Skip to content

chore/change default split page behavior to true #118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .speakeasy/gen.lock
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
lockVersion: 2.0.0
id: 8b5fa338-9106-4734-abf0-e30d67044a90
management:
docChecksum: 5365c99c52e23b044ef9916ecf51b1a9
docChecksum: c7e23b3b8242eb21eccb2091bcc57c72
docVersion: 1.0.35
speakeasyVersion: 1.308.1
generationVersion: 2.342.6
releaseVersion: 0.23.5
configChecksum: e210d7bff3afd386269cb7c6adeef630
releaseVersion: 0.23.6
configChecksum: 4e2e510c7f4b61e04b61acf7de2939a3
repoURL: https://github.com/Unstructured-IO/unstructured-python-client.git
repoSubDirectory: .
installationURL: https://github.com/Unstructured-IO/unstructured-python-client.git
Expand Down
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,9 @@ Refer to the [API parameters page](https://docs.unstructured.io/api-reference/ap

#### Splitting PDF by pages

In order to speed up processing of long PDF files, `split_pdf_page` can be set to `True` (defaults to `False`). It will cause the PDF to be split at client side, before sending to API, and combining individual responses as single result. This parameter will affect only PDF files, no need to disable it for other filetypes.
See [page splitting](https://docs.unstructured.io/api-reference/api-services/sdk#page-splitting) for more details.

In order to speed up processing of large PDF files, the client splits up PDFs into smaller files, sends these to the API concurrently, and recombines the results. `split_pdf_page` can be set to `False` to disable this.

The amount of workers utilized for splitting PDFs is dictated by the `split_pdf_concurrency_level` parameter, with a default of 5 and a maximum of 15 to keep resource usage and costs in check. The splitting process leverages `asyncio` to manage concurrency effectively.
The size of each batch of pages (ranging from 2 to 20) is internally determined based on the concurrency level and the total number of pages in the document. Because the splitting process uses `asyncio` the client can encouter event loop issues if it is nested in another async runner, like running in a `gevent` spawned task. Instead, this is safe to run in multiprocessing workers (e.g., using `multiprocessing.Pool` with `fork` context).
Expand All @@ -83,7 +85,6 @@ req = shared.PartitionParameters(
files=files,
strategy="fast",
languages=["eng"],
split_pdf_page=True,
split_pdf_concurrency_level=8
)
```
Expand Down
2 changes: 1 addition & 1 deletion _test_unstructured_client/unit/test_split_pdf_hook.py
Original file line number Diff line number Diff line change
Expand Up @@ -276,7 +276,7 @@ def test_unit_is_pdf_invalid_extension(caplog):
"""Test is pdf method returns False for file with invalid extension."""
file = shared.Files(b"txt_content", "test_file.txt")

with caplog.at_level(logging.WARNING):
with caplog.at_level(logging.INFO):
result = pdf_utils.is_pdf(file)

assert result is False
Expand Down
2 changes: 1 addition & 1 deletion gen.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ generation:
auth:
oAuth2ClientCredentialsEnabled: false
python:
version: 0.23.5
version: 0.23.6
additionalDependencies:
dependencies:
deepdiff: '>=6.0'
Expand Down
2 changes: 1 addition & 1 deletion overlay_client.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ actions:
"type": "boolean",
"title": "Split Pdf Page",
"description": "This parameter determines if the PDF file should be split on the client side. It's an internal parameter for the Python client and is not sent to the backend.",
"default": false,
"default": true,
}
- target: $["components"]["schemas"]["partition_parameters"]["properties"]
update:
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@

setuptools.setup(
name='unstructured-client',
version='0.23.5',
version='0.23.6',
author='Unstructured',
description='Python Client SDK for Unstructured API',
license = 'MIT',
Expand Down
2 changes: 1 addition & 1 deletion src/unstructured_client/_hooks/custom/pdf_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ def is_pdf(file: shared.Files) -> bool:
True if the file is a PDF, False otherwise.
"""
if not file.file_name.endswith(".pdf"):
logger.warning("Given file doesn't have '.pdf' extension. Continuing without splitting.")
logger.info("Given file doesn't have '.pdf' extension, so splitting is not enabled.")
return False

try:
Expand Down
4 changes: 2 additions & 2 deletions src/unstructured_client/_hooks/custom/split_pdf_hook.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ def before_request(
or not isinstance(file, shared.Files)
or not pdf_utils.is_pdf(file)
):
logger.warning("File could not be split. Partitioning without split.")
logger.info("Partitioning without split.")
return request

starting_page_number = form_utils.get_starting_page_number(
Expand All @@ -160,7 +160,7 @@ def before_request(
logger.info("Determined optimal split size of %d pages.", split_size)

if split_size >= len(pdf.pages):
logger.warning(
logger.info(
"Document has too few pages (%d) to be split efficiently. Partitioning without split.",
len(pdf.pages),
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ class PartitionParameters:
r"""The document types that you want to skip table extraction with. Default: []"""
split_pdf_concurrency_level: Optional[int] = dataclasses.field(default=5, metadata={'multipart_form': { 'field_name': 'split_pdf_concurrency_level' }})
r"""When `split_pdf_page` is set to `True`, this parameter specifies the number of workers used for sending requests when the PDF is split on the client side. It's an internal parameter for the Python client and is not sent to the backend."""
split_pdf_page: Optional[bool] = dataclasses.field(default=False, metadata={'multipart_form': { 'field_name': 'split_pdf_page' }})
split_pdf_page: Optional[bool] = dataclasses.field(default=True, metadata={'multipart_form': { 'field_name': 'split_pdf_page' }})
r"""This parameter determines if the PDF file should be split on the client side. It's an internal parameter for the Python client and is not sent to the backend."""
starting_page_number: Optional[int] = dataclasses.field(default=None, metadata={'multipart_form': { 'field_name': 'starting_page_number' }})
r"""When PDF is split into pages before sending it into the API, providing this information will allow the page number to be assigned correctly. Introduced in 1.0.27."""
Expand Down
4 changes: 2 additions & 2 deletions src/unstructured_client/sdkconfiguration.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,9 @@ class SDKConfiguration:
server: Optional[str] = ''
language: str = 'python'
openapi_doc_version: str = '1.0.35'
sdk_version: str = '0.23.5'
sdk_version: str = '0.23.6'
gen_version: str = '2.342.6'
user_agent: str = 'speakeasy-sdk/python 0.23.5 2.342.6 1.0.35 unstructured-client'
user_agent: str = 'speakeasy-sdk/python 0.23.6 2.342.6 1.0.35 unstructured-client'
retry_config: Optional[RetryConfig] = None

def __post_init__(self):
Expand Down
Loading