Skip to content

Update sdk docs for page splitting defaulting to true #77

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 17, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 14 additions & 9 deletions api-reference/api-services/sdk.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -220,20 +220,25 @@ deployment of Unstructured API, you can access the API using the Python or TypeS

## Page Splitting

In order to speed up processing of long PDF files, set the `splitPdfPage`[*](#parameter-names) parameter to `true`. This will
cause the PDF to be split into small batches of pages by the client before sending requests to the API. The client
awaits all parallel requests and combines the responses into a single response object. This will
work only for PDF files, so don't set it for other types of files.
In order to speed up processing of large PDF files, the `splitPdfPage`[*](#parameter-names) parameter is `true` by default. This
causes the PDF to be split into small batches of pages before sending requests to the API. The client
awaits all parallel requests and combines the responses into a single response object. This is specific to PDF files and other
filetypes are ignored.

The number of parallel requests is controlled by `splitPdfConcurrencyLevel`[*](#parameter-names).
The default is 5 and the max is set to 15 to avoid high resource usage and costs.

If at least one request is successful, the responses are combined into a single response object. An
error is returned only if all requests failed or there was an error during splitting.

When using page splitting, note that chunking will not always work as expected since chunking will happen on the
API side. When chunking elements the whole document context is processed but when we use splitting we only have a part
of the context. If you need to chunk, you can make a second request to the API with the returned elements.
<Note>
This feature may lead to unexpected results when chunking because the server does not see the entire
document context at once. If you'd like to chunk across the whole document and still get the speedup from
parallel processing, you can:
* Partition the pdf with `splitPdfPage` set to true, without any chunking parameters
* Store the returned elements in `results.json`
* Partition this json file with the desired chunking parameters
</Note>

<CodeGroup>
```python Python
Expand All @@ -243,7 +248,7 @@ deployment of Unstructured API, you can access the API using the Python or TypeS
content=file.read(),
file_name=filename,
),
split_pdf_page=True, # Set `split_pdf_page` parameter to `True` to enable splitting the PDF file
split_pdf_page=True, # Set to `False` to disable PDF splitting
split_pdf_concurrency_level=10, # Modify split_pdf_concurrency_level to set the number of parallel requests
)
)
Expand All @@ -256,7 +261,7 @@ deployment of Unstructured API, you can access the API using the Python or TypeS
content: data,
fileName: filename,
},
// Set `splitPdfPage` parameter to `true` to enable splitting the PDF file
// Set to `false` to disable PDF splitting
splitPdfPage: true,
// Modify splitPdfConcurrencyLevel to set the number of parallel requests
splitPdfConcurrencyLevel: 10,
Expand Down