Skip to content

Commit d20c8c1

Browse files
authored
Update sdk docs for page splitting defaulting to true (#77)
1 parent f41ec22 commit d20c8c1

File tree

1 file changed

+14
-9
lines changed

1 file changed

+14
-9
lines changed

api-reference/api-services/sdk.mdx

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -220,20 +220,25 @@ deployment of Unstructured API, you can access the API using the Python or TypeS
220220

221221
## Page Splitting
222222

223-
In order to speed up processing of long PDF files, set the `splitPdfPage`[*](#parameter-names) parameter to `true`. This will
224-
cause the PDF to be split into small batches of pages by the client before sending requests to the API. The client
225-
awaits all parallel requests and combines the responses into a single response object. This will
226-
work only for PDF files, so don't set it for other types of files.
223+
In order to speed up processing of large PDF files, the `splitPdfPage`[*](#parameter-names) parameter is `true` by default. This
224+
causes the PDF to be split into small batches of pages before sending requests to the API. The client
225+
awaits all parallel requests and combines the responses into a single response object. This is specific to PDF files and other
226+
filetypes are ignored.
227227

228228
The number of parallel requests is controlled by `splitPdfConcurrencyLevel`[*](#parameter-names).
229229
The default is 5 and the max is set to 15 to avoid high resource usage and costs.
230230

231231
If at least one request is successful, the responses are combined into a single response object. An
232232
error is returned only if all requests failed or there was an error during splitting.
233233

234-
When using page splitting, note that chunking will not always work as expected since chunking will happen on the
235-
API side. When chunking elements the whole document context is processed but when we use splitting we only have a part
236-
of the context. If you need to chunk, you can make a second request to the API with the returned elements.
234+
<Note>
235+
This feature may lead to unexpected results when chunking because the server does not see the entire
236+
document context at once. If you'd like to chunk across the whole document and still get the speedup from
237+
parallel processing, you can:
238+
* Partition the pdf with `splitPdfPage` set to true, without any chunking parameters
239+
* Store the returned elements in `results.json`
240+
* Partition this json file with the desired chunking parameters
241+
</Note>
237242

238243
<CodeGroup>
239244
```python Python
@@ -243,7 +248,7 @@ deployment of Unstructured API, you can access the API using the Python or TypeS
243248
content=file.read(),
244249
file_name=filename,
245250
),
246-
split_pdf_page=True, # Set `split_pdf_page` parameter to `True` to enable splitting the PDF file
251+
split_pdf_page=True, # Set to `False` to disable PDF splitting
247252
split_pdf_concurrency_level=10, # Modify split_pdf_concurrency_level to set the number of parallel requests
248253
)
249254
)
@@ -256,7 +261,7 @@ deployment of Unstructured API, you can access the API using the Python or TypeS
256261
content: data,
257262
fileName: filename,
258263
},
259-
// Set `splitPdfPage` parameter to `true` to enable splitting the PDF file
264+
// Set to `false` to disable PDF splitting
260265
splitPdfPage: true,
261266
// Modify splitPdfConcurrencyLevel to set the number of parallel requests
262267
splitPdfConcurrencyLevel: 10,

0 commit comments

Comments
 (0)