@@ -220,20 +220,25 @@ deployment of Unstructured API, you can access the API using the Python or TypeS
220
220
221
221
## Page Splitting
222
222
223
- In order to speed up processing of long PDF files, set the ` splitPdfPage ` [ * ] ( #parameter-names ) parameter to ` true ` . This will
224
- cause the PDF to be split into small batches of pages by the client before sending requests to the API. The client
225
- awaits all parallel requests and combines the responses into a single response object. This will
226
- work only for PDF files, so don't set it for other types of files .
223
+ In order to speed up processing of large PDF files, the ` splitPdfPage ` [ * ] ( #parameter-names ) parameter is ` true ` by default . This
224
+ causes the PDF to be split into small batches of pages before sending requests to the API. The client
225
+ awaits all parallel requests and combines the responses into a single response object. This is specific to PDF files and other
226
+ filetypes are ignored .
227
227
228
228
The number of parallel requests is controlled by ` splitPdfConcurrencyLevel ` [ * ] ( #parameter-names ) .
229
229
The default is 5 and the max is set to 15 to avoid high resource usage and costs.
230
230
231
231
If at least one request is successful, the responses are combined into a single response object. An
232
232
error is returned only if all requests failed or there was an error during splitting.
233
233
234
- When using page splitting, note that chunking will not always work as expected since chunking will happen on the
235
- API side. When chunking elements the whole document context is processed but when we use splitting we only have a part
236
- of the context. If you need to chunk, you can make a second request to the API with the returned elements.
234
+ <Note >
235
+ This feature may lead to unexpected results when chunking because the server does not see the entire
236
+ document context at once. If you'd like to chunk across the whole document and still get the speedup from
237
+ parallel processing, you can:
238
+ * Partition the pdf with ` splitPdfPage ` set to true, without any chunking parameters
239
+ * Store the returned elements in ` results.json `
240
+ * Partition this json file with the desired chunking parameters
241
+ </Note >
237
242
238
243
<CodeGroup >
239
244
``` python Python
@@ -243,7 +248,7 @@ deployment of Unstructured API, you can access the API using the Python or TypeS
243
248
content = file .read(),
244
249
file_name = filename,
245
250
),
246
- split_pdf_page = True , # Set `split_pdf_page` parameter to `True ` to enable splitting the PDF file
251
+ split_pdf_page = True , # Set to `False ` to disable PDF splitting
247
252
split_pdf_concurrency_level = 10 , # Modify split_pdf_concurrency_level to set the number of parallel requests
248
253
)
249
254
)
@@ -256,7 +261,7 @@ deployment of Unstructured API, you can access the API using the Python or TypeS
256
261
content: data,
257
262
fileName: filename,
258
263
},
259
- // Set `splitPdfPage` parameter to `true ` to enable splitting the PDF file
264
+ // Set to `false ` to disable PDF splitting
260
265
splitPdfPage: true ,
261
266
// Modify splitPdfConcurrencyLevel to set the number of parallel requests
262
267
splitPdfConcurrencyLevel: 10 ,
0 commit comments