You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
chore/set pdf page splitting to true by default (#84)
Mirror of
Unstructured-IO/unstructured-python-client#118
* Set the split_pdf_page default to true and run `make client-generate`
locally.
* Update the readme, add another reference back to our docs, bring back
some autogenerated sections like in the python repo
* Change some warning logs to info. The user should not be warned about
default behavior for non pdf files
# Testing
Use the client locally and verify that split mode is the default, and
that the dev experience is good
* Create a new test dir and run `npm init -y; npm install typescript
tsx`
* Check out this branch and install from your test dir: `npm i
file:~/repos/unstructured-js-client`
* Run this sample script. Try some different files in and verify that
the logging and results look acceptable.
`npx tsx unstructured.ts`
```
import { UnstructuredClient } from "unstructured-client";
import { PartitionResponse } from "unstructured-client/sdk/models/operations";
import { Strategy } from "unstructured-client/sdk/models/shared";
import * as fs from "fs";
const key = "free-api-key";
const client = new UnstructuredClient({
security: {
apiKeyAuth: key,
},
});
const filename = "fake-html.html";
const data = fs.readFileSync(filename);
client.general.partition({
partitionParameters: {
files: {
content: data,
fileName: filename,
},
strategy: Strategy.Auto,
}
}).then((res: PartitionResponse) => {
if (res.statusCode == 200) {
console.log(res.elements);
}
}).catch((e) => {
if (e.statusCode) {
console.log(e.statusCode);
console.log(e.body);
} else {
console.log(e);
}
});
```
See the [general partition](docs/sdk/models/shared/partitionparameters.md) page for all available parameters.
154
+
Some of the endpoints in this SDK support retries. If you use the SDK without any configuration, it will fall back to the default retry strategy provided by the API. However, the default retry strategy can be overridden on a per-operation basis, or across the entire SDK.
166
155
167
-
### Splitting PDF by pages
156
+
To change the default retry strategy for a single API call, simply provide a retryConfig object to the call:
In order to speed up processing of long PDF files, set `splitPdfPage` parameter to `true`. It will cause the PDF to be split into smaller batches at client side, before sending to API, and combining individual responses as single result. This will work only for PDF files, so don't set it for other types of files. Size of each batch is determined internally and it can vary between 2 and 20 pages per split.
168
+
asyncfunction run() {
169
+
const result =awaitunstructuredClient.general.partition(
170
+
{
171
+
partitionParameters: {
172
+
files: awaitopenAsBlob("./sample-file"),
173
+
strategy: Strategy.Auto,
174
+
},
175
+
},
176
+
{
177
+
retries: {
178
+
strategy: "backoff",
179
+
backoff: {
180
+
initialInterval: 1,
181
+
maxInterval: 50,
182
+
exponent: 1.1,
183
+
maxElapsedTime: 100,
184
+
},
185
+
retryConnectionErrors: false,
186
+
},
187
+
}
188
+
);
170
189
171
-
The amount of parallel requests is controlled by `splitPdfConcurrencyLevel` parameter. By default it equals to 5. It can't be more than 15, to avoid too high resource usage and costs.
190
+
// Handle the result
191
+
console.log(result);
192
+
}
193
+
194
+
run();
172
195
196
+
```
197
+
198
+
If you'd like to override the default retry strategy for all operations that support retries, you can provide a retryConfig at SDK initialization:
const result =awaitunstructuredClient.general.partition({
222
+
partitionParameters: {
223
+
files: awaitopenAsBlob("./sample-file"),
224
+
strategy: Strategy.Auto,
225
+
},
226
+
});
227
+
228
+
// Handle the result
229
+
console.log(result);
230
+
}
231
+
232
+
run();
233
+
234
+
```
235
+
<!-- End Retries [retries] -->
236
+
237
+
### Splitting PDF by pages
238
+
239
+
See [page splitting](https://docs.unstructured.io/api-reference/api-services/sdk#page-splitting) for more details.
240
+
241
+
In order to speed up processing of large PDF files, the client splits up PDFs into smaller files, sends these to the API concurrently, and recombines the results. `splitPdfPage` can be set to `false` to disable this.
242
+
243
+
The amount of parallel requests is controlled by `splitPdfConcurrencyLevel` parameter. By default it equals to 5. It can't be more than 15, to avoid too high resource usage and costs. The size of each batch is determined internally and it can vary between 2 and 20 pages per split.
244
+
245
+
```typescript
178
246
client.general.partition({
179
247
partitionParameters: {
180
248
files: {
@@ -186,14 +254,7 @@ client.general.partition({
186
254
// Modify splitPdfConcurrencyLevel to change the limit of parallel requests
187
255
splitPdfConcurrencyLevel: 10,
188
256
},
189
-
}).then((res:PartitionResponse) => {
190
-
if (res.statusCode==200) {
191
-
console.log(res.elements);
192
-
}
193
-
}).catch((e) => {
194
-
console.log(e.statusCode);
195
-
console.log(e.body);
196
-
});
257
+
}};
197
258
```
198
259
199
260
<!-- Start Requirements [requirements] -->
@@ -244,6 +305,12 @@ run();
244
305
```
245
306
<!-- End File uploads [file-upload] -->
246
307
308
+
<!-- No Authentication -->
309
+
<!-- No SDK Available Operations -->
310
+
<!-- No Pagination -->
311
+
<!-- No Error Handling -->
312
+
<!-- No Server Selection -->
313
+
247
314
<!-- Placeholder for Future Speakeasy SDK Sections -->
0 commit comments