Skip to content

Commit 78b5d57

Browse files
authored
chore/set pdf page splitting to true by default (#84)
Mirror of Unstructured-IO/unstructured-python-client#118 * Set the split_pdf_page default to true and run `make client-generate` locally. * Update the readme, add another reference back to our docs, bring back some autogenerated sections like in the python repo * Change some warning logs to info. The user should not be warned about default behavior for non pdf files # Testing Use the client locally and verify that split mode is the default, and that the dev experience is good * Create a new test dir and run `npm init -y; npm install typescript tsx` * Check out this branch and install from your test dir: `npm i file:~/repos/unstructured-js-client` * Run this sample script. Try some different files in and verify that the logging and results look acceptable. `npx tsx unstructured.ts` ``` import { UnstructuredClient } from "unstructured-client"; import { PartitionResponse } from "unstructured-client/sdk/models/operations"; import { Strategy } from "unstructured-client/sdk/models/shared"; import * as fs from "fs"; const key = "free-api-key"; const client = new UnstructuredClient({ security: { apiKeyAuth: key, }, }); const filename = "fake-html.html"; const data = fs.readFileSync(filename); client.general.partition({ partitionParameters: { files: { content: data, fileName: filename, }, strategy: Strategy.Auto, } }).then((res: PartitionResponse) => { if (res.statusCode == 200) { console.log(res.elements); } }).catch((e) => { if (e.statusCode) { console.log(e.statusCode); console.log(e.body); } else { console.log(e); } }); ```
1 parent 49997b0 commit 78b5d57

File tree

13 files changed

+155
-76
lines changed

13 files changed

+155
-76
lines changed

.speakeasy/gen.lock

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
lockVersion: 2.0.0
22
id: f42cb8e6-e2ce-4565-b975-5a9f38b94d5a
33
management:
4-
docChecksum: f152122a1bb6e932d6eb54355be2da11
4+
docChecksum: fd76ef24456d50f23903277067ffaa2e
55
docVersion: 1.0.35
66
speakeasyVersion: 1.308.1
77
generationVersion: 2.342.6
8-
releaseVersion: 0.11.1
9-
configChecksum: bd8ea9af107b087e7927abd855542414
8+
releaseVersion: 0.11.2
9+
configChecksum: f5f0ec91134be577b27ca7ca87e4f5fa
1010
repoURL: https://github.com/Unstructured-IO/unstructured-js-client.git
1111
repoSubDirectory: .
1212
installationURL: https://github.com/Unstructured-IO/unstructured-js-client

README.md

Lines changed: 124 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,6 @@
99
<a href="https://speakeasyapi.dev/"><img src="https://custom-icon-badges.demolab.com/badge/-Built%20By%20Speakeasy-212015?style=for-the-badge&logoColor=FBE331&logo=speakeasy&labelColor=545454" /></a>
1010
</div>
1111

12-
<h2 align="center">
13-
<p>Typescript SDK for the Unstructured API</p>
14-
</h2>
15-
16-
This is a Typescript client for the [Unstructured API](https://unstructured-io.github.io/unstructured/api.html).
17-
1812
<div align="center">
1913

2014
<a
@@ -24,6 +18,13 @@ This is a Typescript client for the [Unstructured API](https://unstructured-io.g
2418

2519
</div>
2620

21+
<h2 align="center">
22+
<p>Typescript SDK for the Unstructured API</p>
23+
</h2>
24+
25+
This is a Typescript client for the [Unstructured API](https://unstructured-io.github.io/unstructured/api.html).
26+
27+
Please refer to the [Unstructured docs](https://docs.unstructured.io/api-reference/api-services/sdk) for a full guide to using the client.
2728

2829
## SDK Installation
2930

@@ -40,45 +41,40 @@ yarn add unstructured-client --dev
4041
```
4142
<!-- No SDK Installation -->
4243

44+
<!-- Start SDK Example Usage [usage] -->
4345
## SDK Example Usage
44-
Only the `files` parameter is required for partition. See the [general partition](docs/sdks/general/README.md) page for all available parameters.
46+
47+
### Example
4548

4649
```typescript
50+
import { openAsBlob } from "node:fs";
4751
import { UnstructuredClient } from "unstructured-client";
48-
import { PartitionResponse } from "unstructured-client/dist/sdk/models/operations";
49-
import * as fs from "fs";
50-
51-
const key = "YOUR-API-KEY";
52+
import { Strategy } from "unstructured-client/sdk/models/shared";
5253

53-
const client = new UnstructuredClient({
54+
const unstructuredClient = new UnstructuredClient({
5455
security: {
55-
apiKeyAuth: key,
56+
apiKeyAuth: "YOUR_API_KEY",
5657
},
57-
// uncomment and change the URL below depending on which services you use or hosting locally; see below for more details
58-
// by default it will make requests againt the url for the freemium (https://unstructured.io/api-key-free) API service
59-
// serverURL: "http://localhost:8000",
6058
});
6159

62-
const filename = "sample-docs/layout-parser-paper.pdf";
63-
const data = fs.readFileSync(filename);
60+
async function run() {
61+
const result = await unstructuredClient.general.partition({
62+
partitionParameters: {
63+
files: await openAsBlob("./sample-file"),
64+
strategy: Strategy.Auto,
65+
},
66+
});
67+
68+
// Handle the result
69+
console.log(result);
70+
}
71+
72+
run();
6473

65-
client.general.partition({
66-
// Note that this currently only supports a single file
67-
files: {
68-
content: data,
69-
fileName: filename,
70-
},
71-
// Other partition params
72-
strategy: "fast",
73-
}).then((res: PartitionResponse) => {
74-
if (res.statusCode == 200) {
75-
console.log(res.elements);
76-
}
77-
}).catch((e) => {
78-
console.log(e.statusCode);
79-
console.log(e.body);
80-
});
8174
```
75+
<!-- End SDK Example Usage [usage] -->
76+
77+
Refer to the [API parameters page](https://docs.unstructured.io/api-reference/api-services/api-parameters) for all available parameters.
8278

8379
## Change the base URL
8480

@@ -103,12 +99,6 @@ const client = new UnstructuredClient({
10399
```
104100

105101

106-
<!-- No SDK Example Usage -->
107-
<!-- No SDK Available Operations -->
108-
<!-- No Pagination -->
109-
<!-- No Error Handling -->
110-
<!-- No Server Selection -->
111-
112102
<!-- Start Custom HTTP Client [http-client] -->
113103
## Custom HTTP Client
114104

@@ -157,24 +147,102 @@ httpClient.addHook("requestError", (error, request) => {
157147
const sdk = new UnstructuredClient({ httpClient });
158148
```
159149
<!-- End Custom HTTP Client [http-client] -->
160-
<!-- No Retries -->
161-
<!-- No Authentication -->
162150

163-
## PartitionParameters
151+
<!-- Start Retries [retries] -->
152+
## Retries
164153

165-
See the [general partition](docs/sdk/models/shared/partitionparameters.md) page for all available parameters.
154+
Some of the endpoints in this SDK support retries. If you use the SDK without any configuration, it will fall back to the default retry strategy provided by the API. However, the default retry strategy can be overridden on a per-operation basis, or across the entire SDK.
166155

167-
### Splitting PDF by pages
156+
To change the default retry strategy for a single API call, simply provide a retryConfig object to the call:
157+
```typescript
158+
import { openAsBlob } from "node:fs";
159+
import { UnstructuredClient } from "unstructured-client";
160+
import { Strategy } from "unstructured-client/sdk/models/shared";
161+
162+
const unstructuredClient = new UnstructuredClient({
163+
security: {
164+
apiKeyAuth: "YOUR_API_KEY",
165+
},
166+
});
168167

169-
In order to speed up processing of long PDF files, set `splitPdfPage` parameter to `true`. It will cause the PDF to be split into smaller batches at client side, before sending to API, and combining individual responses as single result. This will work only for PDF files, so don't set it for other types of files. Size of each batch is determined internally and it can vary between 2 and 20 pages per split.
168+
async function run() {
169+
const result = await unstructuredClient.general.partition(
170+
{
171+
partitionParameters: {
172+
files: await openAsBlob("./sample-file"),
173+
strategy: Strategy.Auto,
174+
},
175+
},
176+
{
177+
retries: {
178+
strategy: "backoff",
179+
backoff: {
180+
initialInterval: 1,
181+
maxInterval: 50,
182+
exponent: 1.1,
183+
maxElapsedTime: 100,
184+
},
185+
retryConnectionErrors: false,
186+
},
187+
}
188+
);
170189

171-
The amount of parallel requests is controlled by `splitPdfConcurrencyLevel` parameter. By default it equals to 5. It can't be more than 15, to avoid too high resource usage and costs.
190+
// Handle the result
191+
console.log(result);
192+
}
193+
194+
run();
172195

196+
```
197+
198+
If you'd like to override the default retry strategy for all operations that support retries, you can provide a retryConfig at SDK initialization:
173199
```typescript
174-
import { SplitPdfHook } from "unstructured-client/hooks/custom/SplitPdfHook";
200+
import { openAsBlob } from "node:fs";
201+
import { UnstructuredClient } from "unstructured-client";
202+
import { Strategy } from "unstructured-client/sdk/models/shared";
175203

176-
...
204+
const unstructuredClient = new UnstructuredClient({
205+
retryConfig: {
206+
strategy: "backoff",
207+
backoff: {
208+
initialInterval: 1,
209+
maxInterval: 50,
210+
exponent: 1.1,
211+
maxElapsedTime: 100,
212+
},
213+
retryConnectionErrors: false,
214+
},
215+
security: {
216+
apiKeyAuth: "YOUR_API_KEY",
217+
},
218+
});
177219

220+
async function run() {
221+
const result = await unstructuredClient.general.partition({
222+
partitionParameters: {
223+
files: await openAsBlob("./sample-file"),
224+
strategy: Strategy.Auto,
225+
},
226+
});
227+
228+
// Handle the result
229+
console.log(result);
230+
}
231+
232+
run();
233+
234+
```
235+
<!-- End Retries [retries] -->
236+
237+
### Splitting PDF by pages
238+
239+
See [page splitting](https://docs.unstructured.io/api-reference/api-services/sdk#page-splitting) for more details.
240+
241+
In order to speed up processing of large PDF files, the client splits up PDFs into smaller files, sends these to the API concurrently, and recombines the results. `splitPdfPage` can be set to `false` to disable this.
242+
243+
The amount of parallel requests is controlled by `splitPdfConcurrencyLevel` parameter. By default it equals to 5. It can't be more than 15, to avoid too high resource usage and costs. The size of each batch is determined internally and it can vary between 2 and 20 pages per split.
244+
245+
```typescript
178246
client.general.partition({
179247
partitionParameters: {
180248
files: {
@@ -186,14 +254,7 @@ client.general.partition({
186254
// Modify splitPdfConcurrencyLevel to change the limit of parallel requests
187255
splitPdfConcurrencyLevel: 10,
188256
},
189-
}).then((res: PartitionResponse) => {
190-
if (res.statusCode == 200) {
191-
console.log(res.elements);
192-
}
193-
}).catch((e) => {
194-
console.log(e.statusCode);
195-
console.log(e.body);
196-
});
257+
}};
197258
```
198259
199260
<!-- Start Requirements [requirements] -->
@@ -244,6 +305,12 @@ run();
244305
```
245306
<!-- End File uploads [file-upload] -->
246307
308+
<!-- No Authentication -->
309+
<!-- No SDK Available Operations -->
310+
<!-- No Pagination -->
311+
<!-- No Error Handling -->
312+
<!-- No Server Selection -->
313+
247314
<!-- Placeholder for Future Speakeasy SDK Sections -->
248315
249316
### Maturity

RELEASES.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -333,4 +333,14 @@ Based on:
333333
### Generated
334334
- [typescript v0.11.1] .
335335
### Releases
336-
- [NPM v0.11.1] https://www.npmjs.com/package/unstructured-client/v/0.11.1 - .
336+
- [NPM v0.11.1] https://www.npmjs.com/package/unstructured-client/v/0.11.1 - .
337+
338+
## 2024-06-17 17:43:15
339+
### Changes
340+
Based on:
341+
- OpenAPI Doc
342+
- Speakeasy CLI 1.308.1 (2.342.6) https://github.com/speakeasy-api/speakeasy
343+
### Generated
344+
- [typescript v0.11.1] .
345+
### Releases
346+
- [NPM v0.11.2] https://www.npmjs.com/package/unstructured-client/v/0.11.2 - .

gen.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ generation:
1010
auth:
1111
oAuth2ClientCredentialsEnabled: false
1212
typescript:
13-
version: 0.11.1
13+
version: 0.11.2
1414
additionalDependencies:
1515
dependencies:
1616
async: ^3.2.5

jsr.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
{
44
"name": "unstructured-client",
5-
"version": "0.11.1",
5+
"version": "0.11.2",
66
"exports": {
77
".": "./src/index.ts",
88
"./sdk/models/errors": "./src/sdk/models/errors/index.ts",

overlay_client.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ actions:
1010
"type": "boolean",
1111
"title": "Split Pdf Page",
1212
"description": "Should the pdf file be split at client. Ignored on backend.",
13-
"default": false,
13+
"default": true,
1414
}
1515
- target: $["components"]["schemas"]["partition_parameters"]["properties"]
1616
update:

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "unstructured-client",
3-
"version": "0.11.1",
3+
"version": "0.11.2",
44
"author": "Unstructured",
55
"main": "./index.js",
66
"sideEffects": false,

src/hooks/custom/SplitPdfHook.ts

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ import {
2222
prepareResponseHeaders,
2323
splitPdf,
2424
stringToBoolean,
25-
} from "./utils";
25+
} from "./utils/index";
2626
import {
2727
MIN_PAGES_PER_THREAD,
2828
PARTITION_FORM_FILES_KEY,
@@ -98,12 +98,12 @@ export class SplitPdfHook
9898

9999
const [error, pdf, pagesCount] = await loadPdf(file);
100100
if (file === null || pdf === null || error) {
101-
console.warn("File could not be split. Partitioning without split.")
101+
console.info("Partitioning without split.")
102102
return request;
103103
}
104104

105105
if (pagesCount < MIN_PAGES_PER_THREAD) {
106-
console.warn(
106+
console.info(
107107
`PDF has less than ${MIN_PAGES_PER_THREAD} pages. Partitioning without split.`
108108
);
109109
return request;
@@ -119,7 +119,7 @@ export class SplitPdfHook
119119
console.info("Determined optimal split size of %d pages.", splitSize)
120120

121121
if (splitSize >= pagesCount) {
122-
console.warn(
122+
console.info(
123123
"Document has too few pages (%d) to be split efficiently. Partitioning without split.",
124124
pagesCount,
125125
)

src/hooks/custom/utils/pdf.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ export async function loadPdf(
102102
file: File | null
103103
): Promise<[boolean, PDFDocument | null, number]> {
104104
if (!file?.name.endsWith(".pdf")) {
105-
console.warn("Given file is not a PDF. Continuing without splitting.");
105+
console.info("Given file is not a PDF, so splitting is not enabled.");
106106
return [true, null, 0];
107107
}
108108

src/lib/config.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ export function serverURLFromOptions(options: SDKOptions): URL | null {
6161
export const SDK_METADATA = {
6262
language: "typescript",
6363
openapiDocVersion: "1.0.35",
64-
sdkVersion: "0.11.1",
64+
sdkVersion: "0.11.2",
6565
genVersion: "2.342.6",
66-
userAgent: "speakeasy-sdk/typescript 0.11.1 2.342.6 1.0.35 unstructured-client",
66+
userAgent: "speakeasy-sdk/typescript 0.11.2 2.342.6 1.0.35 unstructured-client",
6767
} as const;

src/sdk/models/shared/partitionparameters.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -240,7 +240,7 @@ export namespace PartitionParameters$ {
240240
similarity_threshold: z.nullable(z.number()).optional(),
241241
skip_infer_table_types: z.array(z.string()).optional(),
242242
split_pdf_concurrency_level: z.number().int().default(5),
243-
split_pdf_page: z.boolean().default(false),
243+
split_pdf_page: z.boolean().default(true),
244244
starting_page_number: z.nullable(z.number().int()).optional(),
245245
strategy: Strategy$.inboundSchema.default(Strategy.Auto),
246246
unique_element_ids: z.boolean().default(false),
@@ -326,7 +326,7 @@ export namespace PartitionParameters$ {
326326
similarityThreshold: z.nullable(z.number()).optional(),
327327
skipInferTableTypes: z.array(z.string()).optional(),
328328
splitPdfConcurrencyLevel: z.number().int().default(5),
329-
splitPdfPage: z.boolean().default(false),
329+
splitPdfPage: z.boolean().default(true),
330330
startingPageNumber: z.nullable(z.number().int()).optional(),
331331
strategy: Strategy$.outboundSchema.default(Strategy.Auto),
332332
uniqueElementIds: z.boolean().default(false),

0 commit comments

Comments
 (0)