Skip to content

chore/set pdf page splitting to true by default #84

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .speakeasy/gen.lock
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
lockVersion: 2.0.0
id: f42cb8e6-e2ce-4565-b975-5a9f38b94d5a
management:
docChecksum: f152122a1bb6e932d6eb54355be2da11
docChecksum: fd76ef24456d50f23903277067ffaa2e
docVersion: 1.0.35
speakeasyVersion: 1.308.1
generationVersion: 2.342.6
releaseVersion: 0.11.1
configChecksum: bd8ea9af107b087e7927abd855542414
releaseVersion: 0.11.2
configChecksum: f5f0ec91134be577b27ca7ca87e4f5fa
repoURL: https://github.com/Unstructured-IO/unstructured-js-client.git
repoSubDirectory: .
installationURL: https://github.com/Unstructured-IO/unstructured-js-client
Expand Down
181 changes: 124 additions & 57 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,6 @@
<a href="https://speakeasyapi.dev/"><img src="https://custom-icon-badges.demolab.com/badge/-Built%20By%20Speakeasy-212015?style=for-the-badge&logoColor=FBE331&logo=speakeasy&labelColor=545454" /></a>
</div>

<h2 align="center">
<p>Typescript SDK for the Unstructured API</p>
</h2>

This is a Typescript client for the [Unstructured API](https://unstructured-io.github.io/unstructured/api.html).

<div align="center">

<a
Expand All @@ -24,6 +18,13 @@ This is a Typescript client for the [Unstructured API](https://unstructured-io.g

</div>

<h2 align="center">
<p>Typescript SDK for the Unstructured API</p>
</h2>

This is a Typescript client for the [Unstructured API](https://unstructured-io.github.io/unstructured/api.html).

Please refer to the [Unstructured docs](https://docs.unstructured.io/api-reference/api-services/sdk) for a full guide to using the client.

## SDK Installation

Expand All @@ -40,45 +41,40 @@ yarn add unstructured-client --dev
```
<!-- No SDK Installation -->

<!-- Start SDK Example Usage [usage] -->
## SDK Example Usage
Only the `files` parameter is required for partition. See the [general partition](docs/sdks/general/README.md) page for all available parameters.

### Example

```typescript
import { openAsBlob } from "node:fs";
import { UnstructuredClient } from "unstructured-client";
import { PartitionResponse } from "unstructured-client/dist/sdk/models/operations";
import * as fs from "fs";

const key = "YOUR-API-KEY";
import { Strategy } from "unstructured-client/sdk/models/shared";

const client = new UnstructuredClient({
const unstructuredClient = new UnstructuredClient({
security: {
apiKeyAuth: key,
apiKeyAuth: "YOUR_API_KEY",
},
// uncomment and change the URL below depending on which services you use or hosting locally; see below for more details
// by default it will make requests againt the url for the freemium (https://unstructured.io/api-key-free) API service
// serverURL: "http://localhost:8000",
});

const filename = "sample-docs/layout-parser-paper.pdf";
const data = fs.readFileSync(filename);
async function run() {
const result = await unstructuredClient.general.partition({
partitionParameters: {
files: await openAsBlob("./sample-file"),
strategy: Strategy.Auto,
},
});

// Handle the result
console.log(result);
}

run();

client.general.partition({
// Note that this currently only supports a single file
files: {
content: data,
fileName: filename,
},
// Other partition params
strategy: "fast",
}).then((res: PartitionResponse) => {
if (res.statusCode == 200) {
console.log(res.elements);
}
}).catch((e) => {
console.log(e.statusCode);
console.log(e.body);
});
```
<!-- End SDK Example Usage [usage] -->

Refer to the [API parameters page](https://docs.unstructured.io/api-reference/api-services/api-parameters) for all available parameters.

## Change the base URL

Expand All @@ -103,12 +99,6 @@ const client = new UnstructuredClient({
```


<!-- No SDK Example Usage -->
<!-- No SDK Available Operations -->
<!-- No Pagination -->
<!-- No Error Handling -->
<!-- No Server Selection -->

<!-- Start Custom HTTP Client [http-client] -->
## Custom HTTP Client

Expand Down Expand Up @@ -157,24 +147,102 @@ httpClient.addHook("requestError", (error, request) => {
const sdk = new UnstructuredClient({ httpClient });
```
<!-- End Custom HTTP Client [http-client] -->
<!-- No Retries -->
<!-- No Authentication -->

## PartitionParameters
<!-- Start Retries [retries] -->
## Retries

See the [general partition](docs/sdk/models/shared/partitionparameters.md) page for all available parameters.
Some of the endpoints in this SDK support retries. If you use the SDK without any configuration, it will fall back to the default retry strategy provided by the API. However, the default retry strategy can be overridden on a per-operation basis, or across the entire SDK.

### Splitting PDF by pages
To change the default retry strategy for a single API call, simply provide a retryConfig object to the call:
```typescript
import { openAsBlob } from "node:fs";
import { UnstructuredClient } from "unstructured-client";
import { Strategy } from "unstructured-client/sdk/models/shared";

const unstructuredClient = new UnstructuredClient({
security: {
apiKeyAuth: "YOUR_API_KEY",
},
});

In order to speed up processing of long PDF files, set `splitPdfPage` parameter to `true`. It will cause the PDF to be split into smaller batches at client side, before sending to API, and combining individual responses as single result. This will work only for PDF files, so don't set it for other types of files. Size of each batch is determined internally and it can vary between 2 and 20 pages per split.
async function run() {
const result = await unstructuredClient.general.partition(
{
partitionParameters: {
files: await openAsBlob("./sample-file"),
strategy: Strategy.Auto,
},
},
{
retries: {
strategy: "backoff",
backoff: {
initialInterval: 1,
maxInterval: 50,
exponent: 1.1,
maxElapsedTime: 100,
},
retryConnectionErrors: false,
},
}
);

The amount of parallel requests is controlled by `splitPdfConcurrencyLevel` parameter. By default it equals to 5. It can't be more than 15, to avoid too high resource usage and costs.
// Handle the result
console.log(result);
}

run();

```

If you'd like to override the default retry strategy for all operations that support retries, you can provide a retryConfig at SDK initialization:
```typescript
import { SplitPdfHook } from "unstructured-client/hooks/custom/SplitPdfHook";
import { openAsBlob } from "node:fs";
import { UnstructuredClient } from "unstructured-client";
import { Strategy } from "unstructured-client/sdk/models/shared";

...
const unstructuredClient = new UnstructuredClient({
retryConfig: {
strategy: "backoff",
backoff: {
initialInterval: 1,
maxInterval: 50,
exponent: 1.1,
maxElapsedTime: 100,
},
retryConnectionErrors: false,
},
security: {
apiKeyAuth: "YOUR_API_KEY",
},
});

async function run() {
const result = await unstructuredClient.general.partition({
partitionParameters: {
files: await openAsBlob("./sample-file"),
strategy: Strategy.Auto,
},
});

// Handle the result
console.log(result);
}

run();

```
<!-- End Retries [retries] -->

### Splitting PDF by pages

See [page splitting](https://docs.unstructured.io/api-reference/api-services/sdk#page-splitting) for more details.

In order to speed up processing of large PDF files, the client splits up PDFs into smaller files, sends these to the API concurrently, and recombines the results. `splitPdfPage` can be set to `false` to disable this.

The amount of parallel requests is controlled by `splitPdfConcurrencyLevel` parameter. By default it equals to 5. It can't be more than 15, to avoid too high resource usage and costs. The size of each batch is determined internally and it can vary between 2 and 20 pages per split.

```typescript
client.general.partition({
partitionParameters: {
files: {
Expand All @@ -186,14 +254,7 @@ client.general.partition({
// Modify splitPdfConcurrencyLevel to change the limit of parallel requests
splitPdfConcurrencyLevel: 10,
},
}).then((res: PartitionResponse) => {
if (res.statusCode == 200) {
console.log(res.elements);
}
}).catch((e) => {
console.log(e.statusCode);
console.log(e.body);
});
}};
```

<!-- Start Requirements [requirements] -->
Expand Down Expand Up @@ -244,6 +305,12 @@ run();
```
<!-- End File uploads [file-upload] -->

<!-- No Authentication -->
<!-- No SDK Available Operations -->
<!-- No Pagination -->
<!-- No Error Handling -->
<!-- No Server Selection -->

<!-- Placeholder for Future Speakeasy SDK Sections -->

### Maturity
Expand Down
12 changes: 11 additions & 1 deletion RELEASES.md
Original file line number Diff line number Diff line change
Expand Up @@ -333,4 +333,14 @@ Based on:
### Generated
- [typescript v0.11.1] .
### Releases
- [NPM v0.11.1] https://www.npmjs.com/package/unstructured-client/v/0.11.1 - .
- [NPM v0.11.1] https://www.npmjs.com/package/unstructured-client/v/0.11.1 - .

## 2024-06-17 17:43:15
### Changes
Based on:
- OpenAPI Doc
- Speakeasy CLI 1.308.1 (2.342.6) https://github.com/speakeasy-api/speakeasy
### Generated
- [typescript v0.11.1] .
### Releases
- [NPM v0.11.2] https://www.npmjs.com/package/unstructured-client/v/0.11.2 - .
2 changes: 1 addition & 1 deletion gen.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ generation:
auth:
oAuth2ClientCredentialsEnabled: false
typescript:
version: 0.11.1
version: 0.11.2
additionalDependencies:
dependencies:
async: ^3.2.5
Expand Down
2 changes: 1 addition & 1 deletion jsr.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

{
"name": "unstructured-client",
"version": "0.11.1",
"version": "0.11.2",
"exports": {
".": "./src/index.ts",
"./sdk/models/errors": "./src/sdk/models/errors/index.ts",
Expand Down
2 changes: 1 addition & 1 deletion overlay_client.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ actions:
"type": "boolean",
"title": "Split Pdf Page",
"description": "Should the pdf file be split at client. Ignored on backend.",
"default": false,
"default": true,
}
- target: $["components"]["schemas"]["partition_parameters"]["properties"]
update:
Expand Down
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "unstructured-client",
"version": "0.11.1",
"version": "0.11.2",
"author": "Unstructured",
"main": "./index.js",
"sideEffects": false,
Expand Down
8 changes: 4 additions & 4 deletions src/hooks/custom/SplitPdfHook.ts
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ import {
prepareResponseHeaders,
splitPdf,
stringToBoolean,
} from "./utils";
} from "./utils/index";
import {
MIN_PAGES_PER_THREAD,
PARTITION_FORM_FILES_KEY,
Expand Down Expand Up @@ -98,12 +98,12 @@ export class SplitPdfHook

const [error, pdf, pagesCount] = await loadPdf(file);
if (file === null || pdf === null || error) {
console.warn("File could not be split. Partitioning without split.")
console.info("Partitioning without split.")
return request;
}

if (pagesCount < MIN_PAGES_PER_THREAD) {
console.warn(
console.info(
`PDF has less than ${MIN_PAGES_PER_THREAD} pages. Partitioning without split.`
);
return request;
Expand All @@ -119,7 +119,7 @@ export class SplitPdfHook
console.info("Determined optimal split size of %d pages.", splitSize)

if (splitSize >= pagesCount) {
console.warn(
console.info(
"Document has too few pages (%d) to be split efficiently. Partitioning without split.",
pagesCount,
)
Expand Down
2 changes: 1 addition & 1 deletion src/hooks/custom/utils/pdf.ts
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ export async function loadPdf(
file: File | null
): Promise<[boolean, PDFDocument | null, number]> {
if (!file?.name.endsWith(".pdf")) {
console.warn("Given file is not a PDF. Continuing without splitting.");
console.info("Given file is not a PDF, so splitting is not enabled.");
return [true, null, 0];
}

Expand Down
4 changes: 2 additions & 2 deletions src/lib/config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ export function serverURLFromOptions(options: SDKOptions): URL | null {
export const SDK_METADATA = {
language: "typescript",
openapiDocVersion: "1.0.35",
sdkVersion: "0.11.1",
sdkVersion: "0.11.2",
genVersion: "2.342.6",
userAgent: "speakeasy-sdk/typescript 0.11.1 2.342.6 1.0.35 unstructured-client",
userAgent: "speakeasy-sdk/typescript 0.11.2 2.342.6 1.0.35 unstructured-client",
} as const;
4 changes: 2 additions & 2 deletions src/sdk/models/shared/partitionparameters.ts
Original file line number Diff line number Diff line change
Expand Up @@ -240,7 +240,7 @@ export namespace PartitionParameters$ {
similarity_threshold: z.nullable(z.number()).optional(),
skip_infer_table_types: z.array(z.string()).optional(),
split_pdf_concurrency_level: z.number().int().default(5),
split_pdf_page: z.boolean().default(false),
split_pdf_page: z.boolean().default(true),
starting_page_number: z.nullable(z.number().int()).optional(),
strategy: Strategy$.inboundSchema.default(Strategy.Auto),
unique_element_ids: z.boolean().default(false),
Expand Down Expand Up @@ -326,7 +326,7 @@ export namespace PartitionParameters$ {
similarityThreshold: z.nullable(z.number()).optional(),
skipInferTableTypes: z.array(z.string()).optional(),
splitPdfConcurrencyLevel: z.number().int().default(5),
splitPdfPage: z.boolean().default(false),
splitPdfPage: z.boolean().default(true),
startingPageNumber: z.nullable(z.number().int()).optional(),
strategy: Strategy$.outboundSchema.default(Strategy.Auto),
uniqueElementIds: z.boolean().default(false),
Expand Down
Loading
Loading