feat: Parameter to send custom page range when splitting pdf #125

awalker4 · 2024-07-10T15:25:10Z

New parameter

Add a client side param called split_pdf_page_range which takes a list of two integers, [start_page, end_page]. If split_pdf_page is True and a range is set, slice the doc from start_page up to and including end_page. Only this page range will be sent to the API. The subset of pages is still split up as needed.

Other changes

Allow our custom hooks to properly access list parameters, so we're able to intercept split_pdf_page_range. We need extra handling to get list params out of the request in parse_form_data, and to rebuild the payload in create_request_body.

Testing

Check out this branch and set up a request to your local API:

    client = UnstructuredClient(api_key_auth="", server_url="localhost:8000")

    filename = "_sample_docs/layout-parser-paper.pdf"
    with open(filename, "rb") as f:
        files = shared.Files(
            content=f.read(),
            file_name=filename,
        )

    req = shared.PartitionParameters(
        files=files,
        strategy="fast",
        split_pdf_page=True,
        split_pdf_page_range=[1, 16],
    )

    resp = client.general.partition(req)

Test out various page ranges and confirm that the returned elements are within the range. Invalid ranges should throw a ValueError (pages are out of bounds, or end_page < start_page).

When the client prepares the request, it turns list parameters into multiple instances of the same key. For instance: `extract_image_block_types=["Image", "Table"]` becomes `extract_image_block_types[]="Image"` `extract_image_block_types[]="Table"` We need to account for this in our `parse_form_data` helper if we want to use list params in our hooks. Likewise, we need to go the other way when recreating the request in `create_request_body`.

src/unstructured_client/_hooks/custom/split_pdf_hook.py

_test_unstructured_client/integration/test_decorators.py

_test_unstructured_client/unit/test_split_pdf_hook.py

src/unstructured_client/_hooks/custom/form_utils.py

src/unstructured_client/_hooks/custom/split_pdf_hook.py

pawel-kmiecik

LGTM!

To match the python feature: Unstructured-IO/unstructured-python-client#125 Add a client-side param called `splitPdfPageRange` which takes a list of two integers, `[start, end]`. If `splitPdfPage` is `true` and a range is set, slice the doc from `start` up to and including `end`. Only this page range will be sent to the API. The subset of pages is still split up as needed. If `[start, end]` is out of bounds, throw an error to the user.

To match the python feature: Unstructured-IO/unstructured-python-client#125 # New parameter Add a client-side param called `splitPdfPageRange` which takes a list of two integers, `[start, end]`. If `splitPdfPage` is `true` and a range is set, slice the doc from `start` up to and including `end`. Only this page range will be sent to the API. The subset of pages is still split up as needed. If `[start, end]` is out of bounds, throw an error to the user. # Testing Check out this branch and set up a request to your local API: ``` const client = new UnstructuredClient({ serverURL: "http://localhost:8000", security: { apiKeyAuth: key, }, }); const filename = "layout-parser-paper.pdf"; const data = fs.readFileSync(filename); client.general.partition({ partitionParameters: { files: { content: data, fileName: filename, }, strategy: Strategy.Fast, splitPdfPage: true, splitPdfPageRange: [4, 8], } }).then((res: PartitionResponse) => { if (res.statusCode == 200) { console.log(res.elements); } }).catch((e) => { if (e.statusCode) { console.log(e.statusCode); console.log(e.body); } else { console.log(e); } }); ``` Test out various page ranges and confirm that the returned elements are within the range. Invalid ranges should throw a useful Error (pages are out of bounds, or end_page < start_page).

awalker4 added 2 commits July 3, 2024 17:10

Add get_page_range form util helper

f54ad53

awalker4 changed the title ~~Feat/page ranges~~ feat: Parameter to send custom page ranges when splitting pdf Jul 10, 2024

awalker4 changed the title ~~feat: Parameter to send custom page ranges when splitting pdf~~ feat: Parameter to send custom page range when splitting pdf Jul 10, 2024

Add support for page ranges in pdf split hook

f98193c

awalker4 force-pushed the feat/page-ranges branch from b441902 to 17f84c6 Compare July 10, 2024 17:11

Add split_pdf_page_range parameter, update unit tests

80902b5

awalker4 force-pushed the feat/page-ranges branch from 17f84c6 to 80902b5 Compare July 10, 2024 17:26

Update param docstring

39483e9

awalker4 marked this pull request as ready for review July 10, 2024 18:06

awalker4 requested review from Klaijan, pawel-kmiecik and mpolomdeepsense July 10, 2024 18:06

awalker4 and others added 3 commits July 10, 2024 14:11

Update comment

fe6a6a3

Fix pylint errors

f2ad642

Merge branch 'main' into feat/page-ranges

4a20be0

mpolomdeepsense reviewed Jul 11, 2024

View reviewed changes

src/unstructured_client/_hooks/custom/split_pdf_hook.py Outdated Show resolved Hide resolved

pawel-kmiecik reviewed Jul 11, 2024

View reviewed changes

Update some log messages, unit test assertions

ab11a4d

awalker4 force-pushed the feat/page-ranges branch from 61ff079 to ab11a4d Compare July 11, 2024 20:02

pawel-kmiecik approved these changes Jul 12, 2024

View reviewed changes

awalker4 merged commit 5e30cda into main Jul 12, 2024
7 checks passed

awalker4 deleted the feat/page-ranges branch July 12, 2024 15:48

awalker4 mentioned this pull request Aug 7, 2024

feat: Parameter to send custom page range when splitting pdf Unstructured-IO/unstructured-js-client#101

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Parameter to send custom page range when splitting pdf #125

feat: Parameter to send custom page range when splitting pdf #125

Uh oh!

awalker4 commented Jul 10, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pawel-kmiecik left a comment

Uh oh!

Uh oh!

Uh oh!

feat: Parameter to send custom page range when splitting pdf #125

feat: Parameter to send custom page range when splitting pdf #125

Uh oh!

Conversation

awalker4 commented Jul 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New parameter

Other changes

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pawel-kmiecik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

awalker4 commented Jul 10, 2024 •

edited

Loading