Skip to content

feat: Parameter to send custom page range when splitting pdf #125

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 12, 2024

Conversation

awalker4
Copy link
Collaborator

@awalker4 awalker4 commented Jul 10, 2024

New parameter

Add a client side param called split_pdf_page_range which takes a list of two integers, [start_page, end_page]. If split_pdf_page is True and a range is set, slice the doc from start_page up to and including end_page. Only this page range will be sent to the API. The subset of pages is still split up as needed.

Other changes

Allow our custom hooks to properly access list parameters, so we're able to intercept split_pdf_page_range. We need extra handling to get list params out of the request in parse_form_data, and to rebuild the payload in create_request_body.

Testing

Check out this branch and set up a request to your local API:

    client = UnstructuredClient(api_key_auth="", server_url="localhost:8000")

    filename = "_sample_docs/layout-parser-paper.pdf"
    with open(filename, "rb") as f:
        files = shared.Files(
            content=f.read(),
            file_name=filename,
        )

    req = shared.PartitionParameters(
        files=files,
        strategy="fast",
        split_pdf_page=True,
        split_pdf_page_range=[1, 16],
    )

    resp = client.general.partition(req)

Test out various page ranges and confirm that the returned elements are within the range. Invalid ranges should throw a ValueError (pages are out of bounds, or end_page < start_page).

awalker4 added 2 commits July 3, 2024 17:10
When the client prepares the request, it turns list parameters into multiple instances of the same
key. For instance:

`extract_image_block_types=["Image", "Table"]`

becomes

`extract_image_block_types[]="Image"`
`extract_image_block_types[]="Table"`

We need to account for this in our `parse_form_data` helper if we want to use list params in our
hooks. Likewise, we need to go the other way when recreating the request in `create_request_body`.
@awalker4 awalker4 changed the title Feat/page ranges feat: Parameter to send custom page ranges when splitting pdf Jul 10, 2024
@awalker4 awalker4 changed the title feat: Parameter to send custom page ranges when splitting pdf feat: Parameter to send custom page range when splitting pdf Jul 10, 2024
@awalker4 awalker4 marked this pull request as ready for review July 10, 2024 18:06
Copy link
Contributor

@pawel-kmiecik pawel-kmiecik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@awalker4 awalker4 merged commit 5e30cda into main Jul 12, 2024
7 checks passed
@awalker4 awalker4 deleted the feat/page-ranges branch July 12, 2024 15:48
awalker4 added a commit to Unstructured-IO/unstructured-js-client that referenced this pull request Aug 7, 2024
To match the python feature: Unstructured-IO/unstructured-python-client#125

Add a client-side param called `splitPdfPageRange` which takes a list of two integers, `[start, end]`. If `splitPdfPage` is `true` and a range is set, slice the doc from `start` up to and including `end`. Only this page range will be sent to the API. The subset of pages is still split up as needed. If `[start, end]` is out of bounds, throw an error to the user.
awalker4 added a commit to Unstructured-IO/unstructured-js-client that referenced this pull request Aug 9, 2024
To match the python feature:
Unstructured-IO/unstructured-python-client#125

# New parameter
Add a client-side param called `splitPdfPageRange` which takes a list of
two integers, `[start, end]`. If `splitPdfPage` is `true` and a range is
set, slice the doc from `start` up to and including `end`. Only this
page range will be sent to the API. The subset of pages is still split
up as needed. If `[start, end]` is out of bounds, throw an error to the
user.

# Testing
Check out this branch and set up a request to your local API:

```
const client = new UnstructuredClient({
    serverURL: "http://localhost:8000",
    security: {
        apiKeyAuth: key,
    },
});

const filename = "layout-parser-paper.pdf";
const data = fs.readFileSync(filename);

client.general.partition({
    partitionParameters: {
        files: {
            content: data,
            fileName: filename,
        },
        strategy: Strategy.Fast,
        splitPdfPage: true,
        splitPdfPageRange: [4, 8],
    }
}).then((res: PartitionResponse) => {
    if (res.statusCode == 200) {
        console.log(res.elements);
    }
}).catch((e) => {
    if (e.statusCode) {
      console.log(e.statusCode);
      console.log(e.body);
    } else {
      console.log(e);
    }
});
```

Test out various page ranges and confirm that the returned elements are
within the range. Invalid ranges should throw a useful Error (pages are
out of bounds, or end_page < start_page).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants