Skip to content

fix: Allow split page logic to process files concurrently #175

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Sep 17, 2024

Conversation

awalker4
Copy link
Collaborator

@awalker4 awalker4 commented Sep 16, 2024

The Issue

split_pdf_hook.py does not support multiple concurrent files. This is because we store the split request tasks in self.coroutines_to_execute[operation_id], where operation_id is just the string "partition". Therefore, if we send two concurrent docs using the same SDK, they'll both try to await the same list of coroutines. This could result in interleaved results, but mostly it breaks with RuntimeError: coroutine is being awaited already, as the second request gets ready to await its requests. This will block anyone trying to use the new partition_async to fan out their pdfs.

Note that the js/ts client also has this issue.

The fix

We need to use an actual id to index into coroutines_to_execute. In before_request, let's make a uuid and build up the list of coroutines for this doc. We need to pass this id to after_success in order to retrieve the results, so we can set it as a header on our "dummy" request that's returned to the SDK.

Testing

See the new integration test. We can verify this by sending two docs serially, and then with asyncio.gather, and confirm that the results are the same.

@awalker4 awalker4 enabled auto-merge (squash) September 17, 2024 18:58
@awalker4 awalker4 requested a review from badGarnet September 17, 2024 19:30
@awalker4 awalker4 merged commit 54fd2eb into main Sep 17, 2024
7 checks passed
@awalker4 awalker4 deleted the fix/concurrent-files branch September 17, 2024 20:32
yuming-long added a commit to Unstructured-IO/unstructured-js-client that referenced this pull request Oct 2, 2024
### Summary

copy
Unstructured-IO/unstructured-python-client#175
into JS/TS

### Test
added a integration tests that is passed on CI 
Local integration test:
* docker start core product api but with -p 8000:5000 (change the line
in make file for make docker-start-api)
* `make build`
* `npx jest --verbose --detectOpenHandles --config jest.config.js
test/integration --forceExit -t "SplitPDF async can be used to send
multiple files concurrently"`

you can also move the unit test to main and `make build npx jest
test...` again and will see the test fail, but not on this branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants