Skip to content

Make push_to_hub atomic (#7600) #7605

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sharvil
Copy link

@sharvil sharvil commented Jun 9, 2025

No description provided.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq
Copy link
Member

lhoestq commented Jun 11, 2025

Hi ! unfortunately we can't allow atomic commits for commits with hundreds of files additions (HF would time out)

Maybe an alternative would be to retry if there was a commit in between ? this could be the default behavior as well

@sharvil
Copy link
Author

sharvil commented Jun 11, 2025

Thanks for taking a look – much appreciated!

I've verified that commits with up to 20,000 files don't time out and the commit time scales linearly with the number of operations enqueued. It took just under 2 minutes to complete (successfully) the 20k file commit.

The fundamental issue I'm trying to tackle here is dataset corruption: getting into a state where a dataset on the hub cannot be used when downloaded. Non-atomic commits won't get us there, I think. If, for example, 3 of 5 commits complete and the machine/process calling push_to_hub has a network, hardware, or other failure that prevents it from completing the rest of the commits (even with retries) we'll now have some pointer files pointing to the new data and others pointing to the old data => corrupted. While this may seem like an unlikely scenario, it's a regular occurrence at scale.

If you still feel strongly that atomic commits are not the right way to go, I can either set it to not be the default or remove it entirely from this PR.

As for retries, it's a good idea. In a non-atomic world, the logic gets more complicated:

  • keep an explicit queue of pending add/delete operations
  • chunkwise pop from queue and commit with parent_commit set to previous chunked commit hash
  • if create_commit fails:
    • re-fetch README and set parent_commit to latest hash for revision
    • re-generate dataset card content
    • swap old CommitOperationAdd with new one for README in the pending queue
  • resume chunkwise committing from the queue as above

Entirely doable, but more involved than I signed up for with this PR.

@sharvil
Copy link
Author

sharvil commented Jun 11, 2025

Just to clarify – setting the parent_commit can be separated from making the commit atomic (which is what I'm suggesting by either atomic commits not the default or removing it from this PR). It's crucial to set the parent commit to avoid the read-modify-write race condition on the README schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants