Make `push_to_hub` atomic (#7600) #7605

sharvil · 2025-06-09T22:29:38Z

No description provided.

HuggingFaceDocBuilderDev · 2025-06-11T14:09:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq · 2025-06-11T14:13:15Z

Hi ! unfortunately we can't allow atomic commits for commits with hundreds of files additions (HF would time out)

Maybe an alternative would be to retry if there was a commit in between ? this could be the default behavior as well

sharvil · 2025-06-11T15:11:05Z

Thanks for taking a look – much appreciated!

I've verified that commits with up to 20,000 files don't time out and the commit time scales linearly with the number of operations enqueued. It took just under 2 minutes to complete (successfully) the 20k file commit.

The fundamental issue I'm trying to tackle here is dataset corruption: getting into a state where a dataset on the hub cannot be used when downloaded. Non-atomic commits won't get us there, I think. If, for example, 3 of 5 commits complete and the machine/process calling push_to_hub has a network, hardware, or other failure that prevents it from completing the rest of the commits (even with retries) we'll now have some pointer files pointing to the new data and others pointing to the old data => corrupted. While this may seem like an unlikely scenario, it's a regular occurrence at scale.

If you still feel strongly that atomic commits are not the right way to go, I can either set it to not be the default or remove it entirely from this PR.

As for retries, it's a good idea. In a non-atomic world, the logic gets more complicated:

keep an explicit queue of pending add/delete operations
chunkwise pop from queue and commit with parent_commit set to previous chunked commit hash
if create_commit fails:
- re-fetch README and set parent_commit to latest hash for revision
- re-generate dataset card content
- swap old CommitOperationAdd with new one for README in the pending queue
resume chunkwise committing from the queue as above

Entirely doable, but more involved than I signed up for with this PR.

sharvil · 2025-06-11T15:56:58Z

Just to clarify – setting the parent_commit can be separated from making the commit atomic (which is what I'm suggesting by either atomic commits not the default or removing it from this PR). It's crucial to set the parent commit to avoid the read-modify-write race condition on the README schema.

Make push_to_hub atomic (huggingface#7600)

bf297c9

lhoestq mentioned this pull request Jun 11, 2025

push_to_hub is not concurrency safe (dataset schema corruption) #7600

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make `push_to_hub` atomic (#7600) #7605

Make `push_to_hub` atomic (#7600) #7605

Uh oh!

sharvil commented Jun 9, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jun 11, 2025

Uh oh!

lhoestq commented Jun 11, 2025

Uh oh!

sharvil commented Jun 11, 2025

Uh oh!

sharvil commented Jun 11, 2025

Uh oh!

Uh oh!

Make push_to_hub atomic (#7600) #7605

Are you sure you want to change the base?

Make push_to_hub atomic (#7600) #7605

Uh oh!

Conversation

sharvil commented Jun 9, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jun 11, 2025

Uh oh!

lhoestq commented Jun 11, 2025

Uh oh!

sharvil commented Jun 11, 2025

Uh oh!

sharvil commented Jun 11, 2025

Uh oh!

Uh oh!

Make `push_to_hub` atomic (#7600) #7605

Make `push_to_hub` atomic (#7600) #7605