Skip to content

feat(hub): adding snapshot download method #1038

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

axel7083
Copy link
Contributor

Description

We can now create a snapshotDownload method similator to the snapshot_download of the PY lib1, clone to the cache (only cache supported for now) a repository (either model, space or dataset)

Related issues/PR

With the amazing help of @coyotte508 we were able to merge the following changes

Which allow this PR to provide a python compliant clone of a hugging face repository to the cache directory.

Testing

  • unit tests are covering the new feature

Manually

await snapshotDownload({
	repo: {
		name: 'OuteAI/OuteTTS-0.1-350M',
		type: 'model',
	},
});

assert using the huggingface-cli tool (python)

$: huggingface-cli scan-cache
REPO ID                             REPO TYPE SIZE ON DISK NB FILES LAST_ACCESSED     LAST_MODIFIED     REFS LOCAL PATH                                                                         
----------------------------------- --------- ------------ -------- ----------------- ----------------- ---- ---------------------------------------------------------------------------------- 
OuteAI/OuteTTS-0.1-350M             model           731.6M       14 5 minutes ago     5 minutes ago     main /home/axel7083/.cache/huggingface/hub/models--OuteAI--OuteTTS-0.1-350M

Footnotes

  1. https://huggingface.co/docs/huggingface_hub/en/guides/download#download-an-entire-repository

@axel7083 axel7083 requested a review from coyotte508 as a code owner November 18, 2024 15:07
Copy link
Member

@coyotte508 coyotte508 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for the full feature

Copy link
Member

@coyotte508 coyotte508 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And can you add a small section in packages/hub/README.md regarding cache management? 🙏

@axel7083
Copy link
Contributor Author

And can you add a small section in packages/hub/README.md regarding cache management? 🙏

Yes np!

@axel7083 axel7083 requested a review from coyotte508 November 18, 2024 20:03
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@coyotte508 coyotte508 merged commit 0bcfcd7 into huggingface:main Nov 19, 2024
5 checks passed
Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @axel7083 ! I'm really impressed by the work done in the last PRs + this one!

I just reviewed them and logic is good. I'm a bit worried though about the lack of locks when downloading an individual file. In huggingface_hub, we use filelock which guarantees that a single file is never downloaded twice concurrently. In the js implementation, since the (...).incomplete filename is deterministic, it can happen that 2 processes are downloading bytes to the same file concurrently (and in this case I expect one process to crash?). Is there an equivalent that could be used in JS to avoid concurrency issues?

@axel7083
Copy link
Contributor Author

I just reviewed them and logic is good. I'm a bit worried though about the lack of locks when downloading an individual file. In huggingface_hub, we use filelock which guarantees that a single file is never downloaded twice concurrently. In the js implementation, since the (...).incomplete filename is deterministic, it can happen that 2 processes are downloading bytes to the same file concurrently (and in this case I expect one process to crash?). Is there an equivalent that could be used in JS to avoid concurrency issues?

This is a very complicated problem actually, I've just spent an hour investigating.

The python library filelock use a very complex mechanism to lock the file in a way, that other processes cannot delete the file.

The problem is that the python library apply a real lock on windows it uses msvcrt which provide advanced File Operations and on unix system fcntl

There are using platform dependent mechanism to provide a real lock.

JS packages

There is currently two well known packages:

@coyotte508
Copy link
Member

coyotte508 commented Nov 19, 2024

If we open a file in write mode, no two processes can open it at the same time, so it suffices as a lock?

In any case, let's not add additional deps

@axel7083
Copy link
Contributor Author

axel7083 commented Nov 19, 2024

If we open a file in write mode, no two processes can open it at the same time, so it suffices as a lock?

No this is not the case in Node sadly, you can check it by deleting the file through the windows explorer when running the following code.

const file = await open(path, 'w');

setTimeout(async () => {
    await file.close();
}, 60_000);

@Wauplin
Copy link
Contributor

Wauplin commented Nov 20, 2024

no two processes can open it at the same time

What happens if we try to do so? An error is triggered? If that's the case, we could catch the error, wait while the file is been downloaded by the other process and then continue without downloaded once the first process has completed.

I'm raising the question mainly because concurrency issues caused quite some trouble at some point on the Python side. When using multiple GPUs with 1 GPU == 1 process, each process tries to download the weight file at the same time - hence causing the concurrency issue. I guess this is less likely to happen in JS but still want to raise the question.

@axel7083
Copy link
Contributor Author

What happens if we try to do so? An error is triggered?

No error are triggered sadly, we may check for file existence, but the python implementation is complex and change the file descriptor using C code to interact with os level structure.

We could from the JS detect if the file is locked, so the JS would avoid downloading the file if the python is doing it, but we cannot prevent the python to do it if the JS is downloading it :/

@coyotte508
Copy link
Member

Note that we can open a file in Node.JS with mode "wx", so it fails if the file already exists.

@axel7083
Copy link
Contributor Author

Note that we can open a file in Node.JS with mode "wx", so it fails if the file already exists.

Yes, but we can stat to check if it exists, that not the issue, the problem is telling the python that we lock the file, which we cannot :(

@coyotte508
Copy link
Member

well @Wauplin is the maintainer of the python lib :)

@Wauplin
Copy link
Contributor

Wauplin commented Nov 20, 2024

Ah, I wasn't thinking of a lock shared between JS and Python (that seems much less likely to happen). I think handling the concurrency issue between JS processes is already good enough

@axel7083
Copy link
Contributor Author

Ah, I wasn't thinking of a lock shared between JS and Python (that seems much less likely to happen). I think handling the concurrency issue between JS processes is already good enough

Oh yeah, if that's the only thing we can definitely use https://github.com/npm/lockfile

@coyotte508
Copy link
Member

We can copy the mechanism internally (or just use wx write mode when downloading files) - but let's not add dependencies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants