-
Notifications
You must be signed in to change notification settings - Fork 430
feat(hub): adding snapshot download method #1038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(hub): adding snapshot download method #1038
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks for the full feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And can you add a small section in packages/hub/README.md
regarding cache management? 🙏
Yes np! |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @axel7083 ! I'm really impressed by the work done in the last PRs + this one!
I just reviewed them and logic is good. I'm a bit worried though about the lack of locks when downloading an individual file. In huggingface_hub
, we use filelock
which guarantees that a single file is never downloaded twice concurrently. In the js implementation, since the (...).incomplete
filename is deterministic, it can happen that 2 processes are downloading bytes to the same file concurrently (and in this case I expect one process to crash?). Is there an equivalent that could be used in JS to avoid concurrency issues?
This is a very complicated problem actually, I've just spent an hour investigating. The python library The problem is that the python library apply a real lock on windows it uses msvcrt which provide advanced File Operations and on unix system fcntl There are using platform dependent mechanism to provide a real lock. JS packagesThere is currently two well known packages:
|
If we open a file in write mode, no two processes can open it at the same time, so it suffices as a lock? In any case, let's not add additional deps |
No this is not the case in Node sadly, you can check it by deleting the file through the windows explorer when running the following code. const file = await open(path, 'w');
setTimeout(async () => {
await file.close();
}, 60_000); |
What happens if we try to do so? An error is triggered? If that's the case, we could catch the error, wait while the file is been downloaded by the other process and then continue without downloaded once the first process has completed. I'm raising the question mainly because concurrency issues caused quite some trouble at some point on the Python side. When using multiple GPUs with 1 GPU == 1 process, each process tries to download the weight file at the same time - hence causing the concurrency issue. I guess this is less likely to happen in JS but still want to raise the question. |
No error are triggered sadly, we may check for file existence, but the python implementation is complex and change the file descriptor using C code to interact with os level structure. We could from the JS detect if the file is locked, so the JS would avoid downloading the file if the python is doing it, but we cannot prevent the python to do it if the JS is downloading it :/ |
Note that we can open a file in Node.JS with mode "wx", so it fails if the file already exists. |
Yes, but we can stat to check if it exists, that not the issue, the problem is telling the python that we lock the file, which we cannot :( |
well @Wauplin is the maintainer of the python lib :) |
Ah, I wasn't thinking of a lock shared between JS and Python (that seems much less likely to happen). I think handling the concurrency issue between JS processes is already good enough |
Oh yeah, if that's the only thing we can definitely use https://github.com/npm/lockfile |
We can copy the mechanism internally (or just use |
Description
We can now create a
snapshotDownload
method similator to thesnapshot_download
of the PY lib1, clone to the cache (only cache supported for now) a repository (either model, space or dataset)Related issues/PR
With the amazing help of @coyotte508 we were able to merge the following changes
downloadFileToCacheDir
#1034Which allow this PR to provide a python compliant clone of a hugging face repository to the cache directory.
Testing
Manually
assert using the
huggingface-cli
tool (python)Footnotes
https://huggingface.co/docs/huggingface_hub/en/guides/download#download-an-entire-repository ↩