feat(hub): adding snapshot download method #1038

axel7083 · 2024-11-18T15:07:20Z

Description

We can now create a snapshotDownload method similator to the snapshot_download of the PY lib¹, clone to the cache (only cache supported for now) a repository (either model, space or dataset)

Related issues/PR

With the amazing help of @coyotte508 we were able to merge the following changes

Which allow this PR to provide a python compliant clone of a hugging face repository to the cache directory.

Testing

unit tests are covering the new feature

Manually

await snapshotDownload({
	repo: {
		name: 'OuteAI/OuteTTS-0.1-350M',
		type: 'model',
	},
});

assert using the huggingface-cli tool (python)

$: huggingface-cli scan-cache
REPO ID                             REPO TYPE SIZE ON DISK NB FILES LAST_ACCESSED     LAST_MODIFIED     REFS LOCAL PATH                                                                         
----------------------------------- --------- ------------ -------- ----------------- ----------------- ---- ---------------------------------------------------------------------------------- 
OuteAI/OuteTTS-0.1-350M             model           731.6M       14 5 minutes ago     5 minutes ago     main /home/axel7083/.cache/huggingface/hub/models--OuteAI--OuteTTS-0.1-350M

https://huggingface.co/docs/huggingface_hub/en/guides/download#download-an-entire-repository ↩

coyotte508

Nice! Thanks for the full feature

packages/hub/src/lib/snapshot-download.ts

coyotte508

And can you add a small section in packages/hub/README.md regarding cache management? 🙏

axel7083 · 2024-11-18T20:03:32Z

And can you add a small section in packages/hub/README.md regarding cache management? 🙏

Yes np!

packages/hub/README.md

HuggingFaceDocBuilderDev · 2024-11-19T07:55:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Wauplin

Thanks @axel7083 ! I'm really impressed by the work done in the last PRs + this one!

I just reviewed them and logic is good. I'm a bit worried though about the lack of locks when downloading an individual file. In huggingface_hub, we use filelock which guarantees that a single file is never downloaded twice concurrently. In the js implementation, since the (...).incomplete filename is deterministic, it can happen that 2 processes are downloading bytes to the same file concurrently (and in this case I expect one process to crash?). Is there an equivalent that could be used in JS to avoid concurrency issues?

axel7083 · 2024-11-19T16:56:54Z

I just reviewed them and logic is good. I'm a bit worried though about the lack of locks when downloading an individual file. In huggingface_hub, we use filelock which guarantees that a single file is never downloaded twice concurrently. In the js implementation, since the (...).incomplete filename is deterministic, it can happen that 2 processes are downloading bytes to the same file concurrently (and in this case I expect one process to crash?). Is there an equivalent that could be used in JS to avoid concurrency issues?

This is a very complicated problem actually, I've just spent an hour investigating.

The python library filelock use a very complex mechanism to lock the file in a way, that other processes cannot delete the file.

The problem is that the python library apply a real lock on windows it uses msvcrt which provide advanced File Operations and on unix system fcntl

There are using platform dependent mechanism to provide a real lock.

JS packages

There is currently two well known packages:

https://github.com/npm/lockfile seems to only create the file, and check for existence.
https://www.npmjs.com/package/proper-lockfile uses the mkdir method to lock a file.

coyotte508 · 2024-11-19T17:03:51Z

If we open a file in write mode, no two processes can open it at the same time, so it suffices as a lock?

In any case, let's not add additional deps

axel7083 · 2024-11-19T17:10:10Z

If we open a file in write mode, no two processes can open it at the same time, so it suffices as a lock?

No this is not the case in Node sadly, you can check it by deleting the file through the windows explorer when running the following code.

const file = await open(path, 'w');

setTimeout(async () => {
    await file.close();
}, 60_000);

Wauplin · 2024-11-20T08:51:09Z

no two processes can open it at the same time

What happens if we try to do so? An error is triggered? If that's the case, we could catch the error, wait while the file is been downloaded by the other process and then continue without downloaded once the first process has completed.

I'm raising the question mainly because concurrency issues caused quite some trouble at some point on the Python side. When using multiple GPUs with 1 GPU == 1 process, each process tries to download the weight file at the same time - hence causing the concurrency issue. I guess this is less likely to happen in JS but still want to raise the question.

axel7083 · 2024-11-20T09:05:25Z

What happens if we try to do so? An error is triggered?

No error are triggered sadly, we may check for file existence, but the python implementation is complex and change the file descriptor using C code to interact with os level structure.

We could from the JS detect if the file is locked, so the JS would avoid downloading the file if the python is doing it, but we cannot prevent the python to do it if the JS is downloading it :/

coyotte508 · 2024-11-20T09:38:41Z

Note that we can open a file in Node.JS with mode "wx", so it fails if the file already exists.

axel7083 · 2024-11-20T09:50:47Z

Note that we can open a file in Node.JS with mode "wx", so it fails if the file already exists.

Yes, but we can stat to check if it exists, that not the issue, the problem is telling the python that we lock the file, which we cannot :(

coyotte508 · 2024-11-20T10:00:53Z

well @Wauplin is the maintainer of the python lib :)

Wauplin · 2024-11-20T10:10:32Z

Ah, I wasn't thinking of a lock shared between JS and Python (that seems much less likely to happen). I think handling the concurrency issue between JS processes is already good enough

axel7083 · 2024-11-20T10:12:46Z

Ah, I wasn't thinking of a lock shared between JS and Python (that seems much less likely to happen). I think handling the concurrency issue between JS processes is already good enough

Oh yeah, if that's the only thing we can definitely use https://github.com/npm/lockfile

coyotte508 · 2024-11-20T10:19:18Z

We can copy the mechanism internally (or just use wx write mode when downloading files) - but let's not add dependencies

axel7083 added 2 commits November 18, 2024 15:50

feat(hub): adding snapshot download method

6613966

fix: params propagation

b63f16f

axel7083 requested a review from coyotte508 as a code owner November 18, 2024 15:07

coyotte508 approved these changes Nov 18, 2024

View reviewed changes

packages/hub/src/lib/snapshot-download.ts Show resolved Hide resolved

docs: adding code documentation to snapshotDownload

b452eb2

coyotte508 reviewed Nov 18, 2024

View reviewed changes

packages/hub/src/lib/snapshot-download.ts Outdated Show resolved Hide resolved

Update packages/hub/src/lib/snapshot-download.ts

df29083

coyotte508 reviewed Nov 18, 2024

View reviewed changes

axel7083 added 3 commits November 18, 2024 20:51

fix: tests

5aec3d5

docs: adding cache related function to readme

92f453b

docs: typo

58389eb

axel7083 requested a review from coyotte508 November 18, 2024 20:03

coyotte508 reviewed Nov 19, 2024

View reviewed changes

packages/hub/README.md Outdated Show resolved Hide resolved

Update packages/hub/README.md

ab7f732

coyotte508 merged commit 0bcfcd7 into huggingface:main Nov 19, 2024
5 checks passed

Wauplin reviewed Nov 19, 2024

View reviewed changes

feat(hub): adding snapshot download method #1038

feat(hub): adding snapshot download method #1038

Uh oh!

Conversation

axel7083 commented Nov 18, 2024

Description

Related issues/PR

Testing

Footnotes

Uh oh!

coyotte508 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coyotte508 left a comment

Choose a reason for hiding this comment

Uh oh!

axel7083 commented Nov 18, 2024

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Nov 19, 2024

Uh oh!

Uh oh!

Wauplin left a comment

Choose a reason for hiding this comment

Uh oh!

axel7083 commented Nov 19, 2024

JS packages

Uh oh!

coyotte508 commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

axel7083 commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Wauplin commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

axel7083 commented Nov 20, 2024

Uh oh!

coyotte508 commented Nov 20, 2024

Uh oh!

axel7083 commented Nov 20, 2024

Uh oh!

coyotte508 commented Nov 20, 2024

Uh oh!

Wauplin commented Nov 20, 2024

Uh oh!

axel7083 commented Nov 20, 2024

Uh oh!

coyotte508 commented Nov 20, 2024

Uh oh!

Uh oh!

coyotte508 commented Nov 19, 2024 •

edited

Loading

axel7083 commented Nov 19, 2024 •

edited

Loading

Wauplin commented Nov 20, 2024 •

edited

Loading