fix: make `fileDownloadInfo` use the repo `etag` and not the CDN one #1024

axel7083 · 2024-11-13T20:41:08Z

Full explanation on #1023

I opened the pull request, because I had some time to look at it, but I would understand that the fileDownloadInfo do not have the same result expectation than get_hf_file_metadata. Maybe we are expecting to have the CDN etag when using fileDownloadInfo, but then should a dedicated method getFileMetadata be created ?

Related issues

Fixes #1023

Testing

unit tests has been added

Manually

You can try to use the fileDownloadInfo for a specific file (LFS), and compare it with the blobs created by the python cli tool.

coyotte508 · 2024-11-13T20:53:43Z

packages/hub/src/lib/file-download-info.ts

+		// prevent automatic redirect
+		redirect: 'manual',


This doesn't work in the browser

Hummm, yes I see, after reading a bit the fetch spec, you are right;

https://fetch.spec.whatwg.org/#concept-request-redirect-mode

https://developer.mozilla.org/en-US/docs/Web/API/Response/type

I tried the previous version url, the etag is never included in the headers. As there is a redirect the headers are empty.

Meaning, that today the function is already not working in the browser ?

ah a CDN misconfiguration probably

I tried the following function from the devtools console of https://huggingface.co/google-bert/bert-base-uncased

See the fetch function

(await fetch("https://huggingface.co/bert-base-uncased/resolve/dd4bc8b21efa05ec961e3efc4ee5e3832a3679c7/tf_model.h5", { method: 'GET', headers: {'Range': 'byte=0-0'} }));

I confirm I am getting the right value in Node

yes our CDN CORS headers are probably misconfigured, missing Access-Control-Expose-Headers, we'll fix it.

(Did you see my other comment?)

coyotte508

There's an endpoint /api/:repoType(models|spaces|datasets)/:namespace?/:repo/paths-info/:rev called with POST

With body {paths: string[], expand: boolean} (expand adds security status & last commit info)

I think it would be more helpful here. Not sure if we want to keep the same fileDownloadInfo function or create a new one. A new one, probably, since the etag has a precise meaning, and in a browser context there's no way to make etag works other than the way it currently does, I think.

Hub-side, we could maybe add download URLs to the response (maybe change expand: boolean to a list of strings, cc @Wauplin @hanouticelina - with current behavior of expand kepts for backward compat) - do you think it would be helpful/important?

axel7083 · 2024-11-14T09:39:12Z

There's an endpoint /api/:repoType(models|spaces|datasets)/:namespace?/:repo/paths-info/:rev called with POST

Since the HEAD request is providing all the information needed (etag, commitHash, size) , and as you mentioned, currently it does not work in the browser because the CDN is probably missconfigured¹, since here I am trying to get the same result as the get_hf_file_metadata, to me it would make sense to use the same HEAD request 🤔

Not sure if we want to keep the same fileDownloadInfo function or create a new one.

I am open to create a new one, if this is not a problem to have two different method with some similar behaviour.

Hub-side, we could maybe add download URLs to the response

POST request a bigger than a HEAD, just in term of response size, I have a preference for getting the info from the headers.

https://github.com/huggingface/huggingface.js/pull/1024#discussion_r1841164931 ↩

Wauplin · 2024-11-14T11:02:26Z

Hub-side, we could maybe add download URLs to the response (maybe change expand: boolean to a list of strings, cc @Wauplin @hanouticelina - with current behavior of expand kepts for backward compat) - do you think it would be helpful/important?

What would you see as the list of strings? A list of fields like lastCommit, "securityFileStatus", "downloadUrl", etc. and return only the required fields?
I don't think if it'll be super-important but if it can save some server-compute why not. If we go this path, it would be good to have the same behavior for the expand parameter in /api/:repoType(models|spaces|datasets)/:namespace?/:repo/paths-info/:rev as well for consistency. When expand is passed with a list of optional fields, it would be nice if the default fields are also returned (path, size, blob_id, etc.).

coyotte508 · 2024-11-14T12:09:14Z

Since the HEAD request is providing all the information needed (etag, commitHash, size) , and as you mentioned, currently it does not work in the browser because the CDN is probably missconfigured1, since here I am trying to get the same result as the get_hf_file_metadata, to me it would make sense to use the same HEAD request 🤔

There's another edge case. When you do a HEAD request, depending on where you make the request (eg inside a space on huggingface), you get a S3 url instead of a Cloudfront one.

And AWS S3 has a very annoying bug: a presigned URL for HEAD request is not valid for a GET call (and vice-versa). So if in the backend my node.js process gets the HEAD url, and passes it to the frontend, and the frontend tries to download that file, it will error.

(There's also the problem when making calls from the frontend, for some of our customers which have a direct link to HF, they would get a S3 url not a cloudfront one when making calls to our backend from their browser)

With python it's solved by stopping the redirects before getting to the S3-presigned URL - by using a intermediate URL. It could be possible with NodeJS but it would mean redirect: manual and different browser/nodejs behavior here as well.

We've had a lot of headaches with this 😅

Hence why the additional endpoint - good thing about the additional endpoint, you could query several files at once, instead of making a request for each one.

What would you see as the list of strings? A list of fields like lastCommit, "securityFileStatus", "downloadUrl", etc. and return only the required fields?

Yes those 3 would be optional, and the other fields would always be provided. The problem with lastCommit is it can fail if requesting many files in a repo with a lot of commits, with securityFileStatus it makes an extra API calll, and downloadUrl uses some CPU resources too (not that much, but not been added/needed so far).

But for a first version we could use the endpoint as is, and add the download url if there's a need in the future.

axel7083 · 2024-11-14T13:14:19Z

With python it's solved by stopping the redirects before getting to the S3-presigned URL - by using a intermediate URL. It could be possible with NodeJS but it would mean redirect: manual and different browser/nodejs behavior here as well.

@coyotte508 is this the behaviour that you are mentioning only following redirect for same origin (relative redirect?)

I was looking at the _get_metadata_or_catch_error see file_download.py#L1323

I tried to do the same with the following function, does it solve the problem for the s3, or is it something different ?

huggingface.js/packages/hub/src/lib/file-download-info.ts

Line 27 in ccedf63

async function followSameOriginRedirect(params: {

I was searching for comments or mention of s3 in the python hub library, but not much luck to understand the problem you are facing.. :(

coyotte508 · 2024-11-14T15:10:16Z

@axel7083 do you need the etag or the download url as well too? If only the etag, the paths-info endpoint is already suited.

The JS lib in browser faces issues that the python lib doesn't, due to CORS restrictions, trying to make it behave the same won't work, we tried (and that's why we fell back to a GET request with 0-0 range).

axel7083 · 2024-11-14T15:37:06Z

The JS lib in browser faces issues that the python lib doesn't, due to CORS restrictions, trying to make it behave the same won't work, we tried (and that's why we fell back to a GET request with 0-0 range).

After testing it, I can confirm that this is returning the information that I need. The download link is not a problem, as the downloadFile do the redirect for us.

axel7083 · 2024-11-14T15:38:01Z

Moving to draft while I do more work/experiment on it. Thanks you very much for all the details. Really appreciate the feedback and reactivity 👍

axel7083 · 2024-11-15T10:57:50Z

Closing in favour of #1031

@coyotte508

## Description Following discussion #1024 and incompatibility of using the `HEAD` request to get the same etag as the python library is using for populating the cache directory. This PR add the `pathsInfo` function that return the paths information including the LFS oid (or etag) if the file is a LFS pointer. As suggested by @coyotte508 in #1024 (review) ## Related issues Fixes #1023 (provide an alternative method to `fileDownloadInfo`. ## Tests - [x] unit tests has been added

Wauplin · 2024-11-15T13:32:42Z

@coyotte508 I just thought about a use case for expand: string[]. In datasets, they use /tree to list files with their last commit. They do that we expand=True but don't use the security info at all.

fix: make file download info compliant with python tool

ccedf63

axel7083 requested a review from coyotte508 as a code owner November 13, 2024 20:41

axel7083 changed the title ~~fix: make fileDownloadInfo use the right etag~~ fix: make fileDownloadInfo use the repo etag and not the CDN one Nov 13, 2024

coyotte508 reviewed Nov 13, 2024

View reviewed changes

axel7083 marked this pull request as draft November 14, 2024 15:37

axel7083 mentioned this pull request Nov 15, 2024

feat(hub): adding pathsInfo function #1031

Merged

1 task

axel7083 closed this Nov 15, 2024

fix: make fileDownloadInfo use the repo etag and not the CDN one #1024

fix: make fileDownloadInfo use the repo etag and not the CDN one #1024

Uh oh!

Conversation

axel7083 commented Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related issues

Testing

Uh oh!

coyotte508 Nov 13, 2024

Choose a reason for hiding this comment

Uh oh!

axel7083 Nov 13, 2024

Choose a reason for hiding this comment

Uh oh!

axel7083 Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coyotte508 Nov 13, 2024

Choose a reason for hiding this comment

Uh oh!

axel7083 Nov 13, 2024

Choose a reason for hiding this comment

Uh oh!

coyotte508 Nov 13, 2024

Choose a reason for hiding this comment

Uh oh!

coyotte508 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

axel7083 commented Nov 14, 2024

Footnotes

Uh oh!

Wauplin commented Nov 14, 2024

Uh oh!

coyotte508 commented Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

axel7083 commented Nov 14, 2024

Uh oh!

coyotte508 commented Nov 14, 2024

Uh oh!

axel7083 commented Nov 14, 2024

Uh oh!

axel7083 commented Nov 14, 2024

Uh oh!

axel7083 commented Nov 15, 2024

Uh oh!

Wauplin commented Nov 15, 2024

Uh oh!

Uh oh!

fix: make `fileDownloadInfo` use the repo `etag` and not the CDN one #1024

fix: make `fileDownloadInfo` use the repo `etag` and not the CDN one #1024

axel7083 commented Nov 13, 2024 •

edited

Loading

axel7083 Nov 13, 2024 •

edited

Loading

coyotte508 left a comment •

edited

Loading

coyotte508 commented Nov 14, 2024 •

edited

Loading