Skip to content

fix: make fileDownloadInfo use the repo etag and not the CDN one #1024

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

axel7083
Copy link
Contributor

@axel7083 axel7083 commented Nov 13, 2024

Full explanation on #1023

I opened the pull request, because I had some time to look at it, but I would understand that the fileDownloadInfo do not have the same result expectation than get_hf_file_metadata. Maybe we are expecting to have the CDN etag when using fileDownloadInfo, but then should a dedicated method getFileMetadata be created ?

Related issues

Fixes #1023

Testing

  • unit tests has been added

Manually

You can try to use the fileDownloadInfo for a specific file (LFS), and compare it with the blobs created by the python cli tool.

@axel7083 axel7083 requested a review from coyotte508 as a code owner November 13, 2024 20:41
@axel7083 axel7083 changed the title fix: make fileDownloadInfo use the right etag fix: make fileDownloadInfo use the repo etag and not the CDN one Nov 13, 2024
Comment on lines +39 to +40
// prevent automatic redirect
redirect: 'manual',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work in the browser

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@axel7083 axel7083 Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the previous version url, the etag is never included in the headers. As there is a redirect the headers are empty.

image

Meaning, that today the function is already not working in the browser ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah a CDN misconfiguration probably

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the following function from the devtools console of https://huggingface.co/google-bert/bert-base-uncased

image

See the fetch function

(await fetch("https://huggingface.co/bert-base-uncased/resolve/dd4bc8b21efa05ec961e3efc4ee5e3832a3679c7/tf_model.h5", {
    method: 'GET',
    headers: {'Range': 'byte=0-0'}
}));

I confirm I am getting the right value in Node

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes our CDN CORS headers are probably misconfigured, missing Access-Control-Expose-Headers, we'll fix it.

(Did you see my other comment?)

Copy link
Member

@coyotte508 coyotte508 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an endpoint /api/:repoType(models|spaces|datasets)/:namespace?/:repo/paths-info/:rev called with POST

With body {paths: string[], expand: boolean} (expand adds security status & last commit info)

I think it would be more helpful here. Not sure if we want to keep the same fileDownloadInfo function or create a new one. A new one, probably, since the etag has a precise meaning, and in a browser context there's no way to make etag works other than the way it currently does, I think.

Hub-side, we could maybe add download URLs to the response (maybe change expand: boolean to a list of strings, cc @Wauplin @hanouticelina - with current behavior of expand kepts for backward compat) - do you think it would be helpful/important?

@axel7083
Copy link
Contributor Author

There's an endpoint /api/:repoType(models|spaces|datasets)/:namespace?/:repo/paths-info/:rev called with POST

Since the HEAD request is providing all the information needed (etag, commitHash, size) , and as you mentioned, currently it does not work in the browser because the CDN is probably missconfigured1, since here I am trying to get the same result as the get_hf_file_metadata, to me it would make sense to use the same HEAD request 🤔

Not sure if we want to keep the same fileDownloadInfo function or create a new one.

I am open to create a new one, if this is not a problem to have two different method with some similar behaviour.

Hub-side, we could maybe add download URLs to the response

POST request a bigger than a HEAD, just in term of response size, I have a preference for getting the info from the headers.

Footnotes

  1. https://github.com/huggingface/huggingface.js/pull/1024#discussion_r1841164931

@Wauplin
Copy link
Contributor

Wauplin commented Nov 14, 2024

Hub-side, we could maybe add download URLs to the response (maybe change expand: boolean to a list of strings, cc @Wauplin @hanouticelina - with current behavior of expand kepts for backward compat) - do you think it would be helpful/important?

What would you see as the list of strings? A list of fields like lastCommit, "securityFileStatus", "downloadUrl", etc. and return only the required fields?
I don't think if it'll be super-important but if it can save some server-compute why not. If we go this path, it would be good to have the same behavior for the expand parameter in /api/:repoType(models|spaces|datasets)/:namespace?/:repo/paths-info/:rev as well for consistency. When expand is passed with a list of optional fields, it would be nice if the default fields are also returned (path, size, blob_id, etc.).

@coyotte508
Copy link
Member

coyotte508 commented Nov 14, 2024

Since the HEAD request is providing all the information needed (etag, commitHash, size) , and as you mentioned, currently it does not work in the browser because the CDN is probably missconfigured1, since here I am trying to get the same result as the get_hf_file_metadata, to me it would make sense to use the same HEAD request 🤔

There's another edge case. When you do a HEAD request, depending on where you make the request (eg inside a space on huggingface), you get a S3 url instead of a Cloudfront one.

And AWS S3 has a very annoying bug: a presigned URL for HEAD request is not valid for a GET call (and vice-versa). So if in the backend my node.js process gets the HEAD url, and passes it to the frontend, and the frontend tries to download that file, it will error.

(There's also the problem when making calls from the frontend, for some of our customers which have a direct link to HF, they would get a S3 url not a cloudfront one when making calls to our backend from their browser)

With python it's solved by stopping the redirects before getting to the S3-presigned URL - by using a intermediate URL. It could be possible with NodeJS but it would mean redirect: manual and different browser/nodejs behavior here as well.

We've had a lot of headaches with this 😅

Hence why the additional endpoint - good thing about the additional endpoint, you could query several files at once, instead of making a request for each one.

What would you see as the list of strings? A list of fields like lastCommit, "securityFileStatus", "downloadUrl", etc. and return only the required fields?

Yes those 3 would be optional, and the other fields would always be provided. The problem with lastCommit is it can fail if requesting many files in a repo with a lot of commits, with securityFileStatus it makes an extra API calll, and downloadUrl uses some CPU resources too (not that much, but not been added/needed so far).

But for a first version we could use the endpoint as is, and add the download url if there's a need in the future.

@axel7083
Copy link
Contributor Author

With python it's solved by stopping the redirects before getting to the S3-presigned URL - by using a intermediate URL. It could be possible with NodeJS but it would mean redirect: manual and different browser/nodejs behavior here as well.

@coyotte508 is this the behaviour that you are mentioning only following redirect for same origin (relative redirect?)

I was looking at the _get_metadata_or_catch_error see file_download.py#L1323

I tried to do the same with the following function, does it solve the problem for the s3, or is it something different ?

async function followSameOriginRedirect(params: {

I was searching for comments or mention of s3 in the python hub library, but not much luck to understand the problem you are facing.. :(

@coyotte508
Copy link
Member

@axel7083 do you need the etag or the download url as well too? If only the etag, the paths-info endpoint is already suited.

The JS lib in browser faces issues that the python lib doesn't, due to CORS restrictions, trying to make it behave the same won't work, we tried (and that's why we fell back to a GET request with 0-0 range).

@axel7083
Copy link
Contributor Author

The JS lib in browser faces issues that the python lib doesn't, due to CORS restrictions, trying to make it behave the same won't work, we tried (and that's why we fell back to a GET request with 0-0 range).

After testing it, I can confirm that this is returning the information that I need. The download link is not a problem, as the downloadFile do the redirect for us.

@axel7083 axel7083 marked this pull request as draft November 14, 2024 15:37
@axel7083
Copy link
Contributor Author

Moving to draft while I do more work/experiment on it. Thanks you very much for all the details. Really appreciate the feedback and reactivity 👍

@axel7083
Copy link
Contributor Author

Closing in favour of #1031

@axel7083 axel7083 closed this Nov 15, 2024
coyotte508 pushed a commit that referenced this pull request Nov 15, 2024
## Description

Following discussion
#1024 and
incompatibility of using the `HEAD` request to get the same etag as the
python library is using for populating the cache directory.

This PR add the `pathsInfo` function that return the paths information
including the LFS oid (or etag) if the file is a LFS pointer.

As suggested by @coyotte508 in
#1024 (review)

## Related issues

Fixes #1023 (provide
an alternative method to `fileDownloadInfo`.

## Tests

- [x] unit tests has been added
@Wauplin
Copy link
Contributor

Wauplin commented Nov 15, 2024

@coyotte508 I just thought about a use case for expand: string[]. In datasets, they use /tree to list files with their last commit. They do that we expand=True but don't use the security info at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug(@huggingface/hub): fileDownloadInfo return an etag for LFS file which seems weird
3 participants