-
Notifications
You must be signed in to change notification settings - Fork 430
fix: make fileDownloadInfo
use the repo etag
and not the CDN one
#1024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: make fileDownloadInfo
use the repo etag
and not the CDN one
#1024
Conversation
fileDownloadInfo
use the right etag
fileDownloadInfo
use the repo etag
and not the CDN one
// prevent automatic redirect | ||
redirect: 'manual', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work in the browser
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hummm, yes I see, after reading a bit the fetch spec, you are right;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah a CDN misconfiguration probably
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the following function from the devtools console of https://huggingface.co/google-bert/bert-base-uncased
See the fetch function
(await fetch("https://huggingface.co/bert-base-uncased/resolve/dd4bc8b21efa05ec961e3efc4ee5e3832a3679c7/tf_model.h5", {
method: 'GET',
headers: {'Range': 'byte=0-0'}
}));
I confirm I am getting the right value in Node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes our CDN CORS headers are probably misconfigured, missing Access-Control-Expose-Headers
, we'll fix it.
(Did you see my other comment?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an endpoint /api/:repoType(models|spaces|datasets)/:namespace?/:repo/paths-info/:rev
called with POST
With body {paths: string[], expand: boolean}
(expand adds security status & last commit info)
I think it would be more helpful here. Not sure if we want to keep the same fileDownloadInfo
function or create a new one. A new one, probably, since the etag has a precise meaning, and in a browser context there's no way to make etag works other than the way it currently does, I think.
Hub-side, we could maybe add download URLs to the response (maybe change expand: boolean
to a list of strings, cc @Wauplin @hanouticelina - with current behavior of expand
kepts for backward compat) - do you think it would be helpful/important?
Since the HEAD request is providing all the information needed (
I am open to create a new one, if this is not a problem to have two different method with some similar behaviour.
POST request a bigger than a HEAD, just in term of response size, I have a preference for getting the info from the headers. Footnotes |
What would you see as the list of strings? A list of fields like |
There's another edge case. When you do a HEAD request, depending on where you make the request (eg inside a space on huggingface), you get a S3 url instead of a Cloudfront one. And AWS S3 has a very annoying bug: a presigned URL for HEAD request is not valid for a GET call (and vice-versa). So if in the backend my node.js process gets the HEAD url, and passes it to the frontend, and the frontend tries to download that file, it will error. (There's also the problem when making calls from the frontend, for some of our customers which have a direct link to HF, they would get a S3 url not a cloudfront one when making calls to our backend from their browser) With python it's solved by stopping the redirects before getting to the S3-presigned URL - by using a intermediate URL. It could be possible with NodeJS but it would mean We've had a lot of headaches with this 😅 Hence why the additional endpoint - good thing about the additional endpoint, you could query several files at once, instead of making a request for each one.
Yes those 3 would be optional, and the other fields would always be provided. The problem with But for a first version we could use the endpoint as is, and add the download url if there's a need in the future. |
@coyotte508 is this the behaviour that you are mentioning only following redirect for same origin (relative redirect?) I was looking at the I tried to do the same with the following function, does it solve the problem for the s3, or is it something different ?
I was searching for comments or mention of s3 in the python hub library, but not much luck to understand the problem you are facing.. :( |
@axel7083 do you need the etag or the download url as well too? If only the etag, the The JS lib in browser faces issues that the python lib doesn't, due to CORS restrictions, trying to make it behave the same won't work, we tried (and that's why we fell back to a GET request with 0-0 range). |
After testing it, I can confirm that this is returning the information that I need. The download link is not a problem, as the |
Moving to draft while I do more work/experiment on it. Thanks you very much for all the details. Really appreciate the feedback and reactivity 👍 |
Closing in favour of #1031 |
## Description Following discussion #1024 and incompatibility of using the `HEAD` request to get the same etag as the python library is using for populating the cache directory. This PR add the `pathsInfo` function that return the paths information including the LFS oid (or etag) if the file is a LFS pointer. As suggested by @coyotte508 in #1024 (review) ## Related issues Fixes #1023 (provide an alternative method to `fileDownloadInfo`. ## Tests - [x] unit tests has been added
@coyotte508 I just thought about a use case for |
Full explanation on #1023
Related issues
Fixes #1023
Testing
Manually
You can try to use the
fileDownloadInfo
for a specific file (LFS), and compare it with the blobs created by the python cli tool.