Skip to content

feat🚀: Add the ability to download from HF by repo_id #65

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Sep 16, 2023

Conversation

Yossef-Dawoad
Copy link
Collaborator

the clip model object in python_bindings now can take
model_path_or_repo_id paramter to download a model from Hugging Face by repo_id.

⚠️ breaking changes:

  • add huggingface_hub dependency to support downloading models from HF by repo_id
  • The model_path parameter was renamed to model_path_or_repo_id for the adding support downloading by repo id
  • you can pass model_file if you pass a HF repo_id that has more than .bin files to specify the exact model file to download from that repo
  • If model_path_or_repo_id is a HF repo id and model_file is not specified,
    it will download the default model file (usually the file with smallest name ending with .bin)

📝file changed:

  • python_bindings/clip_cpp/clip.py
  • python_bindings/example_main.py
  • pyproject.toml -> add huggingface_hub dependency
  • update the python_bindings/README.md
  • update the ./README.md

Yossef-Dawoad and others added 4 commits September 14, 2023 22:25
the `clip` model object in python_bindings now can take
`model_path_or_repo_id` paramter to download a model from HugeFare by repo_id.

⚠️ breaking changes:
- add `huggingface_hub` dependency to support downloading models from HF by repo_id
- The `model_path` parameter was renamed to `model_path_or_repo_id` for the adding support downloading by repo id
- you can pass `model_file` if you pass a **HF repo_id** that has more than `.bin`
 file to specify the exact model file to download from that repo
- If `model_path_or_repo_id` is a HF repo id and `model_file` is not specified,
 it will download the default model file (usually the file with smallest name ending with `.bin`)

📝file changed:
- python_bindings/clip_cpp/clip.py
- python_bindings/example_main.py
- pyproject.toml -> add huggingface_hub dependency
- update the python_bindings/README.md
- update the ./README.md
@Yossef-Dawoad
Copy link
Collaborator Author

please, test it before you merge, for some reason I can't get it to be built correctly on my wsl keep getting this #55 but Pypi install work fine, can you tell me your workflow of building the python binding package?

@monatis
Copy link
Owner

monatis commented Sep 15, 2023

Thanks for taking this!

In fact, we don't need huggingface_hub as a dependency.

We can resolve actual file URLs by replacing blob with resolve in the URLs we see when browsing HF model repos.

For example, this url is for browsing the file page and clicking that takes you to the that page on your browser: https://huggingface.co/Green-Sky/ggml_laion_clip-vit-b-32-laion2b-s34b-b79k/blob/main/laion_clip-vit-b-32-laion2b-s34b-b79k.ggmlv0.f16.bin

However, once you replace blob with resolve you will get a url to download that file: https://huggingface.co/Green-Sky/ggml_laion_clip-vit-b-32-laion2b-s34b-b79k/resolve/main/laion_clip-vit-b-32-laion2b-s34b-b79k.ggmlv0.f16.bin

So if we know repo_id and model_file we can form the downloadable URL.

And we can use urllib.request.urlretrieve to download with a progress bar as in the following example:

import urllib.request
import sys


def download_with_progress_bar(url, destination_path):
    def reporthook(count, block_size, total_size):
        # Calculate the progress
        progress = count * block_size / total_size
        progress_percent = int(progress * 100)

        # Create a simple progress bar
        bar_length = 50
        progress_bar = '=' * int(progress * bar_length)
        spaces = ' ' * (bar_length - len(progress_bar))
        sys.stdout.write(f"\r[{progress_bar}{spaces}] {progress_percent}%")
        sys.stdout.flush()

    try:
        urllib.request.urlretrieve(url, destination_path, reporthook=reporthook)
        sys.stdout.write("\n")  # Move to the next line after download is complete
        print(f"File downloaded to {destination_path}")
        return True
    except Exception as e:
        print(f"\nError downloading file: {e}")
        return False

if __name__ == "__main__":
    destination_path = 'laion_clip-vit-b-32-laion2b-s34b-b79k.ggmlv0.f16.bin'

    download_with_progress_bar(url, destination_path)

@monatis
Copy link
Owner

monatis commented Sep 15, 2023

but Pypi install work fine, can you tell me your workflow of building the python binding package?

Here's the workflow to build the shared lib and the Python package end-to-end. Of course better to have a build-python.sh file for it. You can save it in the repo main:

#/bin/bash

rm -rf ./build

mkdir build

cd build

cmake -DBUILD_SHARED_LIBS=ON -DCLIP_NATIVE=OFF ..

make

cp ./libclip.so ../examples/python_bindings/clip_cpp/

cp ./ggml/src/libggml.so ../examples/python_bindings/clip_cpp/

cd ../examples/python_bindings

poetry build

@monatis
Copy link
Owner

monatis commented Sep 15, 2023

With #66, now you can simply run sh ./scripts/build-python.sh --it will build and copy shared libs to the right place.

@Yossef-Dawoad
Copy link
Collaborator Author

We can resolve actual file URLs by replacing blob with resolve in the URLs we see when browsing HF model repos.

Interesting, i will try to work with it and edit the PR today, I should handle checking if the file already exists in the files.

With #66, now you can simply run sh ./scripts/build-python.sh --it will build and copy shared libs to the right place.

This is nice, it should be much easier to build the binding now thanks

@Yossef-Dawoad
Copy link
Collaborator Author

Yossef-Dawoad commented Sep 16, 2023

I already finished implementing the download functionality with the ability to check if the file exists and match the size of that in the network and proper error handling just like the huggingface_api, but one last thing is missing if the user just provides the repo_id without the file_name how I may check for the available files in the server before downloading to choose the smallest bin file in the repo.

example:

  model = Clip(
    model_path_or_repo_id=repo_id,
    ##⚠️⚠️ Here I was able with huggingface lib to check for all available files
    ## and choose the smallest .bin file
    model_file=None, 
    verbosity=2
)

maybe there a request you know I can make to list the files in the hugging face repo and ideally their sizes,
there are {repo_id}/tree/ URL path but it returns the full page.

@monatis
Copy link
Owner

monatis commented Sep 16, 2023

Great! Some digging into the code revealed the endpoint we can use to retrieve the metadata.

Hm, you are using huggingface_hub.model_info to get that info. When I look at its source code, the URL they're forming is https://huggingface.co/api/models/{repo_id}?blobs=true

For example, below is the response to https://huggingface.co/api/models/Green-Sky/ggml_laion_clip-vit-b-32-laion2b-s34b-b79k?blobs=true

We can parse that JSON with standard json package and get the required file information.

{
  "_id": "6491777a87ae7236b212148f",
  "id": "Green-Sky/ggml_laion_clip-vit-b-32-laion2b-s34b-b79k",
  "modelId": "Green-Sky/ggml_laion_clip-vit-b-32-laion2b-s34b-b79k",
  "author": "Green-Sky",
  "sha": "02595089876c11995d92348a84466d26f6038b52",
  "lastModified": "2023-06-25T16:04:20.000Z",
  "private": false,
  "disabled": false,
  "gated": false,
  "tags": [
    "clip",
    "vision",
    "ggml",
    "clip.cpp",
    "license:mit",
    "region:us"
  ],
  "downloads": 0,
  "likes": 2,
  "model-index": null,
  "config": {

  },
  "cardData": {
    "license": "mit",
    "tags": [
      "clip",
      "vision",
      "ggml",
      "clip.cpp"
    ]
  },
  "spaces": [],
  "siblings": [
    {
      "rfilename": ".gitattributes",
      "blobId": "a6344aac8c09253b3b630fb776ae94478aa0275b",
      "size": 1519
    },
    {
      "rfilename": "README.md",
      "blobId": "6a67f6909611556bb67735869b6812e3d3731c29",
      "size": 391
    },
    {
      "rfilename": "laion_clip-vit-b-32-laion2b-s34b-b79k.ggmlv0.f16.bin",
      "blobId": "e9bcf0bc818e3588d507e0439cf30b0b05d237f1",
      "size": 303606311,
      "lfs": {
        "sha256": "a916d7b54205f5237fe26361f5f259e2d0eab0cb609436d9f9f52a786553c0c5",
        "size": 303606311,
        "pointerSize": 134
      }
    },
    {
      "rfilename": "laion_clip-vit-b-32-laion2b-s34b-b79k.ggmlv0.q4_0.bin",
      "blobId": "132a765c5dbc5367a3c336b619741fecd358ded6",
      "size": 89830695,
      "lfs": {
        "sha256": "47cad6cbbe4c311ecd5ad3c32915b0838ce395c41e68b479919d313605c375da",
        "size": 89830695,
        "pointerSize": 133
      }
    },
    {
      "rfilename": "laion_clip-vit-b-32-laion2b-s34b-b79k.ggmlv0.q4_1.bin",
      "blobId": "14419b7636fb441cdb53fea95e9d61fbb5728352",
      "size": 99125287,
      "lfs": {
        "sha256": "acf3c61c06c16630369acc31387b27096aa85780861a6ed5cb675e7284ea2a94",
        "size": 99125287,
        "pointerSize": 133
      }
    }
  ]
}

Copy link
Owner

@monatis monatis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good job. Bump the version in pyproject.toml and it's good to merge.

@Yossef-Dawoad
Copy link
Collaborator Author

I think it ready, I will try to write tests and maybe prepare to automated github actions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants