Can this TGI approach with Llama2 weights downloaded directly from Meta in .pth format ? #1412
-
I am planning to generate TGI inference end point for local installation of Llama2. Can this TGI approach with Llama2 weights downloaded directly from Meta in .pth format ? Or Should the weigths need to be in specific format for this TGI apporoach to work ? If this need to be in specific formation, please poing me to the script which converts .pth format to required format. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 1 reply
-
You need the huggingface format (name of the layers) not the original meta weights. Like these ones: https://huggingface.co/meta-llama/Llama-2-7b-hf Also you'll notive in those repo |
Beta Was this translation helpful? Give feedback.
-
@Narsil thanks for reply. I understand that we have use model weights in HF .bin or .safetensor format. I want to set up TGI server inference end point for Llama2 model, this should be completely local model, should work even without internet within my company I clonedTGI and Llama2 model first into my computer, I did follow below steps Copy HF TGImkdir Llama2_TGI Go inside t-g-i foldercd text-generation-inference Create a subfolder called 'data'mkdir data copy 'Llama-2-7B-Chat-GPTQ' with .safetensor files inside 'data' foldercd data model=Llama-2-7B-Chat-GPTQ docker run --rm --entrypoint /bin/bash -it --name $model -v $volume:/data --gpus all -p 8080:80 ghcr.io/huggingface/text-generation-inference:latest --model-id /nfs/Llama2_TGI/text-generation-inference/data/Llama-2-7B-Chat-GPTQ Issue: Unable to find image 'ghcr.io/huggingface/text-generation-inference:latest' locally I am following the --model-id card as given in below huggingface website instructions, look at last command. https://huggingface.co/docs/text-generation-inference/basic_tutorials/non_core_models Note that i can run Llama2 models using transformer pipelines without using TGI. |
Beta Was this translation helpful? Give feedback.
-
Because you are mounting If you are not sure what to do, just launch the docker with entrypoint bash,. and launch manually from within the docker, you will the the directories as the docker sees them |
Beta Was this translation helpful? Give feedback.
-
$ docker run --rm --name Llama-2-7B-Chat-GPTQ -v /nfs/Llama2_TGI/text-generation-inference/data:/data --gpus all -p 8080:80 ghcr.io/huggingface/text-generation-inference:latest --model-id=/data/Llama-2-7B-Chat-GPTQ By running above command I am getting below error RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist Any suggestion to overcome this error ? Complte error message : 2024-01-08T15:23:54.655843Z INFO text_generation_launcher: Args { model_id: "/data/Llama-2-7B-Chat-GPTQ", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "65a9437a5845", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false } 2024-01-08T15:23:58.359409Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-01-08T15:24:04.265017Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output: You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the File "/opt/conda/bin/text-generation-server", line 8, in File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 235, in serve File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 196, in serve_inner File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 233, in get_model File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 69, in init File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 412, in init File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 350, in init File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 351, in File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 287, in init File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 171, in init File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 105, in load_attention File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 460, in load_multi File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 223, in get_multi_weights_col File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 223, in File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 113, in get_sharded File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 64, in get_filename RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist 2024-01-08T15:24:04.364427Z ERROR text_generation_launcher: Shard 0 failed to start Following files are present inside /data. |
Beta Was this translation helpful? Give feedback.
Finally i could make it work. Thanks @Narsil