Can this TGI approach with Llama2 weights downloaded directly from Meta in .pth format ? #1412

ansSanthoshM · 2024-01-08T08:47:55Z

ansSanthoshM
Jan 8, 2024

I am planning to generate TGI inference end point for local installation of Llama2.

Can this TGI approach with Llama2 weights downloaded directly from Meta in .pth format ? Or Should the weigths need to be in specific format for this TGI apporoach to work ?

If this need to be in specific formation, please poing me to the script which converts .pth format to required format.

Answered by ansSanthoshM

Jan 9, 2024

Finally i could make it work. Thanks @Narsil

View full answer

Narsil · 2024-01-08T08:55:39Z

Narsil
Jan 8, 2024
Maintainer

You need the huggingface format (name of the layers) not the original meta weights.

Like these ones: https://huggingface.co/meta-llama/Llama-2-7b-hf

Also you'll notive in those repo safetensors weights. You should start using them instead of PTH which are a security issue : https://docs.python.org/3/library/pickle.html

0 replies

ansSanthoshM · 2024-01-08T14:21:22Z

ansSanthoshM
Jan 8, 2024
Author

@Narsil thanks for reply. I understand that we have use model weights in HF .bin or .safetensor format.

I want to set up TGI server inference end point for Llama2 model, this should be completely local model, should work even without internet within my company
My OS is Rocky Linux8 and have 4GPUs of total 160GB.

I clonedTGI and Llama2 model first into my computer,

I did follow below steps

Copy HF TGI

mkdir Llama2_TGI
git clone https://github.com/huggingface/text-generation-inference.git

Go inside t-g-i folder

cd text-generation-inference

Create a subfolder called 'data'

mkdir data

copy 'Llama-2-7B-Chat-GPTQ' with .safetensor files inside 'data' folder

cd data
git clone https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ

model=Llama-2-7B-Chat-GPTQ
volume=/nfs/Llama2_TGI/text-generation-inference/data

docker run --rm --entrypoint /bin/bash -it --name $model -v $volume:/data --gpus all -p 8080:80 ghcr.io/huggingface/text-generation-inference:latest --model-id /nfs/Llama2_TGI/text-generation-inference/data/Llama-2-7B-Chat-GPTQ

Issue:
When i run above command, it again downloads the TGI and Llama2 7B model weights again. Why ? How to instruct to use already downloaded model ?

Unable to find image 'ghcr.io/huggingface/text-generation-inference:latest' locally
latest: Pulling from huggingface/text-generation-inference
/bin/bash: --model-id: invalid option

I am following the --model-id card as given in below huggingface website instructions, look at last command.

https://huggingface.co/docs/text-generation-inference/basic_tutorials/non_core_models

Note that i can run Llama2 models using transformer pipelines without using TGI.
I kindly request help in setting up this TGI endpoint sucessfully.

0 replies

Narsil · 2024-01-08T14:39:14Z

Narsil
Jan 8, 2024
Maintainer

Because you are mounting /nfs/.... as /data/ so you need to update to --model-id=/data/ (I think adapt depending with your directories).

If you are not sure what to do, just launch the docker with entrypoint bash,. and launch manually from within the docker, you will the the directories as the docker sees them

0 replies

ansSanthoshM · 2024-01-08T15:29:43Z

ansSanthoshM
Jan 8, 2024
Author

$ docker run --rm --name Llama-2-7B-Chat-GPTQ -v /nfs/Llama2_TGI/text-generation-inference/data:/data --gpus all -p 8080:80 ghcr.io/huggingface/text-generation-inference:latest --model-id=/data/Llama-2-7B-Chat-GPTQ

By running above command I am getting below error

RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

Any suggestion to overcome this error ?

Complte error message :

2024-01-08T15:23:54.655843Z INFO text_generation_launcher: Args { model_id: "/data/Llama-2-7B-Chat-GPTQ", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "65a9437a5845", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2024-01-08T15:23:54.655937Z INFO download: text_generation_launcher: Starting download process.
2024-01-08T15:23:57.843065Z INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-01-08T15:23:58.359409Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-01-08T15:23:58.359638Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-01-08T15:24:03.465977Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 235, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 196, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 233, in get_model
return FlashLlama(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 69, in init
model = FlashLlamaForCausalLM(config, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 412, in init
self.model = FlashLlamaModel(config, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 350, in init
[
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 351, in
FlashLlamaLayer(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 287, in init
self.self_attn = FlashLlamaAttention(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 171, in init
self.query_key_value = load_attention(config, prefix, weights)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 105, in load_attention
return TensorParallelColumnLinear.load_multi(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 460, in load_multi
weight = weights.get_multi_weights_col(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 223, in get_multi_weights_col
w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 223, in
w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 113, in get_sharded
filename, tensor_name = self.get_filename(tensor_name)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 64, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist

2024-01-08T15:24:04.265017Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565
You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):

File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve
server.serve(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 235, in serve
asyncio.run(

File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)

File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 196, in serve_inner
model = get_model(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 233, in get_model
return FlashLlama(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_llama.py", line 69, in init
model = FlashLlamaForCausalLM(config, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 412, in init
self.model = FlashLlamaModel(config, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 350, in init
[

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 351, in
FlashLlamaLayer(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 287, in init
self.self_attn = FlashLlamaAttention(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 171, in init
self.query_key_value = load_attention(config, prefix, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_llama_modeling.py", line 105, in load_attention
return TensorParallelColumnLinear.load_multi(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 460, in load_multi
weight = weights.get_multi_weights_col(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 223, in get_multi_weights_col
w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 223, in
w = [self.get_sharded(f"{p}.weight", dim=0) for p in prefixes]

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 113, in get_sharded
filename, tensor_name = self.get_filename(tensor_name)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 64, in get_filename
raise RuntimeError(f"weight {tensor_name} does not exist")

RuntimeError: weight model.layers.0.self_attn.q_proj.weight does not exist
rank=0
Error: ShardCannotStart

2024-01-08T15:24:04.364427Z ERROR text_generation_launcher: Shard 0 failed to start
2024-01-08T15:24:04.364457Z INFO text_generation_launcher: Shutting down shards
$

Following files are present inside /data.
{/nfs//Llama2_TGI/text-generation-inference/data/Llama-2-7B-Chat-GPTQ}$ ls -lsa
total 3807768
4 drwxr-xr-x 3 4096 Jan 8 09:49 .
0 drwxr-xr-x 5 125 Jan 8 09:58 ..
0 drwxr-xr-x 9 226 Jan 8 09:49 .git
4 -rw-r--r-- 1 1519 Jan 8 09:48 .gitattributes
8 -rw-r--r-- 1 7020 Jan 8 09:48 LICENSE
4 -rw-r--r-- 1 112 Jan 8 09:48 Notice
28 -rw-r--r-- 1 25036 Jan 8 09:48 README.md
8 -rw-r--r-- 1 4766 Jan 8 09:48 USE_POLICY.md
4 -rw-r--r-- 1 789 Jan 8 09:48 config.json
4 -rw-r--r-- 1 137 Jan 8 09:48 generation_config.json
3805400 -rw-r--r-- 1 3896726136 Jan 8 09:49 model.safetensors
4 -rw-r--r-- 1 188 Jan 8 09:48 quantize_config.json
4 -rw-r--r-- 1 411 Jan 8 09:48 special_tokens_map.json
1800 -rw-r--r-- 1 1842764 Jan 8 09:48 tokenizer.json
492 -rw-r--r-- 1 499723 Jan 8 09:48 tokenizer.model
4 -rw-r--r-- 1 727 Jan 8 09:48 tokenizer_config.json

1 reply

ansSanthoshM Jan 9, 2024
Author

Finally i could make it work. Thanks @Narsil

Answer selected by OlivierDehaene

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can this TGI approach with Llama2 weights downloaded directly from Meta in .pth format ? #1412

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Can this TGI approach with Llama2 weights downloaded directly from Meta in .pth format ? #1412

Uh oh!

ansSanthoshM Jan 8, 2024

Replies: 4 comments · 1 reply

Uh oh!

Narsil Jan 8, 2024 Maintainer

Uh oh!

Uh oh!

ansSanthoshM Jan 8, 2024 Author

Copy HF TGI

Go inside t-g-i folder

Create a subfolder called 'data'

copy 'Llama-2-7B-Chat-GPTQ' with .safetensor files inside 'data' folder

Uh oh!

Narsil Jan 8, 2024 Maintainer

Uh oh!

Uh oh!

ansSanthoshM Jan 8, 2024 Author

Uh oh!

ansSanthoshM Jan 9, 2024 Author

ansSanthoshM
Jan 8, 2024

Replies: 4 comments 1 reply

Narsil
Jan 8, 2024
Maintainer

ansSanthoshM
Jan 8, 2024
Author

Narsil
Jan 8, 2024
Maintainer

ansSanthoshM
Jan 8, 2024
Author

ansSanthoshM Jan 9, 2024
Author