Skip to content

Commit 34f59ed

Browse files
lucylqfacebook-github-bot
authored andcommitted
llama2 readme (#3315)
Summary: - add note for embedding quantize, for llama3 - re-order export args to be the same as llama2, group_size missing `--` Pull Request resolved: #3315 Reviewed By: cccclai Differential Revision: D56528535 Pulled By: lucylq fbshipit-source-id: 4453070339ebdb3d782b45f96fe43d28c7006092
1 parent f6758fc commit 34f59ed

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

examples/models/llama2/README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,9 +111,11 @@ You can export and run the original Llama3 8B model.
111111
112112
2. Export model and generate `.pte` file
113113
```
114-
python -m examples.models.llama2.export_llama --checkpoint <consolidated.00.pth> -p <params.json> -d=fp32 -X -qmode 8da4w -kv --use_sdpa_with_kv_cache --output_name="llama3_kv_sdpa_xnn_qe_4_32.pte" group_size 128 --metadata '{"get_bos_id":128000, "get_eos_id":128001}' --embedding-quantize 4,32
114+
python -m examples.models.llama2.export_llama --checkpoint <consolidated.00.pth> -p <params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_id":128001}' --embedding-quantize 4,32 --output_name="llama3_kv_sdpa_xnn_qe_4_32.pte"
115115
```
116116
117+
Due to the larger vocabulary size of Llama3, we recommend quantizing the embeddings with `--embedding-quantize 4,32` to further reduce the model size.
118+
117119
## (Optional) Finetuning
118120
119121
If you want to finetune your model based on a specific dataset, PyTorch provides [TorchTune](https://github.com/pytorch/torchtune) - a native-Pytorch library for easily authoring, fine-tuning and experimenting with LLMs.

0 commit comments

Comments
 (0)