Skip to content

Commit ceae80a

Browse files
mergennachinfacebook-github-bot
authored andcommitted
Instructions for Llama3 (#3154)
Summary: Pull Request resolved: #3154 All the steps until validating on desktop. Reviewed By: iseeyuan Differential Revision: D56358723 fbshipit-source-id: 32d246882d9609840932a7da22c2e3dbf015c0a8
1 parent 3257c66 commit ceae80a

File tree

1 file changed

+20
-4
lines changed

1 file changed

+20
-4
lines changed

examples/models/llama2/README.md

Lines changed: 20 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,9 @@ Please note that the models are subject to the [acceptable use policy](https://g
1717

1818
# Results
1919

20-
Since 7B Llama2 model needs at least 4-bit quantization to fit even within some of the highend phones, results presented here correspond to 4-bit groupwise post-training quantized model.
20+
Since 7B Llama2 model needs at least 4-bit quantization to fit even within some of the highend phones, results presented here correspond to 4-bit groupwise post-training quantized model.
2121

22-
For Llama3, we can use the same process. Note that it's only supported in the ExecuTorch main branch.
22+
For Llama3, we can use the same process. Note that it's only supported in the ExecuTorch main branch.
2323

2424
## Quantization:
2525
We employed 4-bit groupwise per token dynamic quantization of all the linear layers of the model. Dynamic quantization refers to quantizating activations dynamically, such that quantization parameters for activations are calculated, from min/max range, at runtime. Here we quantized activations with 8bits (signed integer). Furthermore, weights are statically quantized. In our case weights were per-channel groupwise quantized with 4bit signed integer. For more information refer to this [page](https://github.com/pytorch-labs/ao/).
@@ -57,7 +57,7 @@ Performance was measured on Samsung Galaxy S22, S24, One Plus 12 and iPhone 15 m
5757
- For Llama7b, your device may require at least 32GB RAM. If this is a constraint for you, please try the smaller stories model.
5858

5959
## Step 1: Setup
60-
1. Follow the [tutorial](https://pytorch.org/executorch/main/getting-started-setup) to set up ExecuTorch
60+
1. Follow the [tutorial](https://pytorch.org/executorch/main/getting-started-setup) to set up ExecuTorch. For installation run `./install_requirements.sh --pybind xnnpack`
6161
2. Run `examples/models/llama2/install_requirements.sh` to install a few dependencies.
6262

6363
## Step 2: Prepare model
@@ -103,6 +103,16 @@ If you want to deploy and run a smaller model for educational purposes. From `ex
103103
python -m examples.models.llama2.tokenizer.tokenizer -t tokenizer.model -o tokenizer.bin
104104
```
105105
106+
### Option C: Download and export Llama3 8B model
107+
108+
You can export and run the original Llama3 8B model.
109+
110+
1. Llama3 pretrained parameters can be downloaded from [Meta's official llama3 repository](https://github.com/meta-llama/llama3/).
111+
112+
2. Export model and generate `.pte` file
113+
```
114+
python -m examples.models.llama2.export_llama --checkpoint <consolidated.00.pth> -p <params.json> -d=fp32 -X -qmode 8da4w -kv --use_sdpa_with_kv_cache --output_name="llama3_kv_sdpa_xnn_qe_4_32.pte" group_size 128 --metadata '{"get_bos_id":128000, "get_eos_id":128001}' --embedding-quantize 4,32
115+
```
106116
107117
## (Optional) Finetuning
108118
@@ -148,6 +158,7 @@ The Uncyclotext results generated above used: `{max_seq_len: 2048, limit: 1000}`
148158
-DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
149159
-DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
150160
-DEXECUTORCH_BUILD_XNNPACK=ON \
161+
-DEXECUTORCH_BUILD_QUANTIZED=ON \
151162
-DEXECUTORCH_BUILD_OPTIMIZED=ON \
152163
-DEXECUTORCH_BUILD_CUSTOM=ON \
153164
-Bcmake-out .
@@ -163,17 +174,22 @@ The Uncyclotext results generated above used: `{max_seq_len: 2048, limit: 1000}`
163174
-DEXECUTORCH_BUILD_CUSTOM=ON \
164175
-DEXECUTORCH_BUILD_OPTIMIZED=ON \
165176
-DEXECUTORCH_BUILD_XNNPACK=ON \
177+
-DEXECUTORCH_BUILD_QUANTIZED=ON \
166178
-Bcmake-out/examples/models/llama2 \
167179
examples/models/llama2
168180
169181
cmake --build cmake-out/examples/models/llama2 -j16 --config Release
170182
```
171183
184+
For Llama3, add `-DEXECUTORCH_USE_TIKTOKEN=ON` option when building the llama runner.
185+
172186
3. Run model. Run options available [here](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/main.cpp#L18-L40).
173187
```
174188
cmake-out/examples/models/llama2/llama_main --model_path=<model pte file> --tokenizer_path=<tokenizer.bin> --prompt=<prompt>
175189
```
176190
191+
For Llama3, you can pass the original `tokenizer.model` (without converting to `.bin` file).
192+
177193
## Step 5: Run benchmark on Android phone
178194
179195
**1. Build llama runner binary for Android**
@@ -271,7 +287,7 @@ This example tries to reuse the Python code, with minimal modifications to make
271287
```
272288
git clean -xfd
273289
pip uninstall executorch
274-
./install_requirements.sh <options>
290+
./install_requirements.sh --pybind xnnpack
275291

276292
rm -rf cmake-out
277293
```

0 commit comments

Comments
 (0)