Skip to content

Commit 53a94af

Browse files
authored
Support quantized llama models (#6486)
1 parent e6d93de commit 53a94af

File tree

2 files changed

+76
-31
lines changed

2 files changed

+76
-31
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ Check out the [Getting Started](https://pytorch.org/executorch/stable/getting-st
2525
Check out the examples of [Llama](./examples/models/llama/README.md), [Llava](./examples/models/llava/README.md) and [other models](./examples/README.md) running on edge devices using ExecuTorch.
2626

2727

28-
**[UPDATE - 09/25]** We have added support for running [Llama 3.2 1B/3B](./examples/models/llama/README.md) models via ExecuTorch.
28+
**[UPDATE - 10/24]** We have added support for running [Llama 3.2 Quantized 1B/3B](./examples/models/llama/README.md) models via ExecuTorch.
2929

3030
## Feedback
3131

examples/models/llama/README.md

Lines changed: 75 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ This example demonstrates how to run [Llama models](https://www.llama.com/) on m
44
Here are supported models:
55

66
- Llama 3.2 1B and 3B
7+
- Llama 3.2 Quantized 1B and 3B
78
- Llama 3.1 8B
89
- Llama 3 8B
910
- [Llama 2 7B](../llama2/README.md)
@@ -24,48 +25,62 @@ Please note that the models are subject to the [Llama 2 Acceptable Use Policy](h
2425

2526
# Results
2627

27-
## Llama 3.2 1B/3B
28+
## Llama 3.2 1B/3B and quantized 1B/3B models
2829

29-
For Llama 3.2 1B/3B models, we have enabled the original bf16 format and quantization to 4-bit, using SpinQuant, for enhanced performance.
30+
For Llama 3.2 1B/3B models, we have enabled the original BF16 format and quantization to 4-bit, using SpinQuant and QAT+LoRA, for enhanced performance.
3031

31-
### 1. Enablement
32+
The quantized models were optimized primarily for Arm CPU architecture by leveraging XNNPACK and Kleidi AI library. Work is underway to specifically enable quantization on mobile accelerators for Llama 1B/3B.
33+
34+
### Enablement
3235

3336
We have successfully verified performance on the following devices: iPhone 15 Pro, iPhone 15 Pro Max, Samsung Galaxy S24+, S22 and OnePlus 12 (featuring 16GB RAM).
3437

35-
Note, the Llama 3.2 3B unquantized bf16 model was only tested on the OnePlus 12, which has sufficient memory (16GB RAM) to support its size requirements.
38+
Note, the Llama 3.2 3B unquantized BF16 model was only tested on the OnePlus 12, which has sufficient memory (16GB RAM) to support its size requirements.
39+
40+
### Quantization
3641

37-
### 2. Quantization
42+
The 1B/3B models are sensitive to accuracy loss when regular post-training quantization (PTQ) is applied. To achieve a balance between accuracy, performance and memory, we utilized 4-bit quantization, using [SpinQuant](https://github.com/facebookresearch/SpinQuant/tree/main) and QAT+LoRA methods.
3843

39-
#### 2.1 SpinQuant
44+
Our quantization scheme involves three parts, applicable to both methods:
4045

41-
The 1B/3B models are sensitive to accuracy loss when regular post-training quantization (PTQ) is applied. To achieve a balance between accuracy, performance and memory, we utilized 4-bit quantization with [SpinQuant](https://github.com/facebookresearch/SpinQuant/tree/main). With SpinQuant, we currently quantize 4-bit groupwise (with groupsize 32) weight, 8bit dynamic activation of all the linear layers of the model, except embedding and output layers. The embedding and output layers are quantized as 8-bit per-channel weight and 8-bit dynamic activation.
46+
- We quantize all linear layers in all transformer blocks to a 4-bit groupwise scheme (with a group size of 32) for weights and 8-bit per-token dynamic quantization for activations.
47+
- The classification layer is quantized to 8-bit per-channel for weight and 8-bit per token dynamic quantization for activation.
48+
- We employ an 8-bit per channel quantization for embedding.
49+
50+
#### SpinQuant
4251

4352
The SpinQuant method takes the original weights and produces optimized quantized weights with minimal outliers, resulting in higher accuracy. This can be achieved without any finetuning of the weights and only requires 100 iterations on a single A100 node.
4453

4554
SpinQuant can generate quantized weights that are [compatible with ExecuTorch](https://github.com/facebookresearch/SpinQuant/tree/main?tab=readme-ov-file#3-export-to-executorch), specifically, it can be integrated with the existing optimized XNNPACK kernels (e.g., group-wise 4bit weight and 8bit dynamic activation). This allows developers to benefit from the higher accuracy of SpinQuant while also taking advantage of the strong performance of ExecuTorch acceleration.
4655

47-
### 3. Accuracy
56+
#### Quantization-Aware Training and LoRA (QAT+LoRA)
57+
58+
Quantization-Aware Training (QAT) is employed to simulate the effects of quantization during the training of Llama-3.2 models, enabling optimization of their performance in low precision environments. To initialize QAT, BF16 Llama-3.2 model checkpoints obtained after supervised fine-tuning (SFT) are utilized and an additional full round of SFT training with QAT is performed. The backbone of the QAT model is then frozen and another round of SFT is performed with low-rank adaptation (LoRA) adaptors applied to all layers within the transformer block. Meanwhile, the LoRA adaptors' weights and activations are maintained in BF16.
59+
60+
### Accuracy
4861

4962
Please see the [Llama 3.2 model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD.md) for accuracy evalations.
5063

51-
### 4. Performance:
64+
### Performance
5265

53-
Llama 3.2 1B and 3B performance was measured on Android OnePlus 12 device. The performance measurement is expressed in terms of tokens per second using an [adb binary-based approach](#step-5-run-benchmark-on) with prompt length of 64.
66+
Llama 3.2 1B and 3B performance was measured on Android OnePlus 12 device. The performance measurement is expressed in terms of tokens per second using an [adb binary-based approach](#step-4-run-benchmark-on-android-phone) with prompt length of 64. It is measured with KleidiAI library. KleidiAI is not enabled by default yet. Use `-DEXECUTORCH_XNNPACK_ENABLE_KLEIDI=ON` to enable it in the build.
5467

55-
|Model | decode (tokens/s) | prefill (tokens/s) | Memory size (RSS in MiB) |
56-
|-------|------------------------ |------------------ | ------------------ |
57-
|1B bf16 | 19.2 | 60.3 | 3,185 |
58-
|1B SpinQuant | 50.2 | 260.5 | 1,921 |
59-
|3B bf16 | 7.6 | 21.2 | 7,419 |
60-
|3B SpinQuant | 19.7 | 89.7 | 3,726 |
68+
|Model | Decode (tokens/s) | Time-to-first-token (sec) | Prefill (tokens/s) | Model size (PTE file size in MiB) | Memory size (RSS in MiB) |
69+
|-------|------------------:|--------------------------:| ------------------:|----------------------------------:| ------------------------:|
70+
|1B BF16 (baseline) | 19.2 | 1.0 | 60.3 | 2,358 | 3,185 |
71+
|1B SpinQuant | 50.2 (2.6x) | 0.3 (-76.9%) | 260.5 (4.3x) | 1,083 (-54.1%) | 1,921 (-39.7%) |
72+
|1B QAT+LoRA | 45.8 (2.4x) | 0.3 (-76.0%) | 252.0 (4.2x) | 1,127 (-52.2%) | 2,255 (-29.2%) |
73+
|3B BF16 (baseline) | 7.6 | 3.0 | 21.2 | 6,129 | 7,419 |
74+
|3B SpinQuant | 19.7 (2.6x) | 0.7 (-76.4%) | 89.7 (4.2x) | 2,435 (-60.3%) | 3,726 (-49.8%) |
75+
|3B QAT+LoRA | 18.5 (2.4x) | 0.7 (-76.1%) | 88.8 (4.2x) | 2,529 (-58.7%) | 4,060 (-45.3%) |
6176

6277

6378
<table>
6479
<tr>
6580
<td>
6681
<img src="./Android3_2_1B_bf16.gif" width="300">
6782
<br>
68-
<em> Llama3.2 1B, unquantized, bf16 on Android phone. </em>
83+
<em> Llama3.2 1B, unquantized, BF16 on Android phone. </em>
6984
</td>
7085
<td>
7186
<img src="./Android3_2_3B_SpinQuant.gif" width="300">
@@ -80,15 +95,15 @@ Llama 3.2 1B and 3B performance was measured on Android OnePlus 12 device. The p
8095
## Llama 3/3.1 8B
8196
Since Llama 3 8B model needs at least 4-bit quantization to fit even within some of the highend phones, results presented here correspond to 4-bit groupwise post-training quantized (PTQ) model.
8297

83-
### 1. Enablement
98+
### Enablement
8499

85100
For Llama 3 8B and Llama3.1 8B, we have verified so far on iPhone 15 Pro, iPhone 15 Pro Max, Samsung Galaxy S24+ and OnePlus 12 (with 16GB RAM) by quantizing to 4bit.
86101

87-
### 2. Quantization
102+
### Quantization
88103

89104
We employed PTQ 4-bit groupwise per token dynamic quantization of all the linear layers of the model. Dynamic quantization refers to quantizating activations dynamically, such that quantization parameters for activations are calculated, from min/max range, at runtime. Here we quantized activations with 8bits (signed integer). Furthermore, weights are statically quantized. In our case weights were per-channel groupwise quantized with 4bit signed integer. Due to Llama3's vocabulary size, we had to quantize embedding lookup table as well. For these results embedding lookup table was groupwise quantized with 4-bits and group size of 32.
90105

91-
### 3. Accuracy
106+
### Accuracy
92107

93108
We evaluated UncycloText perplexity using [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness). Below are the results for two different groupsizes, with max_seq_length 2048, and limit 1000.
94109

@@ -98,9 +113,9 @@ We evaluated UncycloText perplexity using [LM Eval](https://github.com/EleutherAI/l
98113

99114
Please note that LM Eval reports perplexity normalized by word count instead of token count. You may see different perplexity for UncycloText from other sources if they implement it differently. More details could be found [here](https://github.com/EleutherAI/lm-evaluation-harness/issues/2301).
100115

101-
### 4. Performance
116+
### Performance
102117

103-
Llama 3 8B performance was measured on the Samsung Galaxy S22, S24, and OnePlus 12 devices. The performance measurement is expressed in terms of tokens per second using an [adb binary-based approach](#step-5-run-benchmark-on).
118+
Llama 3 8B performance was measured on the Samsung Galaxy S22, S24, and OnePlus 12 devices. The performance measurement is expressed in terms of tokens per second using an [adb binary-based approach](#step-4-run-benchmark-on-android-phone).
104119

105120
|Device | Groupwise 4-bit (128) | Groupwise 4-bit (256)
106121
|--------| ---------------------- | ---------------
@@ -137,9 +152,11 @@ Llama 3 8B performance was measured on the Samsung Galaxy S22, S24, and OnePlus
137152

138153
1. Download `consolidated.00.pth`, `params.json` and `tokenizer.model` from [Llama website](https://www.llama.com/llama-downloads/) or [Hugging Face](https://huggingface.co/meta-llama/Llama-3.2-1B). For chat use-cases, download the instruct models.
139154

140-
2. Export model and generate `.pte` file. Use original bfloat16 version, without any quantization.
155+
2. Export model and generate `.pte` file.
141156

157+
- Use **original BF16** version, without any quantization.
142158
```
159+
# No quantization
143160
# Set these paths to point to the downloaded files
144161
LLAMA_CHECKPOINT=path/to/checkpoint.pth
145162
LLAMA_PARAMS=path/to/params.json
@@ -155,20 +172,22 @@ python -m examples.models.llama.export_llama \
155172
--output_name="llama3_2.pte"
156173
```
157174

158-
Optionally, we can apply SpinQuant to quantize the model without sacrifacing too much accuracy loss.
159-
160-
To use SpinQuant, follow its [instruction](https://github.com/facebookresearch/SpinQuant/tree/main?tab=readme-ov-file#3-export-to-executorch) for exporting checkpoint to ExecuTorch and then export the SpinQuant checkpoint.
175+
- To use **SpinQuant**, here are two ways:
176+
- Download directly from [Llama website](https://www.llama.com/llama-downloads). The model weights are prequantized and can be exported to `pte` file directly.
177+
- Follow its [instruction](https://github.com/facebookresearch/SpinQuant/tree/main?tab=readme-ov-file#3-export-to-executorch) for exporting checkpoint to ExecuTorch and then export the SpinQuant checkpoint.
161178

162179
```
180+
# SpinQuant
163181
# Set these paths to point to the exported files
164182
LLAMA_QUANTIZED_CHECKPOINT=path/to/spinquant/checkpoint.pth
165-
LLAMA_PARAMS=path/to/params.json
183+
LLAMA_PARAMS=path/to/spinquant/params.json
166184
167185
python -m examples.models.llama.export_llama \
168186
--checkpoint "${LLAMA_QUANTIZED_CHECKPOINT:?}" \
169187
--params "${LLAMA_PARAMS:?}" \
170188
--use_sdpa_with_kv_cache \
171189
-X \
190+
--xnnpack-extended-ops \
172191
--preq_mode 8da4w_output_8da8w \
173192
--preq_group_size 32 \
174193
--max_seq_length 2048 \
@@ -180,6 +199,32 @@ python -m examples.models.llama.export_llama \
180199
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
181200
```
182201

202+
- To use **QAT+LoRA**, download directly from [Llama website](https://www.llama.com/llama-downloads). The model weights are prequantized and can be exported to `pte` file directly by:
203+
204+
```
205+
# QAT+LoRA
206+
# Set these paths to point to the exported files
207+
LLAMA_QUANTIZED_CHECKPOINT=path/to/qlora/checkpoint.pth
208+
LLAMA_PARAMS=path/to/qlora/params.json
209+
210+
python -m examples.models.llama.export_llama \
211+
--checkpoint "${LLAMA_QUANTIZED_CHECKPOINT:?}" \
212+
--params "${LLAMA_PARAMS:?}" \
213+
-qat \
214+
-lora 16 \
215+
--preq_mode 8da4w_output_8da8w \
216+
--preq_group_size 32 \
217+
--preq_embedding_quantize 8,0 \
218+
--use_sdpa_with_kv_cache \
219+
-kv \
220+
-X \
221+
--xnnpack-extended-ops \
222+
-d fp32 \
223+
--max_seq_length 2048 \
224+
--output_name "llama3_2.pte" \
225+
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
226+
```
227+
183228
### Option B: Download and export Llama 3 8B instruct model
184229

185230
You can export and run the original Llama 3 8B instruct model.
@@ -193,7 +238,7 @@ You can export and run the original Llama 3 8B instruct model.
193238
194239
Due to the larger vocabulary size of Llama 3, we recommend quantizing the embeddings with `--embedding-quantize 4,32` as shown above to further reduce the model size.
195240
196-
## Step 4: Run on your computer to validate
241+
## Step 3: Run on your computer to validate
197242
198243
1. Build executorch with optimized CPU performance as follows. Build options available [here](https://github.com/pytorch/executorch/blob/main/CMakeLists.txt#L59).
199244
```
@@ -236,7 +281,7 @@ Note for Mac users: There's a known linking issue with Xcode 15.1. Refer to the
236281
237282
To build for CoreML backend and validate on Mac, replace `-DEXECUTORCH_BUILD_XNNPACK=ON` with `-DEXECUTORCH_BUILD_COREML=ON`
238283
239-
## Step 5: Run benchmark on Android phone
284+
## Step 4: Run benchmark on Android phone
240285
241286
**1. Build llama runner binary for Android**
242287
@@ -301,7 +346,7 @@ adb push cmake-out-android/examples/models/llama/llama_main /data/local/tmp/llam
301346
302347
**2.3 Run model**
303348
```
304-
adb shell "cd /data/local/tmp/llama && ./llama_main --model_path <model.pte> --tokenizer_path <tokenizer.model> --prompt \"Once upon a time\" --seq_len 120"
349+
adb shell "cd /data/local/tmp/llama && ./llama_main --model_path <model.pte> --tokenizer_path <tokenizer.model> --prompt \"What is the capital of France?\" --seq_len 120" --warmup=1
305350
```
306351
## Step 6: Build Mobile apps
307352

0 commit comments

Comments
 (0)