You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -25,7 +25,7 @@ Check out the [Getting Started](https://pytorch.org/executorch/stable/getting-st
25
25
Check out the examples of [Llama](./examples/models/llama/README.md), [Llava](./examples/models/llava/README.md) and [other models](./examples/README.md) running on edge devices using ExecuTorch.
26
26
27
27
28
-
**[UPDATE - 09/25]** We have added support for running [Llama 3.2 1B/3B](./examples/models/llama/README.md) models via ExecuTorch.
28
+
**[UPDATE - 10/24]** We have added support for running [Llama 3.2 Quantized 1B/3B](./examples/models/llama/README.md) models via ExecuTorch.
Copy file name to clipboardExpand all lines: examples/models/llama/README.md
+75-30Lines changed: 75 additions & 30 deletions
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,7 @@ This example demonstrates how to run [Llama models](https://www.llama.com/) on m
4
4
Here are supported models:
5
5
6
6
- Llama 3.2 1B and 3B
7
+
- Llama 3.2 Quantized 1B and 3B
7
8
- Llama 3.1 8B
8
9
- Llama 3 8B
9
10
-[Llama 2 7B](../llama2/README.md)
@@ -24,48 +25,62 @@ Please note that the models are subject to the [Llama 2 Acceptable Use Policy](h
24
25
25
26
# Results
26
27
27
-
## Llama 3.2 1B/3B
28
+
## Llama 3.2 1B/3B and quantized 1B/3B models
28
29
29
-
For Llama 3.2 1B/3B models, we have enabled the original bf16 format and quantization to 4-bit, using SpinQuant, for enhanced performance.
30
+
For Llama 3.2 1B/3B models, we have enabled the original BF16 format and quantization to 4-bit, using SpinQuant and QAT+LoRA, for enhanced performance.
30
31
31
-
### 1. Enablement
32
+
The quantized models were optimized primarily for Arm CPU architecture by leveraging XNNPACK and Kleidi AI library. Work is underway to specifically enable quantization on mobile accelerators for Llama 1B/3B.
33
+
34
+
### Enablement
32
35
33
36
We have successfully verified performance on the following devices: iPhone 15 Pro, iPhone 15 Pro Max, Samsung Galaxy S24+, S22 and OnePlus 12 (featuring 16GB RAM).
34
37
35
-
Note, the Llama 3.2 3B unquantized bf16 model was only tested on the OnePlus 12, which has sufficient memory (16GB RAM) to support its size requirements.
38
+
Note, the Llama 3.2 3B unquantized BF16 model was only tested on the OnePlus 12, which has sufficient memory (16GB RAM) to support its size requirements.
39
+
40
+
### Quantization
36
41
37
-
### 2. Quantization
42
+
The 1B/3B models are sensitive to accuracy loss when regular post-training quantization (PTQ) is applied. To achieve a balance between accuracy, performance and memory, we utilized 4-bit quantization, using [SpinQuant](https://github.com/facebookresearch/SpinQuant/tree/main) and QAT+LoRA methods.
38
43
39
-
#### 2.1 SpinQuant
44
+
Our quantization scheme involves three parts, applicable to both methods:
40
45
41
-
The 1B/3B models are sensitive to accuracy loss when regular post-training quantization (PTQ) is applied. To achieve a balance between accuracy, performance and memory, we utilized 4-bit quantization with [SpinQuant](https://github.com/facebookresearch/SpinQuant/tree/main). With SpinQuant, we currently quantize 4-bit groupwise (with groupsize 32) weight, 8bit dynamic activation of all the linear layers of the model, except embedding and output layers. The embedding and output layers are quantized as 8-bit per-channel weight and 8-bit dynamic activation.
46
+
- We quantize all linear layers in all transformer blocks to a 4-bit groupwise scheme (with a group size of 32) for weights and 8-bit per-token dynamic quantization for activations.
47
+
- The classification layer is quantized to 8-bit per-channel for weight and 8-bit per token dynamic quantization for activation.
48
+
- We employ an 8-bit per channel quantization for embedding.
49
+
50
+
#### SpinQuant
42
51
43
52
The SpinQuant method takes the original weights and produces optimized quantized weights with minimal outliers, resulting in higher accuracy. This can be achieved without any finetuning of the weights and only requires 100 iterations on a single A100 node.
44
53
45
54
SpinQuant can generate quantized weights that are [compatible with ExecuTorch](https://github.com/facebookresearch/SpinQuant/tree/main?tab=readme-ov-file#3-export-to-executorch), specifically, it can be integrated with the existing optimized XNNPACK kernels (e.g., group-wise 4bit weight and 8bit dynamic activation). This allows developers to benefit from the higher accuracy of SpinQuant while also taking advantage of the strong performance of ExecuTorch acceleration.
46
55
47
-
### 3. Accuracy
56
+
#### Quantization-Aware Training and LoRA (QAT+LoRA)
57
+
58
+
Quantization-Aware Training (QAT) is employed to simulate the effects of quantization during the training of Llama-3.2 models, enabling optimization of their performance in low precision environments. To initialize QAT, BF16 Llama-3.2 model checkpoints obtained after supervised fine-tuning (SFT) are utilized and an additional full round of SFT training with QAT is performed. The backbone of the QAT model is then frozen and another round of SFT is performed with low-rank adaptation (LoRA) adaptors applied to all layers within the transformer block. Meanwhile, the LoRA adaptors' weights and activations are maintained in BF16.
59
+
60
+
### Accuracy
48
61
49
62
Please see the [Llama 3.2 model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD.md) for accuracy evalations.
50
63
51
-
### 4. Performance:
64
+
### Performance
52
65
53
-
Llama 3.2 1B and 3B performance was measured on Android OnePlus 12 device. The performance measurement is expressed in terms of tokens per second using an [adb binary-based approach](#step-5-run-benchmark-on) with prompt length of 64.
66
+
Llama 3.2 1B and 3B performance was measured on Android OnePlus 12 device. The performance measurement is expressed in terms of tokens per second using an [adb binary-based approach](#step-4-run-benchmark-on-android-phone) with prompt length of 64. It is measured with KleidiAI library. KleidiAI is not enabled by default yet. Use `-DEXECUTORCH_XNNPACK_ENABLE_KLEIDI=ON` to enable it in the build.
@@ -80,15 +95,15 @@ Llama 3.2 1B and 3B performance was measured on Android OnePlus 12 device. The p
80
95
## Llama 3/3.1 8B
81
96
Since Llama 3 8B model needs at least 4-bit quantization to fit even within some of the highend phones, results presented here correspond to 4-bit groupwise post-training quantized (PTQ) model.
82
97
83
-
### 1. Enablement
98
+
### Enablement
84
99
85
100
For Llama 3 8B and Llama3.1 8B, we have verified so far on iPhone 15 Pro, iPhone 15 Pro Max, Samsung Galaxy S24+ and OnePlus 12 (with 16GB RAM) by quantizing to 4bit.
86
101
87
-
### 2. Quantization
102
+
### Quantization
88
103
89
104
We employed PTQ 4-bit groupwise per token dynamic quantization of all the linear layers of the model. Dynamic quantization refers to quantizating activations dynamically, such that quantization parameters for activations are calculated, from min/max range, at runtime. Here we quantized activations with 8bits (signed integer). Furthermore, weights are statically quantized. In our case weights were per-channel groupwise quantized with 4bit signed integer. Due to Llama3's vocabulary size, we had to quantize embedding lookup table as well. For these results embedding lookup table was groupwise quantized with 4-bits and group size of 32.
90
105
91
-
### 3. Accuracy
106
+
### Accuracy
92
107
93
108
We evaluated UncycloText perplexity using [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness). Below are the results for two different groupsizes, with max_seq_length 2048, and limit 1000.
94
109
@@ -98,9 +113,9 @@ We evaluated UncycloText perplexity using [LM Eval](https://github.com/EleutherAI/l
98
113
99
114
Please note that LM Eval reports perplexity normalized by word count instead of token count. You may see different perplexity for UncycloText from other sources if they implement it differently. More details could be found [here](https://github.com/EleutherAI/lm-evaluation-harness/issues/2301).
100
115
101
-
### 4. Performance
116
+
### Performance
102
117
103
-
Llama 3 8B performance was measured on the Samsung Galaxy S22, S24, and OnePlus 12 devices. The performance measurement is expressed in terms of tokens per second using an [adb binary-based approach](#step-5-run-benchmark-on).
118
+
Llama 3 8B performance was measured on the Samsung Galaxy S22, S24, and OnePlus 12 devices. The performance measurement is expressed in terms of tokens per second using an [adb binary-based approach](#step-4-run-benchmark-on-android-phone).
@@ -137,9 +152,11 @@ Llama 3 8B performance was measured on the Samsung Galaxy S22, S24, and OnePlus
137
152
138
153
1. Download `consolidated.00.pth`, `params.json` and `tokenizer.model` from [Llama website](https://www.llama.com/llama-downloads/) or [Hugging Face](https://huggingface.co/meta-llama/Llama-3.2-1B). For chat use-cases, download the instruct models.
139
154
140
-
2. Export model and generate `.pte` file. Use original bfloat16 version, without any quantization.
155
+
2. Export model and generate `.pte` file.
141
156
157
+
- Use **original BF16** version, without any quantization.
142
158
```
159
+
# No quantization
143
160
# Set these paths to point to the downloaded files
Optionally, we can apply SpinQuant to quantize the model without sacrifacing too much accuracy loss.
159
-
160
-
To use SpinQuant, follow its [instruction](https://github.com/facebookresearch/SpinQuant/tree/main?tab=readme-ov-file#3-export-to-executorch) for exporting checkpoint to ExecuTorch and then export the SpinQuant checkpoint.
175
+
- To use **SpinQuant**, here are two ways:
176
+
- Download directly from [Llama website](https://www.llama.com/llama-downloads). The model weights are prequantized and can be exported to `pte` file directly.
177
+
- Follow its [instruction](https://github.com/facebookresearch/SpinQuant/tree/main?tab=readme-ov-file#3-export-to-executorch) for exporting checkpoint to ExecuTorch and then export the SpinQuant checkpoint.
- To use **QAT+LoRA**, download directly from [Llama website](https://www.llama.com/llama-downloads). The model weights are prequantized and can be exported to `pte` file directly by:
### Option B: Download and export Llama 3 8B instruct model
184
229
185
230
You can export and run the original Llama 3 8B instruct model.
@@ -193,7 +238,7 @@ You can export and run the original Llama 3 8B instruct model.
193
238
194
239
Due to the larger vocabulary size of Llama 3, we recommend quantizing the embeddings with `--embedding-quantize 4,32` as shown above to further reduce the model size.
195
240
196
-
## Step 4: Run on your computer to validate
241
+
## Step 3: Run on your computer to validate
197
242
198
243
1. Build executorch with optimized CPU performance as follows. Build options available [here](https://github.com/pytorch/executorch/blob/main/CMakeLists.txt#L59).
199
244
```
@@ -236,7 +281,7 @@ Note for Mac users: There's a known linking issue with Xcode 15.1. Refer to the
236
281
237
282
To build for CoreML backend and validate on Mac, replace `-DEXECUTORCH_BUILD_XNNPACK=ON` with `-DEXECUTORCH_BUILD_COREML=ON`
adb shell "cd /data/local/tmp/llama && ./llama_main --model_path <model.pte> --tokenizer_path <tokenizer.model> --prompt \"Once upon a time\" --seq_len 120"
349
+
adb shell "cd /data/local/tmp/llama && ./llama_main --model_path <model.pte> --tokenizer_path <tokenizer.model> --prompt \"What is the capital of France?\" --seq_len 120" --warmup=1
0 commit comments