You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary:
Moved things around a bit.
Removed Llama3.1 8B as part of SpinQuant support. Once we validate, we can add back.
bypass-github-export-checks
bypass-github-pytorch-ci-checks
bypass-github-executorch-ci-checks
Reviewed By: helunwencser
Differential Revision: D63541861
fbshipit-source-id: 453425d3762ab9de6ad0011f2d4697adce97b951
Copy file name to clipboardExpand all lines: examples/models/llama2/README.md
+7-7Lines changed: 7 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -26,7 +26,7 @@ Since Llama 2 7B or Llama 3 8B model needs at least 4-bit quantization to fit ev
26
26
27
27
For Llama 3.2 1B/3B, we validated the models by running them in their original bf16 datatype and unquantized on both Android and iOS phones. The 3B version required high-end phones with larger RAMs to fit the model.
28
28
29
-
Additionally, these models are sensitive to accuracy when regular PTQ quantization is applied, so we employed [SpinQuant](https://github.com/facebookresearch/SpinQuant/tree/main) to achieve a good balance between accuracyand performance.
29
+
Additionally, 1B/3B models are sensitive to accuracy loss when regular PTQ quantization is applied, so we employed 4bit quantization using [SpinQuant](https://github.com/facebookresearch/SpinQuant/tree/main) to achieve a good balance between accuracy, performance and memory.
30
30
31
31
<table>
32
32
<tr>
@@ -56,11 +56,11 @@ We evaluated UncycloText perplexity using [LM Eval](https://github.com/EleutherAI/l
56
56
57
57
Note that groupsize less than 128 was not enabled, since such models were still too large. This is because our current efforts have focused on enabling FP32 and support for FP16 is under way. What this implies for model size is that 1) embedding table is in FP32 and 2) quantized weights scales are FP32.
58
58
59
-
### SpinQuant (Optional)
59
+
### SpinQuant for Llama 3.2 1B/3B models (Optional)
60
60
61
61
To improve accuracy, we can use [SpinQuant](https://github.com/facebookresearch/SpinQuant/tree/main), a post-training quantization (PTQ) technique that generates new quantized weights. In the standard PTQ process, quantization may lead to a decrease in accuracy when there are outliers. The SpinQuant method takes the original weights and produces optimized quantized weights with minimal outliers, resulting in higher accuracy. This can be achieved without any finetuning of the weights and only requires 100 iterations on a single A100 node.
62
62
63
-
SpinQuant can generate quantized weights that are [compatible with ExecuTorch](https://github.com/facebookresearch/SpinQuant/tree/main?tab=readme-ov-file#3-export-to-executorch), specifically, it can be integrated with the existing optimized XNNPACK kernels (aka group-wise 4bit weight and 8bit dynamic activation). This allows developers to benefit from the higher accuracy of SpinQuant while also taking advantage of the strong performance of ExecuTorch acceleration. We are currently working on enabling SpinQuant for the Llama3.1 8B and Llama3.2 1B/3B models on ExecuTorch.
63
+
SpinQuant can generate quantized weights that are [compatible with ExecuTorch](https://github.com/facebookresearch/SpinQuant/tree/main?tab=readme-ov-file#3-export-to-executorch), specifically, it can be integrated with the existing optimized XNNPACK kernels (e.g., group-wise 4bit weight and 8bit dynamic activation). This allows developers to benefit from the higher accuracy of SpinQuant while also taking advantage of the strong performance of ExecuTorch acceleration. We enabled SpinQuant for Llama3.2 1B/3B models on ExecuTorch.
64
64
65
65
## Enablement
66
66
@@ -73,11 +73,13 @@ We have verified running Llama 2 7B [mobile applications](#step-6-build-mobile-a
73
73
### Llama 3.2 1B and 3B
74
74
Llama 3.2 1B and 3B performance was measured on the OnePlus 12 device. The performance measurement is expressed in terms of tokens per second using an [adb binary-based approach](#step-5-run-benchmark-on) for generating 128 tokens.
(*) With SpinQuant, we currently quantize 4-bit groupwise (with groupsize 32) weight, 8bit dynamic activation of all the linear layers of the model, except embedding and output layers. The embedding and output layers are quantized as 8-bit per-channel weight and 8-bit dynamic activation.
82
+
81
83
### Llama3 8B and Llama3.1 8B
82
84
Llama 3 8B performance was measured on the Samsung Galaxy S22, S24, and OnePlus 12 devices. The performance measurement is expressed in terms of tokens per second using an [adb binary-based approach](#step-5-run-benchmark-on).
Optionally, we can apply SpinQuant to quantize the model without sacrifacing too much accuracy loss. With SpinQuant, we currently support 8-bit per-channel groupwise quantization for embeddings, 8-bit per-channel groupwise weight and 8-bit dynamic activation for the last output layer, 4-bit groupwise with group size 32 weight and 8-bit dynamic activation for other linear layers.
139
+
Optionally, we can apply SpinQuant to quantize the model without sacrifacing too much accuracy loss.
138
140
139
141
To use SpinQuant, follow its [instruction](https://github.com/facebookresearch/SpinQuant/tree/main?tab=readme-ov-file#3-export-to-executorch) for exporting checkpoint to ExecuTorch and then export the SpinQuant checkpoint.
140
142
@@ -172,8 +174,6 @@ You can export and run the original Llama 3 8B instruct model.
172
174
173
175
Due to the larger vocabulary size of Llama 3, we recommend quantizing the embeddings with `--embedding-quantize 4,32` as shown above to further reduce the model size.
174
176
175
-
3. SpinQuant [Optional]. If you want to improve accuracy, you can use [SpinQuant](https://github.com/facebookresearch/SpinQuant). Namely, (1) you can generate a new checkpoint via `31_optimize_rotation_executorch.sh` and `32_eval_ptq_executorch.sh` commands in [SpinQuant repo](https://github.com/facebookresearch/SpinQuant/tree/main?tab=readme-ov-file#3-export-to-executorch) (2) pass in an extra `--use_spin_quant native` argument in `export_llama` script above.
176
-
177
177
### Option C: Download and export stories110M model
178
178
179
179
If you want to deploy and run a smaller model for educational purposes. From `executorch` root:
0 commit comments