Skip to content

Commit bd03d6b

Browse files
mergennachinfacebook-github-bot
authored andcommitted
Improve llama README with SPinQuant
Summary: Moved things around a bit. Removed Llama3.1 8B as part of SpinQuant support. Once we validate, we can add back. bypass-github-export-checks bypass-github-pytorch-ci-checks bypass-github-executorch-ci-checks Reviewed By: helunwencser Differential Revision: D63541861 fbshipit-source-id: 453425d3762ab9de6ad0011f2d4697adce97b951
1 parent 8f3a83b commit bd03d6b

File tree

1 file changed

+7
-7
lines changed

1 file changed

+7
-7
lines changed

examples/models/llama2/README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ Since Llama 2 7B or Llama 3 8B model needs at least 4-bit quantization to fit ev
2626

2727
For Llama 3.2 1B/3B, we validated the models by running them in their original bf16 datatype and unquantized on both Android and iOS phones. The 3B version required high-end phones with larger RAMs to fit the model.
2828

29-
Additionally, these models are sensitive to accuracy when regular PTQ quantization is applied, so we employed [SpinQuant](https://github.com/facebookresearch/SpinQuant/tree/main) to achieve a good balance between accuracy and performance.
29+
Additionally, 1B/3B models are sensitive to accuracy loss when regular PTQ quantization is applied, so we employed 4bit quantization using [SpinQuant](https://github.com/facebookresearch/SpinQuant/tree/main) to achieve a good balance between accuracy, performance and memory.
3030

3131
<table>
3232
<tr>
@@ -56,11 +56,11 @@ We evaluated UncycloText perplexity using [LM Eval](https://github.com/EleutherAI/l
5656

5757
Note that groupsize less than 128 was not enabled, since such models were still too large. This is because our current efforts have focused on enabling FP32 and support for FP16 is under way. What this implies for model size is that 1) embedding table is in FP32 and 2) quantized weights scales are FP32.
5858

59-
### SpinQuant (Optional)
59+
### SpinQuant for Llama 3.2 1B/3B models (Optional)
6060

6161
To improve accuracy, we can use [SpinQuant](https://github.com/facebookresearch/SpinQuant/tree/main), a post-training quantization (PTQ) technique that generates new quantized weights. In the standard PTQ process, quantization may lead to a decrease in accuracy when there are outliers. The SpinQuant method takes the original weights and produces optimized quantized weights with minimal outliers, resulting in higher accuracy. This can be achieved without any finetuning of the weights and only requires 100 iterations on a single A100 node.
6262

63-
SpinQuant can generate quantized weights that are [compatible with ExecuTorch](https://github.com/facebookresearch/SpinQuant/tree/main?tab=readme-ov-file#3-export-to-executorch), specifically, it can be integrated with the existing optimized XNNPACK kernels (aka group-wise 4bit weight and 8bit dynamic activation). This allows developers to benefit from the higher accuracy of SpinQuant while also taking advantage of the strong performance of ExecuTorch acceleration. We are currently working on enabling SpinQuant for the Llama3.1 8B and Llama3.2 1B/3B models on ExecuTorch.
63+
SpinQuant can generate quantized weights that are [compatible with ExecuTorch](https://github.com/facebookresearch/SpinQuant/tree/main?tab=readme-ov-file#3-export-to-executorch), specifically, it can be integrated with the existing optimized XNNPACK kernels (e.g., group-wise 4bit weight and 8bit dynamic activation). This allows developers to benefit from the higher accuracy of SpinQuant while also taking advantage of the strong performance of ExecuTorch acceleration. We enabled SpinQuant for Llama3.2 1B/3B models on ExecuTorch.
6464

6565
## Enablement
6666

@@ -73,11 +73,13 @@ We have verified running Llama 2 7B [mobile applications](#step-6-build-mobile-a
7373
### Llama 3.2 1B and 3B
7474
Llama 3.2 1B and 3B performance was measured on the OnePlus 12 device. The performance measurement is expressed in terms of tokens per second using an [adb binary-based approach](#step-5-run-benchmark-on) for generating 128 tokens.
7575

76-
|Model | bf16 | SpinQuant
76+
|Model | bf16 | 4bit(*) via SpinQuant
7777
|--------| ---------------------- | ---------------
7878
|1B | 19.4 tokens/second | 53.41 tokens/second |
7979
|3B | 7.76 tokens/second | 22.98 tokens/second |
8080

81+
(*) With SpinQuant, we currently quantize 4-bit groupwise (with groupsize 32) weight, 8bit dynamic activation of all the linear layers of the model, except embedding and output layers. The embedding and output layers are quantized as 8-bit per-channel weight and 8-bit dynamic activation.
82+
8183
### Llama3 8B and Llama3.1 8B
8284
Llama 3 8B performance was measured on the Samsung Galaxy S22, S24, and OnePlus 12 devices. The performance measurement is expressed in terms of tokens per second using an [adb binary-based approach](#step-5-run-benchmark-on).
8385

@@ -134,7 +136,7 @@ python -m examples.models.llama2.export_llama \
134136
--output_name="llama3_2.pte"
135137
```
136138

137-
Optionally, we can apply SpinQuant to quantize the model without sacrifacing too much accuracy loss. With SpinQuant, we currently support 8-bit per-channel groupwise quantization for embeddings, 8-bit per-channel groupwise weight and 8-bit dynamic activation for the last output layer, 4-bit groupwise with group size 32 weight and 8-bit dynamic activation for other linear layers.
139+
Optionally, we can apply SpinQuant to quantize the model without sacrifacing too much accuracy loss.
138140

139141
To use SpinQuant, follow its [instruction](https://github.com/facebookresearch/SpinQuant/tree/main?tab=readme-ov-file#3-export-to-executorch) for exporting checkpoint to ExecuTorch and then export the SpinQuant checkpoint.
140142

@@ -172,8 +174,6 @@ You can export and run the original Llama 3 8B instruct model.
172174
173175
Due to the larger vocabulary size of Llama 3, we recommend quantizing the embeddings with `--embedding-quantize 4,32` as shown above to further reduce the model size.
174176
175-
3. SpinQuant [Optional]. If you want to improve accuracy, you can use [SpinQuant](https://github.com/facebookresearch/SpinQuant). Namely, (1) you can generate a new checkpoint via `31_optimize_rotation_executorch.sh` and `32_eval_ptq_executorch.sh` commands in [SpinQuant repo](https://github.com/facebookresearch/SpinQuant/tree/main?tab=readme-ov-file#3-export-to-executorch) (2) pass in an extra `--use_spin_quant native` argument in `export_llama` script above.
176-
177177
### Option C: Download and export stories110M model
178178
179179
If you want to deploy and run a smaller model for educational purposes. From `executorch` root:

0 commit comments

Comments
 (0)