Skip to content

Commit dbf876c

Browse files
mergennachinfacebook-github-bot
authored andcommitted
Add animated gif for 3B SpinQuant (#5763)
Summary: Pull Request resolved: #5763 - Add animated gif for 3B SpinQuant - Remove bf16 numbers, for now, until we verify. bypass-github-export-checks bypass-github-pytorch-ci-checks bypass-github-executorch-ci-checks allow-large-files Differential Revision: D63649777
1 parent 9720715 commit dbf876c

File tree

2 files changed

+16
-4
lines changed

2 files changed

+16
-4
lines changed
Loading

examples/models/llama2/README.md

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,18 @@ To improve accuracy, we can use [SpinQuant](https://github.com/facebookresearch/
6262

6363
SpinQuant can generate quantized weights that are [compatible with ExecuTorch](https://github.com/facebookresearch/SpinQuant/tree/main?tab=readme-ov-file#3-export-to-executorch), specifically, it can be integrated with the existing optimized XNNPACK kernels (e.g., group-wise 4bit weight and 8bit dynamic activation). This allows developers to benefit from the higher accuracy of SpinQuant while also taking advantage of the strong performance of ExecuTorch acceleration. We enabled SpinQuant for Llama3.2 1B/3B models on ExecuTorch.
6464

65+
<p align="center">
66+
<img src="./Android3_2_3B_SpinQuant.gif" width=300>
67+
<br>
68+
<em>
69+
Running Llama3.2 3B on Android phone.
70+
</em>
71+
<br>
72+
<em>
73+
4bit quantization using SpinQuant
74+
</em>
75+
</p>
76+
6577
## Enablement
6678

6779
For Llama 3 8B and Llama3.1 8B, we have verified so far on iPhone 15 Pro, iPhone 15 Pro Max, Samsung Galaxy S24+ and OnePlus 12 (with 16GB RAM).
@@ -73,10 +85,10 @@ We have verified running Llama 2 7B [mobile applications](#step-6-build-mobile-a
7385
### Llama 3.2 1B and 3B
7486
Llama 3.2 1B and 3B performance was measured on the OnePlus 12 device. The performance measurement is expressed in terms of tokens per second using an [adb binary-based approach](#step-5-run-benchmark-on) for generating 128 tokens.
7587

76-
|Model | bf16 | 4bit(*) via SpinQuant
77-
|--------| ---------------------- | ---------------
78-
|1B | 19.4 tokens/second | 53.41 tokens/second |
79-
|3B | 7.76 tokens/second | 22.98 tokens/second |
88+
|Model | 4bit(*) via SpinQuant
89+
|--------| ---------------
90+
|1B | 53.41 tokens/second |
91+
|3B | 22.98 tokens/second |
8092

8193
(*) With SpinQuant, we currently quantize 4-bit groupwise (with groupsize 32) weight, 8bit dynamic activation of all the linear layers of the model, except embedding and output layers. The embedding and output layers are quantized as 8-bit per-channel weight and 8-bit dynamic activation.
8294

0 commit comments

Comments
 (0)