Skip to content

Commit dfbf6fd

Browse files
authored
Mention torchao in llama readme page (#6487)
Reviewed By: TiRune Differential Revision: D64914497
1 parent 5889cc3 commit dfbf6fd

File tree

1 file changed

+4
-0
lines changed

1 file changed

+4
-0
lines changed

examples/models/llama/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,8 @@ Our quantization scheme involves three parts, applicable to both methods:
4747
- The classification layer is quantized to 8-bit per-channel for weight and 8-bit per token dynamic quantization for activation.
4848
- We employ an 8-bit per channel quantization for embedding.
4949

50+
We use [torchao](https://github.com/pytorch/ao) library APIs to define these schemes.
51+
5052
#### SpinQuant
5153

5254
The SpinQuant method takes the original weights and produces optimized quantized weights with minimal outliers, resulting in higher accuracy. This can be achieved without any finetuning of the weights and only requires 100 iterations on a single A100 node.
@@ -103,6 +105,8 @@ For Llama 3 8B and Llama3.1 8B, we have verified so far on iPhone 15 Pro, iPhone
103105

104106
We employed PTQ 4-bit groupwise per token dynamic quantization of all the linear layers of the model. Dynamic quantization refers to quantizating activations dynamically, such that quantization parameters for activations are calculated, from min/max range, at runtime. Here we quantized activations with 8bits (signed integer). Furthermore, weights are statically quantized. In our case weights were per-channel groupwise quantized with 4bit signed integer. Due to Llama3's vocabulary size, we had to quantize embedding lookup table as well. For these results embedding lookup table was groupwise quantized with 4-bits and group size of 32.
105107

108+
We use [torchao](https://github.com/pytorch/ao) library APIs to define these schemes.
109+
106110
### Accuracy
107111

108112
We evaluated UncycloText perplexity using [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness). Below are the results for two different groupsizes, with max_seq_length 2048, and limit 1000.

0 commit comments

Comments
 (0)