Skip to content

Commit e9def32

Browse files
mikekgfbmalfet
authored andcommitted
Update quantization.md (#659)
Update support tables and explanatory text.
1 parent 084bec1 commit e9def32

File tree

1 file changed

+19
-9
lines changed

1 file changed

+19
-9
lines changed

docs/quantization.md

Lines changed: 19 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ While quantization can potentially degrade the model's performance, the methods
1010
### Weight Quantization
1111
| compression | FP Precision | bitwidth| group size | dynamic activation quantization | Eager | AOTI | ExecuTorch |
1212
|--|--|--|--|--|--|--|--|
13-
| linear (asymmetric) | fp32, fp16, bf16 | [8, 4]* | [32, 64, 128, 256]** | ||| |
13+
| linear (asymmetric) | fp32, fp16, bf16 | [8, 4]* | [32, 64, 128, 256]** | ||| 🚧 |
1414
| linear with dynamic activations (symmetric) | fp32^ | | [32, 64, 128, 256]** | a8w4dq | 🚧 |🚧 ||
1515
| linear with GPTQ*** (asymmetric) | | |[32, 64, 128, 256]** | ||||
1616
| linear with HQQ*** (asymmetric) | | |[32, 64, 128, 256]** | ||||
@@ -22,14 +22,23 @@ Due to the larger vocabulary size of llama3, we also recommend quantizing the em
2222
|--|--|--|--|--|--|--|--|
2323
| embedding (symmetric) | fp32, fp16, bf16 | [8, 4]* | [32, 64, 128, 256]** | ||||
2424

25-
^a8w4dq quantization scheme requires model to be converted to fp32, due to lack of support for fp16 and bf16.
25+
^ The a8w4dq quantization scheme requires inouts to be converted to fp32, due to lack of support for fp16 and bf16.
2626

27-
*These are the only valid bitwidth options.
27+
* These are the only valid bitwidth options.
2828

29-
**There are many valid group size options, including 512, 1024, etc. Note that smaller groupsize tends to be better for preserving model quality and accuracy, and larger groupsize for further improving performance. Set 0 for channelwise quantization.
29+
** There are many valid group size options, including 512, 1024, etc. Note that smaller groupsize tends to be better for preserving model quality and accuracy, and larger groupsize for further improving performance. Set 0 for channelwise quantization.
3030

3131
*** [GPTQ](https://arxiv.org/abs/2210.17323) and [HQQ](https://mobiusml.github.io/hqq_blog/) are two different algorithms to address accuracy loss when using lower bit quantization. Due to HQQ relying on data/calibration free quantization, it tends to take less time to quantize model.
3232

33+
## Quantization Profiles
34+
Torchchat quantization supports profiles with multiple settings such as accelerator, dtype, and quantization specified in a JSON file. Four sample profiles are included wwith the torchchat distributin in config/data: `cuda.json`, `desktop.json`, `mobile.json`, `pi5.json` with profiles optimizing for execution on cuda, desktop, mobile and raspberry Pi devices.
35+
36+
In addition to quantization recipes described below, the profiles also enable developers to specify the accelerator and dtype to be used.
37+
38+
At present torchchat supports the fast, cuda, mps, and cpu devices. The default device in torchchat is "fast". The "fast" device is a virtual device that defaults to the fastest executor available in the system, selecting cuda, mps, and cpu in this order.
39+
40+
At present torchchat supports the fast16, fast, bf16, fp16 and fp32 data types. The default data type for models is "fast16". The "fast16" data type is a virtual data type that defaults to the best 16-bit floating point data type available on the selected device. The "fast" data type is a virtual data type that defaults to the best floating point data type available on the selected device. ("Best" tangibly representing a combination of speed and accuracy.)
41+
3342
## Quantization API
3443
Quantization options are passed in json format either as a config file (see [cuda.json](../config/data/cuda.json) and [mobile.json](../config/data/mobile.json)) or a JSON string.
3544

@@ -73,7 +82,7 @@ python3 generate.py llama3 --dso-path llama3.dso --prompt "Hello my name is"
7382
```
7483
### ExecuTorch
7584
```
76-
python3 torchchat.py export llama3 --dtype fp32 --quantize '{"embedding": {"bitwidth": 4, "groupsize":32}, "linear:a8w4dq": {"groupsize" : 256}}' --output-pte-path llama3.pte
85+
python3 torchchat.py export llama3 --quantize '{"embedding": {"bitwidth": 4, "groupsize":32}, "linear:a8w4dq": {"groupsize" : 256}}' --output-pte-path llama3.pte
7786
7887
python3 generate.py llama3 --pte-path llama3.pte --prompt "Hello my name is"
7988
```
@@ -82,12 +91,13 @@ python3 generate.py llama3 --pte-path llama3.pte --prompt "Hello my name is"
8291
On top of quantizing models with integer quantization schemes mentioned above, models can be converted to lower bit floating point precision to reduce the memory bandwidth requirement and take advantage of higher density compute available. For example, many GPUs and some of the CPUs have good support for BFloat16 and Float16. This can be taken advantage of via `--dtype` arg as shown below.
8392

8493
```
85-
python3 generate.py --dtype [bf16 | fp16 | fp32] ...
86-
python3 export.py --dtype [bf16 | fp16 | fp32] ...
94+
python3 generate.py --dtype [ fast16 | fast | bf16 | fp16 | fp32] ...
95+
python3 export.py --dtype [ fast16 | fast | bf16 | fp16 | fp32] ...
8796
```
8897

89-
Unlike gpt-fast which uses bfloat16 as default, torchchat uses float32 as the default. As a consequence you will have to set to --dtype bf16 or --dtype fp16 on server / desktop for best performance.
90-
Support for FP16 and BF16 is limited in many embedded processors. Additional ExecuTorch support for 16-bit floating point types may be added in the future based on hardware support.
98+
Unlike gpt-fast which uses bfloat16 as default, torchchat uses the dtype "fast16" as the default. Torchchat will pick the appropriate 16-bit floating point type available and offering the best performance (for execution with Executorch, macOS/ARM and Linux/x86 platforms). For macOS, support depends on the OS version, with versions starting with 14.0 supporting bfloat16 as support, and float16 for earlier OS version based on system support for these data types.
99+
100+
Support for FP16 and BF16 is limited in many embedded processors and -dtype fp32 may be required in some environments. Additional ExecuTorch support for 16-bit floating point types may be added in the future based on hardware support.
91101

92102
## Adding additional quantization schemes
93103
We invite contributors to submit established quantization schemes, with accuracy and performance results demonstrating soundness.

0 commit comments

Comments
 (0)