Skip to content

Update quantization.md #403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 23, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 45 additions & 22 deletions docs/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,32 @@ Support for FP16 and BF16 is limited in many embedded processors. Additional ex

Next, we'll show you how to optimize your model for mobile execution (for ET) or get the most from your server or desktop hardware (with AOTI). The basic model build for mobile surfaces two issues: Models quickly run out of memory and execution can be slow. In this section, we show you how to fit your models in the limited memory of a mobile device, and optimize execution speed -- both using quantization. This is the torchchat repo after all!
For high-performance devices such as GPUs, quantization provides a way to reduce the memory bandwidth required to and take advantage of the massive compute capabilities provided by today's server-based accelerators such as GPUs. In addition to reducing the memory bandwidth required to compute a result faster by avoiding stalls, quantization allows accelerators (which usually have a limited amount of memory) to store and process larger models than they would otherwise be able to.
We can specify quantization parameters with the --quantize option. The quantize option takes a JSON/dictionary with quantizers and quantization options.
We can specify quantization parameters with the --quantizeize option. The quantize option takes a JSON/dictionary with quantizers and quantization options.
generate and export (for both ET and AOTI) can both accept quantization options. We only show a subset of the combinations to avoid combinatorial explosion.

## Quantization API

Model quantization recipes are specified by a JSON file / dict describing the quantizations to perform. Each quantization step consists of a quantization higher-level operator, and a dict with any parameters:

```
{
"<quantizer1>: {
<quantizer1_option1>" : value,
<quantizer1_option2>" : value,
...
},
"<quantizer2>: {
<quantizer2_option1>" : value,
<quantizer2_option2>" : value,
...
},
...
}
```

The quantization recipe may be specified either on the commandline as a single running string with `--quantize "<json string>"`, or by specifying a filename containing the recipe as a JSON structure with `--quantize filename.json`. It is recommended to store longer recipes as a JSON file, while the CLI variant may be more suitable for quick ad-hoc experiments.


## 8-Bit Embedding Quantization (channelwise & groupwise)
The simplest way to quantize embedding tables is with int8 "channelwise" (symmetric) quantization, where each value is represented by an 8 bit integer, and a floating point scale per embedding (channelwise quantization) or one scale for each group of values in an embedding (groupwise quantization).

Expand All @@ -42,13 +65,13 @@ We can do this in eager mode (optionally with torch.compile), we use the embeddi

TODO: Write this so that someone can copy paste
```
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quantize '{"embedding" : {"bitwidth": 8, "groupsize": 0}}' --device cpu

```

Then, export as follows with ExecuTorch:
```
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize '{"embedding": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
```

Now you can run your model with the same command as before:
Expand All @@ -60,13 +83,13 @@ python3 generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte --prompt "Hel
We can do this in eager mode (optionally with torch.compile), we use the embedding quantizer by specifying the group size:

```
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 8, "groupsize": 8}}' --device cpu
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quantize '{"embedding" : {"bitwidth": 8, "groupsize": 8}}' --device cpu

```
Then, export as follows:

```
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 8, "groupsize": 8} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize '{"embedding": {"bitwidth": 8, "groupsize": 8} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte

```

Expand All @@ -82,12 +105,12 @@ Quantizing embedding tables with int4 provides even higher compression of embedd
We can do this in eager mode (optionally with torch.compile), we use the embedding quantizer with groupsize set to 0 which uses channelwise quantization:

```
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 4, "groupsize": 0}}' --device cpu
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quantize '{"embedding" : {"bitwidth": 4, "groupsize": 0}}' --device cpu
```

Then, export as follows:
```
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 4, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize '{"embedding": {"bitwidth": 4, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
```

Now you can run your model with the same command as before:
Expand All @@ -100,12 +123,12 @@ python3 generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte --prompt "Hel
We can do this in eager mode (optionally with torch.compile), we use the embedding quantizer by specifying the group size:

```
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 4, "groupsize": 8}}' --device cpu
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quantize '{"embedding" : {"bitwidth": 4, "groupsize": 8}}' --device cpu
```

Then, export as follows:
```
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 4, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize '{"embedding": {"bitwidth": 4, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
```

Now you can run your model with the same command as before:
Expand All @@ -124,13 +147,13 @@ The simplest way to quantize embedding tables is with int8 groupwise quantizatio
We can do this in eager mode (optionally with torch.compile), we use the linear:int8 quantizer with groupsize set to 0 which uses channelwise quantization:

```
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int8" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quantize '{"linear:int8" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
```

Then, export as follows using ExecuTorch for mobile backends:

```
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte
```

Now you can run your model with the same command as before:
Expand All @@ -142,7 +165,7 @@ python3 generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte --checkpoint-
Or, export as follows for server/desktop deployments:

```
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.so
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.so
```

Now you can run your model with the same command as before:
Expand All @@ -155,12 +178,12 @@ python3 generate.py --dso-path ${MODEL_OUT}/${MODEL_NAME}_int8.so --checkpoint-p
We can do this in eager mode (optionally with torch.compile), we use the linear:int8 quantizer by specifying the group size:

```
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int8" : {"bitwidth": 8, "groupsize": 8}}' --device cpu
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quantize '{"linear:int8" : {"bitwidth": 8, "groupsize": 8}}' --device cpu
```
Then, export as follows using ExecuTorch:

```
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.pte
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.pte
```

**Now you can run your model with the same command as before:**
Expand All @@ -170,7 +193,7 @@ python3 generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.pte --check
```
*Or, export*
```
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-dso-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.so
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-dso-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.so
```

Now you can run your model with the same command as before:
Expand All @@ -187,11 +210,11 @@ To compress your model even more, 4-bit integer quantization may be used. To ach
We can do this in eager mode (optionally with torch.compile), we use the linear:int8 quantizer by specifying the group size:

```
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int4" : {"groupsize": 32}}' --device [ cpu | cuda | mps ]
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quantize '{"linear:int4" : {"groupsize": 32}}' --device [ cpu | cuda | mps ]
```

```
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:int4': {'groupsize' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.pte | --output-dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso]
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize "{'linear:int4': {'groupsize' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.pte | --output-dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso]
```
Now you can run your model with the same command as before:

Expand All @@ -204,7 +227,7 @@ To compress your model even more, 4-bit integer quantization may be used. To ach

**TODO (Digant): a8w4dq eager mode support [#335](https://github.com/pytorch/torchchat/issues/335) **
```
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:a8w4dq': {'groupsize' : 7} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_8da4w.pte | ...dso... ]
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize "{'linear:a8w4dq': {'groupsize' : 7} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_8da4w.pte | ...dso... ]
```

Now you can run your model with the same command as before:
Expand All @@ -220,11 +243,11 @@ Compression offers smaller memory footprints (to fit on memory-constrained accel

We can use GPTQ with eager execution, optionally in conjunction with torch.compile:
```
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int4" : {"groupsize": 32}}' --device [ cpu | cuda | mps ]
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quantize '{"linear:int4" : {"groupsize": 32}}' --device [ cpu | cuda | mps ]
```

```
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:gptq': {'groupsize' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso... ]
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize "{'linear:gptq': {'groupsize' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso... ]
```
Now you can run your model with the same command as before:

Expand All @@ -240,11 +263,11 @@ Compression offers smaller memory footprints (to fit on memory-constrained accel

We can use HQQ with eager execution, optionally in conjunction with torch.compile:
```
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:hqq" : {"groupsize": 32}}' --device [ cpu | cuda | mps ]
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quantize '{"linear:hqq" : {"groupsize": 32}}' --device [ cpu | cuda | mps ]
```

```
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:hqq': {'groupsize' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_hqq.pte | ...dso... ]
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize "{'linear:hqq': {'groupsize' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_hqq.pte | ...dso... ]
```
Now you can run your model with the same command as before:

Expand Down