Skip to content

Commit 956504f

Browse files
mikekgfbseemethere
authored andcommitted
Run quantization.md document from docs/ (#718)
* improve updown parser, and use in README.md execution * cut/paste errors * typo: true -> false * we scan each partial line, so need to suppress at partial line level :( * make it twice as nice * improved updown parsing * special handling for lines w/o option * enable run on quantization doc * handle white space before trip backtick * updates * mps test * updates * Update run-readme-pr-macos.yml Rename test to avoid babe conflict * Update run-readme-pr.yml Y * Update run-readme-pr-mps.yml 2 * typos * add updown end command * typo * move broken mps * Update parking_lot/run-readme-pr-mps.yml Co-authored-by: Eli Uriegas <[email protected]> --------- Co-authored-by: Eli Uriegas <[email protected]>
1 parent 0bbf0be commit 956504f

File tree

6 files changed

+291
-41
lines changed

6 files changed

+291
-41
lines changed

.github/workflows/run-readme-periodic.yml

Lines changed: 23 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -34,19 +34,35 @@ jobs:
3434
# )
3535
# echo "::endgroup::"
3636
37-
echo "::group::Create script"
38-
python3 scripts/updown.py --file README.md > ./we-run-this.sh
37+
echo "::group::Create script to run README"
38+
python3 scripts/updown.py --file README.md > ./run-readme.sh
3939
# for good measure, if something happened to updown processor,
4040
# and it did not error out, fail with an exit 1
41-
echo "exit 1" >> ./we-run-this.sh
41+
echo "exit 1" >> ./run-readme.sh
4242
echo "::endgroup::"
4343
44-
echo "::group::Run This"
44+
echo "::group::Run README"
4545
echo "*******************************************"
46-
cat ./we-run-this.sh
46+
cat ./run-readme.sh
4747
echo "*******************************************"
48-
bash -x ./we-run-this.sh
49-
48+
bash -x ./run-readme.sh
49+
echo "::endgroup::"
50+
51+
echo "::group::Create script to run quantization"
52+
python3 scripts/updown.py --file docs/quantization.md > ./run-quantization.sh
53+
# for good measure, if something happened to updown processor,
54+
# and it did not error out, fail with an exit 1
55+
echo "exit 1" >> ./run-quantization.sh
56+
echo "::endgroup::"
57+
58+
echo "::group::Run quantization"
59+
echo "*******************************************"
60+
cat ./run-quantization.sh
61+
echo "*******************************************"
62+
bash -x ./run-quantization.sh
63+
echo "::endgroup::"
64+
65+
echo "::group::Completion"
5066
echo "tests complete"
5167
echo "*******************************************"
5268
echo "::endgroup::"

.github/workflows/run-readme-pr-macos.yml

Lines changed: 64 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ on:
77
workflow_dispatch:
88
jobs:
99
test-readme-macos:
10-
runs-on: macos-14-xlarge
10+
runs-on: macos-14-xlarge
1111
steps:
1212
- name: Checkout code
1313
uses: actions/checkout@v2
@@ -37,20 +37,74 @@ jobs:
3737
# yum install -y devtoolset-10-binutils
3838
# export PATH=/opt/rh/devtoolset-10/root/usr/bin/:$PATH
3939
# echo "::endgroup::"
40-
41-
echo "::group::Create script"
42-
python3 scripts/updown.py --file README.md --replace 'llama3:stories15M,-l 3:-l 2,meta-llama/Meta-Llama-3-8B-Instruct:stories15M' --suppress huggingface-cli,HF_TOKEN > ./we-run-this.sh
43-
# for good measure, if something happened to updown processor,
40+
41+
echo "::group::Create script to run README"
42+
python3 scripts/updown.py --file README.md --replace 'llama3:stories15M,-l 3:-l 2,meta-llama/Meta-Llama-3-8B-Instruct:stories15M' --suppress huggingface-cli,HF_TOKEN > ./run-readme.sh
43+
# for good measure, if something happened to updown processor,
4444
# and it did not error out, fail with an exit 1
45-
echo "exit 1" >> ./we-run-this.sh
45+
echo "exit 1" >> ./run-readme.sh
4646
echo "::endgroup::"
47-
48-
echo "::group::Run This"
47+
48+
echo "::group::Run README"
49+
echo "*******************************************"
50+
cat ./run-readme.sh
4951
echo "*******************************************"
50-
cat ./we-run-this.sh
52+
bash -x ./run-readme.sh
53+
echo "::endgroup::"
54+
55+
echo "::group::Completion"
56+
echo "tests complete"
5157
echo "*******************************************"
52-
bash -x ./we-run-this.sh
58+
echo "::endgroup::"
5359

60+
61+
test-quantization-macos:
62+
runs-on: macos-14-xlarge
63+
steps:
64+
- name: Checkout code
65+
uses: actions/checkout@v2
66+
- uses: actions/setup-python@v4
67+
with:
68+
python-version: '3.10.11'
69+
- name: Setup Xcode
70+
if: runner.os == 'macOS'
71+
uses: maxim-lobanov/setup-xcode@v1
72+
with:
73+
xcode-version: '15.3'
74+
- name: Run script
75+
run: |
76+
set -x
77+
# NS: Remove previous installation of torch first
78+
# as this script does not isntall anything into conda env but rather as system dep
79+
pip3 uninstall -y torch || true
80+
set -eou pipefail
81+
82+
echo "::group::Print machine info"
83+
uname -a
84+
sysctl machdep.cpu.brand_string
85+
sysctl machdep.cpu.core_count
86+
echo "::endgroup::"
87+
88+
# echo "::group::Install newer objcopy that supports --set-section-alignment"
89+
# yum install -y devtoolset-10-binutils
90+
# export PATH=/opt/rh/devtoolset-10/root/usr/bin/:$PATH
91+
# echo "::endgroup::"
92+
93+
echo "::group::Create script to run quantization"
94+
python3 scripts/updown.py --file docs/quantization.md --replace llama3:stories15M --suppress huggingface-cli,HF_TOKEN > ./run-quantization.sh
95+
# for good measure, if something happened to updown processor,
96+
# and it did not error out, fail with an exit 1
97+
echo "exit 1" >> ./run-quantization.sh
98+
echo "::endgroup::"
99+
100+
echo "::group::Run quantization"
101+
echo "*******************************************"
102+
cat ./run-quantization.sh
103+
echo "*******************************************"
104+
bash -x ./run-quantization.sh
105+
echo "::endgroup::"
106+
107+
echo "::group::Completion"
54108
echo "tests complete"
55109
echo "*******************************************"
56110
echo "::endgroup::"

.github/workflows/run-readme-pr.yml

Lines changed: 44 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -25,19 +25,57 @@ jobs:
2525
# export PATH=/opt/rh/devtoolset-10/root/usr/bin/:$PATH
2626
# echo "::endgroup::"
2727
28-
echo "::group::Create script"
29-
python3 scripts/updown.py --file README.md --replace 'llama3:stories15M,-l 3:-l 2,meta-llama/Meta-Llama-3-8B-Instruct:stories15M' --suppress huggingface-cli,HF_TOKEN > ./we-run-this.sh
28+
echo "::group::Create script to run README"
29+
python3 scripts/updown.py --file README.md --replace 'llama3:stories15M,-l 3:-l 2,meta-llama/Meta-Llama-3-8B-Instruct:stories15M' --suppress huggingface-cli,HF_TOKEN > ./run-readme.sh
3030
# for good measure, if something happened to updown processor,
3131
# and it did not error out, fail with an exit 1
32-
echo "exit 1" >> ./we-run-this.sh
32+
echo "exit 1" >> ./run-readme.sh
3333
echo "::endgroup::"
3434
35-
echo "::group::Run This"
35+
echo "::group::Run README"
3636
echo "*******************************************"
37-
cat ./we-run-this.sh
37+
cat ./run-readme.sh
3838
echo "*******************************************"
39-
bash -x ./we-run-this.sh
39+
bash -x ./run-readme.sh
40+
echo "::endgroup::"
41+
42+
echo "::group::Completion"
43+
echo "tests complete"
44+
echo "*******************************************"
45+
echo "::endgroup::"
46+
47+
test-quantization-any:
48+
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
49+
with:
50+
runner: linux.g5.4xlarge.nvidia.gpu
51+
gpu-arch-type: cuda
52+
gpu-arch-version: "12.1"
53+
timeout: 60
54+
script: |
55+
echo "::group::Print machine info"
56+
uname -a
57+
echo "::endgroup::"
58+
59+
# echo "::group::Install newer objcopy that supports --set-section-alignment"
60+
# yum install -y devtoolset-10-binutils
61+
# export PATH=/opt/rh/devtoolset-10/root/usr/bin/:$PATH
62+
# echo "::endgroup::"
63+
64+
echo "::group::Create script to run quantization"
65+
python3 scripts/updown.py --file docs/quantization.md --replace llama3:stories15M --suppress huggingface-cli,HF_TOKEN > ./run-quantization.sh
66+
# for good measure, if something happened to updown processor,
67+
# and it did not error out, fail with an exit 1
68+
echo "exit 1" >> ./run-quantization.sh
69+
echo "::endgroup::"
70+
71+
echo "::group::Run quantization"
72+
echo "*******************************************"
73+
cat ./run-quantization.sh
74+
echo "*******************************************"
75+
bash -x ./run-quantization.sh
76+
echo "::endgroup::"
4077
78+
echo "::group::Completion"
4179
echo "tests complete"
4280
echo "*******************************************"
4381
echo "::endgroup::"

docs/quantization.md

Lines changed: 63 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11

22
# Quantization
33

4+
[shell default]: HF_TOKEN="${SECRET_HF_TOKEN_PERIODIC}" huggingface-cli login
5+
[shell default]: TORCHCHAT_ROOT=${PWD} ./scripts/install_et.sh
6+
47
## Introduction
58
Quantization focuses on reducing the precision of model parameters and computations from floating-point to lower-bit integers, such as 8-bit integers. This approach aims to minimize memory requirements, accelerate inference speeds, and decrease power consumption, making models more feasible for deployment on edge devices with limited computational resources. For high-performance devices such as GPUs, quantization provides a way to reduce the required memory bandwidth and take advantage of the massive compute capabilities provided by today's server-based accelerators such as GPUs.
69

@@ -13,36 +16,68 @@ While quantization can potentially degrade the model's performance, the methods
1316
| linear (asymmetric) | fp32, fp16, bf16 | [8, 4]* | [32, 64, 128, 256]** | ||| 🚧 |
1417
| linear with GPTQ*** (asymmetric) | | |[32, 64, 128, 256]** | ||||
1518
| linear with HQQ*** (asymmetric) | | |[32, 64, 128, 256]** | ||||
16-
| linear with dynamic activations (symmetric) | fp32^ | | [32, 64, 128, 256] | a8w4dq | 🚧 |🚧 ||
19+
| linear with dynamic activations (symmetric) | fp32^ | | [32, 64, 128, 256]* | a8w4dq | 🚧 |🚧 ||
1720

1821
### Embedding Quantization
19-
Due to the larger vocabulary size of llama3, we also recommend quantizing the embeddings to further reduce the model size for on-device usecases.
22+
23+
Due to the larger vocabulary size of llama3, we also recommend
24+
quantizing the embeddings to further reduce the model size for
25+
on-device usecases.
2026

2127
| compression | FP Precision | weight quantization (bitwidth)| weight quantization (group size) | dynamic activation quantization | Eager | AOTI | ExecuTorch |
2228
|--|--|--|--|--|--|--|--|
23-
| embedding (symmetric) | fp32, fp16, bf16 | [8, 4]* | [32, 64, 128, 256]** | ||||
29+
| embedding (symmetric) | fp32, fp16, bf16 | [8, 4]* | [ any > 1 ] | ||||
2430

25-
^a8w4dq quantization scheme requires model to be converted to fp32, due to lack of support for fp16 and bf16 in the kernels provided with ExecuTorch.
31+
^ a8w4dq quantization scheme requires model to be converted to fp32,
32+
due to lack of support for fp16 and bf16 in the kernels provided with
33+
ExecuTorch.
2634

2735
* These are the only valid bitwidth options.
2836

29-
** There are many valid group size options, including 512, 1024, etc. Note that smaller groupsize tends to be better for preserving model quality and accuracy, and larger groupsize for further improving performance. Set 0 for channelwise quantization.
37+
** There are many valid group size options, including 512, 1024,
38+
etc. Note that smaller groupsize tends to be better for preserving
39+
model quality and accuracy, and larger groupsize for further
40+
improving performance. Set 0 for channelwise quantization.
3041

31-
*** [GPTQ](https://arxiv.org/abs/2210.17323) and [HQQ](https://mobiusml.github.io/hqq_blog/) are two different algorithms to address accuracy loss when using lower bit quantization. Due to HQQ relying on data/calibration free quantization, it tends to take less time to quantize model.
42+
*** [GPTQ](https://arxiv.org/abs/2210.17323) and
43+
[HQQ](https://mobiusml.github.io/hqq_blog/) are two different
44+
algorithms to address accuracy loss when using lower bit
45+
quantization. Due to HQQ relying on data/calibration free
46+
quantization, it tends to take less time to quantize model.
3247

3348
## Quantization Profiles
34-
Torchchat quantization supports profiles with multiple settings such as accelerator, dtype, and quantization specified in a JSON file. Four sample profiles are included wwith the torchchat distributin in config/data: `cuda.json`, `desktop.json`, `mobile.json`, `pi5.json` with profiles optimizing for execution on cuda, desktop, mobile and raspberry Pi devices.
35-
36-
In addition to quantization recipes described below, the profiles also enable developers to specify the accelerator and dtype to be used.
37-
38-
At present torchchat supports the fast, cuda, mps, and cpu devices. The default device in torchchat is "fast". The "fast" device is a virtual device that defaults to the fastest executor available in the system, selecting cuda, mps, and cpu in this order.
3949

40-
At present torchchat supports the fast16, fast, bf16, fp16 and fp32 data types. The default data type for models is "fast16". The "fast16" data type is a virtual data type that defaults to the best 16-bit floating point data type available on the selected device. The "fast" data type is a virtual data type that defaults to the best floating point data type available on the selected device. ("Best" tangibly representing a combination of speed and accuracy.)
50+
Torchchat quantization supports profiles with multiple settings such
51+
as accelerator, dtype, and quantization specified in a JSON file.
52+
Four sample profiles are included wwith the torchchat distributin in
53+
config/data: `cuda.json`, `desktop.json`, `mobile.json`, `pi5.json`
54+
with profiles optimizing for execution on cuda, desktop, mobile and
55+
raspberry Pi devices.
56+
57+
In addition to quantization recipes described below, the profiles also
58+
enable developers to specify the accelerator and dtype to be used.
59+
60+
At present torchchat supports the fast, cuda, mps, and cpu devices.
61+
The default device in torchchat is "fast". The "fast" device is a
62+
virtual device that defaults to the fastest executor available in the
63+
system, selecting cuda, mps, and cpu in this order.
64+
65+
At present torchchat supports the fast16, fast, bf16, fp16 and fp32
66+
data types. The default data type for models is "fast16". The
67+
"fast16" data type is a virtual data type that defaults to the best
68+
16-bit floating point data type available on the selected device. The
69+
"fast" data type is a virtual data type that defaults to the best
70+
floating point data type available on the selected device. ("Best"
71+
tangibly representing a combination of speed and accuracy.)
4172

4273
## Quantization API
43-
Quantization options are passed in json format either as a config file (see [cuda.json](../config/data/cuda.json) and [mobile.json](../config/data/mobile.json)) or a JSON string.
4474

45-
The expected JSON format is described below. Refer to the tables above for valid `bitwidth` and `groupsize` values.
75+
Quantization options are passed in json format either as a config file
76+
(see [cuda.json](../config/data/cuda.json) and
77+
[mobile.json](../config/data/mobile.json)) or a JSON string.
78+
79+
The expected JSON format is described below. Refer to the tables above
80+
for valid `bitwidth` and `groupsize` values.
4681

4782
| compression | JSON string |
4883
|--|--|
@@ -57,6 +92,7 @@ See the available quantization schemes [here](https://github.com/pytorch/torchch
5792
## Examples
5893
We can mix and match weight quantization with embedding quantization.
5994

95+
[skip default]: begin
6096
* Config file
6197
```
6298
--quantize quant_config.json
@@ -69,16 +105,22 @@ We can mix and match weight quantization with embedding quantization.
69105
```
70106
--quantize '{"embedding": {"bitwidth": 4, "groupsize":32}, "linear:a8w4dq": {"groupsize" : 256}}'
71107
```
72-
Quantization recipes can be applied in conjunction with any of the `chat`, `generate`, `browser` and `export` commands. Below are examples showcasing eager mode with `generate` and AOTI and ExecuTorch with `export`.
108+
[skip default]: end
109+
110+
Quantization recipes can be applied in conjunction with any of the
111+
`chat`, `generate`, `browser` and `export` commands. Below are
112+
examples showcasing eager mode with `generate` and AOTI and ExecuTorch
113+
with `export`.
114+
73115
### Eager mode
74116
```
75117
python3 generate.py [--compile] llama3 --prompt "Hello, my name is" --quantize '{"embedding" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
76118
```
77119
### AOTI
78120
```
79-
python3 torchchat.py export llama3 --quantize '{"embedding": {"bitwidth": 4, "groupsize":32}, "linear:int4": {"groupsize" : 256}}' --output-dso-path llama3.dso
121+
python3 torchchat.py export llama3 --quantize '{"embedding": {"bitwidth": 4, "groupsize":32}, "linear:int4": {"groupsize" : 256}}' --output-dso-path llama3.so
80122
81-
python3 generate.py llama3 --dso-path llama3.dso --prompt "Hello my name is"
123+
python3 generate.py llama3 --dso-path llama3.so --prompt "Hello my name is"
82124
```
83125
### ExecuTorch
84126
```
@@ -90,10 +132,12 @@ python3 generate.py llama3 --pte-path llama3.pte --prompt "Hello my name is"
90132
## Model precision (dtype precision setting)
91133
On top of quantizing models with integer quantization schemes mentioned above, models can be converted to lower bit floating point precision to reduce the memory bandwidth requirement and take advantage of higher density compute available. For example, many GPUs and some of the CPUs have good support for BFloat16 and Float16. This can be taken advantage of via `--dtype` arg as shown below.
92134

135+
[skip default]: begin
93136
```
94137
python3 generate.py --dtype [ fast16 | fast | bf16 | fp16 | fp32] ...
95138
python3 export.py --dtype [ fast16 | fast | bf16 | fp16 | fp32] ...
96139
```
140+
[skip default]: end
97141

98142
Unlike gpt-fast which uses bfloat16 as default, torchchat uses the dtype "fast16" as the default. Torchchat will pick the appropriate 16-bit floating point type available and offering the best performance (for execution with Executorch, macOS/ARM and Linux/x86 platforms). For macOS, support depends on the OS version, with versions starting with 14.0 supporting bfloat16 as support, and float16 for earlier OS version based on system support for these data types.
99143

@@ -109,3 +153,5 @@ We invite contributors to submit established quantization schemes, with accuracy
109153
- Quantization reference, describe options for --quantize parameter
110154
- Show a table with performance/accuracy metrics
111155
- Quantization support matrix? torchchat Quantization Support Matrix
156+
157+
[end default]: end

0 commit comments

Comments
 (0)