Skip to content

Commit d07f3c9

Browse files
authored
Update quantization.md (#369)
python => python3
1 parent 5a0b621 commit d07f3c9

File tree

1 file changed

+35
-35
lines changed

1 file changed

+35
-35
lines changed

docs/quantization.md

Lines changed: 35 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,8 @@ You can generate models (for both export and generate, with eager, torch.compile
1919

2020
TODO: These need to be commands that can be copy paste
2121
```
22-
python generate.py --dtype [bf16 | fp16 | fp32] ...
23-
python export.py --dtype [bf16 | fp16 | fp32] ...
22+
python3 generate.py --dtype [bf16 | fp16 | fp32] ...
23+
python3 export.py --dtype [bf16 | fp16 | fp32] ...
2424
```
2525

2626
Unlike gpt-fast which uses bfloat16 as default, torchchat uses float32 as the default. As a consequence you will have to set to --dtype bf16 or --dtype fp16 on server / desktop for best performance.
@@ -42,37 +42,37 @@ We can do this in eager mode (optionally with torch.compile), we use the embeddi
4242

4343
TODO: Write this so that someone can copy paste
4444
```
45-
python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
45+
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
4646
4747
```
4848

4949
Then, export as follows with ExecuTorch:
5050
```
51-
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
51+
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
5252
```
5353

5454
Now you can run your model with the same command as before:
5555
```
56-
python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte --prompt "Hello my name is"
56+
python3 generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte --prompt "Hello my name is"
5757
```
5858

5959
*Groupwise quantization:*
6060
We can do this in eager mode (optionally with torch.compile), we use the embedding quantizer by specifying the group size:
6161

6262
```
63-
python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 8, "groupsize": 8}}' --device cpu
63+
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 8, "groupsize": 8}}' --device cpu
6464
6565
```
6666
Then, export as follows:
6767

6868
```
69-
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 8, "groupsize": 8} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
69+
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 8, "groupsize": 8} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
7070
7171
```
7272

7373
Now you can run your model with the same command as before:
7474
```
75-
python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte --prompt "Hello my name is"
75+
python3 generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte --prompt "Hello my name is"
7676
```
7777

7878
## 4-Bit Embedding Quantization (channelwise & groupwise)
@@ -82,35 +82,35 @@ Quantizing embedding tables with int4 provides even higher compression of embedd
8282
We can do this in eager mode (optionally with torch.compile), we use the embedding quantizer with groupsize set to 0 which uses channelwise quantization:
8383

8484
```
85-
python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 4, "groupsize": 0}}' --device cpu
85+
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 4, "groupsize": 0}}' --device cpu
8686
```
8787

8888
Then, export as follows:
8989
```
90-
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 4, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
90+
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 4, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
9191
```
9292

9393
Now you can run your model with the same command as before:
9494

9595
```
96-
python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte --prompt "Hello my name is"
96+
python3 generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte --prompt "Hello my name is"
9797
```
9898

9999
*Groupwise quantization:*
100100
We can do this in eager mode (optionally with torch.compile), we use the embedding quantizer by specifying the group size:
101101

102102
```
103-
python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 4, "groupsize": 8}}' --device cpu
103+
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 4, "groupsize": 8}}' --device cpu
104104
```
105105

106106
Then, export as follows:
107107
```
108-
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 4, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
108+
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 4, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
109109
```
110110

111111
Now you can run your model with the same command as before:
112112
```
113-
python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte --prompt "Hello my name is"
113+
python3 generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte --prompt "Hello my name is"
114114
```
115115

116116
## 8-Bit Integer Linear Quantization (linear operator, channel-wise and groupwise)
@@ -124,58 +124,58 @@ The simplest way to quantize embedding tables is with int8 groupwise quantizatio
124124
We can do this in eager mode (optionally with torch.compile), we use the linear:int8 quantizer with groupsize set to 0 which uses channelwise quantization:
125125

126126
```
127-
python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int8" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
127+
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int8" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
128128
```
129129

130130
Then, export as follows using ExecuTorch for mobile backends:
131131

132132
```
133-
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte
133+
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte
134134
```
135135

136136
Now you can run your model with the same command as before:
137137

138138
```
139-
python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte --checkpoint-path ${MODEL_PATH} --prompt "Hello my name is"
139+
python3 generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte --checkpoint-path ${MODEL_PATH} --prompt "Hello my name is"
140140
```
141141

142142
Or, export as follows for server/desktop deployments:
143143

144144
```
145-
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.so
145+
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.so
146146
```
147147

148148
Now you can run your model with the same command as before:
149149

150150
```
151-
python generate.py --dso-path ${MODEL_OUT}/${MODEL_NAME}_int8.so --checkpoint-path ${MODEL_PATH} --prompt "Hello my name is"
151+
python3 generate.py --dso-path ${MODEL_OUT}/${MODEL_NAME}_int8.so --checkpoint-path ${MODEL_PATH} --prompt "Hello my name is"
152152
```
153153

154154
*Groupwise quantization:*
155155
We can do this in eager mode (optionally with torch.compile), we use the linear:int8 quantizer by specifying the group size:
156156

157157
```
158-
python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int8" : {"bitwidth": 8, "groupsize": 8}}' --device cpu
158+
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int8" : {"bitwidth": 8, "groupsize": 8}}' --device cpu
159159
```
160160
Then, export as follows using ExecuTorch:
161161

162162
```
163-
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.pte
163+
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.pte
164164
```
165165

166166
**Now you can run your model with the same command as before:**
167167

168168
```
169-
python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.pte --checkpoint-path ${MODEL_PATH} --prompt "Hello my name is"
169+
python3 generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.pte --checkpoint-path ${MODEL_PATH} --prompt "Hello my name is"
170170
```
171171
*Or, export*
172172
```
173-
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-dso-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.so
173+
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-dso-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.so
174174
```
175175

176176
Now you can run your model with the same command as before:
177177
```
178-
python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.so --checkpoint-path ${MODEL_PATH} -d fp32 --prompt "Hello my name is"
178+
python3 generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.so --checkpoint-path ${MODEL_PATH} -d fp32 --prompt "Hello my name is"
179179
```
180180

181181
Please note that group-wise quantization works functionally, but has not been optimized for CUDA and CPU targets where the best performnance requires a group-wise quantized mixed dtype linear operator.
@@ -187,30 +187,30 @@ To compress your model even more, 4-bit integer quantization may be used. To ach
187187
We can do this in eager mode (optionally with torch.compile), we use the linear:int8 quantizer by specifying the group size:
188188

189189
```
190-
python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int4" : {"groupsize": 32}}' --device [ cpu | cuda | mps ]
190+
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int4" : {"groupsize": 32}}' --device [ cpu | cuda | mps ]
191191
```
192192

193193
```
194-
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:int4': {'groupsize' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.pte | --output-dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso]
194+
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:int4': {'groupsize' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.pte | --output-dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso]
195195
```
196196
Now you can run your model with the same command as before:
197197

198198
```
199-
python generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.pte | --dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso] --prompt "Hello my name is"
199+
python3 generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.pte | --dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso] --prompt "Hello my name is"
200200
```
201201

202202
## 4-Bit Integer Linear Quantization (a8w4dq)
203203
To compress your model even more, 4-bit integer quantization may be used. To achieve good accuracy, we recommend the use of groupwise quantization where (small to mid-sized) groups of int4 weights share a scale. We also quantize activations to 8-bit, giving this scheme its name (a8w4dq = 8-bit dynamically quantized activations with 4b weights), and boost performance.
204204

205205
**TODO (Digant): a8w4dq eager mode support [#335](https://github.com/pytorch/torchchat/issues/335) **
206206
```
207-
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:a8w4dq': {'groupsize' : 7} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_8da4w.pte | ...dso... ]
207+
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:a8w4dq': {'groupsize' : 7} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_8da4w.pte | ...dso... ]
208208
```
209209

210210
Now you can run your model with the same command as before:
211211

212212
```
213-
python generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_a8w4dq.pte | ...dso...] --prompt "Hello my name is"
213+
python3 generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_a8w4dq.pte | ...dso...] --prompt "Hello my name is"
214214
```
215215

216216
## 4-bit Integer Linear Quantization with GPTQ (gptq)
@@ -220,16 +220,16 @@ Compression offers smaller memory footprints (to fit on memory-constrained accel
220220

221221
We can use GPTQ with eager execution, optionally in conjunction with torch.compile:
222222
```
223-
python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int4" : {"groupsize": 32}}' --device [ cpu | cuda | mps ]
223+
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int4" : {"groupsize": 32}}' --device [ cpu | cuda | mps ]
224224
```
225225

226226
```
227-
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:gptq': {'groupsize' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso... ]
227+
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:gptq': {'groupsize' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso... ]
228228
```
229229
Now you can run your model with the same command as before:
230230

231231
```
232-
python generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso...] --prompt "Hello my name is"
232+
python3 generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso...] --prompt "Hello my name is"
233233
```
234234

235235
## 4-bit Integer Linear Quantization with HQQ (hqq)
@@ -240,16 +240,16 @@ Compression offers smaller memory footprints (to fit on memory-constrained accel
240240

241241
We can use HQQ with eager execution, optionally in conjunction with torch.compile:
242242
```
243-
python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:hqq" : {"groupsize": 32}}' --device [ cpu | cuda | mps ]
243+
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:hqq" : {"groupsize": 32}}' --device [ cpu | cuda | mps ]
244244
```
245245

246246
```
247-
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:hqq': {'groupsize' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_hqq.pte | ...dso... ]
247+
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:hqq': {'groupsize' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_hqq.pte | ...dso... ]
248248
```
249249
Now you can run your model with the same command as before:
250250

251251
```
252-
python generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_hqq.pte | ...dso...] --prompt "Hello my name is"
252+
python3 generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_hqq.pte | ...dso...] --prompt "Hello my name is"
253253
254254
255255
## Adding additional quantization schemes

0 commit comments

Comments
 (0)