You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unlike gpt-fast which uses bfloat16 as default, torchchat uses float32 as the default. As a consequence you will have to set to --dtype bf16 or --dtype fp16 on server / desktop for best performance.
@@ -42,37 +42,37 @@ We can do this in eager mode (optionally with torch.compile), we use the embeddi
42
42
43
43
TODO: Write this so that someone can copy paste
44
44
```
45
-
python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
45
+
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
@@ -82,35 +82,35 @@ Quantizing embedding tables with int4 provides even higher compression of embedd
82
82
We can do this in eager mode (optionally with torch.compile), we use the embedding quantizer with groupsize set to 0 which uses channelwise quantization:
83
83
84
84
```
85
-
python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 4, "groupsize": 0}}' --device cpu
85
+
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 4, "groupsize": 0}}' --device cpu
Now you can run your model with the same command as before:
112
112
```
113
-
python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte --prompt "Hello my name is"
113
+
python3 generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte --prompt "Hello my name is"
114
114
```
115
115
116
116
## 8-Bit Integer Linear Quantization (linear operator, channel-wise and groupwise)
@@ -124,58 +124,58 @@ The simplest way to quantize embedding tables is with int8 groupwise quantizatio
124
124
We can do this in eager mode (optionally with torch.compile), we use the linear:int8 quantizer with groupsize set to 0 which uses channelwise quantization:
125
125
126
126
```
127
-
python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int8" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
127
+
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int8" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
128
128
```
129
129
130
130
Then, export as follows using ExecuTorch for mobile backends:
Now you can run your model with the same command as before:
177
177
```
178
-
python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.so --checkpoint-path ${MODEL_PATH} -d fp32 --prompt "Hello my name is"
178
+
python3 generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.so --checkpoint-path ${MODEL_PATH} -d fp32 --prompt "Hello my name is"
179
179
```
180
180
181
181
Please note that group-wise quantization works functionally, but has not been optimized for CUDA and CPU targets where the best performnance requires a group-wise quantized mixed dtype linear operator.
@@ -187,30 +187,30 @@ To compress your model even more, 4-bit integer quantization may be used. To ach
187
187
We can do this in eager mode (optionally with torch.compile), we use the linear:int8 quantizer by specifying the group size:
188
188
189
189
```
190
-
python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int4" : {"groupsize": 32}}' --device [ cpu | cuda | mps ]
190
+
python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int4" : {"groupsize": 32}}' --device [ cpu | cuda | mps ]
Now you can run your model with the same command as before:
197
197
198
198
```
199
-
python generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.pte | --dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso] --prompt "Hello my name is"
199
+
python3 generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.pte | --dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso] --prompt "Hello my name is"
200
200
```
201
201
202
202
## 4-Bit Integer Linear Quantization (a8w4dq)
203
203
To compress your model even more, 4-bit integer quantization may be used. To achieve good accuracy, we recommend the use of groupwise quantization where (small to mid-sized) groups of int4 weights share a scale. We also quantize activations to 8-bit, giving this scheme its name (a8w4dq = 8-bit dynamically quantized activations with 4b weights), and boost performance.
204
204
205
205
**TODO (Digant): a8w4dq eager mode support [#335](https://github.com/pytorch/torchchat/issues/335)**
0 commit comments