Skip to content

Commit d6da1b4

Browse files
lucylqmalfet
authored andcommitted
create dir on download (#425)
1 parent 2df1e28 commit d6da1b4

File tree

1 file changed

+6
-6
lines changed

1 file changed

+6
-6
lines changed

docs/quantization.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -214,7 +214,7 @@ python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello,
214214
```
215215

216216
```
217-
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize "{'linear:int4': {'groupsize' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.pte | --output-dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso]
217+
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize '{"linear:int4": {"groupsize" : 32} }' [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.pte | --output-dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso]
218218
```
219219
Now you can run your model with the same command as before:
220220

@@ -227,7 +227,7 @@ To compress your model even more, 4-bit integer quantization may be used. To ach
227227

228228
**TODO (Digant): a8w4dq eager mode support [#335](https://github.com/pytorch/torchchat/issues/335) **
229229
```
230-
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize "{'linear:a8w4dq': {'groupsize' : 7} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_8da4w.pte | ...dso... ]
230+
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize '{"linear:a8w4dq": {"groupsize" : 8} }' [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_8da4w.pte | ...dso... ]
231231
```
232232

233233
Now you can run your model with the same command as before:
@@ -237,7 +237,7 @@ python3 generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_a8w4dq.pte | ...dso.
237237
```
238238

239239
## 4-bit Integer Linear Quantization with GPTQ (gptq)
240-
Compression offers smaller memory footprints (to fit on memory-constrained accelerators and mobile/edge devices) and reduced memory bandwidth (for better performance), but often at the price of quality degradation. GPTQ 4-bit integer quantization may be used to reduce the quality impact. To achieve good accuracy, we recommend the use of groupwise quantization where (small to mid-sized) groups of int4 weights share a scale.
240+
Compression offers smaller memory footprints (to fit on memory-constrained accelerators and mobile/edge devices) and reduced memory bandwidth (for better performance), but often at the price of quality degradation. GPTQ 4-bit integer quantization may be used to reduce the quality impact. To achieve good accuracy, we recommend the use of groupwise quantization where (small to mid-sized) groups of int4 weights share a scale.
241241

242242
**TODO (Jerry): GPTQ quantization documentation [#336](https://github.com/pytorch/torchchat/issues/336) **
243243

@@ -247,7 +247,7 @@ python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello,
247247
```
248248

249249
```
250-
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize "{'linear:gptq': {'groupsize' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso... ]
250+
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize '{"linear:gptq": {"groupsize" : 32} }' [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso... ]
251251
```
252252
Now you can run your model with the same command as before:
253253

@@ -257,7 +257,7 @@ python3 generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso...
257257

258258
## 4-bit Integer Linear Quantization with HQQ (hqq)
259259

260-
Compression offers smaller memory footprints (to fit on memory-constrained accelerators and mobile/edge devices) and reduced memory bandwidth (for better performance), but often at the price of quality degradation. GPTQ 4-bit integer quantization may be used to reduce the quality impact, but at the cost of significant additional computation time. HQQ Quantization balances performance, accuracy, and runtime, we recommend the use of groupwise quantization where (small to mid-sized) groups of int4 weights share a scale.
260+
Compression offers smaller memory footprints (to fit on memory-constrained accelerators and mobile/edge devices) and reduced memory bandwidth (for better performance), but often at the price of quality degradation. GPTQ 4-bit integer quantization may be used to reduce the quality impact, but at the cost of significant additional computation time. HQQ Quantization balances performance, accuracy, and runtime, we recommend the use of groupwise quantization where (small to mid-sized) groups of int4 weights share a scale.
261261

262262
**TODO (Zhengxu): HQQ quantization documentation [#337](https://github.com/pytorch/torchchat/issues/336) **
263263

@@ -267,7 +267,7 @@ python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello,
267267
```
268268

269269
```
270-
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize "{'linear:hqq': {'groupsize' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_hqq.pte | ...dso... ]
270+
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize '{"linear:hqq": {"groupsize" : 32} }' [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_hqq.pte | ...dso... ]
271271
```
272272
Now you can run your model with the same command as before:
273273

0 commit comments

Comments
 (0)