Skip to content

Commit ed95bd1

Browse files
committed
quantization
1 parent 16d7d83 commit ed95bd1

File tree

2 files changed

+7
-6
lines changed

2 files changed

+7
-6
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -233,6 +233,7 @@ python3 torchchat.py export stories15M --output-pte-path stories15M.pte
233233
# Execute
234234
python3 torchchat.py generate --device cpu --pte-path stories15M.pte --prompt "Hello my name is"
235235
```
236+
* Note, to export a llama model, it's recommended to quantize with `--quantize '{"embedding": {"bitwidth": 4, "groupsize":32}, “linear:int4”: {“groupsize” : 256}}’`
236237

237238
See below under [Mobile Execution](#run-mobile) if you want to deploy and execute a model in your iOS or Android app.
238239

docs/quantization.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -214,12 +214,12 @@ python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello,
214214
```
215215

216216
```
217-
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize '{"linear:int4": {"groupsize" : 32} }' [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.pte | --output-dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso]
217+
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize '{"linear:int4": {"groupsize" : 32} }' --output-dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso
218218
```
219219
Now you can run your model with the same command as before:
220220

221221
```
222-
python3 generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.pte | --dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso] --prompt "Hello my name is"
222+
python3 generate.py --dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso --prompt "Hello my name is"
223223
```
224224

225225
## 4-Bit Integer Linear Quantization (a8w4dq)
@@ -247,12 +247,12 @@ python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello,
247247
```
248248

249249
```
250-
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize '{"linear:gptq": {"groupsize" : 32} }' [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso... ]
250+
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize '{"linear:gptq": {"groupsize" : 32} }' --output-dso-path ${MODEL_OUT}/${MODEL_NAME}_gptq.dso
251251
```
252252
Now you can run your model with the same command as before:
253253

254254
```
255-
python3 generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso...] --prompt "Hello my name is"
255+
python3 generate.py --dso-path ${MODEL_OUT}/${MODEL_NAME}_gptq.dso --prompt "Hello my name is"
256256
```
257257

258258
## 4-bit Integer Linear Quantization with HQQ (hqq)
@@ -267,12 +267,12 @@ python3 generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello,
267267
```
268268

269269
```
270-
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize '{"linear:hqq": {"groupsize" : 32} }' [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_hqq.pte | ...dso... ]
270+
python3 export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quantize '{"linear:hqq": {"groupsize" : 32} }' --output-dso-path ${MODEL_OUT}/${MODEL_NAME}_hqq.dso
271271
```
272272
Now you can run your model with the same command as before:
273273

274274
```
275-
python3 generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_hqq.pte | ...dso...] --prompt "Hello my name is"
275+
python3 generate.py --dso-path ${MODEL_OUT}/${MODEL_NAME}_hqq.dso --prompt "Hello my name is"
276276
277277
278278
## Adding additional quantization schemes

0 commit comments

Comments
 (0)