Skip to content

Commit ee0ad19

Browse files
metascroymalfet
authored andcommitted
update gguf docs (#794)
1 parent 1cb162a commit ee0ad19

File tree

2 files changed

+28
-41
lines changed

2 files changed

+28
-41
lines changed

docs/ADVANCED-USERS.md

Lines changed: 4 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -132,22 +132,10 @@ GGUF model with the option `--load-gguf ${MODELNAME}.gguf`. Presently,
132132
the F16, F32, Q4_0, and Q6_K formats are supported and converted into
133133
native torchchat models.
134134

135-
You may also dequantize GGUF models with the GGUF quantize tool, and
136-
then load and requantize with torchchat native quantization options.
137-
138135
| GGUF Model | Tested | Eager | torch.compile | AOT Inductor | ExecuTorch | Fits on Mobile |
139136
|-----|--------|-------|-----|-----|-----|-----|
140137
| llama-2-7b.Q4_0.gguf | 🚧 | 🚧 | 🚧 | 🚧 | 🚧 |
141138

142-
You may also dequantize GGUF models with the GGUF quantize tool, and
143-
then load and requantize with torchchat native quantization options.
144-
145-
**Please note that quantizing and dequantizing is a lossy process, and
146-
you will get the best results by starting with the original
147-
unquantized model checkpoint, not a previously quantized and then
148-
dequantized model.**
149-
150-
151139
## Conventions used in this document
152140

153141
We use several variables in this example, which may be set as a
@@ -232,7 +220,7 @@ submission guidelines.)
232220

233221
Torchchat supports several devices. You may also let torchchat use
234222
heuristics to select the best device from available devices using
235-
torchchat's virtual device named `fast`.
223+
torchchat's virtual device named `fast`.
236224

237225
Torchchat supports execution using several floating-point datatypes.
238226
Please note that the selection of execution floating point type may
@@ -398,9 +386,9 @@ linear operator (asymmetric) with GPTQ | n/a | 4b (group) | n/a |
398386
linear operator (asymmetric) with HQQ | n/a | work in progress | n/a |
399387

400388
## Model precision (dtype precision setting)
401-
On top of quantizing models with quantization schemes mentioned above, models can be converted
402-
to lower precision floating point representations to reduce the memory bandwidth requirement and
403-
take advantage of higher density compute available. For example, many GPUs and some of the CPUs
389+
On top of quantizing models with quantization schemes mentioned above, models can be converted
390+
to lower precision floating point representations to reduce the memory bandwidth requirement and
391+
take advantage of higher density compute available. For example, many GPUs and some of the CPUs
404392
have good support for bfloat16 and float16. This can be taken advantage of via `--dtype arg` as shown below.
405393

406394
[skip default]: begin
@@ -439,30 +427,6 @@ may dequantize them using GGUF tools, and then laod the model into
439427
torchchat to quantize with torchchat's quantization workflow.)
440428

441429

442-
## Loading unsupported GGUF formats in torchchat
443-
444-
GGUF formats not presently supported natively in torchchat may be
445-
converted to one of the supported formats with GGUF's
446-
`${GGUF}/quantize` utility to be loaded in torchchat. If you convert
447-
to the FP16 or FP32 formats with GGUF's `quantize` utility, you may
448-
then requantize these models with torchchat's quantization workflow.
449-
450-
**Note that quantizing and dequantizing is a lossy process, and you will
451-
get the best results by starting with the original unquantized model
452-
checkpoint, not a previously quantized and then dequantized
453-
model.** Thus, while you can convert your q4_1 model to FP16 or FP32
454-
GGUF formats and then requantize, you might get better results if you
455-
start with the original FP16 or FP32 GGUF format.
456-
457-
To use the quantize tool, install the GGML tools at ${GGUF} . Then,
458-
you can, for example, convert a quantized model to f16 format:
459-
460-
[end default]: end
461-
```
462-
${GGUF}/quantize --allow-requantize your_quantized_model.gguf fake_unquantized_model.gguf f16
463-
```
464-
465-
466430
## Optimizing your model for server, desktop and mobile devices
467431

468432
While we have shown the export and execution of a small model on CPU

docs/GGUF.md

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,6 @@ python3 torchchat.py generate --gguf-path ${GGUF_MODEL_PATH} --dso-path ${GGUF_S
5656
5757
```
5858

59-
6059
### ExecuTorch export + generate
6160
Before running this example, you must first [Set-up ExecuTorch](executorch_setup.md).
6261
```
@@ -67,4 +66,28 @@ python3 torchchat.py export --gguf-path ${GGUF_MODEL_PATH} --output-pte-path ${G
6766
python3 torchchat.py generate --gguf-path ${GGUF_MODEL_PATH} --pte-path ${GGUF_PTE_PATH} --tokenizer-path ${GGUF_TOKENIZER_PATH} --temperature 0 --prompt "Once upon a time" --max-new-tokens 15
6867
```
6968

69+
### Advanced: loading unsupported GGUF formats in torchchat
70+
GGUF formats not presently supported natively in torchchat can be
71+
converted to one of the supported formats with GGUF's
72+
[quantize](https://github.com/ggerganov/llama.cpp/tree/master/examples/quantize) utility.
73+
If you convert to the FP16 or FP32 formats with GGUF's quantize utility, you can
74+
then requantize these models with torchchat's native quantization workflow.
75+
76+
**Please note that quantizing and dequantizing is a lossy process, and
77+
you will get the best results by starting with the original
78+
unquantized model, not a previously quantized and then
79+
dequantized model.**
80+
81+
As an example, suppose you have [llama.cpp cloned and installed](https://github.com/ggerganov/llama.cpp) at ~/repos/llama.cpp.
82+
You can then convert a model to FP16 with the following command:
83+
84+
[skip default]: begin
85+
```
86+
~/repos/llama.cpp/quantize --allow-requantize path_of_model_you_are_converting_from.gguf path_for_model_you_are_converting_to.gguf fp16
87+
```
88+
[skip default]: end
89+
90+
After the model is converted to a supported format like FP16, you can proceed using the instructions above.
91+
92+
7093
[end default]: end

0 commit comments

Comments
 (0)