Skip to content

update gguf docs #794

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 4 additions & 40 deletions docs/ADVANCED-USERS.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,22 +132,10 @@ GGUF model with the option `--load-gguf ${MODELNAME}.gguf`. Presently,
the F16, F32, Q4_0, and Q6_K formats are supported and converted into
native torchchat models.

You may also dequantize GGUF models with the GGUF quantize tool, and
then load and requantize with torchchat native quantization options.

| GGUF Model | Tested | Eager | torch.compile | AOT Inductor | ExecuTorch | Fits on Mobile |
|-----|--------|-------|-----|-----|-----|-----|
| llama-2-7b.Q4_0.gguf | 🚧 | 🚧 | 🚧 | 🚧 | 🚧 |

You may also dequantize GGUF models with the GGUF quantize tool, and
then load and requantize with torchchat native quantization options.

**Please note that quantizing and dequantizing is a lossy process, and
you will get the best results by starting with the original
unquantized model checkpoint, not a previously quantized and then
dequantized model.**


## Conventions used in this document

We use several variables in this example, which may be set as a
Expand Down Expand Up @@ -232,7 +220,7 @@ submission guidelines.)

Torchchat supports several devices. You may also let torchchat use
heuristics to select the best device from available devices using
torchchat's virtual device named `fast`.
torchchat's virtual device named `fast`.

Torchchat supports execution using several floating-point datatypes.
Please note that the selection of execution floating point type may
Expand Down Expand Up @@ -398,9 +386,9 @@ linear operator (asymmetric) with GPTQ | n/a | 4b (group) | n/a |
linear operator (asymmetric) with HQQ | n/a | work in progress | n/a |

## Model precision (dtype precision setting)
On top of quantizing models with quantization schemes mentioned above, models can be converted
to lower precision floating point representations to reduce the memory bandwidth requirement and
take advantage of higher density compute available. For example, many GPUs and some of the CPUs
On top of quantizing models with quantization schemes mentioned above, models can be converted
to lower precision floating point representations to reduce the memory bandwidth requirement and
take advantage of higher density compute available. For example, many GPUs and some of the CPUs
have good support for bfloat16 and float16. This can be taken advantage of via `--dtype arg` as shown below.

[skip default]: begin
Expand Down Expand Up @@ -439,30 +427,6 @@ may dequantize them using GGUF tools, and then laod the model into
torchchat to quantize with torchchat's quantization workflow.)


## Loading unsupported GGUF formats in torchchat

GGUF formats not presently supported natively in torchchat may be
converted to one of the supported formats with GGUF's
`${GGUF}/quantize` utility to be loaded in torchchat. If you convert
to the FP16 or FP32 formats with GGUF's `quantize` utility, you may
then requantize these models with torchchat's quantization workflow.

**Note that quantizing and dequantizing is a lossy process, and you will
get the best results by starting with the original unquantized model
checkpoint, not a previously quantized and then dequantized
model.** Thus, while you can convert your q4_1 model to FP16 or FP32
GGUF formats and then requantize, you might get better results if you
start with the original FP16 or FP32 GGUF format.

To use the quantize tool, install the GGML tools at ${GGUF} . Then,
you can, for example, convert a quantized model to f16 format:

[end default]: end
```
${GGUF}/quantize --allow-requantize your_quantized_model.gguf fake_unquantized_model.gguf f16
```


## Optimizing your model for server, desktop and mobile devices

While we have shown the export and execution of a small model on CPU
Expand Down
25 changes: 24 additions & 1 deletion docs/GGUF.md
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikekgfb what are these lines doing:

[shell default]: HF_TOKEN="${SECRET_HF_TOKEN_PERIODIC}" huggingface-cli login
[shell default]: TORCHCHAT_ROOT=${PWD} ./scripts/install_et.sh

They show up in the rendered markdown.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are commands for our CI which extracts the shell commands and runs them to ensure that the instructions in the docs work. Because we don't install et in the commands in the document (assuming that those have already been performed), we need to do it in a side channel.

shoul be commented out, I'll add that

Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,6 @@ python3 torchchat.py generate --gguf-path ${GGUF_MODEL_PATH} --dso-path ${GGUF_S

```


### ExecuTorch export + generate
Before running this example, you must first [Set-up ExecuTorch](executorch_setup.md).
```
Expand All @@ -67,4 +66,28 @@ python3 torchchat.py export --gguf-path ${GGUF_MODEL_PATH} --output-pte-path ${G
python3 torchchat.py generate --gguf-path ${GGUF_MODEL_PATH} --pte-path ${GGUF_PTE_PATH} --tokenizer-path ${GGUF_TOKENIZER_PATH} --temperature 0 --prompt "Once upon a time" --max-new-tokens 15
```

### Advanced: loading unsupported GGUF formats in torchchat
GGUF formats not presently supported natively in torchchat can be
converted to one of the supported formats with GGUF's
[quantize](https://github.com/ggerganov/llama.cpp/tree/master/examples/quantize) utility.
If you convert to the FP16 or FP32 formats with GGUF's quantize utility, you can
then requantize these models with torchchat's native quantization workflow.

**Please note that quantizing and dequantizing is a lossy process, and
you will get the best results by starting with the original
unquantized model, not a previously quantized and then
dequantized model.**

As an example, suppose you have [llama.cpp cloned and installed](https://github.com/ggerganov/llama.cpp) at ~/repos/llama.cpp.
You can then convert a model to FP16 with the following command:

[skip default]: begin
```
~/repos/llama.cpp/quantize --allow-requantize path_of_model_you_are_converting_from.gguf path_for_model_you_are_converting_to.gguf fp16
```
[skip default]: end

After the model is converted to a supported format like FP16, you can proceed using the instructions above.


[end default]: end