Skip to content

clean up gguf doc #416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Torchchat is a small codebase to showcase running large language models (LLMs) w

## Highlights
- Command line interaction with popular LLMs such as Llama 3, Llama 2, Stories, Mistral and more
- Supporting both GGUF fp32/16 and the Hugging Face checkpoint format
- Supporting [some GGUF files](docs/GGUF.md) and the Hugging Face checkpoint format
- PyTorch-native execution with performance
- Supports popular hardware and OS
- Linux (x86)
Expand Down Expand Up @@ -41,7 +41,7 @@ Most models use HuggingFace as the distribution channel, so you will need to cre
account.

Create a HuggingFace user access token [as documented here](https://huggingface.co/docs/hub/en/security-tokens).
Run `huggingface-cli login`, which will prompt for the newly created token.
Run `huggingface-cli login`, which will prompt for the newly created token.

Once this is done, torchchat will be able to download model artifacts from
HuggingFace.
Expand Down
4 changes: 1 addition & 3 deletions build/gguf_loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,9 @@

import torch

wd = Path(__file__).parent.resolve()
sys.path.append(str(wd))

from gguf import GGUFValueType
from model import ModelArgs, Transformer
from .model import ModelArgs, Transformer
from quantize import pack_scales_and_zeros, WeightOnlyInt4Linear

from build.gguf_util import Q4_0, to_float
Expand Down
35 changes: 27 additions & 8 deletions docs/GGUF.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,19 @@
# Using GGUF Models
We currently support the following models
We support parsing [GGUF](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) files with the following tensor types:
- F16
- F32
- Q4_0
- Q6_K

If an unsupported type is encountered while parsing a GGUF file, an exception is raised.

### Download
First download a GGUF model and tokenizer. In this example, we use GGUF Q4_0 format.
We now go over an example of using GGUF files in the torchchat flow.

### Download resources
First download a GGUF model and tokenizer. In this example, we use a Q4_0 GGUF file. (Note that Q4_0 is only the dominant tensor type in the file, but the file also contains GGUF tensors of types Q6_K, F16, and F32.)

```
# Download resources
mkdir -p ggufs/open_orca
cd ggufs/open_orca
wget -O open_orca.Q4_0.gguf "https://huggingface.co/TheBloke/TinyLlama-1.1B-1T-OpenOrca-GGUF/resolve/main/tinyllama-1.1b-1t-openorca.Q4_0.gguf?download=true"
Expand All @@ -19,20 +23,35 @@ cd ../..

export GGUF_MODEL_PATH=ggufs/open_orca/open_orca.Q4_0.gguf
export GGUF_TOKENIZER_PATH=ggufs/open_orca/tokenizer.model

# Define export paths for examples below
export GGUF_SO_PATH=/tmp/gguf_model.so
export GGUF_PTE_PATH=/tmp/gguf_model.pte
```

### Generate eager
### Eager generate
We can generate text in eager mode as we did before, but we now pass a GGUF file path.
```
python torchchat.py generate --gguf-path ${GGUF_MODEL_PATH} --tokenizer-path ${GGUF_TOKENIZER_PATH} --temperature 0 --prompt "In a faraway land" --max-new-tokens 20
python3 torchchat.py generate --gguf-path ${GGUF_MODEL_PATH} --tokenizer-path ${GGUF_TOKENIZER_PATH} --temperature 0 --prompt "Once upon a time" --max-new-tokens 15
```

### ExecuTorch export + generate
### AOTI export + generate
```
# Convert the model for use
python torchchat.py export --gguf-path ${GGUF_MODEL_PATH} --output-pte-path ${GGUF_PTE_PATH}
python3 torchchat.py export --gguf-path ${GGUF_MODEL_PATH} --output-dso-path ${GGUF_SO_PATH}

# Generate using the PTE model that was created by the export command
python torchchat.py generate --gguf-path ${GGUF_MODEL_PATH} --pte-path ${GGUF_PTE_PATH} --tokenizer-path ${GGUF_TOKENIZER_PATH} --temperature 0 --prompt "In a faraway land" --max-new-tokens 20
python3 torchchat.py generate --gguf-path ${GGUF_MODEL_PATH} --dso-path ${GGUF_SO_PATH} --tokenizer-path ${GGUF_TOKENIZER_PATH} --temperature 0 --prompt "Once upon a time" --max-new-tokens 15

```


### ExecuTorch export + generate
Before running this example, you must first [Set-up ExecuTorch](executorch_setup.md).
```
# Convert the model for use
python3 torchchat.py export --gguf-path ${GGUF_MODEL_PATH} --output-pte-path ${GGUF_PTE_PATH}

# Generate using the PTE model that was created by the export command
python3 torchchat.py generate --gguf-path ${GGUF_MODEL_PATH} --pte-path ${GGUF_PTE_PATH} --tokenizer-path ${GGUF_TOKENIZER_PATH} --temperature 0 --prompt "Once upon a time" --max-new-tokens 15
```