Skip to content

Commit ba783a0

Browse files
byjlwmalfet
authored andcommitted
Updated Documentation (#295)
1 parent 7ebe4d1 commit ba783a0

File tree

7 files changed

+1196
-715
lines changed

7 files changed

+1196
-715
lines changed

README.md

Lines changed: 122 additions & 715 deletions
Large diffs are not rendered by default.

docs/Android.md

Whitespace-only changes.

docs/GGUF.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# Using GGUF Models
2+
We currently support the following models
3+
- F16
4+
- F32
5+
- Q4_0
6+
- Q6_K
7+
8+
9+
### Download
10+
First download a GGUF model and tokenizer. In this example, we use GGUF Q4_0 format.
11+
12+
```
13+
mkdir -p ggufs/open_orca
14+
cd ggufs/open_orca
15+
wget -O open_orca.Q4_0.gguf "https://huggingface.co/TheBloke/TinyLlama-1.1B-1T-OpenOrca-GGUF/resolve/main/tinyllama-1.1b-1t-openorca.Q4_0.gguf?download=true"
16+
17+
wget -O tokenizer.model "https://github.com/karpathy/llama2.c/raw/master/tokenizer.model"
18+
cd ../..
19+
20+
export GGUF_MODEL_PATH=ggufs/open_orca/open_orca.Q4_0.gguf
21+
export GGUF_TOKENIZER_PATH=ggufs/open_orca/tokenizer.model
22+
export GGUF_PTE_PATH=/tmp/gguf_model.pte
23+
```
24+
25+
### Generate eager
26+
```
27+
python torchchat.py generate --gguf-path ${GGUF_MODEL_PATH} --tokenizer-path ${GGUF_TOKENIZER_PATH} --temperature 0 --prompt "In a faraway land" --max-new-tokens 20
28+
```
29+
30+
### Executorch export + generate
31+
```
32+
# Convert the model for use
33+
python torchchat.py export --gguf-path ${GGUF_MODEL_PATH} --output-pte-path ${GGUF_PTE_PATH}
34+
35+
# Generate using the PTE model that was created by the export command
36+
python torchchat.py generate --gguf-path ${GGUF_MODEL_PATH} --pte-path ${GGUF_PTE_PATH} --tokenizer-path ${GGUF_TOKENIZER_PATH} --temperature 0 --prompt "In a faraway land" --max-new-tokens 20
37+
38+
```

docs/MISC.md

Lines changed: 792 additions & 0 deletions
Large diffs are not rendered by default.

docs/Models.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Models
2+
3+
These are the supported models
4+
| Model | Mobile Friendly | Notes |
5+
|------------------|---|---------------------|
6+
|[tinyllamas/stories15M](https://huggingface.co/karpathy/tinyllamas/tree/main)|||
7+
|[tinyllamas/stories42M](https://huggingface.co/karpathy/tinyllamas/tree/main)|||
8+
|[tinyllamas/stories110M](https://huggingface.co/karpathy/tinyllamas/tree/main)|||
9+
|[openlm-research/open_llama_7b](https://huggingface.co/karpathy/tinyllamas/tree/main)|||
10+
|[meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)|||
11+
|[meta-llama/Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)|||
12+
|[meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf)|||
13+
|[meta-llama/CodeLlama-7b-Python-hf](https://huggingface.co/meta-llama/CodeLlama-7b-Python-hf)|||
14+
|[meta-llama/CodeLlama-34b-Python-hf](https://huggingface.co/meta-llama/CodeLlama-34b-Python-hf)|||
15+
|[mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)|||
16+
|[mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)|||
17+
|[mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)|||
18+
|[meta-llama/Llama3](https://huggingface.co/meta-llama/Meta-Llama-3-8B)|||
19+
20+
See the [documentation on GGUF](docs/GGUF.md) to learn how to use GGUF files.

docs/iOS.md

Whitespace-only changes.

docs/quantization.md

Lines changed: 224 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,224 @@
1+
2+
# Quantization
3+
4+
### Introduction
5+
Quantization focuses on reducing the precision of model parameters and computations from floating-point to lower-bit integers, such as 8-bit integers. This approach aims to minimize memory requirements, accelerate inference speeds, and decrease power consumption, making models more feasible for deployment on edge devices with limited computational resources. While quantization can potentially degrade the model's performance, the methods supported by torchchat are designed to mitigate this effect, maintaining a balance between efficiency and accuracy.
6+
7+
### Supported quantization techniques
8+
9+
| compression | FP precision | weight quantization | dynamic activation quantization |
10+
|--|--|--|--|
11+
embedding table (symmetric) | fp32, fp16, bf16 | 8b (group/channel), 4b (group/channel) | n/a |
12+
linear operator (symmetric) | fp32, fp16, bf16 | 8b (group/channel) | n/a |
13+
linear operator (asymmetric) | n/a | 4b (group), a6w4dq | a8w4dq (group) |
14+
linear operator (asymmetric) with GPTQ | n/a | 4b (group) | n/a |
15+
linear operator (asymmetric) with HQQ | n/a | work in progress | n/a |
16+
17+
### Model precision (dtype precision setting)
18+
You can generate models (for both export and generate, with eager, torch.compile, AOTI, ET, for all backends - mobile at present will primarily support fp32, with all options) specify the precision of the model with
19+
20+
TODO: These need to be commands that can be copy paste
21+
```
22+
python generate.py --dtype [bf16 | fp16 | fp32] ...
23+
python export.py --dtype [bf16 | fp16 | fp32] ...
24+
```
25+
26+
Unlike gpt-fast which uses bfloat16 as default, Torch@ uses float32 as the default. As a consequence you will have to set to --dtype bf16 or --dtype fp16 on server / desktop for best performance.
27+
Support for FP16 and BF16 is limited in many embedded processors. Additional executorch support for 16-bit floating point types may be added in the future based on hardware support.
28+
29+
## Making your models fit and execute fast!
30+
31+
Next, we'll show you how to optimize your model for mobile execution (for ET) or get the most from your server or desktop hardware (with AOTI). The basic model build for mobile surfaces two issues: Models quickly run out of memory and execution can be slow. In this section, we show you how to fit your models in the limited memory of a mobile device, and optimize execution speed -- both using quantization. This is the torchchat repo after all!
32+
For high-performance devices such as GPUs, quantization provides a way to reduce the memory bandwidth required to and take advantage of the massive compute capabilities provided by today's server-based accelerators such as GPUs. In addition to reducing the memory bandwidth required to compute a result faster by avoiding stalls, quantization allows accelerators (which usually have a limited amount of memory) to store and process larger models than they would otherwise be able to.
33+
We can specify quantization parameters with the --quantize option. The quantize option takes a JSON/dictionary with quantizers and quantization options.
34+
generate and export (for both ET and AOTI) can both accept quantization options. We only show a subset of the combinations to avoid combinatorial explosion.
35+
36+
### Embedding quantization (8 bit integer, channelwise & groupwise)
37+
The simplest way to quantize embedding tables is with int8 "channelwise" (symmetric) quantization, where each value is represented by an 8 bit integer, and a floating point scale per embedding (channelwise quantization) or one scale for each group of values in an embedding (groupwise quantization).
38+
39+
*Channelwise quantization:*
40+
41+
We can do this in eager mode (optionally with torch.compile), we use the embedding quantizer with groupsize set to 0 which uses channelwise quantization:
42+
43+
TODO: Write this so that someone can copy paste
44+
```
45+
python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
46+
47+
```
48+
49+
Then, export as follows:
50+
```
51+
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
52+
```
53+
54+
Now you can run your model with the same command as before:
55+
```
56+
python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte --prompt "Hello my name is"
57+
```
58+
59+
*Groupwise quantization:*
60+
We can do this in eager mode (optionally with torch.compile), we use the embedding quantizer by specifying the group size:
61+
62+
```
63+
python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 8, "groupsize": 8}}' --device cpu
64+
65+
```
66+
Then, export as follows:
67+
68+
```
69+
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 8, "groupsize": 8} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
70+
71+
```
72+
73+
Now you can run your model with the same command as before:
74+
```
75+
python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte --prompt "Hello my name is"
76+
```
77+
78+
### Embedding quantization (4 bit integer, channelwise & groupwise)
79+
Quantizing embedding tables with int4 provides even higher compression of embedding tables, potentially at the cost of embedding quality and model outcome quality. In 4-bit embedding table quantization, each value is represented by a 4 bit integer with two values packed into each byte to provide greater compression efficiency (potentially at the cost of model quality) over int8 embedding quantization.
80+
81+
*Channelwise quantization:*
82+
We can do this in eager mode (optionally with torch.compile), we use the embedding quantizer with groupsize set to 0 which uses channelwise quantization:
83+
84+
```
85+
python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 4, "groupsize": 0}}' --device cpu
86+
```
87+
88+
Then, export as follows:
89+
```
90+
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 4, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
91+
```
92+
93+
Now you can run your model with the same command as before:
94+
95+
```
96+
python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte --prompt "Hello my name is"
97+
```
98+
99+
*Groupwise quantization:*
100+
We can do this in eager mode (optionally with torch.compile), we use the embedding quantizer by specifying the group size:
101+
102+
```
103+
python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"embedding" : {"bitwidth": 4, "groupsize": 8}}' --device cpu
104+
```
105+
106+
Then, export as follows:
107+
```
108+
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"embedding": {"bitwidth": 4, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte
109+
```
110+
111+
Now you can run your model with the same command as before:
112+
```
113+
python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_emb8b-gw256.pte --prompt "Hello my name is"
114+
```
115+
116+
### Linear 8 bit integer quantization (channel-wise and groupwise)
117+
118+
The simplest way to quantize linear operators is with int8 quantization, where each value is represented by an 8-bit integer, and a floating point scale:
119+
120+
*Channelwise quantization:*
121+
122+
The simplest way to quantize embedding tables is with int8 groupwise quantization, where each value is represented by an 8 bit integer, and a floating point scale per group.
123+
124+
We can do this in eager mode (optionally with torch.compile), we use the linear:int8 quantizer with groupsize set to 0 which uses channelwise quantization:
125+
126+
```
127+
python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int8" : {"bitwidth": 8, "groupsize": 0}}' --device cpu
128+
```
129+
130+
Then, export as follows using Executorch for mobile backends:
131+
132+
```
133+
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte
134+
```
135+
136+
Now you can run your model with the same command as before:
137+
138+
```
139+
python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.pte --checkpoint-path ${MODEL_PATH} --prompt "Hello my name is"
140+
```
141+
142+
Or, export as follows for server/desktop deployments:
143+
144+
```
145+
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int8.so
146+
```
147+
148+
Now you can run your model with the same command as before:
149+
150+
```
151+
python generate.py --dso-path ${MODEL_OUT}/${MODEL_NAME}_int8.so --checkpoint-path ${MODEL_PATH} --prompt "Hello my name is"
152+
```
153+
154+
*Groupwise quantization:*
155+
We can do this in eager mode (optionally with torch.compile), we use the linear:int8 quantizer by specifying the group size:
156+
157+
```
158+
python generate.py [--compile] --checkpoint-path ${MODEL_PATH} --prompt "Hello, my name is" --quant '{"linear:int8" : {"bitwidth": 8, "groupsize": 8}}' --device cpu
159+
```
160+
Then, export as follows using Executorch:
161+
162+
```
163+
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.pte
164+
```
165+
166+
**Now you can run your model with the same command as before:**
167+
168+
```
169+
python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.pte --checkpoint-path ${MODEL_PATH} --prompt "Hello my name is"
170+
```
171+
*Or, export*
172+
```
173+
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant '{"linear:int8": {"bitwidth": 8, "groupsize": 0} }' --output-dso-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.so
174+
```
175+
176+
Now you can run your model with the same command as before:
177+
```
178+
python generate.py --pte-path ${MODEL_OUT}/${MODEL_NAME}_int8-gw256.so --checkpoint-path ${MODEL_PATH} -d fp32 --prompt "Hello my name is"
179+
```
180+
Please note that group-wise quantization works functionally, but has not been optimized for CUDA and CPU targets where the best performnance requires a group-wise quantized mixed dtype linear operator.
181+
182+
**4-bit integer quantization (int4)**
183+
To compress your model even more, 4-bit integer quantization may be used. To achieve good accuracy, we recommend the use of groupwise quantization where (small to mid-sized) groups of int4 weights share a scale.
184+
185+
```
186+
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:int4': {'groupsize' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.pte | --output-dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso]
187+
```
188+
Now you can run your model with the same command as before:
189+
190+
```
191+
python generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.pte | --dso-path ${MODEL_OUT}/${MODEL_NAME}_int4-gw32.dso] --prompt "Hello my name is"
192+
```
193+
**4-bit integer quantization (8da4w)**
194+
To compress your model even more, 4-bit integer quantization may be used. To achieve good accuracy, we recommend the use of groupwise quantization where (small to mid-sized) groups of int4 weights share a scale. We also quantize activations to 8-bit, giving this scheme its name (8da4w = 8b dynamically quantized activations with 4b weights), and boost performance.
195+
196+
```
197+
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:8da4w': {'groupsize' : 7} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_8da4w.pte | ...dso... ]
198+
```
199+
200+
Now you can run your model with the same command as before:
201+
202+
```
203+
python generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_8da4w.pte | ...dso...] --prompt "Hello my name is"
204+
```
205+
206+
**Quantization with GPTQ (gptq)**
207+
```
208+
python export.py --checkpoint-path ${MODEL_PATH} -d fp32 --quant "{'linear:gptq': {'groupsize' : 32} }" [ --output-pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso... ] # may require additional options, check with AO team
209+
```
210+
Now you can run your model with the same command as before:
211+
212+
```
213+
python generate.py [ --pte-path ${MODEL_OUT}/${MODEL_NAME}_gptq.pte | ...dso...] --prompt "Hello my name is"
214+
```
215+
**Adding additional quantization schemes (hqq)**
216+
We invite contributors to submit established quantization schemes, with accuracy and performance results demonstrating soundness.
217+
218+
- Explain terminology, weight size vs activation size, per-channel vs groupwise vs per-tensor, embedding quantization, linear quantization.
219+
- Explain GPTQ, RTN quantization approaches, examples
220+
- Show general form of –quantize parameter
221+
- Describe how to choose a quantization scheme. Which factors should they take into account? Concrete recommendations for use cases, esp. mobile.
222+
- Quantization reference, describe options for –quant parameter
223+
- Show a table with performance/accuracy metrics
224+
- Quantization support matrix? torchat Quantization Support Matrix

0 commit comments

Comments
 (0)