Skip to content

Commit e027cac

Browse files
mikekgfbmalfet
authored andcommitted
Update README.md (#499)
Update README
1 parent 143e557 commit e027cac

File tree

1 file changed

+101
-99
lines changed

1 file changed

+101
-99
lines changed

README.md

Lines changed: 101 additions & 99 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,8 @@ with `python3 torchchat.py remove llama3`.
7777
* [Run exported ExecuTorch file on iOS or Android](#mobile-execution)
7878
* in Chat mode
7979
* in Generate mode
80+
* Fine-tuned models from torchtune
81+
8082

8183
## Running via PyTorch / Python
8284

@@ -85,8 +87,15 @@ Designed for interactive and conversational use.
8587
In chat mode, the LLM engages in a back-and-forth dialogue with the user. It responds to queries, participates in discussions, provides explanations, and can adapt to the flow of conversation.
8688

8789
**Examples**
90+
8891
```bash
89-
python3 torchchat.py chat llama3
92+
# Llama 3 8B Instruct
93+
python3 torchchat.py chat llama3
94+
```
95+
96+
```
97+
# CodeLama 7B for Python
98+
python3 torchchat.py chat codellama
9099
```
91100

92101
For more information run `python3 torchchat.py chat --help`
@@ -107,120 +116,33 @@ For more information run `python3 torchchat.py generate --help`
107116

108117
Designed for interactive graphical conversations using the familiar web browser GUI. The browser command provides a GUI-based experience to engage with the LLM in a back-and-forth dialogue with the user. It responds to queries, participates in discussions, provides explanations, and can adapt to the flow of conversation.
109118

110-
## Quantizing your model (suggested for mobile)
111-
112-
Quantization is the process of converting a model into a more memory-efficient representation. Quantization is particularly important for accelerators -- to take advantage of the available memory bandwidth, and fit in the often limited high-speed memory in accelerators – and mobile devices – to fit in the typically very limited memory of mobile devices.
113-
114-
Depending on the model and the target device, different quantization recipes may be applied. Torchchat contains two example configurations to optimize performance for GPU-based systems `config/data/qconfig_gpu.json`, and mobile systems `config/data/qconfig_mobile.json`. The GPU configuration is targeted towards optimizing for memory bandwidth which is a scarce resource in powerful GPUs (and to a less degree, memory footprint to fit large models into a device's memory). The mobile configuration is targeted towards optimizing for memory fotoprint because in many devices, a single application is limited to as little as GB or less of memory.
115-
116-
You can use the quantization recipes in conjunction with any of the `chat`, `generate` and `browser` commands to test their impact and accelerate model execution. You will apply these recipes to the `export` comamnds below, to optimize the exported models. For example:
117-
```
118-
python3 torchchat.py chat llama3 --quantize config/data/qconfig_gpu.json
119-
```
120-
To adapt these recipes or wrote your own, please refer to the [quantization overview](docs/quantization.md).
121-
122-
*TO BE REPLACED BY SUITABLE ORDING PROVIDED BY LEGAL:*
123-
124-
With quantization, 32-bit floating numbers can be represented with as few as 8 or even 4 bits, and a scale shared by a group of these weights. This transformation is lossy and modifies the behavior of models. While research is being conducted on how to efficiently quantize large language models for use in mobile devices, this transformation invariable results in both quality loss and a reduced amount of control over the output of the models, leading to an increased risk of undesirable responses, hallucinations and stuttering. In effect an a developer quantizing a model, has much control and even more responsibility to quantize a model to quantify and reduce these effects.
125-
126-
127-
## Exporting your model
128-
Compiles a model and saves it to run later.
129-
130-
For more information run `python3 torchchat.py export --help`
131-
132-
### Exporting for Desktop / Server-side via AOT Inductor
133-
134-
```
135-
python3 torchchat.py export stories15M --output-dso-path stories15M.so
136-
```
137-
138-
This produces a `.so` file, also called a Dynamic Shared Object. This `.so` can be linked into your own C++ program.
139-
140-
### Running the exported `.so` via your own C++ application
141-
142-
[TBF]
143-
144-
### Exporting for Mobile via ExecuTorch
145-
146-
Before exporting to an ExecuTorch pte file with the command below, you must first [set-up ExecuTorch](docs/executorch_setup.md) inside torchchat.
147-
148-
```
149-
python3 torchchat.py export stories15M --output-pte-path stories15M.pte
150-
```
151-
152-
### Browser
153-
Run a chatbot in your browser that’s supported by the model you specify in the command.
154-
155119
**Examples**
156120

157121
```
158-
python3 torchchat.py browser stories15M --temperature 0 --num-samples 10
122+
python3 torchchat.py browser llama3 --temperature 0 --num-samples 10
159123
```
160124

161125
*Running on http://127.0.0.1:5000* should be printed out on the terminal. Click the link or go to [http://127.0.0.1:5000](http://127.0.0.1:5000) on your browser to start interacting with it.
162126

163127
Enter some text in the input box, then hit the enter key or click the “SEND” button. After a second or two, the text you entered together with the generated text will be displayed. Repeat to have a conversation.
164128

165-
### Eval
166-
Uses lm_eval library to evaluate model accuracy on a variety of tasks. Defaults to wikitext and can be manually controlled using the tasks and limit args.
167-
168-
For more information run `python3 torchchat.py eval --help`
169-
170-
**Examples**
171-
172-
Eager mode:
173-
```
174-
python3 torchchat.py eval stories15M -d fp32 --limit 5
175-
```
176-
177-
To test the perplexity for a lowered or quantized model, pass it in the same way you would to generate:
178-
179-
```
180-
python3 torchchat.py eval stories15M --pte-path stories15M.pte --limit 5
181-
```
182-
183129

184-
## Models
185130

186-
The following models are supported by torchchat and have associated aliases. Other models, including GGUF format, can be run by specifying a URL directly.
187-
188-
| Model | Mobile Friendly | Notes |
189-
|------------------|---|---------------------|
190-
|[meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)||Tuned for `chat` . Alias to `llama3`.|
191-
|[meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B)||Best for `generate`. Alias to `llama3-base`.|
192-
|[meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)||Tuned for `chat`. Alias to `llama2`.|
193-
|[meta-llama/Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf)||Tuned for `chat`. Alias to `llama2-13b-chat`.|
194-
|[meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf)||Tuned for `chat`. Alias to `llama2-70b-chat`.|
195-
|[meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)||Best for `generate`. Alias to `llama2-base`.|
196-
|[meta-llama/CodeLlama-7b-Python-hf](https://huggingface.co/meta-llama/CodeLlama-7b-Python-hf)||Tuned for Python and `generate`. Alias to `codellama`.|
197-
|[meta-llama/CodeLlama-34b-Python-hf](https://huggingface.co/meta-llama/CodeLlama-34b-Python-hf)||Tuned for Python and `generate`. Alias to `codellama-34b`.|
198-
|[mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)||Best for `generate`. Alias to `mistral-7b-v01-base`.|
199-
|[mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)||Tuned for `chat`. Alias to `mistral-7b-v01-instruct`.|
200-
|[mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)||Tuned for `chat`. Alias to `mistral`.|
201-
|[tinyllamas/stories15M](https://huggingface.co/karpathy/tinyllamas/tree/main)||Toy model for `generate`. Alias to `stories15M`.|
202-
|[tinyllamas/stories42M](https://huggingface.co/karpathy/tinyllamas/tree/main)||Toy model for `generate`. Alias to `stories42M`.|
203-
|[tinyllamas/stories110M](https://huggingface.co/karpathy/tinyllamas/tree/main)||Toy model for `generate`. Alias to `stories110M`.|
204-
|[openlm-research/open_llama_7b](https://huggingface.co/openlm-research/open_llama_7b)||Best for `generate`. Alias to `open-llama`.|
131+
## Quantizing your model (suggested for mobile)
205132

206-
Torchchat also supports loading of many models in the GGUF format. See the [documentation on GGUF](docs/GGUF.md) to learn how to use GGUF files.
133+
Quantization is the process of converting a model into a more memory-efficient representation. Quantization is particularly important for accelerators -- to take advantage of the available memory bandwidth, and fit in the often limited high-speed memory in accelerators – and mobile devices – to fit in the typically very limited memory of mobile devices.
207134

208-
**Examples**
135+
Depending on the model and the target device, different quantization recipes may be applied. Torchchat contains two example configurations to optimize performance for GPU-based systems `config/data/qconfig_gpu.json`, and mobile systems `config/data/qconfig_mobile.json`. The GPU configuration is targeted towards optimizing for memory bandwidth which is a scarce resource in powerful GPUs (and to a less degree, memory footprint to fit large models into a device's memory). The mobile configuration is targeted towards optimizing for memory fotoprint because in many devices, a single application is limited to as little as GB or less of memory.
209136

137+
You can use the quantization recipes in conjunction with any of the `chat`, `generate` and `browser` commands to test their impact and accelerate model execution. You will apply these recipes to the `export` comamnds below, to optimize the exported models. For example:
210138
```
211-
# Llama 3 8B Instruct
212-
python3 torchchat.py chat llama3 --dtype fp16
139+
python3 torchchat.py chat llama3 --quantize config/data/qconfig_gpu.json
213140
```
141+
To adapt these recipes or wrote your own, please refer to the [quantization overview](docs/quantization.md).
214142

215-
```
216-
# Stories 15M
217-
python3 torchchat.py chat stories15M
218-
```
143+
*TO BE REPLACED BY SUITABLE ORDING PROVIDED BY LEGAL:*
219144

220-
```
221-
# CodeLama 7B for Python
222-
python3 torchchat.py chat codellama
223-
```
145+
With quantization, 32-bit floating numbers can be represented with as few as 8 or even 4 bits, and a scale shared by a group of these weights. This transformation is lossy and modifies the behavior of models. While research is being conducted on how to efficiently quantize large language models for use in mobile devices, this transformation invariable results in both quality loss and a reduced amount of control over the output of the models, leading to an increased risk of undesirable responses, hallucinations and stuttering. In effect an a developer quantizing a model, has much control and even more responsibility to quantize a model to quantify and reduce these effects.
224146

225147
## Desktop Execution
226148

@@ -295,7 +217,7 @@ scripts/build_native.sh et
295217
Run:
296218

297219
```bash
298-
cmake-out/et_run model.pte -z tokenizer.model -i "Once upon a time"
220+
cmake-out/et_run llama3.pte -z tokenizer.model -i "Once upon a time"
299221
```
300222

301223
## Fine-tuned models from torchtune
@@ -329,11 +251,91 @@ python3 torchchat.py generate \
329251
--device cuda
330252
```
331253

254+
### Eval
255+
Uses the lm_eval library to evaluate model accuracy on a variety of tasks. Defaults to wikitext and can be manually controlled using the tasks and limit args.
256+
257+
For more information run `python3 torchchat.py eval --help`
258+
259+
**Examples**
260+
261+
Eager mode:
262+
```
263+
python3 torchchat.py eval llama3 -d fp32 --limit 5
264+
```
265+
266+
To test the perplexity for a lowered or quantized model, pass it in the same way you would to generate:
267+
268+
```
269+
python3 torchchat.py eval llama3 --pte-path llama3.pte --limit 5
270+
```
271+
272+
273+
274+
## Models
275+
276+
The following models are supported by torchchat and have associated aliases. Other models, including GGUF format, can be run by specifying a URL directly.
277+
278+
| Model | Mobile Friendly | Notes |
279+
|------------------|---|---------------------|
280+
|[meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)||Tuned for `chat` . Alias to `llama3`.|
281+
|[meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B)||Best for `generate`. Alias to `llama3-base`.|
282+
|[meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)||Tuned for `chat`. Alias to `llama2`.|
283+
|[meta-llama/Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf)||Tuned for `chat`. Alias to `llama2-13b-chat`.|
284+
|[meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf)||Tuned for `chat`. Alias to `llama2-70b-chat`.|
285+
|[meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)||Best for `generate`. Alias to `llama2-base`.|
286+
|[meta-llama/CodeLlama-7b-Python-hf](https://huggingface.co/meta-llama/CodeLlama-7b-Python-hf)||Tuned for Python and `generate`. Alias to `codellama`.|
287+
|[meta-llama/CodeLlama-34b-Python-hf](https://huggingface.co/meta-llama/CodeLlama-34b-Python-hf)||Tuned for Python and `generate`. Alias to `codellama-34b`.|
288+
|[mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)||Best for `generate`. Alias to `mistral-7b-v01-base`.|
289+
|[mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)||Tuned for `chat`. Alias to `mistral-7b-v01-instruct`.|
290+
|[mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)||Tuned for `chat`. Alias to `mistral`.|
291+
|[tinyllamas/stories15M](https://huggingface.co/karpathy/tinyllamas/tree/main)||Toy model for `generate`. Alias to `stories15M`.|
292+
|[tinyllamas/stories42M](https://huggingface.co/karpathy/tinyllamas/tree/main)||Toy model for `generate`. Alias to `stories42M`.|
293+
|[tinyllamas/stories110M](https://huggingface.co/karpathy/tinyllamas/tree/main)||Toy model for `generate`. Alias to `stories110M`.|
294+
|[openlm-research/open_llama_7b](https://huggingface.co/openlm-research/open_llama_7b)||Best for `generate`. Alias to `open-llama`.|
295+
296+
Torchchat also supports loading of many models in the GGUF format. See the [documentation on GGUF](docs/GGUF.md) to learn how to use GGUF files.
297+
298+
While we describe how to use torchchat using the popular llama3 model, you can perform the example commands with any of these models.
299+
300+
301+
332302
## Acknowledgements
333303
Thank you to the [community](docs/ACKNOWLEDGEMENTS.md) for all the awesome libraries and tools
334304
you've built around local LLM inference.
335305

306+
* Georgi Gerganov and his [GGML](https://github.com/ggerganov/ggml)
307+
project shining a spotlight on community-based enablement and
308+
inspiring so many other projects.
309+
310+
* Andrej Karpathy and his
311+
[llama2.c](https://github.com/karpathy/llama2.c) project. So many
312+
great (and simple!) ideas in llama2.c that we have directly adopted
313+
(both ideas and code) from his repo. You can never go wrong by
314+
following Andrej's work.
315+
316+
* Michael Gschwind, Bert Maher, Scott Wolchok, Bin Bao, Chen Yang,
317+
Huamin Li and Mu-Chu Li who built the first version of nanogpt (`DSOGPT`)
318+
with AOT Inductor proving that AOTI can be used to build efficient
319+
LLMs, and DSOs are a viable distribution format for models.
320+
[nanoGPT](https://github.com/karpathy/nanoGPT).
321+
322+
* Bert Maher and his
323+
[llama2.so](https://github.com/bertmaher/llama2.so), which built on
324+
Andrej's llama2.c and on DSOGPT to close the loop on Llama models
325+
with AOTInductor.
326+
327+
* Christian Puhrsch, Horace He, Joe Isaacson and many more for their
328+
many contributions in Accelerating GenAI models in the *"Anything,
329+
Fast!"* pytorch.org blogs, and, in particular, Horace He for [GPT,
330+
Fast!](https://github.com/pytorch-labs/gpt-fast), which we have
331+
directly adopted (both ideas and code) from his repo.
332+
333+
* Mobius Labs as the authors of the HQQ quantization algorithms
334+
included in this distribution.
335+
336+
336337
## License
337-
Torchchat is released under the [BSD 3 license](LICENSE). However you may have other legal obligations
338+
Torchchat is released under the [BSD 3 license](LICENSE). (Additional code in this
339+
distribution is covered by the MIT and Apache Open Source licenses.) However you may have other legal obligations
338340
that govern your use of content, such as the terms of service for third-party models.
339341
![image](https://github.com/pytorch/torchchat/assets/61328285/1cfccb53-c025-43d7-8475-94b34cf92339)

0 commit comments

Comments
 (0)