You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -77,6 +77,8 @@ with `python3 torchchat.py remove llama3`.
77
77
*[Run exported ExecuTorch file on iOS or Android](#mobile-execution)
78
78
* in Chat mode
79
79
* in Generate mode
80
+
* Fine-tuned models from torchtune
81
+
80
82
81
83
## Running via PyTorch / Python
82
84
@@ -85,8 +87,15 @@ Designed for interactive and conversational use.
85
87
In chat mode, the LLM engages in a back-and-forth dialogue with the user. It responds to queries, participates in discussions, provides explanations, and can adapt to the flow of conversation.
86
88
87
89
**Examples**
90
+
88
91
```bash
89
-
python3 torchchat.py chat llama3
92
+
# Llama 3 8B Instruct
93
+
python3 torchchat.py chat llama3
94
+
```
95
+
96
+
```
97
+
# CodeLama 7B for Python
98
+
python3 torchchat.py chat codellama
90
99
```
91
100
92
101
For more information run `python3 torchchat.py chat --help`
@@ -107,120 +116,33 @@ For more information run `python3 torchchat.py generate --help`
107
116
108
117
Designed for interactive graphical conversations using the familiar web browser GUI. The browser command provides a GUI-based experience to engage with the LLM in a back-and-forth dialogue with the user. It responds to queries, participates in discussions, provides explanations, and can adapt to the flow of conversation.
109
118
110
-
## Quantizing your model (suggested for mobile)
111
-
112
-
Quantization is the process of converting a model into a more memory-efficient representation. Quantization is particularly important for accelerators -- to take advantage of the available memory bandwidth, and fit in the often limited high-speed memory in accelerators – and mobile devices – to fit in the typically very limited memory of mobile devices.
113
-
114
-
Depending on the model and the target device, different quantization recipes may be applied. Torchchat contains two example configurations to optimize performance for GPU-based systems `config/data/qconfig_gpu.json`, and mobile systems `config/data/qconfig_mobile.json`. The GPU configuration is targeted towards optimizing for memory bandwidth which is a scarce resource in powerful GPUs (and to a less degree, memory footprint to fit large models into a device's memory). The mobile configuration is targeted towards optimizing for memory fotoprint because in many devices, a single application is limited to as little as GB or less of memory.
115
-
116
-
You can use the quantization recipes in conjunction with any of the `chat`, `generate` and `browser` commands to test their impact and accelerate model execution. You will apply these recipes to the `export` comamnds below, to optimize the exported models. For example:
To adapt these recipes or wrote your own, please refer to the [quantization overview](docs/quantization.md).
121
-
122
-
*TO BE REPLACED BY SUITABLE ORDING PROVIDED BY LEGAL:*
123
-
124
-
With quantization, 32-bit floating numbers can be represented with as few as 8 or even 4 bits, and a scale shared by a group of these weights. This transformation is lossy and modifies the behavior of models. While research is being conducted on how to efficiently quantize large language models for use in mobile devices, this transformation invariable results in both quality loss and a reduced amount of control over the output of the models, leading to an increased risk of undesirable responses, hallucinations and stuttering. In effect an a developer quantizing a model, has much control and even more responsibility to quantize a model to quantify and reduce these effects.
125
-
126
-
127
-
## Exporting your model
128
-
Compiles a model and saves it to run later.
129
-
130
-
For more information run `python3 torchchat.py export --help`
131
-
132
-
### Exporting for Desktop / Server-side via AOT Inductor
*Running on http://127.0.0.1:5000* should be printed out on the terminal. Click the link or go to [http://127.0.0.1:5000](http://127.0.0.1:5000) on your browser to start interacting with it.
162
126
163
127
Enter some text in the input box, then hit the enter key or click the “SEND” button. After a second or two, the text you entered together with the generated text will be displayed. Repeat to have a conversation.
164
128
165
-
### Eval
166
-
Uses lm_eval library to evaluate model accuracy on a variety of tasks. Defaults to wikitext and can be manually controlled using the tasks and limit args.
167
-
168
-
For more information run `python3 torchchat.py eval --help`
The following models are supported by torchchat and have associated aliases. Other models, including GGUF format, can be run by specifying a URL directly.
187
-
188
-
| Model | Mobile Friendly | Notes |
189
-
|------------------|---|---------------------|
190
-
|[meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)|✅|Tuned for `chat` . Alias to `llama3`.|
191
-
|[meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B)|✅|Best for `generate`. Alias to `llama3-base`.|
192
-
|[meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)|✅|Tuned for `chat`. Alias to `llama2`.|
193
-
|[meta-llama/Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf)||Tuned for `chat`. Alias to `llama2-13b-chat`.|
194
-
|[meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf)||Tuned for `chat`. Alias to `llama2-70b-chat`.|
195
-
|[meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)|✅|Best for `generate`. Alias to `llama2-base`.|
196
-
|[meta-llama/CodeLlama-7b-Python-hf](https://huggingface.co/meta-llama/CodeLlama-7b-Python-hf)|✅|Tuned for Python and `generate`. Alias to `codellama`.|
197
-
|[meta-llama/CodeLlama-34b-Python-hf](https://huggingface.co/meta-llama/CodeLlama-34b-Python-hf)|✅|Tuned for Python and `generate`. Alias to `codellama-34b`.|
198
-
|[mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)|✅|Best for `generate`. Alias to `mistral-7b-v01-base`.|
199
-
|[mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)|✅|Tuned for `chat`. Alias to `mistral-7b-v01-instruct`.|
200
-
|[mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)|✅|Tuned for `chat`. Alias to `mistral`.|
201
-
|[tinyllamas/stories15M](https://huggingface.co/karpathy/tinyllamas/tree/main)|✅|Toy model for `generate`. Alias to `stories15M`.|
202
-
|[tinyllamas/stories42M](https://huggingface.co/karpathy/tinyllamas/tree/main)|✅|Toy model for `generate`. Alias to `stories42M`.|
203
-
|[tinyllamas/stories110M](https://huggingface.co/karpathy/tinyllamas/tree/main)|✅|Toy model for `generate`. Alias to `stories110M`.|
204
-
|[openlm-research/open_llama_7b](https://huggingface.co/openlm-research/open_llama_7b)|✅|Best for `generate`. Alias to `open-llama`.|
131
+
## Quantizing your model (suggested for mobile)
205
132
206
-
Torchchat also supports loading of many models in the GGUF format. See the [documentation on GGUF](docs/GGUF.md)to learn how to use GGUF files.
133
+
Quantization is the process of converting a model into a more memory-efficient representation. Quantization is particularly important for accelerators -- to take advantage of the available memory bandwidth, and fit in the often limited high-speed memory in accelerators – and mobile devices – to fit in the typically very limited memory of mobile devices.
207
134
208
-
**Examples**
135
+
Depending on the model and the target device, different quantization recipes may be applied. Torchchat contains two example configurations to optimize performance for GPU-based systems `config/data/qconfig_gpu.json`, and mobile systems `config/data/qconfig_mobile.json`. The GPU configuration is targeted towards optimizing for memory bandwidth which is a scarce resource in powerful GPUs (and to a less degree, memory footprint to fit large models into a device's memory). The mobile configuration is targeted towards optimizing for memory fotoprint because in many devices, a single application is limited to as little as GB or less of memory.
209
136
137
+
You can use the quantization recipes in conjunction with any of the `chat`, `generate` and `browser` commands to test their impact and accelerate model execution. You will apply these recipes to the `export` comamnds below, to optimize the exported models. For example:
To adapt these recipes or wrote your own, please refer to the [quantization overview](docs/quantization.md).
214
142
215
-
```
216
-
# Stories 15M
217
-
python3 torchchat.py chat stories15M
218
-
```
143
+
*TO BE REPLACED BY SUITABLE ORDING PROVIDED BY LEGAL:*
219
144
220
-
```
221
-
# CodeLama 7B for Python
222
-
python3 torchchat.py chat codellama
223
-
```
145
+
With quantization, 32-bit floating numbers can be represented with as few as 8 or even 4 bits, and a scale shared by a group of these weights. This transformation is lossy and modifies the behavior of models. While research is being conducted on how to efficiently quantize large language models for use in mobile devices, this transformation invariable results in both quality loss and a reduced amount of control over the output of the models, leading to an increased risk of undesirable responses, hallucinations and stuttering. In effect an a developer quantizing a model, has much control and even more responsibility to quantize a model to quantify and reduce these effects.
224
146
225
147
## Desktop Execution
226
148
@@ -295,7 +217,7 @@ scripts/build_native.sh et
295
217
Run:
296
218
297
219
```bash
298
-
cmake-out/et_run model.pte -z tokenizer.model -i "Once upon a time"
220
+
cmake-out/et_run llama3.pte -z tokenizer.model -i "Once upon a time"
Uses the lm_eval library to evaluate model accuracy on a variety of tasks. Defaults to wikitext and can be manually controlled using the tasks and limit args.
256
+
257
+
For more information run `python3 torchchat.py eval --help`
The following models are supported by torchchat and have associated aliases. Other models, including GGUF format, can be run by specifying a URL directly.
277
+
278
+
| Model | Mobile Friendly | Notes |
279
+
|------------------|---|---------------------|
280
+
|[meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)|✅|Tuned for `chat` . Alias to `llama3`.|
281
+
|[meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B)|✅|Best for `generate`. Alias to `llama3-base`.|
282
+
|[meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)|✅|Tuned for `chat`. Alias to `llama2`.|
283
+
|[meta-llama/Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf)||Tuned for `chat`. Alias to `llama2-13b-chat`.|
284
+
|[meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf)||Tuned for `chat`. Alias to `llama2-70b-chat`.|
285
+
|[meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)|✅|Best for `generate`. Alias to `llama2-base`.|
286
+
|[meta-llama/CodeLlama-7b-Python-hf](https://huggingface.co/meta-llama/CodeLlama-7b-Python-hf)|✅|Tuned for Python and `generate`. Alias to `codellama`.|
287
+
|[meta-llama/CodeLlama-34b-Python-hf](https://huggingface.co/meta-llama/CodeLlama-34b-Python-hf)|✅|Tuned for Python and `generate`. Alias to `codellama-34b`.|
288
+
|[mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)|✅|Best for `generate`. Alias to `mistral-7b-v01-base`.|
289
+
|[mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1)|✅|Tuned for `chat`. Alias to `mistral-7b-v01-instruct`.|
290
+
|[mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)|✅|Tuned for `chat`. Alias to `mistral`.|
291
+
|[tinyllamas/stories15M](https://huggingface.co/karpathy/tinyllamas/tree/main)|✅|Toy model for `generate`. Alias to `stories15M`.|
292
+
|[tinyllamas/stories42M](https://huggingface.co/karpathy/tinyllamas/tree/main)|✅|Toy model for `generate`. Alias to `stories42M`.|
293
+
|[tinyllamas/stories110M](https://huggingface.co/karpathy/tinyllamas/tree/main)|✅|Toy model for `generate`. Alias to `stories110M`.|
294
+
|[openlm-research/open_llama_7b](https://huggingface.co/openlm-research/open_llama_7b)|✅|Best for `generate`. Alias to `open-llama`.|
295
+
296
+
Torchchat also supports loading of many models in the GGUF format. See the [documentation on GGUF](docs/GGUF.md) to learn how to use GGUF files.
297
+
298
+
While we describe how to use torchchat using the popular llama3 model, you can perform the example commands with any of these models.
299
+
300
+
301
+
332
302
## Acknowledgements
333
303
Thank you to the [community](docs/ACKNOWLEDGEMENTS.md) for all the awesome libraries and tools
334
304
you've built around local LLM inference.
335
305
306
+
* Georgi Gerganov and his [GGML](https://github.com/ggerganov/ggml)
307
+
project shining a spotlight on community-based enablement and
308
+
inspiring so many other projects.
309
+
310
+
* Andrej Karpathy and his
311
+
[llama2.c](https://github.com/karpathy/llama2.c) project. So many
312
+
great (and simple!) ideas in llama2.c that we have directly adopted
313
+
(both ideas and code) from his repo. You can never go wrong by
314
+
following Andrej's work.
315
+
316
+
* Michael Gschwind, Bert Maher, Scott Wolchok, Bin Bao, Chen Yang,
317
+
Huamin Li and Mu-Chu Li who built the first version of nanogpt (`DSOGPT`)
318
+
with AOT Inductor proving that AOTI can be used to build efficient
319
+
LLMs, and DSOs are a viable distribution format for models.
0 commit comments