You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
torchchat is a compact codebase to showcase the capability of running large language models (LLMs) seamlessly across diverse platforms. With torchchat, you could run LLMs from with Python, your own (C/C++) application on mobile (iOS/Android), desktop or servers.
2
+
torchchat is a compact codebase showcasing the ability to run large language models (LLMs) seamlessly. With torchchat, you can run LLMs using Python, within your own (C/C++) application (desktop or server) and on iOS and Android.
3
+
4
+
5
+
6
+
## What can you do with torchchat?
7
+
-[Setup the Repo](#installation)
8
+
-[Download Models](#download-weights)
9
+
-[Run models via PyTorch / Python](#running-via-pytorch--python)
10
+
-[Chat](#chat)
11
+
-[Generate](#generate)
12
+
-[Run chat in the Browser](#browser)
13
+
-[Export models for running on desktop/server without python](#desktopserver-execution)
14
+
-[Use AOT Inductor for faster execution](#aoti-aot-inductor)
15
+
-[Running in c++ using the runner](#running-native-using-our-c-runner)
16
+
-[Run on mobile](#mobile-execution)
17
+
-[Setup](#set-up-executorch)
18
+
-[Export a model for use on mobile](#export-for-mobile)
19
+
-[Deploy and run on iOS](#deploy-and-run-on-ios)
20
+
-[Deploy and run on Android](#deploy-and-run-on-android)
21
+
-[Evaluate a mode](#eval)
22
+
-[Fine-tuned models from torchtune](#fine-tuned-models-from-torchtune)
23
+
-[Supported Models](#models)
24
+
-[Troubleshooting](#troubleshooting)
25
+
3
26
4
27
## Highlights
5
28
- Command line interaction with popular LLMs such as Llama 3, Llama 2, Stories, Mistral and more
@@ -14,7 +37,8 @@ torchchat is a compact codebase to showcase the capability of running large lang
*Disclaimer:* The torchchat Repository Content is provided without any guarantees about performance or compatibility. In particular, torchchat makes available model architectures written in Python for PyTorch that may not perform in the same manner or meet the same standards as the original versions of those models. When using the torchchat Repository Content, including any model architectures, you are solely responsible for determining the appropriateness of using or redistributing the torchchat Repository Content and assume any risks associated with your use of the torchchat Repository Content or any models, outputs, or results, both alone and in combination with any other technologies. Additionally, you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models, weights, data, or other technologies, and you are solely responsible for complying with all such obligations.
40
+
### Disclaimer
41
+
The torchchat Repository Content is provided without any guarantees about performance or compatibility. In particular, torchchat makes available model architectures written in Python for PyTorch that may not perform in the same manner or meet the same standards as the original versions of those models. When using the torchchat Repository Content, including any model architectures, you are solely responsible for determining the appropriateness of using or redistributing the torchchat Repository Content and assume any risks associated with your use of the torchchat Repository Content or any models, outputs, or results, both alone and in combination with any other technologies. Additionally, you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models, weights, data, or other technologies, and you are solely responsible for complying with all such obligations.
18
42
19
43
20
44
## Installation
@@ -42,7 +66,10 @@ python3 torchchat.py --help
42
66
Most models use HuggingFace as the distribution channel, so you will need to create a HuggingFace account.
43
67
44
68
Create a HuggingFace user access token [as documented here](https://huggingface.co/docs/hub/en/security-tokens).
45
-
Run `huggingface-cli login`, which will prompt for the newly created token.
69
+
Log into huggingface:
70
+
```
71
+
huggingface-cli login
72
+
```
46
73
47
74
Once this is done, torchchat will be able to download model artifacts from
48
75
HuggingFace.
@@ -51,63 +78,30 @@ HuggingFace.
51
78
python3 torchchat.py download llama3
52
79
```
53
80
54
-
NOTE: This command may prompt you to request access to llama3 via HuggingFace, if you do not already have access. Simply follow the prompts and re-run the command when access is granted.
55
-
56
-
View available models with `python3 torchchat.py list`. You can also remove downloaded models
57
-
with `python3 torchchat.py remove llama3`.
81
+
*NOTE: This command may prompt you to request access to llama3 via HuggingFace, if you do not already have access. Simply follow the prompts and re-run the command when access is granted.*
58
82
59
-
### Common Issues
60
-
61
-
***CERTIFICATE_VERIFY_FAILED**:
62
-
Run `pip install --upgrade certifi`.
63
-
***Access to model is restricted and you are not in the authorized list. Visit \[link\] to ask for access**:
64
-
Some models require an additional step to access. Follow the link to fill out the request form on HuggingFace.
65
-
66
-
## What can you do with torchchat?
67
-
68
-
* Run models via PyTorch / Python:
69
-
*[Chat](#chat)
70
-
*[Generate](#generate)
71
-
*[Run via Browser](#browser)
72
-
*[Quantizing your model (suggested for mobile)](#quantizing-your-model-suggested-for-mobile)
73
-
* Export and run models in native environments (C++, your own app, mobile, etc.)
74
-
*[Export for desktop/servers via AOTInductor](#export-server)
75
-
*[Run exported .so file via your own C++ application](#run-server)
76
-
* in Chat mode
77
-
* in Generate mode
78
-
*[Export for mobile via ExecuTorch](#exporting-for-mobile-via-executorch)
79
-
*[Run exported ExecuTorch file on iOS or Android](#mobile-execution)
80
-
* in Chat mode
81
-
* in Generate mode
82
-
* Fine-tuned models from torchtune
83
+
View available models with:
84
+
```
85
+
python3 torchchat.py list
86
+
```
83
87
88
+
You can also remove downloaded models with the remove command:
89
+
```
90
+
python3 torchchat.py remove llama3
91
+
```
84
92
85
93
## Running via PyTorch / Python
94
+
[Follow the installation steps if you haven't](#installation)
86
95
87
96
### Chat
88
-
Designed for interactive and conversational use.
89
-
In chat mode, the LLM engages in a back-and-forth dialogue with the user. It responds to queries, participates in discussions, provides explanations, and can adapt to the flow of conversation.
90
-
91
-
**Examples**
92
-
93
97
```bash
94
98
# Llama 3 8B Instruct
95
99
python3 torchchat.py chat llama3
96
100
```
97
101
98
-
```
99
-
# CodeLama 7B for Python
100
-
python3 torchchat.py chat codellama
101
-
```
102
-
103
102
For more information run `python3 torchchat.py chat --help`
104
103
105
104
### Generate
106
-
Aimed at producing content based on specific prompts or instructions.
107
-
In generate mode, the LLM focuses on creating text based on a detailed prompt or instruction. This mode is often used for generating written content like articles, stories, reports, or even creative writing like poetry.
108
-
109
-
110
-
**Examples**
111
105
```bash
112
106
python3 torchchat.py generate llama3
113
107
```
@@ -116,10 +110,6 @@ For more information run `python3 torchchat.py generate --help`
116
110
117
111
### Browser
118
112
119
-
Designed for interactive graphical conversations using the familiar web browser GUI. The browser command provides a GUI-based experience to engage with the LLM in a back-and-forth dialogue with the user. It responds to queries, participates in discussions, provides explanations, and can adapt to the flow of conversation.
@@ -130,96 +120,89 @@ Enter some text in the input box, then hit the enter key or click the “SEND”
130
120
131
121
132
122
133
-
## Quantizing your model (suggested for mobile)
134
-
135
-
Quantization is the process of converting a model into a more memory-efficient representation. Quantization is particularly important for accelerators -- to take advantage of the available memory bandwidth, and fit in the often limited high-speed memory in accelerators – and mobile devices – to fit in the typically very limited memory of mobile devices.
136
-
137
-
Depending on the model and the target device, different quantization recipes may be applied. torchchat contains two example configurations to optimize performance for GPU-based systems `config/data/cuda.json`, and mobile systems `config/data/mobile.json`. The GPU configuration is targeted towards optimizing for memory bandwidth which is a scarce resource in powerful GPUs (and to a less degree, memory footprint to fit large models into a device's memory). The mobile configuration is targeted towards optimizing for memory fotoprint because in many devices, a single application is limited to as little as GB or less of memory.
138
-
139
-
You can use the quantization recipes in conjunction with any of the `chat`, `generate` and `browser` commands to test their impact and accelerate model execution. You will apply these recipes to the `export` comamnds below, to optimize the exported models. For example:
To adapt these recipes or wrote your own, please refer to the [quantization overview](docs/quantization.md).
144
-
145
-
146
-
With quantization, 32-bit floating numbers can be represented with as few as 8 or even 4 bits, and a scale shared by a group of these weights. This transformation is lossy and modifies the behavior of models. While research is being conducted on how to efficiently quantize large language models for use in mobile devices, this transformation invariably results in both quality loss and a reduced amount of control over the output of the models, leading to an increased risk of undesirable responses, hallucination and stuttering. In effect, a developer quantizing a model has a responsibility to understand and reduce these effects.
147
-
148
-
## Desktop Execution
123
+
## Desktop/Server Execution
149
124
150
125
### AOTI (AOT Inductor)
151
-
AOT compiles models into machine code before execution, enhancing performance and predictability. It's particularly beneficial for frequently used models or those requiring quick start times. However, it may lead to larger binary sizes and lacks the runtime flexibility of eager mode.
126
+
AOT compiles models before executionfor faster inference
152
127
153
-
**Examples**
154
-
The following example uses the Llama3 8B model.
128
+
The following example exports and executes the Llama3 8B Instruct model
python3 torchchat.py generate llama3 --quantize config/data/cuda.json--dso-path llama3.so --prompt "Hello my name is"
133
+
# Execute the exported model using Python
134
+
python3 torchchat.py generate llama3 --quantize config/data/cuda.json--dso-path llama3.so --prompt "Hello my name is"
161
135
```
162
136
163
137
NOTE: We use `--quantize config/data/cuda.json` to quantize the llama3 model to reduce model size and improve performance for on-device use cases.
164
138
165
-
**Build Native Runner Binary**
139
+
### Running native using our C++ Runner
166
140
167
-
We provide an end-to-end C++ [runner](runner/run.cpp)that runs the`*.so` file exported after following the previous [examples](#aoti-aot-inductor) section. To build the runner binary on your Mac or Linux:
141
+
The end-to-end C++ [runner](runner/run.cpp) runs an`*.so` file exported in the previous step.
168
142
143
+
To build the runner binary on your Mac or Linux:
169
144
```bash
170
145
scripts/build_native.sh aoti
171
146
```
172
147
173
-
Run:
174
-
148
+
Execute
175
149
```bash
176
150
cmake-out/aoti_run model.so -z tokenizer.model -l 3 -i "Once upon a time"
177
151
```
178
152
179
-
### ExecuTorch
180
-
153
+
## Mobile Execution
181
154
ExecuTorch enables you to optimize your model for execution on a mobile or embedded device, but can also be used on desktop for testing.
182
-
Before running ExecuTorch commands, you must first set-up ExecuTorch in torchchat, see [Set-up Executorch](docs/executorch_setup.md).
183
155
184
-
**Examples**
185
-
The following example uses the Llama3 8B model.
156
+
### Set Up Executorch
157
+
Before running any commands in torchchat that require ExecuTorch, you must first install ExecuTorch.
158
+
159
+
To install ExecuTorch, run the following commands *from the torchchat root directory*.
160
+
This will download the ExecuTorch repo to ./et-build/src and install various ExecuTorch libraries to ./et-build/install.
186
161
```
187
-
# Compile
162
+
export TORCHCHAT_ROOT=${PWD}
163
+
export ENABLE_ET_PYBIND=true
164
+
./scripts/install_et.sh $ENABLE_ET_PYBIND
165
+
```
166
+
167
+
### Export for mobile
168
+
The following example uses the Llama3 8B Instruct model.
python3 torchchat.py generate llama3 --device cpu --pte-path llama3.pte --prompt "Hello my name is"
192
175
```
193
176
NOTE: We use `--quantize config/data/mobile.json` to quantize the llama3 model to reduce model size and improve performance for on-device use cases.
194
177
195
-
See below under [Mobile Execution](#mobile-execution) if you want to deploy and execute a model in your iOS or Android app.
196
-
178
+
For more details on quantization and what settings to use for your use case visit our [Quanitization documentation](docs/quantization.md) or run `python3 torchchat.py export`
197
179
180
+
### Deploy and run on iOS
181
+
The following assumes you've completed the steps for [Setting up Executorch](#set-up-executorch) and
198
182
199
-
## Mobile Execution
200
-
**Prerequisites**
201
-
202
-
ExecuTorch lets you run your model on a mobile or embedded device. The exported ExecuTorch .pte model file plus runtime is all you need.
183
+
Open the xcode project
184
+
```
185
+
open et-build/src/executorch/examples/demo-apps/apple_ios/LLaMA/LLaMA.xcodeproj
186
+
```
187
+
Then click the Play button to launch the app in Simulator.
203
188
204
-
Install [ExecuTorch](https://pytorch.org/executorch/stable/getting-started-setup.html)to get started.
189
+
To run on a device, given that you already have it set up for development, you'll need to have a provisioning profile with the [`increased-memory-limit`](https://developer.apple.com/documentation/bundleresources/entitlements/com_apple_developer_kernel_increased-memory-limit) entitlement. Just change the app's bundle identifier to whatever matches your provisioning profile with the aforementioned capability enabled.
205
190
206
-
Read the [iOS documentation](docs/iOS.md) for more details on iOS.
191
+
After the app launched successfully, copy an exported ExecuTorch model (`.pte`) and tokenizer (`.bin`) files to the iLLaMA folder.
207
192
208
-
Read the [Android documentation](docs/Android.md) for more details on Android.
193
+
For the Simulator, just drap&drop both files onto the Simulator window and save at `On My iPhone > iLLaMA` folder.
209
194
210
-
**Build Native Runner Binary**
195
+
For a device, open it in a separate Finder window, navigate to the Files tab, drag&drop both files to the iLLaMA folder and wait till the copying finishes.
211
196
212
-
We provide an end-to-end C++ [runner](runner/run.cpp) that runs the `*.pte` file exported after following the previous [ExecuTorch](#executorch) section. Notice that this binary is for demo purpose, please follow the respective documentations, to see how to build a similar application on iOS and Android. To build the runner binary on your Mac or Linux:
197
+
Now, follow the app's UI guidelines to pick the model and tokenizer files from the local filesystem and issue a prompt.
The following models are supported by torchchat and have associated aliases. Other models, including GGUF format, can be run by specifying a URL directly.
278
260
279
261
| Model | Mobile Friendly | Notes |
@@ -298,7 +280,13 @@ torchchat also supports loading of many models in the GGUF format. See the [docu
298
280
299
281
While we describe how to use torchchat using the popular llama3 model, you can perform the example commands with any of these models.
300
282
283
+
## Troubleshooting
284
+
285
+
**CERTIFICATE_VERIFY_FAILED**:
286
+
Run `pip install --upgrade certifi`.
301
287
288
+
**Access to model is restricted and you are not in the authorized list.**
289
+
Some models require an additional step to access. Follow the link provided in the error to get access.
302
290
303
291
## Acknowledgements
304
292
Thank you to the [community](docs/ACKNOWLEDGEMENTS.md) for all the awesome libraries and tools
0 commit comments