Skip to content

Commit 8374876

Browse files
mikekgfbmalfet
authored andcommitted
update (#560)
1 parent 6cb09b9 commit 8374876

File tree

1 file changed

+88
-100
lines changed

1 file changed

+88
-100
lines changed

README.md

Lines changed: 88 additions & 100 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,28 @@
11
# Chat with LLMs Everywhere
2-
torchchat is a compact codebase to showcase the capability of running large language models (LLMs) seamlessly across diverse platforms. With torchchat, you could run LLMs from with Python, your own (C/C++) application on mobile (iOS/Android), desktop or servers.
2+
torchchat is a compact codebase showcasing the ability to run large language models (LLMs) seamlessly. With torchchat, you can run LLMs using Python, within your own (C/C++) application (desktop or server) and on iOS and Android.
3+
4+
5+
6+
## What can you do with torchchat?
7+
- [Setup the Repo](#installation)
8+
- [Download Models](#download-weights)
9+
- [Run models via PyTorch / Python](#running-via-pytorch--python)
10+
- [Chat](#chat)
11+
- [Generate](#generate)
12+
- [Run chat in the Browser](#browser)
13+
- [Export models for running on desktop/server without python](#desktopserver-execution)
14+
- [Use AOT Inductor for faster execution](#aoti-aot-inductor)
15+
- [Running in c++ using the runner](#running-native-using-our-c-runner)
16+
- [Run on mobile](#mobile-execution)
17+
- [Setup](#set-up-executorch)
18+
- [Export a model for use on mobile](#export-for-mobile)
19+
- [Deploy and run on iOS](#deploy-and-run-on-ios)
20+
- [Deploy and run on Android](#deploy-and-run-on-android)
21+
- [Evaluate a mode](#eval)
22+
- [Fine-tuned models from torchtune](#fine-tuned-models-from-torchtune)
23+
- [Supported Models](#models)
24+
- [Troubleshooting](#troubleshooting)
25+
326

427
## Highlights
528
- Command line interaction with popular LLMs such as Llama 3, Llama 2, Stories, Mistral and more
@@ -14,7 +37,8 @@ torchchat is a compact codebase to showcase the capability of running large lang
1437
- Multiple quantization schemes
1538
- Multiple execution modes including: Python (Eager, Compile) or Native (AOT Inductor (AOTI), ExecuTorch)
1639

17-
*Disclaimer:* The torchchat Repository Content is provided without any guarantees about performance or compatibility. In particular, torchchat makes available model architectures written in Python for PyTorch that may not perform in the same manner or meet the same standards as the original versions of those models. When using the torchchat Repository Content, including any model architectures, you are solely responsible for determining the appropriateness of using or redistributing the torchchat Repository Content and assume any risks associated with your use of the torchchat Repository Content or any models, outputs, or results, both alone and in combination with any other technologies. Additionally, you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models, weights, data, or other technologies, and you are solely responsible for complying with all such obligations.
40+
### Disclaimer
41+
The torchchat Repository Content is provided without any guarantees about performance or compatibility. In particular, torchchat makes available model architectures written in Python for PyTorch that may not perform in the same manner or meet the same standards as the original versions of those models. When using the torchchat Repository Content, including any model architectures, you are solely responsible for determining the appropriateness of using or redistributing the torchchat Repository Content and assume any risks associated with your use of the torchchat Repository Content or any models, outputs, or results, both alone and in combination with any other technologies. Additionally, you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models, weights, data, or other technologies, and you are solely responsible for complying with all such obligations.
1842

1943

2044
## Installation
@@ -42,7 +66,10 @@ python3 torchchat.py --help
4266
Most models use HuggingFace as the distribution channel, so you will need to create a HuggingFace account.
4367

4468
Create a HuggingFace user access token [as documented here](https://huggingface.co/docs/hub/en/security-tokens).
45-
Run `huggingface-cli login`, which will prompt for the newly created token.
69+
Log into huggingface:
70+
```
71+
huggingface-cli login
72+
```
4673

4774
Once this is done, torchchat will be able to download model artifacts from
4875
HuggingFace.
@@ -51,63 +78,30 @@ HuggingFace.
5178
python3 torchchat.py download llama3
5279
```
5380

54-
NOTE: This command may prompt you to request access to llama3 via HuggingFace, if you do not already have access. Simply follow the prompts and re-run the command when access is granted.
55-
56-
View available models with `python3 torchchat.py list`. You can also remove downloaded models
57-
with `python3 torchchat.py remove llama3`.
81+
*NOTE: This command may prompt you to request access to llama3 via HuggingFace, if you do not already have access. Simply follow the prompts and re-run the command when access is granted.*
5882

59-
### Common Issues
60-
61-
* **CERTIFICATE_VERIFY_FAILED**:
62-
Run `pip install --upgrade certifi`.
63-
* **Access to model is restricted and you are not in the authorized list. Visit \[link\] to ask for access**:
64-
Some models require an additional step to access. Follow the link to fill out the request form on HuggingFace.
65-
66-
## What can you do with torchchat?
67-
68-
* Run models via PyTorch / Python:
69-
* [Chat](#chat)
70-
* [Generate](#generate)
71-
* [Run via Browser](#browser)
72-
* [Quantizing your model (suggested for mobile)](#quantizing-your-model-suggested-for-mobile)
73-
* Export and run models in native environments (C++, your own app, mobile, etc.)
74-
* [Export for desktop/servers via AOTInductor](#export-server)
75-
* [Run exported .so file via your own C++ application](#run-server)
76-
* in Chat mode
77-
* in Generate mode
78-
* [Export for mobile via ExecuTorch](#exporting-for-mobile-via-executorch)
79-
* [Run exported ExecuTorch file on iOS or Android](#mobile-execution)
80-
* in Chat mode
81-
* in Generate mode
82-
* Fine-tuned models from torchtune
83+
View available models with:
84+
```
85+
python3 torchchat.py list
86+
```
8387

88+
You can also remove downloaded models with the remove command:
89+
```
90+
python3 torchchat.py remove llama3
91+
```
8492

8593
## Running via PyTorch / Python
94+
[Follow the installation steps if you haven't](#installation)
8695

8796
### Chat
88-
Designed for interactive and conversational use.
89-
In chat mode, the LLM engages in a back-and-forth dialogue with the user. It responds to queries, participates in discussions, provides explanations, and can adapt to the flow of conversation.
90-
91-
**Examples**
92-
9397
```bash
9498
# Llama 3 8B Instruct
9599
python3 torchchat.py chat llama3
96100
```
97101

98-
```
99-
# CodeLama 7B for Python
100-
python3 torchchat.py chat codellama
101-
```
102-
103102
For more information run `python3 torchchat.py chat --help`
104103

105104
### Generate
106-
Aimed at producing content based on specific prompts or instructions.
107-
In generate mode, the LLM focuses on creating text based on a detailed prompt or instruction. This mode is often used for generating written content like articles, stories, reports, or even creative writing like poetry.
108-
109-
110-
**Examples**
111105
```bash
112106
python3 torchchat.py generate llama3
113107
```
@@ -116,10 +110,6 @@ For more information run `python3 torchchat.py generate --help`
116110

117111
### Browser
118112

119-
Designed for interactive graphical conversations using the familiar web browser GUI. The browser command provides a GUI-based experience to engage with the LLM in a back-and-forth dialogue with the user. It responds to queries, participates in discussions, provides explanations, and can adapt to the flow of conversation.
120-
121-
**Examples**
122-
123113
```
124114
python3 torchchat.py browser llama3 --temperature 0 --num-samples 10
125115
```
@@ -130,96 +120,89 @@ Enter some text in the input box, then hit the enter key or click the “SEND”
130120

131121

132122

133-
## Quantizing your model (suggested for mobile)
134-
135-
Quantization is the process of converting a model into a more memory-efficient representation. Quantization is particularly important for accelerators -- to take advantage of the available memory bandwidth, and fit in the often limited high-speed memory in accelerators – and mobile devices – to fit in the typically very limited memory of mobile devices.
136-
137-
Depending on the model and the target device, different quantization recipes may be applied. torchchat contains two example configurations to optimize performance for GPU-based systems `config/data/cuda.json`, and mobile systems `config/data/mobile.json`. The GPU configuration is targeted towards optimizing for memory bandwidth which is a scarce resource in powerful GPUs (and to a less degree, memory footprint to fit large models into a device's memory). The mobile configuration is targeted towards optimizing for memory fotoprint because in many devices, a single application is limited to as little as GB or less of memory.
138-
139-
You can use the quantization recipes in conjunction with any of the `chat`, `generate` and `browser` commands to test their impact and accelerate model execution. You will apply these recipes to the `export` comamnds below, to optimize the exported models. For example:
140-
```
141-
python3 torchchat.py chat llama3 --quantize config/data/cuda.json
142-
```
143-
To adapt these recipes or wrote your own, please refer to the [quantization overview](docs/quantization.md).
144-
145-
146-
With quantization, 32-bit floating numbers can be represented with as few as 8 or even 4 bits, and a scale shared by a group of these weights. This transformation is lossy and modifies the behavior of models. While research is being conducted on how to efficiently quantize large language models for use in mobile devices, this transformation invariably results in both quality loss and a reduced amount of control over the output of the models, leading to an increased risk of undesirable responses, hallucination and stuttering. In effect, a developer quantizing a model has a responsibility to understand and reduce these effects.
147-
148-
## Desktop Execution
123+
## Desktop/Server Execution
149124

150125
### AOTI (AOT Inductor)
151-
AOT compiles models into machine code before execution, enhancing performance and predictability. It's particularly beneficial for frequently used models or those requiring quick start times. However, it may lead to larger binary sizes and lacks the runtime flexibility of eager mode.
126+
AOT compiles models before execution for faster inference
152127

153-
**Examples**
154-
The following example uses the Llama3 8B model.
128+
The following example exports and executes the Llama3 8B Instruct model
155129
```
156130
# Compile
157131
python3 torchchat.py export llama3 --output-dso-path llama3.so
158132
159-
# Execute
160-
python3 torchchat.py generate llama3 --quantize config/data/cuda.json--dso-path llama3.so --prompt "Hello my name is"
133+
# Execute the exported model using Python
134+
python3 torchchat.py generate llama3 --quantize config/data/cuda.json --dso-path llama3.so --prompt "Hello my name is"
161135
```
162136

163137
NOTE: We use `--quantize config/data/cuda.json` to quantize the llama3 model to reduce model size and improve performance for on-device use cases.
164138

165-
**Build Native Runner Binary**
139+
### Running native using our C++ Runner
166140

167-
We provide an end-to-end C++ [runner](runner/run.cpp) that runs the `*.so` file exported after following the previous [examples](#aoti-aot-inductor) section. To build the runner binary on your Mac or Linux:
141+
The end-to-end C++ [runner](runner/run.cpp) runs an `*.so` file exported in the previous step.
168142

143+
To build the runner binary on your Mac or Linux:
169144
```bash
170145
scripts/build_native.sh aoti
171146
```
172147

173-
Run:
174-
148+
Execute
175149
```bash
176150
cmake-out/aoti_run model.so -z tokenizer.model -l 3 -i "Once upon a time"
177151
```
178152

179-
### ExecuTorch
180-
153+
## Mobile Execution
181154
ExecuTorch enables you to optimize your model for execution on a mobile or embedded device, but can also be used on desktop for testing.
182-
Before running ExecuTorch commands, you must first set-up ExecuTorch in torchchat, see [Set-up Executorch](docs/executorch_setup.md).
183155

184-
**Examples**
185-
The following example uses the Llama3 8B model.
156+
### Set Up Executorch
157+
Before running any commands in torchchat that require ExecuTorch, you must first install ExecuTorch.
158+
159+
To install ExecuTorch, run the following commands *from the torchchat root directory*.
160+
This will download the ExecuTorch repo to ./et-build/src and install various ExecuTorch libraries to ./et-build/install.
186161
```
187-
# Compile
162+
export TORCHCHAT_ROOT=${PWD}
163+
export ENABLE_ET_PYBIND=true
164+
./scripts/install_et.sh $ENABLE_ET_PYBIND
165+
```
166+
167+
### Export for mobile
168+
The following example uses the Llama3 8B Instruct model.
169+
```
170+
# Export
188171
python3 torchchat.py export llama3 --quantize config/data/mobile.json --output-pte-path llama3.pte
189172
190173
# Execute
191174
python3 torchchat.py generate llama3 --device cpu --pte-path llama3.pte --prompt "Hello my name is"
192175
```
193176
NOTE: We use `--quantize config/data/mobile.json` to quantize the llama3 model to reduce model size and improve performance for on-device use cases.
194177

195-
See below under [Mobile Execution](#mobile-execution) if you want to deploy and execute a model in your iOS or Android app.
196-
178+
For more details on quantization and what settings to use for your use case visit our [Quanitization documentation](docs/quantization.md) or run `python3 torchchat.py export`
197179

180+
### Deploy and run on iOS
181+
The following assumes you've completed the steps for [Setting up Executorch](#set-up-executorch) and
198182

199-
## Mobile Execution
200-
**Prerequisites**
201-
202-
ExecuTorch lets you run your model on a mobile or embedded device. The exported ExecuTorch .pte model file plus runtime is all you need.
183+
Open the xcode project
184+
```
185+
open et-build/src/executorch/examples/demo-apps/apple_ios/LLaMA/LLaMA.xcodeproj
186+
```
187+
Then click the Play button to launch the app in Simulator.
203188

204-
Install [ExecuTorch](https://pytorch.org/executorch/stable/getting-started-setup.html) to get started.
189+
To run on a device, given that you already have it set up for development, you'll need to have a provisioning profile with the [`increased-memory-limit`](https://developer.apple.com/documentation/bundleresources/entitlements/com_apple_developer_kernel_increased-memory-limit) entitlement. Just change the app's bundle identifier to whatever matches your provisioning profile with the aforementioned capability enabled.
205190

206-
Read the [iOS documentation](docs/iOS.md) for more details on iOS.
191+
After the app launched successfully, copy an exported ExecuTorch model (`.pte`) and tokenizer (`.bin`) files to the iLLaMA folder.
207192

208-
Read the [Android documentation](docs/Android.md) for more details on Android.
193+
For the Simulator, just drap&drop both files onto the Simulator window and save at `On My iPhone > iLLaMA` folder.
209194

210-
**Build Native Runner Binary**
195+
For a device, open it in a separate Finder window, navigate to the Files tab, drag&drop both files to the iLLaMA folder and wait till the copying finishes.
211196

212-
We provide an end-to-end C++ [runner](runner/run.cpp) that runs the `*.pte` file exported after following the previous [ExecuTorch](#executorch) section. Notice that this binary is for demo purpose, please follow the respective documentations, to see how to build a similar application on iOS and Android. To build the runner binary on your Mac or Linux:
197+
Now, follow the app's UI guidelines to pick the model and tokenizer files from the local filesystem and issue a prompt.
213198

214-
```bash
215-
scripts/build_native.sh et
216-
```
199+
*Click the image below to see it in action!*
200+
<a href="https://pytorch.org/executorch/main/_static/img/llama_ios_app.mp4">
201+
<img src="https://pytorch.org/executorch/main/_static/img/llama_ios_app.png" width="600" alt="iOS app running a LlaMA model">
202+
</a>
217203

218-
Run:
204+
### Deploy and run on Android
219205

220-
```bash
221-
cmake-out/et_run llama3.pte -z tokenizer.model -l 3 -i "Once upon a time"
222-
```
223206

224207
## Fine-tuned models from torchtune
225208

@@ -273,7 +256,6 @@ python3 torchchat.py eval llama3 --pte-path llama3.pte --limit 5
273256

274257

275258
## Models
276-
277259
The following models are supported by torchchat and have associated aliases. Other models, including GGUF format, can be run by specifying a URL directly.
278260

279261
| Model | Mobile Friendly | Notes |
@@ -298,7 +280,13 @@ torchchat also supports loading of many models in the GGUF format. See the [docu
298280

299281
While we describe how to use torchchat using the popular llama3 model, you can perform the example commands with any of these models.
300282

283+
## Troubleshooting
284+
285+
**CERTIFICATE_VERIFY_FAILED**:
286+
Run `pip install --upgrade certifi`.
301287

288+
**Access to model is restricted and you are not in the authorized list.**
289+
Some models require an additional step to access. Follow the link provided in the error to get access.
302290

303291
## Acknowledgements
304292
Thank you to the [community](docs/ACKNOWLEDGEMENTS.md) for all the awesome libraries and tools

0 commit comments

Comments
 (0)