Skip to content

Commit 4e1879c

Browse files
authored
Improve the README via .md components and reducing scope (#920)
* Update README.md * Adding additional minor changes and Using markdown note blocks * Minor typos and undoing changes that are more impactful * adds
1 parent ea63e21 commit 4e1879c

File tree

1 file changed

+83
-59
lines changed

1 file changed

+83
-59
lines changed

README.md

Lines changed: 83 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -4,35 +4,28 @@ torchchat is a small codebase showcasing the ability to run large language model
44

55

66
## What can you do with torchchat?
7-
- [Setup the Repo](#installation)
8-
- [Download Models](#download-weights)
97
- [Run models via PyTorch / Python](#running-via-pytorch--python)
108
- [Chat](#chat)
119
- [Generate](#generate)
1210
- [Run chat in the Browser](#browser)
13-
- [Export models for running on desktop/server without python](#desktopserver-execution)
11+
- [Run models on desktop/server without python](#desktopserver-execution)
1412
- [Use AOT Inductor for faster execution](#aoti-aot-inductor)
1513
- [Running in c++ using the runner](#running-native-using-our-c-runner)
16-
- [Run on mobile](#mobile-execution)
17-
- [Setup](#set-up-executorch)
18-
- [Export a model for use on mobile](#export-for-mobile)
14+
- [Run models on mobile](#mobile-execution)
1915
- [Deploy and run on iOS](#deploy-and-run-on-ios)
2016
- [Deploy and run on Android](#deploy-and-run-on-android)
21-
- [Evaluate a mode](#eval)
22-
- [Supported Models](#models)
23-
- [Troubleshooting](#troubleshooting)
17+
- [Evaluate a model](#eval)
2418

2519

2620
## Highlights
2721
- Command line interaction with popular LLMs such as Llama 3, Llama 2, Stories, Mistral and more
28-
- Supports [common GGUF formats](docs/GGUF.md) and the Hugging Face checkpoint format
2922
- PyTorch-native execution with performance
3023
- Supports popular hardware and OS
3124
- Linux (x86)
3225
- Mac OS (M1/M2/M3)
3326
- Android (Devices that support XNNPACK)
3427
- iOS 17+ (iPhone 13 Pro+)
35-
- Multiple data types including: float32, float16, bfloat16, select GGUF data types
28+
- Multiple data types including: float32, float16, bfloat16
3629
- Multiple quantization schemes
3730
- Multiple execution modes including: Python (Eager, Compile) or Native (AOT Inductor (AOTI), ExecuTorch)
3831

@@ -65,10 +58,12 @@ python3 torchchat.py --help
6558

6659
### Download Weights
6760
Most models use Hugging Face as the distribution channel, so you will need to create a Hugging Face account.
68-
69-
[prefix default]: HF_TOKEN="${SECRET_HF_TOKEN_PERIODIC}"
7061
Create a Hugging Face user access token [as documented here](https://huggingface.co/docs/hub/en/security-tokens) with the `write` role.
62+
7163
Log into Hugging Face:
64+
65+
[prefix default]: HF_TOKEN="${SECRET_HF_TOKEN_PERIODIC}"
66+
7267
```
7368
huggingface-cli login
7469
```
@@ -80,55 +75,68 @@ Hugging Face.
8075
python3 torchchat.py download llama3
8176
```
8277

83-
*NOTE: This command may prompt you to request access to Llama 3 via
84-
Hugging Face, if you do not already have access. Simply follow the
85-
prompts and re-run the command when access is granted.*
78+
> [!NOTE]
79+
> This command may prompt you to request access to Llama 3 via
80+
> Hugging Face, if you do not already have access. Simply follow the
81+
> prompts and re-run the command when access is granted.*
82+
83+
84+
<details>
85+
<summary>Additional Model Inventory Management Commands</summary>
8686

87-
View available models with:
8887
```
88+
# View available models
8989
python3 torchchat.py list
90-
```
9190
92-
Query the location of a particular model -- this is particularly useful in scripts when you do not want to hard-code paths:
93-
```
91+
# Query the location of a particular model
92+
# This is useful in scripts when you do not want to hard-code paths
9493
python3 torchchat.py where llama3
94+
95+
# Remove downloaded models
96+
python3 torchchat.py remove llama3
9597
```
98+
More information about these commands can be found by adding the `--help` option.
9699

97-
Finally, you can also remove downloaded models with the remove command:
98-
`python3 torchchat.py remove llama3`
100+
</details>
99101

100102

101103
## Running via PyTorch / Python
102-
[Follow the installation steps if you haven't.](#installation)
104+
[Follow the installation steps if you haven't already.](#installation)
103105

104106
### Chat
105107
This mode allows you to chat with an LLM in an interactive fashion.
108+
106109
[skip default]: begin
107110
```bash
108111
# Llama 3 8B Instruct
109112
python3 torchchat.py chat llama3
110113
```
111114
[skip default]: end
112115

113-
For more information run `python3 torchchat.py chat --help`
114-
115116
### Generate
116117
This mode generates text based on an input prompt.
117118
```bash
118119
python3 torchchat.py generate llama3 --prompt "write me a story about a boy and his bear"
119120
```
120121

121-
For more information run `python3 torchchat.py generate --help`
122122

123123
### Server
124-
Start the server to send requests and receive the responses.
124+
Thie mode kicks off a server to send curl requests against.
125+
125126
[skip default]: begin
126127
```bash
127128
python3 torchchat.py server llama3
128129
```
129130
[skip default]: end
130131

131-
Sample curl request sent to the server looks like:
132+
<details>
133+
134+
> [!NOTE]
135+
> Depending on the model configuration, this query might take a few minutes
136+
to respond
137+
138+
<summary>Sample Input + Output</summary>
139+
132140
```
133141
curl http://127.0.0.1:5000/chat \
134142
-H "Content-Type: application/json" \
@@ -145,33 +153,36 @@ curl http://127.0.0.1:5000/chat \
145153
}
146154
]
147155
}'
156+
157+
{"response":" I'm a software developer with a passion for building innovative and user-friendly applications. I have experience in developing web and mobile applications using various technologies such as Java, Python, and JavaScript. I'm always looking for new challenges and opportunities to learn and grow as a developer.\n\nIn my free time, I enjoy reading books on computer science and programming, as well as experimenting with new technologies and techniques. I'm also interested in machine learning and artificial intelligence, and I'm always looking for ways to apply these concepts to real-world problems.\n\nI'm excited to be a part of the developer community and to have the opportunity to share my knowledge and experience with others. I'm always happy to help with any questions or problems you may have, and I'm looking forward to learning from you as well.\n\nThank you for visiting my profile! I hope you find my information helpful and interesting. If you have any questions or would like to discuss any topics, please feel free to reach out to me. I"}
148158
```
149159

150-
### Browser
151-
This mode provides access to the model via the browser's localhost.
160+
</details>
152161

153-
Launch an interactive chat with your model. Running the command will automatically open a tab in your browser. [Streamlit](https://streamlit.io/) should already be installed by the `install_requirements.sh` script.
162+
### Browser
163+
This mode provides access to a localhost browser hosting [Streamlit](https://streamlit.io/).
164+
Running the command automatically open a tab in your browser.
154165
```
155166
streamlit run torchchat.py -- browser <model_name> <model_args>
156167
```
157168

158169
For example, to quantize and chat with LLaMA3:
170+
159171
[skip default]: begin
160172
```
161173
streamlit run torchchat.py -- browser llama3 --quantize '{"precision": {"dtype":"float16"}, "executor":{"accelerator":"cpu"}}' --max-new-tokens 256 --compile
162174
```
163175
[skip default]: end
164176

165177

166-
167-
168-
178+
> [!TIP]
179+
> For more information about these commands, please refer to the `--help` menu.
169180
170181

171182
## Desktop/Server Execution
172183

173184
### AOTI (AOT Inductor)
174-
AOT compiles models before execution for faster inference (read more about AOTI [here](https://pytorch.org/blog/pytorch2-2/)).
185+
[AOTI](https://pytorch.org/blog/pytorch2-2/) compiles models before execution for faster inference.
175186

176187
The following example exports and executes the Llama3 8B Instruct
177188
model. The first command performs the actual export, the second
@@ -183,12 +194,12 @@ users to test the exported model.
183194
python3 torchchat.py export llama3 --output-dso-path exportedModels/llama3.so
184195
185196
# Execute the exported model using Python
186-
187197
python3 torchchat.py generate llama3 --dso-path exportedModels/llama3.so --prompt "Hello my name is"
188198
```
189199

190-
NOTE: If your machine has cuda add this flag for performance
191-
`--quantize config/data/cuda.json`
200+
> [!NOTE]
201+
> If your machine has cuda add this flag for performance
202+
`--quantize config/data/cuda.json` when exporting. You'll also need to tell generate to use `--device cuda`
192203

193204
### Running native using our C++ Runner
194205

@@ -207,32 +218,38 @@ cmake-out/aoti_run exportedModels/llama3.so -z `python3 torchchat.py where llama
207218

208219
## Mobile Execution
209220

210-
[ExecuTorch] (https://github.com/pytorch/executorch) enables you to optimize your model for execution on a
211-
mobile or embedded device, but can also be used on desktop for
212-
testing.
221+
[ExecuTorch](https://github.com/pytorch/executorch) enables you to optimize your model for execution on a
222+
mobile or embedded device.
213223

214224
### Set Up ExecuTorch
215225

216226
Before running any commands in torchchat that require ExecuTorch, you
217227
must first install ExecuTorch.
218228

219-
To install ExecuTorch, run the following commands *from the torchchat
220-
root directory*. This will download the ExecuTorch repo to
221-
./et-build/src and install various ExecuTorch libraries to
229+
To install ExecuTorch, run the following commands. This will download the
230+
ExecuTorch repo to ./et-build/src and install various ExecuTorch libraries to
222231
./et-build/install.
223232

233+
> [!IMPORTANT]
234+
> The following commands should be run from the torchchat root directory.
235+
224236
```
225237
export TORCHCHAT_ROOT=${PWD}
226238
./scripts/install_et.sh
227239
```
228240

229241
### Test it out using our ExecuTorch runner
242+
243+
While ExecuTorch does not focus on desktop inference, it is capable
244+
of building a runner to do so. This is handy for testing out PTE
245+
models without sending them to a physical device.
246+
230247
Build the runner
231248
```bash
232249
scripts/build_native.sh et
233250
```
234251

235-
**Get a PTE file if you don't have one already**
252+
Get a PTE file if you don't have one already
236253
```
237254
python3 torchchat.py export llama3 --quantize config/data/mobile.json --output-pte-path llama3.pte
238255
```
@@ -253,12 +270,13 @@ python3 torchchat.py export llama3 --quantize config/data/mobile.json --output-p
253270
python3 torchchat.py generate llama3 --device cpu --pte-path llama3.pte --prompt "Hello my name is"
254271
```
255272

256-
NOTE: We use `--quantize config/data/mobile.json` to quantize the
273+
> [!NOTE]
274+
> We use `--quantize config/data/mobile.json` to quantize the
257275
llama3 model to reduce model size and improve performance for
258276
on-device use cases.
259277

260278
For more details on quantization and what settings to use for your use
261-
case visit our [Quanitization documentation](docs/quantization.md) or
279+
case visit our [Quantization documentation](docs/quantization.md) or
262280
run `python3 torchchat.py export`
263281

264282
[end default]: end
@@ -267,10 +285,14 @@ run `python3 torchchat.py export`
267285

268286
The following assumes you've completed the steps for [Setting up ExecuTorch](#set-up-executorch).
269287

288+
<details>
289+
<summary>Deploying with Xcode</summary>
290+
270291
#### Requirements
271292
- Xcode 15.0 or later
272293
- A development provisioning profile with the [`increased-memory-limit`](https://developer.apple.com/documentation/bundleresources/entitlements/com_apple_developer_kernel_increased-memory-limit) entitlement.
273294

295+
274296
#### Steps
275297

276298
1. Open the Xcode project:
@@ -293,11 +315,15 @@ The following assumes you've completed the steps for [Setting up ExecuTorch](#se
293315
<a href="https://pytorch.org/executorch/main/_static/img/llama_ios_app.mp4">
294316
<img src="https://pytorch.org/executorch/main/_static/img/llama_ios_app.png" width="600" alt="iOS app running a LlaMA model">
295317
</a>
318+
</details>
296319

297320

298321
### Deploy and run on Android
299322

300-
#### Approach 1 (Recommended): Android Studio
323+
The following assumes you've completed the steps for [Setting up ExecuTorch](#set-up-executorch). In torchchat, we show 2 approaches for Android deployment:
324+
325+
<details>
326+
<summary>Approach 1 (Recommended): Android Studio</summary>
301327
302328
If you have Android Studio set up, and you have Java 17 and Android SDK 34 configured, you can follow this step.
303329
@@ -309,9 +335,11 @@ If your model uses tiktoken tokenizer (llama3 model for example), download `exec
309335
310336
Currently the tokenizer is built at compile time, so you need to re-build the app when you need to use a different tokenizer for different model.
311337
312-
NOTE: The script to build the AAR can be found [here](https://github.com/pytorch/executorch/blob/main/build/build_android_library.sh). If you need to tweak with the tokenizer or runtime (for example use your own tokenizer or runtime library), you can modify the ExecuTorch code and use that script to build the AAR library.
338+
> [!NOTE]
339+
> The script to build the AAR can be found [here](https://github.com/pytorch/executorch/blob/main/build/build_android_library.sh). If you need to tweak with the tokenizer or runtime (for example use your own tokenizer or runtime library), you can modify the ExecuTorch code and use that script to build the AAR library.
313340
314341
[executorch-llama-torchchat-bpe.aar](https://ossci-android.s3.amazonaws.com/executorch/release/0.3/executorch-llama-bpe-rc1.aar) (SHASUM: 673af4a1338a93d47369b68ec0d52b8ea7f983a2)
342+
315343
[executorch-llama-torchchat-tiktoken.aar](https://ossci-android.s3.amazonaws.com/executorch/release/0.3/executorch-llama-tiktoken-rc1.aar) (SHASUM: 575190205dbb1ee932a277b50520dc4260a9a9cf)
316344
317345
For BPE tokenizer:
@@ -342,7 +370,9 @@ Now, follow the app's UI guidelines to pick the model and tokenizer files from t
342370

343371
<img src="https://pytorch.org/executorch/main/_static/img/android_llama_app.png" width="600" alt="Android app running a LlaMA model">
344372

345-
#### Approach 2: E2E Script
373+
</details>
374+
<details>
375+
<summary>Approach 2: E2E Script</summary>
346376

347377
Alternatively, you can run `scripts/android_example.sh` which sets up Java, Android SDK Manager, Android SDK, Android emulator (if no physical device is found), builds the app, and launches it for you. It can be used if you don't have a GUI.
348378
@@ -352,17 +382,16 @@ export USE_TIKTOKEN=ON # Set this only for tiktoken tokenizer
352382
sh scripts/android_example.sh
353383
```
354384
385+
</details>
355386
356-
### Eval
387+
## Eval
357388
358389
Uses the lm_eval library to evaluate model accuracy on a variety of
359390
tasks. Defaults to wikitext and can be manually controlled using the
360391
tasks and limit args.
361392
362393
See [Evaluation](docs/evaluation.md)
363394
364-
For more information run `python3 torchchat.py eval --help`
365-
366395
**Examples**
367396
368397
Eager mode:
@@ -382,8 +411,7 @@ python3 torchchat.py eval llama3 --pte-path llama3.pte --limit 5
382411
## Models
383412
384413
The following models are supported by torchchat and have associated
385-
aliases. Other models, including GGUF format, can be run by specifying
386-
a URL directly.
414+
aliases.
387415
388416
| Model | Mobile Friendly | Notes |
389417
|------------------|---|---------------------|
@@ -403,10 +431,6 @@ a URL directly.
403431
|[tinyllamas/stories110M](https://huggingface.co/karpathy/tinyllamas/tree/main)|✅|Toy model for `generate`. Alias to `stories110M`.|
404432
|[openlm-research/open_llama_7b](https://huggingface.co/openlm-research/open_llama_7b)|✅|Best for `generate`. Alias to `open-llama`.|
405433
406-
torchchat also supports loading of many models in the GGUF format. See
407-
the [documentation on GGUF](docs/GGUF.md) to learn how to use GGUF
408-
files.
409-
410434
While we describe how to use torchchat using the popular llama3 model,
411435
you can perform the example commands with any of these models.
412436

0 commit comments

Comments
 (0)