You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This example demonstrates how to run a [llama models](https://www.llama.com/) on mobile via ExecuTorch. We use XNNPACK to accelerate the performance and 4-bit groupwise PTQ quantization to fit the model on a phone.
2
+
This example demonstrates how to run a [Llama models](https://www.llama.com/) on mobile via ExecuTorch. We use XNNPACK to accelerate the performance and 4-bit groupwise PTQ quantization to fit the model on a phone.
3
3
4
4
Here are supported models:
5
5
@@ -10,6 +10,8 @@ Here are supported models:
10
10
11
11
Pretrained models are not included in this repo. Users are suggested to download them [here](https://ai.meta.com/resources/models-and-libraries/llama-downloads/).
12
12
13
+
This page contains the basic recipe for running Llama. See [Llama utils page](./UTILS.md) page for more advanced use-cases such as fine-tuning and running smaller models for educational purposes.
14
+
13
15
# What is Llama?
14
16
Llama is a collection of large language models that use publicly available data for training. These models are based on the transformer architecture, which allows it to process input sequences of arbitrary length and generate output sequences of variable length. One of the key features of Llama models is its ability to generate coherent and contextually relevant text. This is achieved through the use of attention mechanisms, which allow the model to focus on different parts of the input sequence as it generates output. Additionally, Llama models use a technique called “masked language modeling” to pre-train the model on a large corpus of text, which helps it learn to predict missing words in a sentence.
15
17
@@ -108,7 +110,7 @@ Due to Llama3's vocabulary size, we had to quantize embedding lookup table as we
108
110
## Tested on
109
111
110
112
- MacOS M1/M2, Linux.
111
-
- For Llama 3 8B, your device may require at least 32GB RAM. If this is a constraint for you, please try the smaller stories model.
113
+
- For Llama 3 8B, your device may require at least 32GB RAM. If this is a constraint for you, please try the [smaller stories model](./UTILS.md).
112
114
113
115
## Step 1: Setup
114
116
> :warning:**double check your python environment**: make sure `conda activate <VENV>` is run before all the bash and python scripts.
@@ -179,106 +181,6 @@ You can export and run the original Llama 3 8B instruct model.
179
181
180
182
Due to the larger vocabulary size of Llama 3, we recommend quantizing the embeddings with `--embedding-quantize 4,32` as shown above to further reduce the model size.
181
183
182
-
### Option C: Download and export stories110M model
183
-
184
-
If you want to deploy and run a smaller model for educational purposes. From `executorch` root:
185
-
186
-
1. Download `stories110M.pt` and `tokenizer.model` from Github.
### Option D: Download models from Hugging Face and convert from safetensor format to state dict
201
-
202
-
203
-
You can also download above models from [Hugging Face](https://huggingface.co/). Since ExecuTorch starts from a PyTorch model, a script like below can be used to convert the Hugging Face safetensors format to PyTorch's state dict. It leverages the utils provided by [TorchTune](https://github.com/pytorch/torchtune).
204
-
205
-
206
-
```Python
207
-
from torchtune.utils import FullModelHFCheckpointer
208
-
from torchtune.models import convert_weights
209
-
import torch
210
-
211
-
# Convert from safetensors to TorchTune. Suppose the model has been downloaded from Hugging Face
If you want to finetune your model based on a specific dataset, PyTorch provides [TorchTune](https://github.com/pytorch/torchtune) - a native-Pytorch library for easily authoring, fine-tuning and experimenting with LLMs.
232
-
233
-
Once you have [TorchTune installed](https://github.com/pytorch/torchtune?tab=readme-ov-file#get-started) you can finetune Llama2 7B model using LoRA on a single GPU, using the following command. This will produce a checkpoint where the LoRA weights are merged with the base model and so the output checkpoint will be in the same format as the original Llama2 model.
> Forewarning: Model evaluation without a GPU may take a long time, especially on larger models.
254
-
255
-
We use [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate model accuracy.
256
-
257
-
For base models, use the following example command to calculate its perplexity based on UncycloText.
258
-
```
259
-
python -m examples.models.llama.eval_llama \
260
-
-c <checkpoint.pth> \
261
-
-p <params.json> \
262
-
-t <tokenizer.model/bin> \
263
-
-kv \
264
-
-d <checkpoint dtype> \
265
-
--max_seq_len <max sequence length> \
266
-
--limit <number of samples>
267
-
```
268
-
269
-
For instruct models, use the following example command to calculate its MMLU score.
270
-
```
271
-
python -m examples.models.llama.eval_llama \
272
-
-c <checkpoint.pth> \
273
-
-p <params.json> \
274
-
-t <tokenizer.model/bin> \
275
-
-kv \
276
-
-d <checkpoint dtype> \
277
-
--tasks mmlu \
278
-
--num_fewshot 5 \
279
-
--max_seq_len <max sequence length>
280
-
```
281
-
282
184
## Step 4: Run on your computer to validate
283
185
284
186
1. Build executorch with optimized CPU performance as follows. Build options available [here](https://github.com/pytorch/executorch/blob/main/CMakeLists.txt#L59).
@@ -398,19 +300,41 @@ Please refer to [this tutorial](https://pytorch.org/executorch/main/llm/llama-de
398
300
### Android
399
301
Please refer to [this tutorial](https://pytorch.org/executorch/main/llm/llama-demo-android.html) to for full instructions on building the Android LLAMA Demo App.
400
302
401
-
## Optional: Smaller models delegated to other backends
402
-
Currently we supported lowering the stories model to other backends, including, CoreML, MPS and QNN. Please refer to the instruction
403
-
for each backend ([CoreML](https://pytorch.org/executorch/main/build-run-coreml.html), [MPS](https://pytorch.org/executorch/main/build-run-mps.html), [QNN](https://pytorch.org/executorch/main/build-run-qualcomm-ai-engine-direct-backend.html)) before trying to lower them. After the backend library is installed, the script to export a lowered model is
The iOS LLAMA app supports the CoreML and MPS model and the Android LLAMA app supports the QNN model. On Android, it also allow to cross compiler the llama runner binary, push to the device and run.
306
+
### Evaluate model accuracy
410
307
411
-
For CoreML, there are 2 additional optional arguments:
412
-
* `--coreml-ios`: Specify the minimum iOS version to deploy (and turn on available optimizations). E.g. `--coreml-ios 18` will turn on [in-place KV cache](https://developer.apple.com/documentation/coreml/mlstate?language=objc) and [fused scaled dot product attention kernel](https://apple.github.io/coremltools/source/coremltools.converters.mil.mil.ops.defs.html#coremltools.converters.mil.mil.ops.defs.iOS18.transformers.scaled_dot_product_attention) (the resulting model will then need at least iOS 18 to run, though)
413
-
* `--coreml-quantize`: Use [quantization tailored for CoreML](https://apple.github.io/coremltools/docs-guides/source/opt-quantization-overview.html). E.g. `--coreml-quantize b4w` will perform per-block 4-bit weight-only quantization in a way tailored for CoreML
308
+
> Forewarning: Model evaluation without a GPU may take a long time, especially on larger models.
309
+
310
+
We use [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate model accuracy.
311
+
312
+
For base models, use the following example command to calculate its perplexity based on UncycloText.
313
+
```
314
+
python -m examples.models.llama.eval_llama \
315
+
-c <checkpoint.pth> \
316
+
-p <params.json> \
317
+
-t <tokenizer.model/bin> \
318
+
-kv \
319
+
-d <checkpointdtype> \
320
+
--max_seq_len <maxsequencelength> \
321
+
--limit <numberofsamples>
322
+
```
323
+
324
+
For instruct models, use the following example command to calculate its MMLU score.
325
+
```
326
+
python -m examples.models.llama.eval_llama \
327
+
-c <checkpoint.pth> \
328
+
-p <params.json> \
329
+
-t <tokenizer.model/bin> \
330
+
-kv \
331
+
-d <checkpointdtype> \
332
+
--tasks mmlu \
333
+
--num_fewshot 5 \
334
+
--max_seq_len <maxsequencelength>
335
+
```
336
+
337
+
See [Llama utils page](./UTILS.md) page for more advanced use-cases such as fine-tuning and running smaller models for educational purposes, and quick iteration and verification.
414
338
415
339
# What is coming next?
416
340
## Quantization
@@ -420,13 +344,11 @@ For CoreML, there are 2 additional optional arguments:
420
344
- Lower bit quantization
421
345
## Models
422
346
- Enabling more generative AI models and architectures.
423
-
- Enable support for mult-modal models like LlaVa.
424
347
## Performance
425
348
- Performance improvement via techniques such as speculative decoding
426
349
- Enabling LLama and other architectures via Vulkan
427
350
- Enabling performant execution of widely used quantization schemes.
428
351
429
-
430
352
# Notes
431
353
This example tries to reuse the Python code, with minimal modifications to make it compatible with current ExecuTorch:
432
354
1. Since ExecuTorch does not support complex Tensor data type, use the customized functions to have rotary embedding with real numbers. Please see [GitHub issue: Support complex data type in ExecuTorch](https://github.com/pytorch/executorch/issues/886).
If you want to deploy and run a smaller model for educational purposes, you can try stories110M model. It has the same architecture as Llama, but just smaller. It can be also used for fast iteration and verification during development.
6
+
7
+
### Export:
8
+
9
+
From `executorch` root:
10
+
11
+
1. Download `stories110M.pt` and `tokenizer.model` from Github.
Currently we supported lowering the stories model to other backends, including, CoreML, MPS and QNN. Please refer to the instruction
28
+
for each backend ([CoreML](https://pytorch.org/executorch/main/build-run-coreml.html), [MPS](https://pytorch.org/executorch/main/build-run-mps.html), [QNN](https://pytorch.org/executorch/main/build-run-qualcomm-ai-engine-direct-backend.html)) before trying to lower them. After the backend library is installed, the script to export a lowered model is
The iOS LLAMA app supports the CoreML and MPS model and the Android LLAMA app supports the QNN model. On Android, it also allow to cross compiler the llama runner binary, push to the device and run.
35
+
36
+
For CoreML, there are 2 additional optional arguments:
37
+
* `--coreml-ios`: Specify the minimum iOS version to deploy (and turn on available optimizations). E.g. `--coreml-ios 18` will turn on [in-place KV cache](https://developer.apple.com/documentation/coreml/mlstate?language=objc) and [fused scaled dot product attention kernel](https://apple.github.io/coremltools/source/coremltools.converters.mil.mil.ops.defs.html#coremltools.converters.mil.mil.ops.defs.iOS18.transformers.scaled_dot_product_attention) (the resulting model will then need at least iOS 18 to run, though)
38
+
* `--coreml-quantize`: Use [quantization tailored for CoreML](https://apple.github.io/coremltools/docs-guides/source/opt-quantization-overview.html). E.g. `--coreml-quantize b4w` will perform per-block 4-bit weight-only quantization in a way tailored for CoreML
39
+
40
+
41
+
## Download models from Hugging Face and convert from safetensor format to state dict
42
+
43
+
You can also download above models from [Hugging Face](https://huggingface.co/). Since ExecuTorch starts from a PyTorch model, a script like below can be used to convert the Hugging Face safetensors format to PyTorch's state dict. It leverages the utils provided by [TorchTune](https://github.com/pytorch/torchtune).
44
+
45
+
46
+
```Python
47
+
from torchtune.utils import FullModelHFCheckpointer
48
+
from torchtune.models import convert_weights
49
+
import torch
50
+
51
+
# Convert from safetensors to TorchTune. Suppose the model has been downloaded from Hugging Face
If you want to finetune your model based on a specific dataset, PyTorch provides [TorchTune](https://github.com/pytorch/torchtune) - a native-Pytorch library for easily authoring, fine-tuning and experimenting with LLMs.
72
+
73
+
Once you have [TorchTune installed](https://github.com/pytorch/torchtune?tab=readme-ov-file#get-started) you can finetune Llama2 7B model using LoRA on a single GPU, using the following command. This will produce a checkpoint where the LoRA weights are merged with the base model and so the output checkpoint will be in the same format as the original Llama2 model.
0 commit comments