Skip to content

Commit 5a6a571

Browse files
authored
Refactor out utility functions out of llama readme
Differential Revision: D64619935 Pull Request resolved: #6428
1 parent 89ba47a commit 5a6a571

File tree

2 files changed

+125
-114
lines changed

2 files changed

+125
-114
lines changed

examples/models/llama/README.md

Lines changed: 36 additions & 114 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# Summary
2-
This example demonstrates how to run a [llama models](https://www.llama.com/) on mobile via ExecuTorch. We use XNNPACK to accelerate the performance and 4-bit groupwise PTQ quantization to fit the model on a phone.
2+
This example demonstrates how to run a [Llama models](https://www.llama.com/) on mobile via ExecuTorch. We use XNNPACK to accelerate the performance and 4-bit groupwise PTQ quantization to fit the model on a phone.
33

44
Here are supported models:
55

@@ -10,6 +10,8 @@ Here are supported models:
1010

1111
Pretrained models are not included in this repo. Users are suggested to download them [here](https://ai.meta.com/resources/models-and-libraries/llama-downloads/).
1212

13+
This page contains the basic recipe for running Llama. See [Llama utils page](./UTILS.md) page for more advanced use-cases such as fine-tuning and running smaller models for educational purposes.
14+
1315
# What is Llama?
1416
Llama is a collection of large language models that use publicly available data for training. These models are based on the transformer architecture, which allows it to process input sequences of arbitrary length and generate output sequences of variable length. One of the key features of Llama models is its ability to generate coherent and contextually relevant text. This is achieved through the use of attention mechanisms, which allow the model to focus on different parts of the input sequence as it generates output. Additionally, Llama models use a technique called “masked language modeling” to pre-train the model on a large corpus of text, which helps it learn to predict missing words in a sentence.
1517

@@ -108,7 +110,7 @@ Due to Llama3's vocabulary size, we had to quantize embedding lookup table as we
108110
## Tested on
109111

110112
- MacOS M1/M2, Linux.
111-
- For Llama 3 8B, your device may require at least 32GB RAM. If this is a constraint for you, please try the smaller stories model.
113+
- For Llama 3 8B, your device may require at least 32GB RAM. If this is a constraint for you, please try the [smaller stories model](./UTILS.md).
112114

113115
## Step 1: Setup
114116
> :warning: **double check your python environment**: make sure `conda activate <VENV>` is run before all the bash and python scripts.
@@ -179,106 +181,6 @@ You can export and run the original Llama 3 8B instruct model.
179181
180182
Due to the larger vocabulary size of Llama 3, we recommend quantizing the embeddings with `--embedding-quantize 4,32` as shown above to further reduce the model size.
181183
182-
### Option C: Download and export stories110M model
183-
184-
If you want to deploy and run a smaller model for educational purposes. From `executorch` root:
185-
186-
1. Download `stories110M.pt` and `tokenizer.model` from Github.
187-
```
188-
wget "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.pt"
189-
wget "https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.model"
190-
```
191-
2. Create params file.
192-
```
193-
echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.json
194-
```
195-
3. Export model and generate `.pte` file.
196-
```
197-
python -m examples.models.llama.export_llama -c stories110M.pt -p params.json -X -kv
198-
```
199-
200-
### Option D: Download models from Hugging Face and convert from safetensor format to state dict
201-
202-
203-
You can also download above models from [Hugging Face](https://huggingface.co/). Since ExecuTorch starts from a PyTorch model, a script like below can be used to convert the Hugging Face safetensors format to PyTorch's state dict. It leverages the utils provided by [TorchTune](https://github.com/pytorch/torchtune).
204-
205-
206-
```Python
207-
from torchtune.utils import FullModelHFCheckpointer
208-
from torchtune.models import convert_weights
209-
import torch
210-
211-
# Convert from safetensors to TorchTune. Suppose the model has been downloaded from Hugging Face
212-
checkpointer = FullModelHFCheckpointer(
213-
checkpoint_dir='/home/.cache/huggingface/hub/models/snapshots/hash-number',
214-
checkpoint_files=['model-00001-of-00002.safetensors', 'model-00002-of-00002.safetensors'],
215-
output_dir='/the/destination/dir' ,
216-
model_type='LLAMA3' # or other types that TorchTune supports
217-
)
218-
219-
print("loading checkpoint")
220-
sd = checkpointer.load_checkpoint()
221-
222-
# Convert from TorchTune to Meta (PyTorch native)
223-
sd = convert_weights.tune_to_meta(sd['model'])
224-
225-
print("saving checkpoint")
226-
torch.save(sd, "/the/destination/dir/checkpoint.pth")
227-
```
228-
229-
## (Optional) Finetuning
230-
231-
If you want to finetune your model based on a specific dataset, PyTorch provides [TorchTune](https://github.com/pytorch/torchtune) - a native-Pytorch library for easily authoring, fine-tuning and experimenting with LLMs.
232-
233-
Once you have [TorchTune installed](https://github.com/pytorch/torchtune?tab=readme-ov-file#get-started) you can finetune Llama2 7B model using LoRA on a single GPU, using the following command. This will produce a checkpoint where the LoRA weights are merged with the base model and so the output checkpoint will be in the same format as the original Llama2 model.
234-
235-
```
236-
tune run lora_finetune_single_device \
237-
--config llama2/7B_lora_single_device \
238-
checkpointer.checkpoint_dir=<path_to_checkpoint_folder> \
239-
tokenizer.path=<path_to_checkpoint_folder>/tokenizer.model
240-
```
241-
242-
To run full finetuning with Llama2 7B on a single device, you can use the following command.
243-
244-
```
245-
tune run full_finetune_single_device \
246-
--config llama2/7B_full_single_device \
247-
checkpointer.checkpoint_dir=<path_to_checkpoint_folder> \
248-
tokenizer.path=<path_to_checkpoint_folder>/tokenizer.model
249-
```
250-
251-
## Step 3: Evaluate model accuracy
252-
253-
> Forewarning: Model evaluation without a GPU may take a long time, especially on larger models.
254-
255-
We use [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate model accuracy.
256-
257-
For base models, use the following example command to calculate its perplexity based on UncycloText.
258-
```
259-
python -m examples.models.llama.eval_llama \
260-
-c <checkpoint.pth> \
261-
-p <params.json> \
262-
-t <tokenizer.model/bin> \
263-
-kv \
264-
-d <checkpoint dtype> \
265-
--max_seq_len <max sequence length> \
266-
--limit <number of samples>
267-
```
268-
269-
For instruct models, use the following example command to calculate its MMLU score.
270-
```
271-
python -m examples.models.llama.eval_llama \
272-
-c <checkpoint.pth> \
273-
-p <params.json> \
274-
-t <tokenizer.model/bin> \
275-
-kv \
276-
-d <checkpoint dtype> \
277-
--tasks mmlu \
278-
--num_fewshot 5 \
279-
--max_seq_len <max sequence length>
280-
```
281-
282184
## Step 4: Run on your computer to validate
283185
284186
1. Build executorch with optimized CPU performance as follows. Build options available [here](https://github.com/pytorch/executorch/blob/main/CMakeLists.txt#L59).
@@ -398,19 +300,41 @@ Please refer to [this tutorial](https://pytorch.org/executorch/main/llm/llama-de
398300
### Android
399301
Please refer to [this tutorial](https://pytorch.org/executorch/main/llm/llama-demo-android.html) to for full instructions on building the Android LLAMA Demo App.
400302
401-
## Optional: Smaller models delegated to other backends
402-
Currently we supported lowering the stories model to other backends, including, CoreML, MPS and QNN. Please refer to the instruction
403-
for each backend ([CoreML](https://pytorch.org/executorch/main/build-run-coreml.html), [MPS](https://pytorch.org/executorch/main/build-run-mps.html), [QNN](https://pytorch.org/executorch/main/build-run-qualcomm-ai-engine-direct-backend.html)) before trying to lower them. After the backend library is installed, the script to export a lowered model is
404303
405-
- Lower to CoreML: `python -m examples.models.llama.export_llama -kv --disable_dynamic_shape --coreml -c stories110M.pt -p params.json `
406-
- MPS: `python -m examples.models.llama.export_llama -kv --disable_dynamic_shape --mps -c stories110M.pt -p params.json `
407-
- QNN: `python -m examples.models.llama.export_llama -kv --disable_dynamic_shape --qnn -c stories110M.pt -p params.json `
304+
## Utility tools for Llama enablement
408305
409-
The iOS LLAMA app supports the CoreML and MPS model and the Android LLAMA app supports the QNN model. On Android, it also allow to cross compiler the llama runner binary, push to the device and run.
306+
### Evaluate model accuracy
410307
411-
For CoreML, there are 2 additional optional arguments:
412-
* `--coreml-ios`: Specify the minimum iOS version to deploy (and turn on available optimizations). E.g. `--coreml-ios 18` will turn on [in-place KV cache](https://developer.apple.com/documentation/coreml/mlstate?language=objc) and [fused scaled dot product attention kernel](https://apple.github.io/coremltools/source/coremltools.converters.mil.mil.ops.defs.html#coremltools.converters.mil.mil.ops.defs.iOS18.transformers.scaled_dot_product_attention) (the resulting model will then need at least iOS 18 to run, though)
413-
* `--coreml-quantize`: Use [quantization tailored for CoreML](https://apple.github.io/coremltools/docs-guides/source/opt-quantization-overview.html). E.g. `--coreml-quantize b4w` will perform per-block 4-bit weight-only quantization in a way tailored for CoreML
308+
> Forewarning: Model evaluation without a GPU may take a long time, especially on larger models.
309+
310+
We use [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate model accuracy.
311+
312+
For base models, use the following example command to calculate its perplexity based on UncycloText.
313+
```
314+
python -m examples.models.llama.eval_llama \
315+
-c <checkpoint.pth> \
316+
-p <params.json> \
317+
-t <tokenizer.model/bin> \
318+
-kv \
319+
-d <checkpoint dtype> \
320+
--max_seq_len <max sequence length> \
321+
--limit <number of samples>
322+
```
323+
324+
For instruct models, use the following example command to calculate its MMLU score.
325+
```
326+
python -m examples.models.llama.eval_llama \
327+
-c <checkpoint.pth> \
328+
-p <params.json> \
329+
-t <tokenizer.model/bin> \
330+
-kv \
331+
-d <checkpoint dtype> \
332+
--tasks mmlu \
333+
--num_fewshot 5 \
334+
--max_seq_len <max sequence length>
335+
```
336+
337+
See [Llama utils page](./UTILS.md) page for more advanced use-cases such as fine-tuning and running smaller models for educational purposes, and quick iteration and verification.
414338
415339
# What is coming next?
416340
## Quantization
@@ -420,13 +344,11 @@ For CoreML, there are 2 additional optional arguments:
420344
- Lower bit quantization
421345
## Models
422346
- Enabling more generative AI models and architectures.
423-
- Enable support for mult-modal models like LlaVa.
424347
## Performance
425348
- Performance improvement via techniques such as speculative decoding
426349
- Enabling LLama and other architectures via Vulkan
427350
- Enabling performant execution of widely used quantization schemes.
428351
429-
430352
# Notes
431353
This example tries to reuse the Python code, with minimal modifications to make it compatible with current ExecuTorch:
432354
1. Since ExecuTorch does not support complex Tensor data type, use the customized functions to have rotary embedding with real numbers. Please see [GitHub issue: Support complex data type in ExecuTorch](https://github.com/pytorch/executorch/issues/886).

examples/models/llama/UTILS.md

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# Utility tools for Llama enablement
2+
3+
## Stories110M model
4+
5+
If you want to deploy and run a smaller model for educational purposes, you can try stories110M model. It has the same architecture as Llama, but just smaller. It can be also used for fast iteration and verification during development.
6+
7+
### Export:
8+
9+
From `executorch` root:
10+
11+
1. Download `stories110M.pt` and `tokenizer.model` from Github.
12+
```
13+
wget "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.pt"
14+
wget "https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.model"
15+
```
16+
2. Create params file.
17+
```
18+
echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.json
19+
```
20+
3. Export model and generate `.pte` file.
21+
```
22+
python -m examples.models.llama.export_llama -c stories110M.pt -p params.json -X -kv
23+
```
24+
25+
## Smaller model delegated to other backends
26+
27+
Currently we supported lowering the stories model to other backends, including, CoreML, MPS and QNN. Please refer to the instruction
28+
for each backend ([CoreML](https://pytorch.org/executorch/main/build-run-coreml.html), [MPS](https://pytorch.org/executorch/main/build-run-mps.html), [QNN](https://pytorch.org/executorch/main/build-run-qualcomm-ai-engine-direct-backend.html)) before trying to lower them. After the backend library is installed, the script to export a lowered model is
29+
30+
- Lower to CoreML: `python -m examples.models.llama.export_llama -kv --disable_dynamic_shape --coreml -c stories110M.pt -p params.json `
31+
- MPS: `python -m examples.models.llama.export_llama -kv --disable_dynamic_shape --mps -c stories110M.pt -p params.json `
32+
- QNN: `python -m examples.models.llama.export_llama -kv --disable_dynamic_shape --qnn -c stories110M.pt -p params.json `
33+
34+
The iOS LLAMA app supports the CoreML and MPS model and the Android LLAMA app supports the QNN model. On Android, it also allow to cross compiler the llama runner binary, push to the device and run.
35+
36+
For CoreML, there are 2 additional optional arguments:
37+
* `--coreml-ios`: Specify the minimum iOS version to deploy (and turn on available optimizations). E.g. `--coreml-ios 18` will turn on [in-place KV cache](https://developer.apple.com/documentation/coreml/mlstate?language=objc) and [fused scaled dot product attention kernel](https://apple.github.io/coremltools/source/coremltools.converters.mil.mil.ops.defs.html#coremltools.converters.mil.mil.ops.defs.iOS18.transformers.scaled_dot_product_attention) (the resulting model will then need at least iOS 18 to run, though)
38+
* `--coreml-quantize`: Use [quantization tailored for CoreML](https://apple.github.io/coremltools/docs-guides/source/opt-quantization-overview.html). E.g. `--coreml-quantize b4w` will perform per-block 4-bit weight-only quantization in a way tailored for CoreML
39+
40+
41+
## Download models from Hugging Face and convert from safetensor format to state dict
42+
43+
You can also download above models from [Hugging Face](https://huggingface.co/). Since ExecuTorch starts from a PyTorch model, a script like below can be used to convert the Hugging Face safetensors format to PyTorch's state dict. It leverages the utils provided by [TorchTune](https://github.com/pytorch/torchtune).
44+
45+
46+
```Python
47+
from torchtune.utils import FullModelHFCheckpointer
48+
from torchtune.models import convert_weights
49+
import torch
50+
51+
# Convert from safetensors to TorchTune. Suppose the model has been downloaded from Hugging Face
52+
checkpointer = FullModelHFCheckpointer(
53+
checkpoint_dir='/home/.cache/huggingface/hub/models/snapshots/hash-number',
54+
checkpoint_files=['model-00001-of-00002.safetensors', 'model-00002-of-00002.safetensors'],
55+
output_dir='/the/destination/dir' ,
56+
model_type='LLAMA3' # or other types that TorchTune supports
57+
)
58+
59+
print("loading checkpoint")
60+
sd = checkpointer.load_checkpoint()
61+
62+
# Convert from TorchTune to Meta (PyTorch native)
63+
sd = convert_weights.tune_to_meta(sd['model'])
64+
65+
print("saving checkpoint")
66+
torch.save(sd, "/the/destination/dir/checkpoint.pth")
67+
```
68+
69+
## Finetuning
70+
71+
If you want to finetune your model based on a specific dataset, PyTorch provides [TorchTune](https://github.com/pytorch/torchtune) - a native-Pytorch library for easily authoring, fine-tuning and experimenting with LLMs.
72+
73+
Once you have [TorchTune installed](https://github.com/pytorch/torchtune?tab=readme-ov-file#get-started) you can finetune Llama2 7B model using LoRA on a single GPU, using the following command. This will produce a checkpoint where the LoRA weights are merged with the base model and so the output checkpoint will be in the same format as the original Llama2 model.
74+
75+
```
76+
tune run lora_finetune_single_device \
77+
--config llama2/7B_lora_single_device \
78+
checkpointer.checkpoint_dir=<path_to_checkpoint_folder> \
79+
tokenizer.path=<path_to_checkpoint_folder>/tokenizer.model
80+
```
81+
82+
To run full finetuning with Llama2 7B on a single device, you can use the following command.
83+
84+
```
85+
tune run full_finetune_single_device \
86+
--config llama2/7B_full_single_device \
87+
checkpointer.checkpoint_dir=<path_to_checkpoint_folder> \
88+
tokenizer.path=<path_to_checkpoint_folder>/tokenizer.model
89+
```

0 commit comments

Comments
 (0)