Skip to content

Commit e65bba4

Browse files
mergennachinfacebook-github-bot
authored andcommitted
Prepare Llama2 README.md for consumption (#2831)
Summary: Cleaning up old contents from Llama2. This is purely skeleton. Follow-up diffs will contain fixing individual steps. Reviewed By: kimishpatel, iseeyuan Differential Revision: D55703398
1 parent a25dea6 commit e65bba4

File tree

1 file changed

+53
-27
lines changed

1 file changed

+53
-27
lines changed

examples/models/llama2/README.md

Lines changed: 53 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# Summary
2-
This example demonstrates how to Export a [Llama 2](https://ai.meta.com/llama/) model in ExecuTorch such that it can be used in a mobile environment.
2+
This example demonstrates how to run a [Llama 2](https://ai.meta.com/llama/) 7B model on mobile via ExecuTorch. We use XNNPACK to accelerate the performance and 4-bit groupwise PTQ quantization to fit the model on a phone.
3+
4+
35
For Llama2, please refer to [the llama's github page](https://github.com/facebookresearch/llama) for details.
46
Pretrained parameters are not included in this repo. Users are suggested to download them through [the llama's download page](https://ai.meta.com/resources/models-and-libraries/llama-downloads/).
57

@@ -12,31 +14,34 @@ Overall, Llama models are powerful and versatile language models that can be use
1214

1315
Please note that the models are subject to the [acceptable use policy](https://github.com/facebookresearch/llama/blob/main/USE_POLICY.md) and the provided [responsible use guide](https://ai.meta.com/static-resource/responsible-use-guide/).
1416

15-
# Notes
16-
1. This example is to show the feasibility of exporting a Llama2 model in ExecuTorch. There is no guarantee for performance.
17-
2. The provided checkpoint, demo_rand_params.pth is a dummy checkpoint with random parameters. It does not provide meaningful results. It's only for the purpose of demonstration and fast iterations. Use the options `--checkpoint <checkpoint>` and `--params <params>` for custom checkpoints.
18-
1917

20-
# Limitations
21-
This example tries to reuse the Python code, with modifications to make it compatible with current ExecuTorch:
22-
1. Since ExecuTorch does not support complex Tensor data type, use the customized functions to have rotary embedding with real numbers. Please see [GitHub issue: Support complex data type in ExecuTorch](https://github.com/pytorch/executorch/issues/886).
23-
2. No KV cache. The current cache implementation in the original Llama2 repo is not supported by ExecuTorch, because ExecuTorch runtime assumes model data attributes being static. Please see [GitHub issue: Add support of mutable buffers in ExecuTorch](https://github.com/pytorch/executorch/issues/897).
24-
3. No CUDA. ExecuTorch is focused on Edge use cases where CUDA is not available on most of the edge devices.
25-
4. No dependencies on fairscale. The ColumnParallelLinear, ParallelEmbedding and training are not needed and supported in ExecuTorch.
18+
# Results
2619

20+
TODO - Will fill in table of results.
2721

2822
# Instructions:
29-
### Setup
30-
1. Follow the [tutorial](https://pytorch.org/executorch/stable/getting-started-setup) to set up ExecuTorch
31-
2. `cd examples/third-party/llama`
32-
3. `pip install -e .`
33-
4. Go back to `executorch` root, run `bash examples/models/llama2/install_requirements.sh`.
23+
## Step 1: Setup
24+
1. Follow the [tutorial](https://pytorch.org/executorch/main/getting-started-setup) to set up ExecuTorch
25+
2. Run `examples/models/llama2/install_requirements.sh` to install a few requirements.
26+
27+
## Step 2: Prepare model
28+
29+
### Option A: Download and export llama2 model
30+
31+
You can export and run the original Llama2 7B model.
32+
33+
1. Llama2 pretrained parameters can be downloaded [here](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)
34+
35+
2. TODO: Do some preparation.
36+
37+
3. Export model and generate `.pte` file:
38+
```
39+
python -m examples.models.llama2.export_llama --checkpoint <checkpoint.pth> --params <params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w -d fp32`
40+
```
3441
35-
### Export llama2 models
36-
2. From `executorch` root, run `python3 -m examples.models.llama2.export_llama`. The exported program, llama2.pte would be saved in current directory using the dummy checkpoint.
37-
3. Llama2 pretrained parameters can be downloaded [here](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and run with `python3 -m examples.models.llama2.export_llama --checkpoint <checkpoint.pth> --params <params.json>`.
42+
### Option B: Download and export stories110M model
3843
39-
### Export and run stories110M model
44+
If you want to deploy and run a smaller model for educational purposes. From `executorch` root:
4045
4146
1. Download `stories110M.pt` and `tokenizer.model` from Github.
4247
```
@@ -47,23 +52,44 @@ This example tries to reuse the Python code, with modifications to make it compa
4752
```
4853
echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.json
4954
```
50-
3. Export model. Export options available [here](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/export_llama_lib.py#L161).
55+
3. Export model and generate `.pte` file.
5156
```
52-
python3 -m examples.models.llama2.export_llama -c stories110M.pt -p params.json
57+
python -m examples.models.llama2.export_llama -c stories110M.pt -p params.json
5358
```
5459
4. Create tokenizer.bin.
5560
5661
Build with buck2:
5762
```
58-
buck2 run examples/models/llama2/tokenizer:tokenizer_py -- -t tokenizer.model -o tokenizer.bin
63+
python -m examples.models.llama2.tokenizer.tokenizer -t tokenizer.model -o tokenizer.bin
5964
```
60-
Build with cmake: todo
6165
62-
5. Run model. Run options available [here](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/main.cpp#L13).
66+
## Step 3: Run on your computer to validate
67+
68+
1. Build llama runner. TODO
69+
70+
2. Run model. Run options available [here](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/main.cpp#L13).
6371
Build with buck2:
6472
```
6573
buck2 run examples/models/llama2:main -- --model_path=llama2.pte --tokenizer_path=tokenizer.bin --prompt="Once"
6674
```
67-
Build with cmake: todo
75+
Build with cmake: TODO
76+
77+
## Step 4: Run benchmark on Android phone
78+
79+
1. Build llama runner binary for Android
80+
81+
2. Run on Android via adb shell
6882
69-
See test script [here](https://github.com/pytorch/executorch/blob/main/.ci/scripts/test_llama.sh).
83+
## Step 5: Build iOS and/or Android apps
84+
85+
TODO
86+
87+
# What is coming next?
88+
89+
TODO
90+
91+
# Notes
92+
This example tries to reuse the Python code, with minimal modifications to make it compatible with current ExecuTorch:
93+
1. Since ExecuTorch does not support complex Tensor data type, use the customized functions to have rotary embedding with real numbers. Please see [GitHub issue: Support complex data type in ExecuTorch](https://github.com/pytorch/executorch/issues/886).
94+
2. No CUDA. ExecuTorch is focused on Edge use cases where CUDA is not available on most of the edge devices.
95+
3. No dependencies on fairscale. The ColumnParallelLinear, ParallelEmbedding and training are not needed and supported in ExecuTorch.

0 commit comments

Comments
 (0)