You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary:
Cleaning up old contents from Llama2. This is purely skeleton.
Follow-up diffs will contain fixing individual steps.
Reviewed By: kimishpatel, iseeyuan
Differential Revision: D55703398
This example demonstrates how to Export a [Llama 2](https://ai.meta.com/llama/) model in ExecuTorch such that it can be used in a mobile environment.
2
+
This example demonstrates how to run a [Llama 2](https://ai.meta.com/llama/) 7B model on mobile via ExecuTorch. We use XNNPACK to accelerate the performance and 4-bit groupwise PTQ quantization to fit the model on a phone.
3
+
4
+
3
5
For Llama2, please refer to [the llama's github page](https://github.com/facebookresearch/llama) for details.
4
6
Pretrained parameters are not included in this repo. Users are suggested to download them through [the llama's download page](https://ai.meta.com/resources/models-and-libraries/llama-downloads/).
5
7
@@ -12,31 +14,34 @@ Overall, Llama models are powerful and versatile language models that can be use
12
14
13
15
Please note that the models are subject to the [acceptable use policy](https://github.com/facebookresearch/llama/blob/main/USE_POLICY.md) and the provided [responsible use guide](https://ai.meta.com/static-resource/responsible-use-guide/).
14
16
15
-
# Notes
16
-
1. This example is to show the feasibility of exporting a Llama2 model in ExecuTorch. There is no guarantee for performance.
17
-
2. The provided checkpoint, demo_rand_params.pth is a dummy checkpoint with random parameters. It does not provide meaningful results. It's only for the purpose of demonstration and fast iterations. Use the options `--checkpoint <checkpoint>` and `--params <params>` for custom checkpoints.
18
-
19
17
20
-
# Limitations
21
-
This example tries to reuse the Python code, with modifications to make it compatible with current ExecuTorch:
22
-
1. Since ExecuTorch does not support complex Tensor data type, use the customized functions to have rotary embedding with real numbers. Please see [GitHub issue: Support complex data type in ExecuTorch](https://github.com/pytorch/executorch/issues/886).
23
-
2. No KV cache. The current cache implementation in the original Llama2 repo is not supported by ExecuTorch, because ExecuTorch runtime assumes model data attributes being static. Please see [GitHub issue: Add support of mutable buffers in ExecuTorch](https://github.com/pytorch/executorch/issues/897).
24
-
3. No CUDA. ExecuTorch is focused on Edge use cases where CUDA is not available on most of the edge devices.
25
-
4. No dependencies on fairscale. The ColumnParallelLinear, ParallelEmbedding and training are not needed and supported in ExecuTorch.
18
+
# Results
26
19
20
+
TODO - Will fill in table of results.
27
21
28
22
# Instructions:
29
-
### Setup
30
-
1. Follow the [tutorial](https://pytorch.org/executorch/stable/getting-started-setup) to set up ExecuTorch
31
-
2.`cd examples/third-party/llama`
32
-
3.`pip install -e .`
33
-
4. Go back to `executorch` root, run `bash examples/models/llama2/install_requirements.sh`.
23
+
## Step 1: Setup
24
+
1. Follow the [tutorial](https://pytorch.org/executorch/main/getting-started-setup) to set up ExecuTorch
25
+
2. Run `examples/models/llama2/install_requirements.sh` to install a few requirements.
26
+
27
+
## Step 2: Prepare model
28
+
29
+
### Option A: Download and export llama2 model
30
+
31
+
You can export and run the original Llama2 7B model.
32
+
33
+
1. Llama2 pretrained parameters can be downloaded [here](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)
2. From `executorch` root, run `python3 -m examples.models.llama2.export_llama`. The exported program, llama2.pte would be saved in current directory using the dummy checkpoint.
37
-
3. Llama2 pretrained parameters can be downloaded [here](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and run with `python3 -m examples.models.llama2.export_llama --checkpoint <checkpoint.pth> --params <params.json>`.
42
+
### Option B: Download and export stories110M model
38
43
39
-
### Export and run stories110M model
44
+
If you want to deploy and run a smaller model for educational purposes. From `executorch` root:
40
45
41
46
1. Download `stories110M.pt` and `tokenizer.model` from Github.
42
47
```
@@ -47,23 +52,44 @@ This example tries to reuse the Python code, with modifications to make it compa
5. Run model. Run options available [here](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/main.cpp#L13).
66
+
## Step 3: Run on your computer to validate
67
+
68
+
1. Build llama runner. TODO
69
+
70
+
2. Run model. Run options available [here](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/main.cpp#L13).
63
71
Build with buck2:
64
72
```
65
73
buck2 run examples/models/llama2:main -- --model_path=llama2.pte --tokenizer_path=tokenizer.bin --prompt="Once"
66
74
```
67
-
Build with cmake: todo
75
+
Build with cmake: TODO
76
+
77
+
## Step 4: Run benchmark on Android phone
78
+
79
+
1. Build llama runner binary for Android
80
+
81
+
2. Run on Android via adb shell
68
82
69
-
See test script [here](https://github.com/pytorch/executorch/blob/main/.ci/scripts/test_llama.sh).
83
+
## Step 5: Build iOS and/or Android apps
84
+
85
+
TODO
86
+
87
+
# What is coming next?
88
+
89
+
TODO
90
+
91
+
# Notes
92
+
This example tries to reuse the Python code, with minimal modifications to make it compatible with current ExecuTorch:
93
+
1. Since ExecuTorch does not support complex Tensor data type, use the customized functions to have rotary embedding with real numbers. Please see [GitHub issue: Support complex data type in ExecuTorch](https://github.com/pytorch/executorch/issues/886).
94
+
2. No CUDA. ExecuTorch is focused on Edge use cases where CUDA is not available on most of the edge devices.
95
+
3. No dependencies on fairscale. The ColumnParallelLinear, ParallelEmbedding and training are not needed and supported in ExecuTorch.
0 commit comments