You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This example demonstrates how to Export a [Llama 2](https://ai.meta.com/llama/) model in ExecuTorch such that it can be used in a mobile environment.
2
+
This example demonstrates how to run a [Llama 2](https://ai.meta.com/llama/) 7B model on mobile via ExecuTorch. We use XNNPACK to accelerate the performance and 4-bit groupwise PTQ quantization to fit the model on a phone.
3
+
4
+
3
5
For Llama2, please refer to [the llama's github page](https://github.com/facebookresearch/llama) for details.
4
6
Pretrained parameters are not included in this repo. Users are suggested to download them through [the llama's download page](https://ai.meta.com/resources/models-and-libraries/llama-downloads/).
5
7
@@ -12,31 +14,35 @@ Overall, Llama models are powerful and versatile language models that can be use
12
14
13
15
Please note that the models are subject to the [acceptable use policy](https://github.com/facebookresearch/llama/blob/main/USE_POLICY.md) and the provided [responsible use guide](https://ai.meta.com/static-resource/responsible-use-guide/).
14
16
15
-
# Notes
16
-
1. This example is to show the feasibility of exporting a Llama2 model in ExecuTorch. There is no guarantee for performance.
17
-
2. The provided checkpoint, demo_rand_params.pth is a dummy checkpoint with random parameters. It does not provide meaningful results. It's only for the purpose of demonstration and fast iterations. Use the options `--checkpoint <checkpoint>` and `--params <params>` for custom checkpoints.
18
17
18
+
# Results
19
19
20
-
# Limitations
21
-
This example tries to reuse the Python code, with modifications to make it compatible with current ExecuTorch:
22
-
1. Since ExecuTorch does not support complex Tensor data type, use the customized functions to have rotary embedding with real numbers. Please see [GitHub issue: Support complex data type in ExecuTorch](https://github.com/pytorch/executorch/issues/886).
23
-
2. No KV cache. The current cache implementation in the original Llama2 repo is not supported by ExecuTorch, because ExecuTorch runtime assumes model data attributes being static. Please see [GitHub issue: Add support of mutable buffers in ExecuTorch](https://github.com/pytorch/executorch/issues/897).
24
-
3. No CUDA. ExecuTorch is focused on Edge use cases where CUDA is not available on most of the edge devices.
25
-
4. No dependencies on fairscale. The ColumnParallelLinear, ParallelEmbedding and training are not needed and supported in ExecuTorch.
20
+
TODO - Will fill in table of results.
21
+
22
+
# Instructions
23
+
24
+
## Step 1: Setup
25
+
1. Follow the [tutorial](https://pytorch.org/executorch/main/getting-started-setup) to set up ExecuTorch
26
+
2. Run `examples/models/llama2/install_requirements.sh` to install a few dependencies.
26
27
28
+
## Step 2: Prepare model
27
29
28
-
# Instructions:
29
-
### Setup
30
-
1. Follow the [tutorial](https://pytorch.org/executorch/stable/getting-started-setup) to set up ExecuTorch
31
-
2.`cd examples/third-party/llama`
32
-
3.`pip install -e .`
33
-
4. Go back to `executorch` root, run `bash examples/models/llama2/install_requirements.sh`.
30
+
### Option A: Download and export llama2 7B model
31
+
32
+
You can export and run the original Llama2 7B model.
33
+
34
+
1. Llama2 pretrained parameters can be downloaded [here](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)
2. From `executorch` root, run `python3 -m examples.models.llama2.export_llama`. The exported program, llama2.pte would be saved in current directory using the dummy checkpoint.
37
-
3. Llama2 pretrained parameters can be downloaded [here](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and run with `python3 -m examples.models.llama2.export_llama --checkpoint <checkpoint.pth> --params <params.json>`.
43
+
### Option B: Download and export stories110M model
38
44
39
-
### Export and run stories110M model
45
+
If you want to deploy and run a smaller model for educational purposes. From `executorch` root:
40
46
41
47
1. Download `stories110M.pt` and `tokenizer.model` from Github.
42
48
```
@@ -47,23 +53,44 @@ This example tries to reuse the Python code, with modifications to make it compa
5. Run model. Run options available [here](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/main.cpp#L13).
67
+
## Step 3: Run on your computer to validate
68
+
69
+
1. Build llama runner. TODO
70
+
71
+
2. Run model. Run options available [here](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/main.cpp#L13).
63
72
Build with buck2:
64
73
```
65
74
buck2 run examples/models/llama2:main -- --model_path=llama2.pte --tokenizer_path=tokenizer.bin --prompt="Once"
66
75
```
67
-
Build with cmake: todo
76
+
Build with cmake: TODO
77
+
78
+
## Step 4: Run benchmark on Android phone
79
+
80
+
1. Build llama runner binary for Android
81
+
82
+
2. Run on Android via adb shell
83
+
84
+
## Step 5: Build iOS and/or Android apps
68
85
69
-
See test script [here](https://github.com/pytorch/executorch/blob/main/.ci/scripts/test_llama.sh).
86
+
TODO
87
+
88
+
# What is coming next?
89
+
90
+
TODO
91
+
92
+
# Notes
93
+
This example tries to reuse the Python code, with minimal modifications to make it compatible with current ExecuTorch:
94
+
1. Since ExecuTorch does not support complex Tensor data type, use the customized functions to have rotary embedding with real numbers. Please see [GitHub issue: Support complex data type in ExecuTorch](https://github.com/pytorch/executorch/issues/886).
95
+
2. No CUDA. ExecuTorch is focused on Edge use cases where CUDA is not available on most of the edge devices.
96
+
3. No dependencies on fairscale. The ColumnParallelLinear, ParallelEmbedding and training are not needed and supported in ExecuTorch.
0 commit comments