Skip to content

Prepare Llama2 README.md for consumption #2831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 54 additions & 27 deletions examples/models/llama2/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Summary
This example demonstrates how to Export a [Llama 2](https://ai.meta.com/llama/) model in ExecuTorch such that it can be used in a mobile environment.
This example demonstrates how to run a [Llama 2](https://ai.meta.com/llama/) 7B model on mobile via ExecuTorch. We use XNNPACK to accelerate the performance and 4-bit groupwise PTQ quantization to fit the model on a phone.


For Llama2, please refer to [the llama's github page](https://github.com/facebookresearch/llama) for details.
Pretrained parameters are not included in this repo. Users are suggested to download them through [the llama's download page](https://ai.meta.com/resources/models-and-libraries/llama-downloads/).

Expand All @@ -12,31 +14,35 @@ Overall, Llama models are powerful and versatile language models that can be use

Please note that the models are subject to the [acceptable use policy](https://github.com/facebookresearch/llama/blob/main/USE_POLICY.md) and the provided [responsible use guide](https://ai.meta.com/static-resource/responsible-use-guide/).

# Notes
1. This example is to show the feasibility of exporting a Llama2 model in ExecuTorch. There is no guarantee for performance.
2. The provided checkpoint, demo_rand_params.pth is a dummy checkpoint with random parameters. It does not provide meaningful results. It's only for the purpose of demonstration and fast iterations. Use the options `--checkpoint <checkpoint>` and `--params <params>` for custom checkpoints.

# Results

# Limitations
This example tries to reuse the Python code, with modifications to make it compatible with current ExecuTorch:
1. Since ExecuTorch does not support complex Tensor data type, use the customized functions to have rotary embedding with real numbers. Please see [GitHub issue: Support complex data type in ExecuTorch](https://github.com/pytorch/executorch/issues/886).
2. No KV cache. The current cache implementation in the original Llama2 repo is not supported by ExecuTorch, because ExecuTorch runtime assumes model data attributes being static. Please see [GitHub issue: Add support of mutable buffers in ExecuTorch](https://github.com/pytorch/executorch/issues/897).
3. No CUDA. ExecuTorch is focused on Edge use cases where CUDA is not available on most of the edge devices.
4. No dependencies on fairscale. The ColumnParallelLinear, ParallelEmbedding and training are not needed and supported in ExecuTorch.
TODO - Will fill in table of results.

# Instructions

## Step 1: Setup
1. Follow the [tutorial](https://pytorch.org/executorch/main/getting-started-setup) to set up ExecuTorch
2. Run `examples/models/llama2/install_requirements.sh` to install a few dependencies.

## Step 2: Prepare model

# Instructions:
### Setup
1. Follow the [tutorial](https://pytorch.org/executorch/stable/getting-started-setup) to set up ExecuTorch
2. `cd examples/third-party/llama`
3. `pip install -e .`
4. Go back to `executorch` root, run `bash examples/models/llama2/install_requirements.sh`.
### Option A: Download and export llama2 7B model

You can export and run the original Llama2 7B model.

1. Llama2 pretrained parameters can be downloaded [here](https://ai.meta.com/resources/models-and-libraries/llama-downloads/)

2. TODO: Do some preparation.

3. Export model and generate `.pte` file:
```
python -m examples.models.llama2.export_llama --checkpoint <checkpoint.pth> --params <params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w -d fp32
```

### Export llama2 models
2. From `executorch` root, run `python3 -m examples.models.llama2.export_llama`. The exported program, llama2.pte would be saved in current directory using the dummy checkpoint.
3. Llama2 pretrained parameters can be downloaded [here](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and run with `python3 -m examples.models.llama2.export_llama --checkpoint <checkpoint.pth> --params <params.json>`.
### Option B: Download and export stories110M model

### Export and run stories110M model
If you want to deploy and run a smaller model for educational purposes. From `executorch` root:

1. Download `stories110M.pt` and `tokenizer.model` from Github.
```
Expand All @@ -47,23 +53,44 @@ This example tries to reuse the Python code, with modifications to make it compa
```
echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.json
```
3. Export model. Export options available [here](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/export_llama_lib.py#L161).
3. Export model and generate `.pte` file.
```
python3 -m examples.models.llama2.export_llama -c stories110M.pt -p params.json
python -m examples.models.llama2.export_llama -c stories110M.pt -p params.json
```
4. Create tokenizer.bin.

Build with buck2:
```
buck2 run examples/models/llama2/tokenizer:tokenizer_py -- -t tokenizer.model -o tokenizer.bin
python -m examples.models.llama2.tokenizer.tokenizer -t tokenizer.model -o tokenizer.bin
```
Build with cmake: todo

5. Run model. Run options available [here](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/main.cpp#L13).
## Step 3: Run on your computer to validate

1. Build llama runner. TODO

2. Run model. Run options available [here](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/main.cpp#L13).
Build with buck2:
```
buck2 run examples/models/llama2:main -- --model_path=llama2.pte --tokenizer_path=tokenizer.bin --prompt="Once"
```
Build with cmake: todo
Build with cmake: TODO

## Step 4: Run benchmark on Android phone

1. Build llama runner binary for Android

2. Run on Android via adb shell

## Step 5: Build iOS and/or Android apps

See test script [here](https://github.com/pytorch/executorch/blob/main/.ci/scripts/test_llama.sh).
TODO

# What is coming next?

TODO

# Notes
This example tries to reuse the Python code, with minimal modifications to make it compatible with current ExecuTorch:
1. Since ExecuTorch does not support complex Tensor data type, use the customized functions to have rotary embedding with real numbers. Please see [GitHub issue: Support complex data type in ExecuTorch](https://github.com/pytorch/executorch/issues/886).
2. No CUDA. ExecuTorch is focused on Edge use cases where CUDA is not available on most of the edge devices.
3. No dependencies on fairscale. The ColumnParallelLinear, ParallelEmbedding and training are not needed and supported in ExecuTorch.