Skip to content

Instructions for Llama3 #3154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 20 additions & 4 deletions examples/models/llama2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@ Please note that the models are subject to the [acceptable use policy](https://g

# Results

Since 7B Llama2 model needs at least 4-bit quantization to fit even within some of the highend phones, results presented here correspond to 4-bit groupwise post-training quantized model.
Since 7B Llama2 model needs at least 4-bit quantization to fit even within some of the highend phones, results presented here correspond to 4-bit groupwise post-training quantized model.

For Llama3, we can use the same process. Note that it's only supported in the ExecuTorch main branch.
For Llama3, we can use the same process. Note that it's only supported in the ExecuTorch main branch.

## Quantization:
We employed 4-bit groupwise per token dynamic quantization of all the linear layers of the model. Dynamic quantization refers to quantizating activations dynamically, such that quantization parameters for activations are calculated, from min/max range, at runtime. Here we quantized activations with 8bits (signed integer). Furthermore, weights are statically quantized. In our case weights were per-channel groupwise quantized with 4bit signed integer. For more information refer to this [page](https://github.com/pytorch-labs/ao/).
Expand Down Expand Up @@ -57,7 +57,7 @@ Performance was measured on Samsung Galaxy S22, S24, One Plus 12 and iPhone 15 m
- For Llama7b, your device may require at least 32GB RAM. If this is a constraint for you, please try the smaller stories model.

## Step 1: Setup
1. Follow the [tutorial](https://pytorch.org/executorch/main/getting-started-setup) to set up ExecuTorch
1. Follow the [tutorial](https://pytorch.org/executorch/main/getting-started-setup) to set up ExecuTorch. For installation run `./install_requirements.sh --pybind xnnpack`
2. Run `examples/models/llama2/install_requirements.sh` to install a few dependencies.

## Step 2: Prepare model
Expand Down Expand Up @@ -103,6 +103,16 @@ If you want to deploy and run a smaller model for educational purposes. From `ex
python -m examples.models.llama2.tokenizer.tokenizer -t tokenizer.model -o tokenizer.bin
```

### Option C: Download and export Llama3 8B model

You can export and run the original Llama3 8B model.

1. Llama3 pretrained parameters can be downloaded from [Meta's official llama3 repository](https://github.com/meta-llama/llama3/).

2. Export model and generate `.pte` file
```
python -m examples.models.llama2.export_llama --checkpoint <consolidated.00.pth> -p <params.json> -d=fp32 -X -qmode 8da4w -kv --use_sdpa_with_kv_cache --output_name="llama3_kv_sdpa_xnn_qe_4_32.pte" group_size 128 --metadata '{"get_bos_id":128000, "get_eos_id":128001}' --embedding-quantize 4,32
```

## (Optional) Finetuning

Expand Down Expand Up @@ -148,6 +158,7 @@ The Uncyclotext results generated above used: `{max_seq_len: 2048, limit: 1000}`
-DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
-DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
-DEXECUTORCH_BUILD_XNNPACK=ON \
-DEXECUTORCH_BUILD_QUANTIZED=ON \
-DEXECUTORCH_BUILD_OPTIMIZED=ON \
-DEXECUTORCH_BUILD_CUSTOM=ON \
-Bcmake-out .
Expand All @@ -163,17 +174,22 @@ The Uncyclotext results generated above used: `{max_seq_len: 2048, limit: 1000}`
-DEXECUTORCH_BUILD_CUSTOM=ON \
-DEXECUTORCH_BUILD_OPTIMIZED=ON \
-DEXECUTORCH_BUILD_XNNPACK=ON \
-DEXECUTORCH_BUILD_QUANTIZED=ON \
-Bcmake-out/examples/models/llama2 \
examples/models/llama2

cmake --build cmake-out/examples/models/llama2 -j16 --config Release
```

For Llama3, add `-DEXECUTORCH_USE_TIKTOKEN=ON` option when building the llama runner.

3. Run model. Run options available [here](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/main.cpp#L18-L40).
```
cmake-out/examples/models/llama2/llama_main --model_path=<model pte file> --tokenizer_path=<tokenizer.bin> --prompt=<prompt>
```

For Llama3, you can pass the original `tokenizer.model` (without converting to `.bin` file).

## Step 5: Run benchmark on Android phone

**1. Build llama runner binary for Android**
Expand Down Expand Up @@ -271,7 +287,7 @@ This example tries to reuse the Python code, with minimal modifications to make
```
git clean -xfd
pip uninstall executorch
./install_requirements.sh <options>
./install_requirements.sh --pybind xnnpack

rm -rf cmake-out
```
Expand Down