Prepare Llama2 README.md for consumption (#2831)

mergennachin · facebook-github-bot · commit 682039853c3e · 2024-04-03T12:59:19.000-07:00
Summary:

Cleaning up old contents from Llama2. This is purely skeleton.

Follow-up diffs will contain fixing individual steps.

Reviewed By: iseeyuan

Differential Revision: D55703398
diff --git a/examples/models/llama2/README.md b/examples/models/llama2/README.md
@@ -1,5 +1,7 @@
 # Summary
-This example demonstrates how to Export a [Llama 2](https://ai.meta.com/llama/) model in ExecuTorch such that it can be used in a mobile environment.
+This example demonstrates how to run a [Llama 2](https://ai.meta.com/llama/) model on mobile via ExecuTorch. We use XNNPACK to accelerate the performance and 4-bit groupwise PTQ quantization to fit the model on a phone.
+
+
 For Llama2, please refer to [the llama's github page](https://github.com/facebookresearch/llama) for details.
 Pretrained parameters are not included in this repo. Users are suggested to download them through [the llama's download page](https://ai.meta.com/resources/models-and-libraries/llama-downloads/).
 
@@ -12,31 +14,28 @@ Overall, Llama models are powerful and versatile language models that can be use
 
 Please note that the models are subject to the [acceptable use policy](https://github.com/facebookresearch/llama/blob/main/USE_POLICY.md) and the provided [responsible use guide](https://ai.meta.com/static-resource/responsible-use-guide/).
 
-# Notes
-1. This example is to show the feasibility of exporting a Llama2 model in ExecuTorch. There is no guarantee for performance.
-2. The provided checkpoint, demo_rand_params.pth is a dummy checkpoint with random parameters. It does not provide meaningful results. It's only for the purpose of demonstration and fast iterations.  Use the options `--checkpoint <checkpoint>` and `--params <params>` for custom checkpoints.
-
 
-# Limitations
-This example tries to reuse the Python code, with modifications to make it compatible with current ExecuTorch:
-1. Since ExecuTorch does not support complex Tensor data type, use the customized functions to have rotary embedding with real numbers. Please see [GitHub issue: Support complex data type in ExecuTorch](https://github.com/pytorch/executorch/issues/886).
-2. No KV cache. The current cache implementation in the original Llama2 repo is not supported by ExecuTorch, because ExecuTorch runtime assumes model data attributes being static. Please see [GitHub issue: Add support of mutable buffers in ExecuTorch](https://github.com/pytorch/executorch/issues/897).
-3. No CUDA. ExecuTorch is focused on Edge use cases where CUDA is not available on most of the edge devices.
-4. No dependencies on fairscale. The ColumnParallelLinear, ParallelEmbedding and training are not needed and supported in ExecuTorch.
+# Results
 
+TODO - Will fill in table of results.
 
 # Instructions:
-### Setup
-1. Follow the [tutorial](https://pytorch.org/executorch/stable/getting-started-setup) to set up ExecuTorch
-2. `cd examples/third-party/llama`
-3. `pip install -e .`
-4. Go back to `executorch` root, run `bash examples/models/llama2/install_requirements.sh`.
+### Step 1: Setup
+1. Follow the [tutorial](https://pytorch.org/executorch/main/getting-started-setup) to set up ExecuTorch
+2. Run `examples/models/llama2/install_requirements.sh`.
+
+### Step 2: Prepare model
 
-### Export llama2 models
-2. From `executorch` root, run `python3 -m examples.models.llama2.export_llama`. The exported program, llama2.pte would be saved in current directory using the dummy checkpoint.
-3. Llama2 pretrained parameters can be downloaded [here](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and run with `python3 -m examples.models.llama2.export_llama --checkpoint <checkpoint.pth> --params <params.json>`.
+#### Option A: Download and export llama2 model
 
-### Export and run stories110M model
+You can export and run the original Llama2 model.
+
+1. From `executorch` root, run `python3 -m examples.models.llama2.export_llama`. The exported program, llama2.pte would be saved in current directory using the dummy checkpoint.
+2. Llama2 pretrained parameters can be downloaded [here](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and run with `python3 -m examples.models.llama2.export_llama --checkpoint <checkpoint.pth> --params <params.json>`.
+
+#### Option B: Download and export stories110M model
+
+If you want to deploy and run a smaller model for educational purposes.
 
 1. Download `stories110M.pt` and `tokenizer.model` from Github.
     ```
@@ -49,21 +48,44 @@ This example tries to reuse the Python code, with modifications to make it compa
     ```
 3. Export model. Export options available [here](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/export_llama_lib.py#L161).
     ```
-    python3 -m examples.models.llama2.export_llama -c stories110M.pt -p params.json
+    python -m executorch.examples.models.llama2.export_llama -c stories110M.pt -p params.json
     ```
 4. Create tokenizer.bin.
 
     Build with buck2:
     ```
-    buck2 run examples/models/llama2/tokenizer:tokenizer_py -- -t tokenizer.model -o tokenizer.bin
+    python -m executorch.examples.models.llama2.tokenizer.tokenizer -t tokenizer.model -o tokenizer.bin
     ```
-    Build with cmake: todo
 
-5. Run model. Run options available [here](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/main.cpp#L13).
+### Step 3: Run on your computer to validate
+
+1. Build llama runner
+
+2. Run model. Run options available [here](https://github.com/pytorch/executorch/blob/main/examples/models/llama2/main.cpp#L13).
     Build with buck2:
     ```
     buck2 run examples/models/llama2:main -- --model_path=llama2.pte --tokenizer_path=tokenizer.bin --prompt="Once"
     ```
     Build with cmake: todo
 
 See test script [here](https://github.com/pytorch/executorch/blob/main/.ci/scripts/test_llama.sh).
+
+### Step 4: Run benchmark on Android phone
+
+1. Build llama runner binary for Android
+
+2. Run on Android via adb shell
+
+### Step 5: Build iOS and/or Android apps
+
+TODO
+
+### What is coming next?
+
+TODO
+
+# Notes
+This example tries to reuse the Python code, with minimal modifications to make it compatible with current ExecuTorch:
+1. Since ExecuTorch does not support complex Tensor data type, use the customized functions to have rotary embedding with real numbers. Please see [GitHub issue: Support complex data type in ExecuTorch](https://github.com/pytorch/executorch/issues/886).
+2. No CUDA. ExecuTorch is focused on Edge use cases where CUDA is not available on most of the edge devices.
+3. No dependencies on fairscale. The ColumnParallelLinear, ParallelEmbedding and training are not needed and supported in ExecuTorch.