Update Android and iOS demo app readme for Spinquant and QAT+LoRA model support (#6485)

WuhanMonkey · Chester Hu · web-flow · commit e6d93dece9a6 · 2024-10-24T08:08:18.000-07:00
Summary:
1. Added export command for spinquant and QAT+LoRA using prequantized model we are going to release at 10/24
2. Clean up on duplicated commands
3. Renaming the path in the export command to avoid confusion
4. Remove old information

Reviewed By: cmodi-meta

Differential Revision: D64784695

Co-authored-by: Chester Hu &lt;chengpenghu@meta.com&gt;
diff --git a/examples/demo-apps/android/LlamaDemo/README.md b/examples/demo-apps/android/LlamaDemo/README.md
@@ -1,5 +1,7 @@
 # ExecuTorch Llama Android Demo App
 
+**[UPDATE - 10/24]** We have added support for running quantized Llama 3.2 1B/3B models in demo apps on the [XNNPACK backend](https://github.com/pytorch/executorch/blob/main/examples/demo-apps/android/LlamaDemo/docs/delegates/xnnpack_README.md). We currently support inference with SpinQuant and QAT+LoRA quantization methods.
+
 We’re excited to share that the newly revamped Android demo app is live and includes many new updates to provide a more intuitive and smoother user experience with a chat use case! The primary goal of this app is to showcase how easily ExecuTorch can be integrated into an Android demo app and how to exercise the many features ExecuTorch and Llama models have to offer.
 
 This app serves as a valuable resource to inspire your creativity and provide foundational code that you can customize and adapt for your particular use case.
@@ -17,7 +19,8 @@ The goal is for you to see the type of support ExecuTorch provides and feel comf
 
 ## Supporting Models
 As a whole, the models that this app supports are (varies by delegate):
-* Llama 3.2 1B/3B
+* Llama 3.2 Quantized 1B/3B
+* Llama 3.2 1B/3B in BF16
 * Llama Guard 3 1B
 * Llama 3.1 8B
 * Llama 3 8B
@@ -34,11 +37,6 @@ First it’s important to note that currently ExecuTorch provides support across
 | QNN (Qualcomm AI Accelerators)  | [link](https://github.com/pytorch/executorch/blob/main/examples/demo-apps/android/LlamaDemo/docs/delegates/qualcomm_README.md) |
 | MediaTek (MediaTek AI Accelerators)  | [link](https://github.com/pytorch/executorch/blob/main/examples/demo-apps/android/LlamaDemo/docs/delegates/mediatek_README.md)  |
 
-**WARNING** NDK r27 will cause issues like:
-  ```
-  java.lang.UnsatisfiedLinkError: dlopen failed: cannot locate symbol "_ZTVNSt6__ndk114basic_ifstreamIcNS_11char_traitsIcEEEE" referenced by "/data/app/~~F5IwquaXUZPdLpSEYA-JGA==/com.example.executorchllamademo-FSyx80gEhsQCsxz7hvS2Ew==/lib/arm64/libexecutorch.so"...
-  ```
-  Please use NDK version 26.3.11579264.
 
 ## How to Use the App
 
diff --git a/examples/demo-apps/android/LlamaDemo/docs/delegates/xnnpack_README.md b/examples/demo-apps/android/LlamaDemo/docs/delegates/xnnpack_README.md
@@ -1,7 +1,4 @@
 # Building ExecuTorch Android Demo App for Llama/Llava running XNNPACK
-
-**[UPDATE - 09/25]** We have added support for running [Llama 3.2 models](#for-llama-32-1b-and-3b-models) on the XNNPACK backend. We currently support inference on their original data type (BFloat16). We have also added instructions to run [Llama Guard 1B models](#for-llama-guard-1b-models) on-device.
-
 This tutorial covers the end to end workflow for building an android demo app using CPU on device via XNNPACK framework.
 More specifically, it covers:
 1. Export and quantization of Llama and Llava models against the XNNPACK backend.
@@ -10,26 +7,15 @@ More specifically, it covers:
 
 Phone verified: OnePlus 12, OnePlus 9 Pro. Samsung S23 (Llama only), Samsung S24+ (Llama only), Pixel 8 Pro (Llama only)
 
-
-## Known Issues
-* With prompts like “What is the maxwell equation” the runner+jni is unable to handle odd unicodes.
-
 ## Prerequisites
 * Install [Java 17 JDK](https://www.oracle.com/java/technologies/javase/jdk17-archive-downloads.html).
-* Install the [Android SDK API Level 34](https://developer.android.com/about/versions/15/setup-sdk) and [Android NDK 26.3.11579264](https://developer.android.com/studio/projects/install-ndk). **WARNING** NDK r27 will cause issues like:
-  ```
-  java.lang.UnsatisfiedLinkError: dlopen failed: cannot locate symbol "_ZTVNSt6__ndk114basic_ifstreamIcNS_11char_traitsIcEEEE" referenced by "/data/app/~~F5IwquaXUZPdLpSEYA-JGA==/com.example.executorchllamademo-FSyx80gEhsQCsxz7hvS2Ew==/lib/arm64/libexecutorch.so"...
-  ```
-  Please downgrade to version 26.3.11579264.
+* Install the [Android SDK API Level 34](https://developer.android.com/about/versions/15/setup-sdk) and [Android NDK r27b](https://github.com/android/ndk/releases/tag/r27b).
+  * Note: This demo app and tutorial has only been validated with arm64-v8a [ABI](https://developer.android.com/ndk/guides/abis), with NDK 26.3.11579264 and r27b.
 * If you have Android Studio set up, you can install them with
   * Android Studio Settings -> Language & Frameworks -> Android SDK -> SDK Platforms -> Check the row with API Level 34.
   * Android Studio Settings -> Language & Frameworks -> Android SDK -> SDK Tools -> Check NDK (Side by side) row.
 * Alternatively, you can follow [this guide](https://github.com/pytorch/executorch/blob/856e085b9344c8b0bf220a97976140a5b76356aa/examples/demo-apps/android/LlamaDemo/SDK.md) to set up Java/SDK/NDK with CLI.
-Supported Host OS: CentOS, macOS Sonoma on Apple Silicon.
-
-
-Note: This demo app and tutorial has only been validated with arm64-v8a [ABI](https://developer.android.com/ndk/guides/abis), with NDK 26.3.11579264.
-
+* Supported Host OS: CentOS, macOS Sonoma on Apple Silicon.
 
 
 ## Setup ExecuTorch
@@ -61,20 +47,33 @@ Optional: Use the --pybind flag to install with pybindings.
 
 ## Prepare Models
 In this demo app, we support text-only inference with up-to-date Llama models and image reasoning inference with LLaVA 1.5.
-
-### For Llama 3.2 1B and 3B models
-We have supported BFloat16 as a data type on the XNNPACK backend for Llama 3.2 1B/3B models.
 * You can request and download model weights for Llama through Meta official [website](https://llama.meta.com/).
 * For chat use-cases, download the instruct models instead of pretrained.
 * Run `examples/models/llama/install_requirements.sh` to install dependencies.
-* The 1B model in BFloat16 format can run on mobile devices with 8GB RAM. The 3B model will require 12GB+ RAM.
+* Rename tokenizer for Llama3.x with command: `mv tokenizer.model tokenizer.bin`. We are updating the demo app to support tokenizer in original format directly.
+
+### For Llama 3.2 1B and 3B SpinQuant models
+Meta has released prequantized INT4 SpinQuant Llama 3.2 models that ExecuTorch supports on the XNNPACK backend.
 * Export Llama model and generate .pte file as below:
+```
+python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --use_spin_quant native --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_spinquant.pte"
+```
 
+### For Llama 3.2 1B and 3B QAT+LoRA models
+Meta has released prequantized INT4 QAT+LoRA Llama 3.2 models that ExecuTorch supports on the XNNPACK backend.
+* Export Llama model and generate .pte file as below:
 ```
-python -m examples.models.llama.export_llama --checkpoint <checkpoint.pth> --params <params.json> -kv --use_sdpa_with_kv_cache -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2.pte"
+python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -qat -lora 16 -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_qat_lora.pte"
 ```
 
-* Rename tokenizer for Llama 3.2 with command: `mv tokenizer.model tokenizer.bin`. We are updating the demo app to support tokenizer in original format directly.
+### For Llama 3.2 1B and 3B BF16 models
+We have supported BF16 as a data type on the XNNPACK backend for Llama 3.2 1B/3B models.
+* The 1B model in BF16 format can run on mobile devices with 8GB RAM. The 3B model will require 12GB+ RAM.
+* Export Llama model and generate .pte file as below:
+
+```
+python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2_bf16.pte"
+```
 
 For more detail using Llama 3.2 lightweight models including prompt template, please go to our official [website](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2#-llama-3.2-lightweight-models-(1b/3b)-).
 
@@ -88,19 +87,17 @@ To safeguard your application, you can use our Llama Guard models for prompt cla
 * We prepared this model using the following command
 
 ```
-python -m examples.models.llama.export_llama --checkpoint <pruned llama guard 1b checkpoint.pth> --params <params.json> -d fp32 -kv --use_sdpa_with_kv_cache --quantization_mode 8da4w --group_size 256 --xnnpack --max_seq_length 8193 --embedding-quantize 4,32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_prune_map <llama_guard pruned layers map.json> --output_name="llama_guard_3_1b_pruned_xnnpack.pte"
+python -m examples.models.llama.export_llama --checkpoint <path-to-pruned-llama-guard-1b-checkpoint.pth> --params <path-to-your-params.json> -d fp32 -kv --use_sdpa_with_kv_cache --quantization_mode 8da4w --group_size 256 --xnnpack --max_seq_length 8193 --embedding-quantize 4,32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_prune_map <path-to-your-llama_guard-pruned-layers-map.json> --output_name="llama_guard_3_1b_pruned_xnnpack.pte"
 ```
 
 
 ### For Llama 3.1 and Llama 2 models
-* You can download original model weights for Llama through Meta official [website](https://llama.meta.com/).
-* For Llama 2 models, Edit params.json file. Replace "vocab_size": -1 with "vocab_size": 32000. This is a short-term workaround
-* Run `examples/models/llama/install_requirements.sh` to install dependencies.
+* For Llama 2 models, Edit params.json file. Replace "vocab_size": -1 with "vocab_size": 32000. This is a short-term workaround.
 * The Llama 3.1 and Llama 2 models (8B and 7B) can run on devices with 12GB+ RAM.
-* Export Llama model and generate .pte file
+* Export Llama model and generate .pte file as below:
 
 ```
-python -m examples.models.llama.export_llama --checkpoint <checkpoint.pth> --params <params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama.pte"
+python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama.pte"
 ```
 
 You may wonder what the ‘--metadata’ flag is doing. This flag helps export the model with proper special tokens added that the runner can detect EOS tokens easily.
@@ -109,8 +106,6 @@ You may wonder what the ‘--metadata’ flag is doing. This flag helps export t
 ```
 python -m extension.llm.tokenizer.tokenizer -t tokenizer.model -o tokenizer.bin
 ```
-* Rename tokenizer for Llama 3.1 with command: `mv tokenizer.model tokenizer.bin`. We are updating the demo app to support tokenizer in original format directly.
-
 
 ### For LLaVA model
 * For the Llava 1.5 model, you can get it from Huggingface [here](https://huggingface.co/llava-hf/llava-1.5-7b-hf).
diff --git a/examples/demo-apps/apple_ios/LLaMA/README.md b/examples/demo-apps/apple_ios/LLaMA/README.md
@@ -1,5 +1,7 @@
 # ExecuTorch Llama iOS Demo App
 
+**[UPDATE - 10/24]** We have added support for running quantized Llama 3.2 1B/3B models in demo apps on the [XNNPACK backend](https://github.com/pytorch/executorch/blob/main/examples/demo-apps/apple_ios/LLaMA/docs/delegates/xnnpack_README.md). We currently support inference with SpinQuant and QAT+LoRA quantization methods.
+
 We’re excited to share that the newly revamped iOS demo app is live and includes many new updates to provide a more intuitive and smoother user experience with a chat use case! The primary goal of this app is to showcase how easily ExecuTorch can be integrated into an iOS demo app and how to exercise the many features ExecuTorch and Llama models have to offer.
 
 This app serves as a valuable resource to inspire your creativity and provide foundational code that you can customize and adapt for your particular use case.
@@ -17,7 +19,8 @@ The goal is for you to see the type of support ExecuTorch provides and feel comf
 ## Supported Models
 
 As a whole, the models that this app supports are (varies by delegate):
-* Llama 3.2 1B/3B
+* Llama 3.2 Quantized 1B/3B
+* Llama 3.2 1B/3B in BF16
 * Llama 3.1 8B
 * Llama 3 8B
 * Llama 2 7B
diff --git a/examples/demo-apps/apple_ios/LLaMA/docs/delegates/xnnpack_README.md b/examples/demo-apps/apple_ios/LLaMA/docs/delegates/xnnpack_README.md
@@ -1,7 +1,5 @@
 # Building Llama iOS Demo for XNNPACK Backend
 
-**[UPDATE - 09/25]** We have added support for running [Llama 3.2 models](#for-llama-32-1b-and-3b-models) on the XNNPACK backend. We currently support inference on their original data type (BFloat16).
-
 This tutorial covers the end to end workflow for building an iOS demo app using XNNPACK backend on device.
 More specifically, it covers:
 1. Export and quantization of Llama models against the XNNPACK backend.
@@ -38,24 +36,35 @@ Install dependencies
 ```
 
 ## Prepare Models
-In this demo app, we support text-only inference with up-to-date Llama models.
-
-Install the required packages to export the model
+In this demo app, we support text-only inference with up-to-date Llama models and image reasoning inference with LLaVA 1.5.
+* You can request and download model weights for Llama through Meta official [website](https://llama.meta.com/).
+* For chat use-cases, download the instruct models instead of pretrained.
+* Install the required packages to export the model:
 
 ```
 sh examples/models/llama/install_requirements.sh
 ```
+### For Llama 3.2 1B and 3B SpinQuant models
+Meta has released prequantized INT4 SpinQuant Llama 3.2 models that ExecuTorch supports on the XNNPACK backend.
+* Export Llama model and generate .pte file as below:
+```
+python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --use_spin_quant native --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_spinquant.pte"
+```
 
-### For Llama 3.2 1B and 3B models
-We have supported BFloat16 as a data type on the XNNPACK backend for Llama 3.2 1B/3B models.
-* You can download original model weights for Llama through Meta official [website](https://llama.meta.com/).
-* For chat use-cases, download the instruct models instead of pretrained.
-* Run “examples/models/llama/install_requirements.sh” to install dependencies.
-* The 1B model in BFloat16 format can run on mobile devices with 8GB RAM (iPhone 15 Pro and later). The 3B model will require 12GB+ RAM and hence will not fit on 8GB RAM phones.
+### For Llama 3.2 1B and 3B QAT+LoRA models
+Meta has released prequantized INT4 QAT+LoRA Llama 3.2 models that ExecuTorch supports on the XNNPACK backend.
+* Export Llama model and generate .pte file as below:
+```
+python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -qat -lora 16 -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_qat_lora.pte"
+```
+
+### For Llama 3.2 1B and 3B BF16 models
+We have supported BF16 as a data type on the XNNPACK backend for Llama 3.2 1B/3B models.
+* The 1B model in BF16 format can run on mobile devices with 8GB RAM (iPhone 15 Pro and later). The 3B model will require 12GB+ RAM and hence will not fit on 8GB RAM phones.
 * Export Llama model and generate .pte file as below:
 
 ```
-python -m examples.models.llama.export_llama --checkpoint <checkpoint.pth> --params <params.json> -kv --use_sdpa_with_kv_cache -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2.pte"
+python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2_bf16.pte"
 ```
 
 For more detail using Llama 3.2 lightweight models including prompt template, please go to our official [website](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2#-llama-3.2-lightweight-models-(1b/3b)-).
@@ -64,7 +73,7 @@ For more detail using Llama 3.2 lightweight models including prompt template, pl
 
 Export the model
 ```
-python -m examples.models.llama.export_llama --checkpoint <consolidated.00.pth> -p <params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w  --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --embedding-quantize 4,32 --output_name="llama3_kv_sdpa_xnn_qe_4_32.pte"
+python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> -p <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w  --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --embedding-quantize 4,32 --output_name="llama3_kv_sdpa_xnn_qe_4_32.pte"
 ```
 
 ### For LLaVA model