Skip to content

Commit e6d93de

Browse files
WuhanMonkeyChester Hu
andauthored
Update Android and iOS demo app readme for Spinquant and QAT+LoRA model support (#6485)
Summary: 1. Added export command for spinquant and QAT+LoRA using prequantized model we are going to release at 10/24 2. Clean up on duplicated commands 3. Renaming the path in the export command to avoid confusion 4. Remove old information Reviewed By: cmodi-meta Differential Revision: D64784695 Co-authored-by: Chester Hu <[email protected]>
1 parent 8477fa9 commit e6d93de

File tree

4 files changed

+56
-51
lines changed

4 files changed

+56
-51
lines changed

examples/demo-apps/android/LlamaDemo/README.md

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# ExecuTorch Llama Android Demo App
22

3+
**[UPDATE - 10/24]** We have added support for running quantized Llama 3.2 1B/3B models in demo apps on the [XNNPACK backend](https://github.com/pytorch/executorch/blob/main/examples/demo-apps/android/LlamaDemo/docs/delegates/xnnpack_README.md). We currently support inference with SpinQuant and QAT+LoRA quantization methods.
4+
35
We’re excited to share that the newly revamped Android demo app is live and includes many new updates to provide a more intuitive and smoother user experience with a chat use case! The primary goal of this app is to showcase how easily ExecuTorch can be integrated into an Android demo app and how to exercise the many features ExecuTorch and Llama models have to offer.
46

57
This app serves as a valuable resource to inspire your creativity and provide foundational code that you can customize and adapt for your particular use case.
@@ -17,7 +19,8 @@ The goal is for you to see the type of support ExecuTorch provides and feel comf
1719

1820
## Supporting Models
1921
As a whole, the models that this app supports are (varies by delegate):
20-
* Llama 3.2 1B/3B
22+
* Llama 3.2 Quantized 1B/3B
23+
* Llama 3.2 1B/3B in BF16
2124
* Llama Guard 3 1B
2225
* Llama 3.1 8B
2326
* Llama 3 8B
@@ -34,11 +37,6 @@ First it’s important to note that currently ExecuTorch provides support across
3437
| QNN (Qualcomm AI Accelerators) | [link](https://github.com/pytorch/executorch/blob/main/examples/demo-apps/android/LlamaDemo/docs/delegates/qualcomm_README.md) |
3538
| MediaTek (MediaTek AI Accelerators) | [link](https://github.com/pytorch/executorch/blob/main/examples/demo-apps/android/LlamaDemo/docs/delegates/mediatek_README.md) |
3639

37-
**WARNING** NDK r27 will cause issues like:
38-
```
39-
java.lang.UnsatisfiedLinkError: dlopen failed: cannot locate symbol "_ZTVNSt6__ndk114basic_ifstreamIcNS_11char_traitsIcEEEE" referenced by "/data/app/~~F5IwquaXUZPdLpSEYA-JGA==/com.example.executorchllamademo-FSyx80gEhsQCsxz7hvS2Ew==/lib/arm64/libexecutorch.so"...
40-
```
41-
Please use NDK version 26.3.11579264.
4240

4341
## How to Use the App
4442

examples/demo-apps/android/LlamaDemo/docs/delegates/xnnpack_README.md

Lines changed: 26 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,4 @@
11
# Building ExecuTorch Android Demo App for Llama/Llava running XNNPACK
2-
3-
**[UPDATE - 09/25]** We have added support for running [Llama 3.2 models](#for-llama-32-1b-and-3b-models) on the XNNPACK backend. We currently support inference on their original data type (BFloat16). We have also added instructions to run [Llama Guard 1B models](#for-llama-guard-1b-models) on-device.
4-
52
This tutorial covers the end to end workflow for building an android demo app using CPU on device via XNNPACK framework.
63
More specifically, it covers:
74
1. Export and quantization of Llama and Llava models against the XNNPACK backend.
@@ -10,26 +7,15 @@ More specifically, it covers:
107

118
Phone verified: OnePlus 12, OnePlus 9 Pro. Samsung S23 (Llama only), Samsung S24+ (Llama only), Pixel 8 Pro (Llama only)
129

13-
14-
## Known Issues
15-
* With prompts like “What is the maxwell equation” the runner+jni is unable to handle odd unicodes.
16-
1710
## Prerequisites
1811
* Install [Java 17 JDK](https://www.oracle.com/java/technologies/javase/jdk17-archive-downloads.html).
19-
* Install the [Android SDK API Level 34](https://developer.android.com/about/versions/15/setup-sdk) and [Android NDK 26.3.11579264](https://developer.android.com/studio/projects/install-ndk). **WARNING** NDK r27 will cause issues like:
20-
```
21-
java.lang.UnsatisfiedLinkError: dlopen failed: cannot locate symbol "_ZTVNSt6__ndk114basic_ifstreamIcNS_11char_traitsIcEEEE" referenced by "/data/app/~~F5IwquaXUZPdLpSEYA-JGA==/com.example.executorchllamademo-FSyx80gEhsQCsxz7hvS2Ew==/lib/arm64/libexecutorch.so"...
22-
```
23-
Please downgrade to version 26.3.11579264.
12+
* Install the [Android SDK API Level 34](https://developer.android.com/about/versions/15/setup-sdk) and [Android NDK r27b](https://github.com/android/ndk/releases/tag/r27b).
13+
* Note: This demo app and tutorial has only been validated with arm64-v8a [ABI](https://developer.android.com/ndk/guides/abis), with NDK 26.3.11579264 and r27b.
2414
* If you have Android Studio set up, you can install them with
2515
* Android Studio Settings -> Language & Frameworks -> Android SDK -> SDK Platforms -> Check the row with API Level 34.
2616
* Android Studio Settings -> Language & Frameworks -> Android SDK -> SDK Tools -> Check NDK (Side by side) row.
2717
* Alternatively, you can follow [this guide](https://github.com/pytorch/executorch/blob/856e085b9344c8b0bf220a97976140a5b76356aa/examples/demo-apps/android/LlamaDemo/SDK.md) to set up Java/SDK/NDK with CLI.
28-
Supported Host OS: CentOS, macOS Sonoma on Apple Silicon.
29-
30-
31-
Note: This demo app and tutorial has only been validated with arm64-v8a [ABI](https://developer.android.com/ndk/guides/abis), with NDK 26.3.11579264.
32-
18+
* Supported Host OS: CentOS, macOS Sonoma on Apple Silicon.
3319

3420

3521
## Setup ExecuTorch
@@ -61,20 +47,33 @@ Optional: Use the --pybind flag to install with pybindings.
6147

6248
## Prepare Models
6349
In this demo app, we support text-only inference with up-to-date Llama models and image reasoning inference with LLaVA 1.5.
64-
65-
### For Llama 3.2 1B and 3B models
66-
We have supported BFloat16 as a data type on the XNNPACK backend for Llama 3.2 1B/3B models.
6750
* You can request and download model weights for Llama through Meta official [website](https://llama.meta.com/).
6851
* For chat use-cases, download the instruct models instead of pretrained.
6952
* Run `examples/models/llama/install_requirements.sh` to install dependencies.
70-
* The 1B model in BFloat16 format can run on mobile devices with 8GB RAM. The 3B model will require 12GB+ RAM.
53+
* Rename tokenizer for Llama3.x with command: `mv tokenizer.model tokenizer.bin`. We are updating the demo app to support tokenizer in original format directly.
54+
55+
### For Llama 3.2 1B and 3B SpinQuant models
56+
Meta has released prequantized INT4 SpinQuant Llama 3.2 models that ExecuTorch supports on the XNNPACK backend.
7157
* Export Llama model and generate .pte file as below:
58+
```
59+
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --use_spin_quant native --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_spinquant.pte"
60+
```
7261

62+
### For Llama 3.2 1B and 3B QAT+LoRA models
63+
Meta has released prequantized INT4 QAT+LoRA Llama 3.2 models that ExecuTorch supports on the XNNPACK backend.
64+
* Export Llama model and generate .pte file as below:
7365
```
74-
python -m examples.models.llama.export_llama --checkpoint <checkpoint.pth> --params <params.json> -kv --use_sdpa_with_kv_cache -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2.pte"
66+
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -qat -lora 16 -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_qat_lora.pte"
7567
```
7668

77-
* Rename tokenizer for Llama 3.2 with command: `mv tokenizer.model tokenizer.bin`. We are updating the demo app to support tokenizer in original format directly.
69+
### For Llama 3.2 1B and 3B BF16 models
70+
We have supported BF16 as a data type on the XNNPACK backend for Llama 3.2 1B/3B models.
71+
* The 1B model in BF16 format can run on mobile devices with 8GB RAM. The 3B model will require 12GB+ RAM.
72+
* Export Llama model and generate .pte file as below:
73+
74+
```
75+
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2_bf16.pte"
76+
```
7877

7978
For more detail using Llama 3.2 lightweight models including prompt template, please go to our official [website](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2#-llama-3.2-lightweight-models-(1b/3b)-).
8079

@@ -88,19 +87,17 @@ To safeguard your application, you can use our Llama Guard models for prompt cla
8887
* We prepared this model using the following command
8988

9089
```
91-
python -m examples.models.llama.export_llama --checkpoint <pruned llama guard 1b checkpoint.pth> --params <params.json> -d fp32 -kv --use_sdpa_with_kv_cache --quantization_mode 8da4w --group_size 256 --xnnpack --max_seq_length 8193 --embedding-quantize 4,32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_prune_map <llama_guard pruned layers map.json> --output_name="llama_guard_3_1b_pruned_xnnpack.pte"
90+
python -m examples.models.llama.export_llama --checkpoint <path-to-pruned-llama-guard-1b-checkpoint.pth> --params <path-to-your-params.json> -d fp32 -kv --use_sdpa_with_kv_cache --quantization_mode 8da4w --group_size 256 --xnnpack --max_seq_length 8193 --embedding-quantize 4,32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_prune_map <path-to-your-llama_guard-pruned-layers-map.json> --output_name="llama_guard_3_1b_pruned_xnnpack.pte"
9291
```
9392

9493

9594
### For Llama 3.1 and Llama 2 models
96-
* You can download original model weights for Llama through Meta official [website](https://llama.meta.com/).
97-
* For Llama 2 models, Edit params.json file. Replace "vocab_size": -1 with "vocab_size": 32000. This is a short-term workaround
98-
* Run `examples/models/llama/install_requirements.sh` to install dependencies.
95+
* For Llama 2 models, Edit params.json file. Replace "vocab_size": -1 with "vocab_size": 32000. This is a short-term workaround.
9996
* The Llama 3.1 and Llama 2 models (8B and 7B) can run on devices with 12GB+ RAM.
100-
* Export Llama model and generate .pte file
97+
* Export Llama model and generate .pte file as below:
10198

10299
```
103-
python -m examples.models.llama.export_llama --checkpoint <checkpoint.pth> --params <params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama.pte"
100+
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama.pte"
104101
```
105102

106103
You may wonder what the ‘--metadata’ flag is doing. This flag helps export the model with proper special tokens added that the runner can detect EOS tokens easily.
@@ -109,8 +106,6 @@ You may wonder what the ‘--metadata’ flag is doing. This flag helps export t
109106
```
110107
python -m extension.llm.tokenizer.tokenizer -t tokenizer.model -o tokenizer.bin
111108
```
112-
* Rename tokenizer for Llama 3.1 with command: `mv tokenizer.model tokenizer.bin`. We are updating the demo app to support tokenizer in original format directly.
113-
114109

115110
### For LLaVA model
116111
* For the Llava 1.5 model, you can get it from Huggingface [here](https://huggingface.co/llava-hf/llava-1.5-7b-hf).

examples/demo-apps/apple_ios/LLaMA/README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# ExecuTorch Llama iOS Demo App
22

3+
**[UPDATE - 10/24]** We have added support for running quantized Llama 3.2 1B/3B models in demo apps on the [XNNPACK backend](https://github.com/pytorch/executorch/blob/main/examples/demo-apps/apple_ios/LLaMA/docs/delegates/xnnpack_README.md). We currently support inference with SpinQuant and QAT+LoRA quantization methods.
4+
35
We’re excited to share that the newly revamped iOS demo app is live and includes many new updates to provide a more intuitive and smoother user experience with a chat use case! The primary goal of this app is to showcase how easily ExecuTorch can be integrated into an iOS demo app and how to exercise the many features ExecuTorch and Llama models have to offer.
46

57
This app serves as a valuable resource to inspire your creativity and provide foundational code that you can customize and adapt for your particular use case.
@@ -17,7 +19,8 @@ The goal is for you to see the type of support ExecuTorch provides and feel comf
1719
## Supported Models
1820

1921
As a whole, the models that this app supports are (varies by delegate):
20-
* Llama 3.2 1B/3B
22+
* Llama 3.2 Quantized 1B/3B
23+
* Llama 3.2 1B/3B in BF16
2124
* Llama 3.1 8B
2225
* Llama 3 8B
2326
* Llama 2 7B

examples/demo-apps/apple_ios/LLaMA/docs/delegates/xnnpack_README.md

Lines changed: 22 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
11
# Building Llama iOS Demo for XNNPACK Backend
22

3-
**[UPDATE - 09/25]** We have added support for running [Llama 3.2 models](#for-llama-32-1b-and-3b-models) on the XNNPACK backend. We currently support inference on their original data type (BFloat16).
4-
53
This tutorial covers the end to end workflow for building an iOS demo app using XNNPACK backend on device.
64
More specifically, it covers:
75
1. Export and quantization of Llama models against the XNNPACK backend.
@@ -38,24 +36,35 @@ Install dependencies
3836
```
3937

4038
## Prepare Models
41-
In this demo app, we support text-only inference with up-to-date Llama models.
42-
43-
Install the required packages to export the model
39+
In this demo app, we support text-only inference with up-to-date Llama models and image reasoning inference with LLaVA 1.5.
40+
* You can request and download model weights for Llama through Meta official [website](https://llama.meta.com/).
41+
* For chat use-cases, download the instruct models instead of pretrained.
42+
* Install the required packages to export the model:
4443

4544
```
4645
sh examples/models/llama/install_requirements.sh
4746
```
47+
### For Llama 3.2 1B and 3B SpinQuant models
48+
Meta has released prequantized INT4 SpinQuant Llama 3.2 models that ExecuTorch supports on the XNNPACK backend.
49+
* Export Llama model and generate .pte file as below:
50+
```
51+
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --use_spin_quant native --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_spinquant.pte"
52+
```
4853

49-
### For Llama 3.2 1B and 3B models
50-
We have supported BFloat16 as a data type on the XNNPACK backend for Llama 3.2 1B/3B models.
51-
* You can download original model weights for Llama through Meta official [website](https://llama.meta.com/).
52-
* For chat use-cases, download the instruct models instead of pretrained.
53-
* Run “examples/models/llama/install_requirements.sh” to install dependencies.
54-
* The 1B model in BFloat16 format can run on mobile devices with 8GB RAM (iPhone 15 Pro and later). The 3B model will require 12GB+ RAM and hence will not fit on 8GB RAM phones.
54+
### For Llama 3.2 1B and 3B QAT+LoRA models
55+
Meta has released prequantized INT4 QAT+LoRA Llama 3.2 models that ExecuTorch supports on the XNNPACK backend.
56+
* Export Llama model and generate .pte file as below:
57+
```
58+
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -qat -lora 16 -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_qat_lora.pte"
59+
```
60+
61+
### For Llama 3.2 1B and 3B BF16 models
62+
We have supported BF16 as a data type on the XNNPACK backend for Llama 3.2 1B/3B models.
63+
* The 1B model in BF16 format can run on mobile devices with 8GB RAM (iPhone 15 Pro and later). The 3B model will require 12GB+ RAM and hence will not fit on 8GB RAM phones.
5564
* Export Llama model and generate .pte file as below:
5665

5766
```
58-
python -m examples.models.llama.export_llama --checkpoint <checkpoint.pth> --params <params.json> -kv --use_sdpa_with_kv_cache -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2.pte"
67+
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2_bf16.pte"
5968
```
6069

6170
For more detail using Llama 3.2 lightweight models including prompt template, please go to our official [website](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2#-llama-3.2-lightweight-models-(1b/3b)-).
@@ -64,7 +73,7 @@ For more detail using Llama 3.2 lightweight models including prompt template, pl
6473

6574
Export the model
6675
```
67-
python -m examples.models.llama.export_llama --checkpoint <consolidated.00.pth> -p <params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --embedding-quantize 4,32 --output_name="llama3_kv_sdpa_xnn_qe_4_32.pte"
76+
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> -p <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --embedding-quantize 4,32 --output_name="llama3_kv_sdpa_xnn_qe_4_32.pte"
6877
```
6978

7079
### For LLaVA model

0 commit comments

Comments
 (0)