Skip to content

Update Android and iOS demo app readme for Spinquant and QAT+LoRA model support #6485

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 4 additions & 6 deletions examples/demo-apps/android/LlamaDemo/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# ExecuTorch Llama Android Demo App

**[UPDATE - 10/24]** We have added support for running quantized Llama 3.2 1B/3B models in demo apps on the [XNNPACK backend](https://github.com/pytorch/executorch/blob/main/examples/demo-apps/android/LlamaDemo/docs/delegates/xnnpack_README.md). We currently support inference with SpinQuant and QAT+LoRA quantization methods.

We’re excited to share that the newly revamped Android demo app is live and includes many new updates to provide a more intuitive and smoother user experience with a chat use case! The primary goal of this app is to showcase how easily ExecuTorch can be integrated into an Android demo app and how to exercise the many features ExecuTorch and Llama models have to offer.

This app serves as a valuable resource to inspire your creativity and provide foundational code that you can customize and adapt for your particular use case.
Expand All @@ -17,7 +19,8 @@ The goal is for you to see the type of support ExecuTorch provides and feel comf

## Supporting Models
As a whole, the models that this app supports are (varies by delegate):
* Llama 3.2 1B/3B
* Llama 3.2 Quantized 1B/3B
* Llama 3.2 1B/3B in BF16
* Llama Guard 3 1B
* Llama 3.1 8B
* Llama 3 8B
Expand All @@ -34,11 +37,6 @@ First it’s important to note that currently ExecuTorch provides support across
| QNN (Qualcomm AI Accelerators) | [link](https://github.com/pytorch/executorch/blob/main/examples/demo-apps/android/LlamaDemo/docs/delegates/qualcomm_README.md) |
| MediaTek (MediaTek AI Accelerators) | [link](https://github.com/pytorch/executorch/blob/main/examples/demo-apps/android/LlamaDemo/docs/delegates/mediatek_README.md) |

**WARNING** NDK r27 will cause issues like:
```
java.lang.UnsatisfiedLinkError: dlopen failed: cannot locate symbol "_ZTVNSt6__ndk114basic_ifstreamIcNS_11char_traitsIcEEEE" referenced by "/data/app/~~F5IwquaXUZPdLpSEYA-JGA==/com.example.executorchllamademo-FSyx80gEhsQCsxz7hvS2Ew==/lib/arm64/libexecutorch.so"...
```
Please use NDK version 26.3.11579264.

## How to Use the App

Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,4 @@
# Building ExecuTorch Android Demo App for Llama/Llava running XNNPACK

**[UPDATE - 09/25]** We have added support for running [Llama 3.2 models](#for-llama-32-1b-and-3b-models) on the XNNPACK backend. We currently support inference on their original data type (BFloat16). We have also added instructions to run [Llama Guard 1B models](#for-llama-guard-1b-models) on-device.

This tutorial covers the end to end workflow for building an android demo app using CPU on device via XNNPACK framework.
More specifically, it covers:
1. Export and quantization of Llama and Llava models against the XNNPACK backend.
Expand All @@ -10,26 +7,15 @@ More specifically, it covers:

Phone verified: OnePlus 12, OnePlus 9 Pro. Samsung S23 (Llama only), Samsung S24+ (Llama only), Pixel 8 Pro (Llama only)


## Known Issues
* With prompts like “What is the maxwell equation” the runner+jni is unable to handle odd unicodes.

## Prerequisites
* Install [Java 17 JDK](https://www.oracle.com/java/technologies/javase/jdk17-archive-downloads.html).
* Install the [Android SDK API Level 34](https://developer.android.com/about/versions/15/setup-sdk) and [Android NDK 26.3.11579264](https://developer.android.com/studio/projects/install-ndk). **WARNING** NDK r27 will cause issues like:
```
java.lang.UnsatisfiedLinkError: dlopen failed: cannot locate symbol "_ZTVNSt6__ndk114basic_ifstreamIcNS_11char_traitsIcEEEE" referenced by "/data/app/~~F5IwquaXUZPdLpSEYA-JGA==/com.example.executorchllamademo-FSyx80gEhsQCsxz7hvS2Ew==/lib/arm64/libexecutorch.so"...
```
Please downgrade to version 26.3.11579264.
* Install the [Android SDK API Level 34](https://developer.android.com/about/versions/15/setup-sdk) and [Android NDK r27b](https://github.com/android/ndk/releases/tag/r27b).
* Note: This demo app and tutorial has only been validated with arm64-v8a [ABI](https://developer.android.com/ndk/guides/abis), with NDK 26.3.11579264 and r27b.
* If you have Android Studio set up, you can install them with
* Android Studio Settings -> Language & Frameworks -> Android SDK -> SDK Platforms -> Check the row with API Level 34.
* Android Studio Settings -> Language & Frameworks -> Android SDK -> SDK Tools -> Check NDK (Side by side) row.
* Alternatively, you can follow [this guide](https://github.com/pytorch/executorch/blob/856e085b9344c8b0bf220a97976140a5b76356aa/examples/demo-apps/android/LlamaDemo/SDK.md) to set up Java/SDK/NDK with CLI.
Supported Host OS: CentOS, macOS Sonoma on Apple Silicon.


Note: This demo app and tutorial has only been validated with arm64-v8a [ABI](https://developer.android.com/ndk/guides/abis), with NDK 26.3.11579264.

* Supported Host OS: CentOS, macOS Sonoma on Apple Silicon.


## Setup ExecuTorch
Expand Down Expand Up @@ -61,20 +47,33 @@ Optional: Use the --pybind flag to install with pybindings.

## Prepare Models
In this demo app, we support text-only inference with up-to-date Llama models and image reasoning inference with LLaVA 1.5.

### For Llama 3.2 1B and 3B models
We have supported BFloat16 as a data type on the XNNPACK backend for Llama 3.2 1B/3B models.
* You can request and download model weights for Llama through Meta official [website](https://llama.meta.com/).
* For chat use-cases, download the instruct models instead of pretrained.
* Run `examples/models/llama/install_requirements.sh` to install dependencies.
* The 1B model in BFloat16 format can run on mobile devices with 8GB RAM. The 3B model will require 12GB+ RAM.
* Rename tokenizer for Llama3.x with command: `mv tokenizer.model tokenizer.bin`. We are updating the demo app to support tokenizer in original format directly.

### For Llama 3.2 1B and 3B SpinQuant models
Meta has released prequantized INT4 SpinQuant Llama 3.2 models that ExecuTorch supports on the XNNPACK backend.
* Export Llama model and generate .pte file as below:
```
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --use_spin_quant native --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_spinquant.pte"
```

### For Llama 3.2 1B and 3B QAT+LoRA models
Meta has released prequantized INT4 QAT+LoRA Llama 3.2 models that ExecuTorch supports on the XNNPACK backend.
* Export Llama model and generate .pte file as below:
```
python -m examples.models.llama.export_llama --checkpoint <checkpoint.pth> --params <params.json> -kv --use_sdpa_with_kv_cache -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2.pte"
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -qat -lora 16 -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_qat_lora.pte"
```

* Rename tokenizer for Llama 3.2 with command: `mv tokenizer.model tokenizer.bin`. We are updating the demo app to support tokenizer in original format directly.
### For Llama 3.2 1B and 3B BF16 models
We have supported BF16 as a data type on the XNNPACK backend for Llama 3.2 1B/3B models.
* The 1B model in BF16 format can run on mobile devices with 8GB RAM. The 3B model will require 12GB+ RAM.
* Export Llama model and generate .pte file as below:

```
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2_bf16.pte"
```

For more detail using Llama 3.2 lightweight models including prompt template, please go to our official [website](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2#-llama-3.2-lightweight-models-(1b/3b)-).

Expand All @@ -88,19 +87,17 @@ To safeguard your application, you can use our Llama Guard models for prompt cla
* We prepared this model using the following command

```
python -m examples.models.llama.export_llama --checkpoint <pruned llama guard 1b checkpoint.pth> --params <params.json> -d fp32 -kv --use_sdpa_with_kv_cache --quantization_mode 8da4w --group_size 256 --xnnpack --max_seq_length 8193 --embedding-quantize 4,32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_prune_map <llama_guard pruned layers map.json> --output_name="llama_guard_3_1b_pruned_xnnpack.pte"
python -m examples.models.llama.export_llama --checkpoint <path-to-pruned-llama-guard-1b-checkpoint.pth> --params <path-to-your-params.json> -d fp32 -kv --use_sdpa_with_kv_cache --quantization_mode 8da4w --group_size 256 --xnnpack --max_seq_length 8193 --embedding-quantize 4,32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_prune_map <path-to-your-llama_guard-pruned-layers-map.json> --output_name="llama_guard_3_1b_pruned_xnnpack.pte"
```


### For Llama 3.1 and Llama 2 models
* You can download original model weights for Llama through Meta official [website](https://llama.meta.com/).
* For Llama 2 models, Edit params.json file. Replace "vocab_size": -1 with "vocab_size": 32000. This is a short-term workaround
* Run `examples/models/llama/install_requirements.sh` to install dependencies.
* For Llama 2 models, Edit params.json file. Replace "vocab_size": -1 with "vocab_size": 32000. This is a short-term workaround.
* The Llama 3.1 and Llama 2 models (8B and 7B) can run on devices with 12GB+ RAM.
* Export Llama model and generate .pte file
* Export Llama model and generate .pte file as below:

```
python -m examples.models.llama.export_llama --checkpoint <checkpoint.pth> --params <params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama.pte"
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama.pte"
```

You may wonder what the ‘--metadata’ flag is doing. This flag helps export the model with proper special tokens added that the runner can detect EOS tokens easily.
Expand All @@ -109,8 +106,6 @@ You may wonder what the ‘--metadata’ flag is doing. This flag helps export t
```
python -m extension.llm.tokenizer.tokenizer -t tokenizer.model -o tokenizer.bin
```
* Rename tokenizer for Llama 3.1 with command: `mv tokenizer.model tokenizer.bin`. We are updating the demo app to support tokenizer in original format directly.


### For LLaVA model
* For the Llava 1.5 model, you can get it from Huggingface [here](https://huggingface.co/llava-hf/llava-1.5-7b-hf).
Expand Down
5 changes: 4 additions & 1 deletion examples/demo-apps/apple_ios/LLaMA/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# ExecuTorch Llama iOS Demo App

**[UPDATE - 10/24]** We have added support for running quantized Llama 3.2 1B/3B models in demo apps on the [XNNPACK backend](https://github.com/pytorch/executorch/blob/main/examples/demo-apps/apple_ios/LLaMA/docs/delegates/xnnpack_README.md). We currently support inference with SpinQuant and QAT+LoRA quantization methods.

We’re excited to share that the newly revamped iOS demo app is live and includes many new updates to provide a more intuitive and smoother user experience with a chat use case! The primary goal of this app is to showcase how easily ExecuTorch can be integrated into an iOS demo app and how to exercise the many features ExecuTorch and Llama models have to offer.

This app serves as a valuable resource to inspire your creativity and provide foundational code that you can customize and adapt for your particular use case.
Expand All @@ -17,7 +19,8 @@ The goal is for you to see the type of support ExecuTorch provides and feel comf
## Supported Models

As a whole, the models that this app supports are (varies by delegate):
* Llama 3.2 1B/3B
* Llama 3.2 Quantized 1B/3B
* Llama 3.2 1B/3B in BF16
* Llama 3.1 8B
* Llama 3 8B
* Llama 2 7B
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
# Building Llama iOS Demo for XNNPACK Backend

**[UPDATE - 09/25]** We have added support for running [Llama 3.2 models](#for-llama-32-1b-and-3b-models) on the XNNPACK backend. We currently support inference on their original data type (BFloat16).

This tutorial covers the end to end workflow for building an iOS demo app using XNNPACK backend on device.
More specifically, it covers:
1. Export and quantization of Llama models against the XNNPACK backend.
Expand Down Expand Up @@ -38,24 +36,35 @@ Install dependencies
```

## Prepare Models
In this demo app, we support text-only inference with up-to-date Llama models.

Install the required packages to export the model
In this demo app, we support text-only inference with up-to-date Llama models and image reasoning inference with LLaVA 1.5.
* You can request and download model weights for Llama through Meta official [website](https://llama.meta.com/).
* For chat use-cases, download the instruct models instead of pretrained.
* Install the required packages to export the model:

```
sh examples/models/llama/install_requirements.sh
```
### For Llama 3.2 1B and 3B SpinQuant models
Meta has released prequantized INT4 SpinQuant Llama 3.2 models that ExecuTorch supports on the XNNPACK backend.
* Export Llama model and generate .pte file as below:
```
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --use_spin_quant native --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_spinquant.pte"
```

### For Llama 3.2 1B and 3B models
We have supported BFloat16 as a data type on the XNNPACK backend for Llama 3.2 1B/3B models.
* You can download original model weights for Llama through Meta official [website](https://llama.meta.com/).
* For chat use-cases, download the instruct models instead of pretrained.
* Run “examples/models/llama/install_requirements.sh” to install dependencies.
* The 1B model in BFloat16 format can run on mobile devices with 8GB RAM (iPhone 15 Pro and later). The 3B model will require 12GB+ RAM and hence will not fit on 8GB RAM phones.
### For Llama 3.2 1B and 3B QAT+LoRA models
Meta has released prequantized INT4 QAT+LoRA Llama 3.2 models that ExecuTorch supports on the XNNPACK backend.
* Export Llama model and generate .pte file as below:
```
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -qat -lora 16 -kv --use_sdpa_with_kv_cache -X -d fp32 --xnnpack-extended-ops --preq_mode 8da4w_output_8da8w --preq_group_size 32 --max_seq_length 2048 --preq_embedding_quantize 8,0 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name "llama3_2_qat_lora.pte"
```

### For Llama 3.2 1B and 3B BF16 models
We have supported BF16 as a data type on the XNNPACK backend for Llama 3.2 1B/3B models.
* The 1B model in BF16 format can run on mobile devices with 8GB RAM (iPhone 15 Pro and later). The 3B model will require 12GB+ RAM and hence will not fit on 8GB RAM phones.
* Export Llama model and generate .pte file as below:

```
python -m examples.models.llama.export_llama --checkpoint <checkpoint.pth> --params <params.json> -kv --use_sdpa_with_kv_cache -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2.pte"
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> --params <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -d bf16 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="llama3_2_bf16.pte"
```

For more detail using Llama 3.2 lightweight models including prompt template, please go to our official [website](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2#-llama-3.2-lightweight-models-(1b/3b)-).
Expand All @@ -64,7 +73,7 @@ For more detail using Llama 3.2 lightweight models including prompt template, pl

Export the model
```
python -m examples.models.llama.export_llama --checkpoint <consolidated.00.pth> -p <params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --embedding-quantize 4,32 --output_name="llama3_kv_sdpa_xnn_qe_4_32.pte"
python -m examples.models.llama.export_llama --checkpoint <path-to-your-checkpoint.pth> -p <path-to-your-params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --embedding-quantize 4,32 --output_name="llama3_kv_sdpa_xnn_qe_4_32.pte"
```

### For LLaVA model
Expand Down
Loading