Skip to content

Commit d6186d9

Browse files
Create qualcomm_README.md (#5480)
Create qualcomm_README.md (#5394) Summary: Update Qualcomm tutorial for android demo apps Pull Request resolved: #5394 Reviewed By: cccclai Differential Revision: D62771411 Pulled By: WuhanMonkey fbshipit-source-id: 49cb2ccc4a3ab4612ede8d5e4a47429ea63e834b (cherry picked from commit 9c068ab) Co-authored-by: Chester Hu <[email protected]>
1 parent c715c3d commit d6186d9

File tree

2 files changed

+229
-1
lines changed

2 files changed

+229
-1
lines changed

examples/demo-apps/android/LlamaDemo/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ First it’s important to note that currently ExecuTorch provides support across
2929
| Delegate | Resource |
3030
| ------------- | ------------- |
3131
| XNNPACK (CPU-based library) | [link](docs/delegates/xnnpack_README.md) |
32-
| QNN (Qualcomm AI Accelerators) | Coming soon |
32+
| QNN (Qualcomm AI Accelerators) | [link](docs/delegates/qualcomm_README.md) |
3333
| MediaTek (MediaTek AI Accelerators) | [link](docs/delegates/mediatek_README.md) |
3434

3535
## How to Use the App
Lines changed: 228 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,228 @@
1+
# Building ExecuTorch Android Demo App for Llama running Qualcomm
2+
3+
This tutorial covers the end to end workflow for building an android demo app using Qualcomm AI accelerators on device.
4+
More specifically, it covers:
5+
1. Export and quantization of Llama models against the Qualcomm backend.
6+
2. Building and linking libraries that are required to inference on-device for Android platform using Qualcomm AI accelerators.
7+
3. Building the Android demo app itself.
8+
9+
Verified on Linux CentOS, QNN SDK [v2.26](https://softwarecenter.qualcomm.com/api/download/software/qualcomm_neural_processing_sdk/v2.26.0.240828.zip), python 3.10, Android SDK r27 and r26b.
10+
11+
Phone verified: OnePlus 12, Samsung 24+, Samsung 23
12+
13+
## Prerequisites
14+
* Download and unzip QNN SDK [v2.26](https://softwarecenter.qualcomm.com/api/download/software/qualcomm_neural_processing_sdk/v2.26.0.240828.zip)
15+
* Download and unzip Android SDK [r27](https://developer.android.com/ndk/downloads)
16+
* Android phone with Snapdragon8 Gen3 (SM8650) or Gen2 (SM8550). Gen 1 and lower SoC might be supported but not fully validated.
17+
* Desired Llama model weights in .PTH format. You can download them on HuggingFace ([Example](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)).
18+
19+
## Setup ExecuTorch
20+
In this section, we will need to set up the ExecuTorch repo first with Conda environment management. Make sure you have Conda available in your system (or follow the instructions to install it [here](https://anaconda.org/anaconda/conda)). The commands below are running on Linux (CentOS).
21+
22+
Create a Conda environment
23+
```
24+
conda create -n et_qnn python=3.10.0
25+
conda activate et_qnn
26+
```
27+
28+
Checkout ExecuTorch repo and sync submodules
29+
```
30+
git clone https://github.com/pytorch/executorch.git
31+
cd executorch
32+
git submodule sync
33+
git submodule update --init
34+
```
35+
Install dependencies
36+
```
37+
./install_requirements.sh
38+
```
39+
40+
## Setup QNN
41+
```
42+
# Set these variables correctly for your environment
43+
export ANDROID_NDK_ROOT=$HOME/android-ndk-r27 # Download android SDK and unzip to home directory
44+
export QNN_SDK_ROOT=$HOME/Your-SDK-Root #Folder contains lib
45+
export EXECUTORCH_ROOT=$HOME/repos/executorch
46+
export LD_LIBRARY_PATH=$QNN_SDK_ROOT/lib/x86_64-linux-clang/:$LD_LIBRARY_PATH
47+
export PYTHONPATH=$EXECUTORCH_ROOT/..
48+
cp schema/program.fbs exir/_serialize/program.fbs
49+
cp schema/scalar_type.fbs exir/_serialize/scalar_type.fbs
50+
```
51+
52+
### Build QNN backend with ExecuTorch
53+
```
54+
./backends/qualcomm/scripts/build.sh --release
55+
56+
cmake -DPYTHON_EXECUTABLE=python \
57+
-DCMAKE_INSTALL_PREFIX=cmake-out \
58+
-DEXECUTORCH_ENABLE_LOGGING=1 \
59+
-DCMAKE_BUILD_TYPE=Release \
60+
-DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
61+
-DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
62+
-DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
63+
-DEXECUTORCH_BUILD_QNN=ON \
64+
-DQNN_SDK_ROOT=${QNN_SDK_ROOT} \
65+
-DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
66+
-DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
67+
-DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
68+
-Bcmake-out .
69+
cmake --build cmake-out -j16 --target install --config Release
70+
```
71+
72+
73+
74+
### Setup Llama Runner
75+
Next we need to build and compile the Llama runner. This is similar to the requirements for running Llama with XNNPack.
76+
```
77+
sh examples/models/llama2/install_requirements.sh
78+
79+
cmake -DPYTHON_EXECUTABLE=python \
80+
-DCMAKE_INSTALL_PREFIX=cmake-out \
81+
-DCMAKE_BUILD_TYPE=Release \
82+
-DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
83+
-DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
84+
-DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
85+
-DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
86+
-DEXECUTORCH_BUILD_QNN=ON \
87+
-Bcmake-out/examples/models/llama2 \
88+
examples/models/llama2
89+
cmake --build cmake-out/examples/models/llama2 -j16 --config Release
90+
```
91+
92+
## Export Llama Model
93+
QNN backend currently supports exporting to these data types: fp32, int4/ int8 with PTQ, int4 with SpinQuant (Llama 3 only).
94+
95+
We also support export for different Qualcomm SoC. We have verified SM8650(V75) and SM8550(V73). To export for different SoC, add “--soc_model SM8550” in your export command. Without setting this flag, the export will default to SM8650.
96+
97+
### Export with PTQ
98+
We support PTQ by default. The entire export may take ~20 minutes (Llama 3.1 8B). However, there is accuracy regression and we are working on improving it.
99+
8B models might need 16GB RAM on the device to run.
100+
101+
Examples:
102+
```
103+
# 4 bits weight only quantize
104+
python -m examples.models.llama2.export_llama --checkpoint "${MODEL_DIR}/consolidated.00.pth" -p "${MODEL_DIR}/params.json" -kv --disable_dynamic_shape --qnn --pt2e_quantize qnn_16a4w -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="test.pte”
105+
```
106+
If the model is really big, it may require model sharding because the Qualcomm DSP is a 32bit system and has a 4GB size limit . For example for Llama 3 8B models, we need to shard the model into 4, but ExecuTorch still packages it into one PTE file. Here is an example:
107+
```
108+
# 8 bits quantization with 4 shards
109+
python -m examples.models.llama2.export_llama --checkpoint "${MODEL_DIR}/consolidated.00.pth" -p "${MODEL_DIR}/params.json" -kv --disable_dynamic_shape --qnn --pt2e_quantize qnn_8a8w -d fp32 --num_sharding 4 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="test.pte”
110+
```
111+
Note: if you encountered issues below
112+
```
113+
[ERROR] [Qnn ExecuTorch]: Cannot Open QNN library libQnnHtp.so, with error: libc++.so.1: cannot open shared object file: No such file or directory
114+
```
115+
116+
Resolve by:
117+
118+
* Install older QNN such as 2.23 or below and copy it from ${QNN_SDK_ROOT}/lib/x86_64-linux-clang
119+
* Install it with apt-get by yourself
120+
* Install it with script in ${QNN_SDK_ROOT}/bin/check-linux-dependency.sh
121+
You could refer to [QNN SDK document](https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/setup.html?product=1601111740009302#linux-platform-dependencies)
122+
* Install it with Conda:
123+
```
124+
conda install -c conda-forge libcxx=14.0.0
125+
```
126+
127+
After installment, you will need to check libc++.so.1 in your LD_LIBRARY_PATH or system lib. Refer to this [PR](https://github.com/pytorch/executorch/issues/5120) for more detail.
128+
129+
You may also wonder what the "--metadata" flag is doing. This flag helps export the model with proper special tokens added that the runner can detect EOS tokens easily.
130+
131+
Convert tokenizer for Llama 2
132+
```
133+
python -m extension.llm.tokenizer.tokenizer -t <tokenizer.model> -o tokenizer.bin
134+
```
135+
Convert tokenizer for Llama 3 - Rename tokenizer.model to tokenizer.bin.
136+
137+
138+
### Export with Spinquant (Llama 3 8B only)
139+
We also support Llama 3 8B for Spinquant where the accuracy regression is minimal.
140+
141+
Deploying large language models like Llama 3 on-device presents the following challenges:
142+
* The model size is too large to fit in device memory for inference.
143+
* High model loading and inference time.
144+
* Difficulty in quantization.
145+
146+
To address these challenges, we have implemented the following solutions:
147+
* Using --pt2e_quantize qnn_16a4w to quantize activations and weights, thereby reducing the on-disk model size and alleviating memory pressure during inference.
148+
* Using --num_sharding 8 to shard the model into sub-parts.
149+
* Performing graph transformations to convert or decompose operations into more accelerator-friendly operations.
150+
* Using --optimized_rotation_path <path_to_optimized_matrix> to apply R1 and R2 of [Spin Quant](https://github.com/facebookresearch/SpinQuant) to improve accuracy.
151+
* Using --calibration_data "<|start_header_id|>system<|end_header_id|..." to ensure that during the quantization of Llama 3 8B Instruct, the calibration includes special tokens in the prompt template. For more details on the prompt template, refer to the [model card](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/) of meta llama3 instruct.
152+
153+
To get the optimized matrix, please refer to [SpinQuant](https://github.com/facebookresearch/SpinQuant) on GitHub. You can download the optimized rotation matrices in the Quantized Models section. Please choose "LLaMA-3-8B/8B_W4A16KV16_lr_1.5_seed_0".
154+
155+
To export Llama 3 8B instruct with the Qualcomm AI Engine Direct Backend, ensure the following:
156+
* The host machine has more than 100GB of memory (RAM + swap space).
157+
* The entire process takes a few hours.
158+
* 8B models might need 16GB RAM on the device to run.
159+
```
160+
# Please note that calibration_data must include the prompt template for special tokens.
161+
python -m examples.models.llama2.export_llama -t <path_to_tokenizer.model> -p <path_to_params.json> -c <path_to_checkpoint_for_Meta-Llama-3-8B-Instruct> --use_kv_cache --qnn --pt2e_quantize qnn_16a4w --disable_dynamic_shape --num_sharding 8 --calibration_tasks wikitext --calibration_limit 1 --calibration_seq_length 128 --optimized_rotation_path <path_to_optimized_matrix> --calibration_data "<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
162+
```
163+
164+
## Pushing Model and Tokenizer
165+
166+
Once you have the model and tokenizer ready, you can push them to the device before we start building the android demo app.
167+
```
168+
adb shell mkdir -p /data/local/tmp/llama
169+
adb push llama-exported.pte /data/local/tmp/llama
170+
adb push tokenizer.bin /data/local/tmp/llama
171+
```
172+
173+
174+
175+
## Build AAR Library
176+
Open a terminal window and navigate to the root directory of the executorch.
177+
Set the following environment variables:
178+
```
179+
export ANDROID_NDK=<path_to_android_ndk>
180+
export ANDROID_ABI=arm64-v8a
181+
```
182+
Note: <path_to_android_ndk> is the root for the NDK, which is usually under ~/Library/Android/sdk/ndk/XX.Y.ZZZZZ for macOS, and contains NOTICE and README.md. We use <path_to_android_ndk>/build/cmake/android.toolchain.cmake for CMake to cross-compile.
183+
Build the Android Java extension code:
184+
```
185+
pushd extension/android
186+
./gradlew build
187+
popd
188+
```
189+
Run the following command set up the required JNI library:
190+
```
191+
pushd examples/demo-apps/android/LlamaDemo
192+
./gradlew :app:setupQnn
193+
popd
194+
```
195+
Alternative you can also just run the shell script directly as in the root directory:
196+
```
197+
sh examples/demo-apps/android/LlamaDemo/setup-with-qnn.sh
198+
```
199+
This is running the shell script which configures the required core ExecuTorch, Llama2/3, and Android libraries, builds them, and copies them to jniLibs.
200+
Note: If you are building the Android app mentioned in the next section on a separate machine (i.e. MacOS but building and exporting for QNN backend on Linux), make sure you copy the aar file generated from setup-with-qnn script to “examples/demo-apps/android/LlamaDemo/app/libs” before building the Android app.
201+
202+
203+
## Run the Android Demo App
204+
205+
First, make sure your Android phone’s chipset version is compatible with this demo (SM8650, SM8550). You can find the Qualcomm chipset version here in the [mapping](https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/overview.html).
206+
207+
If you build and run the setup-with-qnn script on a separate machine rather than where you are building the Android app, make sure you copy the aar file it generated into “examples/demo-apps/android/LlamaDemo/app/libs”
208+
209+
### Alternative 1: Android Studio (Recommended)
210+
Open Android Studio and select “Open an existing Android Studio project” to open examples/demo-apps/android/LlamaDemo.
211+
Run the app (^R). This builds and launches the app on the phone.
212+
213+
### Alternative 2: Command line
214+
Without Android Studio UI, we can run gradle directly to build the app. We need to set up the Android SDK path and invoke gradle.
215+
```
216+
export ANDROID_HOME=<path_to_android_sdk_home>
217+
pushd examples/demo-apps/android/LlamaDemo
218+
./gradlew :app:installDebug
219+
popd
220+
```
221+
If the app successfully run on your device, you should see something like below:
222+
223+
<p align="center">
224+
<img src="https://github.com/pytorch/executorch/blob/main/examples/demo-apps/android/LlamaDemo/docs/screenshots/opening_the_app_details.png" width=800>
225+
</p>
226+
227+
## Reporting Issues
228+
If you encountered any bugs or issues following this tutorial please file a bug/issue here on Github.

0 commit comments

Comments
 (0)