|
| 1 | +# Building ExecuTorch Android Demo App for Llama running Qualcomm |
| 2 | + |
| 3 | +This tutorial covers the end to end workflow for building an android demo app using Qualcomm AI accelerators on device. |
| 4 | +More specifically, it covers: |
| 5 | +1. Export and quantization of Llama models against the Qualcomm backend. |
| 6 | +2. Building and linking libraries that are required to inference on-device for Android platform using Qualcomm AI accelerators. |
| 7 | +3. Building the Android demo app itself. |
| 8 | + |
| 9 | +Verified on Linux CentOS, QNN SDK [v2.26](https://softwarecenter.qualcomm.com/api/download/software/qualcomm_neural_processing_sdk/v2.26.0.240828.zip), python 3.10, Android SDK r27 and r26b. |
| 10 | + |
| 11 | +Phone verified: OnePlus 12, Samsung 24+, Samsung 23 |
| 12 | + |
| 13 | +## Prerequisites |
| 14 | +* Download and unzip QNN SDK [v2.26](https://softwarecenter.qualcomm.com/api/download/software/qualcomm_neural_processing_sdk/v2.26.0.240828.zip) |
| 15 | +* Download and unzip Android SDK [r27](https://developer.android.com/ndk/downloads) |
| 16 | +* Android phone with Snapdragon8 Gen3 (SM8650) or Gen2 (SM8550). Gen 1 and lower SoC might be supported but not fully validated. |
| 17 | +* Desired Llama model weights in .PTH format. You can download them on HuggingFace ([Example](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)). |
| 18 | + |
| 19 | +## Setup ExecuTorch |
| 20 | +In this section, we will need to set up the ExecuTorch repo first with Conda environment management. Make sure you have Conda available in your system (or follow the instructions to install it [here](https://anaconda.org/anaconda/conda)). The commands below are running on Linux (CentOS). |
| 21 | + |
| 22 | +Create a Conda environment |
| 23 | +``` |
| 24 | +conda create -n et_qnn python=3.10.0 |
| 25 | +conda activate et_qnn |
| 26 | +``` |
| 27 | + |
| 28 | +Checkout ExecuTorch repo and sync submodules |
| 29 | +``` |
| 30 | +git clone https://github.com/pytorch/executorch.git |
| 31 | +cd executorch |
| 32 | +git submodule sync |
| 33 | +git submodule update --init |
| 34 | +``` |
| 35 | +Install dependencies |
| 36 | +``` |
| 37 | +./install_requirements.sh |
| 38 | +``` |
| 39 | + |
| 40 | +## Setup QNN |
| 41 | +``` |
| 42 | +# Set these variables correctly for your environment |
| 43 | +export ANDROID_NDK_ROOT=$HOME/android-ndk-r27 # Download android SDK and unzip to home directory |
| 44 | +export QNN_SDK_ROOT=$HOME/Your-SDK-Root #Folder contains lib |
| 45 | +export EXECUTORCH_ROOT=$HOME/repos/executorch |
| 46 | +export LD_LIBRARY_PATH=$QNN_SDK_ROOT/lib/x86_64-linux-clang/:$LD_LIBRARY_PATH |
| 47 | +export PYTHONPATH=$EXECUTORCH_ROOT/.. |
| 48 | +cp schema/program.fbs exir/_serialize/program.fbs |
| 49 | +cp schema/scalar_type.fbs exir/_serialize/scalar_type.fbs |
| 50 | +``` |
| 51 | + |
| 52 | +### Build QNN backend with ExecuTorch |
| 53 | +``` |
| 54 | +./backends/qualcomm/scripts/build.sh --release |
| 55 | +
|
| 56 | +cmake -DPYTHON_EXECUTABLE=python \ |
| 57 | + -DCMAKE_INSTALL_PREFIX=cmake-out \ |
| 58 | + -DEXECUTORCH_ENABLE_LOGGING=1 \ |
| 59 | + -DCMAKE_BUILD_TYPE=Release \ |
| 60 | + -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \ |
| 61 | + -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \ |
| 62 | + -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \ |
| 63 | + -DEXECUTORCH_BUILD_QNN=ON \ |
| 64 | + -DQNN_SDK_ROOT=${QNN_SDK_ROOT} \ |
| 65 | + -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ |
| 66 | + -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ |
| 67 | + -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ |
| 68 | + -Bcmake-out . |
| 69 | +cmake --build cmake-out -j16 --target install --config Release |
| 70 | +``` |
| 71 | + |
| 72 | + |
| 73 | + |
| 74 | +### Setup Llama Runner |
| 75 | +Next we need to build and compile the Llama runner. This is similar to the requirements for running Llama with XNNPack. |
| 76 | +``` |
| 77 | +sh examples/models/llama2/install_requirements.sh |
| 78 | +
|
| 79 | +cmake -DPYTHON_EXECUTABLE=python \ |
| 80 | + -DCMAKE_INSTALL_PREFIX=cmake-out \ |
| 81 | + -DCMAKE_BUILD_TYPE=Release \ |
| 82 | + -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ |
| 83 | + -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ |
| 84 | + -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ |
| 85 | + -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \ |
| 86 | + -DEXECUTORCH_BUILD_QNN=ON \ |
| 87 | + -Bcmake-out/examples/models/llama2 \ |
| 88 | + examples/models/llama2 |
| 89 | +cmake --build cmake-out/examples/models/llama2 -j16 --config Release |
| 90 | +``` |
| 91 | + |
| 92 | +## Export Llama Model |
| 93 | +QNN backend currently supports exporting to these data types: fp32, int4/ int8 with PTQ, int4 with SpinQuant (Llama 3 only). |
| 94 | + |
| 95 | +We also support export for different Qualcomm SoC. We have verified SM8650(V75) and SM8550(V73). To export for different SoC, add “--soc_model SM8550” in your export command. Without setting this flag, the export will default to SM8650. |
| 96 | + |
| 97 | +### Export with PTQ |
| 98 | +We support PTQ by default. The entire export may take ~20 minutes (Llama 3.1 8B). However, there is accuracy regression and we are working on improving it. |
| 99 | +8B models might need 16GB RAM on the device to run. |
| 100 | + |
| 101 | +Examples: |
| 102 | +``` |
| 103 | +# 4 bits weight only quantize |
| 104 | +python -m examples.models.llama2.export_llama --checkpoint "${MODEL_DIR}/consolidated.00.pth" -p "${MODEL_DIR}/params.json" -kv --disable_dynamic_shape --qnn --pt2e_quantize qnn_16a4w -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="test.pte” |
| 105 | +``` |
| 106 | +If the model is really big, it may require model sharding because the Qualcomm DSP is a 32bit system and has a 4GB size limit . For example for Llama 3 8B models, we need to shard the model into 4, but ExecuTorch still packages it into one PTE file. Here is an example: |
| 107 | +``` |
| 108 | +# 8 bits quantization with 4 shards |
| 109 | +python -m examples.models.llama2.export_llama --checkpoint "${MODEL_DIR}/consolidated.00.pth" -p "${MODEL_DIR}/params.json" -kv --disable_dynamic_shape --qnn --pt2e_quantize qnn_8a8w -d fp32 --num_sharding 4 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="test.pte” |
| 110 | +``` |
| 111 | +Note: if you encountered issues below |
| 112 | +``` |
| 113 | +[ERROR] [Qnn ExecuTorch]: Cannot Open QNN library libQnnHtp.so, with error: libc++.so.1: cannot open shared object file: No such file or directory |
| 114 | +``` |
| 115 | + |
| 116 | +Resolve by: |
| 117 | + |
| 118 | +* Install older QNN such as 2.23 or below and copy it from ${QNN_SDK_ROOT}/lib/x86_64-linux-clang |
| 119 | +* Install it with apt-get by yourself |
| 120 | +* Install it with script in ${QNN_SDK_ROOT}/bin/check-linux-dependency.sh |
| 121 | +You could refer to [QNN SDK document](https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/setup.html?product=1601111740009302#linux-platform-dependencies) |
| 122 | +* Install it with Conda: |
| 123 | +``` |
| 124 | +conda install -c conda-forge libcxx=14.0.0 |
| 125 | +``` |
| 126 | + |
| 127 | +After installment, you will need to check libc++.so.1 in your LD_LIBRARY_PATH or system lib. Refer to this [PR](https://github.com/pytorch/executorch/issues/5120) for more detail. |
| 128 | + |
| 129 | +You may also wonder what the "--metadata" flag is doing. This flag helps export the model with proper special tokens added that the runner can detect EOS tokens easily. |
| 130 | + |
| 131 | +Convert tokenizer for Llama 2 |
| 132 | +``` |
| 133 | +python -m extension.llm.tokenizer.tokenizer -t <tokenizer.model> -o tokenizer.bin |
| 134 | +``` |
| 135 | +Convert tokenizer for Llama 3 - Rename tokenizer.model to tokenizer.bin. |
| 136 | + |
| 137 | + |
| 138 | +### Export with Spinquant (Llama 3 8B only) |
| 139 | +We also support Llama 3 8B for Spinquant where the accuracy regression is minimal. |
| 140 | + |
| 141 | +Deploying large language models like Llama 3 on-device presents the following challenges: |
| 142 | +* The model size is too large to fit in device memory for inference. |
| 143 | +* High model loading and inference time. |
| 144 | +* Difficulty in quantization. |
| 145 | + |
| 146 | +To address these challenges, we have implemented the following solutions: |
| 147 | +* Using --pt2e_quantize qnn_16a4w to quantize activations and weights, thereby reducing the on-disk model size and alleviating memory pressure during inference. |
| 148 | +* Using --num_sharding 8 to shard the model into sub-parts. |
| 149 | +* Performing graph transformations to convert or decompose operations into more accelerator-friendly operations. |
| 150 | +* Using --optimized_rotation_path <path_to_optimized_matrix> to apply R1 and R2 of [Spin Quant](https://github.com/facebookresearch/SpinQuant) to improve accuracy. |
| 151 | +* Using --calibration_data "<|start_header_id|>system<|end_header_id|..." to ensure that during the quantization of Llama 3 8B Instruct, the calibration includes special tokens in the prompt template. For more details on the prompt template, refer to the [model card](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/) of meta llama3 instruct. |
| 152 | + |
| 153 | +To get the optimized matrix, please refer to [SpinQuant](https://github.com/facebookresearch/SpinQuant) on GitHub. You can download the optimized rotation matrices in the Quantized Models section. Please choose "LLaMA-3-8B/8B_W4A16KV16_lr_1.5_seed_0". |
| 154 | + |
| 155 | +To export Llama 3 8B instruct with the Qualcomm AI Engine Direct Backend, ensure the following: |
| 156 | +* The host machine has more than 100GB of memory (RAM + swap space). |
| 157 | +* The entire process takes a few hours. |
| 158 | +* 8B models might need 16GB RAM on the device to run. |
| 159 | +``` |
| 160 | +# Please note that calibration_data must include the prompt template for special tokens. |
| 161 | +python -m examples.models.llama2.export_llama -t <path_to_tokenizer.model> -p <path_to_params.json> -c <path_to_checkpoint_for_Meta-Llama-3-8B-Instruct> --use_kv_cache --qnn --pt2e_quantize qnn_16a4w --disable_dynamic_shape --num_sharding 8 --calibration_tasks wikitext --calibration_limit 1 --calibration_seq_length 128 --optimized_rotation_path <path_to_optimized_matrix> --calibration_data "<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" |
| 162 | +``` |
| 163 | + |
| 164 | +## Pushing Model and Tokenizer |
| 165 | + |
| 166 | +Once you have the model and tokenizer ready, you can push them to the device before we start building the android demo app. |
| 167 | +``` |
| 168 | +adb shell mkdir -p /data/local/tmp/llama |
| 169 | +adb push llama-exported.pte /data/local/tmp/llama |
| 170 | +adb push tokenizer.bin /data/local/tmp/llama |
| 171 | +``` |
| 172 | + |
| 173 | + |
| 174 | + |
| 175 | +## Build AAR Library |
| 176 | +Open a terminal window and navigate to the root directory of the executorch. |
| 177 | +Set the following environment variables: |
| 178 | +``` |
| 179 | +export ANDROID_NDK=<path_to_android_ndk> |
| 180 | +export ANDROID_ABI=arm64-v8a |
| 181 | +``` |
| 182 | +Note: <path_to_android_ndk> is the root for the NDK, which is usually under ~/Library/Android/sdk/ndk/XX.Y.ZZZZZ for macOS, and contains NOTICE and README.md. We use <path_to_android_ndk>/build/cmake/android.toolchain.cmake for CMake to cross-compile. |
| 183 | +Build the Android Java extension code: |
| 184 | +``` |
| 185 | +pushd extension/android |
| 186 | +./gradlew build |
| 187 | +popd |
| 188 | +``` |
| 189 | +Run the following command set up the required JNI library: |
| 190 | +``` |
| 191 | +pushd examples/demo-apps/android/LlamaDemo |
| 192 | +./gradlew :app:setupQnn |
| 193 | +popd |
| 194 | +``` |
| 195 | +Alternative you can also just run the shell script directly as in the root directory: |
| 196 | +``` |
| 197 | +sh examples/demo-apps/android/LlamaDemo/setup-with-qnn.sh |
| 198 | +``` |
| 199 | +This is running the shell script which configures the required core ExecuTorch, Llama2/3, and Android libraries, builds them, and copies them to jniLibs. |
| 200 | +Note: If you are building the Android app mentioned in the next section on a separate machine (i.e. MacOS but building and exporting for QNN backend on Linux), make sure you copy the aar file generated from setup-with-qnn script to “examples/demo-apps/android/LlamaDemo/app/libs” before building the Android app. |
| 201 | + |
| 202 | + |
| 203 | +## Run the Android Demo App |
| 204 | + |
| 205 | +First, make sure your Android phone’s chipset version is compatible with this demo (SM8650, SM8550). You can find the Qualcomm chipset version here in the [mapping](https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/overview.html). |
| 206 | + |
| 207 | +If you build and run the setup-with-qnn script on a separate machine rather than where you are building the Android app, make sure you copy the aar file it generated into “examples/demo-apps/android/LlamaDemo/app/libs” |
| 208 | + |
| 209 | +### Alternative 1: Android Studio (Recommended) |
| 210 | +Open Android Studio and select “Open an existing Android Studio project” to open examples/demo-apps/android/LlamaDemo. |
| 211 | +Run the app (^R). This builds and launches the app on the phone. |
| 212 | + |
| 213 | +### Alternative 2: Command line |
| 214 | +Without Android Studio UI, we can run gradle directly to build the app. We need to set up the Android SDK path and invoke gradle. |
| 215 | +``` |
| 216 | +export ANDROID_HOME=<path_to_android_sdk_home> |
| 217 | +pushd examples/demo-apps/android/LlamaDemo |
| 218 | +./gradlew :app:installDebug |
| 219 | +popd |
| 220 | +``` |
| 221 | +If the app successfully run on your device, you should see something like below: |
| 222 | + |
| 223 | +<p align="center"> |
| 224 | +<img src="https://github.com/pytorch/executorch/blob/main/examples/demo-apps/android/LlamaDemo/docs/screenshots/opening_the_app_details.png" width=800> |
| 225 | +</p> |
| 226 | + |
| 227 | +## Reporting Issues |
| 228 | +If you encountered any bugs or issues following this tutorial please file a bug/issue here on Github. |
0 commit comments