|
1 | 1 | ## Summary
|
2 | 2 | LLaVA is the first multi-modal LLM ExecuTorch supports. In this directory, we
|
3 |
| -- Host a model definition for LLaVA. |
4 |
| -- Demonstrate how to export [LLavA](https://github.com/haotian-liu/LLaVA) multimodal model to a .pte file. |
5 |
| -- Provide a C++ runner that loads the .pte file, the tokenizer and an image, then generate responses based on user prompt. |
| 3 | +- Host a model definition for [LLavA](https://github.com/haotian-liu/LLaVA). |
| 4 | +- Demonstrate how to export LLavA multimodal model to generate ExecuTorch .PTE file. |
| 5 | +- Provide a C++ runner, Android/iOS Apps that loads the .pte file, the tokenizer and an image, then generate responses based on user prompt. |
| 6 | +- Discuss optimizations went into enabling LlaVA on a phone, and early performance numbers |
| 7 | + |
| 8 | +Tokenizer, image encoder, and the pretrained text model, which is based on Meta |
| 9 | +[Llama2-7b](https://llama.meta.com/llama2/), is loaded from Llava |
| 10 | +huggingface page [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) . |
| 11 | + |
| 12 | +## What is LLaVA? |
| 13 | + |
| 14 | +[LLaVA](https://llava-vl.github.io/) is a novel end-to-end trained large |
| 15 | +multimodal model that combines a vision encoder and Vicuna (a LLama2 based text |
| 16 | +model) for general-purpose visual and language understanding, achieving |
| 17 | +impressive chat capabilities mimicking spirits of the cutting edge multimodal |
| 18 | +models and setting a high bar for accuracy on Science QA. |
6 | 19 |
|
7 | 20 | ## Instructions
|
8 |
| -### Export .pte & other artifacts |
9 | 21 |
|
10 |
| -Run the following command to generate `llava.pte`, `tokenizer.bin` and an image tensor (serialized in TorchScript) `image.pt`. |
| 22 | +First you need to generate a .PTE file for the model, along with input image, |
| 23 | +and other artifacts. Then you need either a C++ runner, or Android or iOS |
| 24 | +application to test things out on device. |
11 | 25 |
|
12 |
| -Prerequisite: run `install_requirements.sh` to install ExecuTorch and run `examples/models/llava/install_requirements.sh` to install dependencies. |
| 26 | +### Generate ExecuTorch .PTE and other artifacts |
| 27 | + |
| 28 | +Run the following command to generate `llava.pte`, `tokenizer.bin` and an image |
| 29 | +tensor (serialized in TorchScript) `image.pt`. |
| 30 | + |
| 31 | +Prerequisite: run `install_requirements.sh` to install ExecuTorch and run |
| 32 | +`examples/models/llava/install_requirements.sh` to install dependencies. |
13 | 33 |
|
14 | 34 | ```bash
|
15 | 35 | python -m executorch.examples.models.llava.export_llava --pte-name llava.pte --with-artifacts
|
16 | 36 | ```
|
17 | 37 |
|
18 |
| -Currently the whole export process takes about 6 minutes. We also provide a small test util to verify the correctness of the exported .pte file. Just run: |
| 38 | +Currently the whole export process takes about 6 minutes. We also provide a |
| 39 | +small test utility to verify the correctness of the exported .pte file. Just run: |
19 | 40 |
|
20 | 41 | ```bash
|
21 | 42 | python -m executorch.examples.models.llava.test.test_pte llava.pte
|
22 | 43 | ```
|
23 | 44 |
|
24 |
| -If everything works correctly it should give you some meaningful response such as: |
25 |
| - |
| 45 | +### Build C++ Runner |
26 | 46 |
|
| 47 | +See or run `.ci/scripts/test_llava.sh` shell script to build a C++ runner. This |
| 48 | +script also has preliminary support to build the C++ runner for Android. |
27 | 49 |
|
28 |
| -### Build C++ runner |
| 50 | +This also has an image utility Python script to generate image in PyTorch |
| 51 | +loadable format. Alternatively, we are working on generating image format which |
| 52 | +doesn't need PyTorch to load an image. Motivation for this is to build the C++ |
| 53 | +runner on Android. |
29 | 54 |
|
30 |
| -Run the following cmake commands from `executorch/`: |
| 55 | +Then you should be able to find `llava_main` binary: |
31 | 56 |
|
32 | 57 | ```bash
|
33 |
| -# build libraries |
34 |
| -cmake \ |
35 |
| - -DCMAKE_INSTALL_PREFIX=cmake-out \ |
36 |
| - -DCMAKE_BUILD_TYPE=Debug \ |
37 |
| - -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \ |
38 |
| - -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \ |
39 |
| - -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \ |
40 |
| - -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ |
41 |
| - -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ |
42 |
| - -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ |
43 |
| - -DEXECUTORCH_BUILD_XNNPACK=ON \ |
44 |
| - -DEXECUTORCH_DO_NOT_USE_CXX11_ABI=ON \ |
45 |
| - -DEXECUTORCH_XNNPACK_SHARED_WORKSPACE=ON \ |
46 |
| - -Bcmake-out . |
47 |
| - |
48 |
| - |
49 |
| -cmake --build cmake-out -j9 --target install --config Debug |
50 |
| - |
51 |
| -# build llava runner |
52 |
| - |
53 |
| -dir=examples/models/llava |
54 |
| -python_lib=$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())') |
55 |
| - |
56 |
| -cmake \ |
57 |
| - -DCMAKE_INSTALL_PREFIX=cmake-out \ |
58 |
| - -DCMAKE_BUILD_TYPE=Debug \ |
59 |
| - -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ |
60 |
| - -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ |
61 |
| - -DEXECUTORCH_BUILD_XNNPACK=ON \ |
62 |
| - -DCMAKE_PREFIX_PATH="$python_lib" \ |
63 |
| - -Bcmake-out/${dir} \ |
64 |
| - ${dir} |
65 |
| - |
66 |
| - |
67 |
| -cmake --build cmake-out/${dir} -j9 --config Debug |
| 58 | +cmake-out/examples/models/llava/llava_main |
68 | 59 | ```
|
69 | 60 |
|
70 |
| -Or simply run `.ci/scripts/test_llava.sh`. |
| 61 | +### Build Mobile Apps |
71 | 62 |
|
72 |
| -Then you should be able to find `llava_main` binary: |
| 63 | +#### Android |
73 | 64 |
|
74 |
| -```bash |
75 |
| -cmake-out/examples/models/llava/llava_main |
76 |
| -``` |
| 65 | +We can run LLAVA using the LLAMA Demo Apps. Please refer to [this |
| 66 | +tutorial](https://github.com/pytorch/executorch/tree/main/examples/demo-apps/android/LlamaDemo) |
| 67 | +to for full instructions on building the Android LLAMA Demo App. |
| 68 | + |
| 69 | +#### iOS |
| 70 | + |
| 71 | +We can run LLAVA using the LLAMA Demo Apps. Please refer to [this |
| 72 | +tutorial](https://github.com/pytorch/executorch/tree/main/examples/demo-apps/apple_ios/LLaMA) |
| 73 | +to for full instructions on building the iOS LLAMA Demo App. |
77 | 74 |
|
78 |
| -### Run LLaVA |
| 75 | +### Running LLaVA |
79 | 76 |
|
80 | 77 | Run:
|
81 | 78 | ```bash
|
82 |
| -cmake-out/examples/models/llava/llava_main --model_path=llava.pte --tokenizer_path=tokenizer.bin --image_path=image.pt --prompt="What are the things I should be cautious about when I visit here? ASSISTANT:" --seq_len=768 --temperature=0 |
| 79 | +cmake-out/examples/models/llava/llava_main \ |
| 80 | + --model_path=llava.pte \ |
| 81 | + --tokenizer_path=tokenizer.bin \ |
| 82 | + --image_path=image.pt \ |
| 83 | + --prompt="ASSISTANT:" \ |
| 84 | + --seq_len=768 \ |
| 85 | + --temperature=0 |
83 | 86 | ```
|
| 87 | +(see --help for other options). |
84 | 88 |
|
85 |
| -You should get a response like: |
| 89 | +For this example input used in this example, |
| 90 | + |
| 91 | + |
| 92 | + |
| 93 | +You should get a response like (tested on Arm CPUs with ET XNNPACK delegate): |
86 | 94 |
|
87 | 95 | ```
|
88 |
| -When visiting a place like this, ... |
| 96 | +ASSISTANT: image captures a basketball game in progress, with several players on the court. ... |
89 | 97 | ```
|
| 98 | + |
| 99 | +## Optimizations and Results |
| 100 | + |
| 101 | +Since LLaVA model needs at least 4-bit quantization to fit even within some of |
| 102 | +the high-end phones, results presented here correspond to 4-bit groupwise |
| 103 | +post-training quantized model. |
| 104 | + |
| 105 | +In addition to that, work is mainly focused on using Arm CPUs and ET XNNPACK delegate. |
| 106 | + |
| 107 | +### Memory Footprint Reduction Techniques |
| 108 | + |
| 109 | +With Llava, we needed to find a way to reduce the memory footprint in order to |
| 110 | +make it feasible to run on edge devices. Out of the box, even with 4-bit |
| 111 | +quantized weights, the memory footprint is around ~11 GiB, which is |
| 112 | +prohibitively large even for high-end Android or iOS devices. |
| 113 | + |
| 114 | +We did several optimizations, which should be already enabled if you follow this |
| 115 | +tutorial, to get the memory footprint down to ~5 GiB, which unblocks us to run |
| 116 | +on high-end devices. |
| 117 | + |
| 118 | +#### Sharing intermediate memory across delegates |
| 119 | + |
| 120 | +Sharing working memory across ET XNNPACK delegates helps reduce the peak memory |
| 121 | +usage for LLMs with many DQLinears. We reduced it by 36.1% (from 10.44GiB to |
| 122 | +6.67GiB) for Llava towards unblocking it to run on Phones. |
| 123 | + |
| 124 | +#### Reducing maximum sequence length |
| 125 | + |
| 126 | +To free up more memory, we examined non-constant memory usage, specifically |
| 127 | +focusing on intermediate tensors used throughout the model during inference. |
| 128 | +The majority of these were found in the KV-cache allocations. Based on “minimum |
| 129 | +can get away with” heuristic, we reduced max sequence length number to 768 from |
| 130 | +previous default 2048. This adjustment led to a further memory reduction of |
| 131 | +approximately 1.23 GiB (from 6.67 GiB to 5.44 GiB). |
| 132 | + |
| 133 | +#### Quantizing embedding weights to 8b |
| 134 | + |
| 135 | +By quantizing the embedding layer to 8 bit, we were able to achieve an |
| 136 | +additional memory footprint reduction of approximately 300 MiB, bringing the |
| 137 | +total down to ~5 GiB. |
| 138 | + |
| 139 | +### Performance Optimizations |
| 140 | + |
| 141 | +#### Decode performance |
| 142 | + |
| 143 | +This was already heavily optimized through KV-cache and GEMV kernel |
| 144 | +optimization efforts for LLama2/3. |
| 145 | + |
| 146 | +#### Encode performance |
| 147 | + |
| 148 | +With image based large prompts, this was the focus of performance |
| 149 | +optimizations for LLaVA. We implemented two main optimizations to bring the decode or |
| 150 | +prefill performance for the image down by more than 100% from the baseline. |
| 151 | + |
| 152 | +* **Two XNNPACK Partitioners** |
| 153 | + |
| 154 | +For text-only LLMs, our approach involved lowering only DQLinear ops |
| 155 | +to XNNPACK and relying on ExecuTorch-optimized operators or custom ops |
| 156 | +(utilizing Neon SIMD) to support multiplication, addition, and other |
| 157 | +operations. Lowering these operations to XNNPACK significantly improves Time to |
| 158 | +First Token (TTFT). |
| 159 | + |
| 160 | + |
| 161 | +* **New Arm Neon I8mm GEMM kernels** |
| 162 | + |
| 163 | +We introduced new kernels in XNNPACK for the quantization scheme used |
| 164 | +here, which upgrades our existing dot-prod based GEMM kernels to i8mm based |
| 165 | +GEMM kernels. The new kernel offers significantly improved performance by |
| 166 | +leveraging the more efficient SMMLA instruction from Arm Neon. However, it's |
| 167 | +worth noting that this instruction is only available on newer Arm CPUs. |
| 168 | + |
| 169 | + |
| 170 | +### Results |
| 171 | + |
| 172 | +Note this is an active area of development in the ExecuTorch repository. You |
| 173 | +will need this PR [5380](https://github.com/pytorch/executorch/pull/5380) to |
| 174 | +supply an image to the C++ runner on Android without Torch dependency. This |
| 175 | +should be merged soon. |
| 176 | + |
| 177 | +With those caveats out of the way, here are some preliminary numbers (as average of |
| 178 | +three runs) for LLaVA using a C++ runner on Android OnePlus12 device with 12GiB |
| 179 | +memory. |
| 180 | + |
| 181 | +| Experiment Setup | Prefill time in seconds | Decode tokens/second | |
| 182 | +| :------------- | -------------: | -------------: | |
| 183 | +| Baseline | 29.95 | 8.75 | |
| 184 | +| + Two XNNPACK Partitioners | 17.82 | 8.93 | |
| 185 | +| + New Arm Neon i8mm GEMM Kernels | 14.60 | 8.92 | |
| 186 | + |
| 187 | +We appreciate your feedback. Please let us know if you run into any issues. |
0 commit comments