Skip to content

Commit c715c3d

Browse files
Update Llava README (#5477)
Update a ReadMe (#5473) Summary: Pull Request resolved: #5473 bypass-github-export-checks bypass-github-pytorch-ci-checks bypass-github-executorch-ci-checks Reviewed By: mergennachin Differential Revision: D62925519 fbshipit-source-id: 0872ca5f095cf0367a341d47492ae43e60a66146 (cherry picked from commit ad95e46) Co-authored-by: Digant Desai <[email protected]>
1 parent c809692 commit c715c3d

File tree

1 file changed

+153
-55
lines changed

1 file changed

+153
-55
lines changed

examples/models/llava/README.md

Lines changed: 153 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -1,89 +1,187 @@
11
## Summary
22
LLaVA is the first multi-modal LLM ExecuTorch supports. In this directory, we
3-
- Host a model definition for LLaVA.
4-
- Demonstrate how to export [LLavA](https://github.com/haotian-liu/LLaVA) multimodal model to a .pte file.
5-
- Provide a C++ runner that loads the .pte file, the tokenizer and an image, then generate responses based on user prompt.
3+
- Host a model definition for [LLavA](https://github.com/haotian-liu/LLaVA).
4+
- Demonstrate how to export LLavA multimodal model to generate ExecuTorch .PTE file.
5+
- Provide a C++ runner, Android/iOS Apps that loads the .pte file, the tokenizer and an image, then generate responses based on user prompt.
6+
- Discuss optimizations went into enabling LlaVA on a phone, and early performance numbers
7+
8+
Tokenizer, image encoder, and the pretrained text model, which is based on Meta
9+
[Llama2-7b](https://llama.meta.com/llama2/), is loaded from Llava
10+
huggingface page [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) .
11+
12+
## What is LLaVA?
13+
14+
[LLaVA](https://llava-vl.github.io/) is a novel end-to-end trained large
15+
multimodal model that combines a vision encoder and Vicuna (a LLama2 based text
16+
model) for general-purpose visual and language understanding, achieving
17+
impressive chat capabilities mimicking spirits of the cutting edge multimodal
18+
models and setting a high bar for accuracy on Science QA.
619

720
## Instructions
8-
### Export .pte & other artifacts
921

10-
Run the following command to generate `llava.pte`, `tokenizer.bin` and an image tensor (serialized in TorchScript) `image.pt`.
22+
First you need to generate a .PTE file for the model, along with input image,
23+
and other artifacts. Then you need either a C++ runner, or Android or iOS
24+
application to test things out on device.
1125

12-
Prerequisite: run `install_requirements.sh` to install ExecuTorch and run `examples/models/llava/install_requirements.sh` to install dependencies.
26+
### Generate ExecuTorch .PTE and other artifacts
27+
28+
Run the following command to generate `llava.pte`, `tokenizer.bin` and an image
29+
tensor (serialized in TorchScript) `image.pt`.
30+
31+
Prerequisite: run `install_requirements.sh` to install ExecuTorch and run
32+
`examples/models/llava/install_requirements.sh` to install dependencies.
1333

1434
```bash
1535
python -m executorch.examples.models.llava.export_llava --pte-name llava.pte --with-artifacts
1636
```
1737

18-
Currently the whole export process takes about 6 minutes. We also provide a small test util to verify the correctness of the exported .pte file. Just run:
38+
Currently the whole export process takes about 6 minutes. We also provide a
39+
small test utility to verify the correctness of the exported .pte file. Just run:
1940

2041
```bash
2142
python -m executorch.examples.models.llava.test.test_pte llava.pte
2243
```
2344

24-
If everything works correctly it should give you some meaningful response such as:
25-
45+
### Build C++ Runner
2646

47+
See or run `.ci/scripts/test_llava.sh` shell script to build a C++ runner. This
48+
script also has preliminary support to build the C++ runner for Android.
2749

28-
### Build C++ runner
50+
This also has an image utility Python script to generate image in PyTorch
51+
loadable format. Alternatively, we are working on generating image format which
52+
doesn't need PyTorch to load an image. Motivation for this is to build the C++
53+
runner on Android.
2954

30-
Run the following cmake commands from `executorch/`:
55+
Then you should be able to find `llava_main` binary:
3156

3257
```bash
33-
# build libraries
34-
cmake \
35-
-DCMAKE_INSTALL_PREFIX=cmake-out \
36-
-DCMAKE_BUILD_TYPE=Debug \
37-
-DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
38-
-DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
39-
-DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
40-
-DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
41-
-DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
42-
-DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
43-
-DEXECUTORCH_BUILD_XNNPACK=ON \
44-
-DEXECUTORCH_DO_NOT_USE_CXX11_ABI=ON \
45-
-DEXECUTORCH_XNNPACK_SHARED_WORKSPACE=ON \
46-
-Bcmake-out .
47-
48-
49-
cmake --build cmake-out -j9 --target install --config Debug
50-
51-
# build llava runner
52-
53-
dir=examples/models/llava
54-
python_lib=$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')
55-
56-
cmake \
57-
-DCMAKE_INSTALL_PREFIX=cmake-out \
58-
-DCMAKE_BUILD_TYPE=Debug \
59-
-DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
60-
-DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
61-
-DEXECUTORCH_BUILD_XNNPACK=ON \
62-
-DCMAKE_PREFIX_PATH="$python_lib" \
63-
-Bcmake-out/${dir} \
64-
${dir}
65-
66-
67-
cmake --build cmake-out/${dir} -j9 --config Debug
58+
cmake-out/examples/models/llava/llava_main
6859
```
6960

70-
Or simply run `.ci/scripts/test_llava.sh`.
61+
### Build Mobile Apps
7162

72-
Then you should be able to find `llava_main` binary:
63+
#### Android
7364

74-
```bash
75-
cmake-out/examples/models/llava/llava_main
76-
```
65+
We can run LLAVA using the LLAMA Demo Apps. Please refer to [this
66+
tutorial](https://github.com/pytorch/executorch/tree/main/examples/demo-apps/android/LlamaDemo)
67+
to for full instructions on building the Android LLAMA Demo App.
68+
69+
#### iOS
70+
71+
We can run LLAVA using the LLAMA Demo Apps. Please refer to [this
72+
tutorial](https://github.com/pytorch/executorch/tree/main/examples/demo-apps/apple_ios/LLaMA)
73+
to for full instructions on building the iOS LLAMA Demo App.
7774

78-
### Run LLaVA
75+
### Running LLaVA
7976

8077
Run:
8178
```bash
82-
cmake-out/examples/models/llava/llava_main --model_path=llava.pte --tokenizer_path=tokenizer.bin --image_path=image.pt --prompt="What are the things I should be cautious about when I visit here? ASSISTANT:" --seq_len=768 --temperature=0
79+
cmake-out/examples/models/llava/llava_main \
80+
--model_path=llava.pte \
81+
--tokenizer_path=tokenizer.bin \
82+
--image_path=image.pt \
83+
--prompt="ASSISTANT:" \
84+
--seq_len=768 \
85+
--temperature=0
8386
```
87+
(see --help for other options).
8488

85-
You should get a response like:
89+
For this example input used in this example,
90+
91+
![image](https://upload.wikimedia.org/wikipedia/commons/3/3e/Chicago_Bulls_-_New_Jersey_Nets_match_on_March_28%2C_1991.jpg)
92+
93+
You should get a response like (tested on Arm CPUs with ET XNNPACK delegate):
8694

8795
```
88-
When visiting a place like this, ...
96+
ASSISTANT: image captures a basketball game in progress, with several players on the court. ...
8997
```
98+
99+
## Optimizations and Results
100+
101+
Since LLaVA model needs at least 4-bit quantization to fit even within some of
102+
the high-end phones, results presented here correspond to 4-bit groupwise
103+
post-training quantized model.
104+
105+
In addition to that, work is mainly focused on using Arm CPUs and ET XNNPACK delegate.
106+
107+
### Memory Footprint Reduction Techniques
108+
109+
With Llava, we needed to find a way to reduce the memory footprint in order to
110+
make it feasible to run on edge devices. Out of the box, even with 4-bit
111+
quantized weights, the memory footprint is around ~11 GiB, which is
112+
prohibitively large even for high-end Android or iOS devices.
113+
114+
We did several optimizations, which should be already enabled if you follow this
115+
tutorial, to get the memory footprint down to ~5 GiB, which unblocks us to run
116+
on high-end devices.
117+
118+
#### Sharing intermediate memory across delegates
119+
120+
Sharing working memory across ET XNNPACK delegates helps reduce the peak memory
121+
usage for LLMs with many DQLinears. We reduced it by 36.1% (from 10.44GiB to
122+
6.67GiB) for Llava towards unblocking it to run on Phones.
123+
124+
#### Reducing maximum sequence length
125+
126+
To free up more memory, we examined non-constant memory usage, specifically
127+
focusing on intermediate tensors used throughout the model during inference.
128+
The majority of these were found in the KV-cache allocations. Based on “minimum
129+
can get away with” heuristic, we reduced max sequence length number to 768 from
130+
previous default 2048. This adjustment led to a further memory reduction of
131+
approximately 1.23 GiB (from 6.67 GiB to 5.44 GiB).
132+
133+
#### Quantizing embedding weights to 8b
134+
135+
By quantizing the embedding layer to 8 bit, we were able to achieve an
136+
additional memory footprint reduction of approximately 300 MiB, bringing the
137+
total down to ~5 GiB.
138+
139+
### Performance Optimizations
140+
141+
#### Decode performance
142+
143+
This was already heavily optimized through KV-cache and GEMV kernel
144+
optimization efforts for LLama2/3.
145+
146+
#### Encode performance
147+
148+
With image based large prompts, this was the focus of performance
149+
optimizations for LLaVA. We implemented two main optimizations to bring the decode or
150+
prefill performance for the image down by more than 100% from the baseline.
151+
152+
* **Two XNNPACK Partitioners**
153+
154+
For text-only LLMs, our approach involved lowering only DQLinear ops
155+
to XNNPACK and relying on ExecuTorch-optimized operators or custom ops
156+
(utilizing Neon SIMD) to support multiplication, addition, and other
157+
operations. Lowering these operations to XNNPACK significantly improves Time to
158+
First Token (TTFT).
159+
160+
161+
* **New Arm Neon I8mm GEMM kernels**
162+
163+
We introduced new kernels in XNNPACK for the quantization scheme used
164+
here, which upgrades our existing dot-prod based GEMM kernels to i8mm based
165+
GEMM kernels. The new kernel offers significantly improved performance by
166+
leveraging the more efficient SMMLA instruction from Arm Neon. However, it's
167+
worth noting that this instruction is only available on newer Arm CPUs.
168+
169+
170+
### Results
171+
172+
Note this is an active area of development in the ExecuTorch repository. You
173+
will need this PR [5380](https://github.com/pytorch/executorch/pull/5380) to
174+
supply an image to the C++ runner on Android without Torch dependency. This
175+
should be merged soon.
176+
177+
With those caveats out of the way, here are some preliminary numbers (as average of
178+
three runs) for LLaVA using a C++ runner on Android OnePlus12 device with 12GiB
179+
memory.
180+
181+
| Experiment Setup | Prefill time in seconds | Decode tokens/second |
182+
| :------------- | -------------: | -------------: |
183+
| Baseline | 29.95 | 8.75 |
184+
| + Two XNNPACK Partitioners | 17.82 | 8.93 |
185+
| + New Arm Neon i8mm GEMM Kernels | 14.60 | 8.92 |
186+
187+
We appreciate your feedback. Please let us know if you run into any issues.

0 commit comments

Comments
 (0)