Skip to content

Commit 3a1d98f

Browse files
Add example docs for granite vision
Signed-off-by: Alex-Brooks <[email protected]>
1 parent c392e50 commit 3a1d98f

File tree

1 file changed

+179
-0
lines changed

1 file changed

+179
-0
lines changed
Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
# Granite Vision
2+
3+
Download the model and point your `GRANITE_MODEL` environment variable to the path.
4+
5+
```bash
6+
git clone https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview
7+
export GRANITE_MODEL=/Users/alexanderjbrooks/workspace/develop/llama.cpp/examples/llava/granite-vision-3.1-2b-preview
8+
```
9+
10+
11+
### 1. Running llava surgery v2.
12+
First, we need to run the llava surgery script as shown below:
13+
14+
`python llava_surgery_v2.py -C -m $GRANITE_MODEL`
15+
16+
You should see two new files (`llava.clip` and `llava.projector`) written into your model's directory. You can load them directly with pytorch and validate that they are nonempty using the snippet below.
17+
18+
`ls $GRANITE_MODEL | grep -i llava`
19+
20+
We should see that the projector and visual encoder get split out into the llava files. Quick check to make sure they aren't empty:
21+
```python
22+
import os
23+
import torch
24+
25+
MODEL_PATH = os.getenv("GRANITE_MODEL")
26+
if not MODEL_PATH:
27+
raise ValueError("env var GRANITE_MODEL is unset!")
28+
29+
encoder_tensors = torch.load(os.path.join(MODEL_PATH, "llava.clip"))
30+
projector_tensors = torch.load(os.path.join(MODEL_PATH, "llava.projector"))
31+
32+
assert len(encoder_tensors) > 0
33+
assert len(projector_tensors) > 0
34+
```
35+
36+
If you actually inspect the `.keys()` of the loaded tensors, you should see a lot of `vision_model` tensors in the `encoder_tensors`, and 5 tensors (`'multi_modal_projector.linear_1.bias'`, `'multi_modal_projector.linear_1.weight'`, `'multi_modal_projector.linear_2.bias'`, `'multi_modal_projector.linear_2.weight'`, `'image_newline'`) in the multimodal `projector_tensors`.
37+
38+
39+
### 2. Creating the Visual Component GGUF
40+
To create the GGUF for the visual components, we need to write a config for the visual encoder; make sure the config contains the correct `image_grid_pinpoints`
41+
42+
43+
Note: we refer to this file as `$VISION_CONFIG` later on.
44+
```json
45+
{
46+
"_name_or_path": "siglip-model",
47+
"architectures": [
48+
"SiglipVisionModel"
49+
],
50+
"image_grid_pinpoints": [
51+
[384,768],
52+
[384,1152],
53+
[384,1536],
54+
[384,1920],
55+
[384,2304],
56+
[384,2688],
57+
[384,3072],
58+
[384,3456],
59+
[384,3840],
60+
[768,384],
61+
[768,768],
62+
[768,1152],
63+
[768,1536],
64+
[768,1920],
65+
[1152,384],
66+
[1152,768],
67+
[1152,1152],
68+
[1536,384],
69+
[1536,768],
70+
[1920,384],
71+
[1920,768],
72+
[2304,384],
73+
[2688,384],
74+
[3072,384],
75+
[3456,384],
76+
[3840,384]
77+
],
78+
"mm_patch_merge_type": "spatial_unpad",
79+
"hidden_size": 1152,
80+
"image_size": 384,
81+
"intermediate_size": 4304,
82+
"model_type": "siglip_vision_model",
83+
"num_attention_heads": 16,
84+
"num_hidden_layers": 27,
85+
"patch_size": 14,
86+
"transformers_version": "4.45.0.dev0",
87+
"layer_norm_eps": 1e-6,
88+
"hidden_act": "gelu_pytorch_tanh",
89+
"projection_dim": 0,
90+
"vision_feature_layer": [-24, -20, -12, -1]
91+
}
92+
```
93+
94+
Create a new directory to hold the visual components, and copy the llava.clip/projector files, as well as the vision config into it.
95+
96+
```
97+
ENCODER_PATH=/Users/alexanderjbrooks/workspace/develop/llama.cpp/examples/llava/visual_encoder
98+
mkdir $ENCODER_PATH
99+
100+
cp $GRANITE_MODEL/llava.clip $ENCODER_PATH/pytorch_model.bin
101+
cp $GRANITE_MODEL/llava.projector $ENCODER_PATH/
102+
cp $VISION_CONFIG $ENCODER_PATH/config.json
103+
```
104+
105+
At which point you should have something like this:
106+
```bash
107+
(venv) alexanderjbrooks@Alexanders-MacBook-Pro llava % ls $ENCODER_PATH
108+
config.json llava.projector pytorch_model.bin
109+
```
110+
111+
Now convert the components to GGUF; Note that we also override the image mean/std dev to `[.5,.5,.5]` since we use the siglip visual encoder - in the transformers model, you can find these numbers in the [preprocessor_config.json](https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview/blob/main/preprocessor_config.json).
112+
```bash
113+
python convert_image_encoder_to_gguf.py \
114+
-m $ENCODER_PATH \
115+
--llava-projector $ENCODER_PATH/llava.projector \
116+
--output-dir $ENCODER_PATH \
117+
--clip-model-is-vision \
118+
--clip-model-is-siglip \
119+
--image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5
120+
```
121+
122+
this will create the first GGUF file at `$ENCODER_PATH/mmproj-model-f16.gguf`; we will refer to the abs path of this file as the `$VISUAL_GGUF_PATH.`
123+
124+
125+
### 3. Creating the LLM GGUF.
126+
The granite vision model contains a granite LLM as its language model. For now, the easiest way to get the GGUF for LLM is by loading the composite model in `transformers` and exporting the LLM so that it can be directly converted with the normal conversion path.
127+
128+
First, set the `LLM_EXPORT_PATH` to the path to export the `transformers` LLM to.
129+
```
130+
export LLM_EXPORT_PATH=/Users/alexanderjbrooks/workspace/develop/llama.cpp/examples/llava/granite_vision_llm
131+
```
132+
133+
```python
134+
import os
135+
import transformers
136+
137+
MODEL_PATH = os.getenv("GRANITE_MODEL")
138+
if not MODEL_PATH:
139+
raise ValueError("env var GRANITE_MODEL is unset!")
140+
141+
LLM_EXPORT_PATH = os.getenv("LLM_EXPORT_PATH")
142+
if not MODEL_PATH:
143+
raise ValueError("env var LLM_EXPORT_PATH is unset!")
144+
145+
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_PATH)
146+
147+
# NOTE: granite vision support was added to transformers very recently (4.49);
148+
# if you get size mismatches, your version is too old.
149+
# If you are running with an older version, set `ignore_mismatched_sizes=True`
150+
# as shown below; it won't be loaded correctly, but the LLM part of the model that
151+
# we are exporting will be loaded correctly.
152+
model = transformers.AutoModelForImageTextToText.from_pretrained(MODEL_PATH, ignore_mismatched_sizes=True)
153+
154+
tokenizer.save_pretrained(LLM_EXPORT_PATH)
155+
model.language_model.save_pretrained(LLM_EXPORT_PATH)
156+
```
157+
158+
Now you can convert the exported LLM to GGUF with the normal converter in the root of the llama cpp project.
159+
```bash
160+
LLM_GGUF_PATH=$LLM_EXPORT_PATH/granite_llm.gguf
161+
python convert_hf_to_gguf.py --outfile $LLM_GGUF_PATH $LLM_EXPORT_PATH
162+
```
163+
164+
165+
### 4. Running the Model in Llama cpp
166+
Build llama cpp normally; you should have a target binary named `llama-llava-cli`, which you can pass two binaries to. Sample usage:
167+
168+
Note - the test image shown below can be found [here](https://github-production-user-asset-6210df.s3.amazonaws.com/10740300/415512792-d90d5562-8844-4f34-a0a5-77f62d5a58b5.jpg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20250221%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250221T054145Z&X-Amz-Expires=300&X-Amz-Signature=86c60be490aa49ef7d53f25d6c973580a8273904fed11ed2453d0a38240ee40a&X-Amz-SignedHeaders=host).
169+
170+
```bash
171+
./build/bin/llama-llava-cli -m $LLM_GGUF_PATH \
172+
--mmproj $VISUAL_GGUF_PATH \
173+
--image cherry_blossom.jpg \
174+
-c 16384 \
175+
-p "<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n<|user|>\n\<image>\nWhat type of flowers are in this picture?\n<|assistant|>\n" \
176+
--temp 0
177+
```
178+
179+
Sample response: `The flowers in the picture are cherry blossoms, which are known for their delicate pink petals and are often associated with the beauty of spring.`

0 commit comments

Comments
 (0)