Skip to content

Add Doc for Converting Granite Vision -> GGUF #12006

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 25, 2025

Conversation

alex-jw-brooks
Copy link
Contributor

Adds example docs for converting a granite vision model, which is essentially a llava next model with multiple feature layers using siglip for the visual encoder, and a granite language model as the LLM.

Depends on #11794

CC @danbev

Signed-off-by: Alex-Brooks <[email protected]>

Remove trailing whitespace

Signed-off-by: Alex-Brooks <[email protected]>
@danbev danbev merged commit 4d1051a into ggml-org:master Feb 25, 2025
2 checks passed
orca-zhang pushed a commit to orca-zhang/llama.cpp that referenced this pull request Feb 26, 2025
* Add example docs for granite vision

Signed-off-by: Alex-Brooks <[email protected]>
@samkoesnadi
Copy link
Contributor

samkoesnadi commented Feb 27, 2025

@alex-jw-brooks Hi Alex, thanks for the contribution here! I just want to understand this specific model more. As I understand this comes from IBM, how does this compare to Qwen 2.5 VL?

@alex-jw-brooks
Copy link
Contributor Author

alex-jw-brooks commented Feb 28, 2025

Hi @samkoesnadi! Thanks for your interest 🙂 I'm unfortunately not super familiar with all the details of Qwen2.5 VL quite yet, but hopefully I can help at least explain some details about our model to illuminate how it's similar to other existing model architectures and what use-cases you should use them for.

The best way to understand granite vision is actually to compare it to Llava Next, because they are very similar architecturally. The main differences compared to some other llava next models:

  • It uses multiple feature layers from the visual encoder
  • The visual encoder uses siglip instead of clip, which means larger tiles used in anyres / more image features per tile
  • Uses a granite LLM for the choice of the LLM
  • Handles a pretty wide variety of aspect ratios, i.e., more choices for image grid pinpoints to be used by anyres

In terms of use-cases, granite vision is largely fine-tuned for document understanding type tasks.

For how it compares to Qwen2.5 VL - I'd suggest reading the granite vision technical report and comparing it with the Qwen 2.5 VL technical report to look more deeply into technical differences and model performance. Our 2b model is also Apache 2.0 licensed, whereas 3B Qwen2.5 VL model is not (although 7B is)

@samkoesnadi
Copy link
Contributor

Hi @samkoesnadi! Thanks for your interest 🙂 I'm unfortunately not super familiar with all the details of Qwen2.5 VL quite yet, but hopefully I can help at least explain some details about our model to illuminate how it's similar to other existing model architectures and what use-cases you should use them for.

The best way to understand granite vision is actually to compare it to Llava Next, because they are very similar architecturally. The main differences compared to some other llava next models:

  • It uses multiple feature layers from the visual encoder
  • The visual encoder uses siglip instead of clip, which means larger tiles used in anyres / more image features per tile
  • Uses a granite LLM for the choice of the LLM
  • Handles a pretty wide variety of aspect ratios, i.e., more choices for image grid pinpoints to be used by anyres

In terms of use-cases, granite vision is largely fine-tuned for document understanding type tasks.

For how it compares to Qwen2.5 VL - I'd suggest reading the granite vision technical report and comparing it with the Qwen 2.5 VL technical report to look more deeply into technical differences and model performance. Our 2b model is also Apache 2.0 licensed, whereas 3B Qwen2.5 VL model is not (although 7B is)

Thank you for your clear explanation, appreciate it :)

mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025
* Add example docs for granite vision

Signed-off-by: Alex-Brooks <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Mar 19, 2025
* Add example docs for granite vision

Signed-off-by: Alex-Brooks <[email protected]>
mostlyuseful pushed a commit to mostlyuseful/llama.cpp that referenced this pull request May 12, 2025
* Add example docs for granite vision

Signed-off-by: Alex-Brooks <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants