mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output) #13784

ngxson · 2025-05-25T21:25:19Z

This PR aimed to just add the capability to use 2 clip_ctx, one for audio and one for vision. But I ended up doing quite more refactoring than I initially thought

some fields are moved from clip_ctx to clip_model
from a single clip_model_loader, we are now able to create multiple clip_model and clip_ctx
libmtmd can handle 2 clip_ctx and switch the calls according to the chunk type
refactor mtmd_tokenize so it can handle mixed modality
nits: some functions in clip.cpp has many if..else branches, we can refactor them to switch (...)
refactor tests, it can now support testing audio input
implement SinusoidsPositionEmbedding used by the audio encoder --> generate it during conversion
fix M-RoPE position for audio chunk

TODO in next PRs:

Remove unused clip APIs
Move image preprocessing to mtmd-image.cpp
Move file reader/decoder to mtmd-helper.cpp

Why no audio output?

The simple answer is: I don't have time to implement it.

The long answer: Qwen 2.5 Omni generates audio using 2 steps:

Step 1: generate the mel spectrogram using DiT (diffusion transformer). This is essentially image generation, so kinda give-away what Qwen is cooking next 😃
Step 2: turn mel into wav using BigVGAN

So adding audio generation is indeed adding image generation capability, which I don't really have time to do right now.

Demo

Pre-quantized models:

# Qwen2.5 Omni
# Capabilities: audio input, vision input
(tool_name) -hf ggml-org/Qwen2.5-Omni-3B-GGUF
(tool_name) -hf ggml-org/Qwen2.5-Omni-7B-GGUF

oyaay · 2025-05-26T08:37:26Z

wow . I like this

ngxson · 2025-05-26T09:34:22Z

@ggerganov Sorry for including quite a lot of changes in one single PR.

The global idea of this PR is to use 2 dedicated clip_ctx, each for one modality. I also thought of maybe having audio output and image output support in the future, so the idea of having multiple clip_ctx can be useful to make it happen.

A given clip_model is attached to one clip_ctx for now (so multiple context cannot use the same model). As each modality has its own set of tensors, so I don't feel like it's necessary to split model & ctx like on libllama. However, only one model_loader is used on reading the model. I think it's more useful this way because it can tell in advance how many contexts should be created (and also, model file is only opened once)

On libmtmd side, I rewrite the tokenize function to handle mixed image/audio chunks. It can also handle some edge cases (cc @mattjcly ), for example:

Empty input
Input with only one media, no text

Don't hesitate to ping if something is not clear for you. Thanks!

mega-cqz · 2025-05-26T09:37:55Z

Amazing work!!

ngxson · 2025-05-26T09:54:34Z

Test results:

[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
[vision] OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
[vision] OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL2_5-1B-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/pixtral-12b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-8B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-14B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M

tools/mtmd/clip.h

ggerganov · 2025-05-27T07:58:27Z

tools/mtmd/clip.cpp

            img->nx = hparams.warmup_image_size;
            img->ny = hparams.warmup_image_size;
        } else {
-            img->nx = 1024; // TODO @ngxson : use a better default
+            img->nx = hparams.warmup_audio_size;
            img->ny = hparams.n_mel_bins;
        }
        img->buf.resize(img->nx * img->ny * 3);


Is this needed for audio modalities? We have a single channel in this case, correct?

Indeed, only the image shape is needed during this warmup, so we don't actually need to allocate this buffer. I removed it in 0531096

ggerganov · 2025-05-27T08:02:30Z

tools/mtmd/mtmd-helper.cpp

+    // M-RoPE for audio
+    void set_position_mrope_1d(llama_pos pos_0, int32_t n_tokens, llama_seq_id seq_id) {
+        GGML_ASSERT(n_pos_per_embd == 4);
+        seq_id_0[0] = seq_id;
+        for (int i = 0; i < n_tokens; i++) {
+            pos[i                     ] = pos_0 + i;
+            pos[i + batch.n_tokens    ] = pos_0 + i;
+            pos[i + batch.n_tokens * 2] = pos_0 + i;
+            pos[i + batch.n_tokens * 3] = 0; // last pos dim is unused
+        }
+        for (int i = 0; i < batch.n_tokens; i++) {
+            batch.n_seq_id[i] = 1;
+            batch.seq_id  [i] = seq_id_0.data();
+            batch.logits  [i] = false;
+        }
+    }
+


I think here n_tokens and batch.n_tokens refer to the same thing. If so, should simplify by using for example only batch.n_tokens and remove the n_tokens argument. Or vice versa.

Yes thanks for noticing, it's a leftover code from the 2D version, I removed it in 27a8f26

henfiber · 2025-05-27T15:58:12Z

Just fyi, with Vulkan backend (AMD RADV RENOIR) and the Q8_0 mmproj file the server crashes with:

Floating point exception (core dumped)

Details

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: 5510 (a8ea03d8) with cc (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7) for x86_64-redhat-linux
system info: n_threads = 5, n_threads_batch = 6, total_threads = 12
system_info: n_threads = 5 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
clip_model_loader: model name:

clip_model_loader: description:

clip_model_loader: GGUF version: 3

clip_model_loader: alignment:    32

clip_model_loader: n_tensors:    1008

clip_model_loader: n_kv:         32
clip_model_loader: has vision encoder

clip_model_loader: has audio encoder

clip_ctx: CLIP using Vulkan0 backend

load_hparams: projector:          qwen2.5o

load_hparams: n_embd:             1280

load_hparams: n_head:             16

load_hparams: n_ff:               1280

load_hparams: n_layer:            32

load_hparams: ffn_op:             silu

load_hparams: projection_dim:     3584
--- vision hparams ---

load_hparams: image_size:         1024

load_hparams: patch_size:         14

load_hparams: has_llava_proj:     0

load_hparams: minicpmv_version:   0

load_hparams: proj_scale_factor:  0

load_hparams: n_wa_pattern:       8
load_hparams: model size:         1476.70 MiB

load_hparams: metadata size:      0.35 MiB

alloc_compute_meta:    Vulkan0 compute buffer size =     2.77 MiB

alloc_compute_meta:        CPU compute buffer size =     0.16 MiB

clip_ctx: CLIP using Vulkan0 backend

load_hparams: projector:          qwen2.5o

load_hparams: n_embd:             1280

load_hparams: n_head:             20

load_hparams: n_ff:               5120

load_hparams: n_layer:            32

load_hparams: ffn_op:             gelu_erf

load_hparams: projection_dim:     3584
--- audio hparams ---

load_hparams: n_mel_bins:         128

load_hparams: proj_stack_factor:  0
load_hparams: model size:         1476.70 MiB

load_hparams: metadata size:      0.35 MiB

Floating point exception (core dumped)

(mmproj-Qwen2.5-Omni-7B-Q8_0.gguf)

The error is avoided with --no-mmproj-offload.

The FP16 mmproj file works properly.

Maybe, there is an issue with the Q8_0 file or with the way the Vulkan backend handles the mixed precision format. Other Q8_0 mmproj files for other models (e.g. mmproj-InternVL3-2B-Instruct-Q8_0.gguf) work properly (although this is a vision-only model).

(Did not open an issue in case the issue is with the file and not the backend).

pwilkin · 2025-05-27T22:01:37Z

@ngxson Just out of curiosity, how much work is it to implement diffusion support? It seems like there are more and more models coming out with "all-to-all" capabilities (like Bagel), would be probably nice for image generation to appear on the roadmap at some point...

nqchieutb01 · 2025-06-07T16:28:45Z

@ngxson
I've just fine-tuned the Qwen2.5-Omni-7B model using LoRA, and then merged the adapters into the base model. Now, I’d like to convert this merged model to the GGUF format to run it with llama.cpp.
Could you please guide me on how to do this?

Thanks in advance!

mtmd : allow multiple modalities at the same time

58c7849

github-actions bot added examples python python script changes labels May 25, 2025

ngxson added 6 commits May 26, 2025 00:54

refactor mtmd tokenizer

2782a58

fix compile

bb92d1d

ok, missing SinusoidsPositionEmbedding

8b51e7f

first working version

24ec43e

fix style

1ac73f4

more strict validate of n_embd

9013245

ngxson added 4 commits May 26, 2025 11:07

refactor if..else to switch

346d252

fix regression

6e65e0c

add test for 3B

235fbdb

update docs

bf34f38

ngxson marked this pull request as ready for review May 26, 2025 09:26

ngxson requested a review from ggerganov May 26, 2025 09:26

github-actions bot added the documentation Improvements or additions to documentation label May 26, 2025

ngxson added 2 commits May 26, 2025 11:43

fix tokenizing with add_special

d03c240

add more tests

ef48e8f

fix test case "huge"

94d893d

ggerganov approved these changes May 27, 2025

View reviewed changes

ngxson added 3 commits May 27, 2025 10:09

Merge branch 'master' into xsn/qwen25omni

baa882a

rm redundant code

0531096

set_position_mrope_1d rm n_tokens

27a8f26

ngxson mentioned this pull request May 27, 2025

Misc. bug: llama-mtmd-cli ignores multiple image input #13704

Closed

ngxson merged commit bc583e3 into ggml-org:master May 27, 2025
49 checks passed

KaareLJensen mentioned this pull request May 27, 2025

Feature Reequest: Multi model cli tools: Add a possibility to specify a image in conversation mode plus tab auto completion for path #12983

Open

xunjieliu mentioned this pull request May 28, 2025

Reddit News Daily 2025-05-28 xunjieliu/reddit-daily-news#88

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output) #13784

mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output) #13784

Uh oh!

ngxson commented May 25, 2025 •

edited

Loading

Uh oh!

oyaay commented May 26, 2025

Uh oh!

ngxson commented May 26, 2025 •

edited

Loading

Uh oh!

mega-cqz commented May 26, 2025

Uh oh!

ngxson commented May 26, 2025

Uh oh!

Uh oh!

ggerganov May 27, 2025

Uh oh!

ngxson May 27, 2025

Uh oh!

ggerganov May 27, 2025

Uh oh!

ngxson May 27, 2025

Uh oh!

Uh oh!

henfiber commented May 27, 2025

Uh oh!

pwilkin commented May 27, 2025

Uh oh!

nqchieutb01 commented Jun 7, 2025

Uh oh!

Uh oh!

mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output) #13784

mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output) #13784

Uh oh!

Conversation

ngxson commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why no audio output?

Demo

Uh oh!

oyaay commented May 26, 2025

Uh oh!

ngxson commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mega-cqz commented May 26, 2025

Uh oh!

ngxson commented May 26, 2025

Uh oh!

Uh oh!

ggerganov May 27, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson May 27, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov May 27, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson May 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

henfiber commented May 27, 2025

Uh oh!

pwilkin commented May 27, 2025

Uh oh!

nqchieutb01 commented Jun 7, 2025

Uh oh!

Uh oh!

ngxson commented May 25, 2025 •

edited

Loading

ngxson commented May 26, 2025 •

edited

Loading