Skip to content

mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output) #13784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
May 27, 2025

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented May 25, 2025

This PR aimed to just add the capability to use 2 clip_ctx, one for audio and one for vision. But I ended up doing quite more refactoring than I initially thought

  • some fields are moved from clip_ctx to clip_model
  • from a single clip_model_loader, we are now able to create multiple clip_model and clip_ctx
  • libmtmd can handle 2 clip_ctx and switch the calls according to the chunk type
  • refactor mtmd_tokenize so it can handle mixed modality
  • nits: some functions in clip.cpp has many if..else branches, we can refactor them to switch (...)
  • refactor tests, it can now support testing audio input
  • implement SinusoidsPositionEmbedding used by the audio encoder --> generate it during conversion
  • fix M-RoPE position for audio chunk

TODO in next PRs:

  • Remove unused clip APIs
  • Move image preprocessing to mtmd-image.cpp
  • Move file reader/decoder to mtmd-helper.cpp

Why no audio output?

The simple answer is: I don't have time to implement it.

The long answer: Qwen 2.5 Omni generates audio using 2 steps:

  • Step 1: generate the mel spectrogram using DiT (diffusion transformer). This is essentially image generation, so kinda give-away what Qwen is cooking next 😃
  • Step 2: turn mel into wav using BigVGAN

So adding audio generation is indeed adding image generation capability, which I don't really have time to do right now.

Demo

Pre-quantized models:

# Qwen2.5 Omni
# Capabilities: audio input, vision input
(tool_name) -hf ggml-org/Qwen2.5-Omni-3B-GGUF
(tool_name) -hf ggml-org/Qwen2.5-Omni-7B-GGUF
image

@github-actions github-actions bot added examples python python script changes labels May 25, 2025
@oyaay
Copy link

oyaay commented May 26, 2025

wow . I like this

@ngxson ngxson marked this pull request as ready for review May 26, 2025 09:26
@ngxson ngxson requested a review from ggerganov May 26, 2025 09:26
@ngxson
Copy link
Collaborator Author

ngxson commented May 26, 2025

@ggerganov Sorry for including quite a lot of changes in one single PR.

The global idea of this PR is to use 2 dedicated clip_ctx, each for one modality. I also thought of maybe having audio output and image output support in the future, so the idea of having multiple clip_ctx can be useful to make it happen.

A given clip_model is attached to one clip_ctx for now (so multiple context cannot use the same model). As each modality has its own set of tensors, so I don't feel like it's necessary to split model & ctx like on libllama. However, only one model_loader is used on reading the model. I think it's more useful this way because it can tell in advance how many contexts should be created (and also, model file is only opened once)

On libmtmd side, I rewrite the tokenize function to handle mixed image/audio chunks. It can also handle some edge cases (cc @mattjcly ), for example:

  • Empty input
  • Input with only one media, no text

Don't hesitate to ping if something is not clear for you. Thanks!

@github-actions github-actions bot added the documentation Improvements or additions to documentation label May 26, 2025
@mega-cqz
Copy link

Amazing work!!

@ngxson
Copy link
Collaborator Author

ngxson commented May 26, 2025

Test results:

[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
[vision] OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
[vision] OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL2_5-1B-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/pixtral-12b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-8B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-14B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M

img->nx = hparams.warmup_image_size;
img->ny = hparams.warmup_image_size;
} else {
img->nx = 1024; // TODO @ngxson : use a better default
img->nx = hparams.warmup_audio_size;
img->ny = hparams.n_mel_bins;
}
img->buf.resize(img->nx * img->ny * 3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed for audio modalities? We have a single channel in this case, correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, only the image shape is needed during this warmup, so we don't actually need to allocate this buffer. I removed it in 0531096

Comment on lines 89 to 105
// M-RoPE for audio
void set_position_mrope_1d(llama_pos pos_0, int32_t n_tokens, llama_seq_id seq_id) {
GGML_ASSERT(n_pos_per_embd == 4);
seq_id_0[0] = seq_id;
for (int i = 0; i < n_tokens; i++) {
pos[i ] = pos_0 + i;
pos[i + batch.n_tokens ] = pos_0 + i;
pos[i + batch.n_tokens * 2] = pos_0 + i;
pos[i + batch.n_tokens * 3] = 0; // last pos dim is unused
}
for (int i = 0; i < batch.n_tokens; i++) {
batch.n_seq_id[i] = 1;
batch.seq_id [i] = seq_id_0.data();
batch.logits [i] = false;
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here n_tokens and batch.n_tokens refer to the same thing. If so, should simplify by using for example only batch.n_tokens and remove the n_tokens argument. Or vice versa.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thanks for noticing, it's a leftover code from the 2D version, I removed it in 27a8f26

@henfiber
Copy link

Just fyi, with Vulkan backend (AMD RADV RENOIR) and the Q8_0 mmproj file the server crashes with:

Floating point exception (core dumped)

Details
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
build: 5510 (a8ea03d8) with cc (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7) for x86_64-redhat-linux
system info: n_threads = 5, n_threads_batch = 6, total_threads = 12

system_info: n_threads = 5 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

clip_model_loader: model name:
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 1008
clip_model_loader: n_kv: 32

clip_model_loader: has vision encoder
clip_model_loader: has audio encoder
clip_ctx: CLIP using Vulkan0 backend
load_hparams: projector: qwen2.5o
load_hparams: n_embd: 1280
load_hparams: n_head: 16
load_hparams: n_ff: 1280
load_hparams: n_layer: 32
load_hparams: ffn_op: silu
load_hparams: projection_dim: 3584

--- vision hparams ---
load_hparams: image_size: 1024
load_hparams: patch_size: 14
load_hparams: has_llava_proj: 0
load_hparams: minicpmv_version: 0
load_hparams: proj_scale_factor: 0
load_hparams: n_wa_pattern: 8

load_hparams: model size: 1476.70 MiB
load_hparams: metadata size: 0.35 MiB
alloc_compute_meta: Vulkan0 compute buffer size = 2.77 MiB
alloc_compute_meta: CPU compute buffer size = 0.16 MiB
clip_ctx: CLIP using Vulkan0 backend
load_hparams: projector: qwen2.5o
load_hparams: n_embd: 1280
load_hparams: n_head: 20
load_hparams: n_ff: 5120
load_hparams: n_layer: 32
load_hparams: ffn_op: gelu_erf
load_hparams: projection_dim: 3584

--- audio hparams ---
load_hparams: n_mel_bins: 128
load_hparams: proj_stack_factor: 0

load_hparams: model size: 1476.70 MiB
load_hparams: metadata size: 0.35 MiB
Floating point exception (core dumped)

(mmproj-Qwen2.5-Omni-7B-Q8_0.gguf)

The error is avoided with --no-mmproj-offload.

The FP16 mmproj file works properly.

Maybe, there is an issue with the Q8_0 file or with the way the Vulkan backend handles the mixed precision format. Other Q8_0 mmproj files for other models (e.g. mmproj-InternVL3-2B-Instruct-Q8_0.gguf) work properly (although this is a vision-only model).

(Did not open an issue in case the issue is with the file and not the backend).

@pwilkin
Copy link
Contributor

pwilkin commented May 27, 2025

@ngxson Just out of curiosity, how much work is it to implement diffusion support? It seems like there are more and more models coming out with "all-to-all" capabilities (like Bagel), would be probably nice for image generation to appear on the roadmap at some point...

@nqchieutb01
Copy link

@ngxson
I've just fine-tuned the Qwen2.5-Omni-7B model using LoRA, and then merged the adapters into the base model. Now, I’d like to convert this merged model to the GGUF format to run it with llama.cpp.
Could you please guide me on how to do this?

Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants