-
Notifications
You must be signed in to change notification settings - Fork 12.1k
mtmd : support Qwen 2.5 Omni (input audio+vision, no audio output) #13784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
wow . I like this |
@ggerganov Sorry for including quite a lot of changes in one single PR. The global idea of this PR is to use 2 dedicated A given On
Don't hesitate to ping if something is not clear for you. Thanks! |
Amazing work!! |
Test results:
|
tools/mtmd/clip.cpp
Outdated
img->nx = hparams.warmup_image_size; | ||
img->ny = hparams.warmup_image_size; | ||
} else { | ||
img->nx = 1024; // TODO @ngxson : use a better default | ||
img->nx = hparams.warmup_audio_size; | ||
img->ny = hparams.n_mel_bins; | ||
} | ||
img->buf.resize(img->nx * img->ny * 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed for audio modalities? We have a single channel in this case, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, only the image shape is needed during this warmup, so we don't actually need to allocate this buffer. I removed it in 0531096
tools/mtmd/mtmd-helper.cpp
Outdated
// M-RoPE for audio | ||
void set_position_mrope_1d(llama_pos pos_0, int32_t n_tokens, llama_seq_id seq_id) { | ||
GGML_ASSERT(n_pos_per_embd == 4); | ||
seq_id_0[0] = seq_id; | ||
for (int i = 0; i < n_tokens; i++) { | ||
pos[i ] = pos_0 + i; | ||
pos[i + batch.n_tokens ] = pos_0 + i; | ||
pos[i + batch.n_tokens * 2] = pos_0 + i; | ||
pos[i + batch.n_tokens * 3] = 0; // last pos dim is unused | ||
} | ||
for (int i = 0; i < batch.n_tokens; i++) { | ||
batch.n_seq_id[i] = 1; | ||
batch.seq_id [i] = seq_id_0.data(); | ||
batch.logits [i] = false; | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here n_tokens
and batch.n_tokens
refer to the same thing. If so, should simplify by using for example only batch.n_tokens
and remove the n_tokens
argument. Or vice versa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes thanks for noticing, it's a leftover code from the 2D version, I removed it in 27a8f26
Just fyi, with Vulkan backend (AMD RADV RENOIR) and the Q8_0 mmproj file the server crashes with:
Detailsggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none build: 5510 (a8ea03d8) with cc (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7) for x86_64-redhat-linux system info: n_threads = 5, n_threads_batch = 6, total_threads = 12 (mmproj-Qwen2.5-Omni-7B-Q8_0.gguf) The error is avoided with The FP16 mmproj file works properly. Maybe, there is an issue with the Q8_0 file or with the way the Vulkan backend handles the mixed precision format. Other Q8_0 mmproj files for other models (e.g. mmproj-InternVL3-2B-Instruct-Q8_0.gguf) work properly (although this is a vision-only model). (Did not open an issue in case the issue is with the file and not the backend). |
@ngxson Just out of curiosity, how much work is it to implement diffusion support? It seems like there are more and more models coming out with "all-to-all" capabilities (like Bagel), would be probably nice for image generation to appear on the roadmap at some point... |
@ngxson Thanks in advance! |
This PR aimed to just add the capability to use 2
clip_ctx
, one for audio and one for vision. But I ended up doing quite more refactoring than I initially thoughtclip_ctx
toclip_model
clip_model_loader
, we are now able to create multipleclip_model
andclip_ctx
libmtmd
can handle 2clip_ctx
and switch the calls according to the chunk typemtmd_tokenize
so it can handle mixed modalityclip.cpp
has manyif..else
branches, we can refactor them toswitch (...)
SinusoidsPositionEmbedding
used by the audio encoder --> generate it during conversionTODO in next PRs:
mtmd-image.cpp
mtmd-helper.cpp
Why no audio output?
The simple answer is: I don't have time to implement it.
The long answer: Qwen 2.5 Omni generates audio using 2 steps:
So adding audio generation is indeed adding image generation capability, which I don't really have time to do right now.
Demo
Pre-quantized models: