mtmd : add C public API #13184

ngxson · 2025-04-29T13:51:55Z

/**
 * libmtmd: A library for multimodal support in llama.cpp.
 *
 * WARNING: This API is experimental and subject to many BREAKING CHANGES.
 *          Issues related to API usage may receive lower priority support.
 *
 * For the usage, see an example in mtmd-cli.cpp
 */

The global idea of this PR is to make C-only wrapper that wraps around C++ type. Think of it as a manually transpiled version of libmtmd from C++ to C

The idea of this PR is as follow:

All struct containing C++ type are converted to opaque pointer
Opaque (private) types will need setter/getter function to interact with them
C++ convenient wrappers will be added to prevent manual free() calls in cpp code, they will be grouped into a namespace

For example:

// old code
mtmd_input_chunks chunks = ...;
for (auto & chunk : chunks) {
    mtmd_input_chunk_type type = chunk.type;
    size_t n_tokens = chunk.tokens_text.size();
    ...
}

// new code
mtmd_input_chunks * chunks = mtmd_input_chunks_init();
int32_t res_code = mtmd_tokenize(..., chunks, ...);
size_t n_chunks = mtmd_input_chunks_size(chunks);
for (size_t i = 0; i < n_chunks; i++) {
    mtmd_input_chunk * chunk = mtmd_input_chunks_get(chunks, i);
    mtmd_input_chunk_type type = mtmd_input_chunk_get_type(chunk);
    size_t n_tokens;
    mtmd_input_chunk_get_tokens_text(chunk, &n_tokens);
    ...
}
mtmd_input_chunks_free(chunks);

// or, with c++ wrapper
mtmd::input_chunks chunks;
int32_t res_code = mtmd_tokenize(..., chunks.ptr.get(), ...);
size_t n_chunks = mtmd_input_chunks_size(chunks);
for (size_t i = 0; i < n_chunks; i++) {
    // (same as above)
    ...
}

ggerganov · 2025-05-02T07:14:15Z

Wouldn't having to maintain a C API for libmtmd make the development harder without benefits?

My understanding is that libmtmd allows us to prototype the multi-modality functionality in parallel to the changes that are needed in libllama to support this. And eventually, the multi-modality should be supported directly from libllama in order to reuse the existing infra for model and context management. If we provide a libmtmd C API now, we would have to maintain it and deal with a lot of breaking changes in the future. So I don't think I see a reason to add the C API now. Maybe I am missing something?

ngxson · 2025-05-02T07:48:04Z

The key benefit I was thinking about was to allow user to use libmtmd in their downstream project. This will allow us to gather more feedback about the internal design and the API.

Also, this is also quite necessary as more small vision models are available, people want to implement it in their mobile applications. While they can wait until the proper support to come in libllama, I think it's not guaranteed to come soon in 1-2 months. And if it ever come in the future, my vision is to provide a simple way to convert most of the API from libmtmd to libllama.

And finally, that's also why the C API is needed in libmtmd, it's part of the experiment how we design a C API that deal with multimodal input.

Re. your point about breaking changes, this is a valid concern, but I think I'm following the trajectory of libllama in the early days. IIRC we didn't have a stable API until a certain version because things was changing quite fast. libmtmd is in early development and I think breaking changes are expected in either C or C++ API. To make it clear, I will state in the header file that libmtmd is experimental, breaking changes are expected

ggerganov

I still think it's early to expose an API and invite third-party projects to interface with it. But if you want to give this a try that's OK. I like the momentum that you've created with the implementation so far. Just take some steps to inform the developers that this will be very unstable and that issues will be handled with low priority.

IMO the most important feedback that we will get is from integrating libmtmd in llama-server and figuring out how to support all the necessary features.

It's difficult for me to provide a comprehensive review at this point. The C APIs are hard to get right and usually I take an approach to implement some examples that exercise them in order to understand what is needed.

Unless there are some additional concerns, I think we can merge this. @slaren Curious if you have any thoughts too.

ggerganov · 2025-05-02T18:01:56Z

examples/llava/mtmd.h

+// you can move the chunk ownership to your own code
+// this will release the chunk from the list of input chunks
+// remember to free the chunk when you are done with it
+MTMD_API mtmd_input_chunk * mtmd_input_chunk_release(mtmd_input_chunk * chunk);


I am not sure about the usage of this function, but maybe a better name would be mtmd_input_chunk_take. But generally it seems like something that should not be needed.

Indeed, this API is inspired by unique_ptr::release(), which transfer the ownership of a chunk from its container to user code.

For the usage, it's actually related to this discussion on llama-server. Basically the idea here is to have a mapping std::map<llama_pos, mtmd_input_chunk *> to map the image chunk to the correct position in llama_tokens array. And since I've been playing around with the same idea on wllama, I'm pretty sure that this is what we want to have.

Since it is making a copy rather than really releasing it, would mtmd_input_chunk_dup or copy be more accurate?

Ignore that, I see that it is moved and only a small struct is allocated. Wouldn't this leave the mtmd_input_chunks that used to own this chunk in a bad state?

Yes it will leave the position of that chunk in mtmd_input_chunks to be in an invalid state (actually not invalid, but it will be a text chunk with 0 tokens)

You idea of mtmd_input_chunk_dup sounds better though. I'll implement it. I think the cost of copying some images could be negligible for now, as we not yet design this API to accept video input (which is essentially a sequence of images). In case more models support video input in the future, we can introduce another API specifically for tokenizing video.

I implemented this in 6bc7a30

Edit: some structs may also need clone() function, I added it here: 4d842eb

ngxson · 2025-05-02T18:45:36Z

Thanks for the feedback. I'll add some comments to tell that the libmtmd is under active development and breaking changes are expected.

At this stage, I think not many developers are even aware of the existence of this library, so I assume that we won't get many reports. If we start to get more reported issue about the usage of libmtmd in downstream project, I'll add a dedicated issue template to inform to them that the issue will be low-prio.

And yes I would love to hear your thought on this proposal @slaren

tools/llava/mtmd.h

slaren · 2025-05-04T12:59:46Z

tools/llava/mtmd.h

+MTMD_API int32_t mtmd_helper_eval_chunks(mtmd_context * ctx,
+                                         struct llama_context * lctx,
+                                         mtmd_input_chunks * chunks,
+                                         llama_pos n_past,
+                                         llama_seq_id seq_id,
+                                         int32_t n_batch,
+                                         bool logits_last,
+                                         llama_pos * new_n_past);
+
+// works like mtmd_helper_eval_chunks(), but only for a single chunk
+// this function is NOT thread-safe
+MTMD_API int32_t mtmd_helper_eval_chunk_single(mtmd_context * ctx,
+                                               struct llama_context * lctx,
+                                               mtmd_input_chunk * chunk,
+                                               llama_pos n_past,
+                                               llama_seq_id seq_id,
+                                               int32_t n_batch,
+                                               bool logits_last,
+                                               llama_pos * new_n_past);


const chunks?

Yup thanks, I added const to various places. Places not having const are:

*_init()

setters (for ex. mtmd_bitmap_set_id)

*_free()

mtmd_input_chunk_copy --> kinda equivalent to _init()

mtmd_tokenize --> because it takes mtmd_input_chunks * output and modify it

I'm merging this PR once the CI is green

ngxson · 2025-05-04T19:30:34Z

For viz, I added this comment which should be enough to communicate about the state of libmtmd:

/**
 * libmtmd: A library for multimodal support in llama.cpp.
 *
 * WARNING: This API is experimental and subject to many BREAKING CHANGES.
 *          Issues related to API usage may receive lower priority support.
 *
 * For the usage, see an example in mtmd-cli.cpp
 */

ngxson · 2025-05-04T21:40:12Z

I implemented this API in #12898 and it works well, cache tokens also work correctly. Should be good to merge.

* origin/master: (27 commits) llama : fix build_ffn without gate (ggml-org#13336) CUDA: fix bad asserts for partial offload (ggml-org#13337) convert : qwen2/3moe : set yarn metadata if present (ggml-org#13331) CUDA: fix --split-mode row for MMQ (ggml-org#13323) gguf-py : avoid requiring pyside6 for other scripts (ggml-org#13036) CUDA: fix logic for clearing padding with -ngl 0 (ggml-org#13320) sampling : Integrate Top-nσ into main sampling chain (and add it to the server) (ggml-org#13264) server : Webui - change setText command from parent window to also send the message. (ggml-org#13309) mtmd : rename llava directory to mtmd (ggml-org#13311) clip : fix confused naming ffn_up and ffn_down (ggml-org#13290) convert : bailingmoe : set yarn metadata if present (ggml-org#13312) SYCL: Disable mul_mat kernels for noncontiguous tensor b (ggml-org#13308) mtmd : add C public API (ggml-org#13184) rpc : use backend registry, support dl backends (ggml-org#13304) ggml : activate s390x simd for Q3_K (ggml-org#13301) llava/mtmd : fixes to fully support dl backends (ggml-org#13303) llama : build windows releases with dl backends (ggml-org#13220) CUDA: fix race condition in MMQ stream-k fixup (ggml-org#13299) CUDA: fix race condition in MMQ ids_dst (ggml-org#13294) vulkan: Additional type support for unary, binary, and copy (ggml-org#13266) ...

zhouwg · 2025-05-23T10:30:53Z

The key benefit I was thinking about was to allow user to use libmtmd in their downstream project. This will allow us to gather more feedback about the internal design and the API.

Also, this is also quite necessary as more small vision models are available, people want to implement it in their mobile applications. While they can wait until the proper support to come in libllama, I think it's not guaranteed to come soon in 1-2 months. And if it ever come in the future, my vision is to provide a simple way to convert most of the API from libmtmd to libllama.

And finally, that's also why the C API is needed in libmtmd, it's part of the experiment how we design a C API that deal with multimodal input.

Re. your point about breaking changes, this is a valid concern, but I think I'm following the trajectory of libllama in the early days. IIRC we didn't have a stable API until a certain version because things was changing quite fast. libmtmd is in early development and I think breaking changes are expected in either C or C++ API. To make it clear, I will state in the header file that libmtmd is experimental, breaking changes are expected

ngxson, thanks for your MTMD API. it's very helpful in downstream project: https://github.com/kantv-ai/kantv/blob/master/core/ggml/jni/realtime-video-recognition.cpp#L304.

I have two tech questions here:

could we using multiple(e.g. 2) libllama instance in a single process?
could we using multiple(e.g. 2) libmtmd instance in a single process?

thanks for your time.

ngxson added 4 commits April 29, 2025 10:54

init

4a4f35c

wip

f6b6517

Merge branch 'master' into xsn/mtmd_c_api

e0806c2

working version

82f4246

github-actions bot added the examples label Apr 29, 2025

add mtmd::bitmaps

f8c27b9

ngxson requested review from ggerganov and slaren April 29, 2025 14:29

ngxson added 3 commits April 29, 2025 18:16

add test target

3357961

rm redundant define

92d2404

test: mtmd_input_chunks_free

111d5af

ngxson changed the title ~~mtmd : add C-only public API~~ mtmd : add C public API Apr 29, 2025

rm outdated comment

08d0f9c

github-actions bot added the testing Everything test related label Apr 29, 2025

ggerganov approved these changes May 2, 2025

View reviewed changes

ngxson added 5 commits May 2, 2025 22:05

Merge branch 'master' into xsn/mtmd_c_api

a230804

fix merging issue

863db31

explicitly create mtmd::input_chunks

a0fb701

mtmd_input_chunk_copy

6bc7a30

add clone()

4d842eb

slaren approved these changes May 4, 2025

View reviewed changes

ngxson added 3 commits May 4, 2025 20:56

Merge branch 'master' into xsn/mtmd_c_api

06cb595

add const to various places

e9f7ff9

add warning about breaking changes

049ae24

helper: use mtmd_image_tokens_get_n_pos

076e3b9

ngxson merged commit 27aa259 into ggml-org:master May 4, 2025
45 checks passed

mtmd : add C public API #13184

mtmd : add C public API #13184

Uh oh!

Conversation

ngxson commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented May 2, 2025

Uh oh!

ngxson commented May 2, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov May 2, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slaren May 2, 2025

Choose a reason for hiding this comment

Uh oh!

slaren May 2, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson commented May 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slaren May 4, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson May 4, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson commented May 4, 2025

Uh oh!

ngxson commented May 4, 2025

Uh oh!

Uh oh!

zhouwg commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ngxson commented Apr 29, 2025 •

edited

Loading

ngxson May 2, 2025 •

edited

Loading

ngxson May 2, 2025 •

edited

Loading

ngxson May 2, 2025 •

edited

Loading

zhouwg commented May 23, 2025 •

edited

Loading