-
Notifications
You must be signed in to change notification settings - Fork 12.2k
mtmd : add C public API #13184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mtmd : add C public API #13184
Conversation
Wouldn't having to maintain a C API for My understanding is that |
The key benefit I was thinking about was to allow user to use libmtmd in their downstream project. This will allow us to gather more feedback about the internal design and the API. Also, this is also quite necessary as more small vision models are available, people want to implement it in their mobile applications. While they can wait until the proper support to come in libllama, I think it's not guaranteed to come soon in 1-2 months. And if it ever come in the future, my vision is to provide a simple way to convert most of the API from libmtmd to libllama. And finally, that's also why the C API is needed in libmtmd, it's part of the experiment how we design a C API that deal with multimodal input. Re. your point about breaking changes, this is a valid concern, but I think I'm following the trajectory of libllama in the early days. IIRC we didn't have a stable API until a certain version because things was changing quite fast. libmtmd is in early development and I think breaking changes are expected in either C or C++ API. To make it clear, I will state in the header file that libmtmd is experimental, breaking changes are expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it's early to expose an API and invite third-party projects to interface with it. But if you want to give this a try that's OK. I like the momentum that you've created with the implementation so far. Just take some steps to inform the developers that this will be very unstable and that issues will be handled with low priority.
IMO the most important feedback that we will get is from integrating libmtmd
in llama-server
and figuring out how to support all the necessary features.
It's difficult for me to provide a comprehensive review at this point. The C APIs are hard to get right and usually I take an approach to implement some examples that exercise them in order to understand what is needed.
Unless there are some additional concerns, I think we can merge this. @slaren Curious if you have any thoughts too.
examples/llava/mtmd.h
Outdated
// you can move the chunk ownership to your own code | ||
// this will release the chunk from the list of input chunks | ||
// remember to free the chunk when you are done with it | ||
MTMD_API mtmd_input_chunk * mtmd_input_chunk_release(mtmd_input_chunk * chunk); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about the usage of this function, but maybe a better name would be mtmd_input_chunk_take
. But generally it seems like something that should not be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this API is inspired by unique_ptr::release()
, which transfer the ownership of a chunk from its container to user code.
For the usage, it's actually related to this discussion on llama-server. Basically the idea here is to have a mapping std::map<llama_pos, mtmd_input_chunk *>
to map the image chunk to the correct position in llama_tokens
array. And since I've been playing around with the same idea on wllama, I'm pretty sure that this is what we want to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is making a copy rather than really releasing it, would mtmd_input_chunk_dup
or copy
be more accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore that, I see that it is moved and only a small struct is allocated. Wouldn't this leave the mtmd_input_chunks
that used to own this chunk in a bad state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it will leave the position of that chunk in mtmd_input_chunks
to be in an invalid state (actually not invalid, but it will be a text chunk with 0 tokens)
You idea of mtmd_input_chunk_dup
sounds better though. I'll implement it. I think the cost of copying some images could be negligible for now, as we not yet design this API to accept video input (which is essentially a sequence of images). In case more models support video input in the future, we can introduce another API specifically for tokenizing video.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. I'll add some comments to tell that the libmtmd is under active development and breaking changes are expected. At this stage, I think not many developers are even aware of the existence of this library, so I assume that we won't get many reports. If we start to get more reported issue about the usage of libmtmd in downstream project, I'll add a dedicated issue template to inform to them that the issue will be low-prio. And yes I would love to hear your thought on this proposal @slaren |
tools/llava/mtmd.h
Outdated
MTMD_API int32_t mtmd_helper_eval_chunks(mtmd_context * ctx, | ||
struct llama_context * lctx, | ||
mtmd_input_chunks * chunks, | ||
llama_pos n_past, | ||
llama_seq_id seq_id, | ||
int32_t n_batch, | ||
bool logits_last, | ||
llama_pos * new_n_past); | ||
|
||
// works like mtmd_helper_eval_chunks(), but only for a single chunk | ||
// this function is NOT thread-safe | ||
MTMD_API int32_t mtmd_helper_eval_chunk_single(mtmd_context * ctx, | ||
struct llama_context * lctx, | ||
mtmd_input_chunk * chunk, | ||
llama_pos n_past, | ||
llama_seq_id seq_id, | ||
int32_t n_batch, | ||
bool logits_last, | ||
llama_pos * new_n_past); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const chunks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup thanks, I added const
to various places. Places not having const are:
*_init()
- setters (for ex.
mtmd_bitmap_set_id
) *_free()
mtmd_input_chunk_copy
--> kinda equivalent to_init()
mtmd_tokenize
--> because it takesmtmd_input_chunks * output
and modify it
I'm merging this PR once the CI is green
For viz, I added this comment which should be enough to communicate about the state of libmtmd: /**
* libmtmd: A library for multimodal support in llama.cpp.
*
* WARNING: This API is experimental and subject to many BREAKING CHANGES.
* Issues related to API usage may receive lower priority support.
*
* For the usage, see an example in mtmd-cli.cpp
*/ |
I implemented this API in #12898 and it works well, cache tokens also work correctly. Should be good to merge. |
* origin/master: (27 commits) llama : fix build_ffn without gate (ggml-org#13336) CUDA: fix bad asserts for partial offload (ggml-org#13337) convert : qwen2/3moe : set yarn metadata if present (ggml-org#13331) CUDA: fix --split-mode row for MMQ (ggml-org#13323) gguf-py : avoid requiring pyside6 for other scripts (ggml-org#13036) CUDA: fix logic for clearing padding with -ngl 0 (ggml-org#13320) sampling : Integrate Top-nσ into main sampling chain (and add it to the server) (ggml-org#13264) server : Webui - change setText command from parent window to also send the message. (ggml-org#13309) mtmd : rename llava directory to mtmd (ggml-org#13311) clip : fix confused naming ffn_up and ffn_down (ggml-org#13290) convert : bailingmoe : set yarn metadata if present (ggml-org#13312) SYCL: Disable mul_mat kernels for noncontiguous tensor b (ggml-org#13308) mtmd : add C public API (ggml-org#13184) rpc : use backend registry, support dl backends (ggml-org#13304) ggml : activate s390x simd for Q3_K (ggml-org#13301) llava/mtmd : fixes to fully support dl backends (ggml-org#13303) llama : build windows releases with dl backends (ggml-org#13220) CUDA: fix race condition in MMQ stream-k fixup (ggml-org#13299) CUDA: fix race condition in MMQ ids_dst (ggml-org#13294) vulkan: Additional type support for unary, binary, and copy (ggml-org#13266) ...
ngxson, thanks for your MTMD API. it's very helpful in downstream project: https://github.com/kantv-ai/kantv/blob/master/core/ggml/jni/realtime-video-recognition.cpp#L304. I have two tech questions here:
thanks for your time. |
Fix #13124
The global idea of this PR is to make C-only wrapper that wraps around C++ type. Think of it as a manually transpiled version of libmtmd from C++ to C
The idea of this PR is as follow:
free()
calls in cpp code, they will be grouped into a namespaceFor example: