Releases: ggml-org/llama.cpp
Releases · ggml-org/llama.cpp
b5494
server: fix regression on streamed non-chat completion w/ stops (#13785) * more forgiving message diffs: partial stop words aren't erased, full stops are * Add (slow) server test for completion + stream + stop
b5493
examples : allow extracting embeddings from decoder contexts (#13797) ggml-ci
b5492
llama : clarify deprecation message (#13794)
b5490
vulkan: mark IM2COL as supporting non-contig (#13783)
b5489
CANN: Add the basic supports of Flash Attention kernel (#13627) * cann: add the basic FA support * cann: update the readme * cann: update the FlashAttention with PSEShift * cann: update the input parameters in FA * cann: update the alibi with max_bias * cann: add the constrints of softcap * cann: update the docs CANN.md * cann: update the docs CANN.md * cann: fix typo of CANN.md * cann: add some comments and update the CANN.md * cann: update the CANN.md * cann: update the inner precise for fusedInferAttention * cann: update the constraints of flash_attn_ext on ggml-cann.cpp * cann: clean the whitespace * cann: clean the whitespace * cann: add a new endline
b5488
`server`: add `--reasoning-budget 0` to disable thinking (incl. qwen3…
b5486
tests : improve UGM tokenizer test coverage (#13773)
b5484
rpc : Fix build on OpenBSD (#13541)
b5483
mtmd : add support for Qwen2-Audio and SeaLLM-Audio (#13760) * mtmd : add Qwen2-Audio support * small clean up * update discussion link * clarify mtmd_get_output_embd * clarification in multimodal.md * fix ultravox bug * ggml_cont
b5481
server: fix/test add_generation_prompt (#13770) Co-authored-by: ochafik <[email protected]>