Releases: ggml-org/llama.cpp
Releases · ggml-org/llama.cpp
b5585
CUDA: fix FTZ in FA for Gemma 3 (#13991)
b5584
kv-cache : fix unified::seq_rm to work with seq_id < 0 (#13985) ggml-ci
b5581
opencl: add `backend_synchronize` (#13939) * This is not needed by the normal use where the result is read using `tensor_get`, but it allows perf mode of `test-backend-ops` to properly measure performance.
b5580
OpenCL: Add concat, tsembd, upscale, tanh, pad and repeat (#13840) * add concat, pad, repeat, tsembd, tanh, upscale * small fixes
b5579
server : disable speculative decoding for SWA models (#13970) * server : use swa-full fo draft context ggml-ci * server : disable speculative decoding for SWA models
b5578
metal : use F32 accumulators in FA kernels (#13975) ggml-ci
b5577
gemma : more consistent attention scaling for v2 and v3 (#13951) * gemma : fix attn scale for 27B * cont : apply scale before attn * cont : consistent attention scaling
b5576
`server`: update deepseek reasoning format (pass reasoning_content as…
b5575
mtmd : fix memory leak in mtmd_helper_eval_chunk_single (#13961) * mtmd : fix memory in mtmd_helper_eval_chunk_single * mtmd-cli : fix mem leak * Update tools/mtmd/mtmd-cli.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
b5574
cmake : Handle mixed-case 'Power' strings in POWER CPU detection (#13…