Releases: ggml-org/llama.cpp
Releases · ggml-org/llama.cpp
b5588
ci : remove cuda 11.7 releases, switch runner to windows 2022 (#13997)
b5587
releases : use dl backend for linux release, remove arm64 linux relea…
b5586
llama-graph : use ggml_repeat_4d (#13998)
b5585
CUDA: fix FTZ in FA for Gemma 3 (#13991)
b5584
kv-cache : fix unified::seq_rm to work with seq_id < 0 (#13985) ggml-ci
b5581
opencl: add `backend_synchronize` (#13939) * This is not needed by the normal use where the result is read using `tensor_get`, but it allows perf mode of `test-backend-ops` to properly measure performance.
b5580
OpenCL: Add concat, tsembd, upscale, tanh, pad and repeat (#13840) * add concat, pad, repeat, tsembd, tanh, upscale * small fixes
b5579
server : disable speculative decoding for SWA models (#13970) * server : use swa-full fo draft context ggml-ci * server : disable speculative decoding for SWA models
b5578
metal : use F32 accumulators in FA kernels (#13975) ggml-ci
b5577
gemma : more consistent attention scaling for v2 and v3 (#13951) * gemma : fix attn scale for 27B * cont : apply scale before attn * cont : consistent attention scaling