Skip to content

Releases: ggml-org/llama.cpp

b5588

04 Jun 14:21
2589ad3
Compare
Choose a tag to compare
ci : remove cuda 11.7 releases, switch runner to windows 2022 (#13997)

b5587

04 Jun 11:53
4825487
Compare
Choose a tag to compare
releases : use dl backend for linux release, remove arm64 linux relea…

b5586

04 Jun 08:27
3ac6753
Compare
Choose a tag to compare
llama-graph : use ggml_repeat_4d (#13998)

b5585

04 Jun 07:50
0b4be4c
Compare
Choose a tag to compare
CUDA: fix FTZ in FA for Gemma 3 (#13991)

b5584

04 Jun 07:45
e0e806f
Compare
Choose a tag to compare
kv-cache : fix unified::seq_rm to work with seq_id < 0 (#13985)

ggml-ci

b5581

03 Jun 00:49
71e74a3
Compare
Choose a tag to compare
opencl: add `backend_synchronize` (#13939)

* This is not needed by the normal use where the result is read
  using `tensor_get`, but it allows perf mode of `test-backend-ops`
  to properly measure performance.

b5580

03 Jun 00:38
bfb1e01
Compare
Choose a tag to compare
OpenCL: Add concat, tsembd, upscale, tanh, pad and repeat (#13840)

* add concat, pad, repeat, tsembd, tanh, upscale

* small fixes

b5579

02 Jun 19:30
3637576
Compare
Choose a tag to compare
server : disable speculative decoding for SWA models (#13970)

* server : use swa-full fo draft context

ggml-ci

* server : disable speculative decoding for SWA models

b5578

02 Jun 19:26
ea394d7
Compare
Choose a tag to compare
metal : use F32 accumulators in FA kernels (#13975)

ggml-ci

b5577

02 Jun 18:16
5582c49
Compare
Choose a tag to compare
gemma : more consistent attention scaling for v2 and v3 (#13951)

* gemma : fix attn scale for 27B

* cont : apply scale before attn

* cont : consistent attention scaling