kv-cache : use ggml_set_rows #14285

ggerganov · 2025-06-19T16:30:58Z

Utilize ggml_set_rows() for updating the KV cache.

Make the graph static with respect to the KV cells head offset
Relax the requirement for continuous KV slots of the input ubatch, making the defrag logic almost obsolete

Currently enabled only if the environment variable LLAMA_SET_ROWS is defined. If not, we fallback to the original way of updating the KV cache using a view + cpy of continuous slots. This is needed until the ggml_set_rows() implementation is finalized and supported by all backends.

Will merge after #14274.

Testing

# regular
LLAMA_SET_ROWS=1 ./bin/llama-cli -m ../models/llama-3.2-3b-instruct/ggml-model-q8_0.gguf \
     -p "I believe the meaning of life is" -n 32 --top-k 1 -fa

# SWA
LLAMA_SET_ROWS=1 ./bin/llama-cli -m ../models/gemma-3-4b/ggml-model-q8_0.gguf \
     -p "I believe the meaning of life is" -n 32 --top-k 1 -fa

Next PRs

Introduce the concept of "virtual sequences"
Extend llama_kv_cache_unified to support virtual sequences
Extend the attention graph to support virtual sequences
Enable efficient multi-sequence decoding

rgerganov · 2025-06-20T07:53:30Z

I tried this PR with the following change in the RPC backend:

diff --git a/ggml/src/ggml-rpc/ggml-rpc.cpp b/ggml/src/ggml-rpc/ggml-rpc.cpp
index f468f796..dcbede89 100644
--- a/ggml/src/ggml-rpc/ggml-rpc.cpp
+++ b/ggml/src/ggml-rpc/ggml-rpc.cpp
@@ -761,6 +761,8 @@ static enum ggml_status ggml_backend_rpc_graph_compute(ggml_backend_t backend, g
     ggml_backend_rpc_context * rpc_ctx = (ggml_backend_rpc_context *)backend->context;
     std::vector<uint8_t> input;
     serialize_graph(cgraph, input);
+    auto graph_hash = fnv_hash(input.data(), input.size());
+    printf("RPC graph compute: hash = 0x%" PRIx64 ", size = %zu\n", graph_hash, input.size());
     rpc_msg_graph_compute_rsp response;
     auto sock = get_socket(rpc_ctx->endpoint);
     bool status = send_rpc_cmd(sock, RPC_CMD_GRAPH_COMPUTE, input.data(), input.size(), &response, sizeof(response));

The compute graph doesn't change and produces the same hash with gpt2, tinyllama and mistral-7b models. However, the hash does change with gemma3 models. The serialized graph includes tensor addresses, so it's possible that we rebuild same tensors on different addresses resulting in different graph hash.

But in any way this looks like a great progress!

ggerganov · 2025-06-20T08:02:02Z

However, the hash does change with gemma3 models.

Yes, this is expected. I've applied the change only for the unified cache. For the unified+iswa, it is still using the ggml_cpy:

https://github.com/ggml-org/llama.cpp/pull/14285/files#diff-9be9eea14f4aefce7375482c05968900192634e88e92ac263cedb955a64ad7feR1290-R1291

ggerganov · 2025-06-20T08:07:58Z

Should work with Gemma now as well.

ggerganov · 2025-06-20T09:02:46Z

The non-FA path is now also supported, though I am not 100% sure this is the best way to do it.

ggerganov · 2025-06-20T10:06:08Z

The non-FA path is now also supported, though I am not 100% sure this is the best way to do it.

I don't observe any performance regression with CPU-only build, so the implementation should be good enough I think.

rgerganov · 2025-06-23T08:14:04Z

@ggerganov the following test segfaults on my machine:

$ bin/test-backend-ops -b CPU -p "type=iq2_xxs,n=256,m=5,r=4,b0=1,bs=1,v=0"
Testing 1 devices

Backend 1/1: CPU
  Device description: 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz
  Device memory: 31835 MB (31835 MB free)

  SET_ROWS(type=iq2_xxs,n=256,m=5,r=4,b0=1,bs=1,v=0): [1]    786752 segmentation fault (core dumped)  bin/test-backend-ops -b CPU -p "type=iq2_xxs,n=256,m=5,r=4,b0=1,bs=1,v=0"

ggerganov · 2025-06-23T08:20:31Z

Apply this patch:

diff --git a/ggml/src/ggml-cpu/ggml-cpu.cpp b/ggml/src/ggml-cpu/ggml-cpu.cpp
index 735ef3f01..cc9b922fa 100644
--- a/ggml/src/ggml-cpu/ggml-cpu.cpp
+++ b/ggml/src/ggml-cpu/ggml-cpu.cpp
@@ -416,6 +416,7 @@ static bool ggml_backend_cpu_device_supports_op(ggml_backend_dev_t dev, const st
 
     switch (op->op) {
         case GGML_OP_CPY:
+        case GGML_OP_SET_ROWS:
             return
                 op->type != GGML_TYPE_IQ3_XXS &&
                 op->type != GGML_TYPE_IQ3_S   &&

Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'. ref: #8366

ggml-ci

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 19, 2025

ggerganov mentioned this pull request Jun 19, 2025

ggml : add ggml_set_rows #14274

Draft

ggerganov force-pushed the gg/model-rework-out-ids branch from 1e86597 to 2b940c0 Compare June 20, 2025 07:16

Base automatically changed from gg/model-rework-out-ids to master June 20, 2025 07:50

ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 8f1c5e3 to 5f87f28 Compare June 20, 2025 07:59

ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 4d0c0ea to db0cd69 Compare June 20, 2025 09:01

ggerganov force-pushed the gg/kv-cache-use-set-rows branch 2 times, most recently from d40f705 to d1da992 Compare June 21, 2025 06:20

github-actions bot added the examples label Jun 21, 2025

ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 1031a5d to 14554a8 Compare June 21, 2025 12:29

ggerganov marked this pull request as ready for review June 21, 2025 13:42

ggerganov force-pushed the gg/kv-cache-use-set-rows branch 2 times, most recently from 335161d to e1aba6a Compare June 22, 2025 08:05

github-actions bot added testing Everything test related Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jun 22, 2025

ggerganov force-pushed the gg/kv-cache-use-set-rows branch 3 times, most recently from b5fea54 to c4273b8 Compare June 23, 2025 06:52

ggerganov force-pushed the gg/kv-cache-use-set-rows branch from c4273b8 to 96327b5 Compare June 23, 2025 08:48

rgerganov and others added 4 commits June 23, 2025 13:21

ggml : add ggml_set_rows

c1a581a

Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'. ref: #8366

use I64 for indices

f2cd962

ggml : add repeat impl for i64

695b6b7

ggml : add ggml_is_contiguous_rows

313a444

ggerganov and others added 14 commits June 23, 2025 13:21

ggml : ggml_set_rows support broadcast

df71c80

ggml : ggml_set_rows support quantized dst

630c84a

ggml-ci

ggml : support GGML_TYPE_F32 ".from_float" trait

e897097

ggml : ggml_set_rows update comment + better index name

e73690a

tests : add ggml_set_rows

828e5d2

metal : add ggml_set_rows implementation

c0cfc2f

ggml-ci

ggml : simplify forward_dup_f32

eba9757

ggml : fix supports_op

1f647b5

kv-cache : use ggml_set_rows

79dac3c

ggml-ci

cont : gate the ggml_set_rows usage with env var

db2bb37

ggml-ci

cont : migrate to using set of indices instead of slot head

f875d6c

ggml-ci

cont : kv-cells cp/set for non-cont slots

39d0b1e

ggml-ci

cont : support non-continuous slots

332f073

ggml-ci

kv-cache : utilize ggml_set_rows broadcast

36f8e20

ggml-ci

ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 96327b5 to 36f8e20 Compare June 23, 2025 10:22

ggerganov mentioned this pull request Jun 24, 2025

llama : add high-throughput mode #14363

Draft

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kv-cache : use ggml_set_rows #14285

kv-cache : use ggml_set_rows #14285

ggerganov commented Jun 19, 2025 •

edited

Loading

Uh oh!

rgerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

rgerganov commented Jun 23, 2025

Uh oh!

ggerganov commented Jun 23, 2025

Uh oh!

Uh oh!

kv-cache : use ggml_set_rows #14285

Are you sure you want to change the base?

kv-cache : use ggml_set_rows #14285

Conversation

ggerganov commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Next PRs

Uh oh!

rgerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

ggerganov commented Jun 20, 2025

Uh oh!

rgerganov commented Jun 23, 2025

Uh oh!

ggerganov commented Jun 23, 2025

Uh oh!

Uh oh!

ggerganov commented Jun 19, 2025 •

edited

Loading