Skip to content

kv-cache : use ggml_set_rows #14285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: master
Choose a base branch
from
Open

kv-cache : use ggml_set_rows #14285

wants to merge 18 commits into from

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Jun 19, 2025

depends #14274

Utilize ggml_set_rows() for updating the KV cache.

  • Make the graph static with respect to the KV cells head offset
  • Relax the requirement for continuous KV slots of the input ubatch, making the defrag logic almost obsolete

Currently enabled only if the environment variable LLAMA_SET_ROWS is defined. If not, we fallback to the original way of updating the KV cache using a view + cpy of continuous slots. This is needed until the ggml_set_rows() implementation is finalized and supported by all backends.

Will merge after #14274.

Testing

# regular
LLAMA_SET_ROWS=1 ./bin/llama-cli -m ../models/llama-3.2-3b-instruct/ggml-model-q8_0.gguf \
     -p "I believe the meaning of life is" -n 32 --top-k 1 -fa
# SWA
LLAMA_SET_ROWS=1 ./bin/llama-cli -m ../models/gemma-3-4b/ggml-model-q8_0.gguf \
     -p "I believe the meaning of life is" -n 32 --top-k 1 -fa

Next PRs

  • Introduce the concept of "virtual sequences"
  • Extend llama_kv_cache_unified to support virtual sequences
  • Extend the attention graph to support virtual sequences
  • Enable efficient multi-sequence decoding

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 19, 2025
@ggerganov ggerganov force-pushed the gg/model-rework-out-ids branch from 1e86597 to 2b940c0 Compare June 20, 2025 07:16
Base automatically changed from gg/model-rework-out-ids to master June 20, 2025 07:50
@rgerganov
Copy link
Collaborator

I tried this PR with the following change in the RPC backend:

diff --git a/ggml/src/ggml-rpc/ggml-rpc.cpp b/ggml/src/ggml-rpc/ggml-rpc.cpp
index f468f796..dcbede89 100644
--- a/ggml/src/ggml-rpc/ggml-rpc.cpp
+++ b/ggml/src/ggml-rpc/ggml-rpc.cpp
@@ -761,6 +761,8 @@ static enum ggml_status ggml_backend_rpc_graph_compute(ggml_backend_t backend, g
     ggml_backend_rpc_context * rpc_ctx = (ggml_backend_rpc_context *)backend->context;
     std::vector<uint8_t> input;
     serialize_graph(cgraph, input);
+    auto graph_hash = fnv_hash(input.data(), input.size());
+    printf("RPC graph compute: hash = 0x%" PRIx64 ", size = %zu\n", graph_hash, input.size());
     rpc_msg_graph_compute_rsp response;
     auto sock = get_socket(rpc_ctx->endpoint);
     bool status = send_rpc_cmd(sock, RPC_CMD_GRAPH_COMPUTE, input.data(), input.size(), &response, sizeof(response));

The compute graph doesn't change and produces the same hash with gpt2, tinyllama and mistral-7b models. However, the hash does change with gemma3 models. The serialized graph includes tensor addresses, so it's possible that we rebuild same tensors on different addresses resulting in different graph hash.

But in any way this looks like a great progress!

@ggerganov ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 8f1c5e3 to 5f87f28 Compare June 20, 2025 07:59
@ggerganov
Copy link
Member Author

However, the hash does change with gemma3 models.

Yes, this is expected. I've applied the change only for the unified cache. For the unified+iswa, it is still using the ggml_cpy:

https://github.com/ggml-org/llama.cpp/pull/14285/files#diff-9be9eea14f4aefce7375482c05968900192634e88e92ac263cedb955a64ad7feR1290-R1291

@ggerganov
Copy link
Member Author

Should work with Gemma now as well.

@ggerganov ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 4d0c0ea to db0cd69 Compare June 20, 2025 09:01
@ggerganov
Copy link
Member Author

The non-FA path is now also supported, though I am not 100% sure this is the best way to do it.

@ggerganov
Copy link
Member Author

The non-FA path is now also supported, though I am not 100% sure this is the best way to do it.

I don't observe any performance regression with CPU-only build, so the implementation should be good enough I think.

@ggerganov ggerganov force-pushed the gg/kv-cache-use-set-rows branch 2 times, most recently from d40f705 to d1da992 Compare June 21, 2025 06:20
@ggerganov ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 1031a5d to 14554a8 Compare June 21, 2025 12:29
@ggerganov ggerganov marked this pull request as ready for review June 21, 2025 13:42
@ggerganov ggerganov force-pushed the gg/kv-cache-use-set-rows branch 2 times, most recently from 335161d to e1aba6a Compare June 22, 2025 08:05
@github-actions github-actions bot added testing Everything test related Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Jun 22, 2025
@ggerganov ggerganov force-pushed the gg/kv-cache-use-set-rows branch 3 times, most recently from b5fea54 to c4273b8 Compare June 23, 2025 06:52
@rgerganov
Copy link
Collaborator

@ggerganov the following test segfaults on my machine:

$ bin/test-backend-ops -b CPU -p "type=iq2_xxs,n=256,m=5,r=4,b0=1,bs=1,v=0"
Testing 1 devices

Backend 1/1: CPU
  Device description: 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz
  Device memory: 31835 MB (31835 MB free)

  SET_ROWS(type=iq2_xxs,n=256,m=5,r=4,b0=1,bs=1,v=0): [1]    786752 segmentation fault (core dumped)  bin/test-backend-ops -b CPU -p "type=iq2_xxs,n=256,m=5,r=4,b0=1,bs=1,v=0"

@ggerganov
Copy link
Member Author

Apply this patch:

diff --git a/ggml/src/ggml-cpu/ggml-cpu.cpp b/ggml/src/ggml-cpu/ggml-cpu.cpp
index 735ef3f01..cc9b922fa 100644
--- a/ggml/src/ggml-cpu/ggml-cpu.cpp
+++ b/ggml/src/ggml-cpu/ggml-cpu.cpp
@@ -416,6 +416,7 @@ static bool ggml_backend_cpu_device_supports_op(ggml_backend_dev_t dev, const st
 
     switch (op->op) {
         case GGML_OP_CPY:
+        case GGML_OP_SET_ROWS:
             return
                 op->type != GGML_TYPE_IQ3_XXS &&
                 op->type != GGML_TYPE_IQ3_S   &&

@ggerganov ggerganov force-pushed the gg/kv-cache-use-set-rows branch from c4273b8 to 96327b5 Compare June 23, 2025 08:48
rgerganov and others added 4 commits June 23, 2025 13:21
Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using
indices from 'c'.

ref: #8366
@ggerganov ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 96327b5 to 36f8e20 Compare June 23, 2025 10:22
@ggerganov ggerganov mentioned this pull request Jun 24, 2025
15 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants