-
Notifications
You must be signed in to change notification settings - Fork 12.2k
kv-cache : use ggml_set_rows #14285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
kv-cache : use ggml_set_rows #14285
Conversation
1e86597
to
2b940c0
Compare
I tried this PR with the following change in the RPC backend: diff --git a/ggml/src/ggml-rpc/ggml-rpc.cpp b/ggml/src/ggml-rpc/ggml-rpc.cpp
index f468f796..dcbede89 100644
--- a/ggml/src/ggml-rpc/ggml-rpc.cpp
+++ b/ggml/src/ggml-rpc/ggml-rpc.cpp
@@ -761,6 +761,8 @@ static enum ggml_status ggml_backend_rpc_graph_compute(ggml_backend_t backend, g
ggml_backend_rpc_context * rpc_ctx = (ggml_backend_rpc_context *)backend->context;
std::vector<uint8_t> input;
serialize_graph(cgraph, input);
+ auto graph_hash = fnv_hash(input.data(), input.size());
+ printf("RPC graph compute: hash = 0x%" PRIx64 ", size = %zu\n", graph_hash, input.size());
rpc_msg_graph_compute_rsp response;
auto sock = get_socket(rpc_ctx->endpoint);
bool status = send_rpc_cmd(sock, RPC_CMD_GRAPH_COMPUTE, input.data(), input.size(), &response, sizeof(response));
The compute graph doesn't change and produces the same hash with gpt2, tinyllama and mistral-7b models. However, the hash does change with gemma3 models. The serialized graph includes tensor addresses, so it's possible that we rebuild same tensors on different addresses resulting in different graph hash. But in any way this looks like a great progress! |
8f1c5e3
to
5f87f28
Compare
Yes, this is expected. I've applied the change only for the unified cache. For the unified+iswa, it is still using the |
Should work with Gemma now as well. |
4d0c0ea
to
db0cd69
Compare
The non-FA path is now also supported, though I am not 100% sure this is the best way to do it. |
I don't observe any performance regression with CPU-only build, so the implementation should be good enough I think. |
d40f705
to
d1da992
Compare
1031a5d
to
14554a8
Compare
335161d
to
e1aba6a
Compare
b5fea54
to
c4273b8
Compare
@ggerganov the following test segfaults on my machine:
|
Apply this patch: diff --git a/ggml/src/ggml-cpu/ggml-cpu.cpp b/ggml/src/ggml-cpu/ggml-cpu.cpp
index 735ef3f01..cc9b922fa 100644
--- a/ggml/src/ggml-cpu/ggml-cpu.cpp
+++ b/ggml/src/ggml-cpu/ggml-cpu.cpp
@@ -416,6 +416,7 @@ static bool ggml_backend_cpu_device_supports_op(ggml_backend_dev_t dev, const st
switch (op->op) {
case GGML_OP_CPY:
+ case GGML_OP_SET_ROWS:
return
op->type != GGML_TYPE_IQ3_XXS &&
op->type != GGML_TYPE_IQ3_S &&
|
c4273b8
to
96327b5
Compare
Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'. ref: #8366
ggml-ci
96327b5
to
36f8e20
Compare
depends #14274
Utilize
ggml_set_rows()
for updating the KV cache.head
offsetCurrently enabled only if the environment variable
LLAMA_SET_ROWS
is defined. If not, we fallback to the original way of updating the KV cache using a view + cpy of continuous slots. This is needed until theggml_set_rows()
implementation is finalized and supported by all backends.Will merge after #14274.
Testing
Next PRs
llama_kv_cache_unified
to support virtual sequences