-
Notifications
You must be signed in to change notification settings - Fork 12.3k
llama : fix buffer checks for mamba and rwk #10111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
op_tensor = ggml_ssm_conv(ctx, nullptr, w); | ||
// FIXME | ||
ggml_tensor * conv_x = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, 12345, w->ne[1], 6789); | ||
op_tensor = ggml_ssm_conv(ctx, conv_x, w); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be the other way around, the convolution/filter as the third argument:
op_tensor = ggml_ssm_conv(ctx, w, conv_x);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is important that the weight w
is in the same position as when used during inference so that the backend supports_op
can check it. This function is called as ggml_ssm_conv(ctx, conv_x, model.layers[il].ssm_conv1d)
during inference, so the weight is the third argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I did not realize that, thanks for clarifying!
When running this, the error seems to happen when building the graph for 19696 // reserve again with pp graph to avoid ggml-alloc reallocations during inference
19697 gf_pp = llama_build_graph(*ctx, ubatch_pp, false);
19698 if (!ggml_backend_sched_reserve(ctx->sched, gf_pp)) {
19699 LLAMA_LOG_ERROR("%s: failed to allocate compute buffers\n", __func__);
19700 llama_free(ctx);
19701 return nullptr;
19702 } Setting a breakpoint on this line and inspecting and stepping through (gdb) p llm.n_kv
$71 = 0 And this will cause the lctx.inp_s_mask = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, 1, n_kv); (gdb) p n_kv
$72 = 0
(gdb) p lctx.inp_s_mask->ne
$73 = {1, 0, 1, 1} Could this be the cause of the error perhaps? I've tried adding the following to diff --git a/src/llama.cpp b/src/llama.cpp
index bedacfcb..517b1eb6 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -10257,7 +10257,7 @@ struct llm_build_context {
norm_eps (hparams.f_norm_eps),
norm_rms_eps (hparams.f_norm_rms_eps),
n_tokens (ubatch.n_tokens),
- n_kv (worst_case ? kv_self.size : kv_self.n),
+ n_kv (worst_case ? kv_self.size : (kv_self.recurrent ? 1 : kv_self.n)),
n_outputs (worst_case ? n_tokens : lctx.n_outputs),
n_outputs_enc (worst_case ? n_tokens : lctx.embd_enc.size() / hparams.n_embd),
kv_head (worst_case ? (kv_self.recurrent ? 0 : kv_self.size - n_tokens) : kv_self.head), With this I'm able to run inference using I'm not sure if this is a proper fix or not, but as I'm running out of time today I thought I'd let you know in case this sparks some ideas for you about this issue. I'd be happy to continue investigating this tomorrow if needed/wanted. |
Thanks, the |
Similar to LCPP ggml-org#10111
* llama : fix buffer checks for mamba and rwk * llama : fix missing worst case flag during reserve * cuda : fix supports_op for norm * disable sched SET_CAUSE
* llama : fix buffer checks for mamba and rwk * llama : fix missing worst case flag during reserve * cuda : fix supports_op for norm * disable sched SET_CAUSE
Added random values to pass the asserts, but I don't know if they make sense.
The models load, but I wasn't able to run them. I tried a mamba and RWK models that I found on HF, and both crash during inference in
llm_build_copy_mask_state
.Fixes #10109