llama : fix buffer checks for mamba and rwk #10111

slaren · 2024-10-31T14:15:30Z

Added random values to pass the asserts, but I don't know if they make sense.

The models load, but I wasn't able to run them. I tried a mamba and RWK models that I found on HF, and both crash during inference in llm_build_copy_mask_state.

Fixes #10109

danbev · 2024-10-31T14:49:26Z

src/llama.cpp

-                op_tensor = ggml_ssm_conv(ctx, nullptr, w);
+                // FIXME
+                ggml_tensor * conv_x = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, 12345, w->ne[1], 6789);
+                op_tensor = ggml_ssm_conv(ctx, conv_x, w);


Should this be the other way around, the convolution/filter as the third argument:

op_tensor = ggml_ssm_conv(ctx, w, conv_x);

No, it is important that the weight w is in the same position as when used during inference so that the backend supports_op can check it. This function is called as ggml_ssm_conv(ctx, conv_x, model.layers[il].ssm_conv1d) during inference, so the weight is the third argument.

Ah I did not realize that, thanks for clarifying!

danbev · 2024-10-31T16:08:30Z

When running this, the error seems to happen when building the graph for gf_pp and not the other graphs before it:

19696             // reserve again with pp graph to avoid ggml-alloc reallocations during inference
19697             gf_pp = llama_build_graph(*ctx, ubatch_pp, false);                       
19698             if (!ggml_backend_sched_reserve(ctx->sched, gf_pp)) {              
19699                 LLAMA_LOG_ERROR("%s: failed to allocate compute buffers\n", __func__);
19700                 llama_free(ctx);                                               
19701                 return nullptr;                                                
19702             }

Setting a breakpoint on this line and inspecting and stepping through build_llama_graph I noticed that the value of n_kv is 0:

(gdb) p llm.n_kv                                                                    
$71 = 0

And this will cause the inp_s_mask tensor to have a size of 0:

        lctx.inp_s_mask = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, 1, n_kv);

(gdb) p n_kv                                                                    
$72 = 0                                                                         
(gdb) p lctx.inp_s_mask->ne                                                     
$73 = {1, 0, 1, 1}

Could this be the cause of the error perhaps?

I've tried adding the following to llm_build_context:

diff --git a/src/llama.cpp b/src/llama.cpp                                      
index bedacfcb..517b1eb6 100644                                                 
--- a/src/llama.cpp                                                             
+++ b/src/llama.cpp                                                             
@@ -10257,7 +10257,7 @@ struct llm_build_context {                              
         norm_eps         (hparams.f_norm_eps),                                 
         norm_rms_eps     (hparams.f_norm_rms_eps),                             
         n_tokens         (ubatch.n_tokens),                                    
-        n_kv             (worst_case ? kv_self.size : kv_self.n),              
+        n_kv             (worst_case ? kv_self.size : (kv_self.recurrent ? 1 : kv_self.n)),
         n_outputs        (worst_case ? n_tokens : lctx.n_outputs),             
         n_outputs_enc    (worst_case ? n_tokens : lctx.embd_enc.size() / hparams.n_embd),
         kv_head          (worst_case ? (kv_self.recurrent ? 0 : kv_self.size - n_tokens) : kv_self.head),

With this I'm able to run inference using falcon-mamba-7b-Q4_K_S.gguf.

I'm not sure if this is a proper fix or not, but as I'm running out of time today I thought I'd let you know in case this sparks some ideas for you about this issue. I'd be happy to continue investigating this tomorrow if needed/wanted.

slaren · 2024-10-31T16:52:13Z

Thanks, the worst_case flag was missing during initialization. I also fixed the issue with the CUDA norm incorrectly reported as supported when the tensor is not contiguous. The operation will run on the CPU, but the model will not crash when offloaded to CUDA.

Similar to LCPP ggml-org#10111

* llama : fix buffer checks for mamba and rwk * llama : fix missing worst case flag during reserve * cuda : fix supports_op for norm * disable sched SET_CAUSE

llama : fix buffer checks for mamba and rwk

dec6ce2

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 31, 2024

danbev reviewed Oct 31, 2024

View reviewed changes

slaren added 2 commits October 31, 2024 17:49

llama : fix missing worst case flag during reserve

b135927

cuda : fix supports_op for norm

d1faeca

slaren linked an issue Oct 31, 2024 that may be closed by this pull request

Bug: Error when offloading falcon mamba layers on GPU #9932

Closed

disable sched SET_CAUSE

899351e

danbev approved these changes Oct 31, 2024

View reviewed changes

ggerganov approved these changes Oct 31, 2024

View reviewed changes

slaren merged commit c02e5ab into master Oct 31, 2024
54 checks passed

slaren deleted the sl/fix-mamba-rwk-checks branch October 31, 2024 21:54

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Nov 2, 2024

Test on GGML_OP_FUSED_RMS_NORM

8bf8519

Similar to LCPP ggml-org#10111

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : fix buffer checks for mamba and rwk #10111

llama : fix buffer checks for mamba and rwk #10111

Uh oh!

slaren commented Oct 31, 2024

Uh oh!

danbev Oct 31, 2024 •

edited

Loading

Uh oh!

slaren Oct 31, 2024 •

edited

Loading

Uh oh!

danbev Oct 31, 2024

Uh oh!

danbev commented Oct 31, 2024 •

edited

Loading

Uh oh!

slaren commented Oct 31, 2024

Uh oh!

Uh oh!

Uh oh!

llama : fix buffer checks for mamba and rwk #10111

llama : fix buffer checks for mamba and rwk #10111

Uh oh!

Conversation

slaren commented Oct 31, 2024

Uh oh!

danbev Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slaren Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danbev Oct 31, 2024

Choose a reason for hiding this comment

Uh oh!

danbev commented Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Oct 31, 2024

Uh oh!

Uh oh!

Uh oh!

danbev Oct 31, 2024 •

edited

Loading

slaren Oct 31, 2024 •

edited

Loading

danbev commented Oct 31, 2024 •

edited

Loading