Skip to content

llama : fix buffer checks for mamba and rwk #10111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 31, 2024
Merged

Conversation

slaren
Copy link
Member

@slaren slaren commented Oct 31, 2024

Added random values to pass the asserts, but I don't know if they make sense.

The models load, but I wasn't able to run them. I tried a mamba and RWK models that I found on HF, and both crash during inference in llm_build_copy_mask_state.

Fixes #10109

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 31, 2024
op_tensor = ggml_ssm_conv(ctx, nullptr, w);
// FIXME
ggml_tensor * conv_x = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, 12345, w->ne[1], 6789);
op_tensor = ggml_ssm_conv(ctx, conv_x, w);
Copy link
Collaborator

@danbev danbev Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be the other way around, the convolution/filter as the third argument:

op_tensor = ggml_ssm_conv(ctx, w, conv_x);

Copy link
Member Author

@slaren slaren Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is important that the weight w is in the same position as when used during inference so that the backend supports_op can check it. This function is called as ggml_ssm_conv(ctx, conv_x, model.layers[il].ssm_conv1d) during inference, so the weight is the third argument.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I did not realize that, thanks for clarifying!

@danbev
Copy link
Collaborator

danbev commented Oct 31, 2024

When running this, the error seems to happen when building the graph for gf_pp and not the other graphs before it:

19696             // reserve again with pp graph to avoid ggml-alloc reallocations during inference
19697             gf_pp = llama_build_graph(*ctx, ubatch_pp, false);                       
19698             if (!ggml_backend_sched_reserve(ctx->sched, gf_pp)) {              
19699                 LLAMA_LOG_ERROR("%s: failed to allocate compute buffers\n", __func__);
19700                 llama_free(ctx);                                               
19701                 return nullptr;                                                
19702             }                                                                 

Setting a breakpoint on this line and inspecting and stepping through build_llama_graph I noticed that the value of n_kv is 0:

(gdb) p llm.n_kv                                                                    
$71 = 0                                                                             

And this will cause the inp_s_mask tensor to have a size of 0:

        lctx.inp_s_mask = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, 1, n_kv);         
(gdb) p n_kv                                                                    
$72 = 0                                                                         
(gdb) p lctx.inp_s_mask->ne                                                     
$73 = {1, 0, 1, 1}                                                              

Could this be the cause of the error perhaps?

I've tried adding the following to llm_build_context:

diff --git a/src/llama.cpp b/src/llama.cpp                                      
index bedacfcb..517b1eb6 100644                                                 
--- a/src/llama.cpp                                                             
+++ b/src/llama.cpp                                                             
@@ -10257,7 +10257,7 @@ struct llm_build_context {                              
         norm_eps         (hparams.f_norm_eps),                                 
         norm_rms_eps     (hparams.f_norm_rms_eps),                             
         n_tokens         (ubatch.n_tokens),                                    
-        n_kv             (worst_case ? kv_self.size : kv_self.n),              
+        n_kv             (worst_case ? kv_self.size : (kv_self.recurrent ? 1 : kv_self.n)),
         n_outputs        (worst_case ? n_tokens : lctx.n_outputs),             
         n_outputs_enc    (worst_case ? n_tokens : lctx.embd_enc.size() / hparams.n_embd),
         kv_head          (worst_case ? (kv_self.recurrent ? 0 : kv_self.size - n_tokens) : kv_self.head),

With this I'm able to run inference using falcon-mamba-7b-Q4_K_S.gguf.

I'm not sure if this is a proper fix or not, but as I'm running out of time today I thought I'd let you know in case this sparks some ideas for you about this issue. I'd be happy to continue investigating this tomorrow if needed/wanted.

@slaren
Copy link
Member Author

slaren commented Oct 31, 2024

Thanks, the worst_case flag was missing during initialization. I also fixed the issue with the CUDA norm incorrectly reported as supported when the tensor is not contiguous. The operation will run on the CPU, but the model will not crash when offloaded to CUDA.

@slaren slaren linked an issue Oct 31, 2024 that may be closed by this pull request
@slaren slaren merged commit c02e5ab into master Oct 31, 2024
54 checks passed
@slaren slaren deleted the sl/fix-mamba-rwk-checks branch October 31, 2024 21:54
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Nov 2, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* llama : fix buffer checks for mamba and rwk

* llama : fix missing worst case flag during reserve

* cuda : fix supports_op for norm

* disable sched SET_CAUSE
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* llama : fix buffer checks for mamba and rwk

* llama : fix missing worst case flag during reserve

* cuda : fix supports_op for norm

* disable sched SET_CAUSE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Cannot load Mamba model Bug: Error when offloading falcon mamba layers on GPU
3 participants