Skip to content

Gemma2: 9B query_pre_attn_scalar = 256 not 224 #8444

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

danielhanchen
Copy link
Contributor

@danielhanchen danielhanchen commented Jul 12, 2024

See google/gemma_pytorch@03e6575
Gemma 9b should use 256 and not 224 (self.config.hidden_size // self.config.num_attention_heads).

Google engineers confirmed only the 9B is affected and not the 27B. This PR just disables the sanity check, since the latest HF repo for Gemma2 9B has been updated already - see https://huggingface.co/google/gemma-2-9b-it/blob/1937c70277fcc5f7fb0fc772fc5bc69378996e71/config.json#L24

See google/gemma_pytorch@03e6575

Gemma 9b should use 256 and not 224 (self.config.hidden_size // self.config.num_attention_heads)
@github-actions github-actions bot added the python python script changes label Jul 12, 2024
@ggerganov
Copy link
Member

Need to also apply the following change to llama.cpp so that the correct scaling factor is used in the code:

diff --git a/src/llama.cpp b/src/llama.cpp
index f91ac777..7aa8f676 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -11679,7 +11679,12 @@ struct llm_build_context {
                         ext_factor, attn_factor, beta_fast, beta_slow);
                 cb(Qcur, "Qcur", il);
 
-                Qcur = ggml_scale(ctx0, Qcur, 1.0f / sqrtf(float(n_embd / n_head)));
+                // ref: https://github.com/google/gemma_pytorch/commit/03e657582d17cb5a8617ebf333c1c16f3694670e
+                switch (model.type) {
+                    case e_model::MODEL_9B:  Qcur = ggml_scale(ctx0, Qcur, 1.0f / sqrtf(float(n_embd_head_k)));   break;
+                    case e_model::MODEL_27B: Qcur = ggml_scale(ctx0, Qcur, 1.0f / sqrtf(float(n_embd / n_head))); break;
+                    default: GGML_ASSERT(false);
+                };
                 cb(Qcur, "Qcur_scaled", il);
 
                 Kcur = ggml_rope_ext(

@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jul 13, 2024
@qnixsynapse
Copy link
Collaborator

BTW, regenerating Gemma gguf from HF repos is currently broken because of this.

Anything left for it to be merged?

@ggerganov
Copy link
Member

Continued in #8473

@ggerganov ggerganov closed this Jul 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants