Gemma2: 9B query_pre_attn_scalar = 256 not 224 #8444

danielhanchen · 2024-07-12T05:25:04Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

See google/gemma_pytorch@03e6575
Gemma 9b should use 256 and not 224 (self.config.hidden_size // self.config.num_attention_heads).

Google engineers confirmed only the 9B is affected and not the 27B. This PR just disables the sanity check, since the latest HF repo for Gemma2 9B has been updated already - see https://huggingface.co/google/gemma-2-9b-it/blob/1937c70277fcc5f7fb0fc772fc5bc69378996e71/config.json#L24

See google/gemma_pytorch@03e6575 Gemma 9b should use 256 and not 224 (self.config.hidden_size // self.config.num_attention_heads)

ggerganov · 2024-07-12T07:58:33Z

Need to also apply the following change to llama.cpp so that the correct scaling factor is used in the code:

diff --git a/src/llama.cpp b/src/llama.cpp
index f91ac777..7aa8f676 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -11679,7 +11679,12 @@ struct llm_build_context {
                         ext_factor, attn_factor, beta_fast, beta_slow);
                 cb(Qcur, "Qcur", il);
 
-                Qcur = ggml_scale(ctx0, Qcur, 1.0f / sqrtf(float(n_embd / n_head)));
+                // ref: https://github.com/google/gemma_pytorch/commit/03e657582d17cb5a8617ebf333c1c16f3694670e
+                switch (model.type) {
+                    case e_model::MODEL_9B:  Qcur = ggml_scale(ctx0, Qcur, 1.0f / sqrtf(float(n_embd_head_k)));   break;
+                    case e_model::MODEL_27B: Qcur = ggml_scale(ctx0, Qcur, 1.0f / sqrtf(float(n_embd / n_head))); break;
+                    default: GGML_ASSERT(false);
+                };
                 cb(Qcur, "Qcur_scaled", il);
 
                 Kcur = ggml_rope_ext(

qnixsynapse · 2024-07-13T14:53:17Z

BTW, regenerating Gemma gguf from HF repos is currently broken because of this.

Anything left for it to be merged?

ggerganov · 2024-07-13T15:43:43Z

Continued in #8473

danielhanchen added 2 commits July 10, 2024 22:02

9B - query_pre_attn_scalar = 256 not 224

3a2e615

See google/gemma_pytorch@03e6575 Gemma 9b should use 256 and not 224 (self.config.hidden_size // self.config.num_attention_heads)

Merge branch 'ggerganov:master' into master

fee7936

github-actions bot added the python python script changes label Jul 12, 2024

danielhanchen mentioned this pull request Jul 12, 2024

Gemma2 9B + Unsloth unslothai/unsloth#757

Closed

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jul 13, 2024

ggerganov mentioned this pull request Jul 13, 2024

llama : fix Gemma-2 Query scaling factors #8473

Merged

4 tasks

ggerganov closed this Jul 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gemma2: 9B query_pre_attn_scalar = 256 not 224 #8444

Gemma2: 9B query_pre_attn_scalar = 256 not 224 #8444

Uh oh!

danielhanchen commented Jul 12, 2024 •

edited

Loading

Uh oh!

ggerganov commented Jul 12, 2024

Uh oh!

qnixsynapse commented Jul 13, 2024

Uh oh!

ggerganov commented Jul 13, 2024

Uh oh!

Uh oh!

Gemma2: 9B query_pre_attn_scalar = 256 not 224 #8444

Gemma2: 9B query_pre_attn_scalar = 256 not 224 #8444

Uh oh!

Conversation

danielhanchen commented Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jul 12, 2024

Uh oh!

qnixsynapse commented Jul 13, 2024

Uh oh!

ggerganov commented Jul 13, 2024

Uh oh!

Uh oh!

danielhanchen commented Jul 12, 2024 •

edited

Loading