-
Notifications
You must be signed in to change notification settings - Fork 12.2k
metal : improve FA + improve MoE #12612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
6917e63
to
1e0f5ad
Compare
I'm getting this error when running https://huggingface.co/ggml-org/gemma-3-4b-it-GGUF which
|
@ggerganov No, still getting the same error even after a clean rebuild. |
And you are sure that you checkout the branch of #12659? |
Yes I applied the patch onto current master. |
@PkmX I pushed one more change there - can you check if it works now? It's strange why it fails for you. I tested that the |
I did a little more debugging and found that the The following patch seems to workaround the problem. diff --git a/ggml/src/ggml-metal/ggml-metal.metal b/ggml/src/ggml-metal/ggml-metal.metal
index 54a92247..e82bb5dd 100644
--- a/ggml/src/ggml-metal/ggml-metal.metal
+++ b/ggml/src/ggml-metal/ggml-metal.metal
@@ -3958,7 +3958,7 @@ kernel void kernel_flash_attn_ext_vec(
half, half4, \
half4
-typedef decltype(kernel_flash_attn_ext_vec<FA_TYPES, half4, 1, dequantize_f16_t4, half4, 1, dequantize_f16_t4, 128, 128, 128>) flash_attn_ext_vec_t;
+typedef decltype(kernel_flash_attn_ext_vec<FA_TYPES, half4, 1, dequantize_f16_t4, half4, 1, dequantize_f16_t4, 128, 128, 4>) flash_attn_ext_vec_t;
template [[host_name("kernel_flash_attn_ext_vec_f16_h128")]] kernel flash_attn_ext_vec_t kernel_flash_attn_ext_vec<FA_TYPES, half4, 1, dequantize_f16_t4, half4, 1, dequantize_f16_t4, 128, 128, 4>;
#if defined(GGML_METAL_USE_BF16) |
Right, this is indeed a mistake. Thanks for pin-pointing it. |
Overview
head_size_k != head_size_v
=> V cache quantization support for DeepSeekmul_mat_id
based on rows per expert (huge bottleneck for DeepSeek models) (9c2b783)M2 Studio results
Improved DeepSeek V2 Lite PP and TG perf
Improved DeepSeek V2 Lite large context perf
make -j && ./bin/llama-batched-bench -m ../models/deepseek-v2-lite-chat/ggml-model-q8_0.gguf -c 16384 -b 2048 -ub 512 -npp 512,4096,8192 -ntg 128 -npl 1 -lv 1 -fa
master
:PR
:DeepSeek V3 IQ1_S
make -j && ./bin/llama-bench -m unsloth_DeepSeek-V3-0324-GGUF_UD-IQ1_S_DeepSeek-V3-0324-UD-IQ1_S-00001-of-00004.gguf -fa 1
make -j && ./bin/llama-batched-bench -m unsloth_DeepSeek-V3-0324-GGUF_UD-IQ1_S_DeepSeek-V3-0324-UD-IQ1_S-00001-of-00004.gguf -c 2048 -b 2048 -ub 512 -npp 1,1500 -ntg 128 -npl 1 -lv 1 -fa -ctk q8_0 -ctv q8_0
DeepSeek V2 Lite with Q8_0 KV cache
make -j && ./bin/llama-batched-bench -m ../models/deepseek-v2-lite-chat/ggml-model-q8_0.gguf -c 16384 -b 2048 -ub 512 -npp 512,4096,8192 -ntg 128 -npl 1 -lv 1 -fa -ctk q8_0 -ctv q8_0
master
: not supportedPR
:Improved Gemma TG perf
F16 KV cache:
make -j && ./bin/llama-batched-bench -m ../models/gemma-2-9b/ggml-model-q4_k.gguf -c 16384 -b 2048 -ub 2048 -npp 512,4096,8192 -ntg 128 -npl 1 -lv 1 -fa
master
:PR
:Q8_0 KV cache:
make -j && ./bin/llama-batched-bench -m ../models/gemma-2-9b/ggml-model-q4_k.gguf -c 16384 -b 2048 -ub 2048 -npp 512,4096,8192 -ntg 128 -npl 1 -lv 1 -fa -ctk q8_0 -ctv q8_0
master
:PR
: