Skip to content

metal : optimize dequant q6_K kernel #11892

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 15, 2025

Conversation

akretz
Copy link
Contributor

@akretz akretz commented Feb 15, 2025

In this PR I have optimized the Metal q6_K dequantization kernel. The basic idea is to load the weights into 32 bit variables and to do the bit twiddling on them. That gives a total 4-5% speed improvement on prompt processing on my M1 Max.

Model Test t/s master t/s dequant_q6_metal Speedup
llama 8B Q6_K pp64 289.95 303.06 1.05
llama 8B Q6_K pp128 320.58 335.25 1.05
llama 8B Q6_K pp256 332.92 347.99 1.05
llama 8B Q6_K pp512 339.55 354.55 1.04
llama 8B Q6_K pp1024 337.06 351.88 1.04
llama 8B Q6_K pp2048 330.56 344.86 1.04
qwen2 14B Q6_K pp64 153.16 159.54 1.04
qwen2 14B Q6_K pp128 169.42 176.74 1.04
qwen2 14B Q6_K pp256 177.42 185.07 1.04
qwen2 14B Q6_K pp512 179.47 187.02 1.04
qwen2 14B Q6_K pp1024 178.11 185.53 1.04
qwen2 14B Q6_K pp2048 174.79 181.95 1.04

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Feb 15, 2025
@ggerganov
Copy link
Member

M2 Ultra results:

./scripts/compare-commits.sh master pr/11892 \
    -m models/llama-8b-v3/ggml-model-q6_k.gguf \
    -m models/llama-8b-v3/ggml-model-q4_k.gguf \
    -m models/qwen2.5-7b-coder/ggml-model-q6_k.gguf \
    -m models/qwen2.5-7b-coder/ggml-model-q4_k.gguf \
    -t 1 -fa 1 -p 1,2,3,4,8,16,32,64,128,512
Model Test t/s master t/s pr/11892 Speedup
llama 8B Q4_K_M pp1 90.67 90.47 1.00
llama 8B Q4_K_M pp2 123.40 123.06 1.00
llama 8B Q4_K_M pp3 142.98 143.06 1.00
llama 8B Q4_K_M pp4 146.43 147.71 1.01
llama 8B Q4_K_M pp8 188.94 184.88 0.98
llama 8B Q4_K_M pp16 233.41 235.22 1.01
llama 8B Q4_K_M pp32 460.10 463.28 1.01
llama 8B Q4_K_M pp64 700.10 704.51 1.01
llama 8B Q4_K_M pp128 922.30 930.33 1.01
llama 8B Q4_K_M pp512 1135.04 1142.44 1.01
llama 8B Q4_K_M tg128 90.01 90.07 1.00
llama 8B Q6_K pp1 74.39 74.83 1.01
llama 8B Q6_K pp2 97.57 97.98 1.00
llama 8B Q6_K pp3 107.83 108.99 1.01
llama 8B Q6_K pp4 143.78 153.44 1.07
llama 8B Q6_K pp8 178.59 196.38 1.10
llama 8B Q6_K pp16 221.91 232.37 1.05
llama 8B Q6_K pp32 438.59 457.39 1.04
llama 8B Q6_K pp64 668.77 695.76 1.04
llama 8B Q6_K pp128 874.08 915.95 1.05
llama 8B Q6_K pp512 1072.01 1123.14 1.05
llama 8B Q6_K tg128 75.16 75.39 1.00
qwen2 7B Q4_K_M pp1 91.65 91.24 1.00
qwen2 7B Q4_K_M pp2 121.88 121.57 1.00
qwen2 7B Q4_K_M pp3 141.70 141.80 1.00
qwen2 7B Q4_K_M pp4 157.67 158.70 1.01
qwen2 7B Q4_K_M pp8 198.63 200.83 1.01
qwen2 7B Q4_K_M pp16 251.80 253.93 1.01
qwen2 7B Q4_K_M pp32 496.30 500.68 1.01
qwen2 7B Q4_K_M pp64 743.22 748.85 1.01
qwen2 7B Q4_K_M pp128 960.58 966.90 1.01
qwen2 7B Q4_K_M pp512 1226.83 1234.10 1.01
qwen2 7B Q4_K_M tg128 90.80 90.53 1.00
qwen2 7B Q6_K pp1 77.33 77.98 1.01
qwen2 7B Q6_K pp2 101.18 101.86 1.01
qwen2 7B Q6_K pp3 114.21 114.81 1.01
qwen2 7B Q6_K pp4 153.18 165.03 1.08
qwen2 7B Q6_K pp8 186.73 204.02 1.09
qwen2 7B Q6_K pp16 238.37 249.68 1.05
qwen2 7B Q6_K pp32 470.06 490.19 1.04
qwen2 7B Q6_K pp64 705.50 736.11 1.04
qwen2 7B Q6_K pp128 899.02 944.28 1.05
qwen2 7B Q6_K pp512 1156.60 1211.02 1.05
qwen2 7B Q6_K tg128 77.51 77.53 1.00

Included Q4_K as well since it has some Q6_K tensors - seems like a ~1% improvement there too.

@ggerganov ggerganov merged commit 2288510 into ggml-org:master Feb 15, 2025
42 checks passed
orca-zhang pushed a commit to orca-zhang/llama.cpp that referenced this pull request Feb 26, 2025
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025
mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants