Skip to content

metal : add FA-vec kernels for head size 96 #12952

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 15, 2025
Merged

Conversation

ggerganov
Copy link
Member

fix #12948

Increase tg performance at long contexts for models such as Phi-3 that have head size of 96.

./bin/llama-batched-bench -m ../models/phi-3-mini-128k-instruct/ggml-model-q8_0.gguf -c 32768 -b 4096 -ub 4096 -npp 0,512,4096,8192,16384 -ntg 128 -npl 1 -lv 1 -fa
  • master
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
0 128 1 128 0.898 0.00 1.400 91.42 2.298 55.69
512 128 1 640 0.396 1294.24 1.510 84.74 1.906 335.77
4096 128 1 4224 2.173 1884.70 2.266 56.49 4.439 951.49
8192 128 1 8320 4.863 1684.49 3.128 40.92 7.991 1041.18
16384 128 1 16512 12.512 1309.50 4.867 26.30 17.378 950.15
  • PR
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
0 128 1 128 0.968 0.00 1.294 98.92 2.262 56.59
512 128 1 640 0.421 1217.31 1.345 95.19 1.765 362.55
4096 128 1 4224 2.150 1904.81 1.686 75.91 3.837 1100.97
8192 128 1 8320 4.848 1689.73 2.074 61.72 6.922 1201.97
16384 128 1 16512 12.522 1308.46 2.864 44.70 15.385 1073.24

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Apr 15, 2025
@ggerganov ggerganov merged commit f8f820c into master Apr 15, 2025
59 checks passed
@ggerganov ggerganov deleted the gg/metal-fa-vec-add-h96 branch April 15, 2025 11:45
colout pushed a commit to colout/llama.cpp that referenced this pull request Apr 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

low performance in large contex compared to mlx format model
1 participant