Skip to content

Metal: PP speedup #3084

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Sep 11, 2023
Merged

Metal: PP speedup #3084

merged 10 commits into from
Sep 11, 2023

Conversation

ikawrakow
Copy link
Contributor

Speedup is achieved via

  • Faster soft_max, diag_mask_inf, scale, silu, gelu
  • A new kernel for f16 x f32 matrix multiplications that can not be done via the usual matrix multiplication kernel because the tensors are not contiguous. This mostly affects the K x Q matrix multiplication, the importance of which grows with increasing context / batch size.
  • Tuning of the k_quants dequantization kernels. The effect is most strong for Q5_K and Q6_K.

The table gives results on a 30-core M2 Max. TG performance is mostly unaffected, but there are some very minor gains here and there. For PP the lion share of the speedup comes from the new f16 x f32 matrix multiplication kernel, which somehow does not benefit as much the Falcon model, so this is something left to look Into.

model backend test t/s (Master) t/s (PR) Speedup
Falcon 7B mostly F16 Metal pp 512 365.11 ± 0.06 380.18 ± 0.13 1.041
LLaMA 7B mostly F16 Metal pp 512 489.73 ± 0.32 539.46 ± 0.31 1.102
LLaMA 7B mostly Q8_0 Metal pp 512 445.27 ± 0.13 486.91 ± 0.18 1.094
LLaMA 7B mostly Q4_0 Metal pp 512 445.65 ± 0.30 495.91 ± 0.38 1.113
LLaMA 7B mostly Q4_1 Metal pp 512 448.26 ± 0.11 494.31 ± 0.26 1.103
LLaMA 7B mostly Q6_K Metal pp 512 367.58 ± 0.15 413.82 ± 0.39 1.126
LLaMA 7B mostly Q5_K - Small Metal pp 512 366.29 ± 0.14 413.35 ± 0.27 1.128
LLaMA 7B mostly Q4_K - Small Metal pp 512 399.22 ± 0.28 438.64 ± 0.23 1.099
LLaMA 7B mostly Q3_K - Small Metal pp 512 379.32 ± 0.18 417.91 ± 0.25 1.102
Falcon 7B mostly F16 Metal tg 128 23.34 ± 0.05 23.33 ± 0.04 1.000
LLaMA 7B mostly F16 Metal tg 128 24.25 ± 0.10 24.26 ± 0.04 1.000
LLaMA 7B mostly Q8_0 Metal tg 128 40.09 ± 0.34 40.16 ± 0.27 1.000
LLaMA 7B mostly Q4_0 Metal tg 128 62.23 ± 0.88 62.53 ± 0.06 1.005
LLaMA 7B mostly Q4_1 Metal tg 128 57.93 ± 0.49 58.88 ± 0.78 1.016
LLaMA 7B mostly Q6_K Metal tg 128 42.68 ± 0.54 42.93 ± 0.48 1.006
LLaMA 7B mostly Q5_K - Small Metal tg 128 43.90 ± 0.65 44.67 ± 0.03 1.017
LLaMA 7B mostly Q4_K - Small Metal tg 128 56.42 ± 0.98 57.31 ± 0.44 1.016
LLaMA 7B mostly Q3_K - Small Metal tg 128 45.67 ± 0.57 46.19 ± 0.50 1.011

@ikawrakow ikawrakow requested a review from ggerganov September 8, 2023 16:20
@ggerganov
Copy link
Member

ggerganov commented Sep 9, 2023

M2 Ultra results:

model size test th master t/s PR t/s spedup
LLaMA 7B mostly F16 12.55 GiB pp 512 4 1125.99 ± 0.69 1266.25 ± 0.94 1.125
LLaMA 7B mostly Q8_0 6.67 GiB pp 512 4 1029.88 ± 0.27 1142.73 ± 0.37 1.110
LLaMA 7B mostly Q4_0 3.56 GiB pp 512 4 1032.08 ± 0.60 1163.70 ± 0.44 1.128
LLaMA 7B mostly Q4_1 3.95 GiB pp 512 4 1038.05 ± 0.52 1161.42 ± 1.21 1.119
LLaMA 7B mostly Q6_K 5.15 GiB pp 512 4 856.86 ± 0.57 977.63 ± 0.64 1.141
LLaMA 7B mostly Q5_K - Medium 4.45 GiB pp 512 4 857.33 ± 0.36 976.55 ± 1.93 1.139
LLaMA 7B mostly Q5_K - Small 4.33 GiB pp 512 4 857.07 ± 0.45 977.32 ± 0.55 1.140
LLaMA 7B mostly Q4_K - Medium 3.80 GiB pp 512 4 920.57 ± 0.36 1027.82 ± 0.48 1.117
LLaMA 7B mostly Q4_K - Small 3.59 GiB pp 512 4 930.13 ± 0.65 1035.05 ± 0.63 1.113
LLaMA 7B mostly Q3_K - Medium 3.07 GiB pp 512 4 903.22 ± 0.71 1007.75 ± 0.42 1.116
LLaMA 7B mostly Q3_K - Small 2.75 GiB pp 512 4 886.80 ± 0.67 990.78 ± 0.31 1.117
LLaMA 7B mostly F16 12.55 GiB tg 128 4 40.82 ± 0.02 40.90 ± 0.05 1.002
LLaMA 7B mostly Q8_0 6.67 GiB tg 128 4 64.50 ± 0.04 64.67 ± 0.06 1.003
LLaMA 7B mostly Q4_0 3.56 GiB tg 128 4 90.25 ± 0.11 90.90 ± 0.15 1.007
LLaMA 7B mostly Q4_1 3.95 GiB tg 128 4 85.47 ± 0.09 85.88 ± 0.06 1.005
LLaMA 7B mostly Q6_K 5.15 GiB tg 128 4 71.01 ± 0.08 71.44 ± 0.07 1.006
LLaMA 7B mostly Q5_K - Medium 4.45 GiB tg 128 4 72.36 ± 0.07 72.59 ± 0.08 1.003
LLaMA 7B mostly Q5_K - Small 4.33 GiB tg 128 4 73.75 ± 0.08 73.91 ± 0.13 1.002
LLaMA 7B mostly Q4_K - Medium 3.80 GiB tg 128 4 83.39 ± 0.13 83.69 ± 0.18 1.004
LLaMA 7B mostly Q4_K - Small 3.59 GiB tg 128 4 86.59 ± 0.11 86.90 ± 0.14 1.004
LLaMA 7B mostly Q3_K - Medium 3.07 GiB tg 128 4 83.50 ± 0.10 83.69 ± 0.10 1.002
LLaMA 7B mostly Q3_K - Small 2.75 GiB tg 128 4 84.73 ± 0.10 85.25 ± 0.09 1.006

Looks like there is a small performance hit on Q3_K TG with the Ultra. I redid the test 2 times, but the numbers are consistent.

Updated the table after rebasing to master

@ikawrakow
Copy link
Contributor Author

Looks like there is a small performance hit on Q3_K TG with the Ultra.

I think this is because I did not rebase on latest master before opening the PR. In the meantime you merged #2995 into master, which brings significant improvement in TG for Q3_K. If you rebase the PR on current master, I expect the Q3_K performance drop to turn into a small performance increase.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I've updated the table after rebasing - all good now

Let me take a more detailed look and will merge this later today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants