Metal: PP speedup #3084

ikawrakow · 2023-09-08T16:20:36Z

Speedup is achieved via

Faster soft_max, diag_mask_inf, scale, silu, gelu
A new kernel for f16 x f32 matrix multiplications that can not be done via the usual matrix multiplication kernel because the tensors are not contiguous. This mostly affects the K x Q matrix multiplication, the importance of which grows with increasing context / batch size.
Tuning of the k_quants dequantization kernels. The effect is most strong for Q5_K and Q6_K.

The table gives results on a 30-core M2 Max. TG performance is mostly unaffected, but there are some very minor gains here and there. For PP the lion share of the speedup comes from the new f16 x f32 matrix multiplication kernel, which somehow does not benefit as much the Falcon model, so this is something left to look Into.

model	backend	test	t/s (Master)	t/s (PR)	Speedup
Falcon 7B mostly F16	Metal	pp 512	365.11 ± 0.06	380.18 ± 0.13	1.041
LLaMA 7B mostly F16	Metal	pp 512	489.73 ± 0.32	539.46 ± 0.31	1.102
LLaMA 7B mostly Q8_0	Metal	pp 512	445.27 ± 0.13	486.91 ± 0.18	1.094
LLaMA 7B mostly Q4_0	Metal	pp 512	445.65 ± 0.30	495.91 ± 0.38	1.113
LLaMA 7B mostly Q4_1	Metal	pp 512	448.26 ± 0.11	494.31 ± 0.26	1.103
LLaMA 7B mostly Q6_K	Metal	pp 512	367.58 ± 0.15	413.82 ± 0.39	1.126
LLaMA 7B mostly Q5_K - Small	Metal	pp 512	366.29 ± 0.14	413.35 ± 0.27	1.128
LLaMA 7B mostly Q4_K - Small	Metal	pp 512	399.22 ± 0.28	438.64 ± 0.23	1.099
LLaMA 7B mostly Q3_K - Small	Metal	pp 512	379.32 ± 0.18	417.91 ± 0.25	1.102
Falcon 7B mostly F16	Metal	tg 128	23.34 ± 0.05	23.33 ± 0.04	1.000
LLaMA 7B mostly F16	Metal	tg 128	24.25 ± 0.10	24.26 ± 0.04	1.000
LLaMA 7B mostly Q8_0	Metal	tg 128	40.09 ± 0.34	40.16 ± 0.27	1.000
LLaMA 7B mostly Q4_0	Metal	tg 128	62.23 ± 0.88	62.53 ± 0.06	1.005
LLaMA 7B mostly Q4_1	Metal	tg 128	57.93 ± 0.49	58.88 ± 0.78	1.016
LLaMA 7B mostly Q6_K	Metal	tg 128	42.68 ± 0.54	42.93 ± 0.48	1.006
LLaMA 7B mostly Q5_K - Small	Metal	tg 128	43.90 ± 0.65	44.67 ± 0.03	1.017
LLaMA 7B mostly Q4_K - Small	Metal	tg 128	56.42 ± 0.98	57.31 ± 0.44	1.016
LLaMA 7B mostly Q3_K - Small	Metal	tg 128	45.67 ± 0.57	46.19 ± 0.50	1.011

Although, to me it looks like one should simply fuse scale + diagnonal infinity + soft_max on the KQtensor.

It does work for PP, but somehow it fails for TG. Need to look more into it.

This time more carefully

ggerganov · 2023-09-09T08:03:28Z

M2 Ultra results:

model	size	test	th	master t/s	PR t/s	spedup
LLaMA 7B mostly F16	12.55 GiB	pp 512	4	1125.99 ± 0.69	1266.25 ± 0.94	1.125
LLaMA 7B mostly Q8_0	6.67 GiB	pp 512	4	1029.88 ± 0.27	1142.73 ± 0.37	1.110
LLaMA 7B mostly Q4_0	3.56 GiB	pp 512	4	1032.08 ± 0.60	1163.70 ± 0.44	1.128
LLaMA 7B mostly Q4_1	3.95 GiB	pp 512	4	1038.05 ± 0.52	1161.42 ± 1.21	1.119
LLaMA 7B mostly Q6_K	5.15 GiB	pp 512	4	856.86 ± 0.57	977.63 ± 0.64	1.141
LLaMA 7B mostly Q5_K - Medium	4.45 GiB	pp 512	4	857.33 ± 0.36	976.55 ± 1.93	1.139
LLaMA 7B mostly Q5_K - Small	4.33 GiB	pp 512	4	857.07 ± 0.45	977.32 ± 0.55	1.140
LLaMA 7B mostly Q4_K - Medium	3.80 GiB	pp 512	4	920.57 ± 0.36	1027.82 ± 0.48	1.117
LLaMA 7B mostly Q4_K - Small	3.59 GiB	pp 512	4	930.13 ± 0.65	1035.05 ± 0.63	1.113
LLaMA 7B mostly Q3_K - Medium	3.07 GiB	pp 512	4	903.22 ± 0.71	1007.75 ± 0.42	1.116
LLaMA 7B mostly Q3_K - Small	2.75 GiB	pp 512	4	886.80 ± 0.67	990.78 ± 0.31	1.117
LLaMA 7B mostly F16	12.55 GiB	tg 128	4	40.82 ± 0.02	40.90 ± 0.05	1.002
LLaMA 7B mostly Q8_0	6.67 GiB	tg 128	4	64.50 ± 0.04	64.67 ± 0.06	1.003
LLaMA 7B mostly Q4_0	3.56 GiB	tg 128	4	90.25 ± 0.11	90.90 ± 0.15	1.007
LLaMA 7B mostly Q4_1	3.95 GiB	tg 128	4	85.47 ± 0.09	85.88 ± 0.06	1.005
LLaMA 7B mostly Q6_K	5.15 GiB	tg 128	4	71.01 ± 0.08	71.44 ± 0.07	1.006
LLaMA 7B mostly Q5_K - Medium	4.45 GiB	tg 128	4	72.36 ± 0.07	72.59 ± 0.08	1.003
LLaMA 7B mostly Q5_K - Small	4.33 GiB	tg 128	4	73.75 ± 0.08	73.91 ± 0.13	1.002
LLaMA 7B mostly Q4_K - Medium	3.80 GiB	tg 128	4	83.39 ± 0.13	83.69 ± 0.18	1.004
LLaMA 7B mostly Q4_K - Small	3.59 GiB	tg 128	4	86.59 ± 0.11	86.90 ± 0.14	1.004
LLaMA 7B mostly Q3_K - Medium	3.07 GiB	tg 128	4	83.50 ± 0.10	83.69 ± 0.10	1.002
LLaMA 7B mostly Q3_K - Small	2.75 GiB	tg 128	4	84.73 ± 0.10	85.25 ± 0.09	1.006

~~Looks like there is a small performance hit on Q3_K TG with the Ultra. I redid the test 2 times, but the numbers are consistent.~~

Updated the table after rebasing to master

ikawrakow · 2023-09-09T09:23:16Z

Looks like there is a small performance hit on Q3_K TG with the Ultra.

I think this is because I did not rebase on latest master before opening the PR. In the meantime you merged #2995 into master, which brings significant improvement in TG for Q3_K. If you rebase the PR on current master, I expect the Q3_K performance drop to turn into a small performance increase.

ggerganov

Yup, I've updated the table after rebasing - all good now

Let me take a more detailed look and will merge this later today

Kawrakow added 8 commits September 7, 2023 16:01

Minor speed gains for all quantization types

9a90106

metal: faster kernel_scale via float4

7c8c6ce

Various other speedups for "small" kernels

2699cac

metal: faster soft_max vial float4

43ca769

metal: faster diagonal infinity

fa5a989

Although, to me it looks like one should simply fuse scale + diagnonal infinity + soft_max on the KQtensor.

Another faster f16 x f32 matrix multiply kernel

4560acc

Reverting the diag infinity change

4fc615e

It does work for PP, but somehow it fails for TG. Need to look more into it.

metal: add back faster diagonal infinity

7331d1e

This time more carefully

ikawrakow requested a review from ggerganov September 8, 2023 16:20

ggerganov approved these changes Sep 9, 2023

View reviewed changes

ggerganov added 2 commits September 11, 2023 10:25

Merge branch 'master' into ik/metal_pp

0c17b08

metal : minor (readibility)

211d82a

ggerganov merged commit f31b6f4 into master Sep 11, 2023

ikawrakow mentioned this pull request Sep 11, 2023

POC: combined scale + diagonal mask infinity + soft max op #3121

Closed

ikawrakow deleted the ik/metal_pp branch September 24, 2023 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metal: PP speedup #3084

Metal: PP speedup #3084

Uh oh!

ikawrakow commented Sep 8, 2023

Uh oh!

ggerganov commented Sep 9, 2023 •

edited

Loading

Uh oh!

ikawrakow commented Sep 9, 2023

Uh oh!

ggerganov left a comment

Uh oh!

Uh oh!

Metal: PP speedup #3084

Metal: PP speedup #3084

Uh oh!

Conversation

ikawrakow commented Sep 8, 2023

Uh oh!

ggerganov commented Sep 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Sep 9, 2023

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ggerganov commented Sep 9, 2023 •

edited

Loading