Skip to content

vulkan: subgroup size tuning #12087

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 17, 2025
Merged

Conversation

daniandtheweb
Copy link
Contributor

This PR is a continuation of the tests done in #11826 about the subgroup size in vulkan and its effects in performance especially on RDNA cards. For now it includes a specific configuration that improves RDNA1 performance in almost all operations.

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Feb 26, 2025
@0cc4m
Copy link
Collaborator

0cc4m commented Mar 5, 2025

I haven't forgotten about this, I was just working on the int8 matmul. I'll get back to you here, soon.

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 8, 2025

I tried doing device architecture detection in a better way in daniandtheweb#1, let me know what you think.

@daniandtheweb daniandtheweb force-pushed the rdna-subgroup-size branch 3 times, most recently from c22e0b4 to 7176d04 Compare March 8, 2025 21:01
@0cc4m
Copy link
Collaborator

0cc4m commented Mar 11, 2025

I'll give this another try on RDNA2 soon, I expect that you can add RDNA2 to the RDNA1 logic. Once that is done I think we can merge this.

Can you resolve the merge conflict?

@daniandtheweb
Copy link
Contributor Author

Sure, I'll do it later once I'm back home.

@daniandtheweb daniandtheweb marked this pull request as ready for review March 11, 2025 13:52
@daniandtheweb daniandtheweb changed the title vulkan: subgroup size test vulkan: subgroup size tuning Mar 11, 2025
@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Mar 11, 2025

I was testing the change on RDNA3 on stable-diffusion.cpp and it seems like using subgroup size 32 on this architecture completely breaks image generation resulting in black images. Do you have any clue what could be causing this? Text generation works properly using wave 32 and there's even a small performance uplift in prompt processing (1164 master / 1238 subgroup 32).

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 11, 2025

I don't know what it is, but to figure out which op is failing I implemented the GGML_VULKAN_CHECK_RESULTS thing. Try to compile with that, it should tell you which result is wrong.

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Mar 11, 2025

For some reason all mul_mat and mul_mat_id fail on RDNA3 when using a subgroup size of 32 (I was in a rush before so I wasn't able to run the test-backend-ops) so I'll just avoid doing changes on this arch for now until I find a proper tuning for this card.

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 11, 2025

For some reason all mul_mat and mul_mat_id fail on RDNA3 when using a subgroup size of 32 (I was in a rush before so I wasn't able to run the test-backend-ops) so I'll just avoid doing changes on this arch for now until I find a proper tuning for this card.

That's a coopmat thing, I think. If you ignore coopmat shaders it's probably fine.

Edit: Maybe disable setting the required subgroup size if require_full_subgroups is set, that would cover the coopmat stuff.

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Mar 11, 2025

Thanks for the advise, I still haven't looked at coopmat shaders so I didn't know how they would interact with these changes. With the new changes the tests pass and performance on RDNA3 is slightly better in prompt processing (1164 master / 1220 pr). Stable diffusion seems to get a minor improvement.

For now I think it's safe to keep the same values as the ones on RDNA1 as the card seems to behave similarly when it comes to subgroup size.

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 12, 2025

I checked and you can add RDNA2 to the list. It seems to prefer the same shader sizes. Afterwards we can merge this.

@daniandtheweb
Copy link
Contributor Author

I created a common RDNA map to store the subgroup preferences since they mostly behave the same, however I've kept the option to specify different configurations just in case better tuning options are found for a specific architecture. I'll make more tests with other subgroup configurations but if I find something better I'll leave it for another pr.
If you think this looks good I think it's ready to be merged.

Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, looks good now.

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Mar 12, 2025

I disabled warp 32 for RDNA3 as it seems to behave worse that with warp 64 when using the matrix cores, the bigger performance hits being on MUL_MAT f32 f32 and f16 f32. Every other operation seems mostly unaffected but since this results in no performance improvement I'll leave it untouched for now.

MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 288.38 GFLOPS  243.03 GFLOPS  -45.35 GFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 517.14 GFLOPS  437.11 GFLOPS  -80.03 GFLOP
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 568.25 GFLOPS  497.69 GFLOPS  -70.56 GFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1000.0 GFLOPS  876.59 GFLOPS  -123.41 GFLOPS

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 12, 2025

Matrix cores only get used with coopmat, and that's only for matrix matrix multiplications. What you're looking at is just a regular shader. But interesting that it performs worse with subgroup size 32, I guess I should check RDNA2 and 1 again as well.

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Mar 12, 2025

I'm retesting the results on RDNA1 and despite some small regressions in MUL_MAT the resulting performance in something like llama-bench is actually better.

master:
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |         pp512 |        461.39 ± 0.73 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |         tg128 |         68.95 ± 0.15 |

  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1704 runs -   616.43 us/run - 117.44 MFLOP/run - 190.52 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3408 runs -   323.47 us/run - 117.44 MFLOP/run - 363.06 GFLOPS

pr:
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |         pp512 |        494.96 ± 0.51 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |         tg128 |         68.27 ± 0.02 |

  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1704 runs -   681.27 us/run - 117.44 MFLOP/run - 172.38 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3408 runs -   306.59 us/run - 117.44 MFLOP/run - 383.05 GFLOPS

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Mar 12, 2025

I'm currently running all the performance tests on both RDNA1 and RDNA3 to see if a finer tuning of the subgroup sizes can improve the situation.

@daniandtheweb
Copy link
Contributor Author

Here's the comparison for RDNA1 on master vs PR:
ADD(type=f32,ne=[4096,1,1,1],nr=[1,1,1,1]): 14.5 GB/s  14.49 GB/s  -0.01 GB/s
ADD(type=f32,ne=[4096,1,1,1],nr=[1,512,1,1]): 188.42 GB/s  200.48 GB/s  12.06 GB/s
CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute=[0,0,0,0]): 372.14 GB/s  374.11 GB/s  1.97 GB/s
CPY(type_src=f32,type_dst=f32,ne=[8192,512,2,1],permute=[0,2,1,3]): 247.2 GB/s  257.96 GB/s  10.76 GB/s
CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute=[0,2,1,3]): 241.03 GB/s  251.16 GB/s  10.13 GB/s
SOFT_MAX(type=f32,ne=[4096,4096,5,1],mask=0,m_prec=f32,scale=1.000000,max_bias=0.000000): 384.12 GB/s  384.08 GB/s  -0.04 GB/s
SOFT_MAX(type=f32,ne=[77,4096,5,1],mask=0,m_prec=f32,scale=1.000000,max_bias=0.000000): 285.89 GB/s  277.22 GB/s  -8.67 GB/s
SOFT_MAX(type=f32,ne=[1024,1024,10,1],mask=0,m_prec=f32,scale=1.000000,max_bias=0.000000): 305.99 GB/s  307.26 GB/s  1.27 GB/s
SOFT_MAX(type=f32,ne=[77,1024,10,1],mask=0,m_prec=f32,scale=1.000000,max_bias=0.000000): 261.68 GB/s  261.52 GB/s  -0.16 GB/s
SOFT_MAX(type=f32,ne=[256,256,20,1],mask=0,m_prec=f32,scale=1.000000,max_bias=0.000000): 281.1 GB/s  281.39 GB/s  0.29 GB/s
SOFT_MAX(type=f32,ne=[64,64,20,1],mask=0,m_prec=f32,scale=1.000000,max_bias=0.000000): 152.15 GB/s  151.82 GB/s  -0.33 GB/s
SOFT_MAX(type=f32,ne=[77,64,20,1],mask=0,m_prec=f32,scale=1.000000,max_bias=0.000000): 145.55 GB/s  145.33 GB/s  -0.22 GB/s
ARGMAX(type=f32,ne=[32,10,1,1]): 0.43 GB/s  0.41 GB/s  -0.02 GB/s
ARGMAX(type=f32,ne=[1024,10,1,1]): 3.03 GB/s  3.57 GB/s  0.54 GB/s
ARGMAX(type=f32,ne=[32000,512,1,1]): 319.78 GB/s  306.66 GB/s  -13.12 GB/s
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 190.41 GFLOPS  170.16 GFLOPS  -20.25 GFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 361.98 GFLOPS  380.97 GFLOPS  18.99 GFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.14 TFLOPS  1.04 TFLOPS  -0.10 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.01 TFLOPS  998.71 GFLOPS  0.00 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 839.64 GFLOPS  845.49 GFLOPS  5.85 GFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 771.28 GFLOPS  776.84 GFLOPS  5.56 GFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 671.78 GFLOPS  555.79 GFLOPS  -115.99 GFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.23 TFLOPS  1.23 TFLOPS  0.00 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 942.72 GFLOPS  945.72 GFLOPS  3.00 GFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.16 TFLOPS  1.15 TFLOPS  -0.01 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 990.3 GFLOPS  946.01 GFLOPS  -44.29 GFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 857.58 GFLOPS  860.37 GFLOPS  2.79 GFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.39 TFLOPS  1.43 TFLOPS  0.04 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.31 TFLOPS  1.4 TFLOPS  0.09 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.24 TFLOPS  1.35 TFLOPS  0.11 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.32 TFLOPS  1.36 TFLOPS  0.04 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.33 TFLOPS  1.45 TFLOPS  0.12 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.2 TFLOPS  1.3 TFLOPS  0.10 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.19 TFLOPS  1.07 TFLOPS  -0.12 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.07 TFLOPS  1.03 TFLOPS  -0.04 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 698.32 GFLOPS  749.54 GFLOPS  51.22 GFLOPS
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 367.7 GFLOPS  330.11 GFLOPS  -37.59 GFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 651.75 GFLOPS  721.02 GFLOPS  69.27 GFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.82 TFLOPS  1.62 TFLOPS  -0.20 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.52 TFLOPS  1.52 TFLOPS  0.00 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.41 TFLOPS  1.4 TFLOPS  -0.01 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.19 TFLOPS  1.23 TFLOPS  0.04 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 987.95 GFLOPS  989.86 GFLOPS  1.91 GFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.73 TFLOPS  1.74 TFLOPS  0.01 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.47 TFLOPS  1.48 TFLOPS  0.01 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.77 TFLOPS  1.8 TFLOPS  0.03 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.55 TFLOPS  1.51 TFLOPS  -0.04 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.62 TFLOPS  1.58 TFLOPS  -0.04 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.1 TFLOPS  2.22 TFLOPS  0.12 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.13 TFLOPS  2.33 TFLOPS  0.20 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.05 TFLOPS  2.2 TFLOPS  0.15 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.97 TFLOPS  2.07 TFLOPS  0.10 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.5 TFLOPS  1.47 TFLOPS  -0.03 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.36 TFLOPS  1.85 TFLOPS  0.49 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.08 TFLOPS  2.09 TFLOPS  0.01 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.19 TFLOPS  1.63 TFLOPS  0.44 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.26 TFLOPS  1.31 TFLOPS  0.05 TFLOPS
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 527.76 GFLOPS  487.7 GFLOPS  -40.06 GFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 928.52 GFLOPS  1.04 TFLOPS  0.00 GFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.9 TFLOPS  2.08 TFLOPS  0.18 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.73 TFLOPS  1.76 TFLOPS  0.03 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.7 TFLOPS  1.62 TFLOPS  -0.08 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.49 TFLOPS  1.5 TFLOPS  0.01 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.19 TFLOPS  1.11 TFLOPS  -0.08 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.9 TFLOPS  2.01 TFLOPS  0.11 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.81 TFLOPS  1.74 TFLOPS  -0.07 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.13 TFLOPS  2.07 TFLOPS  -0.06 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.94 TFLOPS  1.92 TFLOPS  -0.02 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.99 TFLOPS  1.87 TFLOPS  -0.12 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.49 TFLOPS  2.9 TFLOPS  0.41 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.65 TFLOPS  2.88 TFLOPS  0.23 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.46 TFLOPS  2.74 TFLOPS  0.28 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.3 TFLOPS  2.65 TFLOPS  0.35 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.54 TFLOPS  1.88 TFLOPS  0.34 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.59 TFLOPS  1.87 TFLOPS  0.28 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.62 TFLOPS  2.68 TFLOPS  0.06 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.32 TFLOPS  1.59 TFLOPS  0.27 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.75 TFLOPS  1.76 TFLOPS  0.01 TFLOPS
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 689.43 GFLOPS  650.27 GFLOPS  -39.16 GFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.16 TFLOPS  1.26 TFLOPS  0.10 TFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.28 TFLOPS  2.14 TFLOPS  -0.14 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.91 TFLOPS  1.91 TFLOPS  0.00 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.95 TFLOPS  1.96 TFLOPS  0.01 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.67 TFLOPS  1.67 TFLOPS  0.00 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.24 TFLOPS  1.11 TFLOPS  -0.13 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.02 TFLOPS  2.06 TFLOPS  0.04 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.87 TFLOPS  1.92 TFLOPS  0.05 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.28 TFLOPS  2.18 TFLOPS  -0.10 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.0 TFLOPS  1.96 TFLOPS  -0.04 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.1 TFLOPS  2.12 TFLOPS  0.02 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.72 TFLOPS  3.1 TFLOPS  0.38 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.71 TFLOPS  3.1 TFLOPS  0.39 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.47 TFLOPS  3.0 TFLOPS  0.53 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.57 TFLOPS  2.91 TFLOPS  0.34 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.31 TFLOPS  1.27 TFLOPS  -0.04 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.39 TFLOPS  1.62 TFLOPS  0.23 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.79 TFLOPS  2.87 TFLOPS  0.08 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.31 TFLOPS  1.43 TFLOPS  0.12 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.07 TFLOPS  2.07 TFLOPS  0.00 TFLOPS
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 864.28 GFLOPS  790.25 GFLOPS  -74.03 GFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.45 TFLOPS  1.38 TFLOPS  -0.07 TFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.52 TFLOPS  2.33 TFLOPS  -0.19 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.01 TFLOPS  2.01 TFLOPS  0.00 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.06 TFLOPS  1.93 TFLOPS  -0.13 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.8 TFLOPS  1.79 TFLOPS  -0.01 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.17 TFLOPS  1.23 TFLOPS  0.06 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.07 TFLOPS  2.11 TFLOPS  0.04 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.95 TFLOPS  2.01 TFLOPS  0.06 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.31 TFLOPS  2.23 TFLOPS  -0.08 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.17 TFLOPS  2.14 TFLOPS  -0.03 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.25 TFLOPS  2.26 TFLOPS  0.01 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.63 TFLOPS  3.25 TFLOPS  0.62 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.49 TFLOPS  3.36 TFLOPS  0.87 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.41 TFLOPS  3.15 TFLOPS  0.74 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.5 TFLOPS  3.09 TFLOPS  0.59 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.33 TFLOPS  1.4 TFLOPS  0.07 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.38 TFLOPS  1.45 TFLOPS  0.07 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 3.13 TFLOPS  3.06 TFLOPS  -0.07 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.27 TFLOPS  1.32 TFLOPS  0.05 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.21 TFLOPS  2.37 TFLOPS  0.16 TFLOPS
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.37 TFLOPS  1.23 TFLOPS  -0.14 TFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.85 TFLOPS  1.58 TFLOPS  -0.27 TFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.51 TFLOPS  2.35 TFLOPS  -0.16 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.08 TFLOPS  2.14 TFLOPS  0.06 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.34 TFLOPS  2.31 TFLOPS  -0.03 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.0 TFLOPS  2.0 TFLOPS  0.00 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.16 TFLOPS  1.12 TFLOPS  -0.04 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.95 TFLOPS  1.69 TFLOPS  -0.26 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 1.81 TFLOPS  1.58 TFLOPS  -0.23 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.46 TFLOPS  2.17 TFLOPS  -0.29 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.25 TFLOPS  2.18 TFLOPS  -0.07 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.31 TFLOPS  2.15 TFLOPS  -0.16 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.66 TFLOPS  2.62 TFLOPS  -0.04 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.88 TFLOPS  2.76 TFLOPS  -0.12 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.66 TFLOPS  2.8 TFLOPS  0.14 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.68 TFLOPS  2.4 TFLOPS  -0.28 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 593.64 GFLOPS  584.46 GFLOPS  -9.18 GFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 543.98 GFLOPS  593.96 GFLOPS  49.98 GFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 3.56 TFLOPS  3.63 TFLOPS  0.07 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 446.48 GFLOPS  476.68 GFLOPS  30.20 GFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.8 TFLOPS  3.0 TFLOPS  0.20 TFLOPS
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 2.68 TFLOPS  2.86 TFLOPS  0.18 TFLOPS
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 3.62 TFLOPS  3.63 TFLOPS  0.01 TFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 6.25 TFLOPS  6.79 TFLOPS  0.54 TFLOPS
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 6.34 TFLOPS  6.88 TFLOPS  0.54 TFLOPS
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 5.77 TFLOPS  6.38 TFLOPS  0.61 TFLOPS
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 5.84 TFLOPS  6.41 TFLOPS  0.57 TFLOPS
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 6.14 TFLOPS  6.73 TFLOPS  0.59 TFLOPS
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 5.47 TFLOPS  5.89 TFLOPS  0.42 TFLOPS
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 4.98 TFLOPS  5.2 TFLOPS  0.22 TFLOPS
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 4.84 TFLOPS  5.25 TFLOPS  0.41 TFLOPS
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 4.63 TFLOPS  4.92 TFLOPS  0.29 TFLOPS
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 5.12 TFLOPS  5.45 TFLOPS  0.33 TFLOPS
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 4.99 TFLOPS  5.38 TFLOPS  0.39 TFLOPS
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 4.98 TFLOPS  5.43 TFLOPS  0.45 TFLOPS
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 4.72 TFLOPS  5.37 TFLOPS  0.65 TFLOPS
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 5.07 TFLOPS  5.36 TFLOPS  0.29 TFLOPS
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 5.21 TFLOPS  5.7 TFLOPS  0.49 TFLOPS
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 4.89 TFLOPS  5.27 TFLOPS  0.38 TFLOPS
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 5.89 TFLOPS  6.61 TFLOPS  0.72 TFLOPS
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 5.02 TFLOPS  5.38 TFLOPS  0.36 TFLOPS
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): 5.1 TFLOPS  5.61 TFLOPS  0.51 TFLOPS
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 169.18 GB/s  169.32 GB/s  0.14 GB/s
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 58.99 GB/s  58.97 GB/s  -0.02 GB/s
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 7.85 GB/s  7.63 GB/s  -0.22 GB/s
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 164.32 GB/s  164.99 GB/s  0.67 GB/s
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 9.57 GB/s  6.35 GB/s  -3.22 GB/s
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 98.62 GB/s  98.9 GB/s  0.28 GB/s
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 32.0 GB/s  32.83 GB/s  0.83 GB/s
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 4.9 GB/s  5.07 GB/s  0.17 GB/s
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 101.08 GB/s  101.07 GB/s  -0.01 GB/s
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 13.13 GB/s  13.09 GB/s  -0.04 GB/s

@daniandtheweb
Copy link
Contributor Author

After running multiple tests I can say for certain that RDNA3 doesn't like subgroup size 32, at least on linux using amdgpu (I tried radv, amdvlk and vk_pro and all behave similarly), every single operation seems to get a small performance hit from using it. I'm not sure if this is an expected behavior or there's some issue in the amdgpu driver (I found other issues in other compute tasks so it's a possibility). For now I don't think it's a good idea changing the default subgroup for this architecture.

Regarding RDNA1 I managed to solve every single perfomance regression by setting most mul_mat_vec operations to use subgroup 64. All the performance gains from switching to subgroup size 32 are maintained while not having any regression on any operation (at least the ones in test-backend-ops). I'll push the changes soon.

If you can compare the results using warp 64 and warp 32 on RDNA2 I can tune the parameters that still need some adjustements if there's any need for it. Does the performance on RDNA2 improve at all with these changes or it behaves similarly to how I described RDNA3?

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 15, 2025

Here's RDNA2 with the diff between everything in 64 and 32.

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6800 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: AMD Radeon RX 6800 XT (RADV NAVI21)
  Device memory: 16368 MB (16368 MB free)


ADD(type=f32,ne=[4096,1,1,1],nr=[1,1,1,1]):                                                              28.41 GB/s�[0m       28.72 GB/s�[0m       +0.31 GB/s�[0m      +1.08%
ADD(type=f32,ne=[4096,1,1,1],nr=[1,512,1,1]):                                                           443.24 GB/s�[0m      461.55 GB/s�[0m      +18.31 GB/s�[0m      +3.97%
CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute=[0,0,0,0]):                                    1433.46 GB/s�[0m     1432.10 GB/s�[0m       -1.36 GB/s�[0m      -0.09%
CPY(type_src=f32,type_dst=f32,ne=[8192,512,2,1],permute=[0,2,1,3]):                                     534.03 GB/s�[0m      587.19 GB/s�[0m      +53.16 GB/s�[0m      +9.05%
CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute=[0,2,1,3]):                                     520.07 GB/s�[0m      568.88 GB/s�[0m      +48.81 GB/s�[0m      +8.58%
SOFT_MAX(type=f32,ne=[4096,4096,5,1],mask=0,m_prec=f32,scale=1.000000,max_bias=0.000000):               419.23 GB/s�[0m      413.18 GB/s�[0m       -6.05 GB/s�[0m      -1.44%
SOFT_MAX(type=f32,ne=[77,4096,5,1],mask=0,m_prec=f32,scale=1.000000,max_bias=0.000000):                 568.38 GB/s�[0m      416.22 GB/s�[0m     -152.16 GB/s�[0m     -26.77%
SOFT_MAX(type=f32,ne=[1024,1024,10,1],mask=0,m_prec=f32,scale=1.000000,max_bias=0.000000):             1341.64 GB/s�[0m     1220.65 GB/s�[0m     -120.99 GB/s�[0m      -9.02%
SOFT_MAX(type=f32,ne=[77,1024,10,1],mask=0,m_prec=f32,scale=1.000000,max_bias=0.000000):                519.60 GB/s�[0m      390.33 GB/s�[0m     -129.27 GB/s�[0m     -24.88%
SOFT_MAX(type=f32,ne=[256,256,20,1],mask=0,m_prec=f32,scale=1.000000,max_bias=0.000000):                935.80 GB/s�[0m      761.65 GB/s�[0m     -174.15 GB/s�[0m     -18.61%
SOFT_MAX(type=f32,ne=[64,64,20,1],mask=0,m_prec=f32,scale=1.000000,max_bias=0.000000):                  270.09 GB/s�[0m      183.25 GB/s�[0m      -86.84 GB/s�[0m     -32.15%
SOFT_MAX(type=f32,ne=[77,64,20,1],mask=0,m_prec=f32,scale=1.000000,max_bias=0.000000):                  327.21 GB/s�[0m      204.12 GB/s�[0m     -123.09 GB/s�[0m     -37.62%
ARGMAX(type=f32,ne=[32,10,1,1]):                                                                          0.94 GB/s�[0m        0.87 GB/s�[0m       -0.07 GB/s�[0m      -7.45%
ARGMAX(type=f32,ne=[1024,10,1,1]):                                                                        8.48 GB/s�[0m        8.44 GB/s�[0m       -0.04 GB/s�[0m      -0.47%
ARGMAX(type=f32,ne=[32000,512,1,1]):                                                                    616.23 GB/s�[0m      623.30 GB/s�[0m       +7.07 GB/s�[0m      +1.13%
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      241.38 GFLOPS�[0m      212.09 GFLOPS�[0m      -29.29 GFLOPS�[0m     -12.13%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        1.64 TFLOPS�[0m        1.65 TFLOPS�[0m       +0.01 TFLOPS�[0m      +0.61%
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.51 TFLOPS�[0m        2.57 TFLOPS�[0m       +0.06 TFLOPS�[0m      +2.33%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.21 TFLOPS�[0m        2.20 TFLOPS�[0m       -0.01 TFLOPS�[0m      -0.45%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.76 TFLOPS�[0m        1.76 TFLOPS�[0m       +0.00 TFLOPS�[0m      -0.00%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.67 TFLOPS�[0m        1.65 TFLOPS�[0m       -0.02 TFLOPS�[0m      -1.20%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.34 TFLOPS�[0m        2.16 TFLOPS�[0m       -0.18 TFLOPS�[0m      -7.69%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.62 TFLOPS�[0m        2.60 TFLOPS�[0m       -0.02 TFLOPS�[0m      -0.76%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.93 TFLOPS�[0m        1.90 TFLOPS�[0m       -0.03 TFLOPS�[0m      -1.55%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.71 TFLOPS�[0m        2.71 TFLOPS�[0m       +0.00 TFLOPS�[0m      -0.00%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.32 TFLOPS�[0m        2.30 TFLOPS�[0m       -0.02 TFLOPS�[0m      -0.86%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.51 TFLOPS�[0m        2.48 TFLOPS�[0m       -0.03 TFLOPS�[0m      -1.20%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    2.66 TFLOPS�[0m        2.77 TFLOPS�[0m       +0.11 TFLOPS�[0m      +3.97%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     2.74 TFLOPS�[0m        2.66 TFLOPS�[0m       -0.08 TFLOPS�[0m      -2.92%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      2.62 TFLOPS�[0m        2.73 TFLOPS�[0m       +0.11 TFLOPS�[0m      +4.03%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    2.46 TFLOPS�[0m        2.58 TFLOPS�[0m       +0.12 TFLOPS�[0m      +4.65%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      2.74 TFLOPS�[0m        2.84 TFLOPS�[0m       +0.10 TFLOPS�[0m      +3.52%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      2.54 TFLOPS�[0m        2.62 TFLOPS�[0m       +0.08 TFLOPS�[0m      +3.05%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     2.70 TFLOPS�[0m        2.79 TFLOPS�[0m       +0.09 TFLOPS�[0m      +3.23%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      2.25 TFLOPS�[0m        2.22 TFLOPS�[0m       -0.03 TFLOPS�[0m      -1.33%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.36 TFLOPS�[0m        1.47 TFLOPS�[0m       +0.11 TFLOPS�[0m      +7.48%
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      459.49 GFLOPS�[0m      408.60 GFLOPS�[0m      -50.89 GFLOPS�[0m     -11.08%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        2.85 TFLOPS�[0m        3.10 TFLOPS�[0m       +0.25 TFLOPS�[0m      +8.06%
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.88 TFLOPS�[0m        3.97 TFLOPS�[0m       +0.09 TFLOPS�[0m      +2.27%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.12 TFLOPS�[0m        3.12 TFLOPS�[0m       +0.00 TFLOPS�[0m      -0.00%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.86 TFLOPS�[0m        2.86 TFLOPS�[0m       +0.00 TFLOPS�[0m      -0.00%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.51 TFLOPS�[0m        2.52 TFLOPS�[0m       +0.01 TFLOPS�[0m      +0.40%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.99 TFLOPS�[0m        2.79 TFLOPS�[0m       -0.20 TFLOPS�[0m      -6.69%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.68 TFLOPS�[0m        3.70 TFLOPS�[0m       +0.02 TFLOPS�[0m      +0.54%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.02 TFLOPS�[0m        2.98 TFLOPS�[0m       -0.04 TFLOPS�[0m      -1.32%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.78 TFLOPS�[0m        3.77 TFLOPS�[0m       -0.01 TFLOPS�[0m      -0.26%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.48 TFLOPS�[0m        3.48 TFLOPS�[0m       +0.00 TFLOPS�[0m      -0.00%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.92 TFLOPS�[0m        3.91 TFLOPS�[0m       -0.01 TFLOPS�[0m      -0.26%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    4.24 TFLOPS�[0m        4.32 TFLOPS�[0m       +0.08 TFLOPS�[0m      +1.85%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     4.45 TFLOPS�[0m        4.42 TFLOPS�[0m       -0.03 TFLOPS�[0m      -0.67%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      4.21 TFLOPS�[0m        4.44 TFLOPS�[0m       +0.23 TFLOPS�[0m      +5.18%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3.73 TFLOPS�[0m        3.78 TFLOPS�[0m       +0.05 TFLOPS�[0m      +1.32%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      3.50 TFLOPS�[0m        4.02 TFLOPS�[0m       +0.52 TFLOPS�[0m     +12.94%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      3.37 TFLOPS�[0m        3.86 TFLOPS�[0m       +0.49 TFLOPS�[0m     +12.69%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     4.04 TFLOPS�[0m        4.13 TFLOPS�[0m       +0.09 TFLOPS�[0m      +2.18%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      3.04 TFLOPS�[0m        3.44 TFLOPS�[0m       +0.40 TFLOPS�[0m     +11.63%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     2.51 TFLOPS�[0m        2.58 TFLOPS�[0m       +0.07 TFLOPS�[0m      +2.71%
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      656.96 GFLOPS�[0m      611.26 GFLOPS�[0m      -45.70 GFLOPS�[0m      -6.96%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        3.47 TFLOPS�[0m        3.77 TFLOPS�[0m       +0.30 TFLOPS�[0m      +7.96%
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.69 TFLOPS�[0m        4.76 TFLOPS�[0m       +0.07 TFLOPS�[0m      +1.47%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.60 TFLOPS�[0m        3.63 TFLOPS�[0m       +0.03 TFLOPS�[0m      +0.83%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.62 TFLOPS�[0m        3.68 TFLOPS�[0m       +0.06 TFLOPS�[0m      +1.63%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.09 TFLOPS�[0m        3.04 TFLOPS�[0m       -0.05 TFLOPS�[0m      -1.62%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.21 TFLOPS�[0m        2.96 TFLOPS�[0m       -0.25 TFLOPS�[0m      -7.79%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.23 TFLOPS�[0m        4.30 TFLOPS�[0m       +0.07 TFLOPS�[0m      +1.63%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.69 TFLOPS�[0m        3.67 TFLOPS�[0m       -0.02 TFLOPS�[0m      -0.54%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.38 TFLOPS�[0m        4.40 TFLOPS�[0m       +0.02 TFLOPS�[0m      +0.45%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.10 TFLOPS�[0m        4.20 TFLOPS�[0m       +0.10 TFLOPS�[0m      +2.38%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.91 TFLOPS�[0m        4.89 TFLOPS�[0m       -0.02 TFLOPS�[0m      -0.41%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    5.48 TFLOPS�[0m        5.77 TFLOPS�[0m       +0.29 TFLOPS�[0m      +5.03%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     5.46 TFLOPS�[0m        5.79 TFLOPS�[0m       +0.33 TFLOPS�[0m      +5.70%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      5.06 TFLOPS�[0m        5.55 TFLOPS�[0m       +0.49 TFLOPS�[0m      +8.83%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    5.15 TFLOPS�[0m        5.43 TFLOPS�[0m       +0.28 TFLOPS�[0m      +5.16%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      3.61 TFLOPS�[0m        4.16 TFLOPS�[0m       +0.55 TFLOPS�[0m     +13.22%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      3.59 TFLOPS�[0m        4.28 TFLOPS�[0m       +0.69 TFLOPS�[0m     +16.12%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     5.05 TFLOPS�[0m        5.23 TFLOPS�[0m       +0.18 TFLOPS�[0m      +3.44%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      3.18 TFLOPS�[0m        3.79 TFLOPS�[0m       +0.61 TFLOPS�[0m     +16.09%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.37 TFLOPS�[0m        3.47 TFLOPS�[0m       +0.10 TFLOPS�[0m      +2.88%
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      873.57 GFLOPS�[0m      814.94 GFLOPS�[0m      -58.63 GFLOPS�[0m      -6.71%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        3.96 TFLOPS�[0m        3.95 TFLOPS�[0m       -0.01 TFLOPS�[0m      -0.25%
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       5.11 TFLOPS�[0m        4.53 TFLOPS�[0m       -0.58 TFLOPS�[0m     -11.35%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.91 TFLOPS�[0m        3.94 TFLOPS�[0m       +0.03 TFLOPS�[0m      +0.76%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.14 TFLOPS�[0m        4.30 TFLOPS�[0m       +0.16 TFLOPS�[0m      +3.72%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.43 TFLOPS�[0m        3.39 TFLOPS�[0m       -0.04 TFLOPS�[0m      -1.17%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.17 TFLOPS�[0m        2.87 TFLOPS�[0m       -0.30 TFLOPS�[0m      -9.46%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.49 TFLOPS�[0m        4.54 TFLOPS�[0m       +0.05 TFLOPS�[0m      +1.10%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.97 TFLOPS�[0m        4.02 TFLOPS�[0m       +0.05 TFLOPS�[0m      +1.24%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.65 TFLOPS�[0m        4.76 TFLOPS�[0m       +0.11 TFLOPS�[0m      +2.31%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.47 TFLOPS�[0m        4.61 TFLOPS�[0m       +0.14 TFLOPS�[0m      +3.04%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       5.01 TFLOPS�[0m        5.02 TFLOPS�[0m       +0.01 TFLOPS�[0m      +0.20%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    5.93 TFLOPS�[0m        6.49 TFLOPS�[0m       +0.56 TFLOPS�[0m      +8.63%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     5.92 TFLOPS�[0m        6.50 TFLOPS�[0m       +0.58 TFLOPS�[0m      +8.92%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      5.26 TFLOPS�[0m        6.22 TFLOPS�[0m       +0.96 TFLOPS�[0m     +15.43%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    5.58 TFLOPS�[0m        6.15 TFLOPS�[0m       +0.57 TFLOPS�[0m      +9.27%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      3.55 TFLOPS�[0m        4.01 TFLOPS�[0m       +0.46 TFLOPS�[0m     +11.47%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      3.48 TFLOPS�[0m        3.95 TFLOPS�[0m       +0.47 TFLOPS�[0m     +11.90%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     5.73 TFLOPS�[0m        5.86 TFLOPS�[0m       +0.13 TFLOPS�[0m      +2.22%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      3.19 TFLOPS�[0m        3.67 TFLOPS�[0m       +0.48 TFLOPS�[0m     +13.08%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.96 TFLOPS�[0m        4.06 TFLOPS�[0m       +0.10 TFLOPS�[0m      +2.46%
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        1.15 TFLOPS�[0m        1.00 TFLOPS�[0m       -0.15 TFLOPS�[0m     -13.04%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        4.37 TFLOPS�[0m        3.95 TFLOPS�[0m       -0.42 TFLOPS�[0m      -9.61%
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       5.51 TFLOPS�[0m        5.00 TFLOPS�[0m       -0.51 TFLOPS�[0m      -9.26%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.15 TFLOPS�[0m        4.16 TFLOPS�[0m       +0.01 TFLOPS�[0m      +0.24%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.78 TFLOPS�[0m        4.70 TFLOPS�[0m       -0.08 TFLOPS�[0m      -1.67%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       3.69 TFLOPS�[0m        3.67 TFLOPS�[0m       -0.02 TFLOPS�[0m      -0.54%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.89 TFLOPS�[0m        2.83 TFLOPS�[0m       -0.06 TFLOPS�[0m      -2.08%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.51 TFLOPS�[0m        4.67 TFLOPS�[0m       +0.16 TFLOPS�[0m      +3.43%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.00 TFLOPS�[0m        4.21 TFLOPS�[0m       +0.21 TFLOPS�[0m      +4.99%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.83 TFLOPS�[0m        4.93 TFLOPS�[0m       +0.10 TFLOPS�[0m      +2.03%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.63 TFLOPS�[0m        4.91 TFLOPS�[0m       +0.28 TFLOPS�[0m      +5.70%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.93 TFLOPS�[0m        5.03 TFLOPS�[0m       +0.10 TFLOPS�[0m      +1.99%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    5.73 TFLOPS�[0m        6.55 TFLOPS�[0m       +0.82 TFLOPS�[0m     +12.52%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     5.49 TFLOPS�[0m        6.94 TFLOPS�[0m       +1.45 TFLOPS�[0m     +20.89%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      5.24 TFLOPS�[0m        6.50 TFLOPS�[0m       +1.26 TFLOPS�[0m     +19.38%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    5.20 TFLOPS�[0m        6.03 TFLOPS�[0m       +0.83 TFLOPS�[0m     +13.76%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      3.37 TFLOPS�[0m        3.65 TFLOPS�[0m       +0.28 TFLOPS�[0m      +7.67%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      3.45 TFLOPS�[0m        3.65 TFLOPS�[0m       +0.20 TFLOPS�[0m      +5.48%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     5.97 TFLOPS�[0m        6.49 TFLOPS�[0m       +0.52 TFLOPS�[0m      +8.01%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      3.15 TFLOPS�[0m        3.51 TFLOPS�[0m       +0.36 TFLOPS�[0m     +10.26%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     4.50 TFLOPS�[0m        4.72 TFLOPS�[0m       +0.22 TFLOPS�[0m      +4.66%
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        1.80 TFLOPS�[0m        1.56 TFLOPS�[0m       -0.24 TFLOPS�[0m     -13.33%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        4.51 TFLOPS�[0m        3.84 TFLOPS�[0m       -0.67 TFLOPS�[0m     -14.86%
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       5.28 TFLOPS�[0m        5.28 TFLOPS�[0m       +0.00 TFLOPS�[0m      -0.00%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.51 TFLOPS�[0m        4.51 TFLOPS�[0m       +0.00 TFLOPS�[0m      -0.00%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       5.53 TFLOPS�[0m        4.43 TFLOPS�[0m       -1.10 TFLOPS�[0m     -19.89%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.17 TFLOPS�[0m        4.19 TFLOPS�[0m       +0.02 TFLOPS�[0m      +0.48%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       2.80 TFLOPS�[0m        2.60 TFLOPS�[0m       -0.20 TFLOPS�[0m      -7.14%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.20 TFLOPS�[0m        4.27 TFLOPS�[0m       +0.07 TFLOPS�[0m      +1.64%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.13 TFLOPS�[0m        4.12 TFLOPS�[0m       -0.01 TFLOPS�[0m      -0.24%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.90 TFLOPS�[0m        5.35 TFLOPS�[0m       +0.45 TFLOPS�[0m      +8.41%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.85 TFLOPS�[0m        5.15 TFLOPS�[0m       +0.30 TFLOPS�[0m      +5.83%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       4.66 TFLOPS�[0m        4.75 TFLOPS�[0m       +0.09 TFLOPS�[0m      +1.89%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    6.17 TFLOPS�[0m        7.06 TFLOPS�[0m       +0.89 TFLOPS�[0m     +12.61%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     6.22 TFLOPS�[0m        7.05 TFLOPS�[0m       +0.83 TFLOPS�[0m     +11.77%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      5.97 TFLOPS�[0m        6.90 TFLOPS�[0m       +0.93 TFLOPS�[0m     +13.48%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    6.08 TFLOPS�[0m        6.89 TFLOPS�[0m       +0.81 TFLOPS�[0m     +11.76%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.25 TFLOPS�[0m        1.23 TFLOPS�[0m       -0.02 TFLOPS�[0m      -1.60%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.26 TFLOPS�[0m        1.27 TFLOPS�[0m       +0.01 TFLOPS�[0m      +0.79%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     6.79 TFLOPS�[0m        6.41 TFLOPS�[0m       -0.38 TFLOPS�[0m      -5.60%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.05 TFLOPS�[0m        1.07 TFLOPS�[0m       +0.02 TFLOPS�[0m      +1.87%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     5.53 TFLOPS�[0m        5.61 TFLOPS�[0m       +0.08 TFLOPS�[0m      +1.43%
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      7.33 TFLOPS�[0m        7.99 TFLOPS�[0m       +0.66 TFLOPS�[0m      +8.26%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      9.41 TFLOPS�[0m       10.28 TFLOPS�[0m       +0.87 TFLOPS�[0m      +8.46%
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    11.97 TFLOPS�[0m       13.05 TFLOPS�[0m       +1.08 TFLOPS�[0m      +8.28%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    12.08 TFLOPS�[0m       13.52 TFLOPS�[0m       +1.44 TFLOPS�[0m     +10.65%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    10.94 TFLOPS�[0m       12.31 TFLOPS�[0m       +1.37 TFLOPS�[0m     +11.13%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    11.23 TFLOPS�[0m       12.52 TFLOPS�[0m       +1.29 TFLOPS�[0m     +10.30%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    11.58 TFLOPS�[0m       13.04 TFLOPS�[0m       +1.46 TFLOPS�[0m     +11.20%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    10.28 TFLOPS�[0m       11.51 TFLOPS�[0m       +1.23 TFLOPS�[0m     +10.69%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     9.17 TFLOPS�[0m       10.05 TFLOPS�[0m       +0.88 TFLOPS�[0m      +8.76%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     9.37 TFLOPS�[0m       10.45 TFLOPS�[0m       +1.08 TFLOPS�[0m     +10.33%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     8.72 TFLOPS�[0m        9.66 TFLOPS�[0m       +0.94 TFLOPS�[0m      +9.73%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     9.52 TFLOPS�[0m       10.57 TFLOPS�[0m       +1.05 TFLOPS�[0m      +9.93%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  9.52 TFLOPS�[0m       10.58 TFLOPS�[0m       +1.06 TFLOPS�[0m     +10.02%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   9.76 TFLOPS�[0m       10.77 TFLOPS�[0m       +1.01 TFLOPS�[0m      +9.38%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    9.47 TFLOPS�[0m       10.54 TFLOPS�[0m       +1.07 TFLOPS�[0m     +10.15%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  9.44 TFLOPS�[0m       10.49 TFLOPS�[0m       +1.05 TFLOPS�[0m     +10.01%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   10.08 TFLOPS�[0m       11.30 TFLOPS�[0m       +1.22 TFLOPS�[0m     +10.80%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    9.47 TFLOPS�[0m       10.52 TFLOPS�[0m       +1.05 TFLOPS�[0m      +9.98%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  11.64 TFLOPS�[0m       12.75 TFLOPS�[0m       +1.11 TFLOPS�[0m      +8.71%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    9.58 TFLOPS�[0m       10.74 TFLOPS�[0m       +1.16 TFLOPS�[0m     +10.80%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  10.19 TFLOPS�[0m       11.14 TFLOPS�[0m       +0.95 TFLOPS�[0m      +8.53%
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):    385.95 GB/s�[0m      271.15 GB/s�[0m     -114.80 GB/s�[0m     -29.74%
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):    208.20 GB/s�[0m      106.04 GB/s�[0m     -102.16 GB/s�[0m     -49.07%
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):      7.48 GB/s�[0m        6.04 GB/s�[0m       -1.44 GB/s�[0m     -19.25%
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):    427.74 GB/s�[0m      247.21 GB/s�[0m     -180.53 GB/s�[0m     -42.21%
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):     35.05 GB/s�[0m       15.21 GB/s�[0m      -19.84 GB/s�[0m     -56.60%
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):    160.50 GB/s�[0m      115.04 GB/s�[0m      -45.46 GB/s�[0m     -28.32%
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):     52.46 GB/s�[0m       39.15 GB/s�[0m      -13.31 GB/s�[0m     -25.37%
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):      5.04 GB/s�[0m        5.19 GB/s�[0m       +0.15 GB/s�[0m      +2.89%
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):     89.62 GB/s�[0m       99.28 GB/s�[0m       +9.66 GB/s�[0m      +9.73%
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):     16.70 GB/s�[0m        7.40 GB/s�[0m       -9.30 GB/s�[0m     -55.69%
  Backend Vulkan0: �[1;32mOK�[0m

Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
�[1;32mOK�[0m

@0cc4m 0cc4m self-requested a review March 16, 2025 07:17
@0cc4m
Copy link
Collaborator

0cc4m commented Mar 16, 2025

Here's another RDNA1 comparison, it's not as negative on mul_mat_vec as it was for you:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1013) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: AMD Radeon Graphics (RADV GFX1013)
  Device memory: 8011 MB (8011 MB free)


ADD(type=f32,ne=[4096,1,1,1],nr=[1,1,1,1]):                                                              16.56 GB/s�[0m       16.55 GB/s�[0m       -0.01 GB/s�[0m      -0.06%
ADD(type=f32,ne=[4096,1,1,1],nr=[1,512,1,1]):                                                           126.25 GB/s�[0m      130.85 GB/s�[0m       +4.60 GB/s�[0m      +3.52%
CPY(type_src=f32,type_dst=f16,ne=[512,3072,1,1],permute=[0,0,0,0]):                                     336.92 GB/s�[0m      332.98 GB/s�[0m       -3.94 GB/s�[0m      -1.17%
CPY(type_src=f32,type_dst=f32,ne=[8192,512,2,1],permute=[0,2,1,3]):                                     163.47 GB/s�[0m      169.34 GB/s�[0m       +5.87 GB/s�[0m      +3.47%
CPY(type_src=f32,type_dst=f32,ne=[3072,512,2,1],permute=[0,2,1,3]):                                     160.42 GB/s�[0m      166.02 GB/s�[0m       +5.60 GB/s�[0m      +3.37%
SOFT_MAX(type=f32,ne=[4096,4096,5,1],mask=0,m_prec=f32,scale=1.000000,max_bias=0.000000):               253.18 GB/s�[0m      255.03 GB/s�[0m       +1.85 GB/s�[0m      +0.73%
SOFT_MAX(type=f32,ne=[77,4096,5,1],mask=0,m_prec=f32,scale=1.000000,max_bias=0.000000):                 133.63 GB/s�[0m      103.23 GB/s�[0m      -30.40 GB/s�[0m     -22.75%
SOFT_MAX(type=f32,ne=[1024,1024,10,1],mask=0,m_prec=f32,scale=1.000000,max_bias=0.000000):              297.86 GB/s�[0m      279.00 GB/s�[0m      -18.86 GB/s�[0m      -6.33%
SOFT_MAX(type=f32,ne=[77,1024,10,1],mask=0,m_prec=f32,scale=1.000000,max_bias=0.000000):                126.17 GB/s�[0m       98.41 GB/s�[0m      -27.76 GB/s�[0m     -22.00%
SOFT_MAX(type=f32,ne=[256,256,20,1],mask=0,m_prec=f32,scale=1.000000,max_bias=0.000000):                234.25 GB/s�[0m      201.73 GB/s�[0m      -32.52 GB/s�[0m     -13.88%
SOFT_MAX(type=f32,ne=[64,64,20,1],mask=0,m_prec=f32,scale=1.000000,max_bias=0.000000):                   91.38 GB/s�[0m       69.55 GB/s�[0m      -21.83 GB/s�[0m     -23.89%
SOFT_MAX(type=f32,ne=[77,64,20,1],mask=0,m_prec=f32,scale=1.000000,max_bias=0.000000):                   92.82 GB/s�[0m       70.45 GB/s�[0m      -22.37 GB/s�[0m     -24.10%
ARGMAX(type=f32,ne=[32,10,1,1]):                                                                          0.55 GB/s�[0m        0.53 GB/s�[0m       -0.02 GB/s�[0m      -3.64%
ARGMAX(type=f32,ne=[1024,10,1,1]):                                                                        5.56 GB/s�[0m        5.57 GB/s�[0m       +0.01 GB/s�[0m      +0.18%
ARGMAX(type=f32,ne=[32000,512,1,1]):                                                                    296.45 GB/s�[0m      176.38 GB/s�[0m     -120.07 GB/s�[0m     -40.50%
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      200.61 GFLOPS�[0m      182.61 GFLOPS�[0m      -18.00 GFLOPS�[0m      -8.97%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      374.05 GFLOPS�[0m      353.91 GFLOPS�[0m      -20.14 GFLOPS�[0m      -5.38%
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     837.60 GFLOPS�[0m      858.64 GFLOPS�[0m      +21.04 GFLOPS�[0m      +2.45%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     723.50 GFLOPS�[0m      710.68 GFLOPS�[0m      -12.82 GFLOPS�[0m      -1.77%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     569.98 GFLOPS�[0m      574.10 GFLOPS�[0m       +4.12 GFLOPS�[0m      +0.72%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     528.57 GFLOPS�[0m      533.52 GFLOPS�[0m       +4.95 GFLOPS�[0m      +0.93%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     527.07 GFLOPS�[0m      493.76 GFLOPS�[0m      -33.31 GFLOPS�[0m      -6.32%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     845.55 GFLOPS�[0m      851.89 GFLOPS�[0m       +6.34 GFLOPS�[0m      +0.74%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     641.31 GFLOPS�[0m      659.35 GFLOPS�[0m      +18.04 GFLOPS�[0m      +2.74%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     870.28 GFLOPS�[0m      876.81 GFLOPS�[0m       +6.53 GFLOPS�[0m      +0.74%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     730.38 GFLOPS�[0m      757.25 GFLOPS�[0m      +26.87 GFLOPS�[0m      +3.55%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     779.15 GFLOPS�[0m      780.00 GFLOPS�[0m       +0.85 GFLOPS�[0m      +0.11%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  904.63 GFLOPS�[0m      940.88 GFLOPS�[0m      +36.25 GFLOPS�[0m      +3.85%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   893.10 GFLOPS�[0m      906.18 GFLOPS�[0m      +13.08 GFLOPS�[0m      +1.44%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    835.41 GFLOPS�[0m      891.71 GFLOPS�[0m      +56.30 GFLOPS�[0m      +6.31%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  864.02 GFLOPS�[0m      886.61 GFLOPS�[0m      +22.59 GFLOPS�[0m      +2.55%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    879.17 GFLOPS�[0m      950.67 GFLOPS�[0m      +71.50 GFLOPS�[0m      +7.52%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    831.61 GFLOPS�[0m      875.19 GFLOPS�[0m      +43.58 GFLOPS�[0m      +4.98%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   953.62 GFLOPS�[0m      991.14 GFLOPS�[0m      +37.52 GFLOPS�[0m      +3.79%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    657.51 GFLOPS�[0m      662.80 GFLOPS�[0m       +5.29 GFLOPS�[0m      +0.80%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   457.86 GFLOPS�[0m      504.42 GFLOPS�[0m      +46.56 GFLOPS�[0m      +9.23%
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      380.67 GFLOPS�[0m      337.66 GFLOPS�[0m      -43.01 GFLOPS�[0m     -11.30%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      654.74 GFLOPS�[0m      625.24 GFLOPS�[0m      -29.50 GFLOPS�[0m      -4.51%
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.27 TFLOPS�[0m        1.15 TFLOPS�[0m       -0.12 TFLOPS�[0m      -9.45%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.00 TFLOPS�[0m        1.00 TFLOPS�[0m       -0.00 TFLOPS�[0m      -0.41%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     951.15 GFLOPS�[0m      957.41 GFLOPS�[0m       +6.26 GFLOPS�[0m      +0.65%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     793.88 GFLOPS�[0m      828.36 GFLOPS�[0m      +34.48 GFLOPS�[0m      +4.16%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     647.98 GFLOPS�[0m      670.58 GFLOPS�[0m      +22.60 GFLOPS�[0m      +3.37%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.13 TFLOPS�[0m        1.18 TFLOPS�[0m       +0.05 TFLOPS�[0m      +4.24%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.00 TFLOPS�[0m        1.03 TFLOPS�[0m       +0.03 TFLOPS�[0m      +2.91%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.24 TFLOPS�[0m        1.19 TFLOPS�[0m       -0.05 TFLOPS�[0m      -4.03%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.03 TFLOPS�[0m        1.13 TFLOPS�[0m       +0.10 TFLOPS�[0m      +8.85%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.23 TFLOPS�[0m        1.12 TFLOPS�[0m       -0.11 TFLOPS�[0m      -8.94%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.40 TFLOPS�[0m        1.49 TFLOPS�[0m       +0.09 TFLOPS�[0m      +6.04%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.44 TFLOPS�[0m        1.51 TFLOPS�[0m       +0.07 TFLOPS�[0m      +4.64%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.33 TFLOPS�[0m        1.44 TFLOPS�[0m       +0.11 TFLOPS�[0m      +7.64%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.31 TFLOPS�[0m        1.40 TFLOPS�[0m       +0.09 TFLOPS�[0m      +6.43%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.00 TFLOPS�[0m        1.19 TFLOPS�[0m       +0.19 TFLOPS�[0m     +15.97%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      0.97 TFLOPS�[0m        1.23 TFLOPS�[0m       +0.26 TFLOPS�[0m     +20.90%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.41 TFLOPS�[0m        1.41 TFLOPS�[0m       +0.00 TFLOPS�[0m      -0.00%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      0.86 TFLOPS�[0m        1.08 TFLOPS�[0m       +0.22 TFLOPS�[0m     +20.78%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   843.92 GFLOPS�[0m      879.70 GFLOPS�[0m      +35.78 GFLOPS�[0m      +4.07%
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      540.57 GFLOPS�[0m      495.25 GFLOPS�[0m      -45.32 GFLOPS�[0m      -8.38%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      861.63 GFLOPS�[0m      806.48 GFLOPS�[0m      -55.15 GFLOPS�[0m      -6.40%
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.27 TFLOPS�[0m        1.41 TFLOPS�[0m       +0.14 TFLOPS�[0m      +9.93%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.20 TFLOPS�[0m        1.21 TFLOPS�[0m       +0.01 TFLOPS�[0m      +0.83%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.16 TFLOPS�[0m        1.10 TFLOPS�[0m       -0.06 TFLOPS�[0m      -5.17%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.03 TFLOPS�[0m        1.03 TFLOPS�[0m       +0.00 TFLOPS�[0m      -0.00%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     788.01 GFLOPS�[0m      737.93 GFLOPS�[0m      -50.08 GFLOPS�[0m      -6.36%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.32 TFLOPS�[0m        1.38 TFLOPS�[0m       +0.06 TFLOPS�[0m      +4.35%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.21 TFLOPS�[0m        1.24 TFLOPS�[0m       +0.03 TFLOPS�[0m      +2.42%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.43 TFLOPS�[0m        1.43 TFLOPS�[0m       +0.00 TFLOPS�[0m      -0.00%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.28 TFLOPS�[0m        1.31 TFLOPS�[0m       +0.03 TFLOPS�[0m      +2.29%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.49 TFLOPS�[0m        1.47 TFLOPS�[0m       -0.02 TFLOPS�[0m      -1.34%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.63 TFLOPS�[0m        1.93 TFLOPS�[0m       +0.30 TFLOPS�[0m     +15.54%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.76 TFLOPS�[0m        1.91 TFLOPS�[0m       +0.15 TFLOPS�[0m      +7.85%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.58 TFLOPS�[0m        1.80 TFLOPS�[0m       +0.22 TFLOPS�[0m     +12.22%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.52 TFLOPS�[0m        1.85 TFLOPS�[0m       +0.33 TFLOPS�[0m     +17.84%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.06 TFLOPS�[0m        1.25 TFLOPS�[0m       +0.19 TFLOPS�[0m     +15.20%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.08 TFLOPS�[0m        1.28 TFLOPS�[0m       +0.20 TFLOPS�[0m     +15.62%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.77 TFLOPS�[0m        1.88 TFLOPS�[0m       +0.11 TFLOPS�[0m      +5.85%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      0.88 TFLOPS�[0m        1.02 TFLOPS�[0m       +0.14 TFLOPS�[0m     +13.69%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.14 TFLOPS�[0m        1.17 TFLOPS�[0m       +0.03 TFLOPS�[0m      +2.56%
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      689.66 GFLOPS�[0m      631.37 GFLOPS�[0m      -58.29 GFLOPS�[0m      -8.45%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        1.01 TFLOPS�[0m        0.92 TFLOPS�[0m       -0.09 TFLOPS�[0m      -8.64%
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.49 TFLOPS�[0m        1.46 TFLOPS�[0m       -0.03 TFLOPS�[0m      -2.01%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.31 TFLOPS�[0m        1.31 TFLOPS�[0m       +0.00 TFLOPS�[0m      -0.00%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.32 TFLOPS�[0m        1.36 TFLOPS�[0m       +0.04 TFLOPS�[0m      +2.94%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.15 TFLOPS�[0m        1.16 TFLOPS�[0m       +0.01 TFLOPS�[0m      +0.86%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     829.80 GFLOPS�[0m      760.90 GFLOPS�[0m      -68.90 GFLOPS�[0m      -8.30%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.36 TFLOPS�[0m        1.39 TFLOPS�[0m       +0.03 TFLOPS�[0m      +2.16%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.25 TFLOPS�[0m        1.30 TFLOPS�[0m       +0.05 TFLOPS�[0m      +3.85%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.51 TFLOPS�[0m        1.55 TFLOPS�[0m       +0.04 TFLOPS�[0m      +2.58%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.32 TFLOPS�[0m        1.31 TFLOPS�[0m       -0.01 TFLOPS�[0m      -0.76%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.50 TFLOPS�[0m        1.53 TFLOPS�[0m       +0.03 TFLOPS�[0m      +1.96%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.81 TFLOPS�[0m        2.07 TFLOPS�[0m       +0.26 TFLOPS�[0m     +12.56%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.76 TFLOPS�[0m        2.05 TFLOPS�[0m       +0.29 TFLOPS�[0m     +14.15%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.62 TFLOPS�[0m        1.97 TFLOPS�[0m       +0.35 TFLOPS�[0m     +17.77%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.68 TFLOPS�[0m        1.97 TFLOPS�[0m       +0.29 TFLOPS�[0m     +14.72%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.01 TFLOPS�[0m        1.06 TFLOPS�[0m       +0.05 TFLOPS�[0m      +4.72%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.01 TFLOPS�[0m        1.08 TFLOPS�[0m       +0.07 TFLOPS�[0m      +6.48%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.99 TFLOPS�[0m        2.00 TFLOPS�[0m       +0.01 TFLOPS�[0m      +0.50%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    905.57 GFLOPS�[0m      956.36 GFLOPS�[0m      +50.79 GFLOPS�[0m      +5.31%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.32 TFLOPS�[0m        1.37 TFLOPS�[0m       +0.05 TFLOPS�[0m      +3.65%
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      827.61 GFLOPS�[0m      728.83 GFLOPS�[0m      -98.78 GFLOPS�[0m     -11.94%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        1.19 TFLOPS�[0m        0.99 TFLOPS�[0m       -0.20 TFLOPS�[0m     -16.47%
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.67 TFLOPS�[0m        1.62 TFLOPS�[0m       -0.05 TFLOPS�[0m      -2.99%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.39 TFLOPS�[0m        1.39 TFLOPS�[0m       +0.00 TFLOPS�[0m      -0.00%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.43 TFLOPS�[0m        1.34 TFLOPS�[0m       -0.09 TFLOPS�[0m      -6.29%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.21 TFLOPS�[0m        1.21 TFLOPS�[0m       +0.00 TFLOPS�[0m      -0.00%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     808.22 GFLOPS�[0m      848.22 GFLOPS�[0m      +40.00 GFLOPS�[0m      +4.72%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.35 TFLOPS�[0m        1.46 TFLOPS�[0m       +0.11 TFLOPS�[0m      +7.53%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.34 TFLOPS�[0m        1.40 TFLOPS�[0m       +0.06 TFLOPS�[0m      +4.29%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.53 TFLOPS�[0m        1.59 TFLOPS�[0m       +0.06 TFLOPS�[0m      +3.77%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.45 TFLOPS�[0m        1.54 TFLOPS�[0m       +0.09 TFLOPS�[0m      +5.84%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.68 TFLOPS�[0m        1.73 TFLOPS�[0m       +0.05 TFLOPS�[0m      +2.89%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.84 TFLOPS�[0m        2.19 TFLOPS�[0m       +0.35 TFLOPS�[0m     +15.98%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.77 TFLOPS�[0m        2.19 TFLOPS�[0m       +0.42 TFLOPS�[0m     +19.18%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.65 TFLOPS�[0m        2.06 TFLOPS�[0m       +0.41 TFLOPS�[0m     +19.90%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.70 TFLOPS�[0m        2.03 TFLOPS�[0m       +0.33 TFLOPS�[0m     +16.26%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      0.96 TFLOPS�[0m        1.08 TFLOPS�[0m       +0.12 TFLOPS�[0m     +10.66%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      0.99 TFLOPS�[0m        1.13 TFLOPS�[0m       +0.14 TFLOPS�[0m     +12.17%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     2.14 TFLOPS�[0m        2.13 TFLOPS�[0m       -0.01 TFLOPS�[0m      -0.47%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    894.15 GFLOPS�[0m      965.20 GFLOPS�[0m      +71.05 GFLOPS�[0m      +7.36%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.47 TFLOPS�[0m        1.64 TFLOPS�[0m       +0.17 TFLOPS�[0m     +10.37%
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        1.17 TFLOPS�[0m        0.96 TFLOPS�[0m       -0.21 TFLOPS�[0m     -17.68%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                        1.46 TFLOPS�[0m        1.15 TFLOPS�[0m       -0.31 TFLOPS�[0m     -21.23%
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.72 TFLOPS�[0m        1.64 TFLOPS�[0m       -0.08 TFLOPS�[0m      -4.65%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.48 TFLOPS�[0m        1.52 TFLOPS�[0m       +0.04 TFLOPS�[0m      +2.63%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.70 TFLOPS�[0m        1.62 TFLOPS�[0m       -0.08 TFLOPS�[0m      -4.71%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.34 TFLOPS�[0m        1.37 TFLOPS�[0m       +0.03 TFLOPS�[0m      +2.19%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     908.84 GFLOPS�[0m      889.57 GFLOPS�[0m      -19.27 GFLOPS�[0m      -2.12%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.34 TFLOPS�[0m        1.31 TFLOPS�[0m       -0.03 TFLOPS�[0m      -2.24%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.33 TFLOPS�[0m        1.25 TFLOPS�[0m       -0.08 TFLOPS�[0m      -6.02%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.63 TFLOPS�[0m        1.73 TFLOPS�[0m       +0.10 TFLOPS�[0m      +5.78%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.49 TFLOPS�[0m        1.62 TFLOPS�[0m       +0.13 TFLOPS�[0m      +8.02%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                       1.79 TFLOPS�[0m        1.74 TFLOPS�[0m       -0.05 TFLOPS�[0m      -2.79%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.96 TFLOPS�[0m        2.00 TFLOPS�[0m       +0.04 TFLOPS�[0m      +2.00%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.95 TFLOPS�[0m        1.98 TFLOPS�[0m       +0.03 TFLOPS�[0m      +1.52%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      1.87 TFLOPS�[0m        1.98 TFLOPS�[0m       +0.11 TFLOPS�[0m      +5.56%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    1.85 TFLOPS�[0m        1.87 TFLOPS�[0m       +0.02 TFLOPS�[0m      +1.07%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    398.53 GFLOPS�[0m      424.71 GFLOPS�[0m      +26.18 GFLOPS�[0m      +6.16%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    408.13 GFLOPS�[0m      419.71 GFLOPS�[0m      +11.58 GFLOPS�[0m      +2.76%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     2.36 TFLOPS�[0m        2.54 TFLOPS�[0m       +0.18 TFLOPS�[0m      +7.09%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    326.80 GFLOPS�[0m      357.76 GFLOPS�[0m      +30.96 GFLOPS�[0m      +8.65%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     1.89 TFLOPS�[0m        1.97 TFLOPS�[0m       +0.08 TFLOPS�[0m      +4.06%
MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      2.28 TFLOPS�[0m        2.76 TFLOPS�[0m       +0.48 TFLOPS�[0m     +17.39%
MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                      3.25 TFLOPS�[0m        3.26 TFLOPS�[0m       +0.01 TFLOPS�[0m      +0.31%
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]): not supported
MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.93 TFLOPS�[0m        4.49 TFLOPS�[0m       +0.56 TFLOPS�[0m     +12.47%
MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     4.13 TFLOPS�[0m        4.53 TFLOPS�[0m       +0.40 TFLOPS�[0m      +8.83%
MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.78 TFLOPS�[0m        4.21 TFLOPS�[0m       +0.43 TFLOPS�[0m     +10.21%
MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.77 TFLOPS�[0m        4.19 TFLOPS�[0m       +0.42 TFLOPS�[0m     +10.02%
MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.94 TFLOPS�[0m        4.45 TFLOPS�[0m       +0.51 TFLOPS�[0m     +11.46%
MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.55 TFLOPS�[0m        3.81 TFLOPS�[0m       +0.26 TFLOPS�[0m      +6.82%
MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.18 TFLOPS�[0m        3.52 TFLOPS�[0m       +0.34 TFLOPS�[0m      +9.66%
MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.09 TFLOPS�[0m        3.48 TFLOPS�[0m       +0.39 TFLOPS�[0m     +11.21%
MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     2.97 TFLOPS�[0m        3.24 TFLOPS�[0m       +0.27 TFLOPS�[0m      +8.33%
MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                     3.34 TFLOPS�[0m        3.66 TFLOPS�[0m       +0.32 TFLOPS�[0m      +8.74%
MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  3.21 TFLOPS�[0m        3.53 TFLOPS�[0m       +0.32 TFLOPS�[0m      +9.07%
MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3.19 TFLOPS�[0m        3.51 TFLOPS�[0m       +0.32 TFLOPS�[0m      +9.12%
MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3.03 TFLOPS�[0m        3.48 TFLOPS�[0m       +0.45 TFLOPS�[0m     +12.93%
MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  3.22 TFLOPS�[0m        3.55 TFLOPS�[0m       +0.33 TFLOPS�[0m      +9.30%
MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3.32 TFLOPS�[0m        3.72 TFLOPS�[0m       +0.40 TFLOPS�[0m     +10.75%
MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3.08 TFLOPS�[0m        3.43 TFLOPS�[0m       +0.35 TFLOPS�[0m     +10.20%
MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3.81 TFLOPS�[0m        4.22 TFLOPS�[0m       +0.41 TFLOPS�[0m      +9.72%
MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                    3.25 TFLOPS�[0m        3.53 TFLOPS�[0m       +0.28 TFLOPS�[0m      +7.93%
MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3.33 TFLOPS�[0m        3.67 TFLOPS�[0m       +0.34 TFLOPS�[0m      +9.26%
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):    117.34 GB/s�[0m       98.53 GB/s�[0m      -18.81 GB/s�[0m     -16.03%
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):     58.04 GB/s�[0m       30.62 GB/s�[0m      -27.42 GB/s�[0m     -47.24%
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):      7.04 GB/s�[0m        5.07 GB/s�[0m       -1.97 GB/s�[0m     -27.98%
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):    127.38 GB/s�[0m       80.36 GB/s�[0m      -47.02 GB/s�[0m     -36.91%
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):     27.14 GB/s�[0m        9.19 GB/s�[0m      -17.95 GB/s�[0m     -66.14%
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):     78.32 GB/s�[0m       53.59 GB/s�[0m      -24.73 GB/s�[0m     -31.58%
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):     25.28 GB/s�[0m       16.25 GB/s�[0m       -9.03 GB/s�[0m     -35.72%
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):      6.15 GB/s�[0m        8.03 GB/s�[0m       +1.88 GB/s�[0m     +23.41%
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):     75.67 GB/s�[0m       43.12 GB/s�[0m      -32.55 GB/s�[0m     -43.02%
IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):     17.00 GB/s�[0m        5.91 GB/s�[0m      -11.09 GB/s�[0m     -65.24%

@daniandtheweb
Copy link
Contributor Author

Thanks for the results. GFX1013 is only the Playstation 5 GPU or is there any other card using that chip? Is the performance using llama-bench or stable-diffusion.cpp increased on any of those two cards using subgroup 32 or is there any performance regression in those real world use cases?

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 16, 2025

Thanks for the results. GFX1013 is only the Playstation 5 GPU or is there any other card using that chip? Is the performance using llama-bench or stable-diffusion.cpp increased on any of those two cards using subgroup 32 or is there any performance regression in those real world use cases?

It's an AMD BC-250, based on the PS5 APU, yeah. Actual performance in llama-bench does increase with your RDNA1 tuning:

model size params backend ngl test t/s Master t/s PR t/s subgroup 32
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 pp512 278.58 ± 0.27 313.68 ± 0.04 313.23 ± 0.05
llama 8B Q4_0 4.33 GiB 8.03 B Vulkan 99 tg128 45.41 ± 0.46 44.96 ± 0.02 45.43 ± 0.03
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 pp512 226.23 ± 0.13 249.13 ± 0.04 249.06 ± 0.05
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 99 tg128 47.57 ± 0.01 47.53 ± 0.04 47.77 ± 0.02

There might be a small regression in mul_mat_vec when you keep it at 64, but it's hard to tell.

@0cc4m
Copy link
Collaborator

0cc4m commented Mar 16, 2025

Let's not hold up this PR with efforts to tune perfectly. We have a decent idea for now and can merge it with or without MMV at subgroup 32. Over time we'll get a clearer picture and can adjust it.

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Mar 16, 2025

I adjusted the code to avoid the workaround I used to avoid duplicating code. If RDNA2 also gets a slight performance improvement from this changes I think it's ready to be merged.
Regarding the mul_mat_vec regression you mentioned I actually am not able to find any, at least on RDNA1. The card you mentioned however is a bit different from a normal RDNA1 so the fine tune I did on the 5700xt may behave differently. If the regression is minimal I think it can be fixed in the future since for now the changes improve the perfomance.

Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be differences between Q4_0 and Q4_K_S on RDNA2, but a minor tg regression at most. It's good enough for now.

Thank you for the tuning work!

@0cc4m 0cc4m merged commit cf2270e into ggml-org:master Mar 17, 2025
47 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Mar 19, 2025
* vulkan: subgroup size test

* Vulkan: Add device architecture enum and logic to recognize AMD generations

* vulkan: use new architecture logic to specify subgroup size

* Initial vulkan subgroup size tuning for RDNA3

* vulkan: commonize RDNA subgroup tuning

* vulkan: override subgroup size if required_subgroup_size = 0

* vulkan: disable warp 32 for RDNA3

* vulkan: fine tuned RDNA1 subgroup sizes

* vulkan: adjusted subgroup size map

* vulkan: fixed RDNA2 subgroup map

---------

Co-authored-by: 0cc4m <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants