vulkan: matmul gcn tuning #13016

netrunnereve · 2025-04-18T21:39:06Z

I tried to do some manual tuning on the mmq warptile settings and I'm seeing good improvements on my end. Right now these changes are only applied on AMD GCN but they might help other chips as well.

Results on my RX470, locked to 900 mhz so the power limit doesn't mess with my numbers:

PR

  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       42 runs - 25039.55 us/run -  60.13 GFLOP/run -   2.40 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       42 runs - 24727.38 us/run -  60.13 GFLOP/run -   2.43 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       38 runs - 26618.84 us/run -  60.13 GFLOP/run -   2.26 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       38 runs - 26566.58 us/run -  60.13 GFLOP/run -   2.26 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       40 runs - 25383.05 us/run -  60.13 GFLOP/run -   2.37 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       34 runs - 30912.41 us/run -  60.13 GFLOP/run -   1.95 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       32 runs - 31700.53 us/run -  60.13 GFLOP/run -   1.90 TFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       34 runs - 30709.47 us/run -  60.13 GFLOP/run -   1.96 TFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       30 runs - 33692.97 us/run -  60.13 GFLOP/run -   1.78 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       34 runs - 30832.82 us/run -  60.13 GFLOP/run -   1.95 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                    32 runs - 31994.59 us/run -  60.13 GFLOP/run -   1.88 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                     32 runs - 31827.38 us/run -  60.13 GFLOP/run -   1.89 TFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      26 runs - 39016.85 us/run -  60.13 GFLOP/run -   1.54 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                    32 runs - 32360.03 us/run -  60.13 GFLOP/run -   1.86 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      34 runs - 30087.26 us/run -  60.13 GFLOP/run -   2.00 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      32 runs - 32532.38 us/run -  60.13 GFLOP/run -   1.85 TFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                     40 runs - 25327.12 us/run -  60.13 GFLOP/run -   2.37 TFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      34 runs - 30346.79 us/run -  60.13 GFLOP/run -   1.98 TFLOPS
  MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                     34 runs - 29419.88 us/run -  60.13 GFLOP/run -   2.04 TFLOPS

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	171.52 ± 0.50
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	Vulkan	99	pp512	144.36 ± 0.49
llama 8B IQ3_XXS - 3.0625 bpw	3.04 GiB	8.03 B	Vulkan	99	pp512	136.64 ± 0.45

Master

  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       30 runs - 33936.63 us/run -  60.13 GFLOP/run -   1.77 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       30 runs - 33703.70 us/run -  60.13 GFLOP/run -   1.78 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       26 runs - 38655.42 us/run -  60.13 GFLOP/run -   1.56 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       28 runs - 37412.71 us/run -  60.13 GFLOP/run -   1.61 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       30 runs - 34417.07 us/run -  60.13 GFLOP/run -   1.75 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       26 runs - 40389.42 us/run -  60.13 GFLOP/run -   1.49 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       24 runs - 45311.38 us/run -  60.13 GFLOP/run -   1.33 TFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       24 runs - 44301.62 us/run -  60.13 GFLOP/run -   1.36 TFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       22 runs - 47887.09 us/run -  60.13 GFLOP/run -   1.26 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       24 runs - 41909.17 us/run -  60.13 GFLOP/run -   1.43 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                    24 runs - 43341.67 us/run -  60.13 GFLOP/run -   1.39 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                     24 runs - 42465.96 us/run -  60.13 GFLOP/run -   1.42 TFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      22 runs - 49311.50 us/run -  60.13 GFLOP/run -   1.22 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                    24 runs - 43765.83 us/run -  60.13 GFLOP/run -   1.37 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      26 runs - 39816.38 us/run -  60.13 GFLOP/run -   1.51 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      24 runs - 43036.33 us/run -  60.13 GFLOP/run -   1.40 TFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                     30 runs - 34430.73 us/run -  60.13 GFLOP/run -   1.75 TFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      24 runs - 41953.67 us/run -  60.13 GFLOP/run -   1.43 TFLOPS
  MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                     24 runs - 41902.25 us/run -  60.13 GFLOP/run -   1.43 TFLOPS

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	136.10 ± 0.41
llama 8B Q2_K - Medium	2.95 GiB	8.03 B	Vulkan	99	pp512	105.10 ± 0.40
llama 8B IQ3_XXS - 3.0625 bpw	3.04 GiB	8.03 B	Vulkan	99	pp512	102.03 ± 0.23

I also tried tuning the small and large warptiles by setting them as the default but I wasn't able to get them to beat the 256 thread medium shader. The FP16 and FP32 shaders already perform best on GCN using the existing parameters.

netrunnereve · 2025-04-19T03:27:29Z

I'm setting this back to draft while I adjust it a little bit more...

netrunnereve · 2025-04-19T21:19:37Z

Okay I think is ready. The 16x16 tiles I'm using now perform like the 64x16 ones at a fixed clock speed but once I turn frequency scaling back on the chip manages to clock higher than before and I get a 4% improvement in pp512 speed. Maybe the smaller tile sizes make it run more efficiently?

Since all the threads in the workgroup do the same calculations and share the same memory we don't necessarily have to make the shader's warp size match the subgroup size.

masamaru-san · 2025-04-20T05:29:15Z

Oh..., this is not seem to be suitable for my Ryzen integrated graphics (Ryzen 5700U with Radeon Graphics, gfx90c), at least.

The most obvious difference can be seen in the test with test-backend-ops, where the FLOPS values drop by about 8 percent when n=512, except for f16 and f32.
Even at the application level, the generation time increases by about 4 percent when using stable-diffusion-v1.5 in stable-diffusion.cpp.

before

  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                 8 runs - 141894.75 us/run -  60.13 GFLOP/run - 423.76 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                10 runs - 104162.00 us/run -  60.13 GFLOP/run - 577.27 GFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0): not supported
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       14 runs - 74643.71 us/run -  60.13 GFLOP/run - 805.55 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       14 runs - 72368.29 us/run -  60.13 GFLOP/run - 830.88 GFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       12 runs - 84916.33 us/run -  60.13 GFLOP/run - 708.10 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       14 runs - 80803.64 us/run -  60.13 GFLOP/run - 744.14 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       14 runs - 77914.29 us/run -  60.13 GFLOP/run - 771.74 GFLOPS
...

after (this PR)

  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                 8 runs - 141682.25 us/run -  60.13 GFLOP/run - 424.40 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                10 runs - 104117.40 us/run -  60.13 GFLOP/run - 577.52 GFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0): not supported
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       12 runs - 84012.67 us/run -  60.13 GFLOP/run - 715.72 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       12 runs - 85369.75 us/run -  60.13 GFLOP/run - 704.34 GFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       14 runs - 82779.86 us/run -  60.13 GFLOP/run - 726.38 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       12 runs - 93338.33 us/run -  60.13 GFLOP/run - 644.21 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       12 runs - 90233.58 us/run -  60.13 GFLOP/run - 666.38 GFLOPS
...

Maybe, like Java and other VMs, we should store performance profile data for each device and apply it automatically?

0cc4m · 2025-04-20T07:34:46Z

@masamaru-san Please also do before and after tests with a model, cause the unit tests are not reliable to judge whether it is actually detrimental.

ggml/src/ggml-vulkan/ggml-vulkan.cpp

Co-authored-by: 0cc4m <[email protected]>

netrunnereve · 2025-04-20T16:14:01Z

The most obvious difference can be seen in the test with test-backend-ops, where the FLOPS values drop by about 8 percent when n=512, except for f16 and f32.
Even at the application level, the generation time increases by about 4 percent when using stable-diffusion-v1.5 in stable-diffusion.cpp.

My first thought was that your integrated graphics only had two cores as that wouldn't handle 256 threads well, but that's not the case as you have eight cores. It's also not a Vega or FP16 issue as @0cc4m's card is doing fine.

Since your 5700U is a 15W chip this might actually be a power issue though. For example when prompt processing Llama 2 7B Q4_0 on master my 470 runs at 1.15 GHz and gets 171 t/s. With my PR it only runs at 1 GHz but gets 189 t/s, and both times I'm hitting the 130A TDC limit on my card. Please run with a real model and compare the GPU clock speeds and power levels with master. On Linux you can use radeontop and sensors for this, for Windows I have no idea 🤷‍♀️.

Maybe, like Java and other VMs, we should store performance profile data for each device and apply it automatically?

Obviously that's the best option but it's a lot of work. Right now everyone can just submit tunes as PRs as it's not that hard to do.

masamaru-san · 2025-04-20T17:38:56Z

I rechecked the degree of performance change for AMD Ryzen 7 5700U with Radeon Graphics (Lucienne/gfx90c).

The conclusion is that test-backend-ops.exe was about 16% slower on average at Q8_0,n=512, while sd.exe was only about 2% slower on average under real conditions.
I ran sd.exe and test-backend-ops.exe alternately with and without this PR applied and made a comparison. This check was done with automatic power limiting turned off by RyzenAdj.

Perhaps this is also due to AMD's Vulkan driver for Windows?

Environment

OS

Windows 11 24H2 Home

> cmd /c ver

Microsoft Windows [Version 10.0.26100.3775]

device info

HP Pavilion Laptop 15-eh1080AU

BIOS: AMI F.30 - AMD AGESA CezannePI-FP6 1.0.1.1 12/02/2024
VRAM 512MB + 32 GB UMA (64 GB RAM)
vulkaninfoSDK

> E:\VulkanSDK\1.4.309.0\Bin\vulkaninfoSDK.exe --summary

WARNING: [Loader Message] Code 0 : Layer VK_LAYER_AMD_switchable_graphics uses API version 1.3 
which is older than the application specified API version of 1.4. May cause issues.
==========
VULKANINFO
==========

Vulkan Instance Version: 1.4.309


Instance Extensions: count = 13
-------------------------------
VK_EXT_debug_report                    : extension revision 10
VK_EXT_debug_utils                     : extension revision 2
VK_EXT_swapchain_colorspace            : extension revision 4
VK_KHR_device_group_creation           : extension revision 1
VK_KHR_external_fence_capabilities     : extension revision 1
VK_KHR_external_memory_capabilities    : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2       : extension revision 1
VK_KHR_portability_enumeration         : extension revision 1
VK_KHR_surface                         : extension revision 25
VK_KHR_win32_surface                   : extension revision 6
VK_LUNARG_direct_driver_loading        : extension revision 1

Instance Layers: count = 10
---------------------------
VK_LAYER_AMD_switchable_graphics  AMD switchable graphics layer                                                                                     
                            1.3.260  version 1
VK_LAYER_KHRONOS_profiles         Khronos Profiles layer                                                                                            
                            1.4.309  version 1
VK_LAYER_KHRONOS_shader_object    Khronos Shader object layer                                                                                       
                            1.4.309  version 1
VK_LAYER_KHRONOS_synchronization2 Khronos Synchronization2 layer                                                                                    
                            1.4.309  version 1
VK_LAYER_KHRONOS_validation       Khronos Validation Layer                                                                                          
                            1.4.309  version 1
VK_LAYER_LUNARG_api_dump          LunarG API dump layer                                                                                             
                            1.4.309  version 2
VK_LAYER_LUNARG_crash_diagnostic  Crash Diagnostic Layer is a crash/hang debugging tool that helps determines GPU progress in a Vulkan application. 1.4.309  version 1
VK_LAYER_LUNARG_gfxreconstruct    GFXReconstruct Capture Layer Version 1.0.5                                                                        
                            1.4.309  version 4194309
VK_LAYER_LUNARG_monitor           Execution Monitoring Layer                                                                                        
                            1.4.309  version 1
VK_LAYER_LUNARG_screenshot        LunarG image capture layer                                                                                        
                            1.4.309  version 1

Devices:
========
GPU0:
        apiVersion         = 1.3.260
        driverVersion      = 2.0.279
        vendorID           = 0x1002
        deviceID           = 0x164c
        deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
        deviceName         = AMD Radeon(TM) Graphics
        driverID           = DRIVER_ID_AMD_PROPRIETARY
        driverName         = AMD proprietary driver
        driverInfo         = 24.9.1 (AMD proprietary shader compiler)
        conformanceVersion = 1.3.3.1
        deviceUUID         = 00000000-0400-0000-0000-000000000000
        driverUUID         = 414d442d-5749-4e2d-4452-560000000000

Disable power limiting control:
It is temporary disabled by RyzenAdj, so graphics core can always be boosted to 1900 MHz (this device's max). In this case, however, this was not necessary since each test case was shorter than 280 seconds.

Build toolset

IDE: Microsoft Visual Studio

ver 17.13.6 Community

Using cmake (msvc internal)

cmake version 3.30.5-msvc23

CMake suite maintained and supported by Kitware (kitware.com/cmake).

build toolset: msvc internal LLVM/CLang

> clang-cl.exe -v
clang version 19.1.1
Target: x86_64-pc-windows-msvc
Thread model: posix
InstallDir: C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\llvm\x64\bin

CASE: stable-diffusion.cpp

base: public master 10c6501
- Modification to display additional debug messages.
cmake command arguments: -DSD_VULKAN=ON
test model: stable-diffusion-v1-5-pruned-emaonly-Q8_0.gguf from HF

command line:

> .\sd.exe -m E:\AI\models\stable-diffusion-v1-5-pruned-emaonly-Q8_0.gguf --color -p 'A cat is sleeping on bed, viewable her whole body.' --seed 11 -o .\cat_vulkan_20250420_6p.png -v

generation time result

(seconds: Model loading time is not included.)

PR	Original
107.90	105.37
107.79	105.39
107.75	105.35
107.43	105.11
107.54	105.68
107.84	105.50
107.61	105.79
107.61	105.69
107.54	105.33
107.47	105.16
107.56
107.64
107.43
107.59
110.62
107.89
107.35
106.96
107.00
107.23

*	PR	Original	diff
AVERAGE	107.69	105.44	+102%
DEV +/-	0.367	0.182

CASE: ggml/test-backend-ops

base tree: public ggml-org/ggml master-13bcf9ce
- Added option to specify use or non-use of imatrix during tensor initialization to reduce variability in test conditions.
Static and native build.
- Using GGML_BUILD_TESTS, GGML_LTO

command line:

> .\test-backend-ops.exe perf -p ',n=512,' --imatrix-on

Test result:
As a representative example, the case of type_a=Q8_0 quantization with n=512 is shown.

run time (us/run)

PR	Original
85,364.75	82,245.71
92,048.25	74,723.00
88,251.17	79,141.21
96,547.42	81,962.00
94,852.75	81,862.86
94,912.50	82,464.00
94,694.58	81,765.57
94,988.92	74,945.21
94,584.92	75,804.07
95,331.75	85,008.42
94,573.08
94,645.67
94,269.75
94,886.50
96,574.75
85,707.50
87,073.25
91,830.42
98,114.58
84,771.17

*	PR	Original	Diff
AVERAGE	92,701.18	79,992.21	+116%
DEV (+/-)	3,386.18	3,071.07

netrunnereve · 2025-04-21T01:51:46Z

The conclusion is that test-backend-ops.exe was about 16% slower on average at Q8_0,n=512, while sd.exe was only about 2% slower on average under real conditions.

Again as mentioned can you run a llama-bench with a real model instead of a test-backend-ops? I don't plan on looking into the stable diffusion results as that's basically a fork with an older version of GGML and possibly some unknown backend changes.

This check was done with automatic power limiting turned off by RyzenAdj.

After doing that what wattage and frequencies are you seeing when running prompt processing? Is it the same for both this PR and master? I'm hoping that you know what you're doing here and are not just casually cranking up the wattage and current limits as that can fry your chip.

masamaru-san · 2025-04-21T06:02:36Z

Again as mentioned can you run a llama-bench with a real model instead of a test-backend-ops? I don't plan on looking into the stable diffusion results as that's basically a fork with an older version of GGML and possibly some unknown backend changes.

Sorry to bother you, I ran llama-bench ⬇️. Differences are 6 to 8 t/s on pp512 test. I think something is jammed because there are only two graphics cores, too. I will treat this within local fork.

llama-bench test result

> &{
>> (1..3) | foreach {
>> "Repeating: $_"
>> "`nMaster version`nWaiting 30seconds..."; Start-Sleep -Seconds 30
>> cd ..\bin.Master\
>>
>> .\llama-bench.exe -m E:\AI\models\llama-2-7b.Q4_0.gguf
>>
>> "`nPR version`nWaiting 30seconds..."; Start-Sleep -Seconds 30
>>
>> cd ..\bin\
>>
>> .\llama-bench.exe -m E:\AI\models\llama-2-7b.Q4_0.gguf
>> "`n----`n"
>> }
>> }

Repeating: 1

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	60.74 ± 0.78
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	7.17 ± 0.03

build: 2016f07 (5162)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	51.68 ± 0.19
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	7.16 ± 0.01

build: 2016f07 (5162)

Repeating: 2

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	58.74 ± 0.33
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	7.15 ± 0.05

build: 2016f07 (5162)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	51.72 ± 0.33
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	7.17 ± 0.03

build: 2016f07 (5162)

Repeating: 3

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	58.47 ± 0.84
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	7.17 ± 0.02

build: 2016f07 (5162)

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	52.12 ± 0.19
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	7.22 ± 0.03

build: 2016f07 (5162)

After doing that what wattage and frequencies are you seeing when running prompt processing? Is it the same for both this PR and master? I'm hoping that you know what you're doing here and are not just casually cranking up the wattage and current limits as that can fry your chip.

I was monitoring watts, clock, load rate, etc. on the GPU-Z and the RyzenAdj, and it didn't look to me like there was any difference between the master and PR. I've attached the logs for part of the second.
GPU-Z Sensor Log_single.zip

Not clock-up, it can only operate at 25W (default) at all times if it is given periodic cooling time before the power limit threshold of 15W is triggered.

netrunnereve · 2025-04-21T17:08:52Z

I think something is jammed because there are only two graphics cores, too. I will treat this within local fork.

I don't understand what you mean here as from your llama-bench results you should be using all 8 of your GPU cores. Also if your GPU hypothetically had its core count limited you would have to adjust it in the driver or graphics BIOS, not by modifying llama.cpp code.

I was monitoring watts, clock, load rate, etc. on the GPU-Z and the RyzenAdj, and it didn't look to me like there was any difference between the master and PR. I've attached the logs for part of the second.

Thanks, that's pretty helpful. I skimmed through the chart and I'm seeing a bit of power limiting during prompt processing with the chip hitting an average of 1700 MHz or so. It then jumps up to the full 1900 MHz for inference, and in both cases it's running slightly below the 25W limit. This is perfectly normal since the prompt processing stage is compute bound while inference is memory bound, and I see this on my own GPUs.

Considering how the prompt processing clocks are similar between master and my PR my guess is that your Windows driver is behaving differently than the Linux RADV driver that's used by @0cc4m and me.

netrunnereve · 2025-04-21T19:26:05Z

Speaking of limiting core count I retested this PR on my 470 with only 8 CUs enabled (2 per shader engine) and still got a 20% improvement in prompt processing speed. Yeah this is looking more and more like a driver thing.

netrunnereve · 2025-04-24T01:10:19Z

I also noticed that your driver is reporting that you have 32k shared memory on your Vega graphics which makes no sense. Anyways I've straight up disabled these changes for the AMD proprietary driver so we should be good to go.

* tune matmul for gcn * this one is more power efficient * Update ggml/src/ggml-vulkan/ggml-vulkan.cpp Co-authored-by: 0cc4m <[email protected]> * disable this tune for the proprietary driver --------- Co-authored-by: 0cc4m <[email protected]>

tune matmul for gcn

940817b

netrunnereve marked this pull request as ready for review April 18, 2025 21:39

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Apr 18, 2025

netrunnereve marked this pull request as draft April 19, 2025 03:26

this one is more power efficient

54b0053

netrunnereve marked this pull request as ready for review April 19, 2025 21:19

netrunnereve requested a review from 0cc4m April 19, 2025 21:24

0cc4m reviewed Apr 20, 2025

View reviewed changes

ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated Show resolved Hide resolved

Update ggml/src/ggml-vulkan/ggml-vulkan.cpp

cc091f8

Co-authored-by: 0cc4m <[email protected]>

disable this tune for the proprietary driver

1a12157

0cc4m approved these changes Apr 24, 2025

View reviewed changes

0cc4m merged commit b3b6d86 into ggml-org:master Apr 24, 2025
48 checks passed

netrunnereve deleted the matmul_tuning branch April 24, 2025 14:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: matmul gcn tuning #13016

vulkan: matmul gcn tuning #13016

Uh oh!

netrunnereve commented Apr 18, 2025 •

edited

Loading

Uh oh!

netrunnereve commented Apr 19, 2025

Uh oh!

netrunnereve commented Apr 19, 2025 •

edited

Loading

Uh oh!

masamaru-san commented Apr 20, 2025

Uh oh!

0cc4m commented Apr 20, 2025

Uh oh!

Uh oh!

netrunnereve commented Apr 20, 2025

Uh oh!

masamaru-san commented Apr 20, 2025

Uh oh!

netrunnereve commented Apr 21, 2025 •

edited

Loading

Uh oh!

masamaru-san commented Apr 21, 2025

Uh oh!

netrunnereve commented Apr 21, 2025 •

edited

Loading

Uh oh!

netrunnereve commented Apr 21, 2025

Uh oh!

netrunnereve commented Apr 24, 2025

Uh oh!

Uh oh!

Uh oh!

vulkan: matmul gcn tuning #13016

vulkan: matmul gcn tuning #13016

Uh oh!

Conversation

netrunnereve commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netrunnereve commented Apr 19, 2025

Uh oh!

netrunnereve commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

masamaru-san commented Apr 20, 2025

before

after (this PR)

Uh oh!

0cc4m commented Apr 20, 2025

Uh oh!

Uh oh!

netrunnereve commented Apr 20, 2025

Uh oh!

masamaru-san commented Apr 20, 2025

Uh oh!

netrunnereve commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

masamaru-san commented Apr 21, 2025

Uh oh!

netrunnereve commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netrunnereve commented Apr 21, 2025

Uh oh!

netrunnereve commented Apr 24, 2025

Uh oh!

Uh oh!

Uh oh!

netrunnereve commented Apr 18, 2025 •

edited

Loading

netrunnereve commented Apr 19, 2025 •

edited

Loading

netrunnereve commented Apr 21, 2025 •

edited

Loading

netrunnereve commented Apr 21, 2025 •

edited

Loading