Skip to content

vulkan: matmul gcn tuning #13016

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 24, 2025
Merged

vulkan: matmul gcn tuning #13016

merged 4 commits into from
Apr 24, 2025

Conversation

netrunnereve
Copy link
Collaborator

@netrunnereve netrunnereve commented Apr 18, 2025

I tried to do some manual tuning on the mmq warptile settings and I'm seeing good improvements on my end. Right now these changes are only applied on AMD GCN but they might help other chips as well.

Results on my RX470, locked to 900 mhz so the power limit doesn't mess with my numbers:

PR

  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       42 runs - 25039.55 us/run -  60.13 GFLOP/run -   2.40 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       42 runs - 24727.38 us/run -  60.13 GFLOP/run -   2.43 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       38 runs - 26618.84 us/run -  60.13 GFLOP/run -   2.26 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       38 runs - 26566.58 us/run -  60.13 GFLOP/run -   2.26 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       40 runs - 25383.05 us/run -  60.13 GFLOP/run -   2.37 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       34 runs - 30912.41 us/run -  60.13 GFLOP/run -   1.95 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       32 runs - 31700.53 us/run -  60.13 GFLOP/run -   1.90 TFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       34 runs - 30709.47 us/run -  60.13 GFLOP/run -   1.96 TFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       30 runs - 33692.97 us/run -  60.13 GFLOP/run -   1.78 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       34 runs - 30832.82 us/run -  60.13 GFLOP/run -   1.95 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                    32 runs - 31994.59 us/run -  60.13 GFLOP/run -   1.88 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                     32 runs - 31827.38 us/run -  60.13 GFLOP/run -   1.89 TFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      26 runs - 39016.85 us/run -  60.13 GFLOP/run -   1.54 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                    32 runs - 32360.03 us/run -  60.13 GFLOP/run -   1.86 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      34 runs - 30087.26 us/run -  60.13 GFLOP/run -   2.00 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      32 runs - 32532.38 us/run -  60.13 GFLOP/run -   1.85 TFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                     40 runs - 25327.12 us/run -  60.13 GFLOP/run -   2.37 TFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      34 runs - 30346.79 us/run -  60.13 GFLOP/run -   1.98 TFLOPS
  MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                     34 runs - 29419.88 us/run -  60.13 GFLOP/run -   2.04 TFLOPS
model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 pp512 171.52 ± 0.50
llama 8B Q2_K - Medium 2.95 GiB 8.03 B Vulkan 99 pp512 144.36 ± 0.49
llama 8B IQ3_XXS - 3.0625 bpw 3.04 GiB 8.03 B Vulkan 99 pp512 136.64 ± 0.45

Master

  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       30 runs - 33936.63 us/run -  60.13 GFLOP/run -   1.77 TFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       30 runs - 33703.70 us/run -  60.13 GFLOP/run -   1.78 TFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       26 runs - 38655.42 us/run -  60.13 GFLOP/run -   1.56 TFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       28 runs - 37412.71 us/run -  60.13 GFLOP/run -   1.61 TFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       30 runs - 34417.07 us/run -  60.13 GFLOP/run -   1.75 TFLOPS
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       26 runs - 40389.42 us/run -  60.13 GFLOP/run -   1.49 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       24 runs - 45311.38 us/run -  60.13 GFLOP/run -   1.33 TFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       24 runs - 44301.62 us/run -  60.13 GFLOP/run -   1.36 TFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       22 runs - 47887.09 us/run -  60.13 GFLOP/run -   1.26 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       24 runs - 41909.17 us/run -  60.13 GFLOP/run -   1.43 TFLOPS
  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                    24 runs - 43341.67 us/run -  60.13 GFLOP/run -   1.39 TFLOPS
  MUL_MAT(type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                     24 runs - 42465.96 us/run -  60.13 GFLOP/run -   1.42 TFLOPS
  MUL_MAT(type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      22 runs - 49311.50 us/run -  60.13 GFLOP/run -   1.22 TFLOPS
  MUL_MAT(type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                    24 runs - 43765.83 us/run -  60.13 GFLOP/run -   1.37 TFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      26 runs - 39816.38 us/run -  60.13 GFLOP/run -   1.51 TFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      24 runs - 43036.33 us/run -  60.13 GFLOP/run -   1.40 TFLOPS
  MUL_MAT(type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                     30 runs - 34430.73 us/run -  60.13 GFLOP/run -   1.75 TFLOPS
  MUL_MAT(type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                      24 runs - 41953.67 us/run -  60.13 GFLOP/run -   1.43 TFLOPS
  MUL_MAT(type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                     24 runs - 41902.25 us/run -  60.13 GFLOP/run -   1.43 TFLOPS
model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 pp512 136.10 ± 0.41
llama 8B Q2_K - Medium 2.95 GiB 8.03 B Vulkan 99 pp512 105.10 ± 0.40
llama 8B IQ3_XXS - 3.0625 bpw 3.04 GiB 8.03 B Vulkan 99 pp512 102.03 ± 0.23

I also tried tuning the small and large warptiles by setting them as the default but I wasn't able to get them to beat the 256 thread medium shader. The FP16 and FP32 shaders already perform best on GCN using the existing parameters.

@netrunnereve netrunnereve marked this pull request as ready for review April 18, 2025 21:39
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Apr 18, 2025
@netrunnereve netrunnereve marked this pull request as draft April 19, 2025 03:26
@netrunnereve
Copy link
Collaborator Author

I'm setting this back to draft while I adjust it a little bit more...

@netrunnereve
Copy link
Collaborator Author

netrunnereve commented Apr 19, 2025

Okay I think is ready. The 16x16 tiles I'm using now perform like the 64x16 ones at a fixed clock speed but once I turn frequency scaling back on the chip manages to clock higher than before and I get a 4% improvement in pp512 speed. Maybe the smaller tile sizes make it run more efficiently?

Since all the threads in the workgroup do the same calculations and share the same memory we don't necessarily have to make the shader's warp size match the subgroup size.

@netrunnereve netrunnereve marked this pull request as ready for review April 19, 2025 21:19
@netrunnereve netrunnereve requested a review from 0cc4m April 19, 2025 21:24
@masamaru-san
Copy link

Oh..., this is not seem to be suitable for my Ryzen integrated graphics (Ryzen 5700U with Radeon Graphics, gfx90c), at least.

The most obvious difference can be seen in the test with test-backend-ops, where the FLOPS values drop by about 8 percent when n=512, except for f16 and f32.
Even at the application level, the generation time increases by about 4 percent when using stable-diffusion-v1.5 in stable-diffusion.cpp.

before

  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                 8 runs - 141894.75 us/run -  60.13 GFLOP/run - 423.76 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                10 runs - 104162.00 us/run -  60.13 GFLOP/run - 577.27 GFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0): not supported
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       14 runs - 74643.71 us/run -  60.13 GFLOP/run - 805.55 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       14 runs - 72368.29 us/run -  60.13 GFLOP/run - 830.88 GFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       12 runs - 84916.33 us/run -  60.13 GFLOP/run - 708.10 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       14 runs - 80803.64 us/run -  60.13 GFLOP/run - 744.14 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       14 runs - 77914.29 us/run -  60.13 GFLOP/run - 771.74 GFLOPS
... 

after (this PR)

  MUL_MAT(type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                 8 runs - 141682.25 us/run -  60.13 GFLOP/run - 424.40 GFLOPS
  MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                10 runs - 104117.40 us/run -  60.13 GFLOP/run - 577.52 GFLOPS
  MUL_MAT(type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0): not supported
  MUL_MAT(type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       12 runs - 84012.67 us/run -  60.13 GFLOP/run - 715.72 GFLOPS
  MUL_MAT(type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       12 runs - 85369.75 us/run -  60.13 GFLOP/run - 704.34 GFLOPS
  MUL_MAT(type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       14 runs - 82779.86 us/run -  60.13 GFLOP/run - 726.38 GFLOPS
  MUL_MAT(type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       12 runs - 93338.33 us/run -  60.13 GFLOP/run - 644.21 GFLOPS
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0):                       12 runs - 90233.58 us/run -  60.13 GFLOP/run - 666.38 GFLOPS
...

Maybe, like Java and other VMs, we should store performance profile data for each device and apply it automatically?

@0cc4m
Copy link
Collaborator

0cc4m commented Apr 20, 2025

@masamaru-san Please also do before and after tests with a model, cause the unit tests are not reliable to judge whether it is actually detrimental.

@netrunnereve
Copy link
Collaborator Author

The most obvious difference can be seen in the test with test-backend-ops, where the FLOPS values drop by about 8 percent when n=512, except for f16 and f32.
Even at the application level, the generation time increases by about 4 percent when using stable-diffusion-v1.5 in stable-diffusion.cpp.

My first thought was that your integrated graphics only had two cores as that wouldn't handle 256 threads well, but that's not the case as you have eight cores. It's also not a Vega or FP16 issue as @0cc4m's card is doing fine.

Since your 5700U is a 15W chip this might actually be a power issue though. For example when prompt processing Llama 2 7B Q4_0 on master my 470 runs at 1.15 GHz and gets 171 t/s. With my PR it only runs at 1 GHz but gets 189 t/s, and both times I'm hitting the 130A TDC limit on my card. Please run with a real model and compare the GPU clock speeds and power levels with master. On Linux you can use radeontop and sensors for this, for Windows I have no idea 🤷‍♀️.

Maybe, like Java and other VMs, we should store performance profile data for each device and apply it automatically?

Obviously that's the best option but it's a lot of work. Right now everyone can just submit tunes as PRs as it's not that hard to do.

@masamaru-san
Copy link

I rechecked the degree of performance change for AMD Ryzen 7 5700U with Radeon Graphics (Lucienne/gfx90c).

The conclusion is that test-backend-ops.exe was about 16% slower on average at Q8_0,n=512, while sd.exe was only about 2% slower on average under real conditions.
I ran sd.exe and test-backend-ops.exe alternately with and without this PR applied and made a comparison. This check was done with automatic power limiting turned off by RyzenAdj.

Perhaps this is also due to AMD's Vulkan driver for Windows?


Environment
  • OS

    • Windows 11 24H2 Home
      > cmd /c ver
      
      Microsoft Windows [Version 10.0.26100.3775]
  • device info

    • HP Pavilion Laptop 15-eh1080AU

      • BIOS: AMI F.30 - AMD AGESA CezannePI-FP6 1.0.1.1 12/02/2024
      • VRAM 512MB + 32 GB UMA (64 GB RAM)
      • vulkaninfoSDK
      > E:\VulkanSDK\1.4.309.0\Bin\vulkaninfoSDK.exe --summary
      
      WARNING: [Loader Message] Code 0 : Layer VK_LAYER_AMD_switchable_graphics uses API version 1.3 
      which is older than the application specified API version of 1.4. May cause issues.
      ==========
      VULKANINFO
      ==========
      
      Vulkan Instance Version: 1.4.309
      
      
      Instance Extensions: count = 13
      -------------------------------
      VK_EXT_debug_report                    : extension revision 10
      VK_EXT_debug_utils                     : extension revision 2
      VK_EXT_swapchain_colorspace            : extension revision 4
      VK_KHR_device_group_creation           : extension revision 1
      VK_KHR_external_fence_capabilities     : extension revision 1
      VK_KHR_external_memory_capabilities    : extension revision 1
      VK_KHR_external_semaphore_capabilities : extension revision 1
      VK_KHR_get_physical_device_properties2 : extension revision 2
      VK_KHR_get_surface_capabilities2       : extension revision 1
      VK_KHR_portability_enumeration         : extension revision 1
      VK_KHR_surface                         : extension revision 25
      VK_KHR_win32_surface                   : extension revision 6
      VK_LUNARG_direct_driver_loading        : extension revision 1
      
      Instance Layers: count = 10
      ---------------------------
      VK_LAYER_AMD_switchable_graphics  AMD switchable graphics layer                                                                                     
                                  1.3.260  version 1
      VK_LAYER_KHRONOS_profiles         Khronos Profiles layer                                                                                            
                                  1.4.309  version 1
      VK_LAYER_KHRONOS_shader_object    Khronos Shader object layer                                                                                       
                                  1.4.309  version 1
      VK_LAYER_KHRONOS_synchronization2 Khronos Synchronization2 layer                                                                                    
                                  1.4.309  version 1
      VK_LAYER_KHRONOS_validation       Khronos Validation Layer                                                                                          
                                  1.4.309  version 1
      VK_LAYER_LUNARG_api_dump          LunarG API dump layer                                                                                             
                                  1.4.309  version 2
      VK_LAYER_LUNARG_crash_diagnostic  Crash Diagnostic Layer is a crash/hang debugging tool that helps determines GPU progress in a Vulkan application. 1.4.309  version 1
      VK_LAYER_LUNARG_gfxreconstruct    GFXReconstruct Capture Layer Version 1.0.5                                                                        
                                  1.4.309  version 4194309
      VK_LAYER_LUNARG_monitor           Execution Monitoring Layer                                                                                        
                                  1.4.309  version 1
      VK_LAYER_LUNARG_screenshot        LunarG image capture layer                                                                                        
                                  1.4.309  version 1
      
      Devices:
      ========
      GPU0:
              apiVersion         = 1.3.260
              driverVersion      = 2.0.279
              vendorID           = 0x1002
              deviceID           = 0x164c
              deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
              deviceName         = AMD Radeon(TM) Graphics
              driverID           = DRIVER_ID_AMD_PROPRIETARY
              driverName         = AMD proprietary driver
              driverInfo         = 24.9.1 (AMD proprietary shader compiler)
              conformanceVersion = 1.3.3.1
              deviceUUID         = 00000000-0400-0000-0000-000000000000
              driverUUID         = 414d442d-5749-4e2d-4452-560000000000
    • Disable power limiting control:
      It is temporary disabled by RyzenAdj, so graphics core can always be boosted to 1900 MHz (this device's max). In this case, however, this was not necessary since each test case was shorter than 280 seconds.

Build toolset
  • IDE: Microsoft Visual Studio
    • ver 17.13.6 Community

    • Using cmake (msvc internal)

      cmake version 3.30.5-msvc23
      
      CMake suite maintained and supported by Kitware (kitware.com/cmake).
    • build toolset: msvc internal LLVM/CLang

      > clang-cl.exe -v
      clang version 19.1.1
      Target: x86_64-pc-windows-msvc
      Thread model: posix
      InstallDir: C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\llvm\x64\bin
CASE: stable-diffusion.cpp
  • base: public master 10c6501

    • Modification to display additional debug messages.
  • cmake command arguments: -DSD_VULKAN=ON

  • test model: stable-diffusion-v1-5-pruned-emaonly-Q8_0.gguf from HF

  • command line:

    > .\sd.exe -m E:\AI\models\stable-diffusion-v1-5-pruned-emaonly-Q8_0.gguf --color -p 'A cat is sleeping on bed, viewable her whole body.' --seed 11 -o .\cat_vulkan_20250420_6p.png -v
  • generation time result

    (seconds: Model loading time is not included.)

    PR Original
    107.90 105.37
    107.79 105.39
    107.75 105.35
    107.43 105.11
    107.54 105.68
    107.84 105.50
    107.61 105.79
    107.61 105.69
    107.54 105.33
    107.47 105.16
    107.56  
    107.64  
    107.43  
    107.59  
    110.62  
    107.89  
    107.35  
    106.96  
    107.00  
    107.23
    * PR Original diff
    AVERAGE 107.69 105.44 +102%
    DEV +/- 0.367 0.182
CASE: ggml/test-backend-ops
  • base tree: public ggml-org/ggml master-13bcf9ce

    • Added option to specify use or non-use of imatrix during tensor initialization to reduce variability in test conditions.
  • Static and native build.

    • Using GGML_BUILD_TESTS, GGML_LTO
  • command line:

    > .\test-backend-ops.exe perf -p ',n=512,' --imatrix-on
  • Test result:
    As a representative example, the case of type_a=Q8_0 quantization with n=512 is shown.

    run time (us/run)

    PR Original
    85,364.75 82,245.71
    92,048.25 74,723.00
    88,251.17 79,141.21
    96,547.42 81,962.00
    94,852.75 81,862.86
    94,912.50 82,464.00
    94,694.58 81,765.57
    94,988.92 74,945.21
    94,584.92 75,804.07
    95,331.75 85,008.42
    94,573.08
    94,645.67
    94,269.75
    94,886.50
    96,574.75
    85,707.50
    87,073.25
    91,830.42
    98,114.58
    84,771.17
    * PR Original Diff
    AVERAGE 92,701.18 79,992.21 +116%
    DEV (+/-) 3,386.18 3,071.07

@netrunnereve
Copy link
Collaborator Author

netrunnereve commented Apr 21, 2025

The conclusion is that test-backend-ops.exe was about 16% slower on average at Q8_0,n=512, while sd.exe was only about 2% slower on average under real conditions.

Again as mentioned can you run a llama-bench with a real model instead of a test-backend-ops? I don't plan on looking into the stable diffusion results as that's basically a fork with an older version of GGML and possibly some unknown backend changes.

This check was done with automatic power limiting turned off by RyzenAdj.

After doing that what wattage and frequencies are you seeing when running prompt processing? Is it the same for both this PR and master? I'm hoping that you know what you're doing here and are not just casually cranking up the wattage and current limits as that can fry your chip.

@masamaru-san
Copy link

Again as mentioned can you run a llama-bench with a real model instead of a test-backend-ops? I don't plan on looking into the stable diffusion results as that's basically a fork with an older version of GGML and possibly some unknown backend changes.

Sorry to bother you, I ran llama-bench ⬇️. Differences are 6 to 8 t/s on pp512 test. I think something is jammed because there are only two graphics cores, too. I will treat this within local fork.

llama-bench test result
> &{
>> (1..3) | foreach {
>> "Repeating: $_"
>> "`nMaster version`nWaiting 30seconds..."; Start-Sleep -Seconds 30
>> cd ..\bin.Master\
>>
>> .\llama-bench.exe -m E:\AI\models\llama-2-7b.Q4_0.gguf
>>
>> "`nPR version`nWaiting 30seconds..."; Start-Sleep -Seconds 30
>>
>> cd ..\bin\
>>
>> .\llama-bench.exe -m E:\AI\models\llama-2-7b.Q4_0.gguf
>> "`n----`n"
>> }
>> }

Repeating: 1

Master version
Waiting 30seconds...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon(TM) Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 pp512 60.74 ± 0.78
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 tg128 7.17 ± 0.03

build: 2016f07 (5162)

PR version
Waiting 30seconds...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon(TM) Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 pp512 51.68 ± 0.19
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 tg128 7.16 ± 0.01

build: 2016f07 (5162)


Repeating: 2

Master version
Waiting 30seconds...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon(TM) Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 pp512 58.74 ± 0.33
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 tg128 7.15 ± 0.05

build: 2016f07 (5162)

PR version
Waiting 30seconds...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon(TM) Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 pp512 51.72 ± 0.33
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 tg128 7.17 ± 0.03

build: 2016f07 (5162)


Repeating: 3

Master version
Waiting 30seconds...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon(TM) Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 pp512 58.47 ± 0.84
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 tg128 7.17 ± 0.02

build: 2016f07 (5162)

PR version
Waiting 30seconds...
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon(TM) Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 0 | matrix cores: none

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 pp512 52.12 ± 0.19
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 tg128 7.22 ± 0.03

build: 2016f07 (5162)

After doing that what wattage and frequencies are you seeing when running prompt processing? Is it the same for both this PR and master? I'm hoping that you know what you're doing here and are not just casually cranking up the wattage and current limits as that can fry your chip.

I was monitoring watts, clock, load rate, etc. on the GPU-Z and the RyzenAdj, and it didn't look to me like there was any difference between the master and PR. I've attached the logs for part of the second.
GPU-Z Sensor Log_single.zip

Not clock-up, it can only operate at 25W (default) at all times if it is given periodic cooling time before the power limit threshold of 15W is triggered.

@netrunnereve
Copy link
Collaborator Author

netrunnereve commented Apr 21, 2025

I think something is jammed because there are only two graphics cores, too. I will treat this within local fork.

I don't understand what you mean here as from your llama-bench results you should be using all 8 of your GPU cores. Also if your GPU hypothetically had its core count limited you would have to adjust it in the driver or graphics BIOS, not by modifying llama.cpp code.

I was monitoring watts, clock, load rate, etc. on the GPU-Z and the RyzenAdj, and it didn't look to me like there was any difference between the master and PR. I've attached the logs for part of the second.

Thanks, that's pretty helpful. I skimmed through the chart and I'm seeing a bit of power limiting during prompt processing with the chip hitting an average of 1700 MHz or so. It then jumps up to the full 1900 MHz for inference, and in both cases it's running slightly below the 25W limit. This is perfectly normal since the prompt processing stage is compute bound while inference is memory bound, and I see this on my own GPUs.

Considering how the prompt processing clocks are similar between master and my PR my guess is that your Windows driver is behaving differently than the Linux RADV driver that's used by @0cc4m and me.

@netrunnereve
Copy link
Collaborator Author

Speaking of limiting core count I retested this PR on my 470 with only 8 CUs enabled (2 per shader engine) and still got a 20% improvement in prompt processing speed. Yeah this is looking more and more like a driver thing.

@netrunnereve
Copy link
Collaborator Author

I also noticed that your driver is reporting that you have 32k shared memory on your Vega graphics which makes no sense. Anyways I've straight up disabled these changes for the AMD proprietary driver so we should be good to go.

@0cc4m 0cc4m merged commit b3b6d86 into ggml-org:master Apr 24, 2025
48 checks passed
@netrunnereve netrunnereve deleted the matmul_tuning branch April 24, 2025 14:56
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Apr 28, 2025
* tune matmul for gcn

* this one is more power efficient

* Update ggml/src/ggml-vulkan/ggml-vulkan.cpp

Co-authored-by: 0cc4m <[email protected]>

* disable this tune for the proprietary driver

---------

Co-authored-by: 0cc4m <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants