vulkan: improve im2col #11826

daniandtheweb · 2025-02-12T13:52:03Z

This PR supersedes #11778.
Here's the performance numbers on my Radeon RX 5700XT (RADV).

Vulkan:

Master:
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              13104 runs -    95.82 us/run -    10244 kB/run -  101.96 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1640 runs -   639.72 us/run -    40964 kB/run -   61.08 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       52 runs - 79006.56 us/run -   655364 kB/run -    7.93 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     1312 runs -   900.50 us/run -   102445 kB/run -  108.53 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       82 runs - 20310.68 us/run -   409645 kB/run -   19.26 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               4278 runs -   236.25 us/run -    23536 kB/run -   95.01 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                335 runs -  3518.08 us/run -   100208 kB/run -   27.17 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       20 runs - 313087.20 us/run -  1678448 kB/run -    5.12 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      572 runs -  2263.98 us/run -   235365 kB/run -   99.18 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       34 runs - 71241.88 us/run -  1002085 kB/run -   13.43 GB/s

PR:
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              19656 runs -    58.61 us/run -    10244 kB/run -  166.70 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1640 runs -   664.67 us/run -    40964 kB/run -   58.78 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       52 runs - 79993.31 us/run -   655364 kB/run -    7.83 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     1968 runs -   602.13 us/run -   102445 kB/run -  162.31 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       82 runs - 17203.68 us/run -   409645 kB/run -   22.74 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               5704 runs -   227.95 us/run -    23536 kB/run -   98.47 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                670 runs -  2895.43 us/run -   100208 kB/run -   33.01 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       20 runs - 315784.75 us/run -  1678448 kB/run -    5.08 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      572 runs -  2218.08 us/run -   235365 kB/run -  101.23 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       34 runs - 72422.56 us/run -  1002085 kB/run -   13.21 GB/s

HIP:

  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               3276 runs -   923.04 us/run -    10244 kB/run -   10.58 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                820 runs -  3359.14 us/run -    40964 kB/run -   11.63 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       52 runs - 58759.37 us/run -   655364 kB/run -   10.66 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      328 runs -  9271.95 us/run -   102445 kB/run -   10.54 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       82 runs - 79109.07 us/run -   409645 kB/run -    4.94 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1426 runs -  1077.72 us/run -    23536 kB/run -   20.83 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                335 runs -  9038.40 us/run -   100208 kB/run -   10.57 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       20 runs - 160059.45 us/run -  1678448 kB/run -   10.02 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      143 runs - 12785.92 us/run -   235365 kB/run -   17.56 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       34 runs - 120635.09 us/run -  1002085 kB/run -    7.93 GB/s

I'm also including the benchmark results when using RADV_PERFTEST=cswave32. It's interesting to note that despite this variable hurting this specific operation's performance it actually improves the speed in stable-diffusion.cpp (sd 1.5 512x512: without PR 1.38 it/s, with PR 1.45 it/s, with PR + cswave32 1.55 it/s).

Vulkan cswave32:

Master:
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              13104 runs -    94.26 us/run -    10244 kB/run -  103.65 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                820 runs -  1426.19 us/run -    40964 kB/run -   27.40 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       52 runs - 122464.00 us/run -   655364 kB/run -    5.11 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      984 runs -  1240.23 us/run -   102445 kB/run -   78.80 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       82 runs - 67429.80 us/run -   409645 kB/run -    5.80 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               4278 runs -   325.81 us/run -    23536 kB/run -   68.89 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                335 runs -  7415.25 us/run -   100208 kB/run -   12.89 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       20 runs - 254549.75 us/run -  1678448 kB/run -    6.30 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      286 runs -  4823.12 us/run -   235365 kB/run -   46.55 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       34 runs - 358562.76 us/run -  1002085 kB/run -    2.67 GB/s

PR:
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              13104 runs -    81.44 us/run -    10244 kB/run -  119.98 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                820 runs -  1439.59 us/run -    40964 kB/run -   27.14 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       52 runs - 135311.06 us/run -   655364 kB/run -    4.63 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      984 runs -  1191.29 us/run -   102445 kB/run -   82.04 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       82 runs - 62476.68 us/run -   409645 kB/run -    6.26 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               4278 runs -   315.84 us/run -    23536 kB/run -   71.07 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                335 runs -  7497.88 us/run -   100208 kB/run -   12.75 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       20 runs - 254943.60 us/run -  1678448 kB/run -    6.29 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      286 runs -  4767.25 us/run -   235365 kB/run -   47.10 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       34 runs - 358791.47 us/run -  1002085 kB/run -    2.67 GB/s

0cc4m · 2025-02-13T10:10:31Z

If cswave32 helps you, maybe look into if setting the requiredSubgroupSize in the shader does the same thing.

daniandtheweb · 2025-02-13T11:20:17Z

That's nice to know, thanks. I'll look into it to check where it does make a difference because, as I mentioned, it actually slows down im2col.

0cc4m · 2025-02-13T13:04:16Z

That's nice to know, thanks. I'll look into it to check where it does make a difference because, as I mentioned, it actually slows down im2col.

Real world performance (like sd.cpp benchmarks) are more important than GB/s in test-backend-ops perf. The dimensions of the real use are probably different, and there's other factors at play (for example graph execution instead of a single op getting repeated).

daniandtheweb · 2025-02-13T17:29:56Z

I've been trying to use the VK_EXT_subgroup_size_control extension in GLSL to set the requiredSubgroupSize but I can't manage to make it work. Is the right approach to enable the "GL_EXT_subgroup_size_control" extension and then appending requiredSubgroupSize to the layout in which local_size_x_id is set or am I trying to use the extension wrongly?

What I'm trying to do is to set the subgroupsize on this specific shader or at least apply the change to all the shaders through ggml-vulkan.cpp without having to rely on the RADV_PERFTEST env but for now I can't find a way to do that.

0cc4m · 2025-02-13T18:23:25Z

Oh sorry, I thought you knew that the extension is already implemented. You can set values when loading pipelines:

https://github.com/ggerganov/llama.cpp/blob/8a8c4ceb6050bd9392609114ca56ae6d26f5b8f5/ggml/src/ggml-vulkan/ggml-vulkan.cpp#L1550-L1552

jeffbolznv · 2025-02-13T18:42:05Z

Perf on rtx 4070:

before
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              36036 runs -    30.08 us/run -    10244 kB/run -  324.76 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               6560 runs -   153.85 us/run -    40964 kB/run -  253.95 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      156 runs -  8152.89 us/run -   655364 kB/run -   76.81 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     3608 runs -   287.17 us/run -   102445 kB/run -  340.32 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      574 runs -  1961.57 us/run -   409645 kB/run -  199.40 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               4278 runs -   278.76 us/run -    23536 kB/run -   80.52 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1005 runs -  1312.75 us/run -   100208 kB/run -   72.81 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       40 runs - 37220.75 us/run -  1678448 kB/run -   43.09 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      429 runs -  2852.16 us/run -   235365 kB/run -   78.72 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       68 runs - 21740.88 us/run -  1002085 kB/run -   44.01 GB/s
  
after
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              36036 runs -    28.20 us/run -    10244 kB/run -  346.41 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               7380 runs -   146.53 us/run -    40964 kB/run -  266.65 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      156 runs -  8175.81 us/run -   655364 kB/run -   76.59 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     3936 runs -   263.05 us/run -   102445 kB/run -  371.53 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      574 runs -  1958.18 us/run -   409645 kB/run -  199.75 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               4278 runs -   280.32 us/run -    23536 kB/run -   80.07 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1005 runs -  1320.69 us/run -   100208 kB/run -   72.37 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       40 runs - 37208.22 us/run -  1678448 kB/run -   43.10 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      429 runs -  2866.03 us/run -   235365 kB/run -   78.34 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       68 runs - 21979.57 us/run -  1002085 kB/run -   43.53 GB/s

daniandtheweb · 2025-02-13T22:00:15Z

Thanks a lot for the information, I've been trying to implement it directly in the shader not knowing it was already implemented in the main ggml-vulkan.cpp code.

I did some tests by manually setting the subgroup size for 32 and I can recreate the results I had using the RADV env.
I also tested creating a second pipeline which uses 64 as subgroup size and used it only on im2col and I managed to get another speedup on stable-diffusion (1.55 it/s PR + subgroup 32 vs 1.59 it/s PR + mixed subgroups) and apparently there seem to be other operations I'm currently testing that work faster on subgroup 64 than 32 other than im2col. (IM2COL 64, MUL_MAT some faster on 32 and some on 64, SOFT_MAX 64, ADD 32, CPY 32)

Do you think it could be a good idea creating two pipelines (1 with subgroup 64 and 1 with subgroup 32) and use them only on specific GPUs? I don't know if it could be useful on other GPUs.

daniandtheweb · 2025-02-13T23:08:19Z

Setting some pipelines to subgroup 64 and some to subgroup 32 I can get some good performance gains on stable diffusion xl 1024x1024 20 steps with tiled vae decode: stock 108.73 s, PR 107.59 s, PR + wave 32 105.66 s, PR + mixed pipelines 100.99 s.

0cc4m · 2025-02-14T07:36:47Z

Do you think it could be a good idea creating two pipelines (1 with subgroup 64 and 1 with subgroup 32) and use them only on specific GPUs? I don't know if it could be useful on other GPUs.

If there is a specific pipeline where it gives a significant advantage on RDNA, you could do that. Otherwise, just globally set the pipeline to 32 or 64 for RDNA, depending on what is better, and leave it alone for other vendors or older AMD.

daniandtheweb · 2025-02-14T16:34:37Z

With the latest changes RX 5700 uses subgroup 32 when it's detected and forces subgroup 64 on IM2COL (other operations may be faster on subgroup 64 like softmax but since they don't make any difference in real world usage I didn't include them).

With this approach + mesa-git the stable diffusion xl 1024x1024 20 steps went down to 98 s from the original 108 s.

I'm still not sure if the ggml_vk_create_pipeline_64 used in im2col is the best approach (I really don't like how I had to duplicate that part of the code) but I couldn't figure out a better way to do it.

daniandtheweb · 2025-02-14T21:31:59Z

I've pushed the last changes in which I remove the awful code duplication and instead introduce a helper function that checks if there's an override for the subgroup size from a map and sets it accordingly. I'm not sure if this could be useful or not on other GPUs (I wonder if maybe other AMD gpus or Intel ones may get some benefits from overriding certain subgroups).

@0cc4m if you think this looks good I think that the PR is ready.

ggml/src/ggml-vulkan/ggml-vulkan.cpp

0cc4m · 2025-02-21T05:29:01Z

I do see significant positive impact on RX 6800 XT, in specific shaders. Especially matrix matrix multiplication seems to like subgroup size 32. But we basically have to tune this for every shader for every RDNA generation or even chip. This might need an autotuner at a later point..

0cc4m · 2025-02-21T06:00:49Z

ggml/src/ggml-vulkan/ggml-vulkan.cpp

@@ -1543,11 +1586,17 @@ static void ggml_vk_load_shaders(vk_device& device) {
        device->pipeline_matmul_id_f32 = std::make_shared<vk_matmul_pipeline_struct>();
    }

+    vk::PhysicalDeviceProperties2 props2;
+    device->physical_device.getProperties2(&props2);
+    std::string device_name = props2.properties.deviceName.data();


I don't think this is needed anymore (also, the device name is available in device->name, and the properties in device->properties)

Edit: I forgot it's used by the get_subgroup_size function. But just use the device field.

If I don't set the device name like this the shader executes at wave64 speed instead of wave32 and I get no performance improvement at all.

I checked and device->name contains just something like Vulkan0, so that's why. But if you need the actual device name, it should be stored in the device struct on init and only accessed here, something like physica_device_name.

But as mentioned in my other comment, we can probably ignore device names for now.

0cc4m · 2025-02-21T06:04:57Z

Applying this to stable-diffusion.cpp is actually very tedious. If you already have done that, could you upload your version to a fork? Then I can compare the sd15 and sdxl times on 6800 XT.

daniandtheweb · 2025-02-22T01:56:03Z

I've just created two stable-diffusion.cpp branches in my fork with all the required changes. You can just checkout sync as the baseline and wave_test with this PR changes and build as usual.

daniandtheweb · 2025-02-22T03:58:38Z

Here are my findings:

    vk::PhysicalDeviceProperties2 props2;
    device->physical_device.getProperties2(&props2);
    std::string device_name = props2.properties.deviceName.data();

Declaring those three lines is needed to effectively enabling wave32, at least on RDNA1 (otherwise if I just call device->name directly everything just keeps using wave64).
2.
Setting device->subgroup_size as the same value as required_subgroup_size keeps wave32 results kinda neutral compared to wave64 (some exceptions are im2col, softmax which clearly just prefer wave64).
3.
Using the new changes I reach the same performance in llama-bench as with RADV_PERFTEST=cswave32 on master but without any additional flag.
4.
These wave32 changes seem to only benefit the radv driver. The vulkan pro and the amdvlk drivers seem to have similar performance in wave32 and wave64.

./llama-bench -m ~/Applications/chat/gguf/llama-2-7b.Q4_0.gguf -ngl 100

Master:
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |         pp512 |        434.46 ± 0.67 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |         tg128 |         68.75 ± 0.03 |

build: 51f311e0 (4753)

Master + cswave32:
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |         pp512 |        465.67 ± 0.33 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |         tg128 |         67.72 ± 0.04 |

build: 51f311e0 (4753)

PR:
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |         pp512 |        465.67 ± 0.12 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |         tg128 |         67.68 ± 0.26 |

build: 4a3988e9 (4688)

I have updated the branch wave_test in my stable-diffusion.cpp fork to use the new changes.

stable-diffusion.cpp sync : 
sd 1.5 - 512x512 - 20 steps - 16.52s, 1.41it/s
sd xl - 1024x1024 - 20 steps - 108.78s, 4.35s/it, vae tiling 2.65it/s

stable-diffusion.cpp wave_test:
sd 1.5 - 512x512 - 20 steps - 14.87s, 1.60it/s
sd xl - 1024x1024 - 20 steps - 101.46s, 4.02s/it, vae tiling 2.74it/s

0cc4m · 2025-02-25T10:28:15Z

Thank you, I tested it on RX 6800 XT and found that this PR is also mostly positive for it. The difference for stable-diffusion is small, but noticeable. I think we can apply this to all RDNA, which simplifies the detection by quite a bit. Simply look at min and max subgroup sizes. If min is 32 and max is 64, it's RDNA. We've used this in the past as well. Then we don't need to look at device names.

daniandtheweb · 2025-02-25T11:09:57Z

Yesterday I've received my new GPU, a RX 7800 XT. Once I get home I'll try enabling the changes for all RDNA and let you know how's the performance on it.

daniandtheweb · 2025-02-25T16:43:44Z

I did some tests on the 7800xt and im2col and softmax still are better in wave64 compared to wave32, however there's no speedup in wave32, I only get regressions in some operations switching to it and some big regressions in stable-diffusion.cpp.
I don't think it would be a good idea setting all RDNA cards to wave32, at least for now, given those regressions.

The im2col changes alone still slightly improve the generation speed.

RX 7800XT results:

Master:
 IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              19656 runs -    51.67 us/run -    10244 kB/run -  189.10 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               4100 runs -   249.27 us/run -    40964 kB/run -  156.74 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       52 runs - 68990.81 us/run -   655364 kB/run -    9.08 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     1968 runs -   527.49 us/run -   102445 kB/run -  185.27 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      164 runs - 11948.24 us/run -   409645 kB/run -   32.74 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               7130 runs -   165.70 us/run -    23536 kB/run -  135.47 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1005 runs -  1270.84 us/run -   100208 kB/run -   75.21 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       20 runs - 198765.85 us/run -  1678448 kB/run -    8.07 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      715 runs -  1685.78 us/run -   235365 kB/run -  133.19 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       34 runs - 41817.24 us/run -  1002085 kB/run -   22.88 GB/s


PR:
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              29484 runs -    36.56 us/run -    10244 kB/run -  267.23 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               5740 runs -   194.34 us/run -    40964 kB/run -  201.05 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       52 runs - 78117.40 us/run -   655364 kB/run -    8.02 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     2296 runs -   500.25 us/run -   102445 kB/run -  195.36 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      164 runs - 11770.01 us/run -   409645 kB/run -   33.23 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               7130 runs -   158.68 us/run -    23536 kB/run -  141.46 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1005 runs -  1266.73 us/run -   100208 kB/run -   75.45 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       20 runs - 197080.80 us/run -  1678448 kB/run -    8.14 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      715 runs -  1642.38 us/run -   235365 kB/run -  136.71 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       34 runs - 40360.91 us/run -  1002085 kB/run -   23.71 GB/s

stable-diffusion.cpp sync : 
sd 1.5 - 512x512 - 20 steps - 8.01s, 3.23it/s
sd xl - 1024x1024 - 20 steps - 49.88s, 1.90s/it, vae tiling 4.52it/s

stable-diffusion.cpp wave_test (default wave64) : 
sd 1.5 - 512x512 - 20 steps - 7.73s, 3.37it/s
sd xl - 1024x1024 - 20 steps - 49.62s, 1.89s/it, vae tiling 4.50it/s

0cc4m · 2025-02-26T05:45:51Z

Hmm, in that case maybe it's better to stick with im2col in this PR and move the RDNA subgroup size optimization to another one? It's a really interesting find that forcing subgroup 32 helps in a lot of cases, but selection needs more work.

daniandtheweb · 2025-02-26T14:42:01Z

Sounds good to me, this pr just started like that after all. I'll open a draft pr with the subgroup changes so that more testing can be easily done.

0cc4m

It's an overall improvement on all my GPUs, at least in perf. LGTM

* vulkan: improve im2col performance

vulkan: improve im2col performance

62733f2

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Feb 12, 2025

0cc4m self-requested a review February 13, 2025 10:07

Force subgroup 32 on RX 5700 and subgroup 64 for im2col

35f6369

Fixed uint

293edef

daniandtheweb changed the title ~~vulkan: improve im2col performance~~ vulkan: improve im2col and AMD RX 5700XT performance Feb 14, 2025

daniandtheweb force-pushed the vk-shader-optimizations-1 branch 2 times, most recently from 9205de6 to a4ef6dd Compare February 14, 2025 21:22

Helper function to set subgroup size

14ea4fa

daniandtheweb force-pushed the vk-shader-optimizations-1 branch from a4ef6dd to 14ea4fa Compare February 14, 2025 21:22

Optimize soft_max

04100e8

daniandtheweb force-pushed the vk-shader-optimizations-1 branch from d4ba722 to 04100e8 Compare February 14, 2025 21:47

0cc4m reviewed Feb 15, 2025

View reviewed changes

ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated Show resolved Hide resolved

Apply map to all NAVI1* gpus

9036e7a

daniandtheweb changed the title ~~vulkan: improve im2col and AMD RX 5700XT performance~~ vulkan: improve im2col and RDNA1 performance Feb 15, 2025

Improved gpu configuration map

d151973

daniandtheweb force-pushed the vk-shader-optimizations-1 branch from c95dff0 to d151973 Compare February 15, 2025 18:53

Cleanup im2col shader

0e5dd68

daniandtheweb force-pushed the vk-shader-optimizations-1 branch from d20e97a to 0e5dd68 Compare February 16, 2025 02:26

0cc4m reviewed Feb 21, 2025

View reviewed changes

Fix subgroup override

27f1301

daniandtheweb added 2 commits February 22, 2025 03:56

subgroup test improvements

9c9b812

Fix wave32 not correctly used

4a3988e

Removed subgroup changes

c49419f

daniandtheweb changed the title ~~vulkan: improve im2col and RDNA1 performance~~ vulkan: improve im2col Feb 26, 2025

daniandtheweb mentioned this pull request Feb 26, 2025

vulkan: subgroup size tuning #12087

Merged

0cc4m approved these changes Feb 28, 2025

View reviewed changes

0cc4m merged commit 581650b into ggml-org:master Feb 28, 2025
43 checks passed

mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025

vulkan: improve im2col (ggml-org#11826)

6aa392d

* vulkan: improve im2col performance

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Mar 19, 2025

vulkan: improve im2col (ggml-org#11826)

02df4c5

* vulkan: improve im2col performance

mostlyuseful pushed a commit to mostlyuseful/llama.cpp that referenced this pull request May 12, 2025

vulkan: improve im2col (ggml-org#11826)

db6f308

* vulkan: improve im2col performance

vulkan: improve im2col #11826

vulkan: improve im2col #11826

Uh oh!

Conversation

daniandtheweb commented Feb 12, 2025

Uh oh!

0cc4m commented Feb 13, 2025

Uh oh!

daniandtheweb commented Feb 13, 2025

Uh oh!

0cc4m commented Feb 13, 2025

Uh oh!

daniandtheweb commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Feb 13, 2025

Uh oh!

daniandtheweb commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daniandtheweb commented Feb 13, 2025

Uh oh!

0cc4m commented Feb 14, 2025

Uh oh!

daniandtheweb commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daniandtheweb commented Feb 14, 2025

Uh oh!

Uh oh!

0cc4m commented Feb 21, 2025

Uh oh!

0cc4m Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daniandtheweb Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0cc4m Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

0cc4m commented Feb 21, 2025

Uh oh!

daniandtheweb commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daniandtheweb commented Feb 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Feb 25, 2025

Uh oh!

daniandtheweb commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daniandtheweb commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Feb 26, 2025

Uh oh!

daniandtheweb commented Feb 26, 2025

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

daniandtheweb commented Feb 13, 2025 •

edited

Loading

0cc4m commented Feb 13, 2025 •

edited

Loading

daniandtheweb commented Feb 13, 2025 •

edited

Loading

daniandtheweb commented Feb 14, 2025 •

edited

Loading

0cc4m Feb 21, 2025 •

edited

Loading

daniandtheweb Feb 22, 2025 •

edited

Loading

daniandtheweb commented Feb 22, 2025 •

edited

Loading

daniandtheweb commented Feb 22, 2025 •

edited

Loading

daniandtheweb commented Feb 25, 2025 •

edited

Loading

daniandtheweb commented Feb 25, 2025 •

edited

Loading