Skip to content

vulkan: improve im2col #11826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Feb 28, 2025
Merged

Conversation

daniandtheweb
Copy link
Contributor

This PR supersedes #11778.
Here's the performance numbers on my Radeon RX 5700XT (RADV).

Vulkan:

Master:
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              13104 runs -    95.82 us/run -    10244 kB/run -  101.96 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1640 runs -   639.72 us/run -    40964 kB/run -   61.08 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       52 runs - 79006.56 us/run -   655364 kB/run -    7.93 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     1312 runs -   900.50 us/run -   102445 kB/run -  108.53 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       82 runs - 20310.68 us/run -   409645 kB/run -   19.26 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               4278 runs -   236.25 us/run -    23536 kB/run -   95.01 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                335 runs -  3518.08 us/run -   100208 kB/run -   27.17 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       20 runs - 313087.20 us/run -  1678448 kB/run -    5.12 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      572 runs -  2263.98 us/run -   235365 kB/run -   99.18 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       34 runs - 71241.88 us/run -  1002085 kB/run -   13.43 GB/s

PR:
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              19656 runs -    58.61 us/run -    10244 kB/run -  166.70 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1640 runs -   664.67 us/run -    40964 kB/run -   58.78 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       52 runs - 79993.31 us/run -   655364 kB/run -    7.83 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     1968 runs -   602.13 us/run -   102445 kB/run -  162.31 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       82 runs - 17203.68 us/run -   409645 kB/run -   22.74 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               5704 runs -   227.95 us/run -    23536 kB/run -   98.47 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                670 runs -  2895.43 us/run -   100208 kB/run -   33.01 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       20 runs - 315784.75 us/run -  1678448 kB/run -    5.08 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      572 runs -  2218.08 us/run -   235365 kB/run -  101.23 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       34 runs - 72422.56 us/run -  1002085 kB/run -   13.21 GB/s

HIP:

  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               3276 runs -   923.04 us/run -    10244 kB/run -   10.58 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                820 runs -  3359.14 us/run -    40964 kB/run -   11.63 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       52 runs - 58759.37 us/run -   655364 kB/run -   10.66 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      328 runs -  9271.95 us/run -   102445 kB/run -   10.54 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       82 runs - 79109.07 us/run -   409645 kB/run -    4.94 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1426 runs -  1077.72 us/run -    23536 kB/run -   20.83 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                335 runs -  9038.40 us/run -   100208 kB/run -   10.57 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       20 runs - 160059.45 us/run -  1678448 kB/run -   10.02 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      143 runs - 12785.92 us/run -   235365 kB/run -   17.56 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       34 runs - 120635.09 us/run -  1002085 kB/run -    7.93 GB/s

I'm also including the benchmark results when using RADV_PERFTEST=cswave32. It's interesting to note that despite this variable hurting this specific operation's performance it actually improves the speed in stable-diffusion.cpp (sd 1.5 512x512: without PR 1.38 it/s, with PR 1.45 it/s, with PR + cswave32 1.55 it/s).

Vulkan cswave32:

Master:
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              13104 runs -    94.26 us/run -    10244 kB/run -  103.65 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                820 runs -  1426.19 us/run -    40964 kB/run -   27.40 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       52 runs - 122464.00 us/run -   655364 kB/run -    5.11 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      984 runs -  1240.23 us/run -   102445 kB/run -   78.80 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       82 runs - 67429.80 us/run -   409645 kB/run -    5.80 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               4278 runs -   325.81 us/run -    23536 kB/run -   68.89 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                335 runs -  7415.25 us/run -   100208 kB/run -   12.89 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       20 runs - 254549.75 us/run -  1678448 kB/run -    6.30 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      286 runs -  4823.12 us/run -   235365 kB/run -   46.55 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       34 runs - 358562.76 us/run -  1002085 kB/run -    2.67 GB/s

PR:
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              13104 runs -    81.44 us/run -    10244 kB/run -  119.98 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                820 runs -  1439.59 us/run -    40964 kB/run -   27.14 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       52 runs - 135311.06 us/run -   655364 kB/run -    4.63 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      984 runs -  1191.29 us/run -   102445 kB/run -   82.04 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       82 runs - 62476.68 us/run -   409645 kB/run -    6.26 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               4278 runs -   315.84 us/run -    23536 kB/run -   71.07 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                335 runs -  7497.88 us/run -   100208 kB/run -   12.75 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       20 runs - 254943.60 us/run -  1678448 kB/run -    6.29 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      286 runs -  4767.25 us/run -   235365 kB/run -   47.10 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       34 runs - 358791.47 us/run -  1002085 kB/run -    2.67 GB/s

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Feb 12, 2025
@0cc4m 0cc4m self-requested a review February 13, 2025 10:07
@0cc4m
Copy link
Collaborator

0cc4m commented Feb 13, 2025

If cswave32 helps you, maybe look into if setting the requiredSubgroupSize in the shader does the same thing.

@daniandtheweb
Copy link
Contributor Author

That's nice to know, thanks. I'll look into it to check where it does make a difference because, as I mentioned, it actually slows down im2col.

@0cc4m
Copy link
Collaborator

0cc4m commented Feb 13, 2025

That's nice to know, thanks. I'll look into it to check where it does make a difference because, as I mentioned, it actually slows down im2col.

Real world performance (like sd.cpp benchmarks) are more important than GB/s in test-backend-ops perf. The dimensions of the real use are probably different, and there's other factors at play (for example graph execution instead of a single op getting repeated).

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Feb 13, 2025

I've been trying to use the VK_EXT_subgroup_size_control extension in GLSL to set the requiredSubgroupSize but I can't manage to make it work. Is the right approach to enable the "GL_EXT_subgroup_size_control" extension and then appending requiredSubgroupSize to the layout in which local_size_x_id is set or am I trying to use the extension wrongly?

What I'm trying to do is to set the subgroupsize on this specific shader or at least apply the change to all the shaders through ggml-vulkan.cpp without having to rely on the RADV_PERFTEST env but for now I can't find a way to do that.

@0cc4m
Copy link
Collaborator

0cc4m commented Feb 13, 2025

Oh sorry, I thought you knew that the extension is already implemented. You can set values when loading pipelines:

https://github.com/ggerganov/llama.cpp/blob/8a8c4ceb6050bd9392609114ca56ae6d26f5b8f5/ggml/src/ggml-vulkan/ggml-vulkan.cpp#L1550-L1552

@jeffbolznv
Copy link
Collaborator

Perf on rtx 4070:

before
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              36036 runs -    30.08 us/run -    10244 kB/run -  324.76 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               6560 runs -   153.85 us/run -    40964 kB/run -  253.95 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      156 runs -  8152.89 us/run -   655364 kB/run -   76.81 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     3608 runs -   287.17 us/run -   102445 kB/run -  340.32 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      574 runs -  1961.57 us/run -   409645 kB/run -  199.40 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               4278 runs -   278.76 us/run -    23536 kB/run -   80.52 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1005 runs -  1312.75 us/run -   100208 kB/run -   72.81 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       40 runs - 37220.75 us/run -  1678448 kB/run -   43.09 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      429 runs -  2852.16 us/run -   235365 kB/run -   78.72 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       68 runs - 21740.88 us/run -  1002085 kB/run -   44.01 GB/s
  
after
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              36036 runs -    28.20 us/run -    10244 kB/run -  346.41 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               7380 runs -   146.53 us/run -    40964 kB/run -  266.65 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      156 runs -  8175.81 us/run -   655364 kB/run -   76.59 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     3936 runs -   263.05 us/run -   102445 kB/run -  371.53 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      574 runs -  1958.18 us/run -   409645 kB/run -  199.75 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               4278 runs -   280.32 us/run -    23536 kB/run -   80.07 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1005 runs -  1320.69 us/run -   100208 kB/run -   72.37 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       40 runs - 37208.22 us/run -  1678448 kB/run -   43.10 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      429 runs -  2866.03 us/run -   235365 kB/run -   78.34 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       68 runs - 21979.57 us/run -  1002085 kB/run -   43.53 GB/s

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Feb 13, 2025

Thanks a lot for the information, I've been trying to implement it directly in the shader not knowing it was already implemented in the main ggml-vulkan.cpp code.

I did some tests by manually setting the subgroup size for 32 and I can recreate the results I had using the RADV env.
I also tested creating a second pipeline which uses 64 as subgroup size and used it only on im2col and I managed to get another speedup on stable-diffusion (1.55 it/s PR + subgroup 32 vs 1.59 it/s PR + mixed subgroups) and apparently there seem to be other operations I'm currently testing that work faster on subgroup 64 than 32 other than im2col. (IM2COL 64, MUL_MAT some faster on 32 and some on 64, SOFT_MAX 64, ADD 32, CPY 32)

Do you think it could be a good idea creating two pipelines (1 with subgroup 64 and 1 with subgroup 32) and use them only on specific GPUs? I don't know if it could be useful on other GPUs.

@daniandtheweb
Copy link
Contributor Author

Setting some pipelines to subgroup 64 and some to subgroup 32 I can get some good performance gains on stable diffusion xl 1024x1024 20 steps with tiled vae decode: stock 108.73 s, PR 107.59 s, PR + wave 32 105.66 s, PR + mixed pipelines 100.99 s.

@0cc4m
Copy link
Collaborator

0cc4m commented Feb 14, 2025

Do you think it could be a good idea creating two pipelines (1 with subgroup 64 and 1 with subgroup 32) and use them only on specific GPUs? I don't know if it could be useful on other GPUs.

If there is a specific pipeline where it gives a significant advantage on RDNA, you could do that. Otherwise, just globally set the pipeline to 32 or 64 for RDNA, depending on what is better, and leave it alone for other vendors or older AMD.

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Feb 14, 2025

With the latest changes RX 5700 uses subgroup 32 when it's detected and forces subgroup 64 on IM2COL (other operations may be faster on subgroup 64 like softmax but since they don't make any difference in real world usage I didn't include them).

With this approach + mesa-git the stable diffusion xl 1024x1024 20 steps went down to 98 s from the original 108 s.

I'm still not sure if the ggml_vk_create_pipeline_64 used in im2col is the best approach (I really don't like how I had to duplicate that part of the code) but I couldn't figure out a better way to do it.

@daniandtheweb daniandtheweb changed the title vulkan: improve im2col performance vulkan: improve im2col and AMD RX 5700XT performance Feb 14, 2025
@daniandtheweb daniandtheweb force-pushed the vk-shader-optimizations-1 branch 2 times, most recently from 9205de6 to a4ef6dd Compare February 14, 2025 21:22
@daniandtheweb daniandtheweb force-pushed the vk-shader-optimizations-1 branch from a4ef6dd to 14ea4fa Compare February 14, 2025 21:22
@daniandtheweb
Copy link
Contributor Author

I've pushed the last changes in which I remove the awful code duplication and instead introduce a helper function that checks if there's an override for the subgroup size from a map and sets it accordingly. I'm not sure if this could be useful or not on other GPUs (I wonder if maybe other AMD gpus or Intel ones may get some benefits from overriding certain subgroups).

@0cc4m if you think this looks good I think that the PR is ready.

@daniandtheweb daniandtheweb force-pushed the vk-shader-optimizations-1 branch from d4ba722 to 04100e8 Compare February 14, 2025 21:47
@daniandtheweb daniandtheweb changed the title vulkan: improve im2col and AMD RX 5700XT performance vulkan: improve im2col and RDNA1 performance Feb 15, 2025
@daniandtheweb daniandtheweb force-pushed the vk-shader-optimizations-1 branch from c95dff0 to d151973 Compare February 15, 2025 18:53
@daniandtheweb daniandtheweb force-pushed the vk-shader-optimizations-1 branch from d20e97a to 0e5dd68 Compare February 16, 2025 02:26
@0cc4m
Copy link
Collaborator

0cc4m commented Feb 21, 2025

I do see significant positive impact on RX 6800 XT, in specific shaders. Especially matrix matrix multiplication seems to like subgroup size 32. But we basically have to tune this for every shader for every RDNA generation or even chip. This might need an autotuner at a later point..

@@ -1543,11 +1586,17 @@ static void ggml_vk_load_shaders(vk_device& device) {
device->pipeline_matmul_id_f32 = std::make_shared<vk_matmul_pipeline_struct>();
}

vk::PhysicalDeviceProperties2 props2;
device->physical_device.getProperties2(&props2);
std::string device_name = props2.properties.deviceName.data();
Copy link
Collaborator

@0cc4m 0cc4m Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed anymore (also, the device name is available in device->name, and the properties in device->properties)

Edit: I forgot it's used by the get_subgroup_size function. But just use the device field.

Copy link
Contributor Author

@daniandtheweb daniandtheweb Feb 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't set the device name like this the shader executes at wave64 speed instead of wave32 and I get no performance improvement at all.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and device->name contains just something like Vulkan0, so that's why. But if you need the actual device name, it should be stored in the device struct on init and only accessed here, something like physica_device_name.

But as mentioned in my other comment, we can probably ignore device names for now.

@0cc4m
Copy link
Collaborator

0cc4m commented Feb 21, 2025

Applying this to stable-diffusion.cpp is actually very tedious. If you already have done that, could you upload your version to a fork? Then I can compare the sd15 and sdxl times on 6800 XT.

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Feb 22, 2025

I've just created two stable-diffusion.cpp branches in my fork with all the required changes. You can just checkout sync as the baseline and wave_test with this PR changes and build as usual.

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Feb 22, 2025

Here are my findings:

    vk::PhysicalDeviceProperties2 props2;
    device->physical_device.getProperties2(&props2);
    std::string device_name = props2.properties.deviceName.data();

Declaring those three lines is needed to effectively enabling wave32, at least on RDNA1 (otherwise if I just call device->name directly everything just keeps using wave64).
2.
Setting device->subgroup_size as the same value as required_subgroup_size keeps wave32 results kinda neutral compared to wave64 (some exceptions are im2col, softmax which clearly just prefer wave64).
3.
Using the new changes I reach the same performance in llama-bench as with RADV_PERFTEST=cswave32 on master but without any additional flag.
4.
These wave32 changes seem to only benefit the radv driver. The vulkan pro and the amdvlk drivers seem to have similar performance in wave32 and wave64.

./llama-bench -m ~/Applications/chat/gguf/llama-2-7b.Q4_0.gguf -ngl 100

Master:
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |         pp512 |        434.46 ± 0.67 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |         tg128 |         68.75 ± 0.03 |

build: 51f311e0 (4753)
Master + cswave32:
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |         pp512 |        465.67 ± 0.33 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |         tg128 |         67.72 ± 0.04 |

build: 51f311e0 (4753)
PR:
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |         pp512 |        465.67 ± 0.12 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | Vulkan     | 100 |         tg128 |         67.68 ± 0.26 |

build: 4a3988e9 (4688)

I have updated the branch wave_test in my stable-diffusion.cpp fork to use the new changes.

stable-diffusion.cpp sync : 
sd 1.5 - 512x512 - 20 steps - 16.52s, 1.41it/s
sd xl - 1024x1024 - 20 steps - 108.78s, 4.35s/it, vae tiling 2.65it/s
stable-diffusion.cpp wave_test:
sd 1.5 - 512x512 - 20 steps - 14.87s, 1.60it/s
sd xl - 1024x1024 - 20 steps - 101.46s, 4.02s/it, vae tiling 2.74it/s

@0cc4m
Copy link
Collaborator

0cc4m commented Feb 25, 2025

Thank you, I tested it on RX 6800 XT and found that this PR is also mostly positive for it. The difference for stable-diffusion is small, but noticeable. I think we can apply this to all RDNA, which simplifies the detection by quite a bit. Simply look at min and max subgroup sizes. If min is 32 and max is 64, it's RDNA. We've used this in the past as well. Then we don't need to look at device names.

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Feb 25, 2025

Yesterday I've received my new GPU, a RX 7800 XT. Once I get home I'll try enabling the changes for all RDNA and let you know how's the performance on it.

@daniandtheweb
Copy link
Contributor Author

daniandtheweb commented Feb 25, 2025

I did some tests on the 7800xt and im2col and softmax still are better in wave64 compared to wave32, however there's no speedup in wave32, I only get regressions in some operations switching to it and some big regressions in stable-diffusion.cpp.
I don't think it would be a good idea setting all RDNA cards to wave32, at least for now, given those regressions.

The im2col changes alone still slightly improve the generation speed.

RX 7800XT results:

Master:
 IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              19656 runs -    51.67 us/run -    10244 kB/run -  189.10 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               4100 runs -   249.27 us/run -    40964 kB/run -  156.74 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       52 runs - 68990.81 us/run -   655364 kB/run -    9.08 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     1968 runs -   527.49 us/run -   102445 kB/run -  185.27 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      164 runs - 11948.24 us/run -   409645 kB/run -   32.74 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               7130 runs -   165.70 us/run -    23536 kB/run -  135.47 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1005 runs -  1270.84 us/run -   100208 kB/run -   75.21 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       20 runs - 198765.85 us/run -  1678448 kB/run -    8.07 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      715 runs -  1685.78 us/run -   235365 kB/run -  133.19 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       34 runs - 41817.24 us/run -  1002085 kB/run -   22.88 GB/s


PR:
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              29484 runs -    36.56 us/run -    10244 kB/run -  267.23 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               5740 runs -   194.34 us/run -    40964 kB/run -  201.05 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       52 runs - 78117.40 us/run -   655364 kB/run -    8.02 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     2296 runs -   500.25 us/run -   102445 kB/run -  195.36 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      164 runs - 11770.01 us/run -   409645 kB/run -   33.23 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               7130 runs -   158.68 us/run -    23536 kB/run -  141.46 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1005 runs -  1266.73 us/run -   100208 kB/run -   75.45 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       20 runs - 197080.80 us/run -  1678448 kB/run -    8.14 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      715 runs -  1642.38 us/run -   235365 kB/run -  136.71 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       34 runs - 40360.91 us/run -  1002085 kB/run -   23.71 GB/s
stable-diffusion.cpp sync : 
sd 1.5 - 512x512 - 20 steps - 8.01s, 3.23it/s
sd xl - 1024x1024 - 20 steps - 49.88s, 1.90s/it, vae tiling 4.52it/s
stable-diffusion.cpp wave_test (default wave64) : 
sd 1.5 - 512x512 - 20 steps - 7.73s, 3.37it/s
sd xl - 1024x1024 - 20 steps - 49.62s, 1.89s/it, vae tiling 4.50it/s

@0cc4m
Copy link
Collaborator

0cc4m commented Feb 26, 2025

Hmm, in that case maybe it's better to stick with im2col in this PR and move the RDNA subgroup size optimization to another one? It's a really interesting find that forcing subgroup 32 helps in a lot of cases, but selection needs more work.

@daniandtheweb
Copy link
Contributor Author

Sounds good to me, this pr just started like that after all. I'll open a draft pr with the subgroup changes so that more testing can be easily done.

@daniandtheweb daniandtheweb changed the title vulkan: improve im2col and RDNA1 performance vulkan: improve im2col Feb 26, 2025
Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an overall improvement on all my GPUs, at least in perf. LGTM

@0cc4m 0cc4m merged commit 581650b into ggml-org:master Feb 28, 2025
43 checks passed
mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025
* vulkan: improve im2col performance
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Mar 19, 2025
* vulkan: improve im2col performance
mostlyuseful pushed a commit to mostlyuseful/llama.cpp that referenced this pull request May 12, 2025
* vulkan: improve im2col performance
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants