-
Notifications
You must be signed in to change notification settings - Fork 12.2k
vulkan: subgroup size tuning #12087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: subgroup size tuning #12087
Conversation
I haven't forgotten about this, I was just working on the int8 matmul. I'll get back to you here, soon. |
I tried doing device architecture detection in a better way in daniandtheweb#1, let me know what you think. |
c22e0b4
to
7176d04
Compare
I'll give this another try on RDNA2 soon, I expect that you can add RDNA2 to the RDNA1 logic. Once that is done I think we can merge this. Can you resolve the merge conflict? |
Sure, I'll do it later once I'm back home. |
7176d04
to
7695541
Compare
I was testing the change on RDNA3 on stable-diffusion.cpp and it seems like using subgroup size 32 on this architecture completely breaks image generation resulting in black images. Do you have any clue what could be causing this? Text generation works properly using wave 32 and there's even a small performance uplift in prompt processing (1164 master / 1238 subgroup 32). |
I don't know what it is, but to figure out which op is failing I implemented the |
For some reason all mul_mat and mul_mat_id fail on RDNA3 when using a subgroup size of 32 (I was in a rush before so I wasn't able to run the test-backend-ops) so I'll just avoid doing changes on this arch for now until I find a proper tuning for this card. |
That's a coopmat thing, I think. If you ignore coopmat shaders it's probably fine. Edit: Maybe disable setting the required subgroup size if |
Thanks for the advise, I still haven't looked at coopmat shaders so I didn't know how they would interact with these changes. With the new changes the tests pass and performance on RDNA3 is slightly better in prompt processing (1164 master / 1220 pr). Stable diffusion seems to get a minor improvement. For now I think it's safe to keep the same values as the ones on RDNA1 as the card seems to behave similarly when it comes to subgroup size. |
I checked and you can add RDNA2 to the list. It seems to prefer the same shader sizes. Afterwards we can merge this. |
I created a common RDNA map to store the subgroup preferences since they mostly behave the same, however I've kept the option to specify different configurations just in case better tuning options are found for a specific architecture. I'll make more tests with other subgroup configurations but if I find something better I'll leave it for another pr. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, looks good now.
I disabled warp 32 for RDNA3 as it seems to behave worse that with warp 64 when using the matrix cores, the bigger performance hits being on MUL_MAT f32 f32 and f16 f32. Every other operation seems mostly unaffected but since this results in no performance improvement I'll leave it untouched for now.
|
Matrix cores only get used with coopmat, and that's only for matrix matrix multiplications. What you're looking at is just a regular shader. But interesting that it performs worse with subgroup size 32, I guess I should check RDNA2 and 1 again as well. |
I'm retesting the results on RDNA1 and despite some small regressions in MUL_MAT the resulting performance in something like llama-bench is actually better.
|
I'm currently running all the performance tests on both RDNA1 and RDNA3 to see if a finer tuning of the subgroup sizes can improve the situation. |
Here's the comparison for RDNA1 on master vs PR:
|
After running multiple tests I can say for certain that RDNA3 doesn't like subgroup size 32, at least on linux using amdgpu (I tried radv, amdvlk and vk_pro and all behave similarly), every single operation seems to get a small performance hit from using it. I'm not sure if this is an expected behavior or there's some issue in the amdgpu driver (I found other issues in other compute tasks so it's a possibility). For now I don't think it's a good idea changing the default subgroup for this architecture. Regarding RDNA1 I managed to solve every single perfomance regression by setting most mul_mat_vec operations to use subgroup 64. All the performance gains from switching to subgroup size 32 are maintained while not having any regression on any operation (at least the ones in test-backend-ops). I'll push the changes soon. If you can compare the results using warp 64 and warp 32 on RDNA2 I can tune the parameters that still need some adjustements if there's any need for it. Does the performance on RDNA2 improve at all with these changes or it behaves similarly to how I described RDNA3? |
Here's RDNA2 with the diff between everything in 64 and 32.
|
Here's another RDNA1 comparison, it's not as negative on mul_mat_vec as it was for you:
|
Thanks for the results. GFX1013 is only the Playstation 5 GPU or is there any other card using that chip? Is the performance using llama-bench or stable-diffusion.cpp increased on any of those two cards using subgroup 32 or is there any performance regression in those real world use cases? |
It's an AMD BC-250, based on the PS5 APU, yeah. Actual performance in llama-bench does increase with your RDNA1 tuning:
There might be a small regression in mul_mat_vec when you keep it at 64, but it's hard to tell. |
Let's not hold up this PR with efforts to tune perfectly. We have a decent idea for now and can merge it with or without MMV at subgroup 32. Over time we'll get a clearer picture and can adjust it. |
I adjusted the code to avoid the workaround I used to avoid duplicating code. If RDNA2 also gets a slight performance improvement from this changes I think it's ready to be merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seem to be differences between Q4_0 and Q4_K_S on RDNA2, but a minor tg regression at most. It's good enough for now.
Thank you for the tuning work!
* vulkan: subgroup size test * Vulkan: Add device architecture enum and logic to recognize AMD generations * vulkan: use new architecture logic to specify subgroup size * Initial vulkan subgroup size tuning for RDNA3 * vulkan: commonize RDNA subgroup tuning * vulkan: override subgroup size if required_subgroup_size = 0 * vulkan: disable warp 32 for RDNA3 * vulkan: fine tuned RDNA1 subgroup sizes * vulkan: adjusted subgroup size map * vulkan: fixed RDNA2 subgroup map --------- Co-authored-by: 0cc4m <[email protected]>
This PR is a continuation of the tests done in #11826 about the subgroup size in vulkan and its effects in performance especially on RDNA cards. For now it includes a specific configuration that improves RDNA1 performance in almost all operations.