-
Notifications
You must be signed in to change notification settings - Fork 12.2k
vulkan: improve im2col #11826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: improve im2col #11826
Conversation
If cswave32 helps you, maybe look into if setting the |
That's nice to know, thanks. I'll look into it to check where it does make a difference because, as I mentioned, it actually slows down im2col. |
Real world performance (like sd.cpp benchmarks) are more important than GB/s in test-backend-ops perf. The dimensions of the real use are probably different, and there's other factors at play (for example graph execution instead of a single op getting repeated). |
I've been trying to use the VK_EXT_subgroup_size_control extension in GLSL to set the What I'm trying to do is to set the subgroupsize on this specific shader or at least apply the change to all the shaders through ggml-vulkan.cpp without having to rely on the RADV_PERFTEST env but for now I can't find a way to do that. |
Oh sorry, I thought you knew that the extension is already implemented. You can set values when loading pipelines: |
Perf on rtx 4070:
|
Thanks a lot for the information, I've been trying to implement it directly in the shader not knowing it was already implemented in the main ggml-vulkan.cpp code. I did some tests by manually setting the subgroup size for 32 and I can recreate the results I had using the RADV env. Do you think it could be a good idea creating two pipelines (1 with subgroup 64 and 1 with subgroup 32) and use them only on specific GPUs? I don't know if it could be useful on other GPUs. |
Setting some pipelines to subgroup 64 and some to subgroup 32 I can get some good performance gains on stable diffusion xl 1024x1024 20 steps with tiled vae decode: stock 108.73 s, PR 107.59 s, PR + wave 32 105.66 s, PR + mixed pipelines 100.99 s. |
If there is a specific pipeline where it gives a significant advantage on RDNA, you could do that. Otherwise, just globally set the pipeline to 32 or 64 for RDNA, depending on what is better, and leave it alone for other vendors or older AMD. |
With the latest changes RX 5700 uses subgroup 32 when it's detected and forces subgroup 64 on IM2COL (other operations may be faster on subgroup 64 like softmax but since they don't make any difference in real world usage I didn't include them). With this approach + mesa-git the stable diffusion xl 1024x1024 20 steps went down to 98 s from the original 108 s. I'm still not sure if the ggml_vk_create_pipeline_64 used in im2col is the best approach (I really don't like how I had to duplicate that part of the code) but I couldn't figure out a better way to do it. |
9205de6
to
a4ef6dd
Compare
a4ef6dd
to
14ea4fa
Compare
I've pushed the last changes in which I remove the awful code duplication and instead introduce a helper function that checks if there's an override for the subgroup size from a map and sets it accordingly. I'm not sure if this could be useful or not on other GPUs (I wonder if maybe other AMD gpus or Intel ones may get some benefits from overriding certain subgroups). @0cc4m if you think this looks good I think that the PR is ready. |
d4ba722
to
04100e8
Compare
c95dff0
to
d151973
Compare
d20e97a
to
0e5dd68
Compare
I do see significant positive impact on RX 6800 XT, in specific shaders. Especially matrix matrix multiplication seems to like subgroup size 32. But we basically have to tune this for every shader for every RDNA generation or even chip. This might need an autotuner at a later point.. |
ggml/src/ggml-vulkan/ggml-vulkan.cpp
Outdated
@@ -1543,11 +1586,17 @@ static void ggml_vk_load_shaders(vk_device& device) { | |||
device->pipeline_matmul_id_f32 = std::make_shared<vk_matmul_pipeline_struct>(); | |||
} | |||
|
|||
vk::PhysicalDeviceProperties2 props2; | |||
device->physical_device.getProperties2(&props2); | |||
std::string device_name = props2.properties.deviceName.data(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is needed anymore (also, the device name is available in device->name
, and the properties in device->properties
)
Edit: I forgot it's used by the get_subgroup_size function. But just use the device field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I don't set the device name like this the shader executes at wave64 speed instead of wave32 and I get no performance improvement at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked and device->name
contains just something like Vulkan0
, so that's why. But if you need the actual device name, it should be stored in the device struct on init and only accessed here, something like physica_device_name
.
But as mentioned in my other comment, we can probably ignore device names for now.
Applying this to stable-diffusion.cpp is actually very tedious. If you already have done that, could you upload your version to a fork? Then I can compare the sd15 and sdxl times on 6800 XT. |
I've just created two stable-diffusion.cpp branches in my fork with all the required changes. You can just checkout |
Here are my findings:
Declaring those three lines is needed to effectively enabling wave32, at least on RDNA1 (otherwise if I just call device->name directly everything just keeps using wave64).
I have updated the branch wave_test in my stable-diffusion.cpp fork to use the new changes.
|
Thank you, I tested it on RX 6800 XT and found that this PR is also mostly positive for it. The difference for stable-diffusion is small, but noticeable. I think we can apply this to all RDNA, which simplifies the detection by quite a bit. Simply look at min and max subgroup sizes. If min is 32 and max is 64, it's RDNA. We've used this in the past as well. Then we don't need to look at device names. |
Yesterday I've received my new GPU, a RX 7800 XT. Once I get home I'll try enabling the changes for all RDNA and let you know how's the performance on it. |
I did some tests on the 7800xt and im2col and softmax still are better in wave64 compared to wave32, however there's no speedup in wave32, I only get regressions in some operations switching to it and some big regressions in stable-diffusion.cpp. The im2col changes alone still slightly improve the generation speed. RX 7800XT results:
|
Hmm, in that case maybe it's better to stick with im2col in this PR and move the RDNA subgroup size optimization to another one? It's a really interesting find that forcing subgroup 32 helps in a lot of cases, but selection needs more work. |
Sounds good to me, this pr just started like that after all. I'll open a draft pr with the subgroup changes so that more testing can be easily done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an overall improvement on all my GPUs, at least in perf. LGTM
* vulkan: improve im2col performance
* vulkan: improve im2col performance
* vulkan: improve im2col performance
This PR supersedes #11778.
Here's the performance numbers on my Radeon RX 5700XT (RADV).
Vulkan
:HIP
:I'm also including the benchmark results when using
RADV_PERFTEST=cswave32
. It's interesting to note that despite this variable hurting this specific operation's performance it actually improves the speed in stable-diffusion.cpp (sd 1.5 512x512: without PR 1.38 it/s, with PR 1.45 it/s, with PR + cswave32 1.55 it/s).Vulkan cswave32
: