-
Notifications
You must be signed in to change notification settings - Fork 12.2k
vulkan: matmul gcn tuning #13016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: matmul gcn tuning #13016
Conversation
I'm setting this back to draft while I adjust it a little bit more... |
Okay I think is ready. The 16x16 tiles I'm using now perform like the 64x16 ones at a fixed clock speed but once I turn frequency scaling back on the chip manages to clock higher than before and I get a 4% improvement in pp512 speed. Maybe the smaller tile sizes make it run more efficiently? Since all the threads in the workgroup do the same calculations and share the same memory we don't necessarily have to make the shader's warp size match the subgroup size. |
Oh..., this is not seem to be suitable for my Ryzen integrated graphics (Ryzen 5700U with Radeon Graphics, gfx90c), at least. The most obvious difference can be seen in the test with before
after (this PR)
Maybe, like Java and other VMs, we should store performance profile data for each device and apply it automatically? |
@masamaru-san Please also do before and after tests with a model, cause the unit tests are not reliable to judge whether it is actually detrimental. |
Co-authored-by: 0cc4m <[email protected]>
My first thought was that your integrated graphics only had two cores as that wouldn't handle 256 threads well, but that's not the case as you have eight cores. It's also not a Vega or FP16 issue as @0cc4m's card is doing fine. Since your 5700U is a 15W chip this might actually be a power issue though. For example when prompt processing Llama 2 7B Q4_0 on master my 470 runs at 1.15 GHz and gets 171 t/s. With my PR it only runs at 1 GHz but gets 189 t/s, and both times I'm hitting the 130A TDC limit on my card. Please run with a real model and compare the GPU clock speeds and power levels with master. On Linux you can use radeontop and sensors for this, for Windows I have no idea 🤷♀️.
Obviously that's the best option but it's a lot of work. Right now everyone can just submit tunes as PRs as it's not that hard to do. |
I rechecked the degree of performance change for AMD Ryzen 7 5700U with Radeon Graphics (Lucienne/gfx90c). The conclusion is that Perhaps this is also due to AMD's Vulkan driver for Windows? Environment
Build toolset
CASE: stable-diffusion.cpp
CASE: ggml/test-backend-ops
|
Again as mentioned can you run a
After doing that what wattage and frequencies are you seeing when running prompt processing? Is it the same for both this PR and master? I'm hoping that you know what you're doing here and are not just casually cranking up the wattage and current limits as that can fry your chip. |
Sorry to bother you, I ran llama-bench ⬇️. Differences are 6 to 8 t/s on pp512 test. I think something is jammed because there are only two graphics cores, too. I will treat this within local fork. llama-bench test result> &{
>> (1..3) | foreach {
>> "Repeating: $_"
>> "`nMaster version`nWaiting 30seconds..."; Start-Sleep -Seconds 30
>> cd ..\bin.Master\
>>
>> .\llama-bench.exe -m E:\AI\models\llama-2-7b.Q4_0.gguf
>>
>> "`nPR version`nWaiting 30seconds..."; Start-Sleep -Seconds 30
>>
>> cd ..\bin\
>>
>> .\llama-bench.exe -m E:\AI\models\llama-2-7b.Q4_0.gguf
>> "`n----`n"
>> }
>> } Repeating: 1 Master version
build: 2016f07 (5162) PR version
build: 2016f07 (5162) Repeating: 2 Master version
build: 2016f07 (5162) PR version
build: 2016f07 (5162) Repeating: 3 Master version
build: 2016f07 (5162) PR version
build: 2016f07 (5162)
I was monitoring watts, clock, load rate, etc. on the GPU-Z and the RyzenAdj, and it didn't look to me like there was any difference between the master and PR. I've attached the logs for part of the second. Not clock-up, it can only operate at 25W (default) at all times if it is given periodic cooling time before the power limit threshold of 15W is triggered. |
I don't understand what you mean here as from your
Thanks, that's pretty helpful. I skimmed through the chart and I'm seeing a bit of power limiting during prompt processing with the chip hitting an average of 1700 MHz or so. It then jumps up to the full 1900 MHz for inference, and in both cases it's running slightly below the 25W limit. This is perfectly normal since the prompt processing stage is compute bound while inference is memory bound, and I see this on my own GPUs. Considering how the prompt processing clocks are similar between master and my PR my guess is that your Windows driver is behaving differently than the Linux RADV driver that's used by @0cc4m and me. |
Speaking of limiting core count I retested this PR on my 470 with only 8 CUs enabled (2 per shader engine) and still got a 20% improvement in prompt processing speed. Yeah this is looking more and more like a driver thing. |
I also noticed that your driver is reporting that you have 32k shared memory on your Vega graphics which makes no sense. Anyways I've straight up disabled these changes for the AMD proprietary driver so we should be good to go. |
* tune matmul for gcn * this one is more power efficient * Update ggml/src/ggml-vulkan/ggml-vulkan.cpp Co-authored-by: 0cc4m <[email protected]> * disable this tune for the proprietary driver --------- Co-authored-by: 0cc4m <[email protected]>
I tried to do some manual tuning on the mmq warptile settings and I'm seeing good improvements on my end. Right now these changes are only applied on AMD GCN but they might help other chips as well.
Results on my RX470, locked to 900 mhz so the power limit doesn't mess with my numbers:
PR
Master
I also tried tuning the small and large warptiles by setting them as the default but I wasn't able to get them to beat the 256 thread medium shader. The FP16 and FP32 shaders already perform best on GCN using the existing parameters.