vulkan: enable coopmat2 FA gqa and split_k optimizations more often #12931

jeffbolznv · 2025-04-13T16:22:49Z

The grouped query attention optmization doesn't require a power of two ratio, the only thing relying on it was the modulo operation written as bitwise &.

split_k need not depend on gqa_ratio - enable it any time there's only one workgroup in the X dimension. The shader gets the split index from the x coord, and multiple workgroups in the X dimension (pre-split) indicates a larger FA operation that wouldn't need splitting.

Perf results:

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m C:\models\second-state\StarCoder2-7B-GGUF\starcoder2-7b-Q4_0.gguf -m C:\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf -m C:\models\Phi-3-mini-4k-instruct-q4.gguf -fa 1 -p 0 -n 8192 --repetitions 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | -------------------: |
| starcoder2 7B Q4_0             |   3.76 GiB |     7.17 B | Vulkan     |  99 |  1 |        tg8192 |         55.98 ± 0.00 |
| qwen2vl 7B IQ4_NL - 4.5 bpw    |   4.13 GiB |     7.62 B | Vulkan     |  99 |  1 |        tg8192 |         57.88 ± 0.00 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |        tg8192 |         68.98 ± 0.00 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m C:\models\second-state\StarCoder2-7B-GGUF\starcoder2-7b-Q4_0.gguf -m C:\models\Qwen2-VL-7B-Instruct-IQ4_NL.gguf -m C:\models\Phi-3-mini-4k-instruct-q4.gguf -fa 1 -p 0 -n 8192 --repetitions 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | -------------------: |
| starcoder2 7B Q4_0             |   3.76 GiB |     7.17 B | Vulkan     |  99 |  1 |        tg8192 |         74.72 ± 0.00 |
| qwen2vl 7B IQ4_NL - 4.5 bpw    |   4.13 GiB |     7.62 B | Vulkan     |  99 |  1 |        tg8192 |         74.87 ± 0.00 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |        tg8192 |         77.90 ± 0.00 |

(This qwen model seems to be broken at TOT, even with the cuda backend, but the speedup is probably realistic)

The grouped query attention optmization doesn't require a power of two ratio, the only thing relying on it was the modulo operation written as bitwise &. split_k need not depend on gqa_ratio - enable it any time there's only one workgroup in the X dimension. The shader gets the split index from the x coord, and multiple workgroups in the X dimension (pre-split) indicates a larger FA operation that wouldn't need splitting.

…gml-org#12931) The grouped query attention optmization doesn't require a power of two ratio, the only thing relying on it was the modulo operation written as bitwise &. split_k need not depend on gqa_ratio - enable it any time there's only one workgroup in the X dimension. The shader gets the split index from the x coord, and multiple workgroups in the X dimension (pre-split) indicates a larger FA operation that wouldn't need splitting.

jeffbolznv requested a review from 0cc4m April 13, 2025 16:22

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Apr 13, 2025

0cc4m approved these changes Apr 16, 2025

View reviewed changes

0cc4m merged commit 015022b into ggml-org:master Apr 16, 2025
51 checks passed

BradHutchings mentioned this pull request Apr 16, 2025

vulkan: enable coopmat2 FA gqa and split_k optimizations more often (… BradHutchings/Mmojo-Server#27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: enable coopmat2 FA gqa and split_k optimizations more often #12931

vulkan: enable coopmat2 FA gqa and split_k optimizations more often #12931

Uh oh!

jeffbolznv commented Apr 13, 2025

Uh oh!

Uh oh!

Uh oh!

vulkan: enable coopmat2 FA gqa and split_k optimizations more often #12931

vulkan: enable coopmat2 FA gqa and split_k optimizations more often #12931

Uh oh!

Conversation

jeffbolznv commented Apr 13, 2025

Uh oh!

Uh oh!

Uh oh!