Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. #8943

mtavenrath · 2024-08-09T08:37:39Z

Performance of the Vulkan backend increases by 2x on Geforce 4090 class GPUs for stable-code-instruct-3b-Q8_0.

Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove.
ggml_vk_sync_buffer used a full pipeline sync which can have a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors.
I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

…ronization overhead. - Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove. - ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors.

ExtReMLapin · 2024-08-09T13:15:53Z

2x on Geforce 4090 class GPUs for stable-code-instruct-3b-Q8_0.

Could you please give pp/s tg/s ?

mtavenrath · 2024-08-09T13:34:36Z

I have to correct myself since it Intel Xe isn't getting faster with this PR. I had a bogus measurement on Intel Xe before, probably because something was running the background.

master:

Intel(R) Iris(R) Xe Graphics (i7-13800h)
llama_print_timings: prompt eval time =     539.70 ms /     9 tokens (   59.97 ms per token,    16.68 tokens per second)
llama_print_timings:        eval time =   18839.67 ms /   215 runs   (   87.63 ms per token,    11.41 tokens per second)

NVIDIA RTX 6000 Ada Generation
llama_print_timings: prompt eval time =      78.59 ms /     9 tokens (    8.73 ms per token,   114.51 tokens per second)
llama_print_timings:        eval time =    4262.02 ms /   280 runs   (   15.22 ms per token,    65.70 tokens per second)

PR:

Intel(R) Iris(R) Xe Graphics (i7-13800h)
llama_print_timings: prompt eval time =     555.27 ms /     9 tokens (   61.70 ms per token,    16.21 tokens per second)
llama_print_timings:        eval time =    7616.56 ms /    87 runs   (   87.55 ms per token,    11.42 tokens per second)

NVIDIA RTX 6000 Ada Generation
llama_print_timings: prompt eval time =      70.06 ms /     9 tokens (    7.78 ms per token,   128.45 tokens per second)
llama_print_timings:        eval time =    4514.31 ms /   576 runs   (    7.84 ms per token,   127.59 tokens per second)

slaren · 2024-08-09T13:47:57Z

GPU	Model	Test	t/s master	t/s vulkan_perf	Speedup
RTX 3090 Ti	llama 7B Q4_0	pp512	976.00	997.27	1.02
RTX 3090 Ti	llama 7B Q4_0	tg128	48.89	72.25	1.48

ggml/src/ggml-vulkan.cpp

0cc4m · 2024-08-10T06:27:12Z

Very nice work, this makes a big difference in CPU-limited scenarios. I'm reducing some of the GPU bottlenecks in #8959, so this should combine well with that.

… and more * fixed default sampling queue to include p_step * changed sampling queue display to better reflect the actual logic * added VK-specific settings `use_mmap_vk`, `flash_attn_vk`, `no_kv_offload_vk` * added new presets for testing

mtavenrath · 2024-08-12T08:41:08Z

The potentially best next optimization would overlap copies & cmdbuffer creation with actual work on the GPU.

Below is a profiling of stable-code-instruct-3b-Q8_0 on my system. It takes ~2.5ms CPU time to create 5.5ms of GPU work. ggml_build_graph alone has an average execution time of 1.17ms in my example. If we could split cmdbuffer generation and execute split cmdbuffers early before full creating this time could potentially be completely eliminated resulting in ~15% more perf. On systems with slower CPUs (e.g. to save energy, thermal constraints, etc.) the benefit can be even bigger.

0cc4m · 2024-08-12T09:18:14Z

Yes, that is true. I've been looking into options for that.

We can submit command buffers early so that the GPU can start to work while the rest of the graph is still being recorded. If that overcomes the overhead from more queue submissions it could reduce the cpu overhead.

We can also look into multithreading the command recording process, maybe in combination with the first option. That just needs careful synchronization to make sure it doesn't add more overhead than it resolves.

mtavenrath · 2024-08-12T10:22:17Z

When submitting early there is no need for multithreading as long as a single CPU core can keep up. Let's keep it simple first before adding the complexity which makes it hard to profile & debug.

ggml_vk_build_graph has a boolean last_node which can be used to stop cmdbuffer recording and start a new one upon the next call. ggml_vk_compute_forward handles this case quite well. I haven't figured out yet how to efficiently trigger the execution of a cmdbuffer upfront.

0cc4m · 2024-08-12T20:57:37Z

I would modify build_graph to submit the buffer to the queue once it gets closed, with a timeline semaphore to synchronize it with the next submission. On the last submission, add a fence and wait for that. The semaphore and fence have to go into some data structure to remember them across a run.

Do you want to look into it? If not, I'll try it when I find some time.

mtavenrath · 2024-08-21T13:49:50Z

@0cc4m I've pushed a draft for the overlapped cmdbuffer creation & execution in pr #9118. I didn't touch the semaphore handling... through it works fine on my system. I cannot tell when I'll have time to take care of the semaphores as well.

…ronization overhead. (ggml-org#8943) * Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. - Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove. - ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors. * Fix small typo --------- Co-authored-by: 0cc4m <[email protected]>

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Aug 9, 2024

mofosyne added performance Speed related topics Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels Aug 9, 2024

0cc4m self-assigned this Aug 10, 2024

0cc4m reviewed Aug 10, 2024

View reviewed changes

ggml/src/ggml-vulkan.cpp Outdated Show resolved Hide resolved

Fix small typo

b2f9d22

0cc4m approved these changes Aug 11, 2024

View reviewed changes

0cc4m merged commit 7c5bfd5 into ggml-org:master Aug 11, 2024
52 checks passed

0cc4m mentioned this pull request Aug 11, 2024

Vulkan Optimizations and Fixes #8959

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. #8943

Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. #8943

Uh oh!

mtavenrath commented Aug 9, 2024 •

edited

Loading

Uh oh!

ExtReMLapin commented Aug 9, 2024

Uh oh!

mtavenrath commented Aug 9, 2024

Uh oh!

slaren commented Aug 9, 2024

Uh oh!

Uh oh!

0cc4m commented Aug 10, 2024

Uh oh!

Uh oh!

mtavenrath commented Aug 12, 2024

Uh oh!

0cc4m commented Aug 12, 2024

Uh oh!

mtavenrath commented Aug 12, 2024

Uh oh!

0cc4m commented Aug 12, 2024

Uh oh!

mtavenrath commented Aug 21, 2024

Uh oh!

Uh oh!

Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. #8943

Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. #8943

Uh oh!

Conversation

mtavenrath commented Aug 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ExtReMLapin commented Aug 9, 2024

Uh oh!

mtavenrath commented Aug 9, 2024

Uh oh!

slaren commented Aug 9, 2024

Uh oh!

Uh oh!

0cc4m commented Aug 10, 2024

Uh oh!

Uh oh!

mtavenrath commented Aug 12, 2024

Uh oh!

0cc4m commented Aug 12, 2024

Uh oh!

mtavenrath commented Aug 12, 2024

Uh oh!

0cc4m commented Aug 12, 2024

Uh oh!

mtavenrath commented Aug 21, 2024

Uh oh!

Uh oh!

mtavenrath commented Aug 9, 2024 •

edited

Loading