-
Notifications
You must be signed in to change notification settings - Fork 12.1k
Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. #8943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ronization overhead. - Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove. - ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors.
Could you please give pp/s tg/s ? |
I have to correct myself since it Intel Xe isn't getting faster with this PR. I had a bogus measurement on Intel Xe before, probably because something was running the background. master:
PR:
|
|
Very nice work, this makes a big difference in CPU-limited scenarios. I'm reducing some of the GPU bottlenecks in #8959, so this should combine well with that. |
… and more * fixed default sampling queue to include p_step * changed sampling queue display to better reflect the actual logic * added VK-specific settings `use_mmap_vk`, `flash_attn_vk`, `no_kv_offload_vk` * added new presets for testing
The potentially best next optimization would overlap copies & cmdbuffer creation with actual work on the GPU. Below is a profiling of stable-code-instruct-3b-Q8_0 on my system. It takes ~2.5ms CPU time to create 5.5ms of GPU work. ggml_build_graph alone has an average execution time of 1.17ms in my example. If we could split cmdbuffer generation and execute split cmdbuffers early before full creating this time could potentially be completely eliminated resulting in ~15% more perf. On systems with slower CPUs (e.g. to save energy, thermal constraints, etc.) the benefit can be even bigger. |
Yes, that is true. I've been looking into options for that. We can submit command buffers early so that the GPU can start to work while the rest of the graph is still being recorded. If that overcomes the overhead from more queue submissions it could reduce the cpu overhead. We can also look into multithreading the command recording process, maybe in combination with the first option. That just needs careful synchronization to make sure it doesn't add more overhead than it resolves. |
When submitting early there is no need for multithreading as long as a single CPU core can keep up. Let's keep it simple first before adding the complexity which makes it hard to profile & debug.
|
I would modify build_graph to submit the buffer to the queue once it gets closed, with a timeline semaphore to synchronize it with the next submission. On the last submission, add a fence and wait for that. The semaphore and fence have to go into some data structure to remember them across a run. Do you want to look into it? If not, I'll try it when I find some time. |
…ronization overhead. (ggml-org#8943) * Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. - Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove. - ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors. * Fix small typo --------- Co-authored-by: 0cc4m <[email protected]>
…ronization overhead. (ggml-org#8943) * Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. - Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove. - ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors. * Fix small typo --------- Co-authored-by: 0cc4m <[email protected]>
Performance of the Vulkan backend increases by 2x on Geforce 4090 class GPUs for stable-code-instruct-3b-Q8_0.
Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove.
ggml_vk_sync_buffer used a full pipeline sync which can have a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors.
I have read the contributing guidelines
Self-reported review complexity: