Skip to content

Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead. #8943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 11, 2024

Conversation

mtavenrath
Copy link
Contributor

@mtavenrath mtavenrath commented Aug 9, 2024

Performance of the Vulkan backend increases by 2x on Geforce 4090 class GPUs for stable-code-instruct-3b-Q8_0.

  • Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove.

  • ggml_vk_sync_buffer used a full pipeline sync which can have a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors.

  • I have read the contributing guidelines

  • Self-reported review complexity:

    • Low
    • Medium
    • High

…ronization overhead.

- Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove.
- ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors.
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Aug 9, 2024
@mofosyne mofosyne added performance Speed related topics Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels Aug 9, 2024
@ExtReMLapin
Copy link
Contributor

2x on Geforce 4090 class GPUs for stable-code-instruct-3b-Q8_0.

Could you please give pp/s tg/s ?

@mtavenrath
Copy link
Contributor Author

I have to correct myself since it Intel Xe isn't getting faster with this PR. I had a bogus measurement on Intel Xe before, probably because something was running the background.

master:

Intel(R) Iris(R) Xe Graphics (i7-13800h)
llama_print_timings: prompt eval time =     539.70 ms /     9 tokens (   59.97 ms per token,    16.68 tokens per second)
llama_print_timings:        eval time =   18839.67 ms /   215 runs   (   87.63 ms per token,    11.41 tokens per second)

NVIDIA RTX 6000 Ada Generation
llama_print_timings: prompt eval time =      78.59 ms /     9 tokens (    8.73 ms per token,   114.51 tokens per second)
llama_print_timings:        eval time =    4262.02 ms /   280 runs   (   15.22 ms per token,    65.70 tokens per second)

PR:

Intel(R) Iris(R) Xe Graphics (i7-13800h)
llama_print_timings: prompt eval time =     555.27 ms /     9 tokens (   61.70 ms per token,    16.21 tokens per second)
llama_print_timings:        eval time =    7616.56 ms /    87 runs   (   87.55 ms per token,    11.42 tokens per second)

NVIDIA RTX 6000 Ada Generation
llama_print_timings: prompt eval time =      70.06 ms /     9 tokens (    7.78 ms per token,   128.45 tokens per second)
llama_print_timings:        eval time =    4514.31 ms /   576 runs   (    7.84 ms per token,   127.59 tokens per second)

@slaren
Copy link
Member

slaren commented Aug 9, 2024

GPU Model Test t/s master t/s vulkan_perf Speedup
RTX 3090 Ti llama 7B Q4_0 pp512 976.00 997.27 1.02
RTX 3090 Ti llama 7B Q4_0 tg128 48.89 72.25 1.48

@0cc4m 0cc4m self-assigned this Aug 10, 2024
@0cc4m
Copy link
Collaborator

0cc4m commented Aug 10, 2024

Very nice work, this makes a big difference in CPU-limited scenarios. I'm reducing some of the GPU bottlenecks in #8959, so this should combine well with that.

MaggotHATE added a commit to MaggotHATE/Llama_chat that referenced this pull request Aug 10, 2024
… and more

* fixed default sampling queue to include p_step
* changed sampling queue display to better reflect the actual logic
* added VK-specific settings `use_mmap_vk`, `flash_attn_vk`, `no_kv_offload_vk`
* added new presets for testing
@0cc4m 0cc4m merged commit 7c5bfd5 into ggml-org:master Aug 11, 2024
52 checks passed
@0cc4m 0cc4m mentioned this pull request Aug 11, 2024
4 tasks
@mtavenrath
Copy link
Contributor Author

The potentially best next optimization would overlap copies & cmdbuffer creation with actual work on the GPU.

Below is a profiling of stable-code-instruct-3b-Q8_0 on my system. It takes ~2.5ms CPU time to create 5.5ms of GPU work. ggml_build_graph alone has an average execution time of 1.17ms in my example. If we could split cmdbuffer generation and execute split cmdbuffers early before full creating this time could potentially be completely eliminated resulting in ~15% more perf. On systems with slower CPUs (e.g. to save energy, thermal constraints, etc.) the benefit can be even bigger.

image

@0cc4m
Copy link
Collaborator

0cc4m commented Aug 12, 2024

Yes, that is true. I've been looking into options for that.

We can submit command buffers early so that the GPU can start to work while the rest of the graph is still being recorded. If that overcomes the overhead from more queue submissions it could reduce the cpu overhead.

We can also look into multithreading the command recording process, maybe in combination with the first option. That just needs careful synchronization to make sure it doesn't add more overhead than it resolves.

@mtavenrath
Copy link
Contributor Author

When submitting early there is no need for multithreading as long as a single CPU core can keep up. Let's keep it simple first before adding the complexity which makes it hard to profile & debug.

ggml_vk_build_graph has a boolean last_node which can be used to stop cmdbuffer recording and start a new one upon the next call. ggml_vk_compute_forward handles this case quite well. I haven't figured out yet how to efficiently trigger the execution of a cmdbuffer upfront.

@0cc4m
Copy link
Collaborator

0cc4m commented Aug 12, 2024

I would modify build_graph to submit the buffer to the queue once it gets closed, with a timeline semaphore to synchronize it with the next submission. On the last submission, add a fence and wait for that. The semaphore and fence have to go into some data structure to remember them across a run.

Do you want to look into it? If not, I'll try it when I find some time.

@mtavenrath
Copy link
Contributor Author

@0cc4m I've pushed a draft for the overlapped cmdbuffer creation & execution in pr #9118. I didn't touch the semaphore handling... through it works fine on my system. I cannot tell when I'll have time to take care of the semaphores as well.

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
…ronization overhead. (ggml-org#8943)

* Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead.

- Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove.
- ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors.

* Fix small typo

---------

Co-authored-by: 0cc4m <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
…ronization overhead. (ggml-org#8943)

* Optimize Vulkan backend for better CPU performance and less GPU synchronization overhead.

- Allocation overhead for the temporary std::vectors was easily detectable with a sampling profiler and simple to remove.
- ggml_vk_sync_buffer introduce a full pipeline sync which has a significant cost on the GPU side, sometimes larger than the actual kernel execution. Adding only barriers for shader read/writes and transfers seems to be sufficient looking at the code which either launches compute kernels or copies tensors.

* Fix small typo

---------

Co-authored-by: 0cc4m <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning performance Speed related topics Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants