Skip to content

[CUDA] Enable CUDA Graph on CUDA Toolkit < 12.x #12394

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 17, 2025

Conversation

gaugarg-nv
Copy link
Contributor

@gaugarg-nv gaugarg-nv commented Mar 14, 2025

cudaGraphExecUpdate API signature was changed in CTK 12.x. For this reason, CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA graph support on CTK version < 12.x by using older API if CTK < 12.x.

Performance Gains on CUDA 11.8, RTX 4090

This PR improves performance by around 35% in generation phase.

Master

llama-bench.exe -m DeepSeek-R1-Distill-Qwen-7B-GGUF\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 7B Q4_K - Medium         |   4.36 GiB |     7.62 B | CUDA       |  99 |         pp512 |     10987.87 ± 29.37 |
| qwen2 7B Q4_K - Medium         |   4.36 GiB |     7.62 B | CUDA       |  99 |         tg128 |        110.47 ± 0.25 |

build: 8fcb5636 (4887)

llama-bench.exe -m DeepSeek-R1-Distill-Llama-8B-GGUF\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |         pp512 |    10345.30 ± 273.76 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |         tg128 |        109.59 ± 0.16 |

build: 8fcb5636 (4887)

This PR

llama-bench.exe -m DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 7B Q4_K - Medium         |   4.36 GiB |     7.62 B | CUDA       |  99 |         pp512 |    10737.57 ± 247.49 |
| qwen2 7B Q4_K - Medium         |   4.36 GiB |     7.62 B | CUDA       |  99 |         tg128 |        153.02 ± 0.18 |

build: fc7f195c (4888)

llama-bench.exe -m DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |         pp512 |     10518.24 ± 45.75 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |         tg128 |        146.70 ± 0.26 |

build: fc7f195c (4888)

Make sure to read the contributing guidelines before submitting a PR

`cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x.
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 14, 2025
@gaugarg-nv
Copy link
Contributor Author

@gaugarg-nv gaugarg-nv changed the title Enable CUDA Graph on CUDA Toolkit < 12.x [CUDA] Enable CUDA Graph on CUDA Toolkit < 12.x Mar 15, 2025
@ggerganov
Copy link
Member

@gaugarg-nv
Copy link
Contributor Author

Seems to cause error in the MUSA build:

https://github.com/ggml-org/llama.cpp/actions/runs/13860248342/job/38872230770?pr=12394#step:6:6724

Added a change that should fix issues with MUSA build.

#else
cudaGraphExecUpdateResultInfo result_info;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correctly cudaGraphExecUpdateResultInfo? Or it should be cudaGraphExecUpdateResult?

As it is, cudaGraphExecUpdateResultInfo is only defined in vendors/musa.h:

#define cudaGraphExecUpdateResultInfo musaGraphExecUpdateResult

Copy link
Contributor Author

@gaugarg-nv gaugarg-nv Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct. cudaGraphExecUpdateResultInfo is declared in headers of CTK >= 12.x. CTK < 12.x declares cudaGraphExecUpdateResult.

Looked into MUSA headers and it seems musaGraphExecUpdate itself is commented out in its headers.
I realized earlier CUDA graph was disabled on MUSA platform and hence that part of code was not even getting compiled on MUSA. But, my change had enabled it on MUSA when I removed CUDART_VERSION >= 12000 check. I have again disabled CUDA graph on MUSA platform now. I have also updated musa.h to use right macros but this doesn't matter as code is not getting compiled on MUSA.

@ggerganov ggerganov merged commit b1b132e into ggml-org:master Mar 17, 2025
46 checks passed
@gaugarg-nv gaugarg-nv deleted the enable_cuda_graph_on_11.x branch March 18, 2025 00:05
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Mar 19, 2025
* Enable CUDA Graph on CTK < 12.x

`cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x.

* Fix compilation errors with MUSA

* Disable CUDA Graph for MUSA
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants