[CUDA] Enable CUDA Graph on CUDA Toolkit < 12.x #12394

gaugarg-nv · 2025-03-14T15:45:49Z

cudaGraphExecUpdate API signature was changed in CTK 12.x. For this reason, CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA graph support on CTK version < 12.x by using older API if CTK < 12.x.

Performance Gains on CUDA 11.8, RTX 4090

This PR improves performance by around 35% in generation phase.

Master

llama-bench.exe -m DeepSeek-R1-Distill-Qwen-7B-GGUF\DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 7B Q4_K - Medium         |   4.36 GiB |     7.62 B | CUDA       |  99 |         pp512 |     10987.87 ± 29.37 |
| qwen2 7B Q4_K - Medium         |   4.36 GiB |     7.62 B | CUDA       |  99 |         tg128 |        110.47 ± 0.25 |

build: 8fcb5636 (4887)

llama-bench.exe -m DeepSeek-R1-Distill-Llama-8B-GGUF\DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |         pp512 |    10345.30 ± 273.76 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |         tg128 |        109.59 ± 0.16 |

build: 8fcb5636 (4887)

This PR

llama-bench.exe -m DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 7B Q4_K - Medium         |   4.36 GiB |     7.62 B | CUDA       |  99 |         pp512 |    10737.57 ± 247.49 |
| qwen2 7B Q4_K - Medium         |   4.36 GiB |     7.62 B | CUDA       |  99 |         tg128 |        153.02 ± 0.18 |

build: fc7f195c (4888)

llama-bench.exe -m DeepSeek-R1-Distill-Llama-8B-Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |         pp512 |     10518.24 ± 45.75 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | CUDA       |  99 |         tg128 |        146.70 ± 0.26 |

build: fc7f195c (4888)

Make sure to read the contributing guidelines before submitting a PR

`cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x.

gaugarg-nv · 2025-03-14T15:47:15Z

FYI @ggerganov @slaren @JohannesGaessler

ggerganov · 2025-03-17T08:16:19Z

Seems to cause error in the MUSA build:

https://github.com/ggml-org/llama.cpp/actions/runs/13860248342/job/38872230770?pr=12394#step:6:6724

gaugarg-nv · 2025-03-17T09:04:57Z

Seems to cause error in the MUSA build:

https://github.com/ggml-org/llama.cpp/actions/runs/13860248342/job/38872230770?pr=12394#step:6:6724

Added a change that should fix issues with MUSA build.

ggerganov · 2025-03-17T09:10:42Z

ggml/src/ggml-cuda/ggml-cuda.cu

 #else
+    cudaGraphExecUpdateResultInfo result_info;


Is this correctly cudaGraphExecUpdateResultInfo? Or it should be cudaGraphExecUpdateResult?

As it is, cudaGraphExecUpdateResultInfo is only defined in vendors/musa.h:

llama.cpp/ggml/src/ggml-cuda/vendors/musa.h

Line 122 in db400a6

#define cudaGraphExecUpdateResultInfo musaGraphExecUpdateResult

This is correct. cudaGraphExecUpdateResultInfo is declared in headers of CTK >= 12.x. CTK < 12.x declares cudaGraphExecUpdateResult.

Looked into MUSA headers and it seems musaGraphExecUpdate itself is commented out in its headers.
I realized earlier CUDA graph was disabled on MUSA platform and hence that part of code was not even getting compiled on MUSA. But, my change had enabled it on MUSA when I removed CUDART_VERSION >= 12000 check. I have again disabled CUDA graph on MUSA platform now. I have also updated musa.h to use right macros but this doesn't matter as code is not getting compiled on MUSA.

* Enable CUDA Graph on CTK < 12.x `cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x. * Fix compilation errors with MUSA * Disable CUDA Graph for MUSA

Enable CUDA Graph on CTK < 12.x

fc7f195

`cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x.

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 14, 2025

gaugarg-nv changed the title ~~Enable CUDA Graph on CUDA Toolkit < 12.x~~ [CUDA] Enable CUDA Graph on CUDA Toolkit < 12.x Mar 15, 2025

Fix compilation errors with MUSA

db400a6

ggerganov reviewed Mar 17, 2025

View reviewed changes

Disable CUDA Graph for MUSA

a05f2b8

ggerganov approved these changes Mar 17, 2025

View reviewed changes

ggerganov merged commit b1b132e into ggml-org:master Mar 17, 2025
46 checks passed

gaugarg-nv deleted the enable_cuda_graph_on_11.x branch March 18, 2025 00:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CUDA] Enable CUDA Graph on CUDA Toolkit < 12.x #12394

[CUDA] Enable CUDA Graph on CUDA Toolkit < 12.x #12394

Uh oh!

gaugarg-nv commented Mar 14, 2025 •

edited

Loading

Uh oh!

gaugarg-nv commented Mar 14, 2025

Uh oh!

ggerganov commented Mar 17, 2025

Uh oh!

gaugarg-nv commented Mar 17, 2025

Uh oh!

ggerganov Mar 17, 2025

Uh oh!

gaugarg-nv Mar 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[CUDA] Enable CUDA Graph on CUDA Toolkit < 12.x #12394

[CUDA] Enable CUDA Graph on CUDA Toolkit < 12.x #12394

Uh oh!

Conversation

gaugarg-nv commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gaugarg-nv commented Mar 14, 2025

Uh oh!

ggerganov commented Mar 17, 2025

Uh oh!

gaugarg-nv commented Mar 17, 2025

Uh oh!

ggerganov Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

gaugarg-nv Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gaugarg-nv commented Mar 14, 2025 •

edited

Loading

gaugarg-nv Mar 17, 2025 •

edited

Loading