Skip to content

CUDA Graph Compute Function Refactor (precursor for performance improvements) #11042

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jan 13, 2025

Conversation

aendk
Copy link
Contributor

@aendk aendk commented Jan 2, 2025

Hi All,
I am working on improving llama.cpp's CUDA graph performance on behalf of NVIDIA.
In preliminary testing, we are seeing up to 3% of performance gain by overlapping CPU and GPU work, and by improving CPU -> GPU copy scheduling on a high end system. The changes are likely to be even more impactful on less capable hardware.

To pave the way for these changes (and to provide readable diffs), I first isolated the cosmetic changes in this PR.
This PR does not contain any changes in the logic. It merely slims down the ggml_backend_cuda_graph_compute() by moving certain loops and other subtasks of the original function into 5 new functions.

These changes considerably improve the readability and future maintainability of this part of the CUDA backend.

Should I add prefixes to the new function names, and if so, what do you suggest?

@agray3 @mtavenrath

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 2, 2025
@aendk aendk marked this pull request as draft January 2, 2025 15:47
@aendk
Copy link
Contributor Author

aendk commented Jan 2, 2025

FYI: setting Status to Draft whilst I investigate the failed tests.

@ggerganov
Copy link
Member

FYI: setting Status to Draft whilst I investigate the failed tests.

I think you just need to make the functions static.

@aendk aendk force-pushed the akieslinger/refactor_cuda_backend branch from 3998c0d to 004ec3a Compare January 7, 2025 08:36
@aendk aendk force-pushed the akieslinger/refactor_cuda_backend branch from 004ec3a to 98d4e55 Compare January 9, 2025 13:15
@aendk aendk marked this pull request as ready for review January 9, 2025 13:52
@aendk
Copy link
Contributor Author

aendk commented Jan 9, 2025

Ok, third time's a charm. My first try lacked the static keyword, my second showed that there were some issues around missing ifdef's when disabling CUDA graphs.
 
Now this refactor-only PR is ready for review.
Due to the ifdefs, the overall diff looks quite messy.

I strongly suggest to review commit by commit. Its easier to see there that not much has happened.
I moved some code into 5 new functions and removed a few lines of unused/dead code.

Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple consecutive #ifdef USE_CUDA_GRAPH, consider consolidating into a single one.

@aendk
Copy link
Contributor Author

aendk commented Jan 13, 2025

I implemented your requested changes. The single test failure looks like a CI / runner issue.
Feel free to re-review or relaunch the failed test.

@slaren slaren merged commit 39509fb into ggml-org:master Jan 13, 2025
47 checks passed
tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Feb 13, 2025
…e improvements) (ggml-org#11042)

* Refactor: Moves cuda graph executable update step to separate function.

* Refactor: Moves cuda graph update check to separate function.

* Refactor: Moves cuda graph maintenance (update or adjusting copy parameters) to separate function for improved readability.

* Fix: Adds missing reference to maintain_cuda_graph() definition.

* Refactor: Improves structure and abstractions by moving CUDA graph evaluation and capture to its own function.

* Refactor: Moves node graph checks and copy ops into individual function for improved readability.

* Refactor: Removes code permanently excluded from compilation to increase readability.

* Style: Adds missing newline

* Style: Consolidates several neighboring '#ifdef USE_CUDA_GRAPH' into a single one

* Refactor: Makes 'cuda_graph_update_required' a local variable

* remove double lines between functions

---------

Co-authored-by: slaren <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025
…e improvements) (ggml-org#11042)

* Refactor: Moves cuda graph executable update step to separate function.

* Refactor: Moves cuda graph update check to separate function.

* Refactor: Moves cuda graph maintenance (update or adjusting copy parameters) to separate function for improved readability.

* Fix: Adds missing reference to maintain_cuda_graph() definition.

* Refactor: Improves structure and abstractions by moving CUDA graph evaluation and capture to its own function.

* Refactor: Moves node graph checks and copy ops into individual function for improved readability.

* Refactor: Removes code permanently excluded from compilation to increase readability.

* Style: Adds missing newline

* Style: Consolidates several neighboring '#ifdef USE_CUDA_GRAPH' into a single one

* Refactor: Makes 'cuda_graph_update_required' a local variable

* remove double lines between functions

---------

Co-authored-by: slaren <[email protected]>
mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025
…e improvements) (ggml-org#11042)

* Refactor: Moves cuda graph executable update step to separate function.

* Refactor: Moves cuda graph update check to separate function.

* Refactor: Moves cuda graph maintenance (update or adjusting copy parameters) to separate function for improved readability.

* Fix: Adds missing reference to maintain_cuda_graph() definition.

* Refactor: Improves structure and abstractions by moving CUDA graph evaluation and capture to its own function.

* Refactor: Moves node graph checks and copy ops into individual function for improved readability.

* Refactor: Removes code permanently excluded from compilation to increase readability.

* Style: Adds missing newline

* Style: Consolidates several neighboring '#ifdef USE_CUDA_GRAPH' into a single one

* Refactor: Makes 'cuda_graph_update_required' a local variable

* remove double lines between functions

---------

Co-authored-by: slaren <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants