Skip to content

cuda: fix layer split mode preventing cuda graph compilation #13815

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

koush
Copy link

@koush koush commented May 27, 2025

When layer split is used, the output from one layer is the input to another layer.

I noticed that CUDA graphs would eventually give up due to too many changes in the graph. The input tensor would have a data address that would cycle between 4 (I think) allocated values. I dug into this a bit to see if I could fix that, so the cuda graph would not have a noisy input address, but my understanding is that this is done due to pipelining? Maybe that is the correct fix, and someone else could make that change. This change works under the assumption that the underlying platform is working as intended here.

In any case, I implemented CUDA graph compilation to account for input address changes by caching CUDA Graphs, and (quickly) finding the one that matches from a LRU set.

This showed appreciable gains in tokens per second:

bartowski/Qwen2.5-72B-Instruct-GGUF:IQ4_XS, 2 x RTX PRO 6000:

Before: 29 tokens/s
After: 35 tokens/s

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 27, 2025
@koush
Copy link
Author

koush commented May 27, 2025

I just noticed this fix was done nearly the same time as mine and this change may be defunct:

#13814

For issue:
#13751

@koush koush closed this May 27, 2025
@koush koush deleted the layer-split-cuda branch May 27, 2025 14:00
@koush koush restored the layer-split-cuda branch June 1, 2025 06:05
@koush koush reopened this Jun 1, 2025
@koush
Copy link
Author

koush commented Jun 1, 2025

Actually it turns out that I will likely need this change, or a derivative of it in the tensor parallel work I am doing, the graph is split for every RMS normalization where the gpus need to gather/reduce tensors. This results in graph churn.

@koush
Copy link
Author

koush commented Jun 1, 2025

Will get another pull up once it is settled.

@koush koush closed this Jun 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant