Skip to content

CUDA: fix FA tg at long context for CC >= 8.9 #13852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

Fixes #12816 (comment) .

On master the CUDA kernel for combining FlashAttention results without stream-k implicitly assumes that the number of parallel blocks is small vs. the head size. This used to be true but after the number of parallel blocks was made variable in #12182 there are configurations where the shared memory is not being initialized correctly. This PR fixes the issue by replacing the bad conditional statement with a loop.

@JohannesGaessler JohannesGaessler merged commit a682474 into ggml-org:master May 28, 2025
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
2 participants