CUDA: fix FA tg at long context for CC >= 8.9 #13852

JohannesGaessler · 2025-05-28T10:35:15Z

On master the CUDA kernel for combining FlashAttention results without stream-k implicitly assumes that the number of parallel blocks is small vs. the head size. This used to be true but after the number of parallel blocks was made variable in #12182 there are configurations where the shared memory is not being initialized correctly. This PR fixes the issue by replacing the bad conditional statement with a loop.

CUDA: fix FA tg at long context for CC >= 8.9

7754864

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 28, 2025

JohannesGaessler mentioned this pull request May 28, 2025

Misc. bug: The model's reasoning performance has significantly decreased despite using different versions of the same model architecture, identical parameters, and the same set of questions. #12816

Closed

CISC approved these changes May 28, 2025

View reviewed changes

JohannesGaessler merged commit a682474 into ggml-org:master May 28, 2025
42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: fix FA tg at long context for CC >= 8.9 #13852

CUDA: fix FA tg at long context for CC >= 8.9 #13852

Uh oh!

JohannesGaessler commented May 28, 2025

Uh oh!

Uh oh!

Uh oh!

CUDA: fix FA tg at long context for CC >= 8.9 #13852

CUDA: fix FA tg at long context for CC >= 8.9 #13852

Uh oh!

Conversation

JohannesGaessler commented May 28, 2025

Uh oh!

Uh oh!

Uh oh!