feat: Holistic persistent kernel template with global scheduler #1026

yzh119 · 2025-04-20T00:46:41Z

Follow up of #858 , #967 , this PR implements the persistent kernel template that supports sequential execution of multiple kernels (e.g. one wave for prefill attention, one wave for decode attention and one wave for attention reduction) in a single kernel, with a globla scheduler for load-balancing:

POD-Attention can be implemented as different scheduler implementation within this framework.
This PR should also resolve the issue mentioned in #1022

Co-authored-by: Yilong Zhao [email protected]

Update attention.py

Update bench_mixed_attention.py

Add tool kernel for sspec

Edenzzzz · 2025-04-20T02:52:16Z

Thanks for your attention. I guess the issue with POD is when each prefill/decode seq has varying lengths, there's significant wave quantization between blocks? I might try to adapt it to your template.
Also I wonder if you have any hints on debugging the illegal memory access issue? Sometimes I also see operation not supported on global/shared address space

AKKamath · 2025-04-23T17:17:46Z

Sorry, I've been a bit busy recently. I'll try taking a look at your illegal access issue. It's what's mentioned in #967 right?

Edenzzzz · 2025-04-23T19:05:31Z

@AKKamath Hi, it's #1022, and I will try to upstream a cleaner reproduction.

Apart from the illegal access, I'm not sure if the slowdown with irregular decode & prefill seqlens is due to global sync in each wave--I don't any global sync points blocking the next CTA apart from the final merge states

Profiler for persistent kernels

Edenzzzz · 2025-05-20T19:13:59Z

include/flashinfer/attention/scheduler.cuh

+            auto [cluster_idx, accum_cost] = cluster_cost_heap.pop();
+            int actual_len = std::min(remaining_len, kv_len_limit);
+            cluster_cost_heap.insert(
+                {cluster_idx, accum_cost + cost_function(cluster_tile_q, actual_len)});


It seems that with load balancing across tile sizes, persistent kernel achieves the same goal as POD Attention (and even more balanced workload because POD Attention currently calls plan() to balance workload for decode?)
It should also have lower CTA launch and quantization overheads?

Though if we can carefully schedule POD Attn to run two CTAs per SM concurrently, each having a different tile size, we could increase tensor core util

One of the goal of POD-Attention is to overlap compute-bound (prefill) and IO-bound (decode) by concurrent execution of two kind of workload within a SM (two CTA per SM), which we didn't explore in this PR, but I suppose it's feasible by carefully design scheduler.

Edenzzzz · 2025-05-20T19:22:20Z

include/flashinfer/attention/persistent_template.cuh

+  BlockPersistentRunner1::Run(params_1, &smem_storage_1);
+  PROFILER_EVENT_END(profiler_closure, PersistentProfileEventType::kRunner1);
+
+  __syncthreads();


Could this sync be removed?

yzh119 · 2025-06-12T00:40:25Z

Moved development to #1137

@yzh119

## 📌 Description Follow up of #858, #967, and #1026, this PR aims to provide an efficient and unified API for processing prefill and decode requests within a single kernel launch. Key features include: 1. Single CUDA graph capture for all batch sizes and sequence lengths. Prior to this PR, FA2 template is implemented with a non-persistent kernel way, which dispatches `padded_batch_sizes` CTA and uses static information (ref: https://github.com/flashinfer-ai/flashinfer/blob/f484fd3c7f09a1d0afb75d779872b9762a35e445/include/flashinfer/attention/scheduler.cuh#L527). This necessitates a specialized CUDA graph for each batch with different seqlens and batch sizes, to maximize throughput. Furthermore, prefill and decode are executed by different kernel launches, increasing the number of CUDA graphs by combination. This PR implements a persistent-style kernel, which enables a single CUDA graph to capture work for all seqlens and batch sizes. 2. Dynamic specialization for prefill and decode. Implemented as a persistent kernel, prefill and decode requests are dynamically executed by an efficient kernel template with suitable hyperparameters. For example, decode requests with `qo_len=1` are processed by `CTA_TILE_Q=16` while prefill requests with `qo_len>=128` are processed by `CTA_TILE_Q=128`. ## Perf Benchmarks: The benchmark script is at `benchmarks/bench_batch_attention.py` and was tested with Qwen-2.5-7B configurations and a single H200. Visualization: <img width="594" alt="image" src="https://github.com/user-attachments/assets/735aca14-387d-4013-b3f4-e199b6cff5f3" /> 1. 30% bandwidth boost in hybrid scenarios 2. slightly worse perf at pure workloads, which may be caused by the reduction overhead ## Unit Tests: Unit tests can be located at `tests/bench_batch_attention.py`. <img width="1527" alt="image" src="https://github.com/user-attachments/assets/fff06c6d-c121-497c-9f62-039653149a4d" /> ## Future works: 1. Add profiler to analyze perf bottleneck 4. Optimize the reduction kernel schedule  ## 🔍 Related Issues #1022 Advised by @yzh119. CC @AKKamath @Edenzzzz  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes  Co-authored-by: yzh119 <[email protected]> Co-authored-by: happierpig <[email protected]>

yzh119 and others added 30 commits April 4, 2025 00:43

upd

fe27b7e

upd

e1ed5a6

upd

651c1e3

wip

6530329

wip

23c006d

wip

a776908

wip

cd3983d

upd

eb8f8ba

upd

c1975ca

upd

d42ca75

upd

26a6aa8

upd

47e5a6a

upd

9be3af5

Update attention.py

a62be0f

upd

53bcbaa

upd

7864f38

Merge pull request #1 from happierpig/patch-1

0a5edac

Update attention.py

Add tool kernel for sspec

a489729

Update persistent.cuh

cc84414

upd

88f6d72

upd

210b1a0

upd

af146a1

Update bench_mixed_attention.py

53c331e

Update bench_mixed_attention.py

6f5f546

Update bench_mixed_attention.py

4a08441

Update bench_mixed_attention.py

1ca0ec2

Update bench_mixed_attention.py

Add tool kernel for sspec

7f3da40

Add tool kernel for sspec

format

7676525

upd

32d7f7f

upd

bfcb950

yzh119 added 8 commits April 14, 2025 07:01

update bench

1acb392

update bench

b7e327d

upd

4beaa3c

upd

a15b518

upd

631baf8

reverse polarit

343e52f

reverse polarity

8d87b3b

upd

c09fa34

yzh119 mentioned this pull request Apr 20, 2025

[Refactor] Uniform PoDAttention API with Horizontal Fusion SMs Schedule #967

Closed

xslingcn and others added 3 commits May 1, 2025 03:17

profiler

fdd9c4d

pre-commit

4d876db

Merge pull request #4 from xslingcn/persistent-profiler

2b25f24

Profiler for persistent kernels

yzh119 mentioned this pull request May 6, 2025

Add POD attn bench yzh119/flashinfer-dev#5

Closed

Edenzzzz reviewed May 20, 2025

View reviewed changes

happierpig mentioned this pull request Jun 11, 2025

[feat] add unified batch attention w/ correctness tests. #1137

Merged

5 tasks

yzh119 closed this Jun 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Holistic persistent kernel template with global scheduler #1026

feat: Holistic persistent kernel template with global scheduler #1026

yzh119 commented Apr 20, 2025 •

edited

Loading

Uh oh!

Edenzzzz commented Apr 20, 2025

Uh oh!

AKKamath commented Apr 23, 2025

Uh oh!

Edenzzzz commented Apr 23, 2025

Uh oh!

Edenzzzz May 20, 2025

Uh oh!

Edenzzzz May 20, 2025

Uh oh!

yzh119 May 23, 2025

Uh oh!

Edenzzzz May 20, 2025

Uh oh!

yzh119 May 23, 2025

Uh oh!

yzh119 commented Jun 12, 2025

Uh oh!

Uh oh!

feat: Holistic persistent kernel template with global scheduler #1026

feat: Holistic persistent kernel template with global scheduler #1026

Conversation

yzh119 commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Edenzzzz commented Apr 20, 2025

Uh oh!

AKKamath commented Apr 23, 2025

Uh oh!

Edenzzzz commented Apr 23, 2025

Uh oh!

Edenzzzz May 20, 2025

Choose a reason for hiding this comment

Uh oh!

Edenzzzz May 20, 2025

Choose a reason for hiding this comment

Uh oh!

yzh119 May 23, 2025

Choose a reason for hiding this comment

Uh oh!

Edenzzzz May 20, 2025

Choose a reason for hiding this comment

Uh oh!

yzh119 May 23, 2025

Choose a reason for hiding this comment

Uh oh!

yzh119 commented Jun 12, 2025

Uh oh!

Uh oh!

yzh119 commented Apr 20, 2025 •

edited

Loading