[Inductor] Support tiling reduction dimensions #137243

blaine-rister · 2024-10-03T00:45:08Z

Sub-PRs containing refactors from this one:

These refactor PRs should land before the main one.

Feature

Note: to minimize risk, multi-dimensional reductions are gated by the flag config.triton.tile_reductions, which defaults to False.

Instead of having a single reduction dimension called "r", we can now support 2D reductions with "r0_" and "r1_" dimensions. 2D reductions generate two nested loops, with different block pointer advancements in each loop body. Most of the implementation is generic to ND reductions, but for now the tiling algorithm sets a hard limit at 2D.

Here's an example of a 2D persistent reduction kernel:

@triton.jit
def triton_per_fused_sum_0(in_ptr0, out_ptr0, xnumel, r0_numel, r1_numel, XBLOCK : tl.constexpr):
    xnumel = 1
    r0_numel = 15
    R0_BLOCK: tl.constexpr = 16
    r1_numel = 15
    R1_BLOCK: tl.constexpr = 16
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None]
    xmask = tl.full([XBLOCK, R0_BLOCK, R1_BLOCK], True, tl.int1)
    r0_index = tl.arange(0, R0_BLOCK)[None, :, None]
    r0_offset = 0
    r0_mask = r0_index < r0_numel
    r1_index = tl.arange(0, R1_BLOCK)[None, None, :]
    r1_offset = 0
    r1_mask = r1_index < r1_numel
    rnumel = r0_numel * r1_numel
    RBLOCK: tl.constexpr = R0_BLOCK*R1_BLOCK
    roffset = r1_offset + (r0_offset*r1_numel)
    rindex = r1_index + (r0_index*r1_numel)
    r0_0 = r0_index
    r1_1 = r1_index
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[15, 15], strides=[30, 1], block_shape=[R0_BLOCK, R1_BLOCK], order=[1, 0], offsets=[r0_offset, r1_offset]), boundary_check=[0, 1], padding_option='zero')[None, :, :]
    tmp1 = tl.broadcast_to(tmp0, [XBLOCK, R0_BLOCK, R1_BLOCK])
    tmp3 = tl.where(r0_mask & r1_mask, tmp1, 0)
    tmp4 = tl.reshape(tmp3, [XBLOCK, RBLOCK])
    tmp5 = tl.sum(tmp4, 1)[:, None, None]
    tl.store(out_ptr0 + (tl.full([XBLOCK, 1, 1], 0, tl.int32)), tmp5, None)
''', device_str='cuda')

There are a few main differences between this kernel and what Inductor would generate without this PR.

Instead of an r/RBLOCK dimension, we have two reduction dimensions: r0_/R0_BLOCK and r1_/R1_BLOCK.
There are special size and indexing variables for reductions, which don't directly correspond to any kernel dimension. (rindex, rnumel, RBLOCK, and roffset.) These collapse N-D reduction sizes and indices indices into 1D. This simplifies the codegen for reductions, which sometimes want to access linear indices instead of N-dimensional ones. Doing things this way allows us to generate N-D loads and stores, but access this data as if it were 1D, minimizing the blast radius of this PR. Although this makes the code more verbose, it shouldn't have a perf impact because the triton compiler eliminates dead code.
We generate the line tmp4 = tl.reshape(tmp3, [XBLOCK, RBLOCK]) before performing the actual reduction. This reshapes N reduction dimensions into 1D. This allows us to reduce over all N dimensions at once, simplifying the codegen and allowing the Triton complier to decide the order of processing under the hood.

Here's an example of a looped reduction:

@triton.jit
def triton_red_fused_sum_0(in_ptr0, out_ptr0, xnumel, r0_numel, r1_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr, R1_BLOCK : tl.constexpr):
    xnumel = 3
    r0_numel = 43
    r1_numel = 129
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None]
    xmask = xindex < xnumel
    r0_base = tl.arange(0, R0_BLOCK)[None, :, None]
    r1_base = tl.arange(0, R1_BLOCK)[None, None, :]
    rnumel = r0_numel * r1_numel
    RBLOCK: tl.constexpr = R0_BLOCK*R1_BLOCK
    rbase = r1_base + (r0_base*r1_numel)
    x0 = xindex
    block_ptr0 = tl.make_block_ptr(in_ptr0, shape=[3, 43, 129], strides=[11094, 258, 1], block_shape=[XBLOCK, R0_BLOCK, R1_BLOCK], order=[2, 1, 0], offsets=[xoffset, 0, 0])
    _tmp2 = tl.full([XBLOCK, R0_BLOCK, R1_BLOCK], 0, tl.float32)
    for r0_offset in range(0, r0_numel, R0_BLOCK):
        r0_index = r0_offset + r0_base
        r0_mask = r0_index < r0_numel
        for r1_offset in range(0, r1_numel, R1_BLOCK):
            r1_index = r1_offset + r1_base
            r1_mask = r1_index < r1_numel
            roffset = r1_offset + (r0_offset*r1_numel)
            rindex = r1_index + (r0_index*r1_numel)
            r0_1 = r0_index
            r1_2 = r1_index
            tmp0 = tl.load(block_ptr0, boundary_check=[0, 1, 2], padding_option='zero', eviction_policy='evict_first')
            tmp1 = tl.broadcast_to(tmp0, [XBLOCK, R0_BLOCK, R1_BLOCK])
            tmp3 = _tmp2 + tmp1
            _tmp2 = tl.where(r0_mask & r1_mask & xmask, tmp3, _tmp2)
            block_ptr0 = tl.advance(block_ptr0, [0, 0, R1_BLOCK])
        block_ptr0 = tl.advance(block_ptr0, [0, R0_BLOCK, (-1)*R1_BLOCK*((128 + R1_BLOCK) // R1_BLOCK)])
    tmp4 = tl.reshape(_tmp2, [XBLOCK, RBLOCK])
    tmp2 = tl.sum(tmp4, 1)[:, None, None]
    tl.store(tl.make_block_ptr(out_ptr0, shape=[3], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.reshape(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0])
''', device_str='cuda')

In addition to the aforementioned changes to the persistent reduction, multidimensional looped reductions have a few more lines of code:

They calculate indices inside the loop using r0_base and r1_base. For compatibility with existing codegen, these are collapsed to the 1D variant rbase.
Block pointer advancements are more nuanced for multidimensional loops. At the end of each loop body, we emit a tl.advance line which not only increments the pointer in its own dimension, but also undoes the cumulative increments of the previous loop level. This is equivalent to the usual practice in nested loops of starting with a fresh iteration variable at each level. Implementing this required refactoring the way we generate pointer advancements into a new self.pointer_advancements field of the kernel, which categorizes advancements by dimension.

The biggest difficulty in implementing this feature was that we represented tiling with a tuple like (5,2). In the existing codebase, the compiler can infer that the reduction dimension of (5,2) is 2, since reductions are always the last dimension. This became cumbersome now that we have to support multiple reduction dimensions, so I refactored tiling into a dict like {"x": 5, "r0_": 2, "r1_": 4}. This required quite a few code changes, but I don't think it makes the underlying logic much more complex. This will also make it easier to eventually support simultaneous pointwise and reduction tiling, like {"x": 5, "y": 5, "r0_": 2, "r1_": 4}. (This is not supported today, but we might want to do it eventually.)

The existing tiling algorithm generalized naturally to support reductions. For pointwise kernels, we tile the pointwise dimensions ("x", "y") as is. For reduction kernels, we never tile the "x" dimension, and only tile the reduction dimensions ("r0_", "r1_"). Thus we only ever tile pointwise OR reduction dimensions, but not both. In principle it seems possible to support both, but it would likely require changes to the kernel fusion and autotuning logic. I thought it best to keep this PR as minimal as possible since it already touched a lot of different files.

Unfortunately, these changes weren't enough to get block pointers in some seemingly simple test cases. In some tests for argmax and var_mean, we already collapse reduction dimensions into 1D and generate modular indexing expressions, prior to tiling. So it's not trivial to figure out how to expand the collapsed reduction dimension back to a shape that would simplify the indexing.

To address these cases, this PR adds a new feature to the config.prefer_nd_tiling option, which analyzes reads and writes in the kernel, using the same mod-div pattern matching logic that generates block pointers later on. By matching this pattern, we can solve for the tiling splits which would simplify the indexing expression, and use then use that tiling to eliminate the modular indexing and emit a block pointer. This tiling mode is still off by default, but it's important for certain applications where we need to get as many block pointers as possible.

Test plan

This touches pretty much anything that uses the Triton and Halide backends, so the existing CI provides good coverage. However, 2D reductions are gated behind a few feature flags like config.prefer_nd_tiling and config.tile_reductions, so this really only checks that the PR doesn't break 1D reductions.

In addition to existing CI tests, this PR also adds some new tests that specifically stress 2D reductions:

test_2d_reduction_odd_shapes: test 2D reductions with a variety of ops and sizes. This covers the typical persistent and looped reductions.
test_2d_reduce_no_x_dim: test 2D reductions with no x dimension.
test_2d_welford_reduction: test 2D welford reductions with block pointers.
test_welford_non_block_pointer: test a 2D welford reduction when block pointer analysis fails.
test_reduction_multiple_discontiguous_dims: test reducing over more than one discontiguous dimension. We won't get a block pointer for this case, since that would require 3D tiling, but we're currently limited to 2D.
test_2d_reduction_multi_kernel: test multi kernel autotuning on a 2D softmax kernel.
test_enable_tiled_reductions: test that config.triton.tile_reductions enables/disables this feature.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @gujinghui @PenghuiCheng @jianyuh @min-jean-cho @yanbing-j @Guobing-Chen @Xia-Weiwen @snadampal @voznesenskym @penguinwu @EikanWang @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov

…to brister/prefer_tiling

torch/_inductor/codegen/triton.py

torch/_inductor/codegen/simd.py

facebook-github-bot · 2024-12-31T00:03:50Z

@blaine-rister has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

blaine-rister · 2024-12-31T00:09:23Z

Just to be safe, this draft PR tests the CI with tiled reductions enabled by default: #144008

blaine-rister · 2024-12-31T03:54:51Z

Just to be safe, this draft PR tests the CI with tiled reductions enabled by default: #144008

Tiled reductions by default turned out to break a few things, including var_mean in test_torchinductor.py. The root cause is that some nodes are missing reduction ranges. I think that the existing code was ignoring this because the default tiling isn't part of ranked_tilings.

Given the weight of this PR, I think it makes sense to merge this as is and handle the missing reduction ranges in a follow up.

blaine-rister · 2024-12-31T03:55:37Z

@pytorchbot merge

pytorchmergebot · 2024-12-31T03:57:15Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…44041) # Issue This PR cleans up an edge case that wasn't handled by #137243. The existing tiling code assumes that `node.get_ranges()` is a reliable source of pointwise and reduction numels. This is true for pointwise kernels, but the situation is more complicated with reductions. Since reductions change the number of elements in a tensor, not all ops within a reduction kernel will have the same number of iterations. For example, `var_mean` fuses pointwise division with the output of reduction sum, and the division lacks the corresponding reduction ranges. # Fix Instead of getting numels from `node.get_ranges()`, explicitly pass the global pointwise and reduction numels to the relevant tiling functions. In `SIMDKernel.complete_partial_tiling`, we solve for the missing numel by diving the global numel by the partial tiling's numel. This ensures all tilings have the correct global numel. Also, in `SIMDKernel.is_compatible`, add the global reduction numel to node ranges that are missing it. For example, `{"x": 8, "r0_": 8}` is compatible with a node of ranges `([8], [])` when we have `reduction_numel=8`. Finally, this PR generalizes some of the existing codegen to handle multiple reduction dims. We already had code to ignore reduction splits for pointwise kernels, but it only worked for 1D reductions. Now it can handle ND. # Test plan This PR parametrizes the existing CI test for `var_mean` to also run with tiled reductions. It also adds a new test checking that `var_mean` generates 2D tilings (with tiled reduction enabled). These new tests would fail on the current main branch. Pull Request resolved: #144041 Approved by: https://github.com/jansel

# Issue #137243 introduced a feature where the ND tiling algorithm analyzes memory dependencies. It iterates over all `Dep`'s of the kernel. However, the analysis is only applicable to `MemoryDep` instances, which are a subclass of `Dep`. In particular, it doesn't work for `StarDep`'s, for the reasons described here: https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/simd.py#L1653 # Fix This PR changes the algorithm to only iterate over `MemoryDep` instances. # Testing Parameterized an existing test for `torch.bucketize` to also run with ND tiling. This test emits a node with `StarDep`'s. Without this PR, the compiler would crash on this test case. Pull Request resolved: #144497 Approved by: https://github.com/eellison

blaine-rister and others added 30 commits July 26, 2024 00:22

draft: prefer tiling to higher dimensions

b88c192

fix none dense stride

6fa6c69

add unit test; try to fix for none order path

7de6dff

fix passing [] rather than None

1fbf80b

fix bug for test_index_put_reinplace. return x for actual_strides too

1b68e78

draft: prefer tiling to higher dimensions

462c87a

bug fix

cfc0d15

Merge branch 'brister/prefer_tiling' of github.com:pytorch/pytorch in…

11420e4

…to brister/prefer_tiling

fix 3D leading dims

8183766

remove successful tests

8362008

Merge remote-tracking branch 'origin/main' into brister/prefer_tiling

5f0c245

update tests for ND pointwise. Tiled reduction is out of scope

29bddfa

pass type checking, parametrize a few more tests

762edd2

fix test

bac7d50

Add a test for 5D tensors

6e502de

fix type annotations for python 3.8

4e1b398

sort tilings in descending order

8543226

remove unexpected success; fix bug for channel_last

a313b78

skip non-dense+reinterpretview; add list convert

607a229

remove unexpected success

6d0afc9

add a new API require_exact_strides

6419105

for view operation, it's ok the stride is not equal.

1ab9ccd

add conversion for symint and symmul

1b5cb3f

Merge branch 'brister/prefer_tiling' into brister/custom_strides_pass

4a77ba9

fix none dense stride

bf01a24

add unit test; try to fix for none order path

273aa7c

fix passing [] rather than None

350a88e

fix bug for test_index_put_reinplace. return x for actual_strides too

d8381d7

remove successful tests

0a003d3

remove unexpected success; fix bug for channel_last

60be746

blaine-rister added 2 commits December 19, 2024 12:34

simplify 3d tiling

b885fd1

revert is_compatible_with_numels

76ed913

blaine-rister commented Dec 19, 2024

View reviewed changes

torch/_inductor/codegen/triton.py Outdated Show resolved Hide resolved

blaine-rister added 4 commits December 19, 2024 13:15

revert mask changes

16d1589

revert format

0dbc5ea

restore type hint

98740ac

revert format

931ecb9

blaine-rister commented Dec 19, 2024

View reviewed changes

torch/_inductor/codegen/simd.py Show resolved Hide resolved

generalize index expression tiling search

04b7a65

blaine-rister requested a review from jansel December 20, 2024 08:16

jansel approved these changes Dec 22, 2024

View reviewed changes

blaine-rister added 3 commits December 30, 2024 14:17

add fusion test

4756597

add another multi reduction test

a3b1bff

Merge branch 'main' into brister/tiled_reduction

a46235d

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 31, 2024

pytorchmergebot added the merging label Dec 31, 2024

pytorchmergebot added the Merged label Dec 31, 2024

pytorchmergebot closed this in a2753e3 Dec 31, 2024

pytorchmergebot removed the merging label Dec 31, 2024

blaine-rister mentioned this pull request Dec 31, 2024

[Inductor] Generalize tiling algorithm to handle fused reductions #144041

Closed

blaine-rister mentioned this pull request Jan 9, 2025

[Inductor] Restrict ND tiling analysis to MemoryDeps #144497

Closed

github-actions bot deleted the brister/tiled_reduction branch February 1, 2025 02:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Inductor] Support tiling reduction dimensions #137243

[Inductor] Support tiling reduction dimensions #137243

Uh oh!

blaine-rister commented Oct 3, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Dec 31, 2024

Uh oh!

blaine-rister commented Dec 31, 2024

Uh oh!

blaine-rister commented Dec 31, 2024

Uh oh!

blaine-rister commented Dec 31, 2024

Uh oh!

pytorchmergebot commented Dec 31, 2024

Uh oh!

Uh oh!

[Inductor] Support tiling reduction dimensions #137243

[Inductor] Support tiling reduction dimensions #137243

Uh oh!

Conversation

blaine-rister commented Oct 3, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Feature

Test plan

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Dec 31, 2024

Uh oh!

blaine-rister commented Dec 31, 2024

Uh oh!

blaine-rister commented Dec 31, 2024

Uh oh!

blaine-rister commented Dec 31, 2024

Uh oh!

pytorchmergebot commented Dec 31, 2024

Merge started

Uh oh!

Uh oh!

blaine-rister commented Oct 3, 2024 •

edited by pytorch-bot bot

Loading