[mlir][SCF] `scf.parallel`: Make reductions part of the terminator #75314

matthias-springer · 2023-12-13T10:51:26Z

This commit makes reductions part of the terminator. Instead of scf.yield, scf.reduce now terminates the body of scf.parallel ops. scf.reduce may contain an arbitrary number of reductions, with one region per reduction.

Example:

%init = arith.constant 0.0 : f32
%r:2 = scf.parallel (%iv) = (%lb) to (%ub) step (%step) init (%init, %init)
    -> f32, f32 {
  %elem_to_reduce1 = load %buffer1[%iv] : memref<100xf32>
  %elem_to_reduce2 = load %buffer2[%iv] : memref<100xf32>
  scf.reduce(%elem_to_reduce1, %elem_to_reduce2 : f32, f32) {
    ^bb0(%lhs : f32, %rhs: f32):
      %res = arith.addf %lhs, %rhs : f32
      scf.reduce.return %res : f32
  }, {
    ^bb0(%lhs : f32, %rhs: f32):
      %res = arith.mulf %lhs, %rhs : f32
      scf.reduce.return %res : f32
  }
}

scf.reduce operations can no longer be interleaved with other ops in the body of scf.parallel. This simplifies the op and makes it possible to assign the RecursiveMemoryEffects trait to scf.reduce. (This was not possible before because the op was not a terminator, causing the op to be DCE'd.)

llvmbot · 2023-12-13T10:51:58Z

@llvm/pr-subscribers-mlir-spirv
@llvm/pr-subscribers-mlir-gpu
@llvm/pr-subscribers-mlir-sparse
@llvm/pr-subscribers-mlir-openmp
@llvm/pr-subscribers-flang-openmp
@llvm/pr-subscribers-mlir-scf

@llvm/pr-subscribers-mlir

Author: Matthias Springer (matthias-springer)

Changes

scf.reduce itself does not have any side effects, but its body may.

Full diff: https://github.com/llvm/llvm-project/pull/75314.diff

1 Files Affected:

(modified) mlir/include/mlir/Dialect/SCF/IR/SCFOps.td (+2-1)

diff --git a/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td b/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
index 573e804b405e84..3948f145d50bd6 100644
--- a/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
+++ b/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
@@ -853,7 +853,8 @@ def ParallelOp : SCF_Op<"parallel",
 // ReduceOp
 //===----------------------------------------------------------------------===//
 
-def ReduceOp : SCF_Op<"reduce", [HasParent<"ParallelOp">]> {
+def ReduceOp : SCF_Op<"reduce", [
+    HasParent<"ParallelOp">, RecursiveMemoryEffects]> {
   let summary = "reduce operation for parallel for";
   let description = [{
     "scf.reduce" is an operation occurring inside "scf.parallel" operations.

matthias-springer · 2023-12-13T10:57:57Z

Based on the failing tests, it looks like this change can cause scf.reduce ops to fold away. I'm wondering if there is a way to indicate that this op is free of side effects, and at the same time preventing the op from being folded away.

Motivation for this change: I am changing #75127 such that all ops with unknown memory effects will be rejected by the buffer deallocation pass. scf.reduce ops are rejected in my current prototype.

Hardcode84 · 2023-12-13T23:17:27Z

It's checked in wouldOpBeTriviallyDead, which have special cases for terminators and SymbolOpInterface but otherwise just checks memory interfaces. We probably should add something like DoNotErase trait and check it in wouldOpBeTriviallyDead as well.

joker-eph · 2023-12-14T00:41:25Z

I'm wondering if there is a way to indicate that this op is free of side effects, and at the same time preventing the op from being folded away

Why is this a problem to fold it away?

Hardcode84 · 2023-12-14T00:50:13Z

Why is this a problem to fold it away?

scf.reduce has a weird semantics, it describes how values should be reduced between scf.parallel loop interations. scf.reduce ops count must match reduction variables count and they should never be removed.

  %0:2 = scf.parallel (%i0, %i1) = (%c1, %c3) to (%c2, %c6) step (%c1, %c3) init(%A, %B) -> (index, index) {
    scf.reduce(%i0) : index {
    ^bb0(%lhs: index, %rhs: index):
      %1 = arith.addi %lhs, %rhs : index
      scf.reduce.return %1 : index
    }
    scf.reduce(%i1) : index {
    ^bb0(%lhs: index, %rhs: index):
      %2 = arith.muli %lhs, %rhs : index
      scf.reduce.return %2 : index
    }
    scf.yield
  }

matthias-springer · 2023-12-14T00:50:20Z

scf.parallel expects that the number of op results is the same as the number of scf.reduce ops in its body. If we fold it away, the enclosing scf.parallel op will no longer verify.

scf.reduce is kind of like a terminator that yields values to the enclosing op. But it does not have to be the last op and can appear anywhere in the body of the scf.parallel.

matthias-springer · 2023-12-14T01:08:39Z

It's checked in wouldOpBeTriviallyDead, which have special cases for terminators and SymbolOpInterface but otherwise just checks memory interfaces. We probably should add something like DoNotErase trait and check it in wouldOpBeTriviallyDead as well.

Instead of DoNotErase I think of it more like a "this op conceptually belongs to the enclosing op" trait. The Terminator trait is often used like that. E.g., it is not considered trivially dead presumably because that would make the enclosing op invalid. Also, some transformations (e.g., bufferization) process the enclosing op and its terminators in one go, as if they were one op.

It could be useful to have such a "belongs to the enclosing op" (not sure what to call it) trait. The Terminator trait could have it as a dependent trait.

joker-eph · 2023-12-14T01:22:23Z

scf.reduce has a weird semantics, it describes how values should be reduced between scf.parallel loop interations.

Ah right I misread and in my head was thinking about the loop construct itself.

Any reason the multiple scf.reduce can't be consolidated as a terminator?

(I don't know if there are other ops like scf.reduce that would want to "communicate with the parent op" without being a terminator or have side-effects)

Hardcode84 · 2023-12-14T01:35:26Z

It could be useful to have such a "belongs to the enclosing op" (not sure what to call it) trait. The Terminator trait could have it as a dependent trait.

+1, yes, I think "belongs to the enclosing op" better conveys the intent, but I don't have a good short name either )

Any reason the multiple scf.reduce can't be consolidated as a terminator?

yes, I think this will be a better solution long term but it will be quite a lot of work and probably a lot of churn both upstream and downstream. We probably want trait as interim solution before this happens.

(but I'm just a passerby, who is the owner of SCF dialect?)

joker-eph · 2023-12-14T01:44:49Z

Another way to keep these alive is to return a token that is passed to the yield.

(I'm not convinced about any "traits" based approach here: this does not seem like a general semantics we're modeling)

Hardcode84 · 2023-12-14T01:50:39Z

Another way to keep these alive is to return a token that is passed to the yield.

With tokens you will still have to somehow convince CSE not to merge two identical reduce ops

matthias-springer · 2023-12-14T04:20:47Z

I think it would be OK to CSE two identical reduce ops. The same reduce op token would be passed to the yield twice, indicating that there are two reductions. scf.reduce is just like a C++ lambda. (In theory, scf.reduce could even be allowed outside of scf.parallel ops then.)

However, this approach would change the op semantics a bit. I believe at the moment the reductions are guaranteed to be performed in the order in which they appear in the loop body. This would no longer be the case. (But I also cannot think of any case in which one would want to depend on a certain order.)

(If we go with the token approach, we could also just use scf.reduce as a terminator, with one region per reduction.)

joker-eph · 2023-12-14T06:26:28Z

I think it would be OK to CSE two identical reduce ops. The same reduce op token would be passed to the yield twice, indicating that there are two reductions. scf.reduce is just like a C++ lambda. (In theory, scf.reduce could even be allowed outside of scf.parallel ops then.)

It does not seems entirely safe to me: we can't structurally guarantee that the use-def chain can be tracked back to the reduce op (think "function outlining" for example).

I believe at the moment the reductions are guaranteed to be performed in the order in which they appear in the loop body

Is this documented? I'm not sure otherwise where would the guarantee come from?

we could also just use scf.reduce as a terminator, with one region per reduction.

Seems like the most robust option to me?

Hardcode84 · 2023-12-14T16:05:05Z

I'm generally +1 to making scf.reduce a terminator with multiple regions, but as I said previously it will be a lot of work.

Does anyone really ready to commit their time to this?

matthias-springer · 2023-12-16T03:10:21Z

I believe at the moment the reductions are guaranteed to be performed in the order in which they appear in the loop body

Is this documented? I'm not sure otherwise where would the guarantee come from?

You are right, this is not documented anywhere.

Hardcode84 · 2023-12-16T03:17:01Z

Nice!

This also probably requires PSA announcement on discourse forum.

mlir/lib/Conversion/SCFToOpenMP/SCFToOpenMP.cpp

mlir/lib/Dialect/SCF/IR/SCF.cpp

Hardcode84 · 2023-12-16T22:07:42Z

Looks good to me, but wait for other people opinions.

Hardcode84

I suggest to wait couple more days, and if there weren't any more comments, merge it.

kiranchandramohan · 2023-12-19T22:43:57Z

Not against this change. But just a drive-through comment.

This might make it difficult for frontends to directly lower to scf.parallel, since frontend reductions like in OpenMP do not require the reduction to be the last operation.

Hardcode84 · 2023-12-19T23:22:15Z

This might make it difficult for frontends to directly lower to scf.parallel, since frontend reductions like in OpenMP do not require the reduction to be the last operation.

That's a good point, but I think this is solvable in frontends, and having a clean semantics on middle level is important.

E.g. in our python compiler we support code like this:

sum = 0
for i in prange(size): # parallel loop
  ... some code
  sum += a[i]
  ... some more code

print(sum)

We are gradually uplifting cf -> scf.while -> scf.for -> scf.parallel and during this scf.for -> scf.parallel uplifting we are transforming sum += to scf.reduce and move it to the end of the block.
And if user is doing something really weird with sum, we will just leave loop as scf.for.

This commit makes reductions part of the terminator. Instead of `scf.yield`, `scf.reduce` now terminates the body of `scf.parallel` ops. `scf.reduce` may contain an arbitrary number of reductions, with one region per reduction. `scf.reduce` operations can no longer be interleaved with other ops in the body of `scf.parallel`. This simplifies the op and makes it possible to assign the `RecursiveMemoryEffects` trait to `scf.reduce`. (This was not possible before because the op was not a terminator, causing the op to be DCE'd.)

matthias-springer requested a review from joker-eph December 13, 2023 10:51

llvmbot added mlir mlir:scf labels Dec 13, 2023

matthias-springer force-pushed the scf_reduce_side_effects branch from b27dbc6 to 533b913 Compare December 16, 2023 03:09

matthias-springer requested review from antiagainst and kuhar as code owners December 16, 2023 03:09

llvmbot added mlir:linalg mlir:gpu mlir:spirv mlir:sparse Sparse compiler in MLIR mlir:async mlir:openmp flang:openmp labels Dec 16, 2023

matthias-springer changed the title ~~[mlir][SCF] Add RecursiveMemoryEffects to scf.reduce~~ [mlir][SCF] scf.parallel: Make reductions part of the terminator Dec 16, 2023

matthias-springer requested a review from Hardcode84 December 16, 2023 03:10

Hardcode84 reviewed Dec 16, 2023

View reviewed changes

mlir/lib/Conversion/SCFToOpenMP/SCFToOpenMP.cpp Outdated Show resolved Hide resolved

mlir/lib/Dialect/SCF/IR/SCF.cpp Outdated Show resolved Hide resolved

matthias-springer force-pushed the scf_reduce_side_effects branch from 533b913 to 8675d3c Compare December 18, 2023 23:56

Hardcode84 approved these changes Dec 19, 2023

View reviewed changes

joker-eph approved these changes Dec 19, 2023

View reviewed changes

matthias-springer force-pushed the scf_reduce_side_effects branch 2 times, most recently from b928c05 to c3ebd54 Compare December 20, 2023 01:50

matthias-springer force-pushed the scf_reduce_side_effects branch from c3ebd54 to 4566abd Compare December 20, 2023 01:56

matthias-springer merged commit 10056c8 into llvm:main Dec 20, 2023

Hardcode84 mentioned this pull request Dec 20, 2023

[mlir][scf] Uplift scf.while to scf.for #76108

Merged

newling mentioned this pull request Jan 9, 2024

bump the LLVM used by mlir-air Xilinx/mlir-air#379

Closed

JosseVanDelm mentioned this pull request Sep 2, 2024

dialects: (scf) Add IsTerminator to scf.reduce xdslproject/xdsl#3133

Closed

[mlir][SCF] scf.parallel: Make reductions part of the terminator #75314

[mlir][SCF] scf.parallel: Make reductions part of the terminator #75314

Uh oh!

Conversation

matthias-springer commented Dec 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Dec 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matthias-springer commented Dec 13, 2023

Uh oh!

Hardcode84 commented Dec 13, 2023

Uh oh!

joker-eph commented Dec 14, 2023

Uh oh!

Hardcode84 commented Dec 14, 2023

Uh oh!

matthias-springer commented Dec 14, 2023

Uh oh!

matthias-springer commented Dec 14, 2023

Uh oh!

joker-eph commented Dec 14, 2023

Uh oh!

Hardcode84 commented Dec 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joker-eph commented Dec 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hardcode84 commented Dec 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matthias-springer commented Dec 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joker-eph commented Dec 14, 2023

Uh oh!

Hardcode84 commented Dec 14, 2023

Uh oh!

matthias-springer commented Dec 16, 2023

Uh oh!

Hardcode84 commented Dec 16, 2023

Uh oh!

Uh oh!

Uh oh!

Hardcode84 commented Dec 16, 2023

Uh oh!

Hardcode84 left a comment

Choose a reason for hiding this comment

Uh oh!

kiranchandramohan commented Dec 19, 2023

Uh oh!

Hardcode84 commented Dec 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

[mlir][SCF] `scf.parallel`: Make reductions part of the terminator #75314

[mlir][SCF] `scf.parallel`: Make reductions part of the terminator #75314

matthias-springer commented Dec 13, 2023 •

edited

Loading

llvmbot commented Dec 13, 2023 •

edited

Loading

Hardcode84 commented Dec 14, 2023 •

edited

Loading

joker-eph commented Dec 14, 2023 •

edited

Loading

Hardcode84 commented Dec 14, 2023 •

edited

Loading

matthias-springer commented Dec 14, 2023 •

edited

Loading

Hardcode84 commented Dec 19, 2023 •

edited

Loading