Skip to content

[mlir][SCF] scf.parallel: Make reductions part of the terminator #75314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

matthias-springer
Copy link
Member

@matthias-springer matthias-springer commented Dec 13, 2023

This commit makes reductions part of the terminator. Instead of scf.yield, scf.reduce now terminates the body of scf.parallel ops. scf.reduce may contain an arbitrary number of reductions, with one region per reduction.

Example:

%init = arith.constant 0.0 : f32
%r:2 = scf.parallel (%iv) = (%lb) to (%ub) step (%step) init (%init, %init)
    -> f32, f32 {
  %elem_to_reduce1 = load %buffer1[%iv] : memref<100xf32>
  %elem_to_reduce2 = load %buffer2[%iv] : memref<100xf32>
  scf.reduce(%elem_to_reduce1, %elem_to_reduce2 : f32, f32) {
    ^bb0(%lhs : f32, %rhs: f32):
      %res = arith.addf %lhs, %rhs : f32
      scf.reduce.return %res : f32
  }, {
    ^bb0(%lhs : f32, %rhs: f32):
      %res = arith.mulf %lhs, %rhs : f32
      scf.reduce.return %res : f32
  }
}

scf.reduce operations can no longer be interleaved with other ops in the body of scf.parallel. This simplifies the op and makes it possible to assign the RecursiveMemoryEffects trait to scf.reduce. (This was not possible before because the op was not a terminator, causing the op to be DCE'd.)

@llvmbot
Copy link
Member

llvmbot commented Dec 13, 2023

@llvm/pr-subscribers-mlir-spirv
@llvm/pr-subscribers-mlir-gpu
@llvm/pr-subscribers-mlir-sparse
@llvm/pr-subscribers-mlir-openmp
@llvm/pr-subscribers-flang-openmp
@llvm/pr-subscribers-mlir-scf

@llvm/pr-subscribers-mlir

Author: Matthias Springer (matthias-springer)

Changes

scf.reduce itself does not have any side effects, but its body may.


Full diff: https://github.com/llvm/llvm-project/pull/75314.diff

1 Files Affected:

  • (modified) mlir/include/mlir/Dialect/SCF/IR/SCFOps.td (+2-1)
diff --git a/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td b/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
index 573e804b405e84..3948f145d50bd6 100644
--- a/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
+++ b/mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
@@ -853,7 +853,8 @@ def ParallelOp : SCF_Op<"parallel",
 // ReduceOp
 //===----------------------------------------------------------------------===//
 
-def ReduceOp : SCF_Op<"reduce", [HasParent<"ParallelOp">]> {
+def ReduceOp : SCF_Op<"reduce", [
+    HasParent<"ParallelOp">, RecursiveMemoryEffects]> {
   let summary = "reduce operation for parallel for";
   let description = [{
     "scf.reduce" is an operation occurring inside "scf.parallel" operations.

@matthias-springer
Copy link
Member Author

Based on the failing tests, it looks like this change can cause scf.reduce ops to fold away. I'm wondering if there is a way to indicate that this op is free of side effects, and at the same time preventing the op from being folded away.

Motivation for this change: I am changing #75127 such that all ops with unknown memory effects will be rejected by the buffer deallocation pass. scf.reduce ops are rejected in my current prototype.

@Hardcode84
Copy link
Contributor

It's checked in wouldOpBeTriviallyDead, which have special cases for terminators and SymbolOpInterface but otherwise just checks memory interfaces. We probably should add something like DoNotErase trait and check it in wouldOpBeTriviallyDead as well.

@joker-eph
Copy link
Collaborator

I'm wondering if there is a way to indicate that this op is free of side effects, and at the same time preventing the op from being folded away

Why is this a problem to fold it away?

@Hardcode84
Copy link
Contributor

Why is this a problem to fold it away?

scf.reduce has a weird semantics, it describes how values should be reduced between scf.parallel loop interations. scf.reduce ops count must match reduction variables count and they should never be removed.

  %0:2 = scf.parallel (%i0, %i1) = (%c1, %c3) to (%c2, %c6) step (%c1, %c3) init(%A, %B) -> (index, index) {
    scf.reduce(%i0) : index {
    ^bb0(%lhs: index, %rhs: index):
      %1 = arith.addi %lhs, %rhs : index
      scf.reduce.return %1 : index
    }
    scf.reduce(%i1) : index {
    ^bb0(%lhs: index, %rhs: index):
      %2 = arith.muli %lhs, %rhs : index
      scf.reduce.return %2 : index
    }
    scf.yield
  }

@matthias-springer
Copy link
Member Author

scf.parallel expects that the number of op results is the same as the number of scf.reduce ops in its body. If we fold it away, the enclosing scf.parallel op will no longer verify.

scf.reduce is kind of like a terminator that yields values to the enclosing op. But it does not have to be the last op and can appear anywhere in the body of the scf.parallel.

@matthias-springer
Copy link
Member Author

It's checked in wouldOpBeTriviallyDead, which have special cases for terminators and SymbolOpInterface but otherwise just checks memory interfaces. We probably should add something like DoNotErase trait and check it in wouldOpBeTriviallyDead as well.

Instead of DoNotErase I think of it more like a "this op conceptually belongs to the enclosing op" trait. The Terminator trait is often used like that. E.g., it is not considered trivially dead presumably because that would make the enclosing op invalid. Also, some transformations (e.g., bufferization) process the enclosing op and its terminators in one go, as if they were one op.

It could be useful to have such a "belongs to the enclosing op" (not sure what to call it) trait. The Terminator trait could have it as a dependent trait.

@joker-eph
Copy link
Collaborator

scf.reduce has a weird semantics, it describes how values should be reduced between scf.parallel loop interations.

Ah right I misread and in my head was thinking about the loop construct itself.

Any reason the multiple scf.reduce can't be consolidated as a terminator?

(I don't know if there are other ops like scf.reduce that would want to "communicate with the parent op" without being a terminator or have side-effects)

@Hardcode84
Copy link
Contributor

Hardcode84 commented Dec 14, 2023

It could be useful to have such a "belongs to the enclosing op" (not sure what to call it) trait. The Terminator trait could have it as a dependent trait.

+1, yes, I think "belongs to the enclosing op" better conveys the intent, but I don't have a good short name either )

Any reason the multiple scf.reduce can't be consolidated as a terminator?

yes, I think this will be a better solution long term but it will be quite a lot of work and probably a lot of churn both upstream and downstream. We probably want trait as interim solution before this happens.

(but I'm just a passerby, who is the owner of SCF dialect?)

@joker-eph
Copy link
Collaborator

joker-eph commented Dec 14, 2023

Another way to keep these alive is to return a token that is passed to the yield.

(I'm not convinced about any "traits" based approach here: this does not seem like a general semantics we're modeling)

@Hardcode84
Copy link
Contributor

Hardcode84 commented Dec 14, 2023

Another way to keep these alive is to return a token that is passed to the yield.

With tokens you will still have to somehow convince CSE not to merge two identical reduce ops

@matthias-springer
Copy link
Member Author

matthias-springer commented Dec 14, 2023

I think it would be OK to CSE two identical reduce ops. The same reduce op token would be passed to the yield twice, indicating that there are two reductions. scf.reduce is just like a C++ lambda. (In theory, scf.reduce could even be allowed outside of scf.parallel ops then.)

However, this approach would change the op semantics a bit. I believe at the moment the reductions are guaranteed to be performed in the order in which they appear in the loop body. This would no longer be the case. (But I also cannot think of any case in which one would want to depend on a certain order.)

(If we go with the token approach, we could also just use scf.reduce as a terminator, with one region per reduction.)

@joker-eph
Copy link
Collaborator

I think it would be OK to CSE two identical reduce ops. The same reduce op token would be passed to the yield twice, indicating that there are two reductions. scf.reduce is just like a C++ lambda. (In theory, scf.reduce could even be allowed outside of scf.parallel ops then.)

It does not seems entirely safe to me: we can't structurally guarantee that the use-def chain can be tracked back to the reduce op (think "function outlining" for example).

I believe at the moment the reductions are guaranteed to be performed in the order in which they appear in the loop body

Is this documented? I'm not sure otherwise where would the guarantee come from?

we could also just use scf.reduce as a terminator, with one region per reduction.

Seems like the most robust option to me?

@Hardcode84
Copy link
Contributor

I'm generally +1 to making scf.reduce a terminator with multiple regions, but as I said previously it will be a lot of work.

Does anyone really ready to commit their time to this?

@matthias-springer matthias-springer changed the title [mlir][SCF] Add RecursiveMemoryEffects to scf.reduce [mlir][SCF] scf.parallel: Make reductions part of the terminator Dec 16, 2023
@matthias-springer
Copy link
Member Author

I believe at the moment the reductions are guaranteed to be performed in the order in which they appear in the loop body

Is this documented? I'm not sure otherwise where would the guarantee come from?

You are right, this is not documented anywhere.

@Hardcode84
Copy link
Contributor

Nice!

This also probably requires PSA announcement on discourse forum.

@Hardcode84
Copy link
Contributor

Looks good to me, but wait for other people opinions.

Copy link
Contributor

@Hardcode84 Hardcode84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to wait couple more days, and if there weren't any more comments, merge it.

@kiranchandramohan
Copy link
Contributor

Not against this change. But just a drive-through comment.

This might make it difficult for frontends to directly lower to scf.parallel, since frontend reductions like in OpenMP do not require the reduction to be the last operation.

@Hardcode84
Copy link
Contributor

Hardcode84 commented Dec 19, 2023

This might make it difficult for frontends to directly lower to scf.parallel, since frontend reductions like in OpenMP do not require the reduction to be the last operation.

That's a good point, but I think this is solvable in frontends, and having a clean semantics on middle level is important.

E.g. in our python compiler we support code like this:

sum = 0
for i in prange(size): # parallel loop
  ... some code
  sum += a[i]
  ... some more code

print(sum)

We are gradually uplifting cf -> scf.while -> scf.for -> scf.parallel and during this scf.for -> scf.parallel uplifting we are transforming sum += to scf.reduce and move it to the end of the block.
And if user is doing something really weird with sum, we will just leave loop as scf.for.

@matthias-springer matthias-springer force-pushed the scf_reduce_side_effects branch 2 times, most recently from b928c05 to c3ebd54 Compare December 20, 2023 01:50
This commit makes reductions part of the terminator. Instead of `scf.yield`, `scf.reduce` now terminates the body of `scf.parallel` ops. `scf.reduce` may contain an arbitrary number of reductions, with one region per reduction.

`scf.reduce` operations can no longer be interleaved with other ops in the body of `scf.parallel`. This simplifies the op and makes it possible to assign the `RecursiveMemoryEffects` trait to `scf.reduce`. (This was not possible before because the op was not a terminator, causing the op to be DCE'd.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants