-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[mlir][scf] Add simple LICM pattern for scf.while
#76370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Move non-side-effecting ops from `before` region if all their args are defined outside the loop. This is cleanup needed for `scf.while` -> `scf.for` uplifting llvm#76108 as it expects `before` block consisting of single `cmp` op.
@llvm/pr-subscribers-mlir-scf @llvm/pr-subscribers-mlir Author: Ivan Butygin (Hardcode84) ChangesMove non-side-effecting ops from Full diff: https://github.com/llvm/llvm-project/pull/76370.diff 2 Files Affected:
diff --git a/mlir/lib/Dialect/SCF/IR/SCF.cpp b/mlir/lib/Dialect/SCF/IR/SCF.cpp
index 5570c2ec688c8a..1f5c420e4775fa 100644
--- a/mlir/lib/Dialect/SCF/IR/SCF.cpp
+++ b/mlir/lib/Dialect/SCF/IR/SCF.cpp
@@ -3810,6 +3810,36 @@ struct WhileRemoveUnusedArgs : public OpRewritePattern<WhileOp> {
}
};
+/// Simple Loop Invariant Code Motion pattern for `scf.while` op.
+/// `scf.while` to `scf.for` uplifting expects `before` block consisting of
+/// single `cmp` op.
+/// Pattern moves ops from `before` block, doesn't visit nested regions.
+struct SCFWhileLICM : public OpRewritePattern<WhileOp> {
+ using OpRewritePattern::OpRewritePattern;
+
+ LogicalResult matchAndRewrite(WhileOp loop,
+ PatternRewriter &rewriter) const override {
+ bool changed = false;
+
+ DominanceInfo dom;
+ Block *body = loop.getBeforeBody();
+ for (Operation &op :
+ llvm::make_early_inc_range(body->without_terminator())) {
+ if (llvm::any_of(op.getOperands(), [&](Value arg) {
+ return !dom.properlyDominates(arg, loop);
+ }))
+ continue;
+
+ if (!isMemoryEffectFree(&op))
+ continue;
+
+ rewriter.updateRootInPlace(&op, [&]() { op.moveBefore(loop); });
+ changed = true;
+ }
+ return success(changed);
+ }
+};
+
/// Remove duplicated ConditionOp args.
struct WhileRemoveDuplicatedResults : public OpRewritePattern<WhileOp> {
using OpRewritePattern::OpRewritePattern;
@@ -3879,7 +3909,7 @@ void WhileOp::getCanonicalizationPatterns(RewritePatternSet &results,
results.add<RemoveLoopInvariantArgsFromBeforeBlock,
RemoveLoopInvariantValueYielded, WhileConditionTruth,
WhileCmpCond, WhileUnusedResult, WhileRemoveDuplicatedResults,
- WhileRemoveUnusedArgs>(context);
+ SCFWhileLICM, WhileRemoveUnusedArgs>(context);
}
//===----------------------------------------------------------------------===//
diff --git a/mlir/test/Dialect/SCF/canonicalize.mlir b/mlir/test/Dialect/SCF/canonicalize.mlir
index 52e0fdfa36d6cd..f8f4d737b381df 100644
--- a/mlir/test/Dialect/SCF/canonicalize.mlir
+++ b/mlir/test/Dialect/SCF/canonicalize.mlir
@@ -1022,10 +1022,10 @@ func.func @while_loop_invariant_argument_different_order(%arg : tensor<i32>) ->
// CHECK-SAME: (%[[ARG:.+]]: tensor<i32>)
// CHECK: %[[ZERO:.*]] = arith.constant dense<0>
// CHECK: %[[ONE:.*]] = arith.constant dense<1>
+// CHECK: %[[COND:.*]] = arith.cmpi sgt, %[[ARG]], %[[ZERO]]
+// CHECK: %[[COND1:.*]] = tensor.extract %[[COND]][]
// CHECK: %[[WHILE:.*]]:2 = scf.while (%[[ARG1:.*]] = %[[ONE]], %[[ARG4:.*]] = %[[ZERO]])
-// CHECK: arith.cmpi sgt, %[[ARG]], %[[ZERO]]
-// CHECK: tensor.extract %{{.*}}[]
-// CHECK: scf.condition(%{{.*}}) %[[ARG1]], %[[ARG4]]
+// CHECK: scf.condition(%[[COND1]]) %[[ARG1]], %[[ARG4]]
// CHECK: } do {
// CHECK: ^{{.*}}(%{{.*}}: tensor<i32>, %{{.*}}: tensor<i32>):
// CHECK: scf.yield %[[ZERO]], %[[ONE]]
@@ -1144,6 +1144,29 @@ func.func @while_duplicated_res() -> (i32, i32) {
// CHECK: }
// CHECK: return %[[RES]], %[[RES]] : i32, i32
+// -----
+
+func.func @while_licm(%arg1: i32, %arg2: i32, %arg3: i32) {
+ scf.while () : () -> () {
+ %val0 = arith.addi %arg1, %arg2 : i32
+ %val = arith.addi %val0, %arg3 : i32
+ %condition = "test.condition"(%val) : (i32) -> i1
+ scf.condition(%condition)
+ } do {
+ ^bb0():
+ "test.test"() : () -> ()
+ scf.yield
+ }
+ return
+}
+// CHECK-LABEL: @while_licm
+// CHECK-SAME: (%[[ARG1:.*]]: i32, %[[ARG2:.*]]: i32, %[[ARG3:.*]]: i32)
+// CHECK: %[[VAL0:.*]] = arith.addi %[[ARG1]], %[[ARG2]] : i32
+// CHECK: %[[VAL1:.*]] = arith.addi %[[VAL0]], %[[ARG3]] : i32
+// CHECK: scf.while
+// CHECK-NEXT: %[[COND:.*]] = "test.condition"(%[[VAL1]]) : (i32) -> i1
+// CHECK-NEXT: scf.condition(%[[COND]])
+
// -----
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a LICM pass that hoists ops from loops. (I recently extended it to support scf.while
.) This change duplicates a part of the LICM pass in a canonicalization pattern.
Your scf.while
uplifting pattern is probably just one example of a pattern that is enabled by LICM, so I'd be in favor of handling this in a more generic way: we could apply LICM as part of every greedy pattern rewrite, before or after each greedy pattern rewriter iteration (https://github.com/llvm/llvm-project/blob/main/mlir/lib/Transforms/Utils/GreedyPatternRewriteDriver.cpp#L763). We already perform various IR optimizations in there: region simplification and constant de-duplication. In practice, the latter is often a form of LICM (when CSE'ing a constant outside of a loop with a constant inside of a loop), so there is some precedent.
I don't think LICM should belong to canonicalization. LICM is not always good for code quality as it may increase register pressure so applying it blindly during canonicalization seems dangerous. |
Note: I'm only LICM from |
This is true for general loops, but for the |
I don't see a contraction between what you're saying and the canonicalization. The process of canonicalization isn't to make the code faster, it is to make two equivalent input programs with the same IR. It is expected that a cost-model and target specific lowering has to do some scheduling, and likely undo some potential canonicalizations. |
After some thought, I actually like the idea of making general LICM part of
|
By that definition many optimizations can be considered a canonicalization but yet I don't think we would want to do them as part of canonicalizations. There are precedents where we decided to not do some transformation as canonicalizations as they are too opinionated. I think this is one of those cases.
My concern is that it won't be possible to consider the initial IR to decide whether some IR should be inside or outside the loop. It seems common for compilers to have some kind of cost model for LICM (or other code motion) and this will prevent that for all users. |
what is the advantage of doing that as part of the rewriter compared to including the LICM pattern explicitly when needed? LICM is kind of SCF specific in this case, is the idea to add a LICM interface that dialects can register? |
I’ve seen cases where it was beneficial to interleave CSE with pattern applications until a fix point is reached. CSE and patterns enabled each other and had to be run multiple times in a loop. (Whether that is a good design or not is a different question.) I had similar cases in mind with LICM and pattern applications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be a canonicalization pattern for a variety of reasons. Some others have already pointed out, that this is not an "always good" optimization and should not be enabled by default.
An additional problem is that any canonicalization pattern that walks the body of an operation risks O(N^2) blowup in the pattern rewriter: each time an op is modified in the body of the while, the pattern is triggered again, causing another walk even if nothing changes.
Like what?
Please elaborate: I'm on the opposite side and I don't quite perceive the reason here.
I don't understand what you're trying to say here? What does it mean "initial IR"? To me the input program to the compiler is "arbitrary", I don't put any value on the "initial IR" in term of being more "optimal". Doing so would break composability and goes entirely against the concept of canonicalization as I understand it. |
This is not the definition of a canonicalization: I strongly oppose this argument right now.
This is a better argument to me, to be studied in detail though. |
+1 to what mehdi said. Canonical to me doesn't have to mean more optimised, but rather that the IR is in a shape more useful for future transformations. LICM is a transformation helping in that regard as it makes loops more canonical (by getting closer to a loop only containing loop variant operations). That said, if it has quadratic complexity it should probably not be enabled in the canonicalize pass. I see canonicalize being used analogous to something like |
Here are few examples of transformations that were moved out of canonicalization over time because having more control over when transformations are applied was necessary: That's just some examples from a quick search but I recall there were more.
I understand the reasoning but in practice it is common for compilers to relying on the original program as fall back when a transformation is not obviously a win. I don't understand where is the line with regards to canonicalization if we go that route. What kind of transformation would not be a canonicalization then? Would function inlining be a canonicalization for instance? And if not why? My interpretation (admittedly not based on docs but inferred from use cases) was that canonicalization were a set of always good simplifications that allowed the rest of the compiler to make it simpler for other transformations to decide which form of IR to expect. My concern is that this could significantly reduce user control over what transformations are applied as canonicalization has to be applied as all or nothing and those patterns are usually not exposed otherwise, preventing any alternative solution for users. |
Not just "always good", necessary. Without canonicalization, passes down the pipeline cannot assume anything and the complexity of pattern matching expressions can go combinatorial.
I think people are mixing two different things:
Questions about the validity of keeping interim information that is prone to cleanups as a dependency between transformation passes on the side, there are always some information being left by analysis passes to following transformation passes that need to be kept. So it's not a clear-cut answer here. Other artefacts (like annotation, properties, metadata) could be created and managed for different purposes, but having IR ops in a "known but not canonical" shape between two specific passes isn't unheard of and may be beneficial in some cases. Now, back to this PR, I agree with @joker-eph here: LICM, even a simplified version, isn't canonicalization, it's a transformation. @Hardcode84 if you add this pattern to the LICM pass and run it before your conversion, would that solve your problem? |
Actually you're saying the opposite of me :) I'm saying that:
and:
Are equivalent programs, and the argument that the second one "may increase register pressure" (for example, use any kind of target-specific optimization) is not a valid objection to canonicalization. |
Some of these aren't motivated, so it's hard to form an opinion. But when I read:
That seems completely bogus to me. If obvious dead code removal can't be a canonicalization, then we could just as well remove entirely the concept of canonicalization. |
I see this as a very fragile way to design compilers, I would say that if you want to design your compiler that way, don't include canonicalization inside your pass pipeline.
"it's ambiguous" sometimes., but "Inlining" is covered here: https://sunfishcode.github.io/blog/2018/10/22/Canonicalization.html The way I see it is "how hard is it to reverse a transformation later"? In the case of LICM it is a local transformation and one that a compiler must be able to "undo" (because often the input program will come in both form and the compiler should optimize it!). Inlining is a global program transformation instead, and we don't know efficiently how to manage this (that is: it's still in the realm of compiler research area, even though there are some implementations around PGO+outlining of common sequence in cold region for example).
How is there a "reduction" of control when there is no control already? That is: we don't have canonicalization knobs, almost by design, we just let people either use canonicalization or don't use it. |
But then I cannot access the existing canonicalization patterns so I would have to rewrite them which is obviously not practical.
Because canonicalization doesn't have knobs if we widen what we consider to be canonicalizations we lose in control. This reduce the modularity of MLIR patterns which I think is going to be a downside for users. |
😅 It's confusing, because I agree with your statement and still think LICM should be LICM, not canonicalization. Not because the two snippets you show are different (they're not), but because these are well known passes (like CSE, DCE) which are arguably all forms of canonicalization. So, we either call every transform pass that doesn't change program semantics "canonicalization", or we keep them in separate passes because some canonicalizations are different than others. I strongly prefer the latter, because of my second point earlier: the control that some bundle-of-passes can have on which types of canonicalization can happen in between its inner transformations.
Indeed. Register/memory allocators should operate on canonical forms and do their most optimal work in those forms. However, the argument itself is still valid. If you interpret this as "the canonical form is changing, and the following passes were coded with the previous canonical form in mind", then my argument about isolating following passes is broken and we need to think about the consequences. But it feels to me that we're driving away from the initial aims of this PR. To the code here, it shouldn't matter if this transform is in pass A or B, as long as the pass runs before the |
But that's already the case, I'm saying: nothing new here.
This comment seems to me to apply to any addition of a canonicalization pattern from day-1...
I feel you're mixing up the "addition of a new pattern you don't like", with the fundamental design that "canonicalization patterns are an opaque bag and you don't get to choose". That is: there is no design change here. You just happen to realize that canonicalization patterns is a growing set of transformations and you can't cherry-pick. Unless we freeze the set of patterns, we're necessarily in the situation you're describing, as far as I understand.
Yes and we run DCE during canonicalization :p So for canonicalization candidates I rather look at:
Some of the impact of keeping a transformation within canonicalization is the cross-transformation impact: you need to perform these interleaved in the same IR traversal to converge. You can often craft input IR where performing one before the other does not lead to the desired simplification. |
As was mentioned before in thread, canonicalizations can open new opportunities for CSE and LICM and vice-versa so you may need to run this sequence of passes arbitrary number of times. I can live without this pattern upstream and internally we have a facility to run sequence of passes until fixed point is reached. I can work on upstreaming it instead. (how is this handled in llvm proper BTW?) |
To me, the reversibility is the key. No loss of information, not just program interpretation (which could say undefined behavior is fair game). For example, what constitutes "local" can be a matter of interpretation. It is LICM can be argued that uses non-local information (op + loop + loop dependency analysis looking at the context, not just the op itself). The same can be said for CSE, since it looks at multiple expressions. Even simpler things like expression order canonicalization ( So, I try not to argue too strongly which passes are canonicalizations and which are not. My suggestion to have them separate is not to determine what is what, but to allow interim pass bundles to control which type of canonicalizations (reversible changes into canonical shapes) we apply within. But in between these "bundles", we should always run all of them until a fixed point. What determines a "bundle" isn't always clear, so don't take my suggestion as concrete, just as a design goal for an upstream pipeline that we all use, so we can collaborate upstream without reinventing the wheel downstream. |
This is key: the canonical form isn't a static form, but an agreed-upon form that the following passes agree to operate on. This form can change with time and passes need to adapt to it, but this change can be expensive and needs to be agreed upon again. In our project we use a lot of upstream passes. Before we used to have trouble with the shapes that the passes leave in IR, but after we started running canonicalization in between every "bundle", the changes to the canonical shapes became minimal and easily treatable, without loss of information. We may not even agree what a "bundle" looks like across different projects, but if we run canonicalization to a fixed point in between them locally, the likelihood that there is loss of information is very low, if we agree what a "canonical form" is. |
Chatted with @ThomasRaoux and this topic came up, felt like reporting some more on some canonicalization aspects. The argument against this as a canonicalization may come down to the definition and purpose of |
Side comment: Isn't this argument also applies to this entire concept of uplifting As I said previously, I can live with the LICM being separate pass, the big limitation of this approach is that Upstream currently missing a facility to run sequence of passes until fixed point is reached. |
I am not so sure: it won't move operation from other dialects, or make operations executed that wouldn't be executed otherwise (like LICM on a zero-trip loop). It is purely a transformation from an scf operation to an equivalent other scf operation. |
The implementation of the canonicalization pass is ~20 lines of code, it should be fairly trivial to write another one downstream that adds more patterns, or even (for example) interleave CSE and canonicalization until fixed-point? |
Yes, and we did it downstream in kind of generic way https://github.com/numba/numba-mlir/blob/main/mlir/lib/Transforms/CompositePass.cpp but there is nothing like this upstream :) Anyways, I can close this PR an continue with main PR (#76108), making a test pass first, and then making it canonicalization in separate PR. Also, I'll need to add another pattern which is most likely not qualifies as canonicalization, transforming:
to
I initially wanted to add it to uplifting pass itself, but if it will be gone, we probably need another separate pass. |
Move non-side-effecting ops from
before
block if all their args are defined outside the loop.Doesn't visit nested regions.
This cleanup is needed for
scf.while
->scf.for
uplifting #76108 as it expectsbefore
block consisting of singlecmp
op.