Skip to content

[SCEV] Apply loop guards to End in computeMaxBECountForLT #116187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

lukel97
Copy link
Contributor

@lukel97 lukel97 commented Nov 14, 2024

This is a follow on from #115705. In computeMaxBECountForLT, we can apply the loop guards to the RHS before computing its maximum range to get a more accurate constant maximum trip count, similarly to what is done in howFarToZero.

One of the effects of this is that it allows isIndvarOverflowCheckKnownFalse to skip the overflow check in more cases.

This is a follow on from llvm#115705. Applying the loop guard allows us to calculate the maximum trip count in more places, which in turn allows isIndvarOverflowCheckKnownFalse to skip the overflow check.
@lukel97 lukel97 requested a review from nikic as a code owner November 14, 2024 09:21
@llvmbot llvmbot added llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms labels Nov 14, 2024
@llvmbot
Copy link
Member

llvmbot commented Nov 14, 2024

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-llvm-analysis

Author: Luke Lau (lukel97)

Changes

This is a follow on from #115705. Applying the loop guard allows us to calculate the maximum trip count in more places, which in turn allows isIndvarOverflowCheckKnownFalse to skip the overflow check.


Full diff: https://github.com/llvm/llvm-project/pull/116187.diff

3 Files Affected:

  • (modified) llvm/include/llvm/Analysis/ScalarEvolution.h (+2-2)
  • (modified) llvm/lib/Analysis/ScalarEvolution.cpp (+10-9)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-known-no-overflow.ll (+2-7)
diff --git a/llvm/include/llvm/Analysis/ScalarEvolution.h b/llvm/include/llvm/Analysis/ScalarEvolution.h
index 4b8cb3a39a86db..8b7745e6ab1034 100644
--- a/llvm/include/llvm/Analysis/ScalarEvolution.h
+++ b/llvm/include/llvm/Analysis/ScalarEvolution.h
@@ -2218,8 +2218,8 @@ class ScalarEvolution {
   ///   actually doesn't, or we'd have to immediately execute UB)
   /// We *don't* assert these preconditions so please be careful.
   const SCEV *computeMaxBECountForLT(const SCEV *Start, const SCEV *Stride,
-                                     const SCEV *End, unsigned BitWidth,
-                                     bool IsSigned);
+                                     const SCEV *End, const Loop *L,
+                                     unsigned BitWidth, bool IsSigned);
 
   /// Verify if an linear IV with positive stride can overflow when in a
   /// less-than comparison, knowing the invariant term of the comparison,
diff --git a/llvm/lib/Analysis/ScalarEvolution.cpp b/llvm/lib/Analysis/ScalarEvolution.cpp
index b10811133770e1..bb7306f5ba3778 100644
--- a/llvm/lib/Analysis/ScalarEvolution.cpp
+++ b/llvm/lib/Analysis/ScalarEvolution.cpp
@@ -12857,11 +12857,10 @@ const SCEV *ScalarEvolution::getUDivCeilSCEV(const SCEV *N, const SCEV *D) {
   return getAddExpr(MinNOne, getUDivExpr(NMinusOne, D));
 }
 
-const SCEV *ScalarEvolution::computeMaxBECountForLT(const SCEV *Start,
-                                                    const SCEV *Stride,
-                                                    const SCEV *End,
-                                                    unsigned BitWidth,
-                                                    bool IsSigned) {
+const SCEV *
+ScalarEvolution::computeMaxBECountForLT(const SCEV *Start, const SCEV *Stride,
+                                        const SCEV *End, const Loop *L,
+                                        unsigned BitWidth, bool IsSigned) {
   // The logic in this function assumes we can represent a positive stride.
   // If we can't, the backedge-taken count must be zero.
   if (IsSigned && BitWidth == 1)
@@ -12895,8 +12894,10 @@ const SCEV *ScalarEvolution::computeMaxBECountForLT(const SCEV *Start,
   // the case End = RHS of the loop termination condition. This is safe because
   // in the other case (End - Start) is zero, leading to a zero maximum backedge
   // taken count.
-  APInt MaxEnd = IsSigned ? APIntOps::smin(getSignedRangeMax(End), Limit)
-                          : APIntOps::umin(getUnsignedRangeMax(End), Limit);
+  const SCEV *GuardedEnd = applyLoopGuards(End, L);
+  APInt MaxEnd = IsSigned
+                     ? APIntOps::smin(getSignedRangeMax(GuardedEnd), Limit)
+                     : APIntOps::umin(getUnsignedRangeMax(GuardedEnd), Limit);
 
   // MaxBECount = ceil((max(MaxEnd, MinStart) - MinStart) / Stride)
   MaxEnd = IsSigned ? APIntOps::smax(MaxEnd, MinStart)
@@ -13150,7 +13151,7 @@ ScalarEvolution::howManyLessThans(const SCEV *LHS, const SCEV *RHS,
       // loop (RHS), and the fact that IV does not overflow (which is
       // checked above).
       const SCEV *MaxBECount = computeMaxBECountForLT(
-          Start, Stride, RHS, getTypeSizeInBits(LHS->getType()), IsSigned);
+          Start, Stride, RHS, L, getTypeSizeInBits(LHS->getType()), IsSigned);
       return ExitLimit(getCouldNotCompute() /* ExactNotTaken */, MaxBECount,
                        MaxBECount, false /*MaxOrZero*/, Predicates);
     }
@@ -13334,7 +13335,7 @@ ScalarEvolution::howManyLessThans(const SCEV *LHS, const SCEV *RHS,
     MaxOrZero = true;
   } else {
     ConstantMaxBECount = computeMaxBECountForLT(
-        Start, Stride, RHS, getTypeSizeInBits(LHS->getType()), IsSigned);
+        Start, Stride, RHS, L, getTypeSizeInBits(LHS->getType()), IsSigned);
   }
 
   if (isa<SCEVCouldNotCompute>(ConstantMaxBECount) &&
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-known-no-overflow.ll b/llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-known-no-overflow.ll
index a8cf002182e240..735421a4f65ee8 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-known-no-overflow.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-known-no-overflow.ll
@@ -4,8 +4,7 @@
 ; RUN: -prefer-predicate-over-epilogue=predicate-dont-vectorize \
 ; RUN: -mtriple=riscv64 -mattr=+v -S < %s | FileCheck %s
 
-; TODO: We know the IV will never overflow here so we can skip the overflow
-; check
+; We know the IV will never overflow here so we can skip the overflow check
 
 define void @trip_count_max_1024(ptr %p, i64 %tc) vscale_range(2, 1024) {
 ; CHECK-LABEL: define void @trip_count_max_1024(
@@ -15,11 +14,7 @@ define void @trip_count_max_1024(ptr %p, i64 %tc) vscale_range(2, 1024) {
 ; CHECK-NEXT:    br i1 [[GUARD]], label %[[EXIT:.*]], label %[[LOOP_PREHEADER:.*]]
 ; CHECK:       [[LOOP_PREHEADER]]:
 ; CHECK-NEXT:    [[UMAX:%.*]] = call i64 @llvm.umax.i64(i64 [[TC]], i64 1)
-; CHECK-NEXT:    [[TMP0:%.*]] = sub i64 -1, [[UMAX]]
-; CHECK-NEXT:    [[TMP1:%.*]] = call i64 @llvm.vscale.i64()
-; CHECK-NEXT:    [[TMP2:%.*]] = mul i64 [[TMP1]], 2
-; CHECK-NEXT:    [[TMP3:%.*]] = icmp ult i64 [[TMP0]], [[TMP2]]
-; CHECK-NEXT:    br i1 [[TMP3]], label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK-NEXT:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
 ; CHECK:       [[VECTOR_PH]]:
 ; CHECK-NEXT:    [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-NEXT:    [[TMP5:%.*]] = mul i64 [[TMP4]], 2

@lukel97 lukel97 changed the title [SCEV] Apply loop guards to End computeMaxBECountForLT [SCEV] Apply loop guards to End in computeMaxBECountForLT Nov 14, 2024
Copy link
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The patch seems reasonable to me - both howFarToZero and howManyLessThans already seem to apply loop guards in other places.

@lukel97
Copy link
Contributor Author

lukel97 commented Nov 14, 2024

The pre-merge CI has picked up a lot more tests that have changed that I completely forgot to run locally, my bad. I'll check the diffs on my end and then update this PR soon

@david-arm
Copy link
Contributor

Hmm, that's also my bad for being too eager to accept before the pre-commit results. :) No worries - I'll take a look again once you've updated the patch!

@lukel97 lukel97 marked this pull request as draft November 14, 2024 10:03
@fhahn
Copy link
Contributor

fhahn commented Nov 14, 2024

Please make sure there are SCEV-only tests for the change

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds significant compile time overhead: https://llvm-compile-time-tracker.com/compare.php?from=e52238b59f250aef5dc0925866d0308305a19dbf&to=49f1bf83d7a988020bea4492e76620ae685d73aa&stat=instructions:u

Possibly using pre-computed guards in howManyLessThans would help, though I'm not sure (the existing guard uses are conditional).

@nikic
Copy link
Contributor

nikic commented Nov 14, 2024

Possibly using pre-computed guards in howManyLessThans would help, though I'm not sure (the existing guard uses are conditional).

Or going one level higher, we could pre-compute the guards for all exit count calculation for a given loop, which would be beneficial for multi-exit loops. Possibly this needs to be lazy though for the cases where we don't need the guards at all.

@david-arm
Copy link
Contributor

Possibly using pre-computed guards in howManyLessThans would help, though I'm not sure (the existing guard uses are conditional).

Or going one level higher, we could pre-compute the guards for all exit count calculation for a given loop, which would be beneficial for multi-exit loops. Possibly this needs to be lazy though for the cases where we don't need the guards at all.

I see that getSmallConstantTripMultiple applies the loop guards after calculating the exit count, i.e.

unsigned ScalarEvolution::getSmallConstantTripMultiple(const Loop *L,
                                                       const SCEV *ExitCount) {
  if (ExitCount == getCouldNotCompute())
    return 1;

  // Get the trip count
  const SCEV *TCExpr = getTripCountFromExitCount(applyLoopGuards(ExitCount, L));

...
}

and perhaps this function is used less in the code base so the additional compile time cost is minimal? If we did something similar in getSmallConstantTripCount perhaps that might keep the compile time down? Not sure if applying the loop guards this late still gives us the benefits though.

@lukel97 lukel97 marked this pull request as ready for review November 19, 2024 11:43
@lukel97
Copy link
Contributor Author

lukel97 commented Nov 19, 2024

I went through and updated the failed tests, this actually ends up reducing the max trip count in a lot more places than expected.

If we did something similar in getSmallConstantTripCount perhaps that might keep the compile time down? Not sure if applying the loop guards this late still gives us the benefits though.

I gave this a try but it doesn't seem to work, it looks we need to apply the guards before we call get{Signed,Unigned}RangeMax in order for ConstantMaxBECount to actually be reduced.

I've also noticed that a good few of the other calls to get{Signed,Unigned}RangeMax tend to wrap the SCEV with applyLoopGuards, e.g. in howFarToZero, so maybe this is kind of a pattern.

I'll continue poking around to see if there's anywhere else we can make this more efficient. My gut feeling is that hoisting the loop guard collection to howManyLessThans won't help given that the only other uses of applyLoopGuards there are mutually exclusive to computeMaxBECountForLT, so I don't think it will end up deduplicating any work

@lukel97
Copy link
Contributor Author

lukel97 commented Nov 21, 2024

I'm going to close this for now since I'm can't really think of another way of avoiding the compile time hit, and this was kind of just a follow-up that fell out of #115705.

I benchmarked this on SPEC CPU 2017 on RISC-V and whilst the generated code was different (I think I saw one loop vectorized where it wasn't previously?), it didn't have any measurable performance impact so I'm happy to go without it for now.

If we end up moving to EVL tail folding eventually this might be worth picking up again to remove the overflow check, but on RISC-V vscale will always be a power-of-2 so there might be other ways of avoiding it.

@lukel97 lukel97 closed this Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants