Skip to content

[LV] Invalidate SCEV values in the scalar loop after loop vectorization #69886

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

dtcxzyw
Copy link
Member

@dtcxzyw dtcxzyw commented Oct 22, 2023

This PR fixes incorrect trip counts after vectorizing the first loop.
After loop vectorization, we should forget all SCEV values in the scalar loops since the scalar loop doesn't dominate any of the loop exits.

Fixes #69097.
Fixes #66616.
Fixes #63970.

@llvmbot
Copy link
Member

llvmbot commented Oct 22, 2023

@llvm/pr-subscribers-llvm-transforms

Author: Yingwei Zheng (dtcxzyw)

Changes

This PR fixes the assertion failure of SE.verify() after loop vectorization (with --verify-scev-strict).
After loop vectorization, we should forget all SCEV values in the scalar loops since the scalar loop doesn't dominate any of the loop exits.

Fixes #69097.
Fixes #66616.

I wonder if it increases compile time. Waiting for the evaluation result from https://llvm-compile-time-tracker.com/.


Full diff: https://github.com/llvm/llvm-project/pull/69886.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+4-1)
  • (added) llvm/test/Transforms/LoopVectorize/pr66616.ll (+35)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index a5d523894bef155..8c8b4a1e2c32f7a 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -3543,7 +3543,10 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
 
   // Forget the original basic block.
   PSE.getSE()->forgetLoop(OrigLoop);
-  PSE.getSE()->forgetBlockAndLoopDispositions();
+  // Now the scalar loop doesn't dominate any of the loop exits.
+  for (auto *Block : OrigLoop->getBlocks())
+    for (auto &Inst : *Block)
+      PSE.getSE()->forgetValue(&Inst);
 
   // After vectorization, the exit blocks of the original loop will have
   // additional predecessors. Invalidate SCEVs for the exit phis in case SE
diff --git a/llvm/test/Transforms/LoopVectorize/pr66616.ll b/llvm/test/Transforms/LoopVectorize/pr66616.ll
new file mode 100644
index 000000000000000..8374cfacfb71ecc
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/pr66616.ll
@@ -0,0 +1,35 @@
+; RUN: opt -passes="require<scalar-evolution>,print<scalar-evolution>,loop-vectorize" --verify-scev --verify-scev-strict -S < %s | FileCheck %s
+
+target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
+target triple = "x86_64-unknown-linux-gnu"
+
+; CHECK-LABEL: @pr66616(
+; CHECK: vector.body
+define void @pr66616() {
+entry:
+  br label %for.body
+
+for.cond5.preheader:                              ; preds = %for.body
+  br label %while.body.i
+
+for.body:                                         ; preds = %for.body, %entry
+  %storemerge12 = phi i8 [ 0, %entry ], [ %inc, %for.body ]
+  %0 = load i8, ptr null, align 1
+  %conv2 = sext i8 %0 to i32
+  %add3 = add i32 %conv2, 1
+  %inc = add i8 %storemerge12, 1
+  %conv1 = zext i8 %inc to i32
+  %tobool.not = icmp eq i32 %conv1, 0
+  br i1 %tobool.not, label %for.cond5.preheader, label %for.body
+
+while.body.i:                                     ; preds = %while.body.i, %for.cond5.preheader
+  %i.addr.09.i = phi i32 [ %dec.i, %while.body.i ], [ %add3, %for.cond5.preheader ]
+  %incdec.ptr48.i = phi ptr [ %incdec.ptr.i, %while.body.i ], [ null, %for.cond5.preheader ]
+  %dec.i = add i32 %i.addr.09.i, 1
+  %incdec.ptr.i = getelementptr i8, ptr %incdec.ptr48.i, i64 1
+  %tobool.not.i = icmp eq i32 %i.addr.09.i, 0
+  br i1 %tobool.not.i, label %for.inc9.loopexit, label %while.body.i
+
+for.inc9.loopexit:                                ; preds = %while.body.i
+  ret void
+}

@dtcxzyw
Copy link
Member Author

dtcxzyw commented Oct 22, 2023

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigating this!

Could you provide more context about the failure in the description? form the issue, it looks like the crash is bool llvm::ScalarEvolution::isLoopEntryGuardedByCond(const Loop *, ICmpInst::Predicate, const SCEV *, const SCEV *): Assertion isAvailableAtLoopEntry(LHS, L) && "LHS is not available at Loop Entry"`, correct?

In that case, is it possible that we kept around some invalid block & loop dispositions?

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the correct solution, but given the lack of root cause analysis in the description I cannot suggest what the correct one would be.

@dtcxzyw
Copy link
Member Author

dtcxzyw commented Oct 23, 2023

Thanks for investigating this!

Could you provide more context about the failure in the description? form the issue, it looks like the crash is bool llvm::ScalarEvolution::isLoopEntryGuardedByCond(const Loop *, ICmpInst::Predicate, const SCEV *, const SCEV *): Assertion isAvailableAtLoopEntry(LHS, L) && "LHS is not available at Loop Entry"`, correct?

IR after vectorizing the loop %for.body<header><latch><exiting>:

define void @pr66616() {
entry:
  br i1 false, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %entry
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %0 = load i8, ptr null, align 1
  %broadcast.splatinsert = insertelement <16 x i8> poison, i8 %0, i64 0
  %broadcast.splat = shufflevector <16 x i8> %broadcast.splatinsert, <16 x i8> poison, <16 x i32> zeroinitializer
  %1 = sext <16 x i8> %broadcast.splat to <16 x i32>
  %2 = add <16 x i32> %1, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
  %index.next = add nuw i32 %index, 16
  %3 = icmp eq i32 %index.next, 256
  br i1 %3, label %middle.block, label %vector.body, !llvm.loop !0

middle.block:                                     ; preds = %vector.body
  %4 = extractelement <16 x i32> %2, i32 15
  br i1 true, label %for.cond5.preheader, label %scalar.ph

scalar.ph:                                        ; preds = %entry, %middle.block
  %bc.resume.val = phi i8 [ 0, %middle.block ], [ 0, %entry ]
  br label %for.body

for.cond5.preheader:                              ; preds = %middle.block, %for.body
  %add3.lcssa = phi i32 [ %add3, %for.body ], [ %4, %middle.block ]
  br label %while.body.i

for.body:                                         ; preds = %for.body, %scalar.ph
  %storemerge12 = phi i8 [ %bc.resume.val, %scalar.ph ], [ %inc, %for.body ]
  %5 = load i8, ptr null, align 1
  %conv2 = sext i8 %5 to i32
  %add3 = add i32 %conv2, 1
  %inc = add i8 %storemerge12, 1
  %conv1 = zext i8 %inc to i32
  %tobool.not = icmp eq i32 %conv1, 0
  br i1 %tobool.not, label %for.cond5.preheader, label %for.body, !llvm.loop !3

while.body.i:                                     ; preds = %while.body.i, %for.cond5.preheader
  %i.addr.09.i = phi i32 [ %dec.i, %while.body.i ], [ %add3.lcssa, %for.cond5.preheader ]
  %incdec.ptr48.i = phi ptr [ %incdec.ptr.i, %while.body.i ], [ null, %for.cond5.preheader ]
  %dec.i = add i32 %i.addr.09.i, 1
  %incdec.ptr.i = getelementptr i8, ptr %incdec.ptr48.i, i64 1
  %tobool.not.i = icmp eq i32 %i.addr.09.i, 0
  br i1 %tobool.not.i, label %for.inc9.loopexit, label %while.body.i

for.inc9.loopexit:                                ; preds = %while.body.i
  ret void
}

Without this patch, the trip count for loop %while.body.i<header><latch><exiting> is:
(-1 + (-1 * (sext i8 %5 to i32))<nsw>)<nsw>
However, we cannot access %5 in BB %while.body.i since %for.body doesn't dominate %while.body.i.
The correct trip count should be (-1 * %add3.lcssa)<nsw>.

In that case, is it possible that we kept around some invalid block & loop dispositions?
forgetValue also cleanups block & loop dispositions.

I will update the test later.

@dtcxzyw
Copy link
Member Author

dtcxzyw commented Oct 23, 2023

Simplified test: https://godbolt.org/z/36cqs16os

@nikic
Copy link
Contributor

nikic commented Oct 23, 2023

Thanks for the explanation. As I understand it, the root issue is that SCEV looks through trivial LCSSA phi nodes, and we currently fail to fully invalidate when they become non-trivial. If this understand is correct, I believe it would be sufficient to invalidate SCEVUnknowns used by the LCSSA phi and defined in the loop. I've put up #69909 based on that idea.

@dtcxzyw
Copy link
Member Author

dtcxzyw commented Oct 23, 2023

Closed as #69909 is better.

@dtcxzyw dtcxzyw closed this Oct 23, 2023
nikic added a commit to nikic/llvm-project that referenced this pull request Nov 16, 2023
Fix this an alternative to llvm#69886. The basic problem is that
SCEV can look through trivial LCSSA phis. When the phi node
later becomes non-trivial, we do invalidate it, but this doesn't
catch uses that are not covered by the IR use-def walk, such as
thoe in BECounts.

Fix this by adding a special invalidation method for LCSSA phis,
which will also invalidate all the SCEVUnknowns used by the LCSSA
phi node.

We should probably also use this invalidation method in other
places that add predecessors to exit blocks, such as loop unroling
and loop peeling.
nikic added a commit that referenced this pull request Nov 17, 2023
This an alternative to #69886. The basic problem is that SCEV can look
through trivial LCSSA phis. When the phi node later becomes non-trivial,
we do invalidate it, but this doesn't catch uses that are not covered by
the IR use-def walk, such as those in BECounts.

Fix this by adding a special invalidation method for LCSSA phis, which
will also invalidate all the SCEVUnknowns/SCEVAddRecExprs used by the
LCSSA phi node and defined in the loop.

We should probably also use this invalidation method in other places
that add predecessors to exit blocks, such as loop unrolling and loop
peeling.

Fixes #69097.
Fixes #66616.
Fixes #63970.
sr-tream pushed a commit to sr-tream/llvm-project that referenced this pull request Nov 20, 2023
This an alternative to llvm#69886. The basic problem is that SCEV can look
through trivial LCSSA phis. When the phi node later becomes non-trivial,
we do invalidate it, but this doesn't catch uses that are not covered by
the IR use-def walk, such as those in BECounts.

Fix this by adding a special invalidation method for LCSSA phis, which
will also invalidate all the SCEVUnknowns/SCEVAddRecExprs used by the
LCSSA phi node and defined in the loop.

We should probably also use this invalidation method in other places
that add predecessors to exit blocks, such as loop unrolling and loop
peeling.

Fixes llvm#69097.
Fixes llvm#66616.
Fixes llvm#63970.
zahiraam pushed a commit to zahiraam/llvm-project that referenced this pull request Nov 20, 2023
This an alternative to llvm#69886. The basic problem is that SCEV can look
through trivial LCSSA phis. When the phi node later becomes non-trivial,
we do invalidate it, but this doesn't catch uses that are not covered by
the IR use-def walk, such as those in BECounts.

Fix this by adding a special invalidation method for LCSSA phis, which
will also invalidate all the SCEVUnknowns/SCEVAddRecExprs used by the
LCSSA phi node and defined in the loop.

We should probably also use this invalidation method in other places
that add predecessors to exit blocks, such as loop unrolling and loop
peeling.

Fixes llvm#69097.
Fixes llvm#66616.
Fixes llvm#63970.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment