[LV] Don't require scalar epilogue for unsupported IAG with tail #96544

nikolaypanchenko · 2024-06-24T19:59:43Z

LV should check that all groups that require scalar epilogue can be widened, otherwise if InterleavedAccessGroup cannot be widened and does have tail element, current logic in LV requires to emit scalar epilogue, which leads to inefficient vector code.

llvmbot · 2024-06-24T20:00:12Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-risc-v

Author: Kolya Panchenko (nikolaypanchenko)

Changes

LV should check that all groups that require scalar epilogue can be widened, otherwise if InterleavedAccessGroup cannot be widened and does have tail element, current logic in LV requires to emit scalar epilogue, which leads to inefficient vector code.

Patch is 103.39 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/96544.diff

9 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+55-28)
(modified) llvm/test/Transforms/LoopVectorize/ARM/pointer_iv.ll (+141-197)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse.ll (+8)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/strided-accesses.ll (+23-24)
(added) llvm/test/Transforms/LoopVectorize/RISCV/unsupported-interleaved-access-with-gaps.ll (+126)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/zvl32b.ll (+4-4)
(modified) llvm/test/Transforms/LoopVectorize/X86/cost-model.ll (+42-43)
(modified) llvm/test/Transforms/LoopVectorize/X86/fixed-order-recurrence.ll (+118-19)
(modified) llvm/test/Transforms/LoopVectorize/vplan-printing-before-execute.ll (+1-1)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 771fb247f201f..c7a74283b5258 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1448,29 +1448,58 @@ class LoopVectorizationCostModel {
   /// Returns true if \p I is a memory instruction in an interleaved-group
   /// of memory accesses that can be vectorized with wide vector loads/stores
   /// and shuffles.
-  bool interleavedAccessCanBeWidened(Instruction *I, ElementCount VF);
+  bool interleavedAccessCanBeWidened(Instruction *I, ElementCount VF) const;
 
   /// Check if \p Instr belongs to any interleaved access group.
-  bool isAccessInterleaved(Instruction *Instr) {
+  bool isAccessInterleaved(Instruction *Instr) const {
     return InterleaveInfo.isInterleaved(Instr);
   }
 
   /// Get the interleaved access group that \p Instr belongs to.
   const InterleaveGroup<Instruction> *
-  getInterleavedAccessGroup(Instruction *Instr) {
+  getInterleavedAccessGroup(Instruction *Instr) const {
     return InterleaveInfo.getInterleaveGroup(Instr);
   }
 
   /// Returns true if we're required to use a scalar epilogue for at least
   /// the final iteration of the original loop.
-  bool requiresScalarEpilogue(bool IsVectorizing) const {
-    if (!isScalarEpilogueAllowed())
+  bool requiresScalarEpilogue(ElementCount VF) const {
+    if (!isScalarEpilogueAllowed()) {
+      LLVM_DEBUG(dbgs() << "LV: Loop with VF = " << VF
+                        << " does not require scalar epilogue\n");
       return false;
+    }
     // If we might exit from anywhere but the latch, must run the exiting
     // iteration in scalar form.
-    if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch())
+    if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) {
+      LLVM_DEBUG(dbgs() << "LV: Loop with VF = " << VF
+                        << " requires scalar epilogue: multiple exists\n");
       return true;
-    return IsVectorizing && InterleaveInfo.requiresScalarEpilogue();
+    }
+    if (VF.isVector()) {
+      if (InterleaveInfo.requiresScalarEpilogue()) {
+        // Make sure interleaved groups that require scalar epilogue will be
+        // widened.
+        for (auto *G : InterleaveInfo.getInterleaveGroups()) {
+          if (!G->requiresScalarEpilogue())
+            continue;
+
+          Instruction *I = G->getMember(0);
+          InstWidening Decision = getWideningDecision(I, VF);
+          if (Decision == CM_Interleave ||
+              (Decision == CM_Unknown &&
+               interleavedAccessCanBeWidened(G->getMember(0), VF))) {
+            LLVM_DEBUG(dbgs() << "LV: Loop with VF = " << VF
+                              << " requires scalar epilogue: vectorizable "
+                                 "interleaved group\n");
+            return true;
+          }
+        }
+      }
+    }
+    LLVM_DEBUG(dbgs() << "LV: Loop with VF = " << VF
+                      << " does not require scalar epilogue\n");
+    return false;
   }
 
   /// Returns true if we're required to use a scalar epilogue for at least
@@ -1479,7 +1508,7 @@ class LoopVectorizationCostModel {
   /// none.
   bool requiresScalarEpilogue(VFRange Range) const {
     auto RequiresScalarEpilogue = [this](ElementCount VF) {
-      return requiresScalarEpilogue(VF.isVector());
+      return requiresScalarEpilogue(VF);
     };
     bool IsRequired = all_of(Range, RequiresScalarEpilogue);
     assert(
@@ -2776,7 +2805,7 @@ InnerLoopVectorizer::getOrCreateVectorTripCount(BasicBlock *InsertBlock) {
   // the step does not evenly divide the trip count, no adjustment is necessary
   // since there will already be scalar iterations. Note that the minimum
   // iterations check ensures that N >= Step.
-  if (Cost->requiresScalarEpilogue(VF.isVector())) {
+  if (Cost->requiresScalarEpilogue(VF)) {
     auto *IsZero = Builder.CreateICmpEQ(R, ConstantInt::get(R->getType(), 0));
     R = Builder.CreateSelect(IsZero, Step, R);
   }
@@ -2829,8 +2858,8 @@ void InnerLoopVectorizer::emitIterationCountCheck(BasicBlock *Bypass) {
   // vector trip count is zero. This check also covers the case where adding one
   // to the backedge-taken count overflowed leading to an incorrect trip count
   // of zero. In this case we will also jump to the scalar loop.
-  auto P = Cost->requiresScalarEpilogue(VF.isVector()) ? ICmpInst::ICMP_ULE
-                                                       : ICmpInst::ICMP_ULT;
+  auto P = Cost->requiresScalarEpilogue(VF) ? ICmpInst::ICMP_ULE
+                                            : ICmpInst::ICMP_ULT;
 
   // If tail is to be folded, vector loop takes care of all iterations.
   Type *CountTy = Count->getType();
@@ -2879,7 +2908,7 @@ void InnerLoopVectorizer::emitIterationCountCheck(BasicBlock *Bypass) {
 
   // Update dominator for Bypass & LoopExit (if needed).
   DT->changeImmediateDominator(Bypass, TCCheckBlock);
-  if (!Cost->requiresScalarEpilogue(VF.isVector()))
+  if (!Cost->requiresScalarEpilogue(VF))
     // If there is an epilogue which must run, there's no edge from the
     // middle block to exit blocks  and thus no need to update the immediate
     // dominator of the exit blocks.
@@ -2908,7 +2937,7 @@ BasicBlock *InnerLoopVectorizer::emitSCEVChecks(BasicBlock *Bypass) {
   // Update dominator only if this is first RT check.
   if (LoopBypassBlocks.empty()) {
     DT->changeImmediateDominator(Bypass, SCEVCheckBlock);
-    if (!Cost->requiresScalarEpilogue(VF.isVector()))
+    if (!Cost->requiresScalarEpilogue(VF))
       // If there is an epilogue which must run, there's no edge from the
       // middle block to exit blocks  and thus no need to update the immediate
       // dominator of the exit blocks.
@@ -2961,7 +2990,7 @@ void InnerLoopVectorizer::createVectorLoopSkeleton(StringRef Prefix) {
   LoopVectorPreHeader = OrigLoop->getLoopPreheader();
   assert(LoopVectorPreHeader && "Invalid loop structure");
   LoopExitBlock = OrigLoop->getUniqueExitBlock(); // may be nullptr
-  assert((LoopExitBlock || Cost->requiresScalarEpilogue(VF.isVector())) &&
+  assert((LoopExitBlock || Cost->requiresScalarEpilogue(VF)) &&
          "multiple exit loop without required epilogue?");
 
   LoopMiddleBlock =
@@ -2976,7 +3005,7 @@ void InnerLoopVectorizer::createVectorLoopSkeleton(StringRef Prefix) {
   // unconditional branch from the middle block to the scalar preheader. In that
   // case, there's no edge from the middle block to exit blocks  and thus no
   // need to update the immediate dominator of the exit blocks.
-  if (Cost->requiresScalarEpilogue(VF.isVector())) {
+  if (Cost->requiresScalarEpilogue(VF)) {
     assert(
         LoopMiddleBlock->getSingleSuccessor() == LoopScalarPreHeader &&
         " middle block should have the scalar preheader as single successor");
@@ -3109,7 +3138,7 @@ BasicBlock *InnerLoopVectorizer::completeLoopSkeleton() {
   //    Thus if tail is to be folded, we know we don't need to run the
   //    remainder and we can use the previous value for the condition (true).
   // 3) Otherwise, construct a runtime check.
-  if (!Cost->requiresScalarEpilogue(VF.isVector()) &&
+  if (!Cost->requiresScalarEpilogue(VF) &&
       !Cost->foldTailByMasking()) {
     // Here we use the same DebugLoc as the scalar loop latch terminator instead
     // of the corresponding compare because they may have ended up with
@@ -3419,7 +3448,7 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
   VPRegionBlock *VectorRegion = State.Plan->getVectorLoopRegion();
   VPBasicBlock *LatchVPBB = VectorRegion->getExitingBasicBlock();
   Loop *VectorLoop = LI->getLoopFor(State.CFG.VPBB2IRBB[LatchVPBB]);
-  if (Cost->requiresScalarEpilogue(VF.isVector())) {
+  if (Cost->requiresScalarEpilogue(VF)) {
     // No edge from the middle block to the unique exit block has been inserted
     // and there is nothing to fix from vector loop; phis should have incoming
     // from scalar loop only.
@@ -3936,7 +3965,7 @@ LoopVectorizationCostModel::getDivRemSpeculationCost(Instruction *I,
 }
 
 bool LoopVectorizationCostModel::interleavedAccessCanBeWidened(
-    Instruction *I, ElementCount VF) {
+    Instruction *I, ElementCount VF) const {
   assert(isAccessInterleaved(I) && "Expecting interleaved access.");
   assert(getWideningDecision(I, VF) == CM_Unknown &&
          "Decision should not be set yet.");
@@ -4670,7 +4699,7 @@ ElementCount LoopVectorizationCostModel::getMaximizedVFForTarget(
   // When a scalar epilogue is required, at least one iteration of the scalar
   // loop has to execute. Adjust MaxTripCount accordingly to avoid picking a
   // max VF that results in a dead vector loop.
-  if (MaxTripCount > 0 && requiresScalarEpilogue(true))
+  if (MaxTripCount > 0 && requiresScalarEpilogue(MaxVectorElementCount))
     MaxTripCount -= 1;
 
   if (MaxTripCount && MaxTripCount <= WidestRegisterMinEC &&
@@ -5302,7 +5331,7 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
     // At least one iteration must be scalar when this constraint holds. So the
     // maximum available iterations for interleaving is one less.
     unsigned AvailableTC =
-        requiresScalarEpilogue(VF.isVector()) ? KnownTC - 1 : KnownTC;
+        requiresScalarEpilogue(VF) ? KnownTC - 1 : KnownTC;
 
     // If trip count is known we select between two prospective ICs, where
     // 1) the aggressive IC is capped by the trip count divided by VF
@@ -5331,7 +5360,7 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
   } else if (BestKnownTC && *BestKnownTC > 0) {
     // At least one iteration must be scalar when this constraint holds. So the
     // maximum available iterations for interleaving is one less.
-    unsigned AvailableTC = requiresScalarEpilogue(VF.isVector())
+    unsigned AvailableTC = requiresScalarEpilogue(VF)
                                ? (*BestKnownTC) - 1
                                : *BestKnownTC;
 
@@ -7638,8 +7667,7 @@ EpilogueVectorizerMainLoop::emitIterationCountCheck(BasicBlock *Bypass,
 
   // Generate code to check if the loop's trip count is less than VF * UF of the
   // main vector loop.
-  auto P = Cost->requiresScalarEpilogue(ForEpilogue ? EPI.EpilogueVF.isVector()
-                                                    : VF.isVector())
+  auto P = Cost->requiresScalarEpilogue(ForEpilogue ? EPI.EpilogueVF : VF)
                ? ICmpInst::ICMP_ULE
                : ICmpInst::ICMP_ULT;
 
@@ -7661,7 +7689,7 @@ EpilogueVectorizerMainLoop::emitIterationCountCheck(BasicBlock *Bypass,
 
     // Update dominator for Bypass & LoopExit.
     DT->changeImmediateDominator(Bypass, TCCheckBlock);
-    if (!Cost->requiresScalarEpilogue(EPI.EpilogueVF.isVector()))
+    if (!Cost->requiresScalarEpilogue(EPI.EpilogueVF))
       // For loops with multiple exits, there's no edge from the middle block
       // to exit blocks (as the epilogue must run) and thus no need to update
       // the immediate dominator of the exit blocks.
@@ -7730,7 +7758,7 @@ EpilogueVectorizerEpilogueLoop::createEpilogueVectorizedLoopSkeleton(
 
   DT->changeImmediateDominator(LoopScalarPreHeader,
                                EPI.EpilogueIterationCountCheck);
-  if (!Cost->requiresScalarEpilogue(EPI.EpilogueVF.isVector()))
+  if (!Cost->requiresScalarEpilogue(EPI.EpilogueVF))
     // If there is an epilogue which must run, there's no edge from the
     // middle block to exit blocks  and thus no need to update the immediate
     // dominator of the exit blocks.
@@ -7812,9 +7840,8 @@ EpilogueVectorizerEpilogueLoop::emitMinimumVectorEpilogueIterCountCheck(
 
   // Generate code to check if the loop's trip count is less than VF * UF of the
   // vector epilogue loop.
-  auto P = Cost->requiresScalarEpilogue(EPI.EpilogueVF.isVector())
-               ? ICmpInst::ICMP_ULE
-               : ICmpInst::ICMP_ULT;
+  auto P = Cost->requiresScalarEpilogue(EPI.EpilogueVF) ? ICmpInst::ICMP_ULE
+                                                        : ICmpInst::ICMP_ULT;
 
   Value *CheckMinIters =
       Builder.CreateICmp(P, Count,
diff --git a/llvm/test/Transforms/LoopVectorize/ARM/pointer_iv.ll b/llvm/test/Transforms/LoopVectorize/ARM/pointer_iv.ll
index 2269b774d9f31..1592ddb26f67d 100644
--- a/llvm/test/Transforms/LoopVectorize/ARM/pointer_iv.ll
+++ b/llvm/test/Transforms/LoopVectorize/ARM/pointer_iv.ll
@@ -12,16 +12,16 @@ define hidden void @pointer_phi_v4i32_add1(ptr noalias nocapture readonly %A, pt
 ; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK:       vector.body:
 ; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[TMP0:%.*]] = shl i32 [[INDEX]], 2
-; CHECK-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[A:%.*]], i32 [[TMP0]]
-; CHECK-NEXT:    [[TMP1:%.*]] = shl i32 [[INDEX]], 2
-; CHECK-NEXT:    [[NEXT_GEP4:%.*]] = getelementptr i8, ptr [[B:%.*]], i32 [[TMP1]]
+; CHECK-NEXT:    [[OFFSET_IDX:%.*]] = shl i32 [[INDEX]], 2
+; CHECK-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[A:%.*]], i32 [[OFFSET_IDX]]
+; CHECK-NEXT:    [[OFFSET_IDX4:%.*]] = shl i32 [[INDEX]], 2
+; CHECK-NEXT:    [[NEXT_GEP5:%.*]] = getelementptr i8, ptr [[B:%.*]], i32 [[OFFSET_IDX4]]
 ; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[NEXT_GEP]], align 4
-; CHECK-NEXT:    [[TMP2:%.*]] = add nsw <4 x i32> [[WIDE_LOAD]], [[BROADCAST_SPLAT]]
-; CHECK-NEXT:    store <4 x i32> [[TMP2]], ptr [[NEXT_GEP4]], align 4
+; CHECK-NEXT:    [[TMP0:%.*]] = add nsw <4 x i32> [[WIDE_LOAD]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    store <4 x i32> [[TMP0]], ptr [[NEXT_GEP5]], align 4
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4
-; CHECK-NEXT:    [[TMP3:%.*]] = icmp eq i32 [[INDEX_NEXT]], 1000
-; CHECK-NEXT:    br i1 [[TMP3]], label [[END:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK-NEXT:    [[TMP1:%.*]] = icmp eq i32 [[INDEX_NEXT]], 1000
+; CHECK-NEXT:    br i1 [[TMP1]], label [[END:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
 ; CHECK:       end:
 ; CHECK-NEXT:    ret void
 ;
@@ -53,24 +53,24 @@ define hidden void @pointer_phi_v4i32_add2(ptr noalias nocapture readonly %A, pt
 ; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK:       vector.body:
 ; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[TMP0:%.*]] = shl i32 [[INDEX]], 3
-; CHECK-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[A]], i32 [[TMP0]]
-; CHECK-NEXT:    [[TMP1:%.*]] = shl i32 [[INDEX]], 2
-; CHECK-NEXT:    [[NEXT_GEP4:%.*]] = getelementptr i8, ptr [[B]], i32 [[TMP1]]
+; CHECK-NEXT:    [[OFFSET_IDX:%.*]] = shl i32 [[INDEX]], 3
+; CHECK-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[A]], i32 [[OFFSET_IDX]]
+; CHECK-NEXT:    [[OFFSET_IDX4:%.*]] = shl i32 [[INDEX]], 2
+; CHECK-NEXT:    [[NEXT_GEP5:%.*]] = getelementptr i8, ptr [[B]], i32 [[OFFSET_IDX4]]
 ; CHECK-NEXT:    [[WIDE_VEC:%.*]] = load <8 x i32>, ptr [[NEXT_GEP]], align 4
 ; CHECK-NEXT:    [[STRIDED_VEC:%.*]] = shufflevector <8 x i32> [[WIDE_VEC]], <8 x i32> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
-; CHECK-NEXT:    [[TMP2:%.*]] = add nsw <4 x i32> [[STRIDED_VEC]], [[BROADCAST_SPLAT]]
-; CHECK-NEXT:    store <4 x i32> [[TMP2]], ptr [[NEXT_GEP4]], align 4
+; CHECK-NEXT:    [[TMP0:%.*]] = add nsw <4 x i32> [[STRIDED_VEC]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    store <4 x i32> [[TMP0]], ptr [[NEXT_GEP5]], align 4
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4
-; CHECK-NEXT:    [[TMP3:%.*]] = icmp eq i32 [[INDEX_NEXT]], 996
-; CHECK-NEXT:    br i1 [[TMP3]], label [[FOR_BODY:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK-NEXT:    [[TMP1:%.*]] = icmp eq i32 [[INDEX_NEXT]], 996
+; CHECK-NEXT:    br i1 [[TMP1]], label [[FOR_BODY:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
 ; CHECK:       for.body:
 ; CHECK-NEXT:    [[A_ADDR_09:%.*]] = phi ptr [ [[ADD_PTR:%.*]], [[FOR_BODY]] ], [ [[IND_END]], [[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[I_08:%.*]] = phi i32 [ [[INC:%.*]], [[FOR_BODY]] ], [ 996, [[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[B_ADDR_07:%.*]] = phi ptr [ [[INCDEC_PTR:%.*]], [[FOR_BODY]] ], [ [[IND_END2]], [[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[TMP4:%.*]] = load i32, ptr [[A_ADDR_09]], align 4
+; CHECK-NEXT:    [[TMP2:%.*]] = load i32, ptr [[A_ADDR_09]], align 4
 ; CHECK-NEXT:    [[ADD_PTR]] = getelementptr inbounds i8, ptr [[A_ADDR_09]], i32 8
-; CHECK-NEXT:    [[ADD:%.*]] = add nsw i32 [[TMP4]], [[Y]]
+; CHECK-NEXT:    [[ADD:%.*]] = add nsw i32 [[TMP2]], [[Y]]
 ; CHECK-NEXT:    store i32 [[ADD]], ptr [[B_ADDR_07]], align 4
 ; CHECK-NEXT:    [[INCDEC_PTR]] = getelementptr inbounds i8, ptr [[B_ADDR_07]], i32 4
 ; CHECK-NEXT:    [[INC]] = add nuw nsw i32 [[I_08]], 1
@@ -100,36 +100,22 @@ end:
 define hidden void @pointer_phi_v4i32_add3(ptr noalias nocapture readonly %A, ptr noalias nocapture %B, i32 %y) {
 ; CHECK-LABEL: @pointer_phi_v4i32_add3(
 ; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[A:%.*]], i32 11952
-; CHECK-NEXT:    [[IND_END2:%.*]] = getelementptr i8, ptr [[B:%.*]], i32 3984
 ; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[Y:%.*]], i64 0
 ; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
 ; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK:       vector.body:
-; CHECK-NEXT:    [[POINTER_PHI:%.*]] = phi ptr [ [[A]], [[ENTRY:%.*]] ], [ [[PTR_IND:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[POINTER_PHI:%.*]] = phi ptr [ [[A:%.*]], [[ENTRY:%.*]] ], [ [[PTR_IND:%.*]], [[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
 ; CHECK-NEXT:    [[TMP0:%.*]] = getelementptr i8, ptr [[POINTER_PHI]], <4 x i32> <i32 0, i32 12, i32 24, i32 36>
-; CHECK-NEXT:    [[TMP1:%.*]] = shl i32 [[INDEX]], 2
-; CHECK-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[B]], i32 [[TMP1]]
+; CHECK-NEXT:    [[OFFSET_IDX:%.*]] = shl i32 [[INDEX]], 2
+; CHECK-NEXT:    [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[B:%.*]], i32 [[OFFSET_IDX]]
 ; CHECK-NEXT:    [[WIDE_MASKED_GATHER:%.*]] = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> [[TMP0]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> poison)
-; CHECK-NEXT:    [[TMP2:%.*]] = add nsw <4 x i32> [[WIDE_MASKED_GATHER]], [[BROADCAST_SPLAT]]
-; CHECK-NEXT:    store <4 x i32> [[TMP2]], ptr [[NEXT_GEP]], align 4
+; CHECK-NEXT:    [[TMP1:%.*]] = add nsw <4 x i32> [[WIDE_MASKED_GATHER]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    store <4 x i32> [[TMP1]], ptr [[NEXT_GEP]], align 4
 ; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4
 ; CHECK-NEXT:    [[PTR_IND]] = getelementptr i8, ptr [[POINTER_PHI]], i32 48
-; CHECK-NEXT:    [[TMP3:%.*]] = icmp eq i32 [[INDEX_NEXT]], 996
-; CHECK-NEXT:    br i1 [[TMP3]], label [[FOR_BODY:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
-; CHECK:       for.body:
-; CHECK-NEXT:    [[A_ADDR_09:%.*]] = phi ptr [ [[ADD_PTR:%.*]], [[FOR_BODY]] ], [ [[IND_END]], [[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[I_08:%.*]] = phi i32 [ [[INC:%.*]], [[FOR_BODY]] ], [ 996, [[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[B_ADDR_07:%.*]] = phi ptr [ [[INCDEC_PTR:%.*]], [[FOR_BODY]] ], [ [[IND_END2]], [[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[TMP4:%.*]] = load i32, ptr [[A_ADDR_09]], align 4
-; CHECK-NEXT:    [[ADD_PTR]] = getelementptr inbounds i8, ptr [[A_ADDR_09]], i32 12
-; CHECK-NEXT:    [[ADD:%.*]] = add nsw i32 [[TMP4]], [[Y]]
-; CHECK-NEXT:    store i32 [[ADD]], ptr [[B_ADDR_07]], align 4
-; CHECK-NEXT:    [[INCDEC_PTR]] = getelementptr inbounds i8, ptr [[B_ADDR_07]], i32 4
-; CHECK-NEXT:    [[INC]] = add nuw nsw i32 [[I_08]], 1
-; CHECK-NEXT:    [[EXITCOND:%.*]] = icmp eq i32 [[INC]], 1000
-; CHECK-NEXT:    br i1 [[EXITCOND]], label [[END:%.*]], label [[FOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
+; CHECK-NEXT:    [[TMP2:%.*]] = icmp eq i32 [[INDEX_NEXT]], 1000
+; CHECK-NEXT:    br i1 [[TMP2]], label [[END:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
 ; CHECK:       end:
 ; CHECK-NEXT:    ret void
 ;
@@ -160,16 +146,16 @@ define hidden void @pointer_phi_v8i16_add1(ptr noalias nocapture readonly %A, pt
 ; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK:       vector.body:
 ; CHECK-NEXT:    [[INDEX:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NE...
[truncated]

alexey-bataev · 2024-06-24T20:05:40Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -1448,29 +1448,58 @@ class LoopVectorizationCostModel {
  /// Returns true if \p I is a memory instruction in an interleaved-group
  /// of memory accesses that can be vectorized with wide vector loads/stores
  /// and shuffles.
-  bool interleavedAccessCanBeWidened(Instruction *I, ElementCount VF);
+  bool interleavedAccessCanBeWidened(Instruction *I, ElementCount VF) const;


Better make it in separate NFC patch

Would be good to also include the additional DEBUG printing as part of the NFC patch

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

alexey-bataev · 2024-06-24T20:10:04Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+    if (VF.isVector()) {
+      if (InterleaveInfo.requiresScalarEpilogue()) {


Suggested change

if (VF.isVector()) {

if (InterleaveInfo.requiresScalarEpilogue()) {

if (VF.isVector() && InterleaveInfo.hasGroups() && InterleaveInfo.requiresScalarEpilogue()) {

Should not be needed to check for #groups. requiresScalarEpilogue is false when no groups were constructed.

Yes, so the below code should be good enough:

Suggested change

if (VF.isVector()) {

if (InterleaveInfo.requiresScalarEpilogue()) {

if (VF.isVector() && InterleaveInfo.requiresScalarEpilogue()) {

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

topperc · 2024-06-24T22:41:10Z

llvm/test/Transforms/LoopVectorize/vplan-printing-before-execute.ll

@@ -8,7 +8,7 @@ target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
 ; VF/IC combination.
 define void @test_tc_less_than_16(ptr %A, i64 %N) {
 ; CHECK:      LV: Scalarizing:  %cmp =
-; CHECK-NEXT: VPlan 'Initial VPlan for VF={8},UF>=1' {
+; CHECK:      VPlan 'Initial VPlan for VF={8},UF>=1' {


Should the additional line that broke the -NEXT be checked too?

the purpose of the test is to check vplan's representation

topperc · 2024-06-24T22:43:01Z

llvm/test/Transforms/LoopVectorize/X86/fixed-order-recurrence.ll

+; CHECK-NEXT:    [[TMP92:%.*]] = load double, ptr [[TMP60]], align 8
+; CHECK-NEXT:    [[TMP93:%.*]] = load double, ptr [[TMP61]], align 8
+; CHECK-NEXT:    [[TMP94:%.*]] = load double, ptr [[TMP62]], align 8
+; CHECK-NEXT:    [[TMP95:%.*]] = load double, ptr [[TMP63]], align 8


Are these loads dead?

Mel-Chen

That's a great discovery. Here's another possible approach. Perhaps we could directly invalidate groups where the WideningDecision is not CM_Interleave within LoopVectorizationCostModel::setCostBasedWideningDecision. The relevant implementation can be referenced from InterleavedAccessInfo::invalidateGroupsRequiringScalarEpilogue().

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Mel-Chen · 2024-06-25T06:56:27Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+        // Make sure interleaved groups that require scalar epilogue will be
+        // widened.
+        for (auto *G : InterleaveInfo.getInterleaveGroups()) {
+          if (!G->requiresScalarEpilogue())
+            continue;
+
+          Instruction *I = G->getMember(0);
+          InstWidening Decision = getWideningDecision(I, VF);
+          if (Decision == CM_Interleave ||
+              (Decision == CM_Unknown &&
+               interleavedAccessCanBeWidened(G->getMember(0), VF))) {
+            LLVM_DEBUG(dbgs() << "LV: Loop with VF = " << VF
+                              << " requires scalar epilogue: vectorizable "
+                                 "interleaved group\n");
+            return true;
+          }


Could you use any_of to implement this check?

Will do. I hesitated to use auto *G, as the proper type InterleaveGroup<Instruction> is quite large

Mel-Chen · 2024-06-25T07:11:49Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+              (Decision == CM_Unknown &&
+               interleavedAccessCanBeWidened(G->getMember(0), VF))) {


I can't imagine why this condition would be needed. Could you explain more?

There's no expectation when requiresScalarEpilogue is called: it can be called before decision is made and after. The latter check makes sure requiresScalarEpilogue returns the same value when decision is not made. Precondition decision != cm_unknown is necessary as interleavedAccessCanBeWidened is expected to be called with unknown decision.

nikolaypanchenko · 2024-06-25T17:51:46Z

That's a great discovery. Here's another possible approach. Perhaps we could directly invalidate groups where the WideningDecision is not CM_Interleave within LoopVectorizationCostModel::setCostBasedWideningDecision. The relevant implementation can be referenced from InterleavedAccessInfo::invalidateGroupsRequiringScalarEpilogue().

I was considering that option too, but groups are constructed for entire loop and one group can be valid with one VF, but not valid with the other.
That said, invalidation is illegal when VPlans with and without interleaved groups are in flight. Only when specific VPlan is selected, invalidation can take place.

LV should check that all groups that require scalar epilogue can be widened, otherwise if InterleavedAccessGroup cannot be widened and does have tail element, current logic in LV requires to emit scalar epilogue, which leads to inefficient vector code.

Mel-Chen · 2024-06-27T12:40:49Z

I was considering that option too, but groups are constructed for entire loop and one group can be valid with one VF, but not valid with the other. That said, invalidation is illegal when VPlans with and without interleaved groups are in flight. Only when specific VPlan is selected, invalidation can take place.

Make sense, thanks.

nikolaypanchenko · 2024-07-09T23:34:03Z

@fhahn ping

fhahn · 2024-07-14T19:56:26Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+              Instruction *I = Group->getMember(0);
+              InstWidening Decision = getWideningDecision(I, VF);
+              return Decision == CM_Interleave ||
+                     (Decision == CM_Unknown &&


Is it possible that the decision later changes and the result of requiresScalarEpilogue changes? as mentioned for #96681 would it be possible to compute whether a scalar epilogue is needed once and set a flag to avoid it potentially changing later?

llvmbot added backend:RISC-V vectorizers llvm:transforms labels Jun 24, 2024

npanchen requested review from fhahn, Mel-Chen and arcbbb June 24, 2024 20:00

npanchen requested a review from alexey-bataev June 24, 2024 20:00

alexey-bataev reviewed Jun 24, 2024

View reviewed changes

topperc reviewed Jun 24, 2024

View reviewed changes

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved

topperc reviewed Jun 24, 2024

View reviewed changes

Mel-Chen reviewed Jun 25, 2024

View reviewed changes

nikolaypanchenko mentioned this pull request Jun 25, 2024

[LV][NFC] Marked functions as const. Added LLVM_DEBUG. #96681

Merged

nikolaypanchenko added 2 commits June 26, 2024 14:45

Addressed comments

fa9e2e9

nikolaypanchenko force-pushed the npanchen/pr/scalar_epilogue_iai_tail branch from 1d2c4fe to fa9e2e9 Compare June 26, 2024 22:16

fhahn reviewed Jul 14, 2024

View reviewed changes

		if (VF.isVector()) {
		if (InterleaveInfo.requiresScalarEpilogue()) {

	if (VF.isVector()) {
	if (InterleaveInfo.requiresScalarEpilogue()) {
	if (VF.isVector() && InterleaveInfo.hasGroups() && InterleaveInfo.requiresScalarEpilogue()) {

	if (VF.isVector()) {
	if (InterleaveInfo.requiresScalarEpilogue()) {
	if (VF.isVector() && InterleaveInfo.requiresScalarEpilogue()) {

		(Decision == CM_Unknown &&
		interleavedAccessCanBeWidened(G->getMember(0), VF))) {

[LV] Don't require scalar epilogue for unsupported IAG with tail #96544

Are you sure you want to change the base?

[LV] Don't require scalar epilogue for unsupported IAG with tail #96544

Uh oh!

Conversation

nikolaypanchenko commented Jun 24, 2024

Uh oh!

llvmbot commented Jun 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mel-Chen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikolaypanchenko commented Jun 25, 2024

Uh oh!

Mel-Chen commented Jun 27, 2024

Uh oh!

nikolaypanchenko commented Jul 9, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llvmbot commented Jun 24, 2024 •

edited

Loading