[LV] Don't predicate divs with invariant divisor when folding tail #98904

fhahn · 2024-07-15T13:53:56Z

When folding the tail, at least one of the lanes must execute
unconditionally. If the divisor is loop-invariant no predication is
needed, as predication would not prevent the divide-by-0 on the executed
lane.

Depends on #98892.

Any instruction marked as uniform will result in a uniform VPReplicateRecipe. If it requires predication, it will be placed in a replicate region, even if isScalarWithPredication returns false. Check isPredicatedInst instead of isScalarWithPredication to avoid generating uniform VPReplicateRecipes placed inside a replicate region. This fixes an assertion when using scalable VFs. Fixes llvm#80416. Fixes llvm#94328.

When folding the tail, at least one of the lanes must execute unconditionally. If the divisor is loop-invariant no predication is needed, as predication would not prevent the divide-by-0 on the executed lane. Depends on llvm#98892.

llvmbot · 2024-07-15T13:54:28Z

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

When folding the tail, at least one of the lanes must execute
unconditionally. If the divisor is loop-invariant no predication is
needed, as predication would not prevent the divide-by-0 on the executed
lane.

Depends on #98892.

Patch is 45.04 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/98904.diff

5 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+33-15)
(modified) llvm/lib/Transforms/Vectorize/VPRecipeBuilder.h (+1-1)
(added) llvm/test/Transforms/LoopVectorize/AArch64/divs-with-scalable-vfs.ll (+380)
(modified) llvm/test/Transforms/LoopVectorize/X86/consecutive-ptr-uniforms.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/X86/divs-with-tail-folding.ll (+30-92)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 5520baef7152d..f5a860e980c74 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1402,7 +1402,8 @@ class LoopVectorizationCostModel {
   /// Returns true if \p I is an instruction that needs to be predicated
   /// at runtime.  The result is independent of the predication mechanism.
   /// Superset of instructions that return true for isScalarWithPredication.
-  bool isPredicatedInst(Instruction *I) const;
+  bool isPredicatedInst(Instruction *I, ElementCount VF,
+                        bool IsKnownUniform = false) const;
 
   /// Return the costs for our two available strategies for lowering a
   /// div/rem operation which requires speculating at least one lane.
@@ -3637,7 +3638,7 @@ void LoopVectorizationCostModel::collectLoopScalars(ElementCount VF) {
 
 bool LoopVectorizationCostModel::isScalarWithPredication(
     Instruction *I, ElementCount VF) const {
-  if (!isPredicatedInst(I))
+  if (!isPredicatedInst(I, VF))
     return false;
 
   // Do we have a non-scalar lowering for this predicated
@@ -3676,7 +3677,9 @@ bool LoopVectorizationCostModel::isScalarWithPredication(
   }
 }
 
-bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I) const {
+bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I,
+                                                  ElementCount VF,
+                                                  bool IsKnownUniform) const {
   if (!blockNeedsPredicationForAnyReason(I->getParent()))
     return false;
 
@@ -3710,6 +3713,15 @@ bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I) const {
   case Instruction::SDiv:
   case Instruction::SRem:
   case Instruction::URem:
+    // When folding the tail, at least one of the lanes must execute
+    // unconditionally. If the divisor is loop-invariant no predication is
+    // needed, as predication would not prevent the divide-by-0 on the executed
+    // lane.
+    if (foldTailByMasking() && !Legal->blockNeedsPredication(I->getParent()) &&
+        TheLoop->isLoopInvariant(I->getOperand(1)) &&
+        (IsKnownUniform || isUniformAfterVectorization(I, VF)))
+      return false;
+
     // TODO: We can use the loop-preheader as context point here and get
     // context sensitive reasoning
     return !isSafeToSpeculativelyExecute(I);
@@ -3907,7 +3919,7 @@ void LoopVectorizationCostModel::collectLoopUniforms(ElementCount VF) {
   SetVector<Instruction *> Worklist;
 
   // Add uniform instructions demanding lane 0 to the worklist. Instructions
-  // that are scalar with predication must not be considered uniform after
+  // that are require predication must not be considered uniform after
   // vectorization, because that would create an erroneous replicating region
   // where only a single instance out of VF should be formed.
   // TODO: optimize such seldom cases if found important, see PR40816.
@@ -3917,9 +3929,10 @@ void LoopVectorizationCostModel::collectLoopUniforms(ElementCount VF) {
                         << *I << "\n");
       return;
     }
-    if (isScalarWithPredication(I, VF)) {
-      LLVM_DEBUG(dbgs() << "LV: Found not uniform being ScalarWithPredication: "
-                        << *I << "\n");
+    if (isPredicatedInst(I, VF, true)) {
+      LLVM_DEBUG(
+          dbgs() << "LV: Found not uniform due to requiring predication: " << *I
+                 << "\n");
       return;
     }
     LLVM_DEBUG(dbgs() << "LV: Found uniform instruction: " << *I << "\n");
@@ -5633,7 +5646,7 @@ bool LoopVectorizationCostModel::useEmulatedMaskMemRefHack(Instruction *I,
   // from moving "masked load/store" check from legality to cost model.
   // Masked Load/Gather emulation was previously never allowed.
   // Limited number of Masked Store/Scatter emulation was allowed.
-  assert((isPredicatedInst(I)) &&
+  assert((isPredicatedInst(I, VF)) &&
          "Expecting a scalar emulated instruction");
   return isa<LoadInst>(I) ||
          (isa<StoreInst>(I) &&
@@ -5912,7 +5925,7 @@ LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I,
   // If we have a predicated load/store, it will need extra i1 extracts and
   // conditional branches, but may not be executed for each vector lane. Scale
   // the cost by the probability of executing the predicated block.
-  if (isPredicatedInst(I)) {
+  if (isPredicatedInst(I, VF)) {
     Cost /= getReciprocalPredBlockProb();
 
     // Add the cost of an i1 extract and a branch
@@ -6772,7 +6785,7 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I,
   case Instruction::SDiv:
   case Instruction::URem:
   case Instruction::SRem:
-    if (VF.isVector() && isPredicatedInst(I)) {
+    if (VF.isVector() && isPredicatedInst(I, VF)) {
       const auto [ScalarCost, SafeDivisorCost] = getDivRemSpeculationCost(I, VF);
       return isDivRemScalarWithPredication(ScalarCost, SafeDivisorCost) ?
         ScalarCost : SafeDivisorCost;
@@ -8444,7 +8457,7 @@ bool VPRecipeBuilder::shouldWiden(Instruction *I, VFRange &Range) const {
 
 VPWidenRecipe *VPRecipeBuilder::tryToWiden(Instruction *I,
                                            ArrayRef<VPValue *> Operands,
-                                           VPBasicBlock *VPBB) {
+                                           VPBasicBlock *VPBB, VFRange &Range) {
   switch (I->getOpcode()) {
   default:
     return nullptr;
@@ -8454,7 +8467,10 @@ VPWidenRecipe *VPRecipeBuilder::tryToWiden(Instruction *I,
   case Instruction::URem: {
     // If not provably safe, use a select to form a safe divisor before widening the
     // div/rem operation itself.  Otherwise fall through to general handling below.
-    if (CM.isPredicatedInst(I)) {
+    bool IsPredicated = LoopVectorizationPlanner::getDecisionAndClampRange(
+        [&](ElementCount VF) -> bool { return CM.isPredicatedInst(I, VF); },
+        Range);
+    if (IsPredicated) {
       SmallVector<VPValue *> Ops(Operands.begin(), Operands.end());
       VPValue *Mask = getBlockInMask(I->getParent());
       VPValue *One =
@@ -8504,8 +8520,8 @@ VPReplicateRecipe *VPRecipeBuilder::handleReplication(Instruction *I,
       [&](ElementCount VF) { return CM.isUniformAfterVectorization(I, VF); },
       Range);
 
-  bool IsPredicated = CM.isPredicatedInst(I);
-
+  bool IsPredicated = LoopVectorizationPlanner::getDecisionAndClampRange(
+      [&](ElementCount VF) { return CM.isPredicatedInst(I, VF); }, Range);
   // Even if the instruction is not marked as uniform, there are certain
   // intrinsic calls that can be effectively treated as such, so we check for
   // them here. Conservatively, we only do this for scalable vectors, since
@@ -8625,7 +8641,7 @@ VPRecipeBuilder::tryToCreateWidenRecipe(Instruction *Instr,
                                  *CI);
   }
 
-  return tryToWiden(Instr, Operands, VPBB);
+  return tryToWiden(Instr, Operands, VPBB, Range);
 }
 
 void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
@@ -9482,6 +9498,8 @@ void VPInterleaveRecipe::execute(VPTransformState &State) {
 void VPReplicateRecipe::execute(VPTransformState &State) {
   Instruction *UI = getUnderlyingInstr();
   if (State.Instance) { // Generate a single instance.
+    assert((State.VF.isScalar() || !isUniform()) &&
+           "uniform recipe shouldn't be predicated");
     assert(!State.VF.isScalable() && "Can't scalarize a scalable vector");
     State.ILV->scalarizeInstruction(UI, this, *State.Instance, State);
     // Insert scalar instance packing it into a vector.
diff --git a/llvm/lib/Transforms/Vectorize/VPRecipeBuilder.h b/llvm/lib/Transforms/Vectorize/VPRecipeBuilder.h
index b4c7ab02f928f..0c7c47258bf9e 100644
--- a/llvm/lib/Transforms/Vectorize/VPRecipeBuilder.h
+++ b/llvm/lib/Transforms/Vectorize/VPRecipeBuilder.h
@@ -100,7 +100,7 @@ class VPRecipeBuilder {
   /// if it can. The function should only be called if the cost-model indicates
   /// that widening should be performed.
   VPWidenRecipe *tryToWiden(Instruction *I, ArrayRef<VPValue *> Operands,
-                            VPBasicBlock *VPBB);
+                            VPBasicBlock *VPBB, VFRange &Range);
 
 public:
   VPRecipeBuilder(VPlan &Plan, Loop *OrigLoop, const TargetLibraryInfo *TLI,
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/divs-with-scalable-vfs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/divs-with-scalable-vfs.ll
new file mode 100644
index 0000000000000..e1ce8d430e741
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/divs-with-scalable-vfs.ll
@@ -0,0 +1,380 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -p loop-vectorize -mtriple aarch64 -mcpu=neoverse-v1 -S %s | FileCheck %s
+
+; Test case for https://github.com/llvm/llvm-project/issues/94328.
+define void @sdiv_feeding_gep(ptr %dst, i32 %x, i64 %M, i64 %conv6, i64 %N) {
+; CHECK-LABEL: define void @sdiv_feeding_gep(
+; CHECK-SAME: ptr [[DST:%.*]], i32 [[X:%.*]], i64 [[M:%.*]], i64 [[CONV6:%.*]], i64 [[N:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[CONV61:%.*]] = zext i32 [[X]] to i64
+; CHECK-NEXT:    [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP1:%.*]] = mul i64 [[TMP0]], 4
+; CHECK-NEXT:    [[TMP2:%.*]] = call i64 @llvm.umax.i64(i64 8, i64 [[TMP1]])
+; CHECK-NEXT:    [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], [[TMP2]]
+; CHECK-NEXT:    br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_SCEVCHECK:.*]]
+; CHECK:       [[VECTOR_SCEVCHECK]]:
+; CHECK-NEXT:    [[TMP3:%.*]] = add i64 [[N]], -1
+; CHECK-NEXT:    [[TMP4:%.*]] = trunc i64 [[TMP3]] to i32
+; CHECK-NEXT:    [[TMP5:%.*]] = icmp slt i32 [[TMP4]], 0
+; CHECK-NEXT:    [[TMP6:%.*]] = icmp ugt i64 [[TMP3]], 4294967295
+; CHECK-NEXT:    [[TMP7:%.*]] = or i1 [[TMP5]], [[TMP6]]
+; CHECK-NEXT:    br i1 [[TMP7]], label %[[SCALAR_PH]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[TMP8:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP9:%.*]] = mul i64 [[TMP8]], 4
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N]], [[TMP9]]
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT:    [[TMP10:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP11:%.*]] = mul i64 [[TMP10]], 4
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP12:%.*]] = add i64 [[INDEX]], 0
+; CHECK-NEXT:    [[TMP13:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP14:%.*]] = mul i64 [[TMP13]], 2
+; CHECK-NEXT:    [[TMP15:%.*]] = add i64 [[TMP14]], 0
+; CHECK-NEXT:    [[TMP16:%.*]] = mul i64 [[TMP15]], 1
+; CHECK-NEXT:    [[TMP17:%.*]] = add i64 [[INDEX]], [[TMP16]]
+; CHECK-NEXT:    [[TMP18:%.*]] = sdiv i64 [[M]], [[CONV6]]
+; CHECK-NEXT:    [[TMP19:%.*]] = sdiv i64 [[M]], [[CONV6]]
+; CHECK-NEXT:    [[TMP20:%.*]] = trunc i64 [[TMP18]] to i32
+; CHECK-NEXT:    [[TMP21:%.*]] = trunc i64 [[TMP19]] to i32
+; CHECK-NEXT:    [[TMP22:%.*]] = mul i64 [[TMP18]], [[CONV61]]
+; CHECK-NEXT:    [[TMP23:%.*]] = mul i64 [[TMP19]], [[CONV61]]
+; CHECK-NEXT:    [[TMP24:%.*]] = sub i64 [[TMP12]], [[TMP22]]
+; CHECK-NEXT:    [[TMP25:%.*]] = sub i64 [[TMP17]], [[TMP23]]
+; CHECK-NEXT:    [[TMP26:%.*]] = trunc i64 [[TMP24]] to i32
+; CHECK-NEXT:    [[TMP27:%.*]] = trunc i64 [[TMP25]] to i32
+; CHECK-NEXT:    [[TMP28:%.*]] = mul i32 [[X]], [[TMP20]]
+; CHECK-NEXT:    [[TMP29:%.*]] = mul i32 [[X]], [[TMP21]]
+; CHECK-NEXT:    [[TMP30:%.*]] = add i32 [[TMP28]], [[TMP26]]
+; CHECK-NEXT:    [[TMP31:%.*]] = add i32 [[TMP29]], [[TMP27]]
+; CHECK-NEXT:    [[TMP32:%.*]] = sext i32 [[TMP30]] to i64
+; CHECK-NEXT:    [[TMP33:%.*]] = sext i32 [[TMP31]] to i64
+; CHECK-NEXT:    [[TMP34:%.*]] = getelementptr double, ptr [[DST]], i64 [[TMP32]]
+; CHECK-NEXT:    [[TMP35:%.*]] = getelementptr double, ptr [[DST]], i64 [[TMP33]]
+; CHECK-NEXT:    [[TMP36:%.*]] = getelementptr double, ptr [[TMP34]], i32 0
+; CHECK-NEXT:    [[TMP37:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP38:%.*]] = mul i64 [[TMP37]], 2
+; CHECK-NEXT:    [[TMP39:%.*]] = getelementptr double, ptr [[TMP34]], i64 [[TMP38]]
+; CHECK-NEXT:    store <vscale x 2 x double> zeroinitializer, ptr [[TMP36]], align 8
+; CHECK-NEXT:    store <vscale x 2 x double> zeroinitializer, ptr [[TMP39]], align 8
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP11]]
+; CHECK-NEXT:    [[TMP40:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP40]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ], [ 0, %[[VECTOR_SCEVCHECK]] ]
+; CHECK-NEXT:    br label %[[LOOP:.*]]
+; CHECK:       [[LOOP]]:
+; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT:    [[DIV18:%.*]] = sdiv i64 [[M]], [[CONV6]]
+; CHECK-NEXT:    [[CONV20:%.*]] = trunc i64 [[DIV18]] to i32
+; CHECK-NEXT:    [[MUL30:%.*]] = mul i64 [[DIV18]], [[CONV61]]
+; CHECK-NEXT:    [[SUB31:%.*]] = sub i64 [[IV]], [[MUL30]]
+; CHECK-NEXT:    [[CONV34:%.*]] = trunc i64 [[SUB31]] to i32
+; CHECK-NEXT:    [[MUL35:%.*]] = mul i32 [[X]], [[CONV20]]
+; CHECK-NEXT:    [[ADD36:%.*]] = add i32 [[MUL35]], [[CONV34]]
+; CHECK-NEXT:    [[IDXPROM:%.*]] = sext i32 [[ADD36]] to i64
+; CHECK-NEXT:    [[GEP:%.*]] = getelementptr double, ptr [[DST]], i64 [[IDXPROM]]
+; CHECK-NEXT:    store double 0.000000e+00, ptr [[GEP]], align 8
+; CHECK-NEXT:    [[IV_NEXT]] = add i64 [[IV]], 1
+; CHECK-NEXT:    [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
+; CHECK-NEXT:    br i1 [[EC]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK:       [[EXIT]]:
+; CHECK-NEXT:    ret void
+;
+entry:
+  %conv61 = zext i32 %x to i64
+  br label %loop
+
+loop:
+  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+  %div18 = sdiv i64 %M, %conv6
+  %conv20 = trunc i64 %div18 to i32
+  %mul30 = mul i64 %div18, %conv61
+  %sub31 = sub i64 %iv, %mul30
+  %conv34 = trunc i64 %sub31 to i32
+  %mul35 = mul i32 %x, %conv20
+  %add36 = add i32 %mul35, %conv34
+  %idxprom = sext i32 %add36 to i64
+  %gep = getelementptr double, ptr %dst, i64 %idxprom
+  store double 0.000000e+00, ptr %gep, align 8
+  %iv.next = add i64 %iv, 1
+  %ec = icmp eq i64 %iv.next, %N
+  br i1 %ec, label %exit, label %loop
+
+exit:
+  ret void
+}
+
+define void @sdiv_feeding_gep_predicated(ptr %dst, i32 %x, i64 %M, i64 %conv6, i64 %N) {
+; CHECK-LABEL: define void @sdiv_feeding_gep_predicated(
+; CHECK-SAME: ptr [[DST:%.*]], i32 [[X:%.*]], i64 [[M:%.*]], i64 [[CONV6:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    [[CONV61:%.*]] = zext i32 [[X]] to i64
+; CHECK-NEXT:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_SCEVCHECK:.*]]
+; CHECK:       [[VECTOR_SCEVCHECK]]:
+; CHECK-NEXT:    [[TMP0:%.*]] = add i64 [[N]], -1
+; CHECK-NEXT:    [[TMP1:%.*]] = trunc i64 [[TMP0]] to i32
+; CHECK-NEXT:    [[TMP2:%.*]] = icmp slt i32 [[TMP1]], 0
+; CHECK-NEXT:    [[TMP3:%.*]] = icmp ugt i64 [[TMP0]], 4294967295
+; CHECK-NEXT:    [[TMP4:%.*]] = or i1 [[TMP2]], [[TMP3]]
+; CHECK-NEXT:    br i1 [[TMP4]], label %[[SCALAR_PH]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 2
+; CHECK-NEXT:    [[TMP7:%.*]] = sub i64 [[TMP6]], 1
+; CHECK-NEXT:    [[N_RND_UP:%.*]] = add i64 [[N]], [[TMP7]]
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP6]]
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
+; CHECK-NEXT:    [[TMP8:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP9:%.*]] = mul i64 [[TMP8]], 2
+; CHECK-NEXT:    [[TMP10:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP11:%.*]] = mul i64 [[TMP10]], 2
+; CHECK-NEXT:    [[TMP12:%.*]] = sub i64 [[N]], [[TMP11]]
+; CHECK-NEXT:    [[TMP13:%.*]] = icmp ugt i64 [[N]], [[TMP11]]
+; CHECK-NEXT:    [[TMP14:%.*]] = select i1 [[TMP13]], i64 [[TMP12]], i64 0
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 0, i64 [[N]])
+; CHECK-NEXT:    [[TMP15:%.*]] = call <vscale x 2 x i64> @llvm.experimental.stepvector.nxv2i64()
+; CHECK-NEXT:    [[TMP16:%.*]] = add <vscale x 2 x i64> [[TMP15]], zeroinitializer
+; CHECK-NEXT:    [[TMP17:%.*]] = mul <vscale x 2 x i64> [[TMP16]], shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 1, i64 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer)
+; CHECK-NEXT:    [[INDUCTION:%.*]] = add <vscale x 2 x i64> zeroinitializer, [[TMP17]]
+; CHECK-NEXT:    [[TMP18:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP19:%.*]] = mul i64 [[TMP18]], 2
+; CHECK-NEXT:    [[TMP20:%.*]] = mul i64 1, [[TMP19]]
+; CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[TMP20]], i64 0
+; CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 2 x i64> [[DOTSPLATINSERT]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[M]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[CONV6]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 2 x i64> [[BROADCAST_SPLATINSERT1]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], %[[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <vscale x 2 x i64> [ [[INDUCTION]], %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[TMP21:%.*]] = add i64 [[INDEX]], 0
+; CHECK-NEXT:    [[TMP22:%.*]] = icmp ule <vscale x 2 x i64> [[VEC_IND]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT:    [[TMP23:%.*]] = select <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], <vscale x 2 x i1> [[TMP22]], <vscale x 2 x i1> zeroinitializer
+; CHECK-NEXT:    [[TMP24:%.*]] = select <vscale x 2 x i1> [[TMP23]], <vscale x 2 x i64> [[BROADCAST_SPLAT2]], <vscale x 2 x i64> shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 1, i64 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer)
+; CHECK-NEXT:    [[TMP25:%.*]] = sdiv <vscale x 2 x i64> [[BROADCAST_SPLAT]], [[TMP24]]
+; CHECK-NEXT:    [[TMP26:%.*]] = extractelement <vscale x 2 x i64> [[TMP25]], i32 0
+; CHECK-NEXT:    [[TMP27:%.*]] = trunc i64 [[TMP26]] to i32
+; CHECK-NEXT:    [[TMP28:%.*]] = mul i64 [[TMP26]], [[CONV61]]
+; CHECK-NEXT:    [[TMP29:%.*]] = sub i64 [[TMP21]], [[TMP28]]
+; CHECK-NEXT:    [[TMP30:%.*]] = trunc i64 [[TMP29]] to i32
+; CHECK-NEXT:    [[TMP31:%.*]] = mul i32 [[X]], [[TMP27]]
+; CHECK-NEXT:    [[TMP32:%.*]] = add i32 [[TMP31]], [[TMP30]]
+; CHECK-NEXT:    [[TMP33:%.*]] = sext i32 [[TMP32]] to i64
+; CHECK-NEXT:    [[TMP34:%.*]] = getelementptr double, ptr [[DST]], i64 [[TMP33]]
+; CHECK-NEXT:    [[TMP35:%.*]] = getelementptr double, ptr [[TMP34]], i32 0
+; CHECK-NEXT:    call void @llvm.masked.store.nxv2f64.p0(<vscale x 2 x double> zeroinitializer, ptr [[TMP35]], i32 8, <vscale x 2 x i1> [[TMP23]])
+; CHECK-NEXT:    [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP9]]
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[INDEX]], i64 [[TMP14]])
+; CHECK-NEXT:    [[TMP36:%.*]] = xor <vscale x 2 x i1> [[ACTIVE_LANE_MASK_NEXT]], shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vs...
[truncated]

alexey-bataev · 2024-07-15T14:00:00Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-    if (isScalarWithPredication(I, VF)) {
-      LLVM_DEBUG(dbgs() << "LV: Found not uniform being ScalarWithPredication: "
-                        << *I << "\n");
+    if (isPredicatedInst(I, VF, true)) {


Suggested change

if (isPredicatedInst(I, VF, true)) {

if (isPredicatedInst(I, VF, /*IsKnownUniform=*/true)) {

Code is gone in the latest version, thanks!

alexey-bataev · 2024-07-15T14:00:13Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -3907,7 +3919,7 @@ void LoopVectorizationCostModel::collectLoopUniforms(ElementCount VF) {
  SetVector<Instruction *> Worklist;

  // Add uniform instructions demanding lane 0 to the worklist. Instructions
-  // that are scalar with predication must not be considered uniform after
+  // that are require predication must not be considered uniform after


Suggested change

// that are require predication must not be considered uniform after

// that require predication must not be considered uniform after

(Should have been fixed by earlier PR)

ayalz · 2024-07-17T21:12:41Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+    // When folding the tail, at least one of the lanes must execute
+    // unconditionally. If the divisor is loop-invariant no predication is
+    // needed, as predication would not prevent the divide-by-0 on the executed
+    // lane.
+    if (foldTailByMasking() && !Legal->blockNeedsPredication(I->getParent()) &&
+        TheLoop->isLoopInvariant(I->getOperand(1)) &&
+        (IsKnownUniform || isUniformAfterVectorization(I, VF)))
+      return false;


This is similar to unpredicating loads and stores above, and should have consistent explanations and implementations. Checking if !Legal->blockNeedsPredication(I->getParent()) implies foldTailByMasking() given than we passed blockNeedsPredicationForAnyReason(I->getParent()) to get here. Why are IsKnownUniform and isUniformAfterVectorization needed - would Legal->isInvariant(I->getOperand(1)) suffice, as in the load-from-uniform-address case above?

A side-effecting instruction under mask could be unmasked provided (a) its side-effect operand(s) is uniform (divisor for div/rem, address for load, address and value for store), and (b) its mask is known to have its first lane set (conceivably extensible to having any lane set?). This is the case for instructions subject only to fold-tail mask. Would be good to have an explicit API, say, MaskKnownToExcludeFirstLane().

Would it suffice to have

if (Legal->isInvariant(I->getOperand(1)) && !Legal->blockNeedsPredication(I->getParent())) return false;

as in the load-from-uniform-address case above?

Thanks I restructured the code and generalized it so the unconditionally-executed but tail-folded cases are all handled at the ned.

Removed the IsKnownUniform requirement.

ayalz

This looks fine, raising some more thoughts ... which could possibly be addressed separately.

ayalz · 2024-07-21T18:28:40Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -3348,40 +3348,48 @@ bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I) const {
    return false;

  // Can we prove this instruction is safe to unconditionally execute?
-  // If not, we must use some form of predication.


Is this comment removal intended.

code restructured in the latest version, with the code separate if being removed.

ayalz · 2024-07-22T11:30:26Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -3348,40 +3348,48 @@ bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I) const {
    return false;


Suggested change

return false;

// If predication is not needed, avoid it.

if (!blockNeedsPredicationForAnyReason(I->getParent())

|| isSafeToSpeculativelyExecute(I))

|| !Legal->isMaskRequired(I)

|| isa<BranchInst, PHINode>(I))

return false;

should the case of I being isSafeToSpeculativelyExecute() be handled here at the outset along with the case of I's block not needing predication for any reason?

Legal's isMaskRequired() deals only with load, stores, and calls (in particular "assume" intrinsics - which get dropped rather than predicated), and is tail-folding aware. Should originally unconditional loads and stores (of invariant values) from invariant addresses be excluded from MaskedOps? I.e., should Legal's blockCanBePredicated() take care of unmasking such loads and stores under tail-folding, alongside its unmasking of loads from safe pointers? Furthermore, is the distinction between CM.isPredicatedInst(I) and Legal.isMaskRequired(I) clear and desired, or could the collection of all MaskedOps/PredicatedInsts be done once? Masks were and still are introduced only where necessary, as a functional requirement rather than a performance preference.

Reorganized to combine all checks at the top, thanks!.

As for unifying the logic, might be good to do it separately, but definitely seems worthwhile given that the decisions are independent of cost

ayalz · 2024-07-22T11:51:33Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+    return false;
+
+  // TODO: We can use the loop-preheader as context point here and get
+  // context sensitive reasoning


Suggested change

// context sensitive reasoning

// context sensitive reasoning.

would be good to indicate that the comment refers to isSafeToSpeculativelyExecute(I), i.e., is I safe to speculatively execute in the preheader.

Updated, thanks!

ayalz · 2024-07-23T15:23:03Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  // Tail folding may introduce additional predication, but we're guaranteed to
+  // always have at least one active lane. If the instruction in the original
+  // scalar loop was executed unconditionally, it may not need predication,
+  // depending on its operands.


suggesting a rephrase, feel free to modify:

Suggested change

// Tail folding may introduce additional predication, but we're guaranteed to

// always have at least one active lane. If the instruction in the original

// scalar loop was executed unconditionally, it may not need predication,

// depending on its operands.

// All that remain are instructions with side-effects originally executed in the loop unconditionally, but now execute under a tail-fold mask (only) having at least one active lane (the first). If the side-effects of the instruction are invariant, executing it w/o (the tail-folding) mask is safe - it will cause the same side-effects as when masked.

Adjusted, thanks!

ayalz · 2024-07-23T15:25:10Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+        "instruction should have been considered to not require predication "
+        "by earlier checks");


Suggested change

"instruction should have been considered to not require predication "

"by earlier checks");

"instruction should have been considered by earlier checks");

to either require predication or not.

ayalz · 2024-07-24T08:07:11Z

llvm/test/Transforms/LoopVectorize/X86/divs-with-tail-folding.ll

-; CHECK-NEXT:    [[TMP15:%.*]] = phi i64 [ poison, %[[PRED_SDIV_CONTINUE4]] ], [ [[TMP14]], %[[PRED_SDIV_IF5]] ]
-; CHECK-NEXT:    [[TMP16:%.*]] = extractelement <4 x i1> [[TMP6]], i32 3
-; CHECK-NEXT:    br i1 [[TMP16]], label %[[PRED_SDIV_IF7:.*]], label %[[PRED_SDIV_CONTINUE8]]
-; CHECK:       [[PRED_SDIV_IF7]]:
 ; CHECK-NEXT:    [[TMP17:%.*]] = sdiv i64 [[M]], [[CONV6]]


Note: invariant sdiv in header with invariant divisor and dividend remains scalar w/o predication rather than scalar w/ predication.

ayalz · 2024-07-24T08:57:14Z

llvm/test/Transforms/LoopVectorize/X86/divs-with-tail-folding.ll

-; CHECK-NEXT:    [[TMP74:%.*]] = ashr i64 [[TMP73]], 32
-; CHECK-NEXT:    [[TMP75:%.*]] = getelementptr i64, ptr [[DST]], i64 [[TMP74]]
-; CHECK-NEXT:    [[TMP76:%.*]] = getelementptr i64, ptr [[TMP75]], i32 0
-; CHECK-NEXT:    call void @llvm.masked.store.v4i64.p0(<4 x i64> [[TMP60]], ptr [[TMP76]], i32 4, <4 x i1> [[TMP5]])


It is quite remarkable to see that TMP76 is known to be an AddRec, leading to vectorizing the store into a wide vector store. But why not simplify this into a store to %dst[%iv]?

I think at the moment we always base the vector pointers on the original IR pointer; in this case it would be more efficient to instead expand the SCEV for pointer I think

ayalz · 2024-07-24T09:11:40Z

llvm/test/Transforms/LoopVectorize/X86/divs-with-tail-folding.ll

@@ -254,117 +227,9 @@ define void @udiv_urem_feeding_gep(i64 %x, ptr %dst, i64 %N) {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
 ; CHECK-NEXT:    [[MUL_1_I:%.*]] = mul i64 [[X]], [[X]]
 ; CHECK-NEXT:    [[MUL_2_I:%.*]] = mul i64 [[MUL_1_I]], [[X]]
-; CHECK-NEXT:    [[TMP0:%.*]] = add i64 [[N]], 1
-; CHECK-NEXT:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_SCEVCHECK:.*]]


This case involving six divs and rems all with invariant divisors in the header, now avoids vectorizing. It is indeed questionable if vectorization is profitable - packing scalar remainders into a vector just to feed a wide store. But worth understanding why unpredicating these divs and rems seems less profitable / detrimental.

It looks this is down to a large difference in the costs of udiv <4 x i64> vs 4 * udiv i64 + scalarization overhead estimate in LV (80 vs 7!). This looks like an X86 cost-model issue independent of the patch

ayalz · 2024-07-24T09:13:24Z

llvm/test/Transforms/LoopVectorize/AArch64/divs-with-scalable-vfs.ll

@@ -1,4 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; NOTE: Assertions t have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5


Suggested change

; NOTE: Assertions t have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5

?

Removed, thanks!

ayalz · 2024-07-24T09:50:26Z

llvm/test/Transforms/LoopVectorize/AArch64/divs-with-scalable-vfs.ll

-; CHECK-NEXT:    [[TMP46:%.*]] = getelementptr i64, ptr [[TMP45]], i32 0
-; CHECK-NEXT:    call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP22]], ptr [[TMP46]], i32 4, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]])
+; CHECK-NEXT:    [[TMP21:%.*]] = add i64 [[INDEX]], 0
+; CHECK-NEXT:    [[TMP23:%.*]] = udiv <vscale x 2 x i64> [[VEC_IND]], [[BROADCAST_SPLAT4]]


This seems right - broadcasting the invariant divisors rather than filling poison for masked lanes, or computing a single unmasked lane if only it is used.
Doing so for fixed vectors in X86/divs-with-tail-folding.ll led to abandoning vectorization - see below.

ayalz

This LGTM, thanks!

Perhaps worth leaving behind TODOs and/or opening tickets for

folding CM.isPredicatedInst() into Legal.isMaskRequired()
X86 cost-model issue with vector versus scalar udiv
expand the AddRec SCEV of pointers for wide loads/stores instead of processing their existing GEP computation

fhahn · 2024-07-25T10:30:52Z

This LGTM, thanks!

Perhaps worth leaving behind TODOs and/or opening tickets for

folding CM.isPredicatedInst() into Legal.isMaskRequired()

X86 cost-model issue with vector versus scalar udiv

expand the AddRec SCEV of pointers for wide loads/stores instead of processing their existing GEP computation

Thanks added a TODO for the first, will file for the others.

…98904) Summary: When folding the tail, at least one of the lanes must execute unconditionally. If the divisor is loop-invariant no predication is needed, as predication would not prevent the divide-by-0 on the executed lane. Depends on #98892. PR: #98904 Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: https://phabricator.intern.facebook.com/D60250648

fhahn added 2 commits July 15, 2024 14:39

fhahn requested review from preames, davemgreen, ayalz and aniragil July 15, 2024 13:53

llvmbot added vectorizers llvm:transforms labels Jul 15, 2024

alexey-bataev reviewed Jul 15, 2024

View reviewed changes

ayalz reviewed Jul 17, 2024

View reviewed changes

fhahn added 8 commits July 19, 2024 12:31

Merge remote-tracking branch 'origin/main' into lv-uniform-tailfold

cb18fd6

Merge remote-tracking branch 'origin/main' into lv-uniform-tailfold

d834270

Merge remote-tracking branch 'origin/main' into lv-uniform-tailfold

f00467f

!fixup

f0b1ea4

Merge remote-tracking branch 'origin/main' into lv-uniform-tailfold

92e0a77

!fixup adjust tests.

e7e7564

Merge remote-tracking branch 'origin/main' into lv-uniform-tailfold

1574791

!fixup generalize logic

2c8de7c

fhahn force-pushed the lv-uniform-tailfold branch from bafa29b to 0f63249 Compare July 20, 2024 13:26

!fixup remove extra args from isPredicatedInst

8a71ea7

fhahn force-pushed the lv-uniform-tailfold branch from 0f63249 to 8a71ea7 Compare July 20, 2024 13:26

fhahn changed the title ~~[LV] Don't predicate uniform divides with loop-invariant divisor.~~ [LV] Don't predicate divs with invariant divisor when folding tail Jul 20, 2024

Merge remote-tracking branch 'origin/main' into lv-uniform-tailfold

d1bdb1e

ayalz reviewed Jul 24, 2024

View reviewed changes

fhahn added 2 commits July 24, 2024 18:21

Merge remote-tracking branch 'origin/main' into lv-uniform-tailfold

edee437

!fixup address latest comments, thanks!

f9581e0

ayalz approved these changes Jul 24, 2024

View reviewed changes

fhahn added 2 commits July 25, 2024 09:13

Merge remote-tracking branch 'origin/main' into lv-uniform-tailfold

49c6677

!ifxup add TODO

bfe0e34

fhahn merged commit 72532c9 into llvm:main Jul 25, 2024
7 checks passed

fhahn deleted the lv-uniform-tailfold branch July 25, 2024 11:21

	if (isPredicatedInst(I, VF, true)) {
	if (isPredicatedInst(I, VF, /IsKnownUniform=/true)) {

	// that are require predication must not be considered uniform after
	// that require predication must not be considered uniform after

		@@ -3348,40 +3348,48 @@ bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I) const {
		return false;

-    return false;
+  // If predication is not needed, avoid it.
+  if (!blockNeedsPredicationForAnyReason(I->getParent())
+      || isSafeToSpeculativelyExecute(I))
+      || !Legal->isMaskRequired(I)
+      || isa<BranchInst, PHINode>(I))
+    return false;

	// context sensitive reasoning
	// context sensitive reasoning.

		"instruction should have been considered to not require predication "
		"by earlier checks");

		@@ -1,4 +1,5 @@
		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
		; NOTE: Assertions t have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5

[LV] Don't predicate divs with invariant divisor when folding tail #98904

[LV] Don't predicate divs with invariant divisor when folding tail #98904

Uh oh!

Conversation

fhahn commented Jul 15, 2024

Uh oh!

llvmbot commented Jul 15, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayalz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayalz left a comment

Choose a reason for hiding this comment

Uh oh!

fhahn commented Jul 25, 2024

Uh oh!

Uh oh!

Uh oh!