-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[LV] Don't predicate divs with invariant divisor when folding tail #98904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Any instruction marked as uniform will result in a uniform VPReplicateRecipe. If it requires predication, it will be placed in a replicate region, even if isScalarWithPredication returns false. Check isPredicatedInst instead of isScalarWithPredication to avoid generating uniform VPReplicateRecipes placed inside a replicate region. This fixes an assertion when using scalable VFs. Fixes llvm#80416. Fixes llvm#94328.
When folding the tail, at least one of the lanes must execute unconditionally. If the divisor is loop-invariant no predication is needed, as predication would not prevent the divide-by-0 on the executed lane. Depends on llvm#98892.
@llvm/pr-subscribers-llvm-transforms Author: Florian Hahn (fhahn) ChangesWhen folding the tail, at least one of the lanes must execute Depends on #98892. Patch is 45.04 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/98904.diff 5 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 5520baef7152d..f5a860e980c74 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1402,7 +1402,8 @@ class LoopVectorizationCostModel {
/// Returns true if \p I is an instruction that needs to be predicated
/// at runtime. The result is independent of the predication mechanism.
/// Superset of instructions that return true for isScalarWithPredication.
- bool isPredicatedInst(Instruction *I) const;
+ bool isPredicatedInst(Instruction *I, ElementCount VF,
+ bool IsKnownUniform = false) const;
/// Return the costs for our two available strategies for lowering a
/// div/rem operation which requires speculating at least one lane.
@@ -3637,7 +3638,7 @@ void LoopVectorizationCostModel::collectLoopScalars(ElementCount VF) {
bool LoopVectorizationCostModel::isScalarWithPredication(
Instruction *I, ElementCount VF) const {
- if (!isPredicatedInst(I))
+ if (!isPredicatedInst(I, VF))
return false;
// Do we have a non-scalar lowering for this predicated
@@ -3676,7 +3677,9 @@ bool LoopVectorizationCostModel::isScalarWithPredication(
}
}
-bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I) const {
+bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I,
+ ElementCount VF,
+ bool IsKnownUniform) const {
if (!blockNeedsPredicationForAnyReason(I->getParent()))
return false;
@@ -3710,6 +3713,15 @@ bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I) const {
case Instruction::SDiv:
case Instruction::SRem:
case Instruction::URem:
+ // When folding the tail, at least one of the lanes must execute
+ // unconditionally. If the divisor is loop-invariant no predication is
+ // needed, as predication would not prevent the divide-by-0 on the executed
+ // lane.
+ if (foldTailByMasking() && !Legal->blockNeedsPredication(I->getParent()) &&
+ TheLoop->isLoopInvariant(I->getOperand(1)) &&
+ (IsKnownUniform || isUniformAfterVectorization(I, VF)))
+ return false;
+
// TODO: We can use the loop-preheader as context point here and get
// context sensitive reasoning
return !isSafeToSpeculativelyExecute(I);
@@ -3907,7 +3919,7 @@ void LoopVectorizationCostModel::collectLoopUniforms(ElementCount VF) {
SetVector<Instruction *> Worklist;
// Add uniform instructions demanding lane 0 to the worklist. Instructions
- // that are scalar with predication must not be considered uniform after
+ // that are require predication must not be considered uniform after
// vectorization, because that would create an erroneous replicating region
// where only a single instance out of VF should be formed.
// TODO: optimize such seldom cases if found important, see PR40816.
@@ -3917,9 +3929,10 @@ void LoopVectorizationCostModel::collectLoopUniforms(ElementCount VF) {
<< *I << "\n");
return;
}
- if (isScalarWithPredication(I, VF)) {
- LLVM_DEBUG(dbgs() << "LV: Found not uniform being ScalarWithPredication: "
- << *I << "\n");
+ if (isPredicatedInst(I, VF, true)) {
+ LLVM_DEBUG(
+ dbgs() << "LV: Found not uniform due to requiring predication: " << *I
+ << "\n");
return;
}
LLVM_DEBUG(dbgs() << "LV: Found uniform instruction: " << *I << "\n");
@@ -5633,7 +5646,7 @@ bool LoopVectorizationCostModel::useEmulatedMaskMemRefHack(Instruction *I,
// from moving "masked load/store" check from legality to cost model.
// Masked Load/Gather emulation was previously never allowed.
// Limited number of Masked Store/Scatter emulation was allowed.
- assert((isPredicatedInst(I)) &&
+ assert((isPredicatedInst(I, VF)) &&
"Expecting a scalar emulated instruction");
return isa<LoadInst>(I) ||
(isa<StoreInst>(I) &&
@@ -5912,7 +5925,7 @@ LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I,
// If we have a predicated load/store, it will need extra i1 extracts and
// conditional branches, but may not be executed for each vector lane. Scale
// the cost by the probability of executing the predicated block.
- if (isPredicatedInst(I)) {
+ if (isPredicatedInst(I, VF)) {
Cost /= getReciprocalPredBlockProb();
// Add the cost of an i1 extract and a branch
@@ -6772,7 +6785,7 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I,
case Instruction::SDiv:
case Instruction::URem:
case Instruction::SRem:
- if (VF.isVector() && isPredicatedInst(I)) {
+ if (VF.isVector() && isPredicatedInst(I, VF)) {
const auto [ScalarCost, SafeDivisorCost] = getDivRemSpeculationCost(I, VF);
return isDivRemScalarWithPredication(ScalarCost, SafeDivisorCost) ?
ScalarCost : SafeDivisorCost;
@@ -8444,7 +8457,7 @@ bool VPRecipeBuilder::shouldWiden(Instruction *I, VFRange &Range) const {
VPWidenRecipe *VPRecipeBuilder::tryToWiden(Instruction *I,
ArrayRef<VPValue *> Operands,
- VPBasicBlock *VPBB) {
+ VPBasicBlock *VPBB, VFRange &Range) {
switch (I->getOpcode()) {
default:
return nullptr;
@@ -8454,7 +8467,10 @@ VPWidenRecipe *VPRecipeBuilder::tryToWiden(Instruction *I,
case Instruction::URem: {
// If not provably safe, use a select to form a safe divisor before widening the
// div/rem operation itself. Otherwise fall through to general handling below.
- if (CM.isPredicatedInst(I)) {
+ bool IsPredicated = LoopVectorizationPlanner::getDecisionAndClampRange(
+ [&](ElementCount VF) -> bool { return CM.isPredicatedInst(I, VF); },
+ Range);
+ if (IsPredicated) {
SmallVector<VPValue *> Ops(Operands.begin(), Operands.end());
VPValue *Mask = getBlockInMask(I->getParent());
VPValue *One =
@@ -8504,8 +8520,8 @@ VPReplicateRecipe *VPRecipeBuilder::handleReplication(Instruction *I,
[&](ElementCount VF) { return CM.isUniformAfterVectorization(I, VF); },
Range);
- bool IsPredicated = CM.isPredicatedInst(I);
-
+ bool IsPredicated = LoopVectorizationPlanner::getDecisionAndClampRange(
+ [&](ElementCount VF) { return CM.isPredicatedInst(I, VF); }, Range);
// Even if the instruction is not marked as uniform, there are certain
// intrinsic calls that can be effectively treated as such, so we check for
// them here. Conservatively, we only do this for scalable vectors, since
@@ -8625,7 +8641,7 @@ VPRecipeBuilder::tryToCreateWidenRecipe(Instruction *Instr,
*CI);
}
- return tryToWiden(Instr, Operands, VPBB);
+ return tryToWiden(Instr, Operands, VPBB, Range);
}
void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
@@ -9482,6 +9498,8 @@ void VPInterleaveRecipe::execute(VPTransformState &State) {
void VPReplicateRecipe::execute(VPTransformState &State) {
Instruction *UI = getUnderlyingInstr();
if (State.Instance) { // Generate a single instance.
+ assert((State.VF.isScalar() || !isUniform()) &&
+ "uniform recipe shouldn't be predicated");
assert(!State.VF.isScalable() && "Can't scalarize a scalable vector");
State.ILV->scalarizeInstruction(UI, this, *State.Instance, State);
// Insert scalar instance packing it into a vector.
diff --git a/llvm/lib/Transforms/Vectorize/VPRecipeBuilder.h b/llvm/lib/Transforms/Vectorize/VPRecipeBuilder.h
index b4c7ab02f928f..0c7c47258bf9e 100644
--- a/llvm/lib/Transforms/Vectorize/VPRecipeBuilder.h
+++ b/llvm/lib/Transforms/Vectorize/VPRecipeBuilder.h
@@ -100,7 +100,7 @@ class VPRecipeBuilder {
/// if it can. The function should only be called if the cost-model indicates
/// that widening should be performed.
VPWidenRecipe *tryToWiden(Instruction *I, ArrayRef<VPValue *> Operands,
- VPBasicBlock *VPBB);
+ VPBasicBlock *VPBB, VFRange &Range);
public:
VPRecipeBuilder(VPlan &Plan, Loop *OrigLoop, const TargetLibraryInfo *TLI,
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/divs-with-scalable-vfs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/divs-with-scalable-vfs.ll
new file mode 100644
index 0000000000000..e1ce8d430e741
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/divs-with-scalable-vfs.ll
@@ -0,0 +1,380 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -p loop-vectorize -mtriple aarch64 -mcpu=neoverse-v1 -S %s | FileCheck %s
+
+; Test case for https://github.com/llvm/llvm-project/issues/94328.
+define void @sdiv_feeding_gep(ptr %dst, i32 %x, i64 %M, i64 %conv6, i64 %N) {
+; CHECK-LABEL: define void @sdiv_feeding_gep(
+; CHECK-SAME: ptr [[DST:%.*]], i32 [[X:%.*]], i64 [[M:%.*]], i64 [[CONV6:%.*]], i64 [[N:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: [[CONV61:%.*]] = zext i32 [[X]] to i64
+; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
+; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.umax.i64(i64 8, i64 [[TMP1]])
+; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], [[TMP2]]
+; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_SCEVCHECK:.*]]
+; CHECK: [[VECTOR_SCEVCHECK]]:
+; CHECK-NEXT: [[TMP3:%.*]] = add i64 [[N]], -1
+; CHECK-NEXT: [[TMP4:%.*]] = trunc i64 [[TMP3]] to i32
+; CHECK-NEXT: [[TMP5:%.*]] = icmp slt i32 [[TMP4]], 0
+; CHECK-NEXT: [[TMP6:%.*]] = icmp ugt i64 [[TMP3]], 4294967295
+; CHECK-NEXT: [[TMP7:%.*]] = or i1 [[TMP5]], [[TMP6]]
+; CHECK-NEXT: br i1 [[TMP7]], label %[[SCALAR_PH]], label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: [[TMP8:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP9:%.*]] = mul i64 [[TMP8]], 4
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N]], [[TMP9]]
+; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-NEXT: [[TMP10:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP11:%.*]] = mul i64 [[TMP10]], 4
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[TMP12:%.*]] = add i64 [[INDEX]], 0
+; CHECK-NEXT: [[TMP13:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP14:%.*]] = mul i64 [[TMP13]], 2
+; CHECK-NEXT: [[TMP15:%.*]] = add i64 [[TMP14]], 0
+; CHECK-NEXT: [[TMP16:%.*]] = mul i64 [[TMP15]], 1
+; CHECK-NEXT: [[TMP17:%.*]] = add i64 [[INDEX]], [[TMP16]]
+; CHECK-NEXT: [[TMP18:%.*]] = sdiv i64 [[M]], [[CONV6]]
+; CHECK-NEXT: [[TMP19:%.*]] = sdiv i64 [[M]], [[CONV6]]
+; CHECK-NEXT: [[TMP20:%.*]] = trunc i64 [[TMP18]] to i32
+; CHECK-NEXT: [[TMP21:%.*]] = trunc i64 [[TMP19]] to i32
+; CHECK-NEXT: [[TMP22:%.*]] = mul i64 [[TMP18]], [[CONV61]]
+; CHECK-NEXT: [[TMP23:%.*]] = mul i64 [[TMP19]], [[CONV61]]
+; CHECK-NEXT: [[TMP24:%.*]] = sub i64 [[TMP12]], [[TMP22]]
+; CHECK-NEXT: [[TMP25:%.*]] = sub i64 [[TMP17]], [[TMP23]]
+; CHECK-NEXT: [[TMP26:%.*]] = trunc i64 [[TMP24]] to i32
+; CHECK-NEXT: [[TMP27:%.*]] = trunc i64 [[TMP25]] to i32
+; CHECK-NEXT: [[TMP28:%.*]] = mul i32 [[X]], [[TMP20]]
+; CHECK-NEXT: [[TMP29:%.*]] = mul i32 [[X]], [[TMP21]]
+; CHECK-NEXT: [[TMP30:%.*]] = add i32 [[TMP28]], [[TMP26]]
+; CHECK-NEXT: [[TMP31:%.*]] = add i32 [[TMP29]], [[TMP27]]
+; CHECK-NEXT: [[TMP32:%.*]] = sext i32 [[TMP30]] to i64
+; CHECK-NEXT: [[TMP33:%.*]] = sext i32 [[TMP31]] to i64
+; CHECK-NEXT: [[TMP34:%.*]] = getelementptr double, ptr [[DST]], i64 [[TMP32]]
+; CHECK-NEXT: [[TMP35:%.*]] = getelementptr double, ptr [[DST]], i64 [[TMP33]]
+; CHECK-NEXT: [[TMP36:%.*]] = getelementptr double, ptr [[TMP34]], i32 0
+; CHECK-NEXT: [[TMP37:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP38:%.*]] = mul i64 [[TMP37]], 2
+; CHECK-NEXT: [[TMP39:%.*]] = getelementptr double, ptr [[TMP34]], i64 [[TMP38]]
+; CHECK-NEXT: store <vscale x 2 x double> zeroinitializer, ptr [[TMP36]], align 8
+; CHECK-NEXT: store <vscale x 2 x double> zeroinitializer, ptr [[TMP39]], align 8
+; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP11]]
+; CHECK-NEXT: [[TMP40:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT: br i1 [[TMP40]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK: [[MIDDLE_BLOCK]]:
+; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-NEXT: br i1 [[CMP_N]], label %[[EXIT:.*]], label %[[SCALAR_PH]]
+; CHECK: [[SCALAR_PH]]:
+; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ], [ 0, %[[VECTOR_SCEVCHECK]] ]
+; CHECK-NEXT: br label %[[LOOP:.*]]
+; CHECK: [[LOOP]]:
+; CHECK-NEXT: [[IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[IV_NEXT:%.*]], %[[LOOP]] ]
+; CHECK-NEXT: [[DIV18:%.*]] = sdiv i64 [[M]], [[CONV6]]
+; CHECK-NEXT: [[CONV20:%.*]] = trunc i64 [[DIV18]] to i32
+; CHECK-NEXT: [[MUL30:%.*]] = mul i64 [[DIV18]], [[CONV61]]
+; CHECK-NEXT: [[SUB31:%.*]] = sub i64 [[IV]], [[MUL30]]
+; CHECK-NEXT: [[CONV34:%.*]] = trunc i64 [[SUB31]] to i32
+; CHECK-NEXT: [[MUL35:%.*]] = mul i32 [[X]], [[CONV20]]
+; CHECK-NEXT: [[ADD36:%.*]] = add i32 [[MUL35]], [[CONV34]]
+; CHECK-NEXT: [[IDXPROM:%.*]] = sext i32 [[ADD36]] to i64
+; CHECK-NEXT: [[GEP:%.*]] = getelementptr double, ptr [[DST]], i64 [[IDXPROM]]
+; CHECK-NEXT: store double 0.000000e+00, ptr [[GEP]], align 8
+; CHECK-NEXT: [[IV_NEXT]] = add i64 [[IV]], 1
+; CHECK-NEXT: [[EC:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
+; CHECK-NEXT: br i1 [[EC]], label %[[EXIT]], label %[[LOOP]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: ret void
+;
+entry:
+ %conv61 = zext i32 %x to i64
+ br label %loop
+
+loop:
+ %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
+ %div18 = sdiv i64 %M, %conv6
+ %conv20 = trunc i64 %div18 to i32
+ %mul30 = mul i64 %div18, %conv61
+ %sub31 = sub i64 %iv, %mul30
+ %conv34 = trunc i64 %sub31 to i32
+ %mul35 = mul i32 %x, %conv20
+ %add36 = add i32 %mul35, %conv34
+ %idxprom = sext i32 %add36 to i64
+ %gep = getelementptr double, ptr %dst, i64 %idxprom
+ store double 0.000000e+00, ptr %gep, align 8
+ %iv.next = add i64 %iv, 1
+ %ec = icmp eq i64 %iv.next, %N
+ br i1 %ec, label %exit, label %loop
+
+exit:
+ ret void
+}
+
+define void @sdiv_feeding_gep_predicated(ptr %dst, i32 %x, i64 %M, i64 %conv6, i64 %N) {
+; CHECK-LABEL: define void @sdiv_feeding_gep_predicated(
+; CHECK-SAME: ptr [[DST:%.*]], i32 [[X:%.*]], i64 [[M:%.*]], i64 [[CONV6:%.*]], i64 [[N:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[ENTRY:.*]]:
+; CHECK-NEXT: [[CONV61:%.*]] = zext i32 [[X]] to i64
+; CHECK-NEXT: br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_SCEVCHECK:.*]]
+; CHECK: [[VECTOR_SCEVCHECK]]:
+; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[N]], -1
+; CHECK-NEXT: [[TMP1:%.*]] = trunc i64 [[TMP0]] to i32
+; CHECK-NEXT: [[TMP2:%.*]] = icmp slt i32 [[TMP1]], 0
+; CHECK-NEXT: [[TMP3:%.*]] = icmp ugt i64 [[TMP0]], 4294967295
+; CHECK-NEXT: [[TMP4:%.*]] = or i1 [[TMP2]], [[TMP3]]
+; CHECK-NEXT: br i1 [[TMP4]], label %[[SCALAR_PH]], label %[[VECTOR_PH:.*]]
+; CHECK: [[VECTOR_PH]]:
+; CHECK-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 2
+; CHECK-NEXT: [[TMP7:%.*]] = sub i64 [[TMP6]], 1
+; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[N]], [[TMP7]]
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP6]]
+; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
+; CHECK-NEXT: [[TMP8:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP9:%.*]] = mul i64 [[TMP8]], 2
+; CHECK-NEXT: [[TMP10:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP11:%.*]] = mul i64 [[TMP10]], 2
+; CHECK-NEXT: [[TMP12:%.*]] = sub i64 [[N]], [[TMP11]]
+; CHECK-NEXT: [[TMP13:%.*]] = icmp ugt i64 [[N]], [[TMP11]]
+; CHECK-NEXT: [[TMP14:%.*]] = select i1 [[TMP13]], i64 [[TMP12]], i64 0
+; CHECK-NEXT: [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 0, i64 [[N]])
+; CHECK-NEXT: [[TMP15:%.*]] = call <vscale x 2 x i64> @llvm.experimental.stepvector.nxv2i64()
+; CHECK-NEXT: [[TMP16:%.*]] = add <vscale x 2 x i64> [[TMP15]], zeroinitializer
+; CHECK-NEXT: [[TMP17:%.*]] = mul <vscale x 2 x i64> [[TMP16]], shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 1, i64 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer)
+; CHECK-NEXT: [[INDUCTION:%.*]] = add <vscale x 2 x i64> zeroinitializer, [[TMP17]]
+; CHECK-NEXT: [[TMP18:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP19:%.*]] = mul i64 [[TMP18]], 2
+; CHECK-NEXT: [[TMP20:%.*]] = mul i64 1, [[TMP19]]
+; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[TMP20]], i64 0
+; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 2 x i64> [[DOTSPLATINSERT]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[M]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 2 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT: [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 2 x i64> poison, i64 [[CONV6]], i64 0
+; CHECK-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 2 x i64> [[BROADCAST_SPLATINSERT1]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
+; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK: [[VECTOR_BODY]]:
+; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = phi <vscale x 2 x i1> [ [[ACTIVE_LANE_MASK_ENTRY]], %[[VECTOR_PH]] ], [ [[ACTIVE_LANE_MASK_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_IND:%.*]] = phi <vscale x 2 x i64> [ [[INDUCTION]], %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT: [[TMP21:%.*]] = add i64 [[INDEX]], 0
+; CHECK-NEXT: [[TMP22:%.*]] = icmp ule <vscale x 2 x i64> [[VEC_IND]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT: [[TMP23:%.*]] = select <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], <vscale x 2 x i1> [[TMP22]], <vscale x 2 x i1> zeroinitializer
+; CHECK-NEXT: [[TMP24:%.*]] = select <vscale x 2 x i1> [[TMP23]], <vscale x 2 x i64> [[BROADCAST_SPLAT2]], <vscale x 2 x i64> shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 1, i64 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer)
+; CHECK-NEXT: [[TMP25:%.*]] = sdiv <vscale x 2 x i64> [[BROADCAST_SPLAT]], [[TMP24]]
+; CHECK-NEXT: [[TMP26:%.*]] = extractelement <vscale x 2 x i64> [[TMP25]], i32 0
+; CHECK-NEXT: [[TMP27:%.*]] = trunc i64 [[TMP26]] to i32
+; CHECK-NEXT: [[TMP28:%.*]] = mul i64 [[TMP26]], [[CONV61]]
+; CHECK-NEXT: [[TMP29:%.*]] = sub i64 [[TMP21]], [[TMP28]]
+; CHECK-NEXT: [[TMP30:%.*]] = trunc i64 [[TMP29]] to i32
+; CHECK-NEXT: [[TMP31:%.*]] = mul i32 [[X]], [[TMP27]]
+; CHECK-NEXT: [[TMP32:%.*]] = add i32 [[TMP31]], [[TMP30]]
+; CHECK-NEXT: [[TMP33:%.*]] = sext i32 [[TMP32]] to i64
+; CHECK-NEXT: [[TMP34:%.*]] = getelementptr double, ptr [[DST]], i64 [[TMP33]]
+; CHECK-NEXT: [[TMP35:%.*]] = getelementptr double, ptr [[TMP34]], i32 0
+; CHECK-NEXT: call void @llvm.masked.store.nxv2f64.p0(<vscale x 2 x double> zeroinitializer, ptr [[TMP35]], i32 8, <vscale x 2 x i1> [[TMP23]])
+; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP9]]
+; CHECK-NEXT: [[ACTIVE_LANE_MASK_NEXT]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 [[INDEX]], i64 [[TMP14]])
+; CHECK-NEXT: [[TMP36:%.*]] = xor <vscale x 2 x i1> [[ACTIVE_LANE_MASK_NEXT]], shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vs...
[truncated]
|
if (isScalarWithPredication(I, VF)) { | ||
LLVM_DEBUG(dbgs() << "LV: Found not uniform being ScalarWithPredication: " | ||
<< *I << "\n"); | ||
if (isPredicatedInst(I, VF, true)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (isPredicatedInst(I, VF, true)) { | |
if (isPredicatedInst(I, VF, /*IsKnownUniform=*/true)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code is gone in the latest version, thanks!
@@ -3907,7 +3919,7 @@ void LoopVectorizationCostModel::collectLoopUniforms(ElementCount VF) { | |||
SetVector<Instruction *> Worklist; | |||
|
|||
// Add uniform instructions demanding lane 0 to the worklist. Instructions | |||
// that are scalar with predication must not be considered uniform after | |||
// that are require predication must not be considered uniform after |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// that are require predication must not be considered uniform after | |
// that require predication must not be considered uniform after |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Should have been fixed by earlier PR)
// When folding the tail, at least one of the lanes must execute | ||
// unconditionally. If the divisor is loop-invariant no predication is | ||
// needed, as predication would not prevent the divide-by-0 on the executed | ||
// lane. | ||
if (foldTailByMasking() && !Legal->blockNeedsPredication(I->getParent()) && | ||
TheLoop->isLoopInvariant(I->getOperand(1)) && | ||
(IsKnownUniform || isUniformAfterVectorization(I, VF))) | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to unpredicating loads and stores above, and should have consistent explanations and implementations. Checking if !Legal->blockNeedsPredication(I->getParent()) implies foldTailByMasking() given than we passed blockNeedsPredicationForAnyReason(I->getParent()) to get here. Why are IsKnownUniform and isUniformAfterVectorization needed - would Legal->isInvariant(I->getOperand(1)) suffice, as in the load-from-uniform-address case above?
A side-effecting instruction under mask could be unmasked provided (a) its side-effect operand(s) is uniform (divisor for div/rem, address for load, address and value for store), and (b) its mask is known to have its first lane set (conceivably extensible to having any lane set?). This is the case for instructions subject only to fold-tail mask. Would be good to have an explicit API, say, MaskKnownToExcludeFirstLane()
.
Would it suffice to have
if (Legal->isInvariant(I->getOperand(1)) &&
!Legal->blockNeedsPredication(I->getParent()))
return false;
as in the load-from-uniform-address case above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks I restructured the code and generalized it so the unconditionally-executed but tail-folded cases are all handled at the ned.
Removed the IsKnownUniform requirement.
bafa29b
to
0f63249
Compare
0f63249
to
8a71ea7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine, raising some more thoughts ... which could possibly be addressed separately.
@@ -3348,40 +3348,48 @@ bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I) const { | |||
return false; | |||
|
|||
// Can we prove this instruction is safe to unconditionally execute? | |||
// If not, we must use some form of predication. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment removal intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code restructured in the latest version, with the code separate if being removed.
@@ -3348,40 +3348,48 @@ bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I) const { | |||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return false; | |
// If predication is not needed, avoid it. | |
if (!blockNeedsPredicationForAnyReason(I->getParent()) | |
|| isSafeToSpeculativelyExecute(I)) | |
|| !Legal->isMaskRequired(I) | |
|| isa<BranchInst, PHINode>(I)) | |
return false; |
should the case of I
being isSafeToSpeculativelyExecute() be handled here at the outset along with the case of I
's block not needing predication for any reason?
Legal's isMaskRequired() deals only with load, stores, and calls (in particular "assume" intrinsics - which get dropped rather than predicated), and is tail-folding aware. Should originally unconditional loads and stores (of invariant values) from invariant addresses be excluded from MaskedOps? I.e., should Legal's blockCanBePredicated() take care of unmasking such loads and stores under tail-folding, alongside its unmasking of loads from safe pointers? Furthermore, is the distinction between CM.isPredicatedInst(I) and Legal.isMaskRequired(I) clear and desired, or could the collection of all MaskedOps/PredicatedInsts be done once? Masks were and still are introduced only where necessary, as a functional requirement rather than a performance preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reorganized to combine all checks at the top, thanks!.
As for unifying the logic, might be good to do it separately, but definitely seems worthwhile given that the decisions are independent of cost
return false; | ||
|
||
// TODO: We can use the loop-preheader as context point here and get | ||
// context sensitive reasoning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// context sensitive reasoning | |
// context sensitive reasoning. |
would be good to indicate that the comment refers to isSafeToSpeculativelyExecute(I), i.e., is I
safe to speculatively execute in the preheader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
// Tail folding may introduce additional predication, but we're guaranteed to | ||
// always have at least one active lane. If the instruction in the original | ||
// scalar loop was executed unconditionally, it may not need predication, | ||
// depending on its operands. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggesting a rephrase, feel free to modify:
// Tail folding may introduce additional predication, but we're guaranteed to | |
// always have at least one active lane. If the instruction in the original | |
// scalar loop was executed unconditionally, it may not need predication, | |
// depending on its operands. | |
// All that remain are instructions with side-effects originally executed in the loop unconditionally, but now execute under a tail-fold mask (only) having at least one active lane (the first). If the side-effects of the instruction are invariant, executing it w/o (the tail-folding) mask is safe - it will cause the same side-effects as when masked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted, thanks!
"instruction should have been considered to not require predication " | ||
"by earlier checks"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"instruction should have been considered to not require predication " | |
"by earlier checks"); | |
"instruction should have been considered by earlier checks"); |
to either require predication or not.
; CHECK-NEXT: [[TMP15:%.*]] = phi i64 [ poison, %[[PRED_SDIV_CONTINUE4]] ], [ [[TMP14]], %[[PRED_SDIV_IF5]] ] | ||
; CHECK-NEXT: [[TMP16:%.*]] = extractelement <4 x i1> [[TMP6]], i32 3 | ||
; CHECK-NEXT: br i1 [[TMP16]], label %[[PRED_SDIV_IF7:.*]], label %[[PRED_SDIV_CONTINUE8]] | ||
; CHECK: [[PRED_SDIV_IF7]]: | ||
; CHECK-NEXT: [[TMP17:%.*]] = sdiv i64 [[M]], [[CONV6]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: invariant sdiv in header with invariant divisor and dividend remains scalar w/o predication rather than scalar w/ predication.
; CHECK-NEXT: [[TMP74:%.*]] = ashr i64 [[TMP73]], 32 | ||
; CHECK-NEXT: [[TMP75:%.*]] = getelementptr i64, ptr [[DST]], i64 [[TMP74]] | ||
; CHECK-NEXT: [[TMP76:%.*]] = getelementptr i64, ptr [[TMP75]], i32 0 | ||
; CHECK-NEXT: call void @llvm.masked.store.v4i64.p0(<4 x i64> [[TMP60]], ptr [[TMP76]], i32 4, <4 x i1> [[TMP5]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is quite remarkable to see that TMP76 is known to be an AddRec, leading to vectorizing the store into a wide vector store. But why not simplify this into a store to %dst[%iv]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at the moment we always base the vector pointers on the original IR pointer; in this case it would be more efficient to instead expand the SCEV for pointer I think
@@ -254,117 +227,9 @@ define void @udiv_urem_feeding_gep(i64 %x, ptr %dst, i64 %N) { | |||
; CHECK-NEXT: [[ENTRY:.*]]: | |||
; CHECK-NEXT: [[MUL_1_I:%.*]] = mul i64 [[X]], [[X]] | |||
; CHECK-NEXT: [[MUL_2_I:%.*]] = mul i64 [[MUL_1_I]], [[X]] | |||
; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[N]], 1 | |||
; CHECK-NEXT: br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_SCEVCHECK:.*]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case involving six divs and rems all with invariant divisors in the header, now avoids vectorizing. It is indeed questionable if vectorization is profitable - packing scalar remainders into a vector just to feed a wide store. But worth understanding why unpredicating these divs and rems seems less profitable / detrimental.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks this is down to a large difference in the costs of udiv <4 x i64>
vs 4 * udiv i64
+ scalarization overhead estimate in LV (80 vs 7!). This looks like an X86 cost-model issue independent of the patch
@@ -1,4 +1,5 @@ | |||
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5 | |||
; NOTE: Assertions t have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
; NOTE: Assertions t have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5 |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, thanks!
; CHECK-NEXT: [[TMP46:%.*]] = getelementptr i64, ptr [[TMP45]], i32 0 | ||
; CHECK-NEXT: call void @llvm.masked.store.nxv2i64.p0(<vscale x 2 x i64> [[TMP22]], ptr [[TMP46]], i32 4, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]]) | ||
; CHECK-NEXT: [[TMP21:%.*]] = add i64 [[INDEX]], 0 | ||
; CHECK-NEXT: [[TMP23:%.*]] = udiv <vscale x 2 x i64> [[VEC_IND]], [[BROADCAST_SPLAT4]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems right - broadcasting the invariant divisors rather than filling poison for masked lanes, or computing a single unmasked lane if only it is used.
Doing so for fixed vectors in X86/divs-with-tail-folding.ll led to abandoning vectorization - see below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM, thanks!
Perhaps worth leaving behind TODOs and/or opening tickets for
- folding CM.isPredicatedInst() into Legal.isMaskRequired()
- X86 cost-model issue with vector versus scalar udiv
- expand the AddRec SCEV of pointers for wide loads/stores instead of processing their existing GEP computation
Thanks added a TODO for the first, will file for the others. |
…98904) Summary: When folding the tail, at least one of the lanes must execute unconditionally. If the divisor is loop-invariant no predication is needed, as predication would not prevent the divide-by-0 on the executed lane. Depends on #98892. PR: #98904 Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: https://phabricator.intern.facebook.com/D60250648
When folding the tail, at least one of the lanes must execute
unconditionally. If the divisor is loop-invariant no predication is
needed, as predication would not prevent the divide-by-0 on the executed
lane.
Depends on #98892.