-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[LoopVectorize] Add the cost of VPInstruction::AnyOf to vplan #125058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-vectorizers Author: David Sherwood (david-arm) ChangesThis patch adds an initial implementation of Full diff: https://github.com/llvm/llvm-project/pull/125058.diff 4 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 1e948d2c756c24..e2947d73361cc5 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7539,6 +7539,7 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
CM.CostKind);
precomputeCosts(BestPlan, BestFactor.Width, CostCtx);
assert((BestFactor.Width == LegacyVF.Width ||
+ Legal->hasUncountableEarlyExit() ||
planContainsAdditionalSimplifications(getPlanFor(BestFactor.Width),
CostCtx, OrigLoop) ||
planContainsAdditionalSimplifications(getPlanFor(LegacyVF.Width),
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 459222234bc37f..051cda12875917 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -1316,10 +1316,7 @@ class VPInstruction : public VPRecipeWithIRFlags,
/// Return the cost of this VPInstruction.
InstructionCost computeCost(ElementCount VF,
- VPCostContext &Ctx) const override {
- // TODO: Compute accurate cost after retiring the legacy cost model.
- return 0;
- }
+ VPCostContext &Ctx) const override;
#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
/// Print the VPInstruction to \p O.
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 81031b9401ca09..3e484a68751a8e 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -709,6 +709,20 @@ Value *VPInstruction::generate(VPTransformState &State) {
}
}
+InstructionCost VPInstruction::computeCost(ElementCount VF,
+ VPCostContext &Ctx) const {
+ switch (getOpcode()) {
+ case VPInstruction::AnyOf: {
+ auto *VecI1Ty = VectorType::get(Type::getInt1Ty(Ctx.LLVMCtx), VF);
+ return Ctx.TTI.getArithmeticReductionCost(Instruction::Or, VecI1Ty,
+ std::nullopt, Ctx.CostKind);
+ }
+ default:
+ // TODO: Fill out other opcodes!
+ return 0;
+ }
+}
+
bool VPInstruction::isVectorToScalar() const {
return getOpcode() == VPInstruction::ExtractFromEnd ||
getOpcode() == VPInstruction::ExtractFirstActive ||
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/simple_early_exit.ll b/llvm/test/Transforms/LoopVectorize/AArch64/simple_early_exit.ll
index 9d21ea0ab6de39..b439b64e829e5f 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/simple_early_exit.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/simple_early_exit.ll
@@ -272,36 +272,52 @@ define i64 @loop_contains_safe_div() #1 {
; CHECK-NEXT: [[P2:%.*]] = alloca [1024 x i8], align 4
; CHECK-NEXT: call void @init_mem(ptr [[P1]], i64 1024)
; CHECK-NEXT: call void @init_mem(ptr [[P2]], i64 1024)
+; CHECK-NEXT: [[TMP11:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP12:%.*]] = mul i64 [[TMP11]], 4
; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
; CHECK: vector.ph:
+; CHECK-NEXT: [[TMP10:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP10]], 4
+; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 64, [[TMP3]]
+; CHECK-NEXT: [[INDEX1:%.*]] = sub i64 64, [[N_MOD_VF]]
+; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
+; CHECK-NEXT: [[OFFSET_IDX:%.*]] = add i64 3, [[INDEX1]]
+; CHECK-NEXT: [[TMP16:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
+; CHECK-NEXT: [[TMP17:%.*]] = mul <vscale x 4 x i64> [[TMP16]], splat (i64 1)
+; CHECK-NEXT: [[INDUCTION:%.*]] = add <vscale x 4 x i64> splat (i64 3), [[TMP17]]
+; CHECK-NEXT: [[TMP9:%.*]] = mul i64 1, [[TMP5]]
+; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP9]], i64 0
+; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
-; CHECK-NEXT: [[INDEX1:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT2:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT: [[VEC_IND:%.*]] = phi <2 x i64> [ <i64 3, i64 4>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT: [[OFFSET_IDX:%.*]] = add i64 3, [[INDEX1]]
-; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[OFFSET_IDX]], 0
+; CHECK-NEXT: [[INDEX2:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT2:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[VEC_IND:%.*]] = phi <vscale x 4 x i64> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[OFFSET_IDX1:%.*]] = add i64 3, [[INDEX2]]
+; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[OFFSET_IDX1]], 0
; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds i32, ptr [[P1]], i64 [[TMP0]]
; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i32, ptr [[TMP1]], i32 0
-; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <2 x i32>, ptr [[TMP2]], align 1
-; CHECK-NEXT: [[TMP3:%.*]] = udiv <2 x i32> [[WIDE_LOAD]], splat (i32 20000)
-; CHECK-NEXT: [[TMP4:%.*]] = icmp eq <2 x i32> [[TMP3]], splat (i32 1)
-; CHECK-NEXT: [[INDEX_NEXT2]] = add nuw i64 [[INDEX1]], 2
-; CHECK-NEXT: [[TMP5:%.*]] = xor <2 x i1> [[TMP4]], splat (i1 true)
-; CHECK-NEXT: [[TMP6:%.*]] = call i1 @llvm.vector.reduce.or.v2i1(<2 x i1> [[TMP5]])
-; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT2]], 64
-; CHECK-NEXT: [[VEC_IND_NEXT]] = add <2 x i64> [[VEC_IND]], splat (i64 2)
+; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 4 x i32>, ptr [[TMP2]], align 1
+; CHECK-NEXT: [[TMP13:%.*]] = udiv <vscale x 4 x i32> [[WIDE_LOAD]], splat (i32 20000)
+; CHECK-NEXT: [[TMP14:%.*]] = icmp eq <vscale x 4 x i32> [[TMP13]], splat (i32 1)
+; CHECK-NEXT: [[INDEX_NEXT2]] = add nuw i64 [[INDEX2]], [[TMP5]]
+; CHECK-NEXT: [[TMP15:%.*]] = xor <vscale x 4 x i1> [[TMP14]], splat (i1 true)
+; CHECK-NEXT: [[TMP6:%.*]] = call i1 @llvm.vector.reduce.or.nxv4i1(<vscale x 4 x i1> [[TMP15]])
+; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT2]], [[INDEX1]]
+; CHECK-NEXT: [[VEC_IND_NEXT]] = add <vscale x 4 x i64> [[VEC_IND]], [[DOTSPLAT]]
; CHECK-NEXT: [[TMP8:%.*]] = or i1 [[TMP6]], [[TMP7]]
; CHECK-NEXT: br i1 [[TMP8]], label [[MIDDLE_SPLIT:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
; CHECK: middle.split:
; CHECK-NEXT: br i1 [[TMP6]], label [[VECTOR_EARLY_EXIT:%.*]], label [[MIDDLE_BLOCK:%.*]]
; CHECK: vector.early.exit:
-; CHECK-NEXT: [[FIRST_ACTIVE_LANE:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.v2i1(<2 x i1> [[TMP5]], i1 true)
-; CHECK-NEXT: [[EARLY_EXIT_VALUE:%.*]] = extractelement <2 x i64> [[VEC_IND]], i64 [[FIRST_ACTIVE_LANE]]
+; CHECK-NEXT: [[FIRST_ACTIVE_LANE:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.nxv4i1(<vscale x 4 x i1> [[TMP15]], i1 true)
+; CHECK-NEXT: [[EARLY_EXIT_VALUE:%.*]] = extractelement <vscale x 4 x i64> [[VEC_IND]], i64 [[FIRST_ACTIVE_LANE]]
; CHECK-NEXT: br label [[LOOP_END:%.*]]
; CHECK: middle.block:
-; CHECK-NEXT: br i1 true, label [[LOOP_END]], label [[SCALAR_PH]]
+; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 64, [[INDEX1]]
+; CHECK-NEXT: br i1 [[CMP_N]], label [[LOOP_END]], label [[SCALAR_PH]]
; CHECK: scalar.ph:
-; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 67, [[MIDDLE_BLOCK]] ], [ 3, [[ENTRY:%.*]] ]
+; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[OFFSET_IDX]], [[MIDDLE_BLOCK]] ], [ 3, [[ENTRY:%.*]] ]
; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ [[INDEX_NEXT:%.*]], [[LOOP_INC:%.*]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
|
switch (getOpcode()) { | ||
case VPInstruction::AnyOf: { | ||
auto *VecI1Ty = VectorType::get(Type::getInt1Ty(Ctx.LLVMCtx), VF); | ||
return Ctx.TTI.getArithmeticReductionCost(Instruction::Or, VecI1Ty, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Dave,
Is there a reason behind using the reduction cost specifically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so in VPInstruction::generate
the AnyOf instruction is implemented as follows:
case VPInstruction::AnyOf: {
Value *A = State.get(getOperand(0));
return Builder.CreateOrReduce(A);
}
i.e. it's creating an or
reduction across the input vector so I've just created a cost that matches the IR generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I see. Thanks for the clarification.
@@ -7539,7 +7539,7 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() { | |||
CM.CostKind); | |||
precomputeCosts(BestPlan, BestFactor.Width, CostCtx); | |||
assert((BestFactor.Width == LegacyVF.Width || | |||
Legal->hasUncountableEarlyExit() || | |||
BestPlan.hasRegionWithEarlyExit() || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update. I've been thinking about the naming a bit, and the region/plan doesn't technically have any early exits any more , as we converted those, but the input had one. I don't really have any good suggestions for the naming. Any ideas for a better terminology to use here?
Alternatively we could just inline the check here, with an explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I struggled with naming too. There is only one exiting block that just happens to have the conditions merged and there is only a single successor - the middle split block. What most accurately describes the CFG is something like:
doesRegionTerminatorHaveMultipleConditionsForExiting
or
isMiddleBlockIndirectSuccessorOfVectorRegion
but these seem a bit wordy! I'm happy to inline the check here for now, however I imagine at some point it will be useful to have some way of querying this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would probably be best to inline things for now, and reconsider after more uses arise. Maybe things will get clearer then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with suggestion inline, thanks
This patch adds an initial implementation of VPInstruction::computeCost with support for only one instruction so far - VPInstruction::AnyOf. This is only used when vectorising loops with uncountable early exits.
I had to rebase + force push just to work around the ridiculous "Processing Updates" hang in github since this morning. |
I still miss Phabricator. :( |
Following on from llvm#125058, this patch takes into account the work done in the vector early exit block when assessing the profitability of vectorising the loop. I have renamed areRuntimeChecksProfitable to isOutsideLoopWorkProfitable and we now pass in the early exit costs. As part of this, I have added the ExtractFirstActive opcode to VPInstruction::computeCost. It's worth pointing out that when we assess profitability of the loop we calculate a minimum trip count and compare that against the *maximum* trip count. However, since the loop has an early exit the runtime trip count can still end up being less than the minimum. Alternatively, we may never take the early exit at all at runtime and so we have the opposite problem of over-estimating the cost of the loop. The loop vectoriser cannot simultaneously take two contradictory positions and so I feel the only sensible thing to do is be conservative and assume the loop will be more expensive than loops without early exits.
…25058) This patch adds an initial implementation of VPInstruction::computeCost with support for only one instruction so far - VPInstruction::AnyOf. This is only used when vectorising loops with uncountable early exits.
Following on from llvm#125058, this patch takes into account the work done in the vector early exit block when assessing the profitability of vectorising the loop. I have renamed areRuntimeChecksProfitable to isOutsideLoopWorkProfitable and we now pass in the early exit costs. As part of this, I have added the ExtractFirstActive opcode to VPInstruction::computeCost. It's worth pointing out that when we assess profitability of the loop we calculate a minimum trip count and compare that against the *maximum* trip count. However, since the loop has an early exit the runtime trip count can still end up being less than the minimum. Alternatively, we may never take the early exit at all at runtime and so we have the opposite problem of over-estimating the cost of the loop. The loop vectoriser cannot simultaneously take two contradictory positions and so I feel the only sensible thing to do is be conservative and assume the loop will be more expensive than loops without early exits. We may find in future that we need to adjust the cost according to the probability of taking the early exit. This will become even more important once we support multiple early exits. However, we have to start somewhere and we can always revisit this later.
Following on from llvm#125058, this patch takes into account the work done in the vector early exit block when assessing the profitability of vectorising the loop. I have renamed areRuntimeChecksProfitable to isOutsideLoopWorkProfitable and we now pass in the early exit costs. As part of this, I have added the ExtractFirstActive opcode to VPInstruction::computeCost. It's worth pointing out that when we assess profitability of the loop we calculate a minimum trip count and compare that against the *maximum* trip count. However, since the loop has an early exit the runtime trip count can still end up being less than the minimum. Alternatively, we may never take the early exit at all at runtime and so we have the opposite problem of over-estimating the cost of the loop. The loop vectoriser cannot simultaneously take two contradictory positions and so I feel the only sensible thing to do is be conservative and assume the loop will be more expensive than loops without early exits. We may find in future that we need to adjust the cost according to the probability of taking the early exit. This will become even more important once we support multiple early exits. However, we have to start somewhere and we can always revisit this later.
Following on from llvm#125058, this patch takes into account the work done in the vector early exit block when assessing the profitability of vectorising the loop. I have renamed areRuntimeChecksProfitable to isOutsideLoopWorkProfitable and we now pass in the early exit costs. As part of this, I have added the ExtractFirstActive opcode to VPInstruction::computeCost. It's worth pointing out that when we assess profitability of the loop we calculate a minimum trip count and compare that against the *maximum* trip count. However, since the loop has an early exit the runtime trip count can still end up being less than the minimum. Alternatively, we may never take the early exit at all at runtime and so we have the opposite problem of over-estimating the cost of the loop. The loop vectoriser cannot simultaneously take two contradictory positions and so I feel the only sensible thing to do is be conservative and assume the loop will be more expensive than loops without early exits. We may find in future that we need to adjust the cost according to the probability of taking the early exit. This will become even more important once we support multiple early exits. However, we have to start somewhere and we can always revisit this later.
) Following on from #125058, this patch takes into account the work done in the vector early exit block when assessing the profitability of vectorising the loop. I have renamed areRuntimeChecksProfitable to isOutsideLoopWorkProfitable and we now pass in the early exit costs. As part of this, I have added the ExtractFirstActive opcode to VPInstruction::computeCost. It's worth pointing out that when we assess profitability of the loop we calculate a minimum trip count and compare that against the *maximum* trip count. However, since the loop has an early exit the runtime trip count can still end up being less than the minimum. Alternatively, we may never take the early exit at all at runtime and so we have the opposite problem of over-estimating the cost of the loop. The loop vectoriser cannot simultaneously take two contradictory positions and so I feel the only sensible thing to do is be conservative and assume the loop will be more expensive than loops without early exits. We may find in future that we need to adjust the cost according to the probability of taking the early exit. This will become even more important once we support multiple early exits. However, we have to start somewhere and we can always revisit this later.
This patch adds an initial implementation of
VPInstruction::computeCost with support for only one
instruction so far - VPInstruction::AnyOf. This is only
used when vectorising loops with uncountable early exits.