Skip to content

[LoopVectorize] Add the cost of VPInstruction::AnyOf to vplan #125058

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 5, 2025

Conversation

david-arm
Copy link
Contributor

This patch adds an initial implementation of
VPInstruction::computeCost with support for only one
instruction so far - VPInstruction::AnyOf. This is only
used when vectorising loops with uncountable early exits.

@llvmbot
Copy link
Member

llvmbot commented Jan 30, 2025

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-vectorizers

Author: David Sherwood (david-arm)

Changes

This patch adds an initial implementation of
VPInstruction::computeCost with support for only one
instruction so far - VPInstruction::AnyOf. This is only
used when vectorising loops with uncountable early exits.


Full diff: https://github.com/llvm/llvm-project/pull/125058.diff

4 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+1-4)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+14)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/simple_early_exit.ll (+32-16)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 1e948d2c756c24..e2947d73361cc5 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7539,6 +7539,7 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
                         CM.CostKind);
   precomputeCosts(BestPlan, BestFactor.Width, CostCtx);
   assert((BestFactor.Width == LegacyVF.Width ||
+          Legal->hasUncountableEarlyExit() ||
           planContainsAdditionalSimplifications(getPlanFor(BestFactor.Width),
                                                 CostCtx, OrigLoop) ||
           planContainsAdditionalSimplifications(getPlanFor(LegacyVF.Width),
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 459222234bc37f..051cda12875917 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -1316,10 +1316,7 @@ class VPInstruction : public VPRecipeWithIRFlags,
 
   /// Return the cost of this VPInstruction.
   InstructionCost computeCost(ElementCount VF,
-                              VPCostContext &Ctx) const override {
-    // TODO: Compute accurate cost after retiring the legacy cost model.
-    return 0;
-  }
+                              VPCostContext &Ctx) const override;
 
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
   /// Print the VPInstruction to \p O.
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 81031b9401ca09..3e484a68751a8e 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -709,6 +709,20 @@ Value *VPInstruction::generate(VPTransformState &State) {
   }
 }
 
+InstructionCost VPInstruction::computeCost(ElementCount VF,
+                                           VPCostContext &Ctx) const {
+  switch (getOpcode()) {
+  case VPInstruction::AnyOf: {
+    auto *VecI1Ty = VectorType::get(Type::getInt1Ty(Ctx.LLVMCtx), VF);
+    return Ctx.TTI.getArithmeticReductionCost(Instruction::Or, VecI1Ty,
+                                              std::nullopt, Ctx.CostKind);
+  }
+  default:
+    // TODO: Fill out other opcodes!
+    return 0;
+  }
+}
+
 bool VPInstruction::isVectorToScalar() const {
   return getOpcode() == VPInstruction::ExtractFromEnd ||
          getOpcode() == VPInstruction::ExtractFirstActive ||
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/simple_early_exit.ll b/llvm/test/Transforms/LoopVectorize/AArch64/simple_early_exit.ll
index 9d21ea0ab6de39..b439b64e829e5f 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/simple_early_exit.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/simple_early_exit.ll
@@ -272,36 +272,52 @@ define i64 @loop_contains_safe_div() #1 {
 ; CHECK-NEXT:    [[P2:%.*]] = alloca [1024 x i8], align 4
 ; CHECK-NEXT:    call void @init_mem(ptr [[P1]], i64 1024)
 ; CHECK-NEXT:    call void @init_mem(ptr [[P2]], i64 1024)
+; CHECK-NEXT:    [[TMP11:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP12:%.*]] = mul i64 [[TMP11]], 4
 ; CHECK-NEXT:    br i1 false, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
 ; CHECK:       vector.ph:
+; CHECK-NEXT:    [[TMP10:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP3:%.*]] = mul i64 [[TMP10]], 4
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 64, [[TMP3]]
+; CHECK-NEXT:    [[INDEX1:%.*]] = sub i64 64, [[N_MOD_VF]]
+; CHECK-NEXT:    [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP5:%.*]] = mul i64 [[TMP4]], 4
+; CHECK-NEXT:    [[OFFSET_IDX:%.*]] = add i64 3, [[INDEX1]]
+; CHECK-NEXT:    [[TMP16:%.*]] = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
+; CHECK-NEXT:    [[TMP17:%.*]] = mul <vscale x 4 x i64> [[TMP16]], splat (i64 1)
+; CHECK-NEXT:    [[INDUCTION:%.*]] = add <vscale x 4 x i64> splat (i64 3), [[TMP17]]
+; CHECK-NEXT:    [[TMP9:%.*]] = mul i64 1, [[TMP5]]
+; CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP9]], i64 0
+; CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
 ; CHECK-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK:       vector.body:
-; CHECK-NEXT:    [[INDEX1:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT2:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <2 x i64> [ <i64 3, i64 4>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT:    [[OFFSET_IDX:%.*]] = add i64 3, [[INDEX1]]
-; CHECK-NEXT:    [[TMP0:%.*]] = add i64 [[OFFSET_IDX]], 0
+; CHECK-NEXT:    [[INDEX2:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT2:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <vscale x 4 x i64> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[OFFSET_IDX1:%.*]] = add i64 3, [[INDEX2]]
+; CHECK-NEXT:    [[TMP0:%.*]] = add i64 [[OFFSET_IDX1]], 0
 ; CHECK-NEXT:    [[TMP1:%.*]] = getelementptr inbounds i32, ptr [[P1]], i64 [[TMP0]]
 ; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i32, ptr [[TMP1]], i32 0
-; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <2 x i32>, ptr [[TMP2]], align 1
-; CHECK-NEXT:    [[TMP3:%.*]] = udiv <2 x i32> [[WIDE_LOAD]], splat (i32 20000)
-; CHECK-NEXT:    [[TMP4:%.*]] = icmp eq <2 x i32> [[TMP3]], splat (i32 1)
-; CHECK-NEXT:    [[INDEX_NEXT2]] = add nuw i64 [[INDEX1]], 2
-; CHECK-NEXT:    [[TMP5:%.*]] = xor <2 x i1> [[TMP4]], splat (i1 true)
-; CHECK-NEXT:    [[TMP6:%.*]] = call i1 @llvm.vector.reduce.or.v2i1(<2 x i1> [[TMP5]])
-; CHECK-NEXT:    [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT2]], 64
-; CHECK-NEXT:    [[VEC_IND_NEXT]] = add <2 x i64> [[VEC_IND]], splat (i64 2)
+; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <vscale x 4 x i32>, ptr [[TMP2]], align 1
+; CHECK-NEXT:    [[TMP13:%.*]] = udiv <vscale x 4 x i32> [[WIDE_LOAD]], splat (i32 20000)
+; CHECK-NEXT:    [[TMP14:%.*]] = icmp eq <vscale x 4 x i32> [[TMP13]], splat (i32 1)
+; CHECK-NEXT:    [[INDEX_NEXT2]] = add nuw i64 [[INDEX2]], [[TMP5]]
+; CHECK-NEXT:    [[TMP15:%.*]] = xor <vscale x 4 x i1> [[TMP14]], splat (i1 true)
+; CHECK-NEXT:    [[TMP6:%.*]] = call i1 @llvm.vector.reduce.or.nxv4i1(<vscale x 4 x i1> [[TMP15]])
+; CHECK-NEXT:    [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT2]], [[INDEX1]]
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add <vscale x 4 x i64> [[VEC_IND]], [[DOTSPLAT]]
 ; CHECK-NEXT:    [[TMP8:%.*]] = or i1 [[TMP6]], [[TMP7]]
 ; CHECK-NEXT:    br i1 [[TMP8]], label [[MIDDLE_SPLIT:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
 ; CHECK:       middle.split:
 ; CHECK-NEXT:    br i1 [[TMP6]], label [[VECTOR_EARLY_EXIT:%.*]], label [[MIDDLE_BLOCK:%.*]]
 ; CHECK:       vector.early.exit:
-; CHECK-NEXT:    [[FIRST_ACTIVE_LANE:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.v2i1(<2 x i1> [[TMP5]], i1 true)
-; CHECK-NEXT:    [[EARLY_EXIT_VALUE:%.*]] = extractelement <2 x i64> [[VEC_IND]], i64 [[FIRST_ACTIVE_LANE]]
+; CHECK-NEXT:    [[FIRST_ACTIVE_LANE:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.nxv4i1(<vscale x 4 x i1> [[TMP15]], i1 true)
+; CHECK-NEXT:    [[EARLY_EXIT_VALUE:%.*]] = extractelement <vscale x 4 x i64> [[VEC_IND]], i64 [[FIRST_ACTIVE_LANE]]
 ; CHECK-NEXT:    br label [[LOOP_END:%.*]]
 ; CHECK:       middle.block:
-; CHECK-NEXT:    br i1 true, label [[LOOP_END]], label [[SCALAR_PH]]
+; CHECK-NEXT:    [[CMP_N:%.*]] = icmp eq i64 64, [[INDEX1]]
+; CHECK-NEXT:    br i1 [[CMP_N]], label [[LOOP_END]], label [[SCALAR_PH]]
 ; CHECK:       scalar.ph:
-; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ 67, [[MIDDLE_BLOCK]] ], [ 3, [[ENTRY:%.*]] ]
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[OFFSET_IDX]], [[MIDDLE_BLOCK]] ], [ 3, [[ENTRY:%.*]] ]
 ; CHECK-NEXT:    br label [[LOOP:%.*]]
 ; CHECK:       loop:
 ; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ [[INDEX_NEXT:%.*]], [[LOOP_INC:%.*]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]

switch (getOpcode()) {
case VPInstruction::AnyOf: {
auto *VecI1Ty = VectorType::get(Type::getInt1Ty(Ctx.LLVMCtx), VF);
return Ctx.TTI.getArithmeticReductionCost(Instruction::Or, VecI1Ty,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Dave,
Is there a reason behind using the reduction cost specifically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so in VPInstruction::generate the AnyOf instruction is implemented as follows:

  case VPInstruction::AnyOf: {
    Value *A = State.get(getOperand(0));
    return Builder.CreateOrReduce(A);
  }

i.e. it's creating an or reduction across the input vector so I've just created a cost that matches the IR generated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I see. Thanks for the clarification.

@@ -7539,7 +7539,7 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
CM.CostKind);
precomputeCosts(BestPlan, BestFactor.Width, CostCtx);
assert((BestFactor.Width == LegacyVF.Width ||
Legal->hasUncountableEarlyExit() ||
BestPlan.hasRegionWithEarlyExit() ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I've been thinking about the naming a bit, and the region/plan doesn't technically have any early exits any more , as we converted those, but the input had one. I don't really have any good suggestions for the naming. Any ideas for a better terminology to use here?

Alternatively we could just inline the check here, with an explanation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I struggled with naming too. There is only one exiting block that just happens to have the conditions merged and there is only a single successor - the middle split block. What most accurately describes the CFG is something like:

doesRegionTerminatorHaveMultipleConditionsForExiting

or

isMiddleBlockIndirectSuccessorOfVectorRegion

but these seem a bit wordy! I'm happy to inline the check here for now, however I imagine at some point it will be useful to have some way of querying this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would probably be best to inline things for now, and reconsider after more uses arise. Maybe things will get clearer then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done!

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with suggestion inline, thanks

This patch adds an initial implementation of
VPInstruction::computeCost with support for only one
instruction so far - VPInstruction::AnyOf. This is only
used when vectorising loops with uncountable early exits.
@david-arm
Copy link
Contributor Author

I had to rebase + force push just to work around the ridiculous "Processing Updates" hang in github since this morning.

@david-arm
Copy link
Contributor Author

I still miss Phabricator. :(

@david-arm david-arm merged commit f07cd36 into llvm:main Feb 5, 2025
8 checks passed
david-arm added a commit to david-arm/llvm-project that referenced this pull request Feb 7, 2025
Following on from llvm#125058, this patch takes into account the
work done in the vector early exit block when assessing the
profitability of vectorising the loop. I have renamed
areRuntimeChecksProfitable to isOutsideLoopWorkProfitable and
we now pass in the early exit costs. As part of this, I have
added the ExtractFirstActive opcode to VPInstruction::computeCost.

It's worth pointing out that when we assess profitability of the
loop we calculate a minimum trip count and compare that against
the *maximum* trip count. However, since the loop has an early
exit the runtime trip count can still end up being less than the
minimum. Alternatively, we may never take the early exit at all
at runtime and so we have the opposite problem of over-estimating
the cost of the loop. The loop vectoriser cannot simultaneously
take two contradictory positions and so I feel the only sensible
thing to do is be conservative and assume the loop will be more
expensive than loops without early exits.
Icohedron pushed a commit to Icohedron/llvm-project that referenced this pull request Feb 11, 2025
…25058)

This patch adds an initial implementation of
VPInstruction::computeCost with support for only one
instruction so far - VPInstruction::AnyOf. This is only
used when vectorising loops with uncountable early exits.
david-arm added a commit to david-arm/llvm-project that referenced this pull request Feb 27, 2025
Following on from llvm#125058, this patch takes into account the
work done in the vector early exit block when assessing the
profitability of vectorising the loop. I have renamed
areRuntimeChecksProfitable to isOutsideLoopWorkProfitable and
we now pass in the early exit costs. As part of this, I have
added the ExtractFirstActive opcode to VPInstruction::computeCost.

It's worth pointing out that when we assess profitability of the
loop we calculate a minimum trip count and compare that against
the *maximum* trip count. However, since the loop has an early
exit the runtime trip count can still end up being less than the
minimum. Alternatively, we may never take the early exit at all
at runtime and so we have the opposite problem of over-estimating
the cost of the loop. The loop vectoriser cannot simultaneously
take two contradictory positions and so I feel the only sensible
thing to do is be conservative and assume the loop will be more
expensive than loops without early exits.

We may find in future that we need to adjust the cost according to
the probability of taking the early exit. This will become even
more important once we support multiple early exits. However, we
have to start somewhere and we can always revisit this later.
@david-arm david-arm deleted the ee_costs1 branch February 27, 2025 13:02
david-arm added a commit to david-arm/llvm-project that referenced this pull request Mar 5, 2025
Following on from llvm#125058, this patch takes into account the
work done in the vector early exit block when assessing the
profitability of vectorising the loop. I have renamed
areRuntimeChecksProfitable to isOutsideLoopWorkProfitable and
we now pass in the early exit costs. As part of this, I have
added the ExtractFirstActive opcode to VPInstruction::computeCost.

It's worth pointing out that when we assess profitability of the
loop we calculate a minimum trip count and compare that against
the *maximum* trip count. However, since the loop has an early
exit the runtime trip count can still end up being less than the
minimum. Alternatively, we may never take the early exit at all
at runtime and so we have the opposite problem of over-estimating
the cost of the loop. The loop vectoriser cannot simultaneously
take two contradictory positions and so I feel the only sensible
thing to do is be conservative and assume the loop will be more
expensive than loops without early exits.

We may find in future that we need to adjust the cost according to
the probability of taking the early exit. This will become even
more important once we support multiple early exits. However, we
have to start somewhere and we can always revisit this later.
david-arm added a commit to david-arm/llvm-project that referenced this pull request Mar 11, 2025
Following on from llvm#125058, this patch takes into account the
work done in the vector early exit block when assessing the
profitability of vectorising the loop. I have renamed
areRuntimeChecksProfitable to isOutsideLoopWorkProfitable and
we now pass in the early exit costs. As part of this, I have
added the ExtractFirstActive opcode to VPInstruction::computeCost.

It's worth pointing out that when we assess profitability of the
loop we calculate a minimum trip count and compare that against
the *maximum* trip count. However, since the loop has an early
exit the runtime trip count can still end up being less than the
minimum. Alternatively, we may never take the early exit at all
at runtime and so we have the opposite problem of over-estimating
the cost of the loop. The loop vectoriser cannot simultaneously
take two contradictory positions and so I feel the only sensible
thing to do is be conservative and assume the loop will be more
expensive than loops without early exits.

We may find in future that we need to adjust the cost according to
the probability of taking the early exit. This will become even
more important once we support multiple early exits. However, we
have to start somewhere and we can always revisit this later.
david-arm added a commit that referenced this pull request Mar 11, 2025
)

Following on from #125058, this patch takes into account the
work done in the vector early exit block when assessing the
profitability of vectorising the loop. I have renamed
areRuntimeChecksProfitable to isOutsideLoopWorkProfitable and
we now pass in the early exit costs. As part of this, I have
added the ExtractFirstActive opcode to VPInstruction::computeCost.

It's worth pointing out that when we assess profitability of the
loop we calculate a minimum trip count and compare that against
the *maximum* trip count. However, since the loop has an early
exit the runtime trip count can still end up being less than the
minimum. Alternatively, we may never take the early exit at all
at runtime and so we have the opposite problem of over-estimating
the cost of the loop. The loop vectoriser cannot simultaneously
take two contradictory positions and so I feel the only sensible
thing to do is be conservative and assume the loop will be more
expensive than loops without early exits.

We may find in future that we need to adjust the cost according to
the probability of taking the early exit. This will become even
more important once we support multiple early exits. However, we
have to start somewhere and we can always revisit this later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants