Skip to content

[LoopVectorizer][AArch64] Move getMinTripCountTailFoldingThreshold later. #132170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 26, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 40 additions & 26 deletions llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4025,11 +4025,8 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
MaxPowerOf2RuntimeVF = std::nullopt; // Stick with tail-folding for now.
}

if (MaxPowerOf2RuntimeVF && *MaxPowerOf2RuntimeVF > 0) {
assert((UserVF.isNonZero() || isPowerOf2_32(*MaxPowerOf2RuntimeVF)) &&
"MaxFixedVF must be a power of 2");
unsigned MaxVFtimesIC =
UserIC ? *MaxPowerOf2RuntimeVF * UserIC : *MaxPowerOf2RuntimeVF;
auto NoScalarEpilogueNeeded = [this, &UserIC](unsigned MaxVF) {
unsigned MaxVFtimesIC = UserIC ? MaxVF * UserIC : MaxVF;
ScalarEvolution *SE = PSE.getSE();
// Currently only loops with countable exits are vectorized, but calling
// getSymbolicMaxBackedgeTakenCount allows enablement work for loops with
Expand All @@ -4043,13 +4040,41 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
const SCEV *Rem = SE->getURemExpr(
SE->applyLoopGuards(ExitCount, TheLoop),
SE->getConstant(BackedgeTakenCount->getType(), MaxVFtimesIC));
if (Rem->isZero()) {
return Rem->isZero();
};

if (MaxPowerOf2RuntimeVF > 0) {
assert((UserVF.isNonZero() || isPowerOf2_32(*MaxPowerOf2RuntimeVF)) &&
"MaxFixedVF must be a power of 2");
if (NoScalarEpilogueNeeded(*MaxPowerOf2RuntimeVF)) {
// Accept MaxFixedVF if we do not have a tail.
LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");
return MaxFactors;
}
}

auto ExpectedTC = getSmallBestKnownTC(PSE, TheLoop);
if (ExpectedTC && ExpectedTC <= TTI.getMinTripCountTailFoldingThreshold()) {
if (MaxPowerOf2RuntimeVF > 0) {
// If we have a low-trip-count, and the fixed-width VF is known to divide
// the trip count but the scalable factor does not, use the fixed-width
// factor in preference to allow the generation of a non-predicated loop.
if (ScalarEpilogueStatus == CM_ScalarEpilogueNotAllowedLowTripLoop &&
NoScalarEpilogueNeeded(MaxFactors.FixedVF.getFixedValue())) {
LLVM_DEBUG(dbgs() << "LV: Picking a fixed-width so that no tail will "
"remain for any chosen VF.\n");
MaxFactors.ScalableVF = ElementCount::getScalable(0);
return MaxFactors;
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you re-add the debug output that we had before, i.e. something like:

        LLVM_DEBUG(dbgs() << "The target considers the trip count too "
                             "small to consider vectorizing with tail-folding.\n");

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reportVectorizationFailure will print the message it is reporting to dbgs() too. It didn't seem necessary to print the same info twice.

It will print

LV: Not vectorizing: The trip count is below the minial threshold value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK yeah fair enough, but I think we're still missing the original Hints.emitRemarkWithHints(); call? I think LoopVectorizationCostModel does have access to a Hints pointer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it gets emitted in one of the parent functions. None of the other returns from this function use
emitRemarkWithHints, which was the reason I didn't more it over.

reportVectorizationFailure will emit the -Rpass=analysis output. The -Rpass=missed remark will be emitted by the VF now being scalar

remark: the cost-model indicates that vectorization is not beneficial [-Rpass-missed=loop-vectorize]

reportVectorizationFailure(
"The trip count is below the minial threshold value.",
"loop trip count is too low, avoiding vectorization", "LowTripCount",
ORE, TheLoop);
return FixedScalableVFPair::getNone();
}

// If we don't know the precise trip count, or if the trip count that we
// found modulo the vectorization factor is not zero, try to fold the tail
// by masking.
Expand Down Expand Up @@ -10597,26 +10622,15 @@ bool LoopVectorizePass::processLoop(Loop *L) {
if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)
LLVM_DEBUG(dbgs() << " But vectorizing was explicitly forced.\n");
else {
if (*ExpectedTC > TTI->getMinTripCountTailFoldingThreshold()) {
LLVM_DEBUG(dbgs() << "\n");
// Predicate tail-folded loops are efficient even when the loop
// iteration count is low. However, setting the epilogue policy to
// `CM_ScalarEpilogueNotAllowedLowTripLoop` prevents vectorizing loops
// with runtime checks. It's more effective to let
// `isOutsideLoopWorkProfitable` determine if vectorization is
// beneficial for the loop.
if (SEL != CM_ScalarEpilogueNotNeededUsePredicate)
SEL = CM_ScalarEpilogueNotAllowedLowTripLoop;
} else {
LLVM_DEBUG(dbgs() << " But the target considers the trip count too "
"small to consider vectorizing.\n");
reportVectorizationFailure(
"The trip count is below the minial threshold value.",
"loop trip count is too low, avoiding vectorization",
"LowTripCount", ORE, L);
Hints.emitRemarkWithHints();
return false;
}
LLVM_DEBUG(dbgs() << "\n");
// Predicate tail-folded loops are efficient even when the loop
// iteration count is low. However, setting the epilogue policy to
// `CM_ScalarEpilogueNotAllowedLowTripLoop` prevents vectorizing loops
// with runtime checks. It's more effective to let
// `isOutsideLoopWorkProfitable` determine if vectorization is
// beneficial for the loop.
if (SEL != CM_ScalarEpilogueNotNeededUsePredicate)
SEL = CM_ScalarEpilogueNotAllowedLowTripLoop;
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ target triple = "aarch64-unknown-linux-gnu"

; DEBUG-LABEL: LV: Checking a loop in 'trip_count_too_small'
; DEBUG: LV: Found a loop with a very small trip count. This loop is worth vectorizing only if no scalar iteration overheads are incurred.
; DEBUG: LV: Not vectorizing: The trip count is below the minial threshold value..
; DEBUG: LV: Not vectorizing: Runtime SCEV check is required with -Os/-Oz.

; DEBUG-LABEL: LV: Checking a loop in 'too_many_runtime_checks'
; DEBUG: LV: Found trip count: 0
Expand Down Expand Up @@ -490,9 +490,103 @@ while.end:
ret void
}

; This has a trip-count of 4, and should vectorize with vf==4.
define i32 @tc4(ptr noundef readonly captures(none) %tmp) vscale_range(1,16) {
; CHECK-LABEL: define i32 @tc4(
; CHECK-SAME: ptr noundef readonly captures(none) [[TMP:%.*]]) #[[ATTR1]] {
; CHECK-NEXT: [[ENTRY:.*]]:
; CHECK-NEXT: br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
; CHECK: [[VECTOR_PH]]:
; CHECK-NEXT: br label %[[VECTOR_BODY:.*]]
; CHECK: [[VECTOR_BODY]]:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.*]] = phi <4 x i32> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[TMP3:%.*]], %[[VECTOR_BODY]] ]
; CHECK-NEXT: [[ARRAYIDX1:%.*]] = getelementptr inbounds nuw [4 x i32], ptr [[TMP]], i64 0, i64 [[INDEX]]
; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds nuw i32, ptr [[ARRAYIDX1]], i32 0
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[TMP2]], align 4
; CHECK-NEXT: [[TMP3]] = add <4 x i32> [[VEC_PHI]], [[WIDE_LOAD]]
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 4
; CHECK-NEXT: br i1 true, label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP7:![0-9]+]]
; CHECK: [[MIDDLE_BLOCK]]:
; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP3]])
; CHECK-NEXT: br i1 true, label %[[FOR_COND_CLEANUP:.*]], label %[[SCALAR_PH]]
; CHECK: [[SCALAR_PH]]:
; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 4, %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ [[TMP4]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
; CHECK-NEXT: br label %[[FOR_BODY:.*]]
; CHECK: [[FOR_COND_CLEANUP]]:
; CHECK-NEXT: [[ADD_LCSSA:%.*]] = phi i32 [ [[ADD:%.*]], %[[FOR_BODY]] ], [ [[TMP4]], %[[MIDDLE_BLOCK]] ]
; CHECK-NEXT: ret i32 [[ADD_LCSSA]]
; CHECK: [[FOR_BODY]]:
; CHECK-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ]
; CHECK-NEXT: [[SUM_0179:%.*]] = phi i32 [ [[BC_MERGE_RDX]], %[[SCALAR_PH]] ], [ [[ADD]], %[[FOR_BODY]] ]
; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds nuw [4 x i32], ptr [[TMP]], i64 0, i64 [[INDVARS_IV]]
; CHECK-NEXT: [[TMP5:%.*]] = load i32, ptr [[ARRAYIDX2]], align 4
; CHECK-NEXT: [[ADD]] = add i32 [[SUM_0179]], [[TMP5]]
; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 4
; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label %[[FOR_COND_CLEANUP]], label %[[FOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
;
entry:
br label %for.body

for.cond.cleanup: ; preds = %for.body
%add.lcssa = phi i32 [ %add, %for.body ]
ret i32 %add.lcssa

for.body: ; preds = %entry, %for.body
%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
%sum.0179 = phi i32 [ 0, %entry ], [ %add, %for.body ]
%arrayidx1 = getelementptr inbounds nuw [4 x i32], ptr %tmp, i64 0, i64 %indvars.iv
%0 = load i32, ptr %arrayidx1, align 4
%add = add i32 %sum.0179, %0
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%exitcond.not = icmp eq i64 %indvars.iv.next, 4
br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
}

; This has a trip-count of 4 from a profile.
define i32 @tc4_from_profile(ptr noundef readonly captures(none) %tmp, i64 %N) vscale_range(1,16) {
; CHECK-LABEL: define i32 @tc4_from_profile(
; CHECK-SAME: ptr noundef readonly captures(none) [[TMP:%.*]], i64 [[N:%.*]]) #[[ATTR1]] {
; CHECK-NEXT: [[ENTRY:.*]]:
; CHECK-NEXT: br label %[[FOR_BODY:.*]]
; CHECK: [[FOR_COND_CLEANUP:.*]]:
; CHECK-NEXT: [[TMP4:%.*]] = phi i32 [ [[ADD:%.*]], %[[FOR_BODY]] ]
; CHECK-NEXT: ret i32 [[TMP4]]
; CHECK: [[FOR_BODY]]:
; CHECK-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ]
; CHECK-NEXT: [[SUM_0179:%.*]] = phi i32 [ 0, %[[ENTRY]] ], [ [[ADD]], %[[FOR_BODY]] ]
; CHECK-NEXT: [[ARRAYIDX1:%.*]] = getelementptr inbounds nuw [4 x i32], ptr [[TMP]], i64 0, i64 [[INDVARS_IV]]
; CHECK-NEXT: [[TMP0:%.*]] = load i32, ptr [[ARRAYIDX1]], align 4
; CHECK-NEXT: [[ADD]] = add i32 [[SUM_0179]], [[TMP0]]
; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[N]]
; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label %[[FOR_COND_CLEANUP]], label %[[FOR_BODY]], !prof [[PROF9:![0-9]+]]
;
entry:
br label %for.body

for.cond.cleanup: ; preds = %for.body
%add.lcssa = phi i32 [ %add, %for.body ]
ret i32 %add.lcssa

for.body: ; preds = %entry, %for.body
%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
%sum.0179 = phi i32 [ 0, %entry ], [ %add, %for.body ]
%arrayidx1 = getelementptr inbounds nuw [4 x i32], ptr %tmp, i64 0, i64 %indvars.iv
%0 = load i32, ptr %arrayidx1, align 4
%add = add i32 %sum.0179, %0
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%exitcond.not = icmp eq i64 %indvars.iv.next, %N
br i1 %exitcond.not, label %for.cond.cleanup, label %for.body, !prof !2
}


!0 = distinct !{!0, !1}
!1 = !{!"llvm.loop.vectorize.predicate.enable", i1 true}
!2 = !{!"branch_weights", i32 10, i32 30}

;.
; CHECK-VS1: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
; CHECK-VS1: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
Expand All @@ -501,6 +595,9 @@ while.end:
; CHECK-VS1: [[LOOP4]] = distinct !{[[LOOP4]], [[META1]]}
; CHECK-VS1: [[LOOP5]] = distinct !{[[LOOP5]], [[META1]], [[META2]]}
; CHECK-VS1: [[LOOP6]] = distinct !{[[LOOP6]], [[META1]]}
; CHECK-VS1: [[LOOP7]] = distinct !{[[LOOP7]], [[META1]], [[META2]]}
; CHECK-VS1: [[LOOP8]] = distinct !{[[LOOP8]], [[META2]], [[META1]]}
; CHECK-VS1: [[PROF9]] = !{!"branch_weights", i32 10, i32 30}
;.
; CHECK-VS2: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
; CHECK-VS2: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
Expand All @@ -509,4 +606,7 @@ while.end:
; CHECK-VS2: [[LOOP4]] = distinct !{[[LOOP4]], [[META1]]}
; CHECK-VS2: [[LOOP5]] = distinct !{[[LOOP5]], [[META1]], [[META2]]}
; CHECK-VS2: [[LOOP6]] = distinct !{[[LOOP6]], [[META1]]}
; CHECK-VS2: [[LOOP7]] = distinct !{[[LOOP7]], [[META1]], [[META2]]}
; CHECK-VS2: [[LOOP8]] = distinct !{[[LOOP8]], [[META2]], [[META1]]}
; CHECK-VS2: [[PROF9]] = !{!"branch_weights", i32 10, i32 30}
;.