Skip to content

Commit 3ffafb7

Browse files
committed
[LV] Change loops' interleave count computation
A set of microbenchmarks in llvm-test-suite (llvm/llvm-test-suite#56), when tested on a AArch64 platform, demonstrates that loop interleaving is beneficial in two cases: 1) when TC > 2 * VW * IC, such that the interleaved vectorized portion of the loop runs at least twice 2) when TC is an exact multiple of VW * IC, such that there is no epilogue loop to run where, TC = trip count, VW = vectorization width, IC = interleaving count We change the interleave count computation based on this information but we leave it the same when the flag InterleaveSmallLoopScalarReductionTrue is set to true, since it handles a special case (https://reviews.llvm.org/D81416).
1 parent 6a4489a commit 3ffafb7

File tree

6 files changed

+415
-505
lines changed

6 files changed

+415
-505
lines changed

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5727,8 +5727,12 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
57275727
}
57285728

57295729
// If trip count is known or estimated compile time constant, limit the
5730-
// interleave count to be less than the trip count divided by VF, provided it
5731-
// is at least 1.
5730+
// interleave count to be less than the trip count divided by VF * 2,
5731+
// provided VF is at least 1 and the trip count is not an exact multiple of
5732+
// VF, such that the vector loop runs at least twice to make interleaving seem
5733+
// profitable when there is an epilogue loop present. When
5734+
// InterleaveSmallLoopScalarReduction is true or trip count is an exact
5735+
// multiple of VF, we allow interleaving even when the vector loop runs once.
57325736
//
57335737
// For scalable vectors we can't know if interleaving is beneficial. It may
57345738
// not be beneficial for small loops if none of the lanes in the second vector
@@ -5737,10 +5741,15 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
57375741
// the InterleaveCount as if vscale is '1', although if some information about
57385742
// the vector is known (e.g. min vector size), we can make a better decision.
57395743
if (BestKnownTC) {
5740-
MaxInterleaveCount =
5741-
std::min(*BestKnownTC / VF.getKnownMinValue(), MaxInterleaveCount);
5742-
// Make sure MaxInterleaveCount is greater than 0.
5743-
MaxInterleaveCount = std::max(1u, MaxInterleaveCount);
5744+
if (InterleaveSmallLoopScalarReduction ||
5745+
(*BestKnownTC % VF.getKnownMinValue() == 0))
5746+
MaxInterleaveCount =
5747+
std::min(*BestKnownTC / VF.getKnownMinValue(), MaxInterleaveCount);
5748+
else
5749+
MaxInterleaveCount = std::min(*BestKnownTC / (VF.getKnownMinValue() * 2),
5750+
MaxInterleaveCount);
5751+
// Make sure MaxInterleaveCount is greater than 0 & a power of 2.
5752+
MaxInterleaveCount = llvm::bit_floor(std::max(1u, MaxInterleaveCount));
57445753
}
57455754

57465755
assert(MaxInterleaveCount > 0 &&

llvm/test/Transforms/LoopVectorize/AArch64/interleave_count.ll

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,9 @@ for.end:
2929
ret void
3030
}
3131

32-
; TODO: For this loop with known TC of 33, when the auto-vectorizer chooses VF 16, it should choose
32+
; For this loop with known TC of 33, when the auto-vectorizer chooses VF 16, it should choose
3333
; IC 1 since there may be a remainder loop that needs to run after the vector loop.
34-
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
34+
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 1)
3535
define void @loop_with_tc_33(ptr noalias %p, ptr noalias %q) {
3636
entry:
3737
br label %for.body
@@ -78,10 +78,10 @@ for.end:
7878
ret void
7979
}
8080

81-
; TODO: For a loop with unknown trip count but a profile showing an approx TC estimate of 33,
81+
; For a loop with unknown trip count but a profile showing an approx TC estimate of 33,
8282
; when the auto-vectorizer chooses VF 16, it should choose IC 1 since chances are high that the
8383
; remainder loop will need to run
84-
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 2)
84+
; CHECK: remark: <unknown>:0:0: vectorized loop (vectorization width: 16, interleaved count: 1)
8585
define void @loop_with_profile_tc_33(ptr noalias %p, ptr noalias %q, i64 %n) {
8686
entry:
8787
br label %for.body

llvm/test/Transforms/LoopVectorize/PowerPC/large-loop-rdx.ll

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,6 @@
88
; CHECK-NEXT: fadd
99
; CHECK-NEXT: fadd
1010
; CHECK-NEXT: fadd
11-
; CHECK-NEXT: fadd
12-
; CHECK-NEXT: fadd
13-
; CHECK-NEXT: fadd
14-
; CHECK-NEXT: fadd
1511
; CHECK-NEXT: =
1612
; CHECK-NOT: fadd
1713
; CHECK-SAME: >

0 commit comments

Comments
 (0)