-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[LV] Don't require scalar epilogue for unsupported IAG with tail #96544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[LV] Don't require scalar epilogue for unsupported IAG with tail #96544
Conversation
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-backend-risc-v Author: Kolya Panchenko (nikolaypanchenko) ChangesLV should check that all groups that require scalar epilogue can be widened, otherwise if InterleavedAccessGroup cannot be widened and does have tail element, current logic in LV requires to emit scalar epilogue, which leads to inefficient vector code. Patch is 103.39 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/96544.diff 9 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 771fb247f201f..c7a74283b5258 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1448,29 +1448,58 @@ class LoopVectorizationCostModel {
/// Returns true if \p I is a memory instruction in an interleaved-group
/// of memory accesses that can be vectorized with wide vector loads/stores
/// and shuffles.
- bool interleavedAccessCanBeWidened(Instruction *I, ElementCount VF);
+ bool interleavedAccessCanBeWidened(Instruction *I, ElementCount VF) const;
/// Check if \p Instr belongs to any interleaved access group.
- bool isAccessInterleaved(Instruction *Instr) {
+ bool isAccessInterleaved(Instruction *Instr) const {
return InterleaveInfo.isInterleaved(Instr);
}
/// Get the interleaved access group that \p Instr belongs to.
const InterleaveGroup<Instruction> *
- getInterleavedAccessGroup(Instruction *Instr) {
+ getInterleavedAccessGroup(Instruction *Instr) const {
return InterleaveInfo.getInterleaveGroup(Instr);
}
/// Returns true if we're required to use a scalar epilogue for at least
/// the final iteration of the original loop.
- bool requiresScalarEpilogue(bool IsVectorizing) const {
- if (!isScalarEpilogueAllowed())
+ bool requiresScalarEpilogue(ElementCount VF) const {
+ if (!isScalarEpilogueAllowed()) {
+ LLVM_DEBUG(dbgs() << "LV: Loop with VF = " << VF
+ << " does not require scalar epilogue\n");
return false;
+ }
// If we might exit from anywhere but the latch, must run the exiting
// iteration in scalar form.
- if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch())
+ if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) {
+ LLVM_DEBUG(dbgs() << "LV: Loop with VF = " << VF
+ << " requires scalar epilogue: multiple exists\n");
return true;
- return IsVectorizing && InterleaveInfo.requiresScalarEpilogue();
+ }
+ if (VF.isVector()) {
+ if (InterleaveInfo.requiresScalarEpilogue()) {
+ // Make sure interleaved groups that require scalar epilogue will be
+ // widened.
+ for (auto *G : InterleaveInfo.getInterleaveGroups()) {
+ if (!G->requiresScalarEpilogue())
+ continue;
+
+ Instruction *I = G->getMember(0);
+ InstWidening Decision = getWideningDecision(I, VF);
+ if (Decision == CM_Interleave ||
+ (Decision == CM_Unknown &&
+ interleavedAccessCanBeWidened(G->getMember(0), VF))) {
+ LLVM_DEBUG(dbgs() << "LV: Loop with VF = " << VF
+ << " requires scalar epilogue: vectorizable "
+ "interleaved group\n");
+ return true;
+ }
+ }
+ }
+ }
+ LLVM_DEBUG(dbgs() << "LV: Loop with VF = " << VF
+ << " does not require scalar epilogue\n");
+ return false;
}
/// Returns true if we're required to use a scalar epilogue for at least
@@ -1479,7 +1508,7 @@ class LoopVectorizationCostModel {
/// none.
bool requiresScalarEpilogue(VFRange Range) const {
auto RequiresScalarEpilogue = [this](ElementCount VF) {
- return requiresScalarEpilogue(VF.isVector());
+ return requiresScalarEpilogue(VF);
};
bool IsRequired = all_of(Range, RequiresScalarEpilogue);
assert(
@@ -2776,7 +2805,7 @@ InnerLoopVectorizer::getOrCreateVectorTripCount(BasicBlock *InsertBlock) {
// the step does not evenly divide the trip count, no adjustment is necessary
// since there will already be scalar iterations. Note that the minimum
// iterations check ensures that N >= Step.
- if (Cost->requiresScalarEpilogue(VF.isVector())) {
+ if (Cost->requiresScalarEpilogue(VF)) {
auto *IsZero = Builder.CreateICmpEQ(R, ConstantInt::get(R->getType(), 0));
R = Builder.CreateSelect(IsZero, Step, R);
}
@@ -2829,8 +2858,8 @@ void InnerLoopVectorizer::emitIterationCountCheck(BasicBlock *Bypass) {
// vector trip count is zero. This check also covers the case where adding one
// to the backedge-taken count overflowed leading to an incorrect trip count
// of zero. In this case we will also jump to the scalar loop.
- auto P = Cost->requiresScalarEpilogue(VF.isVector()) ? ICmpInst::ICMP_ULE
- : ICmpInst::ICMP_ULT;
+ auto P = Cost->requiresScalarEpilogue(VF) ? ICmpInst::ICMP_ULE
+ : ICmpInst::ICMP_ULT;
// If tail is to be folded, vector loop takes care of all iterations.
Type *CountTy = Count->getType();
@@ -2879,7 +2908,7 @@ void InnerLoopVectorizer::emitIterationCountCheck(BasicBlock *Bypass) {
// Update dominator for Bypass & LoopExit (if needed).
DT->changeImmediateDominator(Bypass, TCCheckBlock);
- if (!Cost->requiresScalarEpilogue(VF.isVector()))
+ if (!Cost->requiresScalarEpilogue(VF))
// If there is an epilogue which must run, there's no edge from the
// middle block to exit blocks and thus no need to update the immediate
// dominator of the exit blocks.
@@ -2908,7 +2937,7 @@ BasicBlock *InnerLoopVectorizer::emitSCEVChecks(BasicBlock *Bypass) {
// Update dominator only if this is first RT check.
if (LoopBypassBlocks.empty()) {
DT->changeImmediateDominator(Bypass, SCEVCheckBlock);
- if (!Cost->requiresScalarEpilogue(VF.isVector()))
+ if (!Cost->requiresScalarEpilogue(VF))
// If there is an epilogue which must run, there's no edge from the
// middle block to exit blocks and thus no need to update the immediate
// dominator of the exit blocks.
@@ -2961,7 +2990,7 @@ void InnerLoopVectorizer::createVectorLoopSkeleton(StringRef Prefix) {
LoopVectorPreHeader = OrigLoop->getLoopPreheader();
assert(LoopVectorPreHeader && "Invalid loop structure");
LoopExitBlock = OrigLoop->getUniqueExitBlock(); // may be nullptr
- assert((LoopExitBlock || Cost->requiresScalarEpilogue(VF.isVector())) &&
+ assert((LoopExitBlock || Cost->requiresScalarEpilogue(VF)) &&
"multiple exit loop without required epilogue?");
LoopMiddleBlock =
@@ -2976,7 +3005,7 @@ void InnerLoopVectorizer::createVectorLoopSkeleton(StringRef Prefix) {
// unconditional branch from the middle block to the scalar preheader. In that
// case, there's no edge from the middle block to exit blocks and thus no
// need to update the immediate dominator of the exit blocks.
- if (Cost->requiresScalarEpilogue(VF.isVector())) {
+ if (Cost->requiresScalarEpilogue(VF)) {
assert(
LoopMiddleBlock->getSingleSuccessor() == LoopScalarPreHeader &&
" middle block should have the scalar preheader as single successor");
@@ -3109,7 +3138,7 @@ BasicBlock *InnerLoopVectorizer::completeLoopSkeleton() {
// Thus if tail is to be folded, we know we don't need to run the
// remainder and we can use the previous value for the condition (true).
// 3) Otherwise, construct a runtime check.
- if (!Cost->requiresScalarEpilogue(VF.isVector()) &&
+ if (!Cost->requiresScalarEpilogue(VF) &&
!Cost->foldTailByMasking()) {
// Here we use the same DebugLoc as the scalar loop latch terminator instead
// of the corresponding compare because they may have ended up with
@@ -3419,7 +3448,7 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
VPRegionBlock *VectorRegion = State.Plan->getVectorLoopRegion();
VPBasicBlock *LatchVPBB = VectorRegion->getExitingBasicBlock();
Loop *VectorLoop = LI->getLoopFor(State.CFG.VPBB2IRBB[LatchVPBB]);
- if (Cost->requiresScalarEpilogue(VF.isVector())) {
+ if (Cost->requiresScalarEpilogue(VF)) {
// No edge from the middle block to the unique exit block has been inserted
// and there is nothing to fix from vector loop; phis should have incoming
// from scalar loop only.
@@ -3936,7 +3965,7 @@ LoopVectorizationCostModel::getDivRemSpeculationCost(Instruction *I,
}
bool LoopVectorizationCostModel::interleavedAccessCanBeWidened(
- Instruction *I, ElementCount VF) {
+ Instruction *I, ElementCount VF) const {
assert(isAccessInterleaved(I) && "Expecting interleaved access.");
assert(getWideningDecision(I, VF) == CM_Unknown &&
"Decision should not be set yet.");
@@ -4670,7 +4699,7 @@ ElementCount LoopVectorizationCostModel::getMaximizedVFForTarget(
// When a scalar epilogue is required, at least one iteration of the scalar
// loop has to execute. Adjust MaxTripCount accordingly to avoid picking a
// max VF that results in a dead vector loop.
- if (MaxTripCount > 0 && requiresScalarEpilogue(true))
+ if (MaxTripCount > 0 && requiresScalarEpilogue(MaxVectorElementCount))
MaxTripCount -= 1;
if (MaxTripCount && MaxTripCount <= WidestRegisterMinEC &&
@@ -5302,7 +5331,7 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
// At least one iteration must be scalar when this constraint holds. So the
// maximum available iterations for interleaving is one less.
unsigned AvailableTC =
- requiresScalarEpilogue(VF.isVector()) ? KnownTC - 1 : KnownTC;
+ requiresScalarEpilogue(VF) ? KnownTC - 1 : KnownTC;
// If trip count is known we select between two prospective ICs, where
// 1) the aggressive IC is capped by the trip count divided by VF
@@ -5331,7 +5360,7 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
} else if (BestKnownTC && *BestKnownTC > 0) {
// At least one iteration must be scalar when this constraint holds. So the
// maximum available iterations for interleaving is one less.
- unsigned AvailableTC = requiresScalarEpilogue(VF.isVector())
+ unsigned AvailableTC = requiresScalarEpilogue(VF)
? (*BestKnownTC) - 1
: *BestKnownTC;
@@ -7638,8 +7667,7 @@ EpilogueVectorizerMainLoop::emitIterationCountCheck(BasicBlock *Bypass,
// Generate code to check if the loop's trip count is less than VF * UF of the
// main vector loop.
- auto P = Cost->requiresScalarEpilogue(ForEpilogue ? EPI.EpilogueVF.isVector()
- : VF.isVector())
+ auto P = Cost->requiresScalarEpilogue(ForEpilogue ? EPI.EpilogueVF : VF)
? ICmpInst::ICMP_ULE
: ICmpInst::ICMP_ULT;
@@ -7661,7 +7689,7 @@ EpilogueVectorizerMainLoop::emitIterationCountCheck(BasicBlock *Bypass,
// Update dominator for Bypass & LoopExit.
DT->changeImmediateDominator(Bypass, TCCheckBlock);
- if (!Cost->requiresScalarEpilogue(EPI.EpilogueVF.isVector()))
+ if (!Cost->requiresScalarEpilogue(EPI.EpilogueVF))
// For loops with multiple exits, there's no edge from the middle block
// to exit blocks (as the epilogue must run) and thus no need to update
// the immediate dominator of the exit blocks.
@@ -7730,7 +7758,7 @@ EpilogueVectorizerEpilogueLoop::createEpilogueVectorizedLoopSkeleton(
DT->changeImmediateDominator(LoopScalarPreHeader,
EPI.EpilogueIterationCountCheck);
- if (!Cost->requiresScalarEpilogue(EPI.EpilogueVF.isVector()))
+ if (!Cost->requiresScalarEpilogue(EPI.EpilogueVF))
// If there is an epilogue which must run, there's no edge from the
// middle block to exit blocks and thus no need to update the immediate
// dominator of the exit blocks.
@@ -7812,9 +7840,8 @@ EpilogueVectorizerEpilogueLoop::emitMinimumVectorEpilogueIterCountCheck(
// Generate code to check if the loop's trip count is less than VF * UF of the
// vector epilogue loop.
- auto P = Cost->requiresScalarEpilogue(EPI.EpilogueVF.isVector())
- ? ICmpInst::ICMP_ULE
- : ICmpInst::ICMP_ULT;
+ auto P = Cost->requiresScalarEpilogue(EPI.EpilogueVF) ? ICmpInst::ICMP_ULE
+ : ICmpInst::ICMP_ULT;
Value *CheckMinIters =
Builder.CreateICmp(P, Count,
diff --git a/llvm/test/Transforms/LoopVectorize/ARM/pointer_iv.ll b/llvm/test/Transforms/LoopVectorize/ARM/pointer_iv.ll
index 2269b774d9f31..1592ddb26f67d 100644
--- a/llvm/test/Transforms/LoopVectorize/ARM/pointer_iv.ll
+++ b/llvm/test/Transforms/LoopVectorize/ARM/pointer_iv.ll
@@ -12,16 +12,16 @@ define hidden void @pointer_phi_v4i32_add1(ptr noalias nocapture readonly %A, pt
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT: [[TMP0:%.*]] = shl i32 [[INDEX]], 2
-; CHECK-NEXT: [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[A:%.*]], i32 [[TMP0]]
-; CHECK-NEXT: [[TMP1:%.*]] = shl i32 [[INDEX]], 2
-; CHECK-NEXT: [[NEXT_GEP4:%.*]] = getelementptr i8, ptr [[B:%.*]], i32 [[TMP1]]
+; CHECK-NEXT: [[OFFSET_IDX:%.*]] = shl i32 [[INDEX]], 2
+; CHECK-NEXT: [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[A:%.*]], i32 [[OFFSET_IDX]]
+; CHECK-NEXT: [[OFFSET_IDX4:%.*]] = shl i32 [[INDEX]], 2
+; CHECK-NEXT: [[NEXT_GEP5:%.*]] = getelementptr i8, ptr [[B:%.*]], i32 [[OFFSET_IDX4]]
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[NEXT_GEP]], align 4
-; CHECK-NEXT: [[TMP2:%.*]] = add nsw <4 x i32> [[WIDE_LOAD]], [[BROADCAST_SPLAT]]
-; CHECK-NEXT: store <4 x i32> [[TMP2]], ptr [[NEXT_GEP4]], align 4
+; CHECK-NEXT: [[TMP0:%.*]] = add nsw <4 x i32> [[WIDE_LOAD]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT: store <4 x i32> [[TMP0]], ptr [[NEXT_GEP5]], align 4
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4
-; CHECK-NEXT: [[TMP3:%.*]] = icmp eq i32 [[INDEX_NEXT]], 1000
-; CHECK-NEXT: br i1 [[TMP3]], label [[END:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK-NEXT: [[TMP1:%.*]] = icmp eq i32 [[INDEX_NEXT]], 1000
+; CHECK-NEXT: br i1 [[TMP1]], label [[END:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
; CHECK: end:
; CHECK-NEXT: ret void
;
@@ -53,24 +53,24 @@ define hidden void @pointer_phi_v4i32_add2(ptr noalias nocapture readonly %A, pt
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
-; CHECK-NEXT: [[TMP0:%.*]] = shl i32 [[INDEX]], 3
-; CHECK-NEXT: [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[A]], i32 [[TMP0]]
-; CHECK-NEXT: [[TMP1:%.*]] = shl i32 [[INDEX]], 2
-; CHECK-NEXT: [[NEXT_GEP4:%.*]] = getelementptr i8, ptr [[B]], i32 [[TMP1]]
+; CHECK-NEXT: [[OFFSET_IDX:%.*]] = shl i32 [[INDEX]], 3
+; CHECK-NEXT: [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[A]], i32 [[OFFSET_IDX]]
+; CHECK-NEXT: [[OFFSET_IDX4:%.*]] = shl i32 [[INDEX]], 2
+; CHECK-NEXT: [[NEXT_GEP5:%.*]] = getelementptr i8, ptr [[B]], i32 [[OFFSET_IDX4]]
; CHECK-NEXT: [[WIDE_VEC:%.*]] = load <8 x i32>, ptr [[NEXT_GEP]], align 4
; CHECK-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <8 x i32> [[WIDE_VEC]], <8 x i32> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
-; CHECK-NEXT: [[TMP2:%.*]] = add nsw <4 x i32> [[STRIDED_VEC]], [[BROADCAST_SPLAT]]
-; CHECK-NEXT: store <4 x i32> [[TMP2]], ptr [[NEXT_GEP4]], align 4
+; CHECK-NEXT: [[TMP0:%.*]] = add nsw <4 x i32> [[STRIDED_VEC]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT: store <4 x i32> [[TMP0]], ptr [[NEXT_GEP5]], align 4
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4
-; CHECK-NEXT: [[TMP3:%.*]] = icmp eq i32 [[INDEX_NEXT]], 996
-; CHECK-NEXT: br i1 [[TMP3]], label [[FOR_BODY:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK-NEXT: [[TMP1:%.*]] = icmp eq i32 [[INDEX_NEXT]], 996
+; CHECK-NEXT: br i1 [[TMP1]], label [[FOR_BODY:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
; CHECK: for.body:
; CHECK-NEXT: [[A_ADDR_09:%.*]] = phi ptr [ [[ADD_PTR:%.*]], [[FOR_BODY]] ], [ [[IND_END]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[I_08:%.*]] = phi i32 [ [[INC:%.*]], [[FOR_BODY]] ], [ 996, [[VECTOR_BODY]] ]
; CHECK-NEXT: [[B_ADDR_07:%.*]] = phi ptr [ [[INCDEC_PTR:%.*]], [[FOR_BODY]] ], [ [[IND_END2]], [[VECTOR_BODY]] ]
-; CHECK-NEXT: [[TMP4:%.*]] = load i32, ptr [[A_ADDR_09]], align 4
+; CHECK-NEXT: [[TMP2:%.*]] = load i32, ptr [[A_ADDR_09]], align 4
; CHECK-NEXT: [[ADD_PTR]] = getelementptr inbounds i8, ptr [[A_ADDR_09]], i32 8
-; CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP4]], [[Y]]
+; CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP2]], [[Y]]
; CHECK-NEXT: store i32 [[ADD]], ptr [[B_ADDR_07]], align 4
; CHECK-NEXT: [[INCDEC_PTR]] = getelementptr inbounds i8, ptr [[B_ADDR_07]], i32 4
; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_08]], 1
@@ -100,36 +100,22 @@ end:
define hidden void @pointer_phi_v4i32_add3(ptr noalias nocapture readonly %A, ptr noalias nocapture %B, i32 %y) {
; CHECK-LABEL: @pointer_phi_v4i32_add3(
; CHECK-NEXT: entry:
-; CHECK-NEXT: [[IND_END:%.*]] = getelementptr i8, ptr [[A:%.*]], i32 11952
-; CHECK-NEXT: [[IND_END2:%.*]] = getelementptr i8, ptr [[B:%.*]], i32 3984
; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[Y:%.*]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
-; CHECK-NEXT: [[POINTER_PHI:%.*]] = phi ptr [ [[A]], [[ENTRY:%.*]] ], [ [[PTR_IND:%.*]], [[VECTOR_BODY]] ]
+; CHECK-NEXT: [[POINTER_PHI:%.*]] = phi ptr [ [[A:%.*]], [[ENTRY:%.*]] ], [ [[PTR_IND:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[INDEX:%.*]] = phi i32 [ 0, [[ENTRY]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP0:%.*]] = getelementptr i8, ptr [[POINTER_PHI]], <4 x i32> <i32 0, i32 12, i32 24, i32 36>
-; CHECK-NEXT: [[TMP1:%.*]] = shl i32 [[INDEX]], 2
-; CHECK-NEXT: [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[B]], i32 [[TMP1]]
+; CHECK-NEXT: [[OFFSET_IDX:%.*]] = shl i32 [[INDEX]], 2
+; CHECK-NEXT: [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[B:%.*]], i32 [[OFFSET_IDX]]
; CHECK-NEXT: [[WIDE_MASKED_GATHER:%.*]] = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> [[TMP0]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> poison)
-; CHECK-NEXT: [[TMP2:%.*]] = add nsw <4 x i32> [[WIDE_MASKED_GATHER]], [[BROADCAST_SPLAT]]
-; CHECK-NEXT: store <4 x i32> [[TMP2]], ptr [[NEXT_GEP]], align 4
+; CHECK-NEXT: [[TMP1:%.*]] = add nsw <4 x i32> [[WIDE_MASKED_GATHER]], [[BROADCAST_SPLAT]]
+; CHECK-NEXT: store <4 x i32> [[TMP1]], ptr [[NEXT_GEP]], align 4
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4
; CHECK-NEXT: [[PTR_IND]] = getelementptr i8, ptr [[POINTER_PHI]], i32 48
-; CHECK-NEXT: [[TMP3:%.*]] = icmp eq i32 [[INDEX_NEXT]], 996
-; CHECK-NEXT: br i1 [[TMP3]], label [[FOR_BODY:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
-; CHECK: for.body:
-; CHECK-NEXT: [[A_ADDR_09:%.*]] = phi ptr [ [[ADD_PTR:%.*]], [[FOR_BODY]] ], [ [[IND_END]], [[VECTOR_BODY]] ]
-; CHECK-NEXT: [[I_08:%.*]] = phi i32 [ [[INC:%.*]], [[FOR_BODY]] ], [ 996, [[VECTOR_BODY]] ]
-; CHECK-NEXT: [[B_ADDR_07:%.*]] = phi ptr [ [[INCDEC_PTR:%.*]], [[FOR_BODY]] ], [ [[IND_END2]], [[VECTOR_BODY]] ]
-; CHECK-NEXT: [[TMP4:%.*]] = load i32, ptr [[A_ADDR_09]], align 4
-; CHECK-NEXT: [[ADD_PTR]] = getelementptr inbounds i8, ptr [[A_ADDR_09]], i32 12
-; CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP4]], [[Y]]
-; CHECK-NEXT: store i32 [[ADD]], ptr [[B_ADDR_07]], align 4
-; CHECK-NEXT: [[INCDEC_PTR]] = getelementptr inbounds i8, ptr [[B_ADDR_07]], i32 4
-; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_08]], 1
-; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], 1000
-; CHECK-NEXT: br i1 [[EXITCOND]], label [[END:%.*]], label [[FOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
+; CHECK-NEXT: [[TMP2:%.*]] = icmp eq i32 [[INDEX_NEXT]], 1000
+; CHECK-NEXT: br i1 [[TMP2]], label [[END:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
; CHECK: end:
; CHECK-NEXT: ret void
;
@@ -160,16 +146,16 @@ define hidden void @pointer_phi_v8i16_add1(ptr noalias nocapture readonly %A, pt
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[INDEX_NE...
[truncated]
|
@@ -1448,29 +1448,58 @@ class LoopVectorizationCostModel { | |||
/// Returns true if \p I is a memory instruction in an interleaved-group | |||
/// of memory accesses that can be vectorized with wide vector loads/stores | |||
/// and shuffles. | |||
bool interleavedAccessCanBeWidened(Instruction *I, ElementCount VF); | |||
bool interleavedAccessCanBeWidened(Instruction *I, ElementCount VF) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better make it in separate NFC patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to also include the additional DEBUG printing as part of the NFC patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (VF.isVector()) { | ||
if (InterleaveInfo.requiresScalarEpilogue()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (VF.isVector()) { | |
if (InterleaveInfo.requiresScalarEpilogue()) { | |
if (VF.isVector() && InterleaveInfo.hasGroups() && InterleaveInfo.requiresScalarEpilogue()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not be needed to check for #groups. requiresScalarEpilogue
is false when no groups were constructed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so the below code should be good enough:
if (VF.isVector()) { | |
if (InterleaveInfo.requiresScalarEpilogue()) { | |
if (VF.isVector() && InterleaveInfo.requiresScalarEpilogue()) { |
@@ -8,7 +8,7 @@ target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128" | |||
; VF/IC combination. | |||
define void @test_tc_less_than_16(ptr %A, i64 %N) { | |||
; CHECK: LV: Scalarizing: %cmp = | |||
; CHECK-NEXT: VPlan 'Initial VPlan for VF={8},UF>=1' { | |||
; CHECK: VPlan 'Initial VPlan for VF={8},UF>=1' { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the additional line that broke the -NEXT be checked too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the purpose of the test is to check vplan's representation
; CHECK-NEXT: [[TMP92:%.*]] = load double, ptr [[TMP60]], align 8 | ||
; CHECK-NEXT: [[TMP93:%.*]] = load double, ptr [[TMP61]], align 8 | ||
; CHECK-NEXT: [[TMP94:%.*]] = load double, ptr [[TMP62]], align 8 | ||
; CHECK-NEXT: [[TMP95:%.*]] = load double, ptr [[TMP63]], align 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these loads dead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great discovery. Here's another possible approach. Perhaps we could directly invalidate groups where the WideningDecision is not CM_Interleave
within LoopVectorizationCostModel::setCostBasedWideningDecision
. The relevant implementation can be referenced from InterleavedAccessInfo::invalidateGroupsRequiringScalarEpilogue()
.
// Make sure interleaved groups that require scalar epilogue will be | ||
// widened. | ||
for (auto *G : InterleaveInfo.getInterleaveGroups()) { | ||
if (!G->requiresScalarEpilogue()) | ||
continue; | ||
|
||
Instruction *I = G->getMember(0); | ||
InstWidening Decision = getWideningDecision(I, VF); | ||
if (Decision == CM_Interleave || | ||
(Decision == CM_Unknown && | ||
interleavedAccessCanBeWidened(G->getMember(0), VF))) { | ||
LLVM_DEBUG(dbgs() << "LV: Loop with VF = " << VF | ||
<< " requires scalar epilogue: vectorizable " | ||
"interleaved group\n"); | ||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use any_of to implement this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. I hesitated to use auto *G
, as the proper type InterleaveGroup<Instruction>
is quite large
(Decision == CM_Unknown && | ||
interleavedAccessCanBeWidened(G->getMember(0), VF))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't imagine why this condition would be needed. Could you explain more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no expectation when requiresScalarEpilogue
is called: it can be called before decision is made and after. The latter check makes sure requiresScalarEpilogue
returns the same value when decision is not made. Precondition decision != cm_unknown
is necessary as interleavedAccessCanBeWidened
is expected to be called with unknown decision.
I was considering that option too, but groups are constructed for entire loop and one group can be valid with one VF, but not valid with the other. |
LV should check that all groups that require scalar epilogue can be widened, otherwise if InterleavedAccessGroup cannot be widened and does have tail element, current logic in LV requires to emit scalar epilogue, which leads to inefficient vector code.
1d2c4fe
to
fa9e2e9
Compare
Make sense, thanks. |
@fhahn ping |
Instruction *I = Group->getMember(0); | ||
InstWidening Decision = getWideningDecision(I, VF); | ||
return Decision == CM_Interleave || | ||
(Decision == CM_Unknown && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that the decision later changes and the result of requiresScalarEpilogue
changes? as mentioned for #96681 would it be possible to compute whether a scalar epilogue is needed once and set a flag to avoid it potentially changing later?
LV should check that all groups that require scalar epilogue can be widened, otherwise if InterleavedAccessGroup cannot be widened and does have tail element, current logic in LV requires to emit scalar epilogue, which leads to inefficient vector code.