Skip to content

Commit 7231446

Browse files
committed
[LV] Unconditionally branch from middle to scalar preheader if the scalar loop must execute (try 4)
Resubmit after the following changes: * Fix a latent bug related to unrolling with required epilogue (see e49d65f). I believe this is the cause of the prior PPC buildbot failure. * Disable non-latch exits for epilogue vectorization to be safe (9ffa90d) * Split out assert movement (600624a) to reduce churn if this gets reverted again. Previous commit message (try 3) Resubmit after fixing test/Transforms/LoopVectorize/ARM/mve-gather-scatter-tailpred.ll Previous commit message... This is a resubmit of 3e5ce4 (which was reverted by 7fe41ac). The original commit caused a PPC build bot failure we never really got to the bottom of. I can't reproduce the issue, and the bot owner was non-responsive. In the meantime, we stumbled across an issue which seems possibly related, and worked around a latent bug in 80e8025. My best guess is that the original patch exposed that latent issue at higher frequency, but it really is just a guess. Original commit message follows... If we know that the scalar epilogue is required to run, modify the CFG to end the middle block with an unconditional branch to scalar preheader. This is instead of a conditional branch to either the preheader or the exit block. The motivation to do this is to support multiple exit blocks. Specifically, the current structure forces us to identify immediate dominators and *which* exit block to branch from in the middle terminator. For the multiple exit case - where we know require scalar will hold - these questions are ill formed. This is the last change needed to support multiple exit loops, but since the diffs are already large enough, I'm going to land this, and then enable separately. You can think of this as being NFCIish prep work, but the changes are a bit too involved for me to feel comfortable tagging the review that way. Differential Revision: https://reviews.llvm.org/D94892
1 parent 5888a19 commit 7231446

File tree

6 files changed

+161
-133
lines changed

6 files changed

+161
-133
lines changed

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Lines changed: 83 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -816,7 +816,7 @@ class InnerLoopVectorizer {
816816
/// Middle Block between the vector and the scalar.
817817
BasicBlock *LoopMiddleBlock;
818818

819-
/// The (unique) ExitBlock of the scalar loop. Note that
819+
/// The unique ExitBlock of the scalar loop if one exists. Note that
820820
/// there can be multiple exiting edges reaching this block.
821821
BasicBlock *LoopExitBlock;
822822

@@ -3268,9 +3268,13 @@ void InnerLoopVectorizer::emitMinimumIterationCountCheck(Loop *L,
32683268
DT->getNode(Bypass)->getIDom()) &&
32693269
"TC check is expected to dominate Bypass");
32703270

3271-
// Update dominator for Bypass & LoopExit.
3271+
// Update dominator for Bypass & LoopExit (if needed).
32723272
DT->changeImmediateDominator(Bypass, TCCheckBlock);
3273-
DT->changeImmediateDominator(LoopExitBlock, TCCheckBlock);
3273+
if (!Cost->requiresScalarEpilogue(VF))
3274+
// If there is an epilogue which must run, there's no edge from the
3275+
// middle block to exit blocks and thus no need to update the immediate
3276+
// dominator of the exit blocks.
3277+
DT->changeImmediateDominator(LoopExitBlock, TCCheckBlock);
32743278

32753279
ReplaceInstWithInst(
32763280
TCCheckBlock->getTerminator(),
@@ -3294,7 +3298,11 @@ BasicBlock *InnerLoopVectorizer::emitSCEVChecks(Loop *L, BasicBlock *Bypass) {
32943298
// Update dominator only if this is first RT check.
32953299
if (LoopBypassBlocks.empty()) {
32963300
DT->changeImmediateDominator(Bypass, SCEVCheckBlock);
3297-
DT->changeImmediateDominator(LoopExitBlock, SCEVCheckBlock);
3301+
if (!Cost->requiresScalarEpilogue(VF))
3302+
// If there is an epilogue which must run, there's no edge from the
3303+
// middle block to exit blocks and thus no need to update the immediate
3304+
// dominator of the exit blocks.
3305+
DT->changeImmediateDominator(LoopExitBlock, SCEVCheckBlock);
32983306
}
32993307

33003308
LoopBypassBlocks.push_back(SCEVCheckBlock);
@@ -3447,9 +3455,10 @@ Value *InnerLoopVectorizer::emitTransformedIndex(
34473455
Loop *InnerLoopVectorizer::createVectorLoopSkeleton(StringRef Prefix) {
34483456
LoopScalarBody = OrigLoop->getHeader();
34493457
LoopVectorPreHeader = OrigLoop->getLoopPreheader();
3450-
LoopExitBlock = OrigLoop->getUniqueExitBlock();
3451-
assert(LoopExitBlock && "Must have an exit block");
34523458
assert(LoopVectorPreHeader && "Invalid loop structure");
3459+
LoopExitBlock = OrigLoop->getUniqueExitBlock(); // may be nullptr
3460+
assert((LoopExitBlock || Cost->requiresScalarEpilogue(VF)) &&
3461+
"multiple exit loop without required epilogue?");
34533462

34543463
LoopMiddleBlock =
34553464
SplitBlock(LoopVectorPreHeader, LoopVectorPreHeader->getTerminator(), DT,
@@ -3458,12 +3467,20 @@ Loop *InnerLoopVectorizer::createVectorLoopSkeleton(StringRef Prefix) {
34583467
SplitBlock(LoopMiddleBlock, LoopMiddleBlock->getTerminator(), DT, LI,
34593468
nullptr, Twine(Prefix) + "scalar.ph");
34603469

3461-
// Set up branch from middle block to the exit and scalar preheader blocks.
3462-
// completeLoopSkeleton will update the condition to use an iteration check,
3463-
// if required to decide whether to execute the remainder.
3464-
BranchInst *BrInst =
3465-
BranchInst::Create(LoopExitBlock, LoopScalarPreHeader, Builder.getTrue());
34663470
auto *ScalarLatchTerm = OrigLoop->getLoopLatch()->getTerminator();
3471+
3472+
// Set up the middle block terminator. Two cases:
3473+
// 1) If we know that we must execute the scalar epilogue, emit an
3474+
// unconditional branch.
3475+
// 2) Otherwise, we must have a single unique exit block (due to how we
3476+
// implement the multiple exit case). In this case, set up a conditonal
3477+
// branch from the middle block to the loop scalar preheader, and the
3478+
// exit block. completeLoopSkeleton will update the condition to use an
3479+
// iteration check, if required to decide whether to execute the remainder.
3480+
BranchInst *BrInst = Cost->requiresScalarEpilogue(VF) ?
3481+
BranchInst::Create(LoopScalarPreHeader) :
3482+
BranchInst::Create(LoopExitBlock, LoopScalarPreHeader,
3483+
Builder.getTrue());
34673484
BrInst->setDebugLoc(ScalarLatchTerm->getDebugLoc());
34683485
ReplaceInstWithInst(LoopMiddleBlock->getTerminator(), BrInst);
34693486

@@ -3475,7 +3492,11 @@ Loop *InnerLoopVectorizer::createVectorLoopSkeleton(StringRef Prefix) {
34753492
nullptr, nullptr, Twine(Prefix) + "vector.body");
34763493

34773494
// Update dominator for loop exit.
3478-
DT->changeImmediateDominator(LoopExitBlock, LoopMiddleBlock);
3495+
if (!Cost->requiresScalarEpilogue(VF))
3496+
// If there is an epilogue which must run, there's no edge from the
3497+
// middle block to exit blocks and thus no need to update the immediate
3498+
// dominator of the exit blocks.
3499+
DT->changeImmediateDominator(LoopExitBlock, LoopMiddleBlock);
34793500

34803501
// Create and register the new vector loop.
34813502
Loop *Lp = LI->AllocateLoop();
@@ -3577,10 +3598,14 @@ BasicBlock *InnerLoopVectorizer::completeLoopSkeleton(Loop *L,
35773598
auto *ScalarLatchTerm = OrigLoop->getLoopLatch()->getTerminator();
35783599

35793600
// Add a check in the middle block to see if we have completed
3580-
// all of the iterations in the first vector loop.
3581-
// If (N - N%VF) == N, then we *don't* need to run the remainder.
3582-
// If tail is to be folded, we know we don't need to run the remainder.
3583-
if (!Cost->foldTailByMasking()) {
3601+
// all of the iterations in the first vector loop. Three cases:
3602+
// 1) If we require a scalar epilogue, there is no conditional branch as
3603+
// we unconditionally branch to the scalar preheader. Do nothing.
3604+
// 2) If (N - N%VF) == N, then we *don't* need to run the remainder.
3605+
// Thus if tail is to be folded, we know we don't need to run the
3606+
// remainder and we can use the previous value for the condition (true).
3607+
// 3) Otherwise, construct a runtime check.
3608+
if (!Cost->requiresScalarEpilogue(VF) && !Cost->foldTailByMasking()) {
35843609
Instruction *CmpN = CmpInst::Create(Instruction::ICmp, CmpInst::ICMP_EQ,
35853610
Count, VectorTripCount, "cmp.n",
35863611
LoopMiddleBlock->getTerminator());
@@ -3644,17 +3669,17 @@ BasicBlock *InnerLoopVectorizer::createVectorizedLoopSkeleton() {
36443669
| [ ]_| <-- vector loop.
36453670
| |
36463671
| v
3647-
| -[ ] <--- middle-block.
3648-
| / |
3649-
| / v
3650-
-|- >[ ] <--- new preheader.
3672+
\ -[ ] <--- middle-block.
3673+
\/ |
3674+
/\ v
3675+
| ->[ ] <--- new preheader.
36513676
| |
3652-
| v
3677+
(opt) v <-- edge from middle to exit iff epilogue is not required.
36533678
| [ ] \
3654-
| [ ]_| <-- old scalar loop to handle remainder.
3679+
| [ ]_| <-- old scalar loop to handle remainder (scalar epilogue).
36553680
\ |
36563681
\ v
3657-
>[ ] <-- exit block.
3682+
>[ ] <-- exit block(s).
36583683
...
36593684
*/
36603685

@@ -4088,13 +4113,19 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State) {
40884113
// Forget the original basic block.
40894114
PSE.getSE()->forgetLoop(OrigLoop);
40904115

4091-
// Fix-up external users of the induction variables.
4092-
for (auto &Entry : Legal->getInductionVars())
4093-
fixupIVUsers(Entry.first, Entry.second,
4094-
getOrCreateVectorTripCount(LI->getLoopFor(LoopVectorBody)),
4095-
IVEndValues[Entry.first], LoopMiddleBlock);
4116+
// If we inserted an edge from the middle block to the unique exit block,
4117+
// update uses outside the loop (phis) to account for the newly inserted
4118+
// edge.
4119+
if (!Cost->requiresScalarEpilogue(VF)) {
4120+
// Fix-up external users of the induction variables.
4121+
for (auto &Entry : Legal->getInductionVars())
4122+
fixupIVUsers(Entry.first, Entry.second,
4123+
getOrCreateVectorTripCount(LI->getLoopFor(LoopVectorBody)),
4124+
IVEndValues[Entry.first], LoopMiddleBlock);
4125+
4126+
fixLCSSAPHIs(State);
4127+
}
40964128

4097-
fixLCSSAPHIs(State);
40984129
for (Instruction *PI : PredicatedInstructions)
40994130
sinkScalarOperands(&*PI);
41004131

@@ -4309,12 +4340,13 @@ void InnerLoopVectorizer::fixFirstOrderRecurrence(VPWidenPHIRecipe *PhiR,
43094340
// recurrence in the exit block, and then add an edge for the middle block.
43104341
// Note that LCSSA does not imply single entry when the original scalar loop
43114342
// had multiple exiting edges (as we always run the last iteration in the
4312-
// scalar epilogue); in that case, the exiting path through middle will be
4313-
// dynamically dead and the value picked for the phi doesn't matter.
4314-
for (PHINode &LCSSAPhi : LoopExitBlock->phis())
4315-
if (any_of(LCSSAPhi.incoming_values(),
4316-
[Phi](Value *V) { return V == Phi; }))
4317-
LCSSAPhi.addIncoming(ExtractForPhiUsedOutsideLoop, LoopMiddleBlock);
4343+
// scalar epilogue); in that case, there is no edge from middle to exit and
4344+
// and thus no phis which needed updated.
4345+
if (!Cost->requiresScalarEpilogue(VF))
4346+
for (PHINode &LCSSAPhi : LoopExitBlock->phis())
4347+
if (any_of(LCSSAPhi.incoming_values(),
4348+
[Phi](Value *V) { return V == Phi; }))
4349+
LCSSAPhi.addIncoming(ExtractForPhiUsedOutsideLoop, LoopMiddleBlock);
43184350
}
43194351

43204352
void InnerLoopVectorizer::fixReduction(VPReductionPHIRecipe *PhiR,
@@ -4483,10 +4515,11 @@ void InnerLoopVectorizer::fixReduction(VPReductionPHIRecipe *PhiR,
44834515
// We know that the loop is in LCSSA form. We need to update the PHI nodes
44844516
// in the exit blocks. See comment on analogous loop in
44854517
// fixFirstOrderRecurrence for a more complete explaination of the logic.
4486-
for (PHINode &LCSSAPhi : LoopExitBlock->phis())
4487-
if (any_of(LCSSAPhi.incoming_values(),
4488-
[LoopExitInst](Value *V) { return V == LoopExitInst; }))
4489-
LCSSAPhi.addIncoming(ReducedPartRdx, LoopMiddleBlock);
4518+
if (!Cost->requiresScalarEpilogue(VF))
4519+
for (PHINode &LCSSAPhi : LoopExitBlock->phis())
4520+
if (any_of(LCSSAPhi.incoming_values(),
4521+
[LoopExitInst](Value *V) { return V == LoopExitInst; }))
4522+
LCSSAPhi.addIncoming(ReducedPartRdx, LoopMiddleBlock);
44904523

44914524
// Fix the scalar loop reduction variable with the incoming reduction sum
44924525
// from the vector body and from the backedge value.
@@ -8316,7 +8349,11 @@ BasicBlock *EpilogueVectorizerMainLoop::emitMinimumIterationCountCheck(
83168349

83178350
// Update dominator for Bypass & LoopExit.
83188351
DT->changeImmediateDominator(Bypass, TCCheckBlock);
8319-
DT->changeImmediateDominator(LoopExitBlock, TCCheckBlock);
8352+
if (!Cost->requiresScalarEpilogue(EPI.EpilogueVF))
8353+
// For loops with multiple exits, there's no edge from the middle block
8354+
// to exit blocks (as the epilogue must run) and thus no need to update
8355+
// the immediate dominator of the exit blocks.
8356+
DT->changeImmediateDominator(LoopExitBlock, TCCheckBlock);
83208357

83218358
LoopBypassBlocks.push_back(TCCheckBlock);
83228359

@@ -8380,7 +8417,12 @@ EpilogueVectorizerEpilogueLoop::createEpilogueVectorizedLoopSkeleton() {
83808417

83818418
DT->changeImmediateDominator(LoopScalarPreHeader,
83828419
EPI.EpilogueIterationCountCheck);
8383-
DT->changeImmediateDominator(LoopExitBlock, EPI.EpilogueIterationCountCheck);
8420+
if (!Cost->requiresScalarEpilogue(EPI.EpilogueVF))
8421+
// If there is an epilogue which must run, there's no edge from the
8422+
// middle block to exit blocks and thus no need to update the immediate
8423+
// dominator of the exit blocks.
8424+
DT->changeImmediateDominator(LoopExitBlock,
8425+
EPI.EpilogueIterationCountCheck);
83848426

83858427
// Keep track of bypass blocks, as they feed start values to the induction
83868428
// phis in the scalar loop preheader.

0 commit comments

Comments
 (0)