Skip to content

Commit 3a4ecfc

Browse files
committed
[VPlan] Model branch cond to enter scalar epilogue in VPlan.
This patch moves branch condition creation to enter the scalar epilogue loop to VPlan. Modeling the branch in the middle block also requires modeling the successor blocks. To do so, this patch introduces a new VPBlockBase sub-type that simply wraps an existing IR block: VPIRWrapperBlock (name subject to better suggestions). This allows allows connecting blocks naturally to leave nodes. It can also be used to transition modeling of more bits of the skeleton to VPlan gradually. Note that the middle.block is still created as part of the skeleton and then patched in during VPlan execution. Unfortunately the skeleton needs to create the middle.block early on, as it is also used for induction resume value creation and is also needed to properly update the dominator tree during skeleton creation. After this patch lands, I plan to move induction resume value and phi node creation in the scalar preheader to VPlan. Once that is done, we should be able to create the middle.block in VPlan directly. At the moment, VPIRWrapperBlocks only wrap an original IR block and don't allow any additions. We could allow IR wrapper blocks to also contain recipes that get added at the beginning of the block. Then the middle.block could also be wrapped. Something like that will also be needed to place the induction/reduction resume phi nodes in the scalar preheader. This is a re-worked version based on the earlier https://reviews.llvm.org/D150398 and the main change is the introduction and use of VPIRWrapperBlock. Note that this patch adds and uses a new helper reorderIncomingBlocks to preserve the original order of incoming blocks in created phi nodes; as this patch changes the order branches are created, the order of the predecessors changes, which would cause different order of incoming blocks in some phis we create. The helper ensures the created IR doesn't change (modulo some minor difference in name numbering). After this change landed, the tests should be updated and the helper removed separately. Depends on #92525
1 parent 53e8e56 commit 3a4ecfc

12 files changed

+363
-96
lines changed

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Lines changed: 66 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@
5959
#include "VPlan.h"
6060
#include "VPlanAnalysis.h"
6161
#include "VPlanHCFGBuilder.h"
62+
#include "VPlanPatternMatch.h"
6263
#include "VPlanTransforms.h"
6364
#include "VPlanVerifier.h"
6465
#include "llvm/ADT/APInt.h"
@@ -2972,22 +2973,7 @@ void InnerLoopVectorizer::createVectorLoopSkeleton(StringRef Prefix) {
29722973
SplitBlock(LoopMiddleBlock, LoopMiddleBlock->getTerminator(), DT, LI,
29732974
nullptr, Twine(Prefix) + "scalar.ph");
29742975

2975-
auto *ScalarLatchTerm = OrigLoop->getLoopLatch()->getTerminator();
2976-
2977-
// Set up the middle block terminator. Two cases:
2978-
// 1) If we know that we must execute the scalar epilogue, emit an
2979-
// unconditional branch.
2980-
// 2) Otherwise, we must have a single unique exit block (due to how we
2981-
// implement the multiple exit case). In this case, set up a conditional
2982-
// branch from the middle block to the loop scalar preheader, and the
2983-
// exit block. completeLoopSkeleton will update the condition to use an
2984-
// iteration check, if required to decide whether to execute the remainder.
2985-
BranchInst *BrInst =
2986-
Cost->requiresScalarEpilogue(VF.isVector())
2987-
? BranchInst::Create(LoopScalarPreHeader)
2988-
: BranchInst::Create(LoopExitBlock, LoopScalarPreHeader,
2989-
Builder.getTrue());
2990-
BrInst->setDebugLoc(ScalarLatchTerm->getDebugLoc());
2976+
auto *BrInst = new UnreachableInst(LoopMiddleBlock->getContext());
29912977
ReplaceInstWithInst(LoopMiddleBlock->getTerminator(), BrInst);
29922978

29932979
// Update dominator for loop exit. During skeleton creation, only the vector
@@ -3094,50 +3080,7 @@ void InnerLoopVectorizer::createInductionResumeValues(
30943080
}
30953081
}
30963082

3097-
BasicBlock *InnerLoopVectorizer::completeLoopSkeleton() {
3098-
// The trip counts should be cached by now.
3099-
Value *Count = getTripCount();
3100-
Value *VectorTripCount = getOrCreateVectorTripCount(LoopVectorPreHeader);
3101-
3102-
auto *ScalarLatchTerm = OrigLoop->getLoopLatch()->getTerminator();
3103-
3104-
// Add a check in the middle block to see if we have completed
3105-
// all of the iterations in the first vector loop. Three cases:
3106-
// 1) If we require a scalar epilogue, there is no conditional branch as
3107-
// we unconditionally branch to the scalar preheader. Do nothing.
3108-
// 2) If (N - N%VF) == N, then we *don't* need to run the remainder.
3109-
// Thus if tail is to be folded, we know we don't need to run the
3110-
// remainder and we can use the previous value for the condition (true).
3111-
// 3) Otherwise, construct a runtime check.
3112-
if (!Cost->requiresScalarEpilogue(VF.isVector()) &&
3113-
!Cost->foldTailByMasking()) {
3114-
// Here we use the same DebugLoc as the scalar loop latch terminator instead
3115-
// of the corresponding compare because they may have ended up with
3116-
// different line numbers and we want to avoid awkward line stepping while
3117-
// debugging. Eg. if the compare has got a line number inside the loop.
3118-
// TODO: At the moment, CreateICmpEQ will simplify conditions with constant
3119-
// operands. Perform simplification directly on VPlan once the branch is
3120-
// modeled there.
3121-
IRBuilder<> B(LoopMiddleBlock->getTerminator());
3122-
B.SetCurrentDebugLocation(ScalarLatchTerm->getDebugLoc());
3123-
Value *CmpN = B.CreateICmpEQ(Count, VectorTripCount, "cmp.n");
3124-
BranchInst &BI = *cast<BranchInst>(LoopMiddleBlock->getTerminator());
3125-
BI.setCondition(CmpN);
3126-
if (hasBranchWeightMD(*ScalarLatchTerm)) {
3127-
// Assume that `Count % VectorTripCount` is equally distributed.
3128-
unsigned TripCount = UF * VF.getKnownMinValue();
3129-
assert(TripCount > 0 && "trip count should not be zero");
3130-
const uint32_t Weights[] = {1, TripCount - 1};
3131-
setBranchWeights(BI, Weights);
3132-
}
3133-
}
3134-
3135-
#ifdef EXPENSIVE_CHECKS
3136-
assert(DT->verify(DominatorTree::VerificationLevel::Fast));
3137-
#endif
31383083

3139-
return LoopVectorPreHeader;
3140-
}
31413084

31423085
std::pair<BasicBlock *, Value *>
31433086
InnerLoopVectorizer::createVectorizedLoopSkeleton(
@@ -3198,7 +3141,7 @@ InnerLoopVectorizer::createVectorizedLoopSkeleton(
31983141
// Emit phis for the new starting index of the scalar loop.
31993142
createInductionResumeValues(ExpandedSCEVs);
32003143

3201-
return {completeLoopSkeleton(), nullptr};
3144+
return {LoopVectorPreHeader, nullptr};
32023145
}
32033146

32043147
// Fix up external users of the induction variable. At this point, we are
@@ -3481,6 +3424,18 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
34813424
VF.getKnownMinValue() * UF);
34823425
}
34833426

3427+
// Helper to reorder blocks so they match the original order even after the
3428+
// order of the predecessors changes. This is only used to avoid a number of
3429+
// test changes due to reordering of incoming blocks in phi nodes and should be
3430+
// removed soon, with the tests being updated.
3431+
static void reorderIncomingBlocks(SmallVectorImpl<BasicBlock *> &Blocks,
3432+
BasicBlock *LoopMiddleBlock) {
3433+
if (Blocks.front() == LoopMiddleBlock)
3434+
std::swap(Blocks.front(), Blocks.back());
3435+
if (Blocks.size() == 3)
3436+
std::swap(Blocks[0], Blocks[1]);
3437+
}
3438+
34843439
void InnerLoopVectorizer::fixFixedOrderRecurrence(
34853440
VPFirstOrderRecurrencePHIRecipe *PhiR, VPTransformState &State) {
34863441
// This is the second phase of vectorizing first-order recurrences. An
@@ -3591,7 +3546,9 @@ void InnerLoopVectorizer::fixFixedOrderRecurrence(
35913546
PHINode *Phi = cast<PHINode>(PhiR->getUnderlyingValue());
35923547
auto *Start = Builder.CreatePHI(Phi->getType(), 2, "scalar.recur.init");
35933548
auto *ScalarInit = PhiR->getStartValue()->getLiveInIRValue();
3594-
for (auto *BB : predecessors(LoopScalarPreHeader)) {
3549+
SmallVector<BasicBlock *> Blocks(predecessors(LoopScalarPreHeader));
3550+
reorderIncomingBlocks(Blocks, LoopMiddleBlock);
3551+
for (auto *BB : Blocks) {
35953552
auto *Incoming = BB == LoopMiddleBlock ? ExtractForScalar : ScalarInit;
35963553
Start->addIncoming(Incoming, BB);
35973554
}
@@ -7480,7 +7437,9 @@ static void createAndCollectMergePhiForReduction(
74807437
// If we are fixing reductions in the epilogue loop then we should already
74817438
// have created a bc.merge.rdx Phi after the main vector body. Ensure that
74827439
// we carry over the incoming values correctly.
7483-
for (auto *Incoming : predecessors(LoopScalarPreHeader)) {
7440+
SmallVector<BasicBlock *> Blocks(predecessors(LoopScalarPreHeader));
7441+
reorderIncomingBlocks(Blocks, LoopMiddleBlock);
7442+
for (auto *Incoming : Blocks) {
74847443
if (Incoming == LoopMiddleBlock)
74857444
BCBlockPhi->addIncoming(FinalValue, Incoming);
74867445
else if (ResumePhi && is_contained(ResumePhi->blocks(), Incoming))
@@ -7551,6 +7510,21 @@ LoopVectorizationPlanner::executePlan(
75517510
std::tie(State.CFG.PrevBB, CanonicalIVStartValue) =
75527511
ILV.createVectorizedLoopSkeleton(ExpandedSCEVs ? *ExpandedSCEVs
75537512
: State.ExpandedSCEVs);
7513+
#ifdef EXPENSIVE_CHECKS
7514+
assert(DT->verify(DominatorTree::VerificationLevel::Fast));
7515+
#endif
7516+
7517+
VPBasicBlock *MiddleVPBB =
7518+
cast<VPBasicBlock>(BestVPlan.getVectorLoopRegion()->getSingleSuccessor());
7519+
7520+
using namespace llvm::VPlanPatternMatch;
7521+
if (MiddleVPBB->begin() != MiddleVPBB->end() &&
7522+
match(&MiddleVPBB->back(), m_BranchOnCond(m_VPValue()))) {
7523+
cast<VPIRWrapperBlock>(MiddleVPBB->getSuccessors()[1])
7524+
->resetBlock(OrigLoop->getLoopPreheader());
7525+
} else
7526+
cast<VPIRWrapperBlock>(MiddleVPBB->getSuccessors()[0])
7527+
->resetBlock(OrigLoop->getLoopPreheader());
75547528

75557529
// Only use noalias metadata when using memory checks guaranteeing no overlap
75567530
// across all iterations.
@@ -7687,7 +7661,7 @@ EpilogueVectorizerMainLoop::createEpilogueVectorizedLoopSkeleton(
76877661
// inductions in the epilogue loop are created before executing the plan for
76887662
// the epilogue loop.
76897663

7690-
return {completeLoopSkeleton(), nullptr};
7664+
return {LoopVectorPreHeader, nullptr};
76917665
}
76927666

76937667
void EpilogueVectorizerMainLoop::printDebugTracesAtStart() {
@@ -7811,8 +7785,11 @@ EpilogueVectorizerEpilogueLoop::createEpilogueVectorizedLoopSkeleton(
78117785
VecEpilogueIterationCountCheck,
78127786
VecEpilogueIterationCountCheck->getSinglePredecessor());
78137787

7814-
DT->changeImmediateDominator(LoopScalarPreHeader,
7815-
EPI.EpilogueIterationCountCheck);
7788+
if (auto *N = DT->getNode(LoopScalarPreHeader))
7789+
DT->changeImmediateDominator(LoopScalarPreHeader,
7790+
EPI.EpilogueIterationCountCheck);
7791+
else
7792+
DT->addNewBlock(LoopScalarPreHeader, EPI.EpilogueIterationCountCheck);
78167793
if (!Cost->requiresScalarEpilogue(EPI.EpilogueVF.isVector()))
78177794
// If there is an epilogue which must run, there's no edge from the
78187795
// middle block to exit blocks and thus no need to update the immediate
@@ -7876,7 +7853,7 @@ EpilogueVectorizerEpilogueLoop::createEpilogueVectorizedLoopSkeleton(
78767853
{VecEpilogueIterationCountCheck,
78777854
EPI.VectorTripCount} /* AdditionalBypass */);
78787855

7879-
return {completeLoopSkeleton(), EPResumeVal};
7856+
return {LoopVectorPreHeader, EPResumeVal};
78807857
}
78817858

78827859
BasicBlock *
@@ -8625,9 +8602,25 @@ LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(VFRange &Range) {
86258602
// modified; a basic block for the vector pre-header, followed by a region for
86268603
// the vector loop, followed by the middle basic block. The skeleton vector
86278604
// loop region contains a header and latch basic blocks.
8605+
8606+
// Add a check in the middle block to see if we have completed
8607+
// all of the iterations in the first vector loop. Three cases:
8608+
// 1) If we require a scalar epilogue, there is no conditional branch as
8609+
// we unconditionally branch to the scalar preheader. Do nothing.
8610+
// 2) If (N - N%VF) == N, then we *don't* need to run the remainder.
8611+
// Thus if tail is to be folded, we know we don't need to run the
8612+
// remainder and we can use the previous value for the condition (true).
8613+
// 3) Otherwise, construct a runtime check.
8614+
bool RequiresScalarEpilogueCheck =
8615+
LoopVectorizationPlanner::getDecisionAndClampRange(
8616+
[this](ElementCount VF) {
8617+
return !CM.requiresScalarEpilogue(VF.isVector());
8618+
},
8619+
Range);
86288620
VPlanPtr Plan = VPlan::createInitialVPlan(
86298621
createTripCountSCEV(Legal->getWidestInductionType(), PSE, OrigLoop),
8630-
*PSE.getSE());
8622+
*PSE.getSE(), RequiresScalarEpilogueCheck, CM.foldTailByMasking(),
8623+
OrigLoop);
86318624
VPBasicBlock *HeaderVPBB = new VPBasicBlock("vector.body");
86328625
VPBasicBlock *LatchVPBB = new VPBasicBlock("vector.latch");
86338626
VPBlockUtils::insertBlockAfter(LatchVPBB, HeaderVPBB);
@@ -8875,7 +8868,7 @@ VPlanPtr LoopVectorizationPlanner::buildVPlan(VFRange &Range) {
88758868
// Create new empty VPlan
88768869
auto Plan = VPlan::createInitialVPlan(
88778870
createTripCountSCEV(Legal->getWidestInductionType(), PSE, OrigLoop),
8878-
*PSE.getSE());
8871+
*PSE.getSE(), true, false, OrigLoop);
88798872

88808873
// Build hierarchical CFG
88818874
VPlanHCFGBuilder HCFGBuilder(OrigLoop, LI, *Plan);
@@ -9084,6 +9077,9 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
90849077
}
90859078
}
90869079
Builder.setInsertPoint(&*LatchVPBB->begin());
9080+
VPBasicBlock *MiddleVPBB =
9081+
cast<VPBasicBlock>(VectorLoopRegion->getSingleSuccessor());
9082+
VPBasicBlock::iterator IP = MiddleVPBB->begin();
90879083
for (VPRecipeBase &R :
90889084
Plan->getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
90899085
VPReductionPHIRecipe *PhiR = dyn_cast<VPReductionPHIRecipe>(&R);
@@ -9192,8 +9188,8 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
91929188
// also modeled in VPlan.
91939189
auto *FinalReductionResult = new VPInstruction(
91949190
VPInstruction::ComputeReductionResult, {PhiR, NewExitingVPV}, ExitDL);
9195-
cast<VPBasicBlock>(VectorLoopRegion->getSingleSuccessor())
9196-
->appendRecipe(FinalReductionResult);
9191+
FinalReductionResult->insertBefore(*MiddleVPBB, IP);
9192+
IP = std::next(FinalReductionResult->getIterator());
91979193
OrigExitingVPV->replaceUsesWithIf(
91989194
FinalReductionResult,
91999195
[](VPUser &User, unsigned) { return isa<VPLiveOut>(&User); });

0 commit comments

Comments
 (0)