Skip to content

Commit 55fcb29

Browse files
committed
[LV] Compute register usage for interleaving on VPlan.
Add a version of calculateRegisterUsage that works estimates register usage for a VPlan. This mostly just ports the existing code, with some updates to figure out what recipes will generate vectors vs scalars. There are number of changes in the computed register usages, but they should be more accurate w.r.t. to the generated vector code. There are the following changes: * Scalar usage increases in most cases by 1, as we always create a scalar canonical IV, which is alive across the loop and is not considered by the legacy implementation * Output is ordered by insertion, now scalar registers are added first due the canonical IV phi. * Using the VPlan, we now also more precisely know if an induction will be vectorized or scalarized.
1 parent 4d1e4ef commit 55fcb29

File tree

13 files changed

+327
-145
lines changed

13 files changed

+327
-145
lines changed

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Lines changed: 229 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -992,7 +992,8 @@ class LoopVectorizationCostModel {
992992
/// If interleave count has been specified by metadata it will be returned.
993993
/// Otherwise, the interleave count is computed and returned. VF and LoopCost
994994
/// are the selected vectorization factor and the cost of the selected VF.
995-
unsigned selectInterleaveCount(ElementCount VF, InstructionCost LoopCost);
995+
unsigned selectInterleaveCount(VPlan &Plan, ElementCount VF,
996+
InstructionCost LoopCost);
996997

997998
/// Memory access instruction may be vectorized in more than one way.
998999
/// Form of instruction after vectorization depends on cost.
@@ -4871,8 +4872,232 @@ void LoopVectorizationCostModel::collectElementTypesForWidening() {
48714872
}
48724873
}
48734874

4875+
/// Estimate the register usage for \p Plan and vectorization factors in \p VFs.
4876+
/// Returns the register usage for each VF in \p VFs.
4877+
static SmallVector<LoopVectorizationCostModel::RegisterUsage, 8>
4878+
calculateRegisterUsage(VPlan &Plan, ArrayRef<ElementCount> VFs,
4879+
const TargetTransformInfo &TTI) {
4880+
// This function calculates the register usage by measuring the highest number
4881+
// of values that are alive at a single location. Obviously, this is a very
4882+
// rough estimation. We scan the loop in a topological order in order and
4883+
// assign a number to each recipe. We use RPO to ensure that defs are
4884+
// met before their users. We assume that each recipe that has in-loop
4885+
// users starts an interval. We record every time that an in-loop value is
4886+
// used, so we have a list of the first and last occurrences of each
4887+
// recipe. Next, we transpose this data structure into a multi map that
4888+
// holds the list of intervals that *end* at a specific location. This multi
4889+
// map allows us to perform a linear search. We scan the instructions linearly
4890+
// and record each time that a new interval starts, by placing it in a set.
4891+
// If we find this value in the multi-map then we remove it from the set.
4892+
// The max register usage is the maximum size of the set.
4893+
// We also search for instructions that are defined outside the loop, but are
4894+
// used inside the loop. We need this number separately from the max-interval
4895+
// usage number because when we unroll, loop-invariant values do not take
4896+
// more register.
4897+
LoopVectorizationCostModel::RegisterUsage RU;
4898+
4899+
// Each 'key' in the map opens a new interval. The values
4900+
// of the map are the index of the 'last seen' usage of the
4901+
// recipe that is the key.
4902+
using IntervalMap = SmallDenseMap<VPRecipeBase *, unsigned, 16>;
4903+
4904+
// Maps recipe to its index.
4905+
SmallVector<VPRecipeBase *, 64> IdxToRecipe;
4906+
// Marks the end of each interval.
4907+
IntervalMap EndPoint;
4908+
// Saves the list of recipe indices that are used in the loop.
4909+
SmallPtrSet<VPRecipeBase *, 8> Ends;
4910+
// Saves the list of values that are used in the loop but are defined outside
4911+
// the loop (not including non-recipe values such as arguments and
4912+
// constants).
4913+
SmallSetVector<VPValue *, 8> LoopInvariants;
4914+
LoopInvariants.insert(&Plan.getVectorTripCount());
4915+
4916+
ReversePostOrderTraversal<VPBlockDeepTraversalWrapper<VPBlockBase *>> RPOT(
4917+
Plan.getVectorLoopRegion());
4918+
for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(RPOT)) {
4919+
if (!VPBB->getParent())
4920+
break;
4921+
for (VPRecipeBase &R : *VPBB) {
4922+
IdxToRecipe.push_back(&R);
4923+
4924+
// Save the end location of each USE.
4925+
for (VPValue *U : R.operands()) {
4926+
auto *DefR = U->getDefiningRecipe();
4927+
4928+
// Ignore non-recipe values such as arguments, constants, etc.
4929+
// FIXME: Might need some motivation why these values are ignored. If
4930+
// for example an argument is used inside the loop it will increase the
4931+
// register pressure (so shouldn't we add it to LoopInvariants).
4932+
if (!DefR && (!U->getLiveInIRValue() ||
4933+
!isa<Instruction>(U->getLiveInIRValue())))
4934+
continue;
4935+
4936+
// If this recipe is outside the loop then record it and continue.
4937+
if (!DefR) {
4938+
LoopInvariants.insert(U);
4939+
continue;
4940+
}
4941+
4942+
// Overwrite previous end points.
4943+
EndPoint[DefR] = IdxToRecipe.size();
4944+
Ends.insert(DefR);
4945+
}
4946+
}
4947+
if (VPBB == Plan.getVectorLoopRegion()->getExiting()) {
4948+
// VPWidenIntOrFpInductionRecipes are used implicitly at the end of the
4949+
// exiting block, where their increment will get materialized eventually.
4950+
for (auto &R : Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
4951+
if (isa<VPWidenIntOrFpInductionRecipe>(&R)) {
4952+
EndPoint[&R] = IdxToRecipe.size();
4953+
Ends.insert(&R);
4954+
}
4955+
}
4956+
}
4957+
}
4958+
4959+
// Saves the list of intervals that end with the index in 'key'.
4960+
using RecipeList = SmallVector<VPRecipeBase *, 2>;
4961+
SmallDenseMap<unsigned, RecipeList, 16> TransposeEnds;
4962+
4963+
// Transpose the EndPoints to a list of values that end at each index.
4964+
for (auto &Interval : EndPoint)
4965+
TransposeEnds[Interval.second].push_back(Interval.first);
4966+
4967+
SmallPtrSet<VPRecipeBase *, 8> OpenIntervals;
4968+
SmallVector<LoopVectorizationCostModel::RegisterUsage, 8> RUs(VFs.size());
4969+
SmallVector<SmallMapVector<unsigned, unsigned, 4>, 8> MaxUsages(VFs.size());
4970+
4971+
LLVM_DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n");
4972+
4973+
VPTypeAnalysis TypeInfo(Plan.getCanonicalIV()->getScalarType());
4974+
4975+
const auto &TTICapture = TTI;
4976+
auto GetRegUsage = [&TTICapture](Type *Ty, ElementCount VF) -> unsigned {
4977+
if (Ty->isTokenTy() || !VectorType::isValidElementType(Ty) ||
4978+
(VF.isScalable() &&
4979+
!TTICapture.isElementTypeLegalForScalableVector(Ty)))
4980+
return 0;
4981+
return TTICapture.getRegUsageForType(VectorType::get(Ty, VF));
4982+
};
4983+
4984+
for (unsigned int Idx = 0, Sz = IdxToRecipe.size(); Idx < Sz; ++Idx) {
4985+
VPRecipeBase *R = IdxToRecipe[Idx];
4986+
4987+
// Remove all of the recipes that end at this location.
4988+
RecipeList &List = TransposeEnds[Idx];
4989+
for (VPRecipeBase *ToRemove : List)
4990+
OpenIntervals.erase(ToRemove);
4991+
4992+
// Ignore recipes that are never used within the loop.
4993+
if (!Ends.count(R) && !R->mayHaveSideEffects())
4994+
continue;
4995+
4996+
// For each VF find the maximum usage of registers.
4997+
for (unsigned J = 0, E = VFs.size(); J < E; ++J) {
4998+
// Count the number of registers used, per register class, given all open
4999+
// intervals.
5000+
// Note that elements in this SmallMapVector will be default constructed
5001+
// as 0. So we can use "RegUsage[ClassID] += n" in the code below even if
5002+
// there is no previous entry for ClassID.
5003+
SmallMapVector<unsigned, unsigned, 4> RegUsage;
5004+
5005+
if (VFs[J].isScalar()) {
5006+
for (auto *Inst : OpenIntervals) {
5007+
for (VPValue *DefV : Inst->definedValues()) {
5008+
unsigned ClassID = TTI.getRegisterClassForType(
5009+
false, TypeInfo.inferScalarType(DefV));
5010+
// FIXME: The target might use more than one register for the type
5011+
// even in the scalar case.
5012+
RegUsage[ClassID] += 1;
5013+
}
5014+
}
5015+
} else {
5016+
for (auto *R : OpenIntervals) {
5017+
if (isa<VPVectorPointerRecipe, VPReverseVectorPointerRecipe>(R))
5018+
continue;
5019+
if (isa<VPCanonicalIVPHIRecipe, VPReplicateRecipe, VPDerivedIVRecipe,
5020+
VPScalarIVStepsRecipe>(R) ||
5021+
(isa<VPInstruction>(R) &&
5022+
all_of(cast<VPSingleDefRecipe>(R)->users(), [&](VPUser *U) {
5023+
return cast<VPRecipeBase>(U)->usesScalars(
5024+
R->getVPSingleValue());
5025+
}))) {
5026+
unsigned ClassID = TTI.getRegisterClassForType(
5027+
false, TypeInfo.inferScalarType(R->getVPSingleValue()));
5028+
// FIXME: The target might use more than one register for the type
5029+
// even in the scalar case.
5030+
RegUsage[ClassID] += 1;
5031+
} else {
5032+
for (VPValue *DefV : R->definedValues()) {
5033+
Type *ScalarTy = TypeInfo.inferScalarType(DefV);
5034+
unsigned ClassID = TTI.getRegisterClassForType(true, ScalarTy);
5035+
RegUsage[ClassID] += GetRegUsage(ScalarTy, VFs[J]);
5036+
}
5037+
}
5038+
}
5039+
}
5040+
5041+
for (const auto &Pair : RegUsage) {
5042+
auto &Entry = MaxUsages[J][Pair.first];
5043+
Entry = std::max(Entry, Pair.second);
5044+
}
5045+
}
5046+
5047+
LLVM_DEBUG(dbgs() << "LV(REG): At #" << Idx << " Interval # "
5048+
<< OpenIntervals.size() << '\n');
5049+
5050+
// Add the current recipe to the list of open intervals.
5051+
OpenIntervals.insert(R);
5052+
}
5053+
5054+
for (unsigned Idx = 0, End = VFs.size(); Idx < End; ++Idx) {
5055+
// Note that elements in this SmallMapVector will be default constructed
5056+
// as 0. So we can use "Invariant[ClassID] += n" in the code below even if
5057+
// there is no previous entry for ClassID.
5058+
SmallMapVector<unsigned, unsigned, 4> Invariant;
5059+
5060+
for (auto *In : LoopInvariants) {
5061+
// FIXME: The target might use more than one register for the type
5062+
// even in the scalar case.
5063+
bool IsScalar = all_of(In->users(), [&](VPUser *U) {
5064+
return cast<VPRecipeBase>(U)->usesScalars(In);
5065+
});
5066+
5067+
ElementCount VF = IsScalar ? ElementCount::getFixed(1) : VFs[Idx];
5068+
unsigned ClassID = TTI.getRegisterClassForType(
5069+
VF.isVector(), TypeInfo.inferScalarType(In));
5070+
Invariant[ClassID] += GetRegUsage(TypeInfo.inferScalarType(In), VF);
5071+
}
5072+
5073+
LLVM_DEBUG({
5074+
dbgs() << "LV(REG): VF = " << VFs[Idx] << '\n';
5075+
dbgs() << "LV(REG): Found max usage: " << MaxUsages[Idx].size()
5076+
<< " item\n";
5077+
for (const auto &pair : MaxUsages[Idx]) {
5078+
dbgs() << "LV(REG): RegisterClass: "
5079+
<< TTI.getRegisterClassName(pair.first) << ", " << pair.second
5080+
<< " registers\n";
5081+
}
5082+
dbgs() << "LV(REG): Found invariant usage: " << Invariant.size()
5083+
<< " item\n";
5084+
for (const auto &pair : Invariant) {
5085+
dbgs() << "LV(REG): RegisterClass: "
5086+
<< TTI.getRegisterClassName(pair.first) << ", " << pair.second
5087+
<< " registers\n";
5088+
}
5089+
});
5090+
5091+
RU.LoopInvariantRegs = Invariant;
5092+
RU.MaxLocalUsers = MaxUsages[Idx];
5093+
RUs[Idx] = RU;
5094+
}
5095+
5096+
return RUs;
5097+
}
5098+
48745099
unsigned
4875-
LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
5100+
LoopVectorizationCostModel::selectInterleaveCount(VPlan &Plan, ElementCount VF,
48765101
InstructionCost LoopCost) {
48775102
// -- The interleave heuristics --
48785103
// We interleave the loop in order to expose ILP and reduce the loop overhead.
@@ -4922,7 +5147,7 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
49225147
return 1;
49235148
}
49245149

4925-
RegisterUsage R = calculateRegisterUsage({VF})[0];
5150+
RegisterUsage R = ::calculateRegisterUsage(Plan, {VF}, TTI)[0];
49265151
// We divide by these constants so assume that we have at least one
49275152
// instruction that uses at least one register.
49285153
for (auto &Pair : R.MaxLocalUsers) {
@@ -10760,7 +10985,7 @@ bool LoopVectorizePass::processLoop(Loop *L) {
1076010985
AddBranchWeights, CM.CostKind);
1076110986
if (LVP.hasPlanWithVF(VF.Width)) {
1076210987
// Select the interleave count.
10763-
IC = CM.selectInterleaveCount(VF.Width, VF.Cost);
10988+
IC = CM.selectInterleaveCount(LVP.getPlanFor(VF.Width), VF.Width, VF.Cost);
1076410989

1076510990
unsigned SelectedIC = std::max(IC, UserIC);
1076610991
// Optimistically generate runtime checks if they are needed. Drop them if

llvm/test/Transforms/LoopVectorize/AArch64/i1-reg-usage.ll

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@ target triple = "aarch64"
88
; CHECK-LABEL: LV: Checking a loop in 'or_reduction_neon' from <stdin>
99
; CHECK: LV(REG): VF = 32
1010
; CHECK-NEXT: LV(REG): Found max usage: 2 item
11+
; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 2 registers
1112
; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 72 registers
12-
; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 1 registers
1313

1414
define i1 @or_reduction_neon(i32 %arg, ptr %ptr) {
1515
entry:
@@ -31,8 +31,8 @@ loop:
3131
; CHECK-LABEL: LV: Checking a loop in 'or_reduction_sve'
3232
; CHECK: LV(REG): VF = 64
3333
; CHECK-NEXT: LV(REG): Found max usage: 2 item
34+
; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 2 registers
3435
; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 136 registers
35-
; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 1 registers
3636

3737
define i1 @or_reduction_sve(i32 %arg, ptr %ptr) vscale_range(2,2) "target-features"="+sve" {
3838
entry:

llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,10 @@ define void @get_invariant_reg_usage(ptr %z) {
1616
; CHECK-LABEL: LV: Checking a loop in 'get_invariant_reg_usage'
1717
; CHECK: LV(REG): VF = vscale x 16
1818
; CHECK-NEXT: LV(REG): Found max usage: 2 item
19-
; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 3 registers
20-
; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 1 registers
21-
; CHECK-NEXT: LV(REG): Found invariant usage: 2 item
2219
; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 2 registers
23-
; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 8 registers
20+
; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 1 registers
21+
; CHECK-NEXT: LV(REG): Found invariant usage: 1 item
22+
; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 3 registers
2423

2524
L.entry:
2625
%0 = load i128, ptr %z, align 16

llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,17 +9,18 @@
99
define void @bar(ptr %A, i32 signext %n) {
1010
; CHECK-LABEL: bar
1111
; CHECK-SCALAR: LV(REG): Found max usage: 2 item
12-
; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 2 registers
12+
; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 3 registers
1313
; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::FPRRC, 1 registers
1414
; CHECK-SCALAR-NEXT: LV(REG): Found invariant usage: 1 item
1515
; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 1 registers
1616
; CHECK-SCALAR-NEXT: LV: The target has 30 registers of LoongArch::GPRRC register class
1717
; CHECK-SCALAR-NEXT: LV: The target has 32 registers of LoongArch::FPRRC register class
1818
; CHECK-VECTOR: LV(REG): Found max usage: 2 item
19-
; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::VRRC, 3 registers
20-
; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 1 registers
19+
; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 2 registers
20+
; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::VRRC, 2 registers
2121
; CHECK-VECTOR-NEXT: LV(REG): Found invariant usage: 1 item
2222
; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 1 registers
23+
; CHECK-VECTOR-NEXT: LV: The target has 30 registers of LoongArch::GPRRC register class
2324
; CHECK-VECTOR-NEXT: LV: The target has 32 registers of LoongArch::VRRC register class
2425

2526
entry:

0 commit comments

Comments
 (0)