Skip to content

Commit ec0117d

Browse files
fhahnSamTebbs33
authored andcommitted
[LV] Compute register usage for interleaving on VPlan.
Add a version of calculateRegisterUsage that works estimates register usage for a VPlan. This mostly just ports the existing code, with some updates to figure out what recipes will generate vectors vs scalars. There are number of changes in the computed register usages, but they should be more accurate w.r.t. to the generated vector code. There are the following changes: * Scalar usage increases in most cases by 1, as we always create a scalar canonical IV, which is alive across the loop and is not considered by the legacy implementation * Output is ordered by insertion, now scalar registers are added first due the canonical IV phi. * Using the VPlan, we now also more precisely know if an induction will be vectorized or scalarized.
1 parent b738b82 commit ec0117d

File tree

13 files changed

+325
-152
lines changed

13 files changed

+325
-152
lines changed

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Lines changed: 229 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -992,7 +992,8 @@ class LoopVectorizationCostModel {
992992
/// If interleave count has been specified by metadata it will be returned.
993993
/// Otherwise, the interleave count is computed and returned. VF and LoopCost
994994
/// are the selected vectorization factor and the cost of the selected VF.
995-
unsigned selectInterleaveCount(ElementCount VF, InstructionCost LoopCost);
995+
unsigned selectInterleaveCount(VPlan &Plan, ElementCount VF,
996+
InstructionCost LoopCost);
996997

997998
/// Memory access instruction may be vectorized in more than one way.
998999
/// Form of instruction after vectorization depends on cost.
@@ -4881,8 +4882,232 @@ void LoopVectorizationCostModel::collectElementTypesForWidening() {
48814882
}
48824883
}
48834884

4885+
/// Estimate the register usage for \p Plan and vectorization factors in \p VFs.
4886+
/// Returns the register usage for each VF in \p VFs.
4887+
static SmallVector<LoopVectorizationCostModel::RegisterUsage, 8>
4888+
calculateRegisterUsage(VPlan &Plan, ArrayRef<ElementCount> VFs,
4889+
const TargetTransformInfo &TTI) {
4890+
// This function calculates the register usage by measuring the highest number
4891+
// of values that are alive at a single location. Obviously, this is a very
4892+
// rough estimation. We scan the loop in a topological order in order and
4893+
// assign a number to each recipe. We use RPO to ensure that defs are
4894+
// met before their users. We assume that each recipe that has in-loop
4895+
// users starts an interval. We record every time that an in-loop value is
4896+
// used, so we have a list of the first and last occurrences of each
4897+
// recipe. Next, we transpose this data structure into a multi map that
4898+
// holds the list of intervals that *end* at a specific location. This multi
4899+
// map allows us to perform a linear search. We scan the instructions linearly
4900+
// and record each time that a new interval starts, by placing it in a set.
4901+
// If we find this value in the multi-map then we remove it from the set.
4902+
// The max register usage is the maximum size of the set.
4903+
// We also search for instructions that are defined outside the loop, but are
4904+
// used inside the loop. We need this number separately from the max-interval
4905+
// usage number because when we unroll, loop-invariant values do not take
4906+
// more register.
4907+
LoopVectorizationCostModel::RegisterUsage RU;
4908+
4909+
// Each 'key' in the map opens a new interval. The values
4910+
// of the map are the index of the 'last seen' usage of the
4911+
// recipe that is the key.
4912+
using IntervalMap = SmallDenseMap<VPRecipeBase *, unsigned, 16>;
4913+
4914+
// Maps recipe to its index.
4915+
SmallVector<VPRecipeBase *, 64> IdxToRecipe;
4916+
// Marks the end of each interval.
4917+
IntervalMap EndPoint;
4918+
// Saves the list of recipe indices that are used in the loop.
4919+
SmallPtrSet<VPRecipeBase *, 8> Ends;
4920+
// Saves the list of values that are used in the loop but are defined outside
4921+
// the loop (not including non-recipe values such as arguments and
4922+
// constants).
4923+
SmallSetVector<VPValue *, 8> LoopInvariants;
4924+
LoopInvariants.insert(&Plan.getVectorTripCount());
4925+
4926+
ReversePostOrderTraversal<VPBlockDeepTraversalWrapper<VPBlockBase *>> RPOT(
4927+
Plan.getVectorLoopRegion());
4928+
for (VPBasicBlock *VPBB : VPBlockUtils::blocksOnly<VPBasicBlock>(RPOT)) {
4929+
if (!VPBB->getParent())
4930+
break;
4931+
for (VPRecipeBase &R : *VPBB) {
4932+
IdxToRecipe.push_back(&R);
4933+
4934+
// Save the end location of each USE.
4935+
for (VPValue *U : R.operands()) {
4936+
auto *DefR = U->getDefiningRecipe();
4937+
4938+
// Ignore non-recipe values such as arguments, constants, etc.
4939+
// FIXME: Might need some motivation why these values are ignored. If
4940+
// for example an argument is used inside the loop it will increase the
4941+
// register pressure (so shouldn't we add it to LoopInvariants).
4942+
if (!DefR && (!U->getLiveInIRValue() ||
4943+
!isa<Instruction>(U->getLiveInIRValue())))
4944+
continue;
4945+
4946+
// If this recipe is outside the loop then record it and continue.
4947+
if (!DefR) {
4948+
LoopInvariants.insert(U);
4949+
continue;
4950+
}
4951+
4952+
// Overwrite previous end points.
4953+
EndPoint[DefR] = IdxToRecipe.size();
4954+
Ends.insert(DefR);
4955+
}
4956+
}
4957+
if (VPBB == Plan.getVectorLoopRegion()->getExiting()) {
4958+
// VPWidenIntOrFpInductionRecipes are used implicitly at the end of the
4959+
// exiting block, where their increment will get materialized eventually.
4960+
for (auto &R : Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
4961+
if (isa<VPWidenIntOrFpInductionRecipe>(&R)) {
4962+
EndPoint[&R] = IdxToRecipe.size();
4963+
Ends.insert(&R);
4964+
}
4965+
}
4966+
}
4967+
}
4968+
4969+
// Saves the list of intervals that end with the index in 'key'.
4970+
using RecipeList = SmallVector<VPRecipeBase *, 2>;
4971+
SmallDenseMap<unsigned, RecipeList, 16> TransposeEnds;
4972+
4973+
// Transpose the EndPoints to a list of values that end at each index.
4974+
for (auto &Interval : EndPoint)
4975+
TransposeEnds[Interval.second].push_back(Interval.first);
4976+
4977+
SmallPtrSet<VPRecipeBase *, 8> OpenIntervals;
4978+
SmallVector<LoopVectorizationCostModel::RegisterUsage, 8> RUs(VFs.size());
4979+
SmallVector<SmallMapVector<unsigned, unsigned, 4>, 8> MaxUsages(VFs.size());
4980+
4981+
LLVM_DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n");
4982+
4983+
VPTypeAnalysis TypeInfo(Plan.getCanonicalIV()->getScalarType());
4984+
4985+
const auto &TTICapture = TTI;
4986+
auto GetRegUsage = [&TTICapture](Type *Ty, ElementCount VF) -> unsigned {
4987+
if (Ty->isTokenTy() || !VectorType::isValidElementType(Ty) ||
4988+
(VF.isScalable() &&
4989+
!TTICapture.isElementTypeLegalForScalableVector(Ty)))
4990+
return 0;
4991+
return TTICapture.getRegUsageForType(VectorType::get(Ty, VF));
4992+
};
4993+
4994+
for (unsigned int Idx = 0, Sz = IdxToRecipe.size(); Idx < Sz; ++Idx) {
4995+
VPRecipeBase *R = IdxToRecipe[Idx];
4996+
4997+
// Remove all of the recipes that end at this location.
4998+
RecipeList &List = TransposeEnds[Idx];
4999+
for (VPRecipeBase *ToRemove : List)
5000+
OpenIntervals.erase(ToRemove);
5001+
5002+
// Ignore recipes that are never used within the loop.
5003+
if (!Ends.count(R) && !R->mayHaveSideEffects())
5004+
continue;
5005+
5006+
// For each VF find the maximum usage of registers.
5007+
for (unsigned J = 0, E = VFs.size(); J < E; ++J) {
5008+
// Count the number of registers used, per register class, given all open
5009+
// intervals.
5010+
// Note that elements in this SmallMapVector will be default constructed
5011+
// as 0. So we can use "RegUsage[ClassID] += n" in the code below even if
5012+
// there is no previous entry for ClassID.
5013+
SmallMapVector<unsigned, unsigned, 4> RegUsage;
5014+
5015+
if (VFs[J].isScalar()) {
5016+
for (auto *Inst : OpenIntervals) {
5017+
for (VPValue *DefV : Inst->definedValues()) {
5018+
unsigned ClassID = TTI.getRegisterClassForType(
5019+
false, TypeInfo.inferScalarType(DefV));
5020+
// FIXME: The target might use more than one register for the type
5021+
// even in the scalar case.
5022+
RegUsage[ClassID] += 1;
5023+
}
5024+
}
5025+
} else {
5026+
for (auto *R : OpenIntervals) {
5027+
if (isa<VPVectorPointerRecipe, VPReverseVectorPointerRecipe>(R))
5028+
continue;
5029+
if (isa<VPCanonicalIVPHIRecipe, VPReplicateRecipe, VPDerivedIVRecipe,
5030+
VPScalarIVStepsRecipe>(R) ||
5031+
(isa<VPInstruction>(R) &&
5032+
all_of(cast<VPSingleDefRecipe>(R)->users(), [&](VPUser *U) {
5033+
return cast<VPRecipeBase>(U)->usesScalars(
5034+
R->getVPSingleValue());
5035+
}))) {
5036+
unsigned ClassID = TTI.getRegisterClassForType(
5037+
false, TypeInfo.inferScalarType(R->getVPSingleValue()));
5038+
// FIXME: The target might use more than one register for the type
5039+
// even in the scalar case.
5040+
RegUsage[ClassID] += 1;
5041+
} else {
5042+
for (VPValue *DefV : R->definedValues()) {
5043+
Type *ScalarTy = TypeInfo.inferScalarType(DefV);
5044+
unsigned ClassID = TTI.getRegisterClassForType(true, ScalarTy);
5045+
RegUsage[ClassID] += GetRegUsage(ScalarTy, VFs[J]);
5046+
}
5047+
}
5048+
}
5049+
}
5050+
5051+
for (const auto &Pair : RegUsage) {
5052+
auto &Entry = MaxUsages[J][Pair.first];
5053+
Entry = std::max(Entry, Pair.second);
5054+
}
5055+
}
5056+
5057+
LLVM_DEBUG(dbgs() << "LV(REG): At #" << Idx << " Interval # "
5058+
<< OpenIntervals.size() << '\n');
5059+
5060+
// Add the current recipe to the list of open intervals.
5061+
OpenIntervals.insert(R);
5062+
}
5063+
5064+
for (unsigned Idx = 0, End = VFs.size(); Idx < End; ++Idx) {
5065+
// Note that elements in this SmallMapVector will be default constructed
5066+
// as 0. So we can use "Invariant[ClassID] += n" in the code below even if
5067+
// there is no previous entry for ClassID.
5068+
SmallMapVector<unsigned, unsigned, 4> Invariant;
5069+
5070+
for (auto *In : LoopInvariants) {
5071+
// FIXME: The target might use more than one register for the type
5072+
// even in the scalar case.
5073+
bool IsScalar = all_of(In->users(), [&](VPUser *U) {
5074+
return cast<VPRecipeBase>(U)->usesScalars(In);
5075+
});
5076+
5077+
ElementCount VF = IsScalar ? ElementCount::getFixed(1) : VFs[Idx];
5078+
unsigned ClassID = TTI.getRegisterClassForType(
5079+
VF.isVector(), TypeInfo.inferScalarType(In));
5080+
Invariant[ClassID] += GetRegUsage(TypeInfo.inferScalarType(In), VF);
5081+
}
5082+
5083+
LLVM_DEBUG({
5084+
dbgs() << "LV(REG): VF = " << VFs[Idx] << '\n';
5085+
dbgs() << "LV(REG): Found max usage: " << MaxUsages[Idx].size()
5086+
<< " item\n";
5087+
for (const auto &pair : MaxUsages[Idx]) {
5088+
dbgs() << "LV(REG): RegisterClass: "
5089+
<< TTI.getRegisterClassName(pair.first) << ", " << pair.second
5090+
<< " registers\n";
5091+
}
5092+
dbgs() << "LV(REG): Found invariant usage: " << Invariant.size()
5093+
<< " item\n";
5094+
for (const auto &pair : Invariant) {
5095+
dbgs() << "LV(REG): RegisterClass: "
5096+
<< TTI.getRegisterClassName(pair.first) << ", " << pair.second
5097+
<< " registers\n";
5098+
}
5099+
});
5100+
5101+
RU.LoopInvariantRegs = Invariant;
5102+
RU.MaxLocalUsers = MaxUsages[Idx];
5103+
RUs[Idx] = RU;
5104+
}
5105+
5106+
return RUs;
5107+
}
5108+
48845109
unsigned
4885-
LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
5110+
LoopVectorizationCostModel::selectInterleaveCount(VPlan &Plan, ElementCount VF,
48865111
InstructionCost LoopCost) {
48875112
// -- The interleave heuristics --
48885113
// We interleave the loop in order to expose ILP and reduce the loop overhead.
@@ -4932,7 +5157,7 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
49325157
return 1;
49335158
}
49345159

4935-
RegisterUsage R = calculateRegisterUsage({VF})[0];
5160+
RegisterUsage R = ::calculateRegisterUsage(Plan, {VF}, TTI)[0];
49365161
// We divide by these constants so assume that we have at least one
49375162
// instruction that uses at least one register.
49385163
for (auto &Pair : R.MaxLocalUsers) {
@@ -10717,7 +10942,7 @@ bool LoopVectorizePass::processLoop(Loop *L) {
1071710942
AddBranchWeights, CM.CostKind);
1071810943
if (LVP.hasPlanWithVF(VF.Width)) {
1071910944
// Select the interleave count.
10720-
IC = CM.selectInterleaveCount(VF.Width, VF.Cost);
10945+
IC = CM.selectInterleaveCount(LVP.getPlanFor(VF.Width), VF.Width, VF.Cost);
1072110946

1072210947
unsigned SelectedIC = std::max(IC, UserIC);
1072310948
// Optimistically generate runtime checks if they are needed. Drop them if

llvm/test/Transforms/LoopVectorize/AArch64/i1-reg-usage.ll

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@ target triple = "aarch64"
88
; CHECK-LABEL: LV: Checking a loop in 'or_reduction_neon' from <stdin>
99
; CHECK: LV(REG): VF = 32
1010
; CHECK-NEXT: LV(REG): Found max usage: 2 item
11+
; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 2 registers
1112
; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 72 registers
12-
; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 1 registers
1313

1414
define i1 @or_reduction_neon(i32 %arg, ptr %ptr) {
1515
entry:
@@ -31,8 +31,8 @@ loop:
3131
; CHECK-LABEL: LV: Checking a loop in 'or_reduction_sve'
3232
; CHECK: LV(REG): VF = 64
3333
; CHECK-NEXT: LV(REG): Found max usage: 2 item
34+
; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 2 registers
3435
; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 136 registers
35-
; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 1 registers
3636

3737
define i1 @or_reduction_sve(i32 %arg, ptr %ptr) vscale_range(2,2) "target-features"="+sve" {
3838
entry:

llvm/test/Transforms/LoopVectorize/AArch64/reg-usage.ll

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,11 +16,10 @@ define void @get_invariant_reg_usage(ptr %z) {
1616
; CHECK-LABEL: LV: Checking a loop in 'get_invariant_reg_usage'
1717
; CHECK: LV(REG): VF = vscale x 16
1818
; CHECK-NEXT: LV(REG): Found max usage: 2 item
19-
; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 3 registers
20-
; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 1 registers
21-
; CHECK-NEXT: LV(REG): Found invariant usage: 2 item
2219
; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 2 registers
23-
; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 8 registers
20+
; CHECK-NEXT: LV(REG): RegisterClass: Generic::VectorRC, 1 registers
21+
; CHECK-NEXT: LV(REG): Found invariant usage: 1 item
22+
; CHECK-NEXT: LV(REG): RegisterClass: Generic::ScalarRC, 3 registers
2423

2524
L.entry:
2625
%0 = load i128, ptr %z, align 16

llvm/test/Transforms/LoopVectorize/LoongArch/reg-usage.ll

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,17 +9,18 @@
99
define void @bar(ptr %A, i32 signext %n) {
1010
; CHECK-LABEL: bar
1111
; CHECK-SCALAR: LV(REG): Found max usage: 2 item
12-
; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 2 registers
12+
; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 3 registers
1313
; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::FPRRC, 1 registers
1414
; CHECK-SCALAR-NEXT: LV(REG): Found invariant usage: 1 item
1515
; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 1 registers
1616
; CHECK-SCALAR-NEXT: LV: The target has 30 registers of LoongArch::GPRRC register class
1717
; CHECK-SCALAR-NEXT: LV: The target has 32 registers of LoongArch::FPRRC register class
1818
; CHECK-VECTOR: LV(REG): Found max usage: 2 item
19-
; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::VRRC, 3 registers
20-
; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 1 registers
19+
; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 2 registers
20+
; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::VRRC, 2 registers
2121
; CHECK-VECTOR-NEXT: LV(REG): Found invariant usage: 1 item
2222
; CHECK-VECTOR-NEXT: LV(REG): RegisterClass: LoongArch::GPRRC, 1 registers
23+
; CHECK-VECTOR-NEXT: LV: The target has 30 registers of LoongArch::GPRRC register class
2324
; CHECK-VECTOR-NEXT: LV: The target has 32 registers of LoongArch::VRRC register class
2425

2526
entry:

0 commit comments

Comments
 (0)