Skip to content

Commit 6bae597

Browse files
committed
[CSSPGO] Call site prioritized inlining for sample PGO
This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC. The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (`-sample-profile-prioritized-inline=0`). Motivation With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO. - It doesn't take inline candidate size into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat. - The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately. - It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inacurate hotness heuristic. With pseduo-probe and the change above, this is no longer an issue for CSSPGO. New FDO Inliner Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed. - Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655. - Size Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check. - Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites. - BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path. Note that the new inliner avoids repeatedly evaluating same set of call site, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO). Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO. Tunings and knobs I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO. Results Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and ready for bigger workloads (we're working on it). Differential Revision: https://reviews.llvm.org/D94001
1 parent 49c9c3a commit 6bae597

File tree

10 files changed

+904
-81
lines changed

10 files changed

+904
-81
lines changed

llvm/include/llvm/Transforms/IPO/SampleContextTracker.h

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
#include "llvm/ProfileData/SampleProf.h"
2424
#include <list>
2525
#include <map>
26+
#include <vector>
2627

2728
using namespace llvm;
2829
using namespace sampleprof;
@@ -42,7 +43,7 @@ class ContextTrieNode {
4243
CallSiteLoc(CallLoc){};
4344
ContextTrieNode *getChildContext(const LineLocation &CallSite,
4445
StringRef CalleeName);
45-
ContextTrieNode *getChildContext(const LineLocation &CallSite);
46+
ContextTrieNode *getHottestChildContext(const LineLocation &CallSite);
4647
ContextTrieNode *getOrCreateChildContext(const LineLocation &CallSite,
4748
StringRef CalleeName,
4849
bool AllowCreate = true);
@@ -94,6 +95,9 @@ class SampleContextTracker {
9495
// call-site. The full context is identified by location of call instruction.
9596
FunctionSamples *getCalleeContextSamplesFor(const CallBase &Inst,
9697
StringRef CalleeName);
98+
// Get samples for indirect call targets for call site at given location.
99+
std::vector<const FunctionSamples *>
100+
getIndirectCalleeContextSamplesFor(const DILocation *DIL);
97101
// Query context profile for a given location. The full context
98102
// is identified by input DILocation.
99103
FunctionSamples *getContextSamplesFor(const DILocation *DIL);

llvm/lib/Transforms/IPO/SampleContextTracker.cpp

Lines changed: 54 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ namespace llvm {
3030
ContextTrieNode *ContextTrieNode::getChildContext(const LineLocation &CallSite,
3131
StringRef CalleeName) {
3232
if (CalleeName.empty())
33-
return getChildContext(CallSite);
33+
return getHottestChildContext(CallSite);
3434

3535
uint32_t Hash = nodeHash(CalleeName, CallSite);
3636
auto It = AllChildContext.find(Hash);
@@ -40,18 +40,22 @@ ContextTrieNode *ContextTrieNode::getChildContext(const LineLocation &CallSite,
4040
}
4141

4242
ContextTrieNode *
43-
ContextTrieNode::getChildContext(const LineLocation &CallSite) {
43+
ContextTrieNode::getHottestChildContext(const LineLocation &CallSite) {
4444
// CSFDO-TODO: This could be slow, change AllChildContext so we can
4545
// do point look up for child node by call site alone.
46-
// CSFDO-TODO: Return the child with max count for indirect call
46+
// Retrieve the child node with max count for indirect call
4747
ContextTrieNode *ChildNodeRet = nullptr;
48+
uint64_t MaxCalleeSamples = 0;
4849
for (auto &It : AllChildContext) {
4950
ContextTrieNode &ChildNode = It.second;
50-
if (ChildNode.CallSiteLoc == CallSite) {
51-
if (ChildNodeRet)
52-
return nullptr;
53-
else
54-
ChildNodeRet = &ChildNode;
51+
if (ChildNode.CallSiteLoc != CallSite)
52+
continue;
53+
FunctionSamples *Samples = ChildNode.getFunctionSamples();
54+
if (!Samples)
55+
continue;
56+
if (Samples->getTotalSamples() > MaxCalleeSamples) {
57+
ChildNodeRet = &ChildNode;
58+
MaxCalleeSamples = Samples->getTotalSamples();
5559
}
5660
}
5761

@@ -191,12 +195,12 @@ FunctionSamples *
191195
SampleContextTracker::getCalleeContextSamplesFor(const CallBase &Inst,
192196
StringRef CalleeName) {
193197
LLVM_DEBUG(dbgs() << "Getting callee context for instr: " << Inst << "\n");
194-
// CSFDO-TODO: We use CalleeName to differentiate indirect call
195-
// We need to get sample for indirect callee too.
196198
DILocation *DIL = Inst.getDebugLoc();
197199
if (!DIL)
198200
return nullptr;
199201

202+
// For indirect call, CalleeName will be empty, in which case the context
203+
// profile for callee with largest total samples will be returned.
200204
ContextTrieNode *CalleeContext = getCalleeContextFor(DIL, CalleeName);
201205
if (CalleeContext) {
202206
FunctionSamples *FSamples = CalleeContext->getFunctionSamples();
@@ -209,6 +213,26 @@ SampleContextTracker::getCalleeContextSamplesFor(const CallBase &Inst,
209213
return nullptr;
210214
}
211215

216+
std::vector<const FunctionSamples *>
217+
SampleContextTracker::getIndirectCalleeContextSamplesFor(
218+
const DILocation *DIL) {
219+
std::vector<const FunctionSamples *> R;
220+
if (!DIL)
221+
return R;
222+
223+
ContextTrieNode *CallerNode = getContextFor(DIL);
224+
LineLocation CallSite = FunctionSamples::getCallSiteIdentifier(DIL);
225+
for (auto &It : CallerNode->getAllChildContext()) {
226+
ContextTrieNode &ChildNode = It.second;
227+
if (ChildNode.getCallSiteLoc() != CallSite)
228+
continue;
229+
if (FunctionSamples *CalleeSamples = ChildNode.getFunctionSamples())
230+
R.push_back(CalleeSamples);
231+
}
232+
233+
return R;
234+
}
235+
212236
FunctionSamples *
213237
SampleContextTracker::getContextSamplesFor(const DILocation *DIL) {
214238
assert(DIL && "Expect non-null location");
@@ -295,11 +319,6 @@ void SampleContextTracker::promoteMergeContextSamplesTree(
295319
const Instruction &Inst, StringRef CalleeName) {
296320
LLVM_DEBUG(dbgs() << "Promoting and merging context tree for instr: \n"
297321
<< Inst << "\n");
298-
// CSFDO-TODO: We also need to promote context profile from indirect
299-
// calls. We won't have callee names from those from call instr.
300-
if (CalleeName.empty())
301-
return;
302-
303322
// Get the caller context for the call instruction, we don't use callee
304323
// name from call because there can be context from indirect calls too.
305324
DILocation *DIL = Inst.getDebugLoc();
@@ -309,6 +328,22 @@ void SampleContextTracker::promoteMergeContextSamplesTree(
309328

310329
// Get the context that needs to be promoted
311330
LineLocation CallSite = FunctionSamples::getCallSiteIdentifier(DIL);
331+
// For indirect call, CalleeName will be empty, in which case we need to
332+
// promote all non-inlined child context profiles.
333+
if (CalleeName.empty()) {
334+
for (auto &It : CallerNode->getAllChildContext()) {
335+
ContextTrieNode *NodeToPromo = &It.second;
336+
if (CallSite != NodeToPromo->getCallSiteLoc())
337+
continue;
338+
FunctionSamples *FromSamples = NodeToPromo->getFunctionSamples();
339+
if (FromSamples && FromSamples->getContext().hasState(InlinedContext))
340+
continue;
341+
promoteMergeContextSamplesTree(*NodeToPromo);
342+
}
343+
return;
344+
}
345+
346+
// Get the context for the given callee that needs to be promoted
312347
ContextTrieNode *NodeToPromo =
313348
CallerNode->getChildContext(CallSite, CalleeName);
314349
if (!NodeToPromo)
@@ -328,6 +363,8 @@ ContextTrieNode &SampleContextTracker::promoteMergeContextSamplesTree(
328363
LLVM_DEBUG(dbgs() << " Found context tree root to promote: "
329364
<< FromSamples->getContext() << "\n");
330365

366+
assert(!FromSamples->getContext().hasState(InlinedContext) &&
367+
"Shouldn't promote inlined context profile");
331368
StringRef ContextStrToRemove = FromSamples->getContext().getCallingContext();
332369
return promoteMergeContextSamplesTree(NodeToPromo, RootContext,
333370
ContextStrToRemove);
@@ -360,14 +397,12 @@ SampleContextTracker::getCalleeContextFor(const DILocation *DIL,
360397
StringRef CalleeName) {
361398
assert(DIL && "Expect non-null location");
362399

363-
// CSSPGO-TODO: need to support indirect callee
364-
if (CalleeName.empty())
365-
return nullptr;
366-
367400
ContextTrieNode *CallContext = getContextFor(DIL);
368401
if (!CallContext)
369402
return nullptr;
370403

404+
// When CalleeName is empty, the child context profile with max
405+
// total samples will be returned.
371406
return CallContext->getChildContext(
372407
FunctionSamples::getCallSiteIdentifier(DIL), CalleeName);
373408
}

0 commit comments

Comments
 (0)