-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[MemProf] Handle missing tail call frames #75823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MemProf] Handle missing tail call frames #75823
Conversation
If tail call optimization was not disabled for the profiled binary, the call contexts will be missing frames for tail calls. Handle this by performing a limited search through tail call edges for the profiled callee when a discontinuity is detected. The search depth is adjustable but defaults to 5. If we are able to identify a short sequence of tail calls, update the graph for those calls. In the case of ThinLTO, synthesize the necessary CallsiteInfos for carrying the cloning information to the backends.
@llvm/pr-subscribers-lto @llvm/pr-subscribers-llvm-transforms Author: Teresa Johnson (teresajohnson) ChangesIf tail call optimization was not disabled for the profiled binary, the If we are able to identify a short sequence of tail calls, update the Patch is 36.47 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/75823.diff 5 Files Affected:
diff --git a/llvm/include/llvm/IR/ModuleSummaryIndex.h b/llvm/include/llvm/IR/ModuleSummaryIndex.h
index e72f74ad4adb66..66c7d10d823d9c 100644
--- a/llvm/include/llvm/IR/ModuleSummaryIndex.h
+++ b/llvm/include/llvm/IR/ModuleSummaryIndex.h
@@ -1011,6 +1011,12 @@ class FunctionSummary : public GlobalValueSummary {
return *Callsites;
}
+ void addCallsite(CallsiteInfo &Callsite) {
+ if (!Callsites)
+ Callsites = std::make_unique<CallsitesTy>();
+ Callsites->push_back(Callsite);
+ }
+
ArrayRef<AllocInfo> allocs() const {
if (Allocs)
return *Allocs;
diff --git a/llvm/lib/Bitcode/Writer/BitcodeWriter.cpp b/llvm/lib/Bitcode/Writer/BitcodeWriter.cpp
index 8fca569a391baf..a5fc267b1883bf 100644
--- a/llvm/lib/Bitcode/Writer/BitcodeWriter.cpp
+++ b/llvm/lib/Bitcode/Writer/BitcodeWriter.cpp
@@ -459,9 +459,24 @@ class IndexBitcodeWriter : public BitcodeWriterBase {
// Record all stack id indices actually used in the summary entries being
// written, so that we can compact them in the case of distributed ThinLTO
// indexes.
- for (auto &CI : FS->callsites())
+ for (auto &CI : FS->callsites()) {
+ // If the stack id list is empty, this callsite info was synthesized for
+ // a missing tail call frame. Ensure that the callee's GUID gets a value
+ // id. Normally we only generate these for defined summaries, which in
+ // the case of distributed ThinLTO is only the functions already defined
+ // in the module or that we want to import. We don't bother to include
+ // all the callee symbols as they aren't normally needed in the backend.
+ // However, for the synthesized callsite infos we do need the callee
+ // GUID in the backend so that we can correlate the identified callee
+ // with this callsite info (which for non-tail calls is done by the
+ // ordering of the callsite infos and verified via stack ids).
+ if (CI.StackIdIndices.empty()) {
+ GUIDToValueIdMap[CI.Callee.getGUID()] = ++GlobalValueId;
+ continue;
+ }
for (auto Idx : CI.StackIdIndices)
StackIdIndices.push_back(Idx);
+ }
for (auto &AI : FS->allocs())
for (auto &MIB : AI.MIBs)
for (auto Idx : MIB.StackIdIndices)
diff --git a/llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp b/llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp
index 70a3f3067d9d6d..59f982cd420bb6 100644
--- a/llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp
+++ b/llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp
@@ -77,6 +77,14 @@ STATISTIC(MaxAllocVersionsThinBackend,
"allocation during ThinLTO backend");
STATISTIC(UnclonableAllocsThinBackend,
"Number of unclonable ambigous allocations during ThinLTO backend");
+STATISTIC(RemovedEdgesWithMismatchedCallees,
+ "Number of edges removed due to mismatched callees (profiled vs IR)");
+STATISTIC(FoundProfiledCalleeCount,
+ "Number of profiled callees found via tail calls");
+STATISTIC(FoundProfiledCalleeDepth,
+ "Aggregate depth of profiled callees found via tail calls");
+STATISTIC(FoundProfiledCalleeMaxDepth,
+ "Maximum depth of profiled callees found via tail calls");
static cl::opt<std::string> DotFilePathPrefix(
"memprof-dot-file-path-prefix", cl::init(""), cl::Hidden,
@@ -104,6 +112,12 @@ static cl::opt<std::string> MemProfImportSummary(
cl::desc("Import summary to use for testing the ThinLTO backend via opt"),
cl::Hidden);
+static cl::opt<unsigned>
+ TailCallSearchDepth("memprof-tail-call-search-depth", cl::init(5),
+ cl::Hidden,
+ cl::desc("Max depth to recursively search for missing "
+ "frames through tail calls."));
+
namespace llvm {
// Indicate we are linking with an allocator that supports hot/cold operator
// new interfaces.
@@ -365,8 +379,7 @@ class CallsiteContextGraph {
/// Save lists of calls with MemProf metadata in each function, for faster
/// iteration.
- std::vector<std::pair<FuncTy *, std::vector<CallInfo>>>
- FuncToCallsWithMetadata;
+ std::map<FuncTy *, std::vector<CallInfo>> FuncToCallsWithMetadata;
/// Map from callsite node to the enclosing caller function.
std::map<const ContextNode *, const FuncTy *> NodeToCallingFunc;
@@ -411,9 +424,25 @@ class CallsiteContextGraph {
return static_cast<const DerivedCCG *>(this)->getStackId(IdOrIndex);
}
- /// Returns true if the given call targets the given function.
- bool calleeMatchesFunc(CallTy Call, const FuncTy *Func) {
- return static_cast<DerivedCCG *>(this)->calleeMatchesFunc(Call, Func);
+ /// Returns true if the given call targets the callee of the given edge, or if
+ /// we were able to identify the call chain through intermediate tail calls.
+ /// In the latter case new context nodes are added to the graph for the
+ /// identified tail calls, and their synthesized nodes are added to
+ /// TailCallToContextNodeMap. The EdgeIter is updated in either case to the
+ /// next element after the input position (either incremented or updated after
+ /// removing the old edge).
+ bool
+ calleesMatch(CallTy Call, EdgeIter &EI,
+ MapVector<CallInfo, ContextNode *> &TailCallToContextNodeMap);
+
+ /// Returns true if the given call targets the given function, or if we were
+ /// able to identify the call chain through intermediate tail calls (in which
+ /// case FoundCalleeChain will be populated).
+ bool calleeMatchesFunc(
+ CallTy Call, const FuncTy *Func, const FuncTy *CallerFunc,
+ std::vector<std::pair<CallTy, FuncTy *>> &FoundCalleeChain) {
+ return static_cast<DerivedCCG *>(this)->calleeMatchesFunc(
+ Call, Func, CallerFunc, FoundCalleeChain);
}
/// Get a list of nodes corresponding to the stack ids in the given
@@ -553,7 +582,12 @@ class ModuleCallsiteContextGraph
Instruction *>;
uint64_t getStackId(uint64_t IdOrIndex) const;
- bool calleeMatchesFunc(Instruction *Call, const Function *Func);
+ bool calleeMatchesFunc(
+ Instruction *Call, const Function *Func, const Function *CallerFunc,
+ std::vector<std::pair<Instruction *, Function *>> &FoundCalleeChain);
+ bool findProfiledCalleeThroughTailCalls(
+ const Function *ProfiledCallee, Value *CurCallee, unsigned Depth,
+ std::vector<std::pair<Instruction *, Function *>> &FoundCalleeChain);
uint64_t getLastStackId(Instruction *Call);
std::vector<uint64_t> getStackIdsWithContextNodesForCall(Instruction *Call);
void updateAllocationCall(CallInfo &Call, AllocationType AllocType);
@@ -606,12 +640,30 @@ class IndexCallsiteContextGraph
function_ref<bool(GlobalValue::GUID, const GlobalValueSummary *)>
isPrevailing);
+ ~IndexCallsiteContextGraph() {
+ // Now that we are done with the graph it is safe to add the new
+ // CallsiteInfo structs to the function summary vectors. The graph nodes
+ // point into locations within these vectors, so we don't want to add them
+ // any earlier.
+ for (auto &I : FunctionCalleesToSynthesizedCallsiteInfos) {
+ auto *FS = I.first;
+ for (auto &Callsite : I.second)
+ FS->addCallsite(*Callsite.second);
+ }
+ }
+
private:
friend CallsiteContextGraph<IndexCallsiteContextGraph, FunctionSummary,
IndexCall>;
uint64_t getStackId(uint64_t IdOrIndex) const;
- bool calleeMatchesFunc(IndexCall &Call, const FunctionSummary *Func);
+ bool calleeMatchesFunc(
+ IndexCall &Call, const FunctionSummary *Func,
+ const FunctionSummary *CallerFunc,
+ std::vector<std::pair<IndexCall, FunctionSummary *>> &FoundCalleeChain);
+ bool findProfiledCalleeThroughTailCalls(
+ ValueInfo ProfiledCallee, ValueInfo CurCallee, unsigned Depth,
+ std::vector<std::pair<IndexCall, FunctionSummary *>> &FoundCalleeChain);
uint64_t getLastStackId(IndexCall &Call);
std::vector<uint64_t> getStackIdsWithContextNodesForCall(IndexCall &Call);
void updateAllocationCall(CallInfo &Call, AllocationType AllocType);
@@ -630,6 +682,16 @@ class IndexCallsiteContextGraph
std::map<const FunctionSummary *, ValueInfo> FSToVIMap;
const ModuleSummaryIndex &Index;
+ function_ref<bool(GlobalValue::GUID, const GlobalValueSummary *)>
+ isPrevailing;
+
+ // Saves/owns the callsite info structures synthesized for missing tail call
+ // frames that we discover while building the graph.
+ // It maps from the summary of the function making the tail call, to a map
+ // of callee ValueInfo to corresponding synthesized callsite info.
+ std::map<FunctionSummary *,
+ std::map<ValueInfo, std::unique_ptr<CallsiteInfo>>>
+ FunctionCalleesToSynthesizedCallsiteInfos;
};
} // namespace
@@ -1493,7 +1555,7 @@ ModuleCallsiteContextGraph::ModuleCallsiteContextGraph(
}
}
if (!CallsWithMetadata.empty())
- FuncToCallsWithMetadata.push_back({&F, CallsWithMetadata});
+ FuncToCallsWithMetadata[&F] = CallsWithMetadata;
}
if (DumpCCG) {
@@ -1518,7 +1580,7 @@ IndexCallsiteContextGraph::IndexCallsiteContextGraph(
ModuleSummaryIndex &Index,
function_ref<bool(GlobalValue::GUID, const GlobalValueSummary *)>
isPrevailing)
- : Index(Index) {
+ : Index(Index), isPrevailing(isPrevailing) {
for (auto &I : Index) {
auto VI = Index.getValueInfo(I);
for (auto &S : VI.getSummaryList()) {
@@ -1572,7 +1634,7 @@ IndexCallsiteContextGraph::IndexCallsiteContextGraph(
CallsWithMetadata.push_back({&SN});
if (!CallsWithMetadata.empty())
- FuncToCallsWithMetadata.push_back({FS, CallsWithMetadata});
+ FuncToCallsWithMetadata[FS] = CallsWithMetadata;
if (!FS->allocs().empty() || !FS->callsites().empty())
FSToVIMap[FS] = VI;
@@ -1604,6 +1666,11 @@ void CallsiteContextGraph<DerivedCCG, FuncTy,
// this transformation for regular LTO, and for ThinLTO we can simulate that
// effect in the summary and perform the actual speculative devirtualization
// while cloning in the ThinLTO backend.
+
+ // Keep track of the new nodes synthesizes for discovered tail calls missing
+ // from the profiled contexts.
+ MapVector<CallInfo, ContextNode *> TailCallToContextNodeMap;
+
for (auto Entry = NonAllocationCallToContextNodeMap.begin();
Entry != NonAllocationCallToContextNodeMap.end();) {
auto *Node = Entry->second;
@@ -1611,13 +1678,17 @@ void CallsiteContextGraph<DerivedCCG, FuncTy,
// Check all node callees and see if in the same function.
bool Removed = false;
auto Call = Node->Call.call();
- for (auto &Edge : Node->CalleeEdges) {
- if (!Edge->Callee->hasCall())
+ for (auto EI = Node->CalleeEdges.begin(); EI != Node->CalleeEdges.end();) {
+ auto Edge = *EI;
+ if (!Edge->Callee->hasCall()) {
+ ++EI;
continue;
+ }
assert(NodeToCallingFunc.count(Edge->Callee));
// Check if the called function matches that of the callee node.
- if (calleeMatchesFunc(Call, NodeToCallingFunc[Edge->Callee]))
+ if (calleesMatch(Call, EI, TailCallToContextNodeMap))
continue;
+ RemovedEdgesWithMismatchedCallees++;
// Work around by setting Node to have a null call, so it gets
// skipped during cloning. Otherwise assignFunctions will assert
// because its data structures are not designed to handle this case.
@@ -1629,6 +1700,11 @@ void CallsiteContextGraph<DerivedCCG, FuncTy,
if (!Removed)
Entry++;
}
+
+ // Add the new nodes after the above loop so that the iteration is not
+ // invalidated.
+ for (auto I : TailCallToContextNodeMap)
+ NonAllocationCallToContextNodeMap[I.first] = I.second;
}
uint64_t ModuleCallsiteContextGraph::getStackId(uint64_t IdOrIndex) const {
@@ -1642,8 +1718,160 @@ uint64_t IndexCallsiteContextGraph::getStackId(uint64_t IdOrIndex) const {
return Index.getStackIdAtIndex(IdOrIndex);
}
-bool ModuleCallsiteContextGraph::calleeMatchesFunc(Instruction *Call,
- const Function *Func) {
+template <typename DerivedCCG, typename FuncTy, typename CallTy>
+bool CallsiteContextGraph<DerivedCCG, FuncTy, CallTy>::calleesMatch(
+ CallTy Call, EdgeIter &EI,
+ MapVector<CallInfo, ContextNode *> &TailCallToContextNodeMap) {
+ auto Edge = *EI;
+ const FuncTy *ProfiledCalleeFunc = NodeToCallingFunc[Edge->Callee];
+ const FuncTy *CallerFunc = NodeToCallingFunc[Edge->Caller];
+ // Will be populated in order of callee to caller if we find a chain of tail
+ // calls between the profiled caller and callee.
+ std::vector<std::pair<CallTy, FuncTy *>> FoundCalleeChain;
+ if (!calleeMatchesFunc(Call, ProfiledCalleeFunc, CallerFunc,
+ FoundCalleeChain)) {
+ ++EI;
+ return false;
+ }
+
+ // The usual case where the profiled callee matches that of the IR/summary.
+ if (FoundCalleeChain.empty()) {
+ ++EI;
+ return true;
+ }
+
+ auto AddEdge = [Edge, &EI](ContextNode *Caller, ContextNode *Callee) {
+ auto *CurEdge = Callee->findEdgeFromCaller(Caller);
+ // If there is already an edge between these nodes, simply update it and
+ // return.
+ if (CurEdge) {
+ CurEdge->ContextIds.insert(Edge->ContextIds.begin(),
+ Edge->ContextIds.end());
+ CurEdge->AllocTypes |= Edge->AllocTypes;
+ return;
+ }
+ // Otherwise, create a new edge and insert it into the caller and callee
+ // lists.
+ auto NewEdge = std::make_shared<ContextEdge>(
+ Callee, Caller, Edge->AllocTypes, Edge->ContextIds);
+ Callee->CallerEdges.push_back(NewEdge);
+ if (Caller == Edge->Caller) {
+ // If we are inserting the new edge into the current edge's caller, insert
+ // the new edge before the current iterator position, and then increment
+ // back to the current edge.
+ EI = Caller->CalleeEdges.insert(EI, NewEdge);
+ ++EI;
+ assert(*EI == Edge);
+ } else
+ Caller->CalleeEdges.push_back(NewEdge);
+ };
+
+ // Create new nodes for each found callee and connect in between the profiled
+ // caller and callee.
+ auto *CurCalleeNode = Edge->Callee;
+ for (auto I : FoundCalleeChain) {
+ auto &NewCall = I.first;
+ auto *Func = I.second;
+ ContextNode *NewNode = nullptr;
+ // First check if we have already synthesized a node for this tail call.
+ if (TailCallToContextNodeMap.count(NewCall)) {
+ NewNode = TailCallToContextNodeMap[NewCall];
+ NewNode->ContextIds.insert(Edge->ContextIds.begin(),
+ Edge->ContextIds.end());
+ NewNode->AllocTypes |= Edge->AllocTypes;
+ } else {
+ FuncToCallsWithMetadata[Func].push_back({NewCall});
+ // Create Node and record node info.
+ NodeOwner.push_back(
+ std::make_unique<ContextNode>(/*IsAllocation=*/false, NewCall));
+ NewNode = NodeOwner.back().get();
+ NodeToCallingFunc[NewNode] = Func;
+ TailCallToContextNodeMap[NewCall] = NewNode;
+ NewNode->ContextIds = Edge->ContextIds;
+ NewNode->AllocTypes = Edge->AllocTypes;
+ }
+
+ // Hook up node to its callee node
+ AddEdge(NewNode, CurCalleeNode);
+
+ CurCalleeNode = NewNode;
+ }
+
+ // Hook up edge's original caller to new callee node.
+ AddEdge(Edge->Caller, CurCalleeNode);
+
+ // Remove old edge
+ Edge->Callee->eraseCallerEdge(Edge.get());
+ EI = Edge->Caller->CalleeEdges.erase(EI);
+
+ return true;
+}
+
+bool ModuleCallsiteContextGraph::findProfiledCalleeThroughTailCalls(
+ const Function *ProfiledCallee, Value *CurCallee, unsigned Depth,
+ std::vector<std::pair<Instruction *, Function *>> &FoundCalleeChain) {
+ // Stop recursive search if we have already explored the maximum specified
+ // depth.
+ if (Depth > TailCallSearchDepth)
+ return false;
+
+ auto SaveCallsiteInfo = [&](Instruction *Callsite, Function *F) {
+ FoundCalleeChain.push_back({Callsite, F});
+ };
+
+ auto *CalleeFunc = dyn_cast<Function>(CurCallee);
+ if (!CalleeFunc) {
+ auto *Alias = dyn_cast<GlobalAlias>(CurCallee);
+ assert(Alias);
+ CalleeFunc = dyn_cast<Function>(Alias->getAliasee());
+ assert(CalleeFunc);
+ }
+
+ // Look for tail calls in this function, and check if they either call the
+ // profiled callee directly, or indirectly (via a recursive search).
+ for (auto &BB : *CalleeFunc) {
+ for (auto &I : BB) {
+ auto *CB = dyn_cast<CallBase>(&I);
+ if (!CB->isTailCall())
+ continue;
+ auto *CalledValue = CB->getCalledOperand();
+ auto *CalledFunction = CB->getCalledFunction();
+ if (CalledValue && !CalledFunction) {
+ CalledValue = CalledValue->stripPointerCasts();
+ // Stripping pointer casts can reveal a called function.
+ CalledFunction = dyn_cast<Function>(CalledValue);
+ }
+ // Check if this is an alias to a function. If so, get the
+ // called aliasee for the checks below.
+ if (auto *GA = dyn_cast<GlobalAlias>(CalledValue)) {
+ assert(!CalledFunction &&
+ "Expected null called function in callsite for alias");
+ CalledFunction = dyn_cast<Function>(GA->getAliaseeObject());
+ }
+ if (!CalledFunction)
+ continue;
+ if (CalledFunction == ProfiledCallee) {
+ FoundProfiledCalleeCount++;
+ FoundProfiledCalleeDepth += Depth;
+ if (Depth > FoundProfiledCalleeMaxDepth)
+ FoundProfiledCalleeMaxDepth = Depth;
+ SaveCallsiteInfo(&I, CalleeFunc);
+ return true;
+ }
+ if (findProfiledCalleeThroughTailCalls(ProfiledCallee, CalledFunction,
+ Depth + 1, FoundCalleeChain)) {
+ SaveCallsiteInfo(&I, CalleeFunc);
+ return true;
+ }
+ }
+ }
+
+ return false;
+}
+
+bool ModuleCallsiteContextGraph::calleeMatchesFunc(
+ Instruction *Call, const Function *Func, const Function *CallerFunc,
+ std::vector<std::pair<Instruction *, Function *>> &FoundCalleeChain) {
auto *CB = dyn_cast<CallBase>(Call);
if (!CB->getCalledOperand())
return false;
@@ -1652,11 +1880,96 @@ bool ModuleCallsiteContextGraph::calleeMatchesFunc(Instruction *Call,
if (CalleeFunc == Func)
return true;
auto *Alias = dyn_cast<GlobalAlias>(CalleeVal);
- return Alias && Alias->getAliasee() == Func;
+ if (Alias && Alias->getAliasee() == Func)
+ return true;
+
+ // Recursively search for the profiled callee through tail calls starting with
+ // the actual Callee. The discovered tail call chain is saved in
+ // FoundCalleeChain, and we will fixup the graph to include these callsites
+ // after returning.
+ // FIXME: We will currently redo the same recursive walk if we find the same
+ // mismatched callee from another callsite. We can improve this with more
+ // bookkeeping of the created chain of new nodes for each mismatch.
+ unsigned Depth = 1;
+ if (!findProfiledCalleeThroughTailCalls(Func, CalleeVal, Depth,
+ FoundCalleeChain)) {
+ LLVM_DEBUG(dbgs() << "Not found through tail calls: " << Func->getName()
+ << " from " << CallerFunc->getName()
+ << " that actually called " << CalleeVal->getName()
+ << "\n");
+ return false;
+ }
+
+ return true;
}
-bool IndexCallsiteContextGraph::calleeMatchesFunc(IndexCall &Call,
- const FunctionSummary *Func) {
+bool IndexCallsiteContextGraph::findProfiledCalleeThroughTailCalls(
+ ValueInfo ProfiledCallee, ValueInfo CurCallee, unsigned Depth,
+ std::vector<std::pair<IndexCall, FunctionSummary *>> &FoundCalleeChain) {
+ // Stop recursive search if we have already explored the maximum specified
+ // depth.
+ if (Depth > TailCallSearchDepth)
+ return false;
+
+ auto CreateAndSaveCallsiteInfo = [&](ValueInfo Callee, FunctionSummary *FS) {
+ // Make a CallsiteInfo for each discovered callee, if one hasn't already
+ // been synthesized.
+ if (!FunctionCalleesToSynthesizedCallsiteInfos.count(FS) ||
+ !FunctionCalleesToSynthesizedCallsiteInfos[FS].count(Callee))
+ // StackIds is empty (we don't have debug info available in the index for
+ // these callsites)
+ FunctionCal...
[truncated]
|
Hi, @teresajohnson. I have one question. For example, there are two call chain:
It seems in |
Another question is, is it possible that the tail call occurs at thin backend for instrument binary? If it's possible, can this patch handle it? Or the instrument binary shouldn't be compiled with lto/thinlto. |
Yes I should have mentioned that this algorithm is greedy and therefore won't currently find all possible paths, so we would miss the second case here. I was thinking it was a simple extension to handle all paths, but there is a slight complication. I hacked up the change to see how difficult it would be and it works for a test case like this, but I need to do a bit broader testing of it. Not sure yet whether it is worth handling this in this patch or as a follow-on. Let me do a bit more investigation on my prototyped change and circle back.
If we're missing a frame for something not marked as a tail call during the LTO link, then this approach would not be able to recover it. We only search through calls marked as tail calls at that phase to keep the overhead smaller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still reading..
@@ -1604,20 +1666,29 @@ void CallsiteContextGraph<DerivedCCG, FuncTy, | |||
// this transformation for regular LTO, and for ThinLTO we can simulate that | |||
// effect in the summary and perform the actual speculative devirtualization | |||
// while cloning in the ThinLTO backend. | |||
|
|||
// Keep track of the new nodes synthesizes for discovered tail calls missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: synthesized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -365,8 +379,7 @@ class CallsiteContextGraph { | |||
|
|||
/// Save lists of calls with MemProf metadata in each function, for faster | |||
/// iteration. | |||
std::vector<std::pair<FuncTy *, std::vector<CallInfo>>> | |||
FuncToCallsWithMetadata; | |||
std::map<FuncTy *, std::vector<CallInfo>> FuncToCallsWithMetadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An ordered map indexed by a pointer sounds like recipe for non-deterministic output. Can we use an unordered_map instead if determinism is unaffected or use a different key if ordering is required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I do have a concern after looking through the uses of this map that one of the iterations could result in non-deterministic behavior. Since I am not sure the best way to index using a key with a deterministic order, I have changed this to MapVector.
Edit: yep, this exposed an ordering non-determinism, in the order the cloned functions were generated and thus emitted into the IR, resulting in a test fix.
// frames that we discover while building the graph. | ||
// It maps from the summary of the function making the tail call, to a map | ||
// of callee ValueInfo to corresponding synthesized callsite info. | ||
std::map<FunctionSummary *, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same concern of determinism here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No ordering requirements for this one, so have changed to unordered_map.
|
||
// Add the new nodes after the above loop so that the iteration is not | ||
// invalidated. | ||
for (auto I : TailCallToContextNodeMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto& to avoid a copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -630,6 +682,16 @@ class IndexCallsiteContextGraph | |||
std::map<const FunctionSummary *, ValueInfo> FSToVIMap; | |||
|
|||
const ModuleSummaryIndex &Index; | |||
function_ref<bool(GlobalValue::GUID, const GlobalValueSummary *)> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spell out llvm::function_ref (since there exists std::function_ref too)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a bunch of existing uses of this type and none are fully spelled out (I checked that it does use the llvm version, however). How about I do a follow on NFC patch that adds the llvm:: prefix to all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// back to the current edge. | ||
EI = Caller->CalleeEdges.insert(EI, NewEdge); | ||
++EI; | ||
assert(*EI == Edge); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a message for the assertion in LLVM style like assert(cond && "message")
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
The behaviour I prefer is that we only perform cloning for unambiguous cases (as far as we can get with dynamic profiles). I think we should enumerate all paths up and treat multiple paths as unreachable. This mimics the behaviour of the missing frame inference logic in SampleProf. I'm wary of the downsides of false positives since we may degrade performance. Do you have any stats on unambiguous matches vs total matches? Handling in a separate patch sounds fine to me. |
This is a good point, and in fact much simpler and likely will be effective in the vast majority of cases: I had earlier collected stats from a large application profiled with tail call optimization enabled, and found that 98% of the profiled callees were found via only 1 level of tail call. In that case we know for sure the function of the missing frame (from the call in the IR of the profiled caller frame). What we don't know is which tail call it is if there are multiple on different calls to the same callee. For now my handling will be somewhat imperfect in this case (we might clone the wrong one), but I think this is probably not a common enough case to be too concerned with unless we see otherwise.
When I am back from the holiday I may just modify the patch to be conservative if there are multiple paths through tail calls in different functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have addressed the current round of comments from @snehasish , but haven't yet made the change to ignore multiple paths. Going to do that now.
@@ -365,8 +379,7 @@ class CallsiteContextGraph { | |||
|
|||
/// Save lists of calls with MemProf metadata in each function, for faster | |||
/// iteration. | |||
std::vector<std::pair<FuncTy *, std::vector<CallInfo>>> | |||
FuncToCallsWithMetadata; | |||
std::map<FuncTy *, std::vector<CallInfo>> FuncToCallsWithMetadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I do have a concern after looking through the uses of this map that one of the iterations could result in non-deterministic behavior. Since I am not sure the best way to index using a key with a deterministic order, I have changed this to MapVector.
Edit: yep, this exposed an ordering non-determinism, in the order the cloned functions were generated and thus emitted into the IR, resulting in a test fix.
@@ -630,6 +682,16 @@ class IndexCallsiteContextGraph | |||
std::map<const FunctionSummary *, ValueInfo> FSToVIMap; | |||
|
|||
const ModuleSummaryIndex &Index; | |||
function_ref<bool(GlobalValue::GUID, const GlobalValueSummary *)> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a bunch of existing uses of this type and none are fully spelled out (I checked that it does use the llvm version, however). How about I do a follow on NFC patch that adds the llvm:: prefix to all?
// frames that we discover while building the graph. | ||
// It maps from the summary of the function making the tail call, to a map | ||
// of callee ValueInfo to corresponding synthesized callsite info. | ||
std::map<FunctionSummary *, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No ordering requirements for this one, so have changed to unordered_map.
@@ -1604,20 +1666,29 @@ void CallsiteContextGraph<DerivedCCG, FuncTy, | |||
// this transformation for regular LTO, and for ThinLTO we can simulate that | |||
// effect in the summary and perform the actual speculative devirtualization | |||
// while cloning in the ThinLTO backend. | |||
|
|||
// Keep track of the new nodes synthesizes for discovered tail calls missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
// Add the new nodes after the above loop so that the iteration is not | ||
// invalidated. | ||
for (auto I : TailCallToContextNodeMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
// back to the current edge. | ||
EI = Caller->CalleeEdges.insert(EI, NewEdge); | ||
++EI; | ||
assert(*EI == Edge); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
chains conservatively.
Added the handling to make this conservative, along with a new test for that case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm with some optional suggestions.
;; Test to make sure that missing tail call frames in memprof profiles are | ||
;; identified but not cloned when there are multiple non-unique possible | ||
;; tail call chains between the profiled frames. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include the source from which these test.ll files were generated for posterity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't end up including that here because I did a lot of manual clean up of the original test (for example there were calls to other functions like sleep and delete etc). So I'm not convinced there is a lot of value in including that here.
@@ -104,6 +114,12 @@ static cl::opt<std::string> MemProfImportSummary( | |||
cl::desc("Import summary to use for testing the ThinLTO backend via opt"), | |||
cl::Hidden); | |||
|
|||
static cl::opt<unsigned> | |||
TailCallSearchDepth("memprof-tail-call-search-depth", cl::init(5), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we need to have an option or a hard coded limit is fine? Given our strategy of selecting only unambiguous paths, increasing depth will increase both runtime and chance of an ambiguous path. So then we'd probably won't tune it and I'd prefer not to introduce a flag which we never change. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I dislike hardcoded thresholds that can't be adjusted at runtime. For debugging and analysis it is sometimes really handy to have a command line flag to change these types of limits. So I'd prefer to keep it as an option for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added testing of this option to one of the tests. It also provides a handy way to disable the tail call discovery if we encounter issues.
@@ -630,6 +682,16 @@ class IndexCallsiteContextGraph | |||
std::map<const FunctionSummary *, ValueInfo> FSToVIMap; | |||
|
|||
const ModuleSummaryIndex &Index; | |||
function_ref<bool(GlobalValue::GUID, const GlobalValueSummary *)> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, sounds good to me.
// Create new nodes for each found callee and connect in between the profiled | ||
// caller and callee. | ||
auto *CurCalleeNode = Edge->Callee; | ||
for (auto I : FoundCalleeChain) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: auto& [NewCall, Func] : FoundCalleeChain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
// Add the new nodes after the above loop so that the iteration is not | ||
// invalidated. | ||
for (auto &I : TailCallToContextNodeMap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A structured binding with variable names would be more readable than the I.first
and I.second
usage right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
bool IndexCallsiteContextGraph::calleeMatchesFunc(IndexCall &Call, | ||
const FunctionSummary *Func) { | ||
bool IndexCallsiteContextGraph::findProfiledCalleeThroughTailCalls( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[optional] I wonder if we can refactor the Index and Module version of this method by factoring out the FSToVI lookup as a functor as we have done in other places (if I recall correctly)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are pieces of it that could be refactored out, but it is hard to share most of it because there is a lot that is specific to whether this is IR or Summary. So I opted to just keep the versions separate.
As suggested in llvm#75823, to avoid confusion with std::function_ref, qualify all uses with llvm:: (we were already using the llvm version, but this avoids ambiguity).
Should fix buildbot failure https://lab.llvm.org/buildbot/#/builders/54/builds/8451 from llvm#75823.
Should fix buildbot failure https://lab.llvm.org/buildbot/#/builders/54/builds/8451 from #75823.
As suggested in #75823, to avoid confusion with std::function_ref, qualify all uses with llvm:: (we were already using the llvm version, but this avoids ambiguity).
If tail call optimization was not disabled for the profiled binary, the call contexts will be missing frames for tail calls. Handle this by performing a limited search through tail call edges for the profiled callee when a discontinuity is detected. The search depth is adjustable but defaults to 5. If we are able to identify a short sequence of tail calls, update the graph for those calls. In the case of ThinLTO, synthesize the necessary CallsiteInfos for carrying the cloning information to the backends.
…7788) Should fix buildbot failure https://lab.llvm.org/buildbot/#/builders/54/builds/8451 from llvm#75823.
…77783) As suggested in llvm#75823, to avoid confusion with std::function_ref, qualify all uses with llvm:: (we were already using the llvm version, but this avoids ambiguity).
If tail call optimization was not disabled for the profiled binary, the
call contexts will be missing frames for tail calls. Handle this by
performing a limited search through tail call edges for the profiled
callee when a discontinuity is detected. The search depth is adjustable
but defaults to 5.
If we are able to identify a short sequence of tail calls, update the
graph for those calls. In the case of ThinLTO, synthesize the necessary
CallsiteInfos for carrying the cloning information to the backends.