Skip to content

[MemProf] Handle missing tail call frames #75823

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 11, 2024

Conversation

teresajohnson
Copy link
Contributor

If tail call optimization was not disabled for the profiled binary, the
call contexts will be missing frames for tail calls. Handle this by
performing a limited search through tail call edges for the profiled
callee when a discontinuity is detected. The search depth is adjustable
but defaults to 5.

If we are able to identify a short sequence of tail calls, update the
graph for those calls. In the case of ThinLTO, synthesize the necessary
CallsiteInfos for carrying the cloning information to the backends.

If tail call optimization was not disabled for the profiled binary, the
call contexts will be missing frames for tail calls. Handle this by
performing a limited search through tail call edges for the profiled
callee when a discontinuity is detected. The search depth is adjustable
but defaults to 5.

If we are able to identify a short sequence of tail calls, update the
graph for those calls. In the case of ThinLTO, synthesize the necessary
CallsiteInfos for carrying the cloning information to the backends.
@llvmbot llvmbot added LTO Link time optimization (regular/full LTO or ThinLTO) llvm:ir llvm:transforms labels Dec 18, 2023
@llvmbot
Copy link
Member

llvmbot commented Dec 18, 2023

@llvm/pr-subscribers-lto
@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-llvm-transforms

Author: Teresa Johnson (teresajohnson)

Changes

If tail call optimization was not disabled for the profiled binary, the
call contexts will be missing frames for tail calls. Handle this by
performing a limited search through tail call edges for the profiled
callee when a discontinuity is detected. The search depth is adjustable
but defaults to 5.

If we are able to identify a short sequence of tail calls, update the
graph for those calls. In the case of ThinLTO, synthesize the necessary
CallsiteInfos for carrying the cloning information to the backends.


Patch is 36.47 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/75823.diff

5 Files Affected:

  • (modified) llvm/include/llvm/IR/ModuleSummaryIndex.h (+6)
  • (modified) llvm/lib/Bitcode/Writer/BitcodeWriter.cpp (+16-1)
  • (modified) llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp (+422-54)
  • (added) llvm/test/ThinLTO/X86/memprof-tailcall.ll (+110)
  • (added) llvm/test/Transforms/MemProfContextDisambiguation/tailcall.ll (+84)
diff --git a/llvm/include/llvm/IR/ModuleSummaryIndex.h b/llvm/include/llvm/IR/ModuleSummaryIndex.h
index e72f74ad4adb66..66c7d10d823d9c 100644
--- a/llvm/include/llvm/IR/ModuleSummaryIndex.h
+++ b/llvm/include/llvm/IR/ModuleSummaryIndex.h
@@ -1011,6 +1011,12 @@ class FunctionSummary : public GlobalValueSummary {
     return *Callsites;
   }
 
+  void addCallsite(CallsiteInfo &Callsite) {
+    if (!Callsites)
+      Callsites = std::make_unique<CallsitesTy>();
+    Callsites->push_back(Callsite);
+  }
+
   ArrayRef<AllocInfo> allocs() const {
     if (Allocs)
       return *Allocs;
diff --git a/llvm/lib/Bitcode/Writer/BitcodeWriter.cpp b/llvm/lib/Bitcode/Writer/BitcodeWriter.cpp
index 8fca569a391baf..a5fc267b1883bf 100644
--- a/llvm/lib/Bitcode/Writer/BitcodeWriter.cpp
+++ b/llvm/lib/Bitcode/Writer/BitcodeWriter.cpp
@@ -459,9 +459,24 @@ class IndexBitcodeWriter : public BitcodeWriterBase {
       // Record all stack id indices actually used in the summary entries being
       // written, so that we can compact them in the case of distributed ThinLTO
       // indexes.
-      for (auto &CI : FS->callsites())
+      for (auto &CI : FS->callsites()) {
+        // If the stack id list is empty, this callsite info was synthesized for
+        // a missing tail call frame. Ensure that the callee's GUID gets a value
+        // id. Normally we only generate these for defined summaries, which in
+        // the case of distributed ThinLTO is only the functions already defined
+        // in the module or that we want to import. We don't bother to include
+        // all the callee symbols as they aren't normally needed in the backend.
+        // However, for the synthesized callsite infos we do need the callee
+        // GUID in the backend so that we can correlate the identified callee
+        // with this callsite info (which for non-tail calls is done by the
+        // ordering of the callsite infos and verified via stack ids).
+        if (CI.StackIdIndices.empty()) {
+          GUIDToValueIdMap[CI.Callee.getGUID()] = ++GlobalValueId;
+          continue;
+        }
         for (auto Idx : CI.StackIdIndices)
           StackIdIndices.push_back(Idx);
+      }
       for (auto &AI : FS->allocs())
         for (auto &MIB : AI.MIBs)
           for (auto Idx : MIB.StackIdIndices)
diff --git a/llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp b/llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp
index 70a3f3067d9d6d..59f982cd420bb6 100644
--- a/llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp
+++ b/llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp
@@ -77,6 +77,14 @@ STATISTIC(MaxAllocVersionsThinBackend,
           "allocation during ThinLTO backend");
 STATISTIC(UnclonableAllocsThinBackend,
           "Number of unclonable ambigous allocations during ThinLTO backend");
+STATISTIC(RemovedEdgesWithMismatchedCallees,
+          "Number of edges removed due to mismatched callees (profiled vs IR)");
+STATISTIC(FoundProfiledCalleeCount,
+          "Number of profiled callees found via tail calls");
+STATISTIC(FoundProfiledCalleeDepth,
+          "Aggregate depth of profiled callees found via tail calls");
+STATISTIC(FoundProfiledCalleeMaxDepth,
+          "Maximum depth of profiled callees found via tail calls");
 
 static cl::opt<std::string> DotFilePathPrefix(
     "memprof-dot-file-path-prefix", cl::init(""), cl::Hidden,
@@ -104,6 +112,12 @@ static cl::opt<std::string> MemProfImportSummary(
     cl::desc("Import summary to use for testing the ThinLTO backend via opt"),
     cl::Hidden);
 
+static cl::opt<unsigned>
+    TailCallSearchDepth("memprof-tail-call-search-depth", cl::init(5),
+                        cl::Hidden,
+                        cl::desc("Max depth to recursively search for missing "
+                                 "frames through tail calls."));
+
 namespace llvm {
 // Indicate we are linking with an allocator that supports hot/cold operator
 // new interfaces.
@@ -365,8 +379,7 @@ class CallsiteContextGraph {
 
   /// Save lists of calls with MemProf metadata in each function, for faster
   /// iteration.
-  std::vector<std::pair<FuncTy *, std::vector<CallInfo>>>
-      FuncToCallsWithMetadata;
+  std::map<FuncTy *, std::vector<CallInfo>> FuncToCallsWithMetadata;
 
   /// Map from callsite node to the enclosing caller function.
   std::map<const ContextNode *, const FuncTy *> NodeToCallingFunc;
@@ -411,9 +424,25 @@ class CallsiteContextGraph {
     return static_cast<const DerivedCCG *>(this)->getStackId(IdOrIndex);
   }
 
-  /// Returns true if the given call targets the given function.
-  bool calleeMatchesFunc(CallTy Call, const FuncTy *Func) {
-    return static_cast<DerivedCCG *>(this)->calleeMatchesFunc(Call, Func);
+  /// Returns true if the given call targets the callee of the given edge, or if
+  /// we were able to identify the call chain through intermediate tail calls.
+  /// In the latter case new context nodes are added to the graph for the
+  /// identified tail calls, and their synthesized nodes are added to
+  /// TailCallToContextNodeMap. The EdgeIter is updated in either case to the
+  /// next element after the input position (either incremented or updated after
+  /// removing the old edge).
+  bool
+  calleesMatch(CallTy Call, EdgeIter &EI,
+               MapVector<CallInfo, ContextNode *> &TailCallToContextNodeMap);
+
+  /// Returns true if the given call targets the given function, or if we were
+  /// able to identify the call chain through intermediate tail calls (in which
+  /// case FoundCalleeChain will be populated).
+  bool calleeMatchesFunc(
+      CallTy Call, const FuncTy *Func, const FuncTy *CallerFunc,
+      std::vector<std::pair<CallTy, FuncTy *>> &FoundCalleeChain) {
+    return static_cast<DerivedCCG *>(this)->calleeMatchesFunc(
+        Call, Func, CallerFunc, FoundCalleeChain);
   }
 
   /// Get a list of nodes corresponding to the stack ids in the given
@@ -553,7 +582,12 @@ class ModuleCallsiteContextGraph
                               Instruction *>;
 
   uint64_t getStackId(uint64_t IdOrIndex) const;
-  bool calleeMatchesFunc(Instruction *Call, const Function *Func);
+  bool calleeMatchesFunc(
+      Instruction *Call, const Function *Func, const Function *CallerFunc,
+      std::vector<std::pair<Instruction *, Function *>> &FoundCalleeChain);
+  bool findProfiledCalleeThroughTailCalls(
+      const Function *ProfiledCallee, Value *CurCallee, unsigned Depth,
+      std::vector<std::pair<Instruction *, Function *>> &FoundCalleeChain);
   uint64_t getLastStackId(Instruction *Call);
   std::vector<uint64_t> getStackIdsWithContextNodesForCall(Instruction *Call);
   void updateAllocationCall(CallInfo &Call, AllocationType AllocType);
@@ -606,12 +640,30 @@ class IndexCallsiteContextGraph
       function_ref<bool(GlobalValue::GUID, const GlobalValueSummary *)>
           isPrevailing);
 
+  ~IndexCallsiteContextGraph() {
+    // Now that we are done with the graph it is safe to add the new
+    // CallsiteInfo structs to the function summary vectors. The graph nodes
+    // point into locations within these vectors, so we don't want to add them
+    // any earlier.
+    for (auto &I : FunctionCalleesToSynthesizedCallsiteInfos) {
+      auto *FS = I.first;
+      for (auto &Callsite : I.second)
+        FS->addCallsite(*Callsite.second);
+    }
+  }
+
 private:
   friend CallsiteContextGraph<IndexCallsiteContextGraph, FunctionSummary,
                               IndexCall>;
 
   uint64_t getStackId(uint64_t IdOrIndex) const;
-  bool calleeMatchesFunc(IndexCall &Call, const FunctionSummary *Func);
+  bool calleeMatchesFunc(
+      IndexCall &Call, const FunctionSummary *Func,
+      const FunctionSummary *CallerFunc,
+      std::vector<std::pair<IndexCall, FunctionSummary *>> &FoundCalleeChain);
+  bool findProfiledCalleeThroughTailCalls(
+      ValueInfo ProfiledCallee, ValueInfo CurCallee, unsigned Depth,
+      std::vector<std::pair<IndexCall, FunctionSummary *>> &FoundCalleeChain);
   uint64_t getLastStackId(IndexCall &Call);
   std::vector<uint64_t> getStackIdsWithContextNodesForCall(IndexCall &Call);
   void updateAllocationCall(CallInfo &Call, AllocationType AllocType);
@@ -630,6 +682,16 @@ class IndexCallsiteContextGraph
   std::map<const FunctionSummary *, ValueInfo> FSToVIMap;
 
   const ModuleSummaryIndex &Index;
+  function_ref<bool(GlobalValue::GUID, const GlobalValueSummary *)>
+      isPrevailing;
+
+  // Saves/owns the callsite info structures synthesized for missing tail call
+  // frames that we discover while building the graph.
+  // It maps from the summary of the function making the tail call, to a map
+  // of callee ValueInfo to corresponding synthesized callsite info.
+  std::map<FunctionSummary *,
+           std::map<ValueInfo, std::unique_ptr<CallsiteInfo>>>
+      FunctionCalleesToSynthesizedCallsiteInfos;
 };
 } // namespace
 
@@ -1493,7 +1555,7 @@ ModuleCallsiteContextGraph::ModuleCallsiteContextGraph(
       }
     }
     if (!CallsWithMetadata.empty())
-      FuncToCallsWithMetadata.push_back({&F, CallsWithMetadata});
+      FuncToCallsWithMetadata[&F] = CallsWithMetadata;
   }
 
   if (DumpCCG) {
@@ -1518,7 +1580,7 @@ IndexCallsiteContextGraph::IndexCallsiteContextGraph(
     ModuleSummaryIndex &Index,
     function_ref<bool(GlobalValue::GUID, const GlobalValueSummary *)>
         isPrevailing)
-    : Index(Index) {
+    : Index(Index), isPrevailing(isPrevailing) {
   for (auto &I : Index) {
     auto VI = Index.getValueInfo(I);
     for (auto &S : VI.getSummaryList()) {
@@ -1572,7 +1634,7 @@ IndexCallsiteContextGraph::IndexCallsiteContextGraph(
           CallsWithMetadata.push_back({&SN});
 
       if (!CallsWithMetadata.empty())
-        FuncToCallsWithMetadata.push_back({FS, CallsWithMetadata});
+        FuncToCallsWithMetadata[FS] = CallsWithMetadata;
 
       if (!FS->allocs().empty() || !FS->callsites().empty())
         FSToVIMap[FS] = VI;
@@ -1604,6 +1666,11 @@ void CallsiteContextGraph<DerivedCCG, FuncTy,
   // this transformation for regular LTO, and for ThinLTO we can simulate that
   // effect in the summary and perform the actual speculative devirtualization
   // while cloning in the ThinLTO backend.
+
+  // Keep track of the new nodes synthesizes for discovered tail calls missing
+  // from the profiled contexts.
+  MapVector<CallInfo, ContextNode *> TailCallToContextNodeMap;
+
   for (auto Entry = NonAllocationCallToContextNodeMap.begin();
        Entry != NonAllocationCallToContextNodeMap.end();) {
     auto *Node = Entry->second;
@@ -1611,13 +1678,17 @@ void CallsiteContextGraph<DerivedCCG, FuncTy,
     // Check all node callees and see if in the same function.
     bool Removed = false;
     auto Call = Node->Call.call();
-    for (auto &Edge : Node->CalleeEdges) {
-      if (!Edge->Callee->hasCall())
+    for (auto EI = Node->CalleeEdges.begin(); EI != Node->CalleeEdges.end();) {
+      auto Edge = *EI;
+      if (!Edge->Callee->hasCall()) {
+        ++EI;
         continue;
+      }
       assert(NodeToCallingFunc.count(Edge->Callee));
       // Check if the called function matches that of the callee node.
-      if (calleeMatchesFunc(Call, NodeToCallingFunc[Edge->Callee]))
+      if (calleesMatch(Call, EI, TailCallToContextNodeMap))
         continue;
+      RemovedEdgesWithMismatchedCallees++;
       // Work around by setting Node to have a null call, so it gets
       // skipped during cloning. Otherwise assignFunctions will assert
       // because its data structures are not designed to handle this case.
@@ -1629,6 +1700,11 @@ void CallsiteContextGraph<DerivedCCG, FuncTy,
     if (!Removed)
       Entry++;
   }
+
+  // Add the new nodes after the above loop so that the iteration is not
+  // invalidated.
+  for (auto I : TailCallToContextNodeMap)
+    NonAllocationCallToContextNodeMap[I.first] = I.second;
 }
 
 uint64_t ModuleCallsiteContextGraph::getStackId(uint64_t IdOrIndex) const {
@@ -1642,8 +1718,160 @@ uint64_t IndexCallsiteContextGraph::getStackId(uint64_t IdOrIndex) const {
   return Index.getStackIdAtIndex(IdOrIndex);
 }
 
-bool ModuleCallsiteContextGraph::calleeMatchesFunc(Instruction *Call,
-                                                   const Function *Func) {
+template <typename DerivedCCG, typename FuncTy, typename CallTy>
+bool CallsiteContextGraph<DerivedCCG, FuncTy, CallTy>::calleesMatch(
+    CallTy Call, EdgeIter &EI,
+    MapVector<CallInfo, ContextNode *> &TailCallToContextNodeMap) {
+  auto Edge = *EI;
+  const FuncTy *ProfiledCalleeFunc = NodeToCallingFunc[Edge->Callee];
+  const FuncTy *CallerFunc = NodeToCallingFunc[Edge->Caller];
+  // Will be populated in order of callee to caller if we find a chain of tail
+  // calls between the profiled caller and callee.
+  std::vector<std::pair<CallTy, FuncTy *>> FoundCalleeChain;
+  if (!calleeMatchesFunc(Call, ProfiledCalleeFunc, CallerFunc,
+                         FoundCalleeChain)) {
+    ++EI;
+    return false;
+  }
+
+  // The usual case where the profiled callee matches that of the IR/summary.
+  if (FoundCalleeChain.empty()) {
+    ++EI;
+    return true;
+  }
+
+  auto AddEdge = [Edge, &EI](ContextNode *Caller, ContextNode *Callee) {
+    auto *CurEdge = Callee->findEdgeFromCaller(Caller);
+    // If there is already an edge between these nodes, simply update it and
+    // return.
+    if (CurEdge) {
+      CurEdge->ContextIds.insert(Edge->ContextIds.begin(),
+                                 Edge->ContextIds.end());
+      CurEdge->AllocTypes |= Edge->AllocTypes;
+      return;
+    }
+    // Otherwise, create a new edge and insert it into the caller and callee
+    // lists.
+    auto NewEdge = std::make_shared<ContextEdge>(
+        Callee, Caller, Edge->AllocTypes, Edge->ContextIds);
+    Callee->CallerEdges.push_back(NewEdge);
+    if (Caller == Edge->Caller) {
+      // If we are inserting the new edge into the current edge's caller, insert
+      // the new edge before the current iterator position, and then increment
+      // back to the current edge.
+      EI = Caller->CalleeEdges.insert(EI, NewEdge);
+      ++EI;
+      assert(*EI == Edge);
+    } else
+      Caller->CalleeEdges.push_back(NewEdge);
+  };
+
+  // Create new nodes for each found callee and connect in between the profiled
+  // caller and callee.
+  auto *CurCalleeNode = Edge->Callee;
+  for (auto I : FoundCalleeChain) {
+    auto &NewCall = I.first;
+    auto *Func = I.second;
+    ContextNode *NewNode = nullptr;
+    // First check if we have already synthesized a node for this tail call.
+    if (TailCallToContextNodeMap.count(NewCall)) {
+      NewNode = TailCallToContextNodeMap[NewCall];
+      NewNode->ContextIds.insert(Edge->ContextIds.begin(),
+                                 Edge->ContextIds.end());
+      NewNode->AllocTypes |= Edge->AllocTypes;
+    } else {
+      FuncToCallsWithMetadata[Func].push_back({NewCall});
+      // Create Node and record node info.
+      NodeOwner.push_back(
+          std::make_unique<ContextNode>(/*IsAllocation=*/false, NewCall));
+      NewNode = NodeOwner.back().get();
+      NodeToCallingFunc[NewNode] = Func;
+      TailCallToContextNodeMap[NewCall] = NewNode;
+      NewNode->ContextIds = Edge->ContextIds;
+      NewNode->AllocTypes = Edge->AllocTypes;
+    }
+
+    // Hook up node to its callee node
+    AddEdge(NewNode, CurCalleeNode);
+
+    CurCalleeNode = NewNode;
+  }
+
+  // Hook up edge's original caller to new callee node.
+  AddEdge(Edge->Caller, CurCalleeNode);
+
+  // Remove old edge
+  Edge->Callee->eraseCallerEdge(Edge.get());
+  EI = Edge->Caller->CalleeEdges.erase(EI);
+
+  return true;
+}
+
+bool ModuleCallsiteContextGraph::findProfiledCalleeThroughTailCalls(
+    const Function *ProfiledCallee, Value *CurCallee, unsigned Depth,
+    std::vector<std::pair<Instruction *, Function *>> &FoundCalleeChain) {
+  // Stop recursive search if we have already explored the maximum specified
+  // depth.
+  if (Depth > TailCallSearchDepth)
+    return false;
+
+  auto SaveCallsiteInfo = [&](Instruction *Callsite, Function *F) {
+    FoundCalleeChain.push_back({Callsite, F});
+  };
+
+  auto *CalleeFunc = dyn_cast<Function>(CurCallee);
+  if (!CalleeFunc) {
+    auto *Alias = dyn_cast<GlobalAlias>(CurCallee);
+    assert(Alias);
+    CalleeFunc = dyn_cast<Function>(Alias->getAliasee());
+    assert(CalleeFunc);
+  }
+
+  // Look for tail calls in this function, and check if they either call the
+  // profiled callee directly, or indirectly (via a recursive search).
+  for (auto &BB : *CalleeFunc) {
+    for (auto &I : BB) {
+      auto *CB = dyn_cast<CallBase>(&I);
+      if (!CB->isTailCall())
+        continue;
+      auto *CalledValue = CB->getCalledOperand();
+      auto *CalledFunction = CB->getCalledFunction();
+      if (CalledValue && !CalledFunction) {
+        CalledValue = CalledValue->stripPointerCasts();
+        // Stripping pointer casts can reveal a called function.
+        CalledFunction = dyn_cast<Function>(CalledValue);
+      }
+      // Check if this is an alias to a function. If so, get the
+      // called aliasee for the checks below.
+      if (auto *GA = dyn_cast<GlobalAlias>(CalledValue)) {
+        assert(!CalledFunction &&
+               "Expected null called function in callsite for alias");
+        CalledFunction = dyn_cast<Function>(GA->getAliaseeObject());
+      }
+      if (!CalledFunction)
+        continue;
+      if (CalledFunction == ProfiledCallee) {
+        FoundProfiledCalleeCount++;
+        FoundProfiledCalleeDepth += Depth;
+        if (Depth > FoundProfiledCalleeMaxDepth)
+          FoundProfiledCalleeMaxDepth = Depth;
+        SaveCallsiteInfo(&I, CalleeFunc);
+        return true;
+      }
+      if (findProfiledCalleeThroughTailCalls(ProfiledCallee, CalledFunction,
+                                             Depth + 1, FoundCalleeChain)) {
+        SaveCallsiteInfo(&I, CalleeFunc);
+        return true;
+      }
+    }
+  }
+
+  return false;
+}
+
+bool ModuleCallsiteContextGraph::calleeMatchesFunc(
+    Instruction *Call, const Function *Func, const Function *CallerFunc,
+    std::vector<std::pair<Instruction *, Function *>> &FoundCalleeChain) {
   auto *CB = dyn_cast<CallBase>(Call);
   if (!CB->getCalledOperand())
     return false;
@@ -1652,11 +1880,96 @@ bool ModuleCallsiteContextGraph::calleeMatchesFunc(Instruction *Call,
   if (CalleeFunc == Func)
     return true;
   auto *Alias = dyn_cast<GlobalAlias>(CalleeVal);
-  return Alias && Alias->getAliasee() == Func;
+  if (Alias && Alias->getAliasee() == Func)
+    return true;
+
+  // Recursively search for the profiled callee through tail calls starting with
+  // the actual Callee. The discovered tail call chain is saved in
+  // FoundCalleeChain, and we will fixup the graph to include these callsites
+  // after returning.
+  // FIXME: We will currently redo the same recursive walk if we find the same
+  // mismatched callee from another callsite. We can improve this with more
+  // bookkeeping of the created chain of new nodes for each mismatch.
+  unsigned Depth = 1;
+  if (!findProfiledCalleeThroughTailCalls(Func, CalleeVal, Depth,
+                                          FoundCalleeChain)) {
+    LLVM_DEBUG(dbgs() << "Not found through tail calls: " << Func->getName()
+                      << " from " << CallerFunc->getName()
+                      << " that actually called " << CalleeVal->getName()
+                      << "\n");
+    return false;
+  }
+
+  return true;
 }
 
-bool IndexCallsiteContextGraph::calleeMatchesFunc(IndexCall &Call,
-                                                  const FunctionSummary *Func) {
+bool IndexCallsiteContextGraph::findProfiledCalleeThroughTailCalls(
+    ValueInfo ProfiledCallee, ValueInfo CurCallee, unsigned Depth,
+    std::vector<std::pair<IndexCall, FunctionSummary *>> &FoundCalleeChain) {
+  // Stop recursive search if we have already explored the maximum specified
+  // depth.
+  if (Depth > TailCallSearchDepth)
+    return false;
+
+  auto CreateAndSaveCallsiteInfo = [&](ValueInfo Callee, FunctionSummary *FS) {
+    // Make a CallsiteInfo for each discovered callee, if one hasn't already
+    // been synthesized.
+    if (!FunctionCalleesToSynthesizedCallsiteInfos.count(FS) ||
+        !FunctionCalleesToSynthesizedCallsiteInfos[FS].count(Callee))
+      // StackIds is empty (we don't have debug info available in the index for
+      // these callsites)
+      FunctionCal...
[truncated]

@lifengxiang1025
Copy link
Contributor

Hi, @teresajohnson. I have one question. For example, there are two call chain:

A->B(tail call)->C(tail call)->D(tail call)->G
A->B(tail call)->E(tail call)->F(tail call)->G

It seems in calleesMatch only the chain(A->B->C->D->G) will be recorded while the chain(A->B->E->F->G) won't. Is this what you expected?

@lifengxiang1025
Copy link
Contributor

Another question is, is it possible that the tail call occurs at thin backend for instrument binary? If it's possible, can this patch handle it? Or the instrument binary shouldn't be compiled with lto/thinlto.

@teresajohnson
Copy link
Contributor Author

Hi, @teresajohnson. I have one question. For example, there are two call chain:

A->B(tail call)->C(tail call)->D(tail call)->G
A->B(tail call)->E(tail call)->F(tail call)->G

It seems in calleesMatch only the chain(A->B->C->D->G) will be recorded while the chain(A->B->E->F->G) won't. Is this what you expected?

Yes I should have mentioned that this algorithm is greedy and therefore won't currently find all possible paths, so we would miss the second case here. I was thinking it was a simple extension to handle all paths, but there is a slight complication. I hacked up the change to see how difficult it would be and it works for a test case like this, but I need to do a bit broader testing of it.

Not sure yet whether it is worth handling this in this patch or as a follow-on. Let me do a bit more investigation on my prototyped change and circle back.

Another question is, is it possible that the tail call occurs at thin backend for instrument binary? If it's possible, can this patch handle it? Or the instrument binary shouldn't be compiled with lto/thinlto.

If we're missing a frame for something not marked as a tail call during the LTO link, then this approach would not be able to recover it. We only search through calls marked as tail calls at that phase to keep the overhead smaller.

Copy link
Contributor

@snehasish snehasish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reading..

@@ -1604,20 +1666,29 @@ void CallsiteContextGraph<DerivedCCG, FuncTy,
// this transformation for regular LTO, and for ThinLTO we can simulate that
// effect in the summary and perform the actual speculative devirtualization
// while cloning in the ThinLTO backend.

// Keep track of the new nodes synthesizes for discovered tail calls missing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: synthesized

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -365,8 +379,7 @@ class CallsiteContextGraph {

/// Save lists of calls with MemProf metadata in each function, for faster
/// iteration.
std::vector<std::pair<FuncTy *, std::vector<CallInfo>>>
FuncToCallsWithMetadata;
std::map<FuncTy *, std::vector<CallInfo>> FuncToCallsWithMetadata;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An ordered map indexed by a pointer sounds like recipe for non-deterministic output. Can we use an unordered_map instead if determinism is unaffected or use a different key if ordering is required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I do have a concern after looking through the uses of this map that one of the iterations could result in non-deterministic behavior. Since I am not sure the best way to index using a key with a deterministic order, I have changed this to MapVector.

Edit: yep, this exposed an ordering non-determinism, in the order the cloned functions were generated and thus emitted into the IR, resulting in a test fix.

// frames that we discover while building the graph.
// It maps from the summary of the function making the tail call, to a map
// of callee ValueInfo to corresponding synthesized callsite info.
std::map<FunctionSummary *,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern of determinism here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No ordering requirements for this one, so have changed to unordered_map.


// Add the new nodes after the above loop so that the iteration is not
// invalidated.
for (auto I : TailCallToContextNodeMap)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto& to avoid a copy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -630,6 +682,16 @@ class IndexCallsiteContextGraph
std::map<const FunctionSummary *, ValueInfo> FSToVIMap;

const ModuleSummaryIndex &Index;
function_ref<bool(GlobalValue::GUID, const GlobalValueSummary *)>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spell out llvm::function_ref (since there exists std::function_ref too)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a bunch of existing uses of this type and none are fully spelled out (I checked that it does use the llvm version, however). How about I do a follow on NFC patch that adds the llvm:: prefix to all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// back to the current edge.
EI = Caller->CalleeEdges.insert(EI, NewEdge);
++EI;
assert(*EI == Edge);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a message for the assertion in LLVM style like assert(cond && "message")?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@snehasish
Copy link
Contributor

Hi, @teresajohnson. I have one question. For example, there are two call chain:

A->B(tail call)->C(tail call)->D(tail call)->G
A->B(tail call)->E(tail call)->F(tail call)->G

It seems in calleesMatch only the chain(A->B->C->D->G) will be recorded while the chain(A->B->E->F->G) won't. Is this what you expected?

Yes I should have mentioned that this algorithm is greedy and therefore won't currently find all possible paths, so we would miss the second case here. I was thinking it was a simple extension to handle all paths, but there is a slight complication. I hacked up the change to see how difficult it would be and it works for a test case like this, but I need to do a bit broader testing of it.

Not sure yet whether it is worth handling this in this patch or as a follow-on. Let me do a bit more investigation on my prototyped change and circle back.

The behaviour I prefer is that we only perform cloning for unambiguous cases (as far as we can get with dynamic profiles). I think we should enumerate all paths up and treat multiple paths as unreachable. This mimics the behaviour of the missing frame inference logic in SampleProf. I'm wary of the downsides of false positives since we may degrade performance. Do you have any stats on unambiguous matches vs total matches?

Handling in a separate patch sounds fine to me.

@teresajohnson
Copy link
Contributor Author

Hi, @teresajohnson. I have one question. For example, there are two call chain:

A->B(tail call)->C(tail call)->D(tail call)->G
A->B(tail call)->E(tail call)->F(tail call)->G

It seems in calleesMatch only the chain(A->B->C->D->G) will be recorded while the chain(A->B->E->F->G) won't. Is this what you expected?

Yes I should have mentioned that this algorithm is greedy and therefore won't currently find all possible paths, so we would miss the second case here. I was thinking it was a simple extension to handle all paths, but there is a slight complication. I hacked up the change to see how difficult it would be and it works for a test case like this, but I need to do a bit broader testing of it.
Not sure yet whether it is worth handling this in this patch or as a follow-on. Let me do a bit more investigation on my prototyped change and circle back.

The behaviour I prefer is that we only perform cloning for unambiguous cases (as far as we can get with dynamic profiles). I think we should enumerate all paths up and treat multiple paths as unreachable. This mimics the behaviour of the missing frame inference logic in SampleProf. I'm wary of the downsides of false positives since we may degrade performance. Do you have any stats on unambiguous matches vs total matches?

This is a good point, and in fact much simpler and likely will be effective in the vast majority of cases: I had earlier collected stats from a large application profiled with tail call optimization enabled, and found that 98% of the profiled callees were found via only 1 level of tail call. In that case we know for sure the function of the missing frame (from the call in the IR of the profiled caller frame). What we don't know is which tail call it is if there are multiple on different calls to the same callee. For now my handling will be somewhat imperfect in this case (we might clone the wrong one), but I think this is probably not a common enough case to be too concerned with unless we see otherwise.

Handling in a separate patch sounds fine to me.

When I am back from the holiday I may just modify the patch to be conservative if there are multiple paths through tail calls in different functions.

Copy link
Contributor Author

@teresajohnson teresajohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed the current round of comments from @snehasish , but haven't yet made the change to ignore multiple paths. Going to do that now.

@@ -365,8 +379,7 @@ class CallsiteContextGraph {

/// Save lists of calls with MemProf metadata in each function, for faster
/// iteration.
std::vector<std::pair<FuncTy *, std::vector<CallInfo>>>
FuncToCallsWithMetadata;
std::map<FuncTy *, std::vector<CallInfo>> FuncToCallsWithMetadata;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I do have a concern after looking through the uses of this map that one of the iterations could result in non-deterministic behavior. Since I am not sure the best way to index using a key with a deterministic order, I have changed this to MapVector.

Edit: yep, this exposed an ordering non-determinism, in the order the cloned functions were generated and thus emitted into the IR, resulting in a test fix.

@@ -630,6 +682,16 @@ class IndexCallsiteContextGraph
std::map<const FunctionSummary *, ValueInfo> FSToVIMap;

const ModuleSummaryIndex &Index;
function_ref<bool(GlobalValue::GUID, const GlobalValueSummary *)>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a bunch of existing uses of this type and none are fully spelled out (I checked that it does use the llvm version, however). How about I do a follow on NFC patch that adds the llvm:: prefix to all?

// frames that we discover while building the graph.
// It maps from the summary of the function making the tail call, to a map
// of callee ValueInfo to corresponding synthesized callsite info.
std::map<FunctionSummary *,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No ordering requirements for this one, so have changed to unordered_map.

@@ -1604,20 +1666,29 @@ void CallsiteContextGraph<DerivedCCG, FuncTy,
// this transformation for regular LTO, and for ThinLTO we can simulate that
// effect in the summary and perform the actual speculative devirtualization
// while cloning in the ThinLTO backend.

// Keep track of the new nodes synthesizes for discovered tail calls missing
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// Add the new nodes after the above loop so that the iteration is not
// invalidated.
for (auto I : TailCallToContextNodeMap)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// back to the current edge.
EI = Caller->CalleeEdges.insert(EI, NewEdge);
++EI;
assert(*EI == Edge);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@teresajohnson
Copy link
Contributor Author

When I am back from the holiday I may just modify the patch to be conservative if there are multiple paths through tail calls in different functions.

Added the handling to make this conservative, along with a new test for that case.

Copy link
Contributor

@snehasish snehasish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with some optional suggestions.

;; Test to make sure that missing tail call frames in memprof profiles are
;; identified but not cloned when there are multiple non-unique possible
;; tail call chains between the profiled frames.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include the source from which these test.ll files were generated for posterity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't end up including that here because I did a lot of manual clean up of the original test (for example there were calls to other functions like sleep and delete etc). So I'm not convinced there is a lot of value in including that here.

@@ -104,6 +114,12 @@ static cl::opt<std::string> MemProfImportSummary(
cl::desc("Import summary to use for testing the ThinLTO backend via opt"),
cl::Hidden);

static cl::opt<unsigned>
TailCallSearchDepth("memprof-tail-call-search-depth", cl::init(5),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we need to have an option or a hard coded limit is fine? Given our strategy of selecting only unambiguous paths, increasing depth will increase both runtime and chance of an ambiguous path. So then we'd probably won't tune it and I'd prefer not to introduce a flag which we never change. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I dislike hardcoded thresholds that can't be adjusted at runtime. For debugging and analysis it is sometimes really handy to have a command line flag to change these types of limits. So I'd prefer to keep it as an option for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added testing of this option to one of the tests. It also provides a handy way to disable the tail call discovery if we encounter issues.

@@ -630,6 +682,16 @@ class IndexCallsiteContextGraph
std::map<const FunctionSummary *, ValueInfo> FSToVIMap;

const ModuleSummaryIndex &Index;
function_ref<bool(GlobalValue::GUID, const GlobalValueSummary *)>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds good to me.

// Create new nodes for each found callee and connect in between the profiled
// caller and callee.
auto *CurCalleeNode = Edge->Callee;
for (auto I : FoundCalleeChain) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: auto& [NewCall, Func] : FoundCalleeChain

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// Add the new nodes after the above loop so that the iteration is not
// invalidated.
for (auto &I : TailCallToContextNodeMap)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A structured binding with variable names would be more readable than the I.first and I.second usage right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

bool IndexCallsiteContextGraph::calleeMatchesFunc(IndexCall &Call,
const FunctionSummary *Func) {
bool IndexCallsiteContextGraph::findProfiledCalleeThroughTailCalls(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[optional] I wonder if we can refactor the Index and Module version of this method by factoring out the FSToVI lookup as a functor as we have done in other places (if I recall correctly)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are pieces of it that could be refactored out, but it is hard to share most of it because there is a lot that is specific to whether this is IR or Summary. So I opted to just keep the versions separate.

@teresajohnson teresajohnson merged commit 26a8664 into llvm:main Jan 11, 2024
teresajohnson added a commit to teresajohnson/llvm-project that referenced this pull request Jan 11, 2024
As suggested in llvm#75823, to
avoid confusion with std::function_ref, qualify all uses with llvm::
(we were already using the llvm version, but this avoids ambiguity).
teresajohnson added a commit to teresajohnson/llvm-project that referenced this pull request Jan 11, 2024
teresajohnson added a commit that referenced this pull request Jan 16, 2024
As suggested in #75823, to
avoid confusion with std::function_ref, qualify all uses with llvm::
(we were already using the llvm version, but this avoids ambiguity).
justinfargnoli pushed a commit to justinfargnoli/llvm-project that referenced this pull request Jan 28, 2024
If tail call optimization was not disabled for the profiled binary, the
call contexts will be missing frames for tail calls. Handle this by
performing a limited search through tail call edges for the profiled
callee when a discontinuity is detected. The search depth is adjustable
but defaults to 5.

If we are able to identify a short sequence of tail calls, update the
graph for those calls. In the case of ThinLTO, synthesize the necessary
CallsiteInfos for carrying the cloning information to the backends.
justinfargnoli pushed a commit to justinfargnoli/llvm-project that referenced this pull request Jan 28, 2024
justinfargnoli pushed a commit to justinfargnoli/llvm-project that referenced this pull request Jan 28, 2024
…77783)

As suggested in llvm#75823, to
avoid confusion with std::function_ref, qualify all uses with llvm::
(we were already using the llvm version, but this avoids ambiguity).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llvm:ir llvm:transforms LTO Link time optimization (regular/full LTO or ThinLTO)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants