Skip to content

[ctxprof] root autodetection mechanism #133147

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 8, 2025

Conversation

mtrofin
Copy link
Member

@mtrofin mtrofin commented Mar 26, 2025

This is an optional mechanism that automatically detects roots. It's a best-effort mechanism, and its main goal is to avoid pointing at the message pump function as a root. This is the function that polls message queue(s) in an infinite loop, and is thus a bad root (it never exits).

High-level, when collection is requested - which should happen when a server has already been set up and handing requests - we spend a bit of time sampling all the server's threads. Each sample is a stack which we insert in a PerThreadCallsiteTrie. After a while, we run for each PerThreadCallsiteTrie the root detection logic. We then traverse all the FunctionData, find the ones matching the detected roots, and allocate a ContextRoot for them. From here, we special case FunctionData objects, in __llvm_ctx_profile_get_context, that have a CtxRootand route them to__llvm_ctx_profile_start_context`.

For this to work, on the llvm side, we need to have all functions call __llvm_ctx_profile_release_context because they might be roots. This comes at a slight (percentages) penalty during collection - which we can afford since the overall technique is ~5x faster than normal instrumentation. We can later explore conditionally enabling autoroot detection and avoiding this penalty, if desired.

Note that functions that musttail call can't have their return instrumented this way, and a subsequent patch will harden the mechanism against this case.

The mechanism could be used in combination with explicit root specification, too.

Copy link
Member Author

mtrofin commented Mar 26, 2025

Copy link

github-actions bot commented Mar 26, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@mtrofin mtrofin force-pushed the users/mtrofin/03-24-rootautodetect branch from 7182bae to 7550771 Compare March 27, 2025 02:33
@mtrofin mtrofin force-pushed the users/mtrofin/03-26-_ctxprof_nfc_move_2_implementation_functions_up_in_ctxinstrprofiling.cpp_ branch from bbe97a8 to 277a3b3 Compare March 27, 2025 02:33
@mtrofin mtrofin changed the title RootAutodetect [ctxprof] root autodetection mechanism Mar 27, 2025
@mtrofin mtrofin force-pushed the users/mtrofin/03-24-rootautodetect branch 3 times, most recently from 7d0b3a1 to ce0ecd6 Compare March 27, 2025 22:06
@mtrofin mtrofin force-pushed the users/mtrofin/03-26-_ctxprof_nfc_move_2_implementation_functions_up_in_ctxinstrprofiling.cpp_ branch 2 times, most recently from 4f422ca to 24468d9 Compare March 28, 2025 04:50
@mtrofin mtrofin force-pushed the users/mtrofin/03-24-rootautodetect branch from ce0ecd6 to 1f145e0 Compare March 28, 2025 04:50
@mtrofin mtrofin force-pushed the users/mtrofin/03-26-_ctxprof_nfc_move_2_implementation_functions_up_in_ctxinstrprofiling.cpp_ branch from 24468d9 to 4fd0d3c Compare March 28, 2025 05:18
@mtrofin mtrofin force-pushed the users/mtrofin/03-24-rootautodetect branch 2 times, most recently from e2bee27 to dbcdd9c Compare March 28, 2025 05:56
@mtrofin mtrofin force-pushed the users/mtrofin/03-26-_ctxprof_nfc_move_2_implementation_functions_up_in_ctxinstrprofiling.cpp_ branch 2 times, most recently from 47c5535 to 9065433 Compare March 28, 2025 16:22
@mtrofin mtrofin force-pushed the users/mtrofin/03-24-rootautodetect branch 2 times, most recently from 0080eef to 8ab329e Compare March 29, 2025 02:25
@mtrofin mtrofin force-pushed the users/mtrofin/03-26-_ctxprof_nfc_move_2_implementation_functions_up_in_ctxinstrprofiling.cpp_ branch from 9065433 to 5579f73 Compare March 29, 2025 02:25
@mtrofin mtrofin force-pushed the users/mtrofin/03-24-rootautodetect branch from 8ab329e to e2b4b9c Compare March 29, 2025 02:43
@mtrofin mtrofin force-pushed the users/mtrofin/03-26-_ctxprof_nfc_move_2_implementation_functions_up_in_ctxinstrprofiling.cpp_ branch 2 times, most recently from 319d31f to f36ebc7 Compare March 29, 2025 02:51
@mtrofin mtrofin force-pushed the users/mtrofin/03-24-rootautodetect branch from e2b4b9c to a984fc7 Compare March 29, 2025 02:51
@mtrofin mtrofin force-pushed the users/mtrofin/03-26-_ctxprof_nfc_move_2_implementation_functions_up_in_ctxinstrprofiling.cpp_ branch from f36ebc7 to 401ba9b Compare March 29, 2025 03:09
@mtrofin mtrofin force-pushed the users/mtrofin/03-24-rootautodetect branch 2 times, most recently from 96fcc51 to b78258c Compare March 29, 2025 03:46
@mtrofin mtrofin force-pushed the users/mtrofin/03-26-_ctxprof_nfc_move_2_implementation_functions_up_in_ctxinstrprofiling.cpp_ branch from 401ba9b to ba9b6f2 Compare March 29, 2025 03:46
Base automatically changed from users/mtrofin/03-26-_ctxprof_nfc_move_2_implementation_functions_up_in_ctxinstrprofiling.cpp_ to main March 29, 2025 03:53
@mtrofin mtrofin force-pushed the users/mtrofin/03-24-rootautodetect branch 4 times, most recently from f208007 to 1a12853 Compare March 31, 2025 19:25
mtrofin added a commit that referenced this pull request Mar 31, 2025
Most of the functionality will be reused with the auto-root detection mechanism (which is introduced subsequently in PR #133147).
@mtrofin mtrofin force-pushed the users/mtrofin/03-24-rootautodetect branch from 1a12853 to d8fe115 Compare March 31, 2025 19:28
@mtrofin mtrofin marked this pull request as ready for review March 31, 2025 19:32
@llvmbot llvmbot added compiler-rt PGO Profile Guided Optimizations llvm:transforms labels Mar 31, 2025
@llvmbot
Copy link
Member

llvmbot commented Mar 31, 2025

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-pgo

Author: Mircea Trofin (mtrofin)

Changes

This is an optional mechanism that automatically detects roots. It's a best-effort mechanism, and its main goal is to avoid pointing at the message pump function as a root. This is the function that polls message queue(s) in an infinite loop, and is thus a bad root (it never exits).

High-level, when collection is requested - which should happen when a server has already been set up and handing requests - we spend a bit of time sampling all the server's threads. Each sample is a stack which we insert in a PerThreadCallsiteTrie. After a while, we run for each PerThreadCallsiteTrie the root detection logic. We then traverse all the FunctionData, find the ones matching the detected roots, and allocate a ContextRoot for them. From here, we special case FunctionData objects, in __llvm_ctx_profile_get_context, that have a CtxRootand route them to__llvm_ctx_profile_start_context`.

For this to work, on the llvm side, we need to have all functions call __llvm_ctx_profile_release_context because they might be roots. This comes at a slight (percentages) penalty during collection - which we can afford since the overall technique is ~5x faster than normal instrumentation. We can later explore conditionally enabling autoroot detection and avoiding this penalty, if desired.

Note that functions that musttail call can't have their return instrumented this way, and a subsequent patch will harden the mechanism against this case.

The mechanism could be used in combination with explicit root specification, too.


Patch is 33.28 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/133147.diff

11 Files Affected:

  • (modified) compiler-rt/lib/ctx_profile/CMakeLists.txt (+1-1)
  • (modified) compiler-rt/lib/ctx_profile/CtxInstrContextNode.h (+1)
  • (modified) compiler-rt/lib/ctx_profile/CtxInstrProfiling.cpp (+53-23)
  • (modified) compiler-rt/lib/ctx_profile/CtxInstrProfiling.h (+1-1)
  • (modified) compiler-rt/lib/ctx_profile/RootAutoDetector.cpp (+94)
  • (modified) compiler-rt/lib/ctx_profile/RootAutoDetector.h (+31)
  • (added) compiler-rt/test/ctx_profile/TestCases/autodetect-roots.cpp (+188)
  • (modified) compiler-rt/test/ctx_profile/TestCases/generate-context.cpp (+3-2)
  • (modified) llvm/include/llvm/ProfileData/CtxInstrContextNode.h (+1)
  • (modified) llvm/lib/Transforms/Instrumentation/PGOCtxProfLowering.cpp (+17-9)
  • (modified) llvm/test/Transforms/PGOProfile/ctx-instrumentation.ll (+43-7)
diff --git a/compiler-rt/lib/ctx_profile/CMakeLists.txt b/compiler-rt/lib/ctx_profile/CMakeLists.txt
index bb606449c61b1..446ebc96408dd 100644
--- a/compiler-rt/lib/ctx_profile/CMakeLists.txt
+++ b/compiler-rt/lib/ctx_profile/CMakeLists.txt
@@ -27,7 +27,7 @@ endif()
 add_compiler_rt_runtime(clang_rt.ctx_profile
   STATIC
   ARCHS ${CTX_PROFILE_SUPPORTED_ARCH}
-  OBJECT_LIBS RTSanitizerCommon RTSanitizerCommonLibc
+  OBJECT_LIBS RTSanitizerCommon RTSanitizerCommonLibc RTSanitizerCommonSymbolizer
   CFLAGS ${EXTRA_FLAGS}
   SOURCES ${CTX_PROFILE_SOURCES}
   ADDITIONAL_HEADERS ${CTX_PROFILE_HEADERS}
diff --git a/compiler-rt/lib/ctx_profile/CtxInstrContextNode.h b/compiler-rt/lib/ctx_profile/CtxInstrContextNode.h
index a42bf9ebb01ea..55423d95b3088 100644
--- a/compiler-rt/lib/ctx_profile/CtxInstrContextNode.h
+++ b/compiler-rt/lib/ctx_profile/CtxInstrContextNode.h
@@ -127,6 +127,7 @@ class ContextNode final {
 /// MUTEXDECL takes one parameter, the name of a field that is a mutex.
 #define CTXPROF_FUNCTION_DATA(PTRDECL, VOLATILE_PTRDECL, MUTEXDECL)            \
   PTRDECL(FunctionData, Next)                                                  \
+  VOLATILE_PTRDECL(void, EntryAddress)                                         \
   VOLATILE_PTRDECL(ContextRoot, CtxRoot)                                       \
   VOLATILE_PTRDECL(ContextNode, FlatCtx)                                       \
   MUTEXDECL(Mutex)
diff --git a/compiler-rt/lib/ctx_profile/CtxInstrProfiling.cpp b/compiler-rt/lib/ctx_profile/CtxInstrProfiling.cpp
index 10a6a8c1f71e5..d8b6947a62e60 100644
--- a/compiler-rt/lib/ctx_profile/CtxInstrProfiling.cpp
+++ b/compiler-rt/lib/ctx_profile/CtxInstrProfiling.cpp
@@ -7,6 +7,7 @@
 //===----------------------------------------------------------------------===//
 
 #include "CtxInstrProfiling.h"
+#include "RootAutoDetector.h"
 #include "sanitizer_common/sanitizer_allocator_internal.h"
 #include "sanitizer_common/sanitizer_atomic.h"
 #include "sanitizer_common/sanitizer_atomic_clang.h"
@@ -43,6 +44,12 @@ Arena *FlatCtxArena = nullptr;
 __thread bool IsUnderContext = false;
 __sanitizer::atomic_uint8_t ProfilingStarted = {};
 
+__sanitizer::atomic_uintptr_t RootDetector = {};
+RootAutoDetector *getRootDetector() {
+  return reinterpret_cast<RootAutoDetector *>(
+      __sanitizer::atomic_load_relaxed(&RootDetector));
+}
+
 // utility to taint a pointer by setting the LSB. There is an assumption
 // throughout that the addresses of contexts are even (really, they should be
 // align(8), but "even"-ness is the minimum assumption)
@@ -201,7 +208,7 @@ ContextNode *getCallsiteSlow(GUID Guid, ContextNode **InsertionPoint,
   return Ret;
 }
 
-ContextNode *getFlatProfile(FunctionData &Data, GUID Guid,
+ContextNode *getFlatProfile(FunctionData &Data, void *Callee, GUID Guid,
                             uint32_t NumCounters) {
   if (ContextNode *Existing = Data.FlatCtx)
     return Existing;
@@ -232,6 +239,7 @@ ContextNode *getFlatProfile(FunctionData &Data, GUID Guid,
     auto *Ret = allocContextNode(AllocBuff, Guid, NumCounters, 0);
     Data.FlatCtx = Ret;
 
+    Data.EntryAddress = Callee;
     Data.Next = reinterpret_cast<FunctionData *>(
         __sanitizer::atomic_load_relaxed(&AllFunctionsData));
     while (!__sanitizer::atomic_compare_exchange_strong(
@@ -316,27 +324,32 @@ ContextNode *getUnhandledContext(FunctionData &Data, GUID Guid,
   // entered once and never exit. They should be assumed to be entered before
   // profiling starts - because profiling should start after the server is up
   // and running (which is equivalent to "message pumps are set up").
-  ContextRoot *R = __llvm_ctx_profile_current_context_root;
-  if (!R) {
+  if (!CtxRoot) {
+    if (auto *RAD = getRootDetector())
+      RAD->sample();
+    else if (auto *CR = Data.CtxRoot)
+      return tryStartContextGivenRoot(CR, Guid, NumCounters, NumCallsites);
     if (IsUnderContext || !__sanitizer::atomic_load_relaxed(&ProfilingStarted))
       return TheScratchContext;
     else
       return markAsScratch(
-          onContextEnter(*getFlatProfile(Data, Guid, NumCounters)));
+          onContextEnter(*getFlatProfile(Data, Callee, Guid, NumCounters)));
   }
-  auto [Iter, Ins] = R->Unhandled.insert({Guid, nullptr});
+  auto [Iter, Ins] = CtxRoot->Unhandled.insert({Guid, nullptr});
   if (Ins)
-    Iter->second =
-        getCallsiteSlow(Guid, &R->FirstUnhandledCalleeNode, NumCounters, 0);
+    Iter->second = getCallsiteSlow(Guid, &CtxRoot->FirstUnhandledCalleeNode,
+                                   NumCounters, 0);
   return markAsScratch(onContextEnter(*Iter->second));
 }
 
 ContextNode *__llvm_ctx_profile_get_context(FunctionData *Data, void *Callee,
                                             GUID Guid, uint32_t NumCounters,
                                             uint32_t NumCallsites) {
+  auto *CtxRoot = __llvm_ctx_profile_current_context_root;
   // fast "out" if we're not even doing contextual collection.
-  if (!__llvm_ctx_profile_current_context_root)
-    return getUnhandledContext(*Data, Guid, NumCounters);
+  if (!CtxRoot)
+    return getUnhandledContext(*Data, Callee, Guid, NumCounters, NumCallsites,
+                               nullptr);
 
   // also fast "out" if the caller is scratch. We can see if it's scratch by
   // looking at the interior pointer into the subcontexts vector that the caller
@@ -345,7 +358,8 @@ ContextNode *__llvm_ctx_profile_get_context(FunctionData *Data, void *Callee,
   // precisely, aligned - 8 values)
   auto **CallsiteContext = consume(__llvm_ctx_profile_callsite[0]);
   if (!CallsiteContext || isScratch(CallsiteContext))
-    return getUnhandledContext(*Data, Guid, NumCounters);
+    return getUnhandledContext(*Data, Callee, Guid, NumCounters, NumCallsites,
+                               CtxRoot);
 
   // if the callee isn't the expected one, return scratch.
   // Signal handler(s) could have been invoked at any point in the execution.
@@ -363,7 +377,8 @@ ContextNode *__llvm_ctx_profile_get_context(FunctionData *Data, void *Callee,
   // for that case.
   auto *ExpectedCallee = consume(__llvm_ctx_profile_expected_callee[0]);
   if (ExpectedCallee != Callee)
-    return getUnhandledContext(*Data, Guid, NumCounters);
+    return getUnhandledContext(*Data, Callee, Guid, NumCounters, NumCallsites,
+                               CtxRoot);
 
   auto *Callsite = *CallsiteContext;
   // in the case of indirect calls, we will have all seen targets forming a
@@ -385,24 +400,26 @@ ContextNode *__llvm_ctx_profile_get_context(FunctionData *Data, void *Callee,
   return Ret;
 }
 
-ContextNode *__llvm_ctx_profile_start_context(
-    FunctionData *FData, GUID Guid, uint32_t Counters,
-    uint32_t Callsites) SANITIZER_NO_THREAD_SAFETY_ANALYSIS {
+ContextNode *__llvm_ctx_profile_start_context(FunctionData *FData, GUID Guid,
+                                              uint32_t Counters,
+                                              uint32_t Callsites) {
+
   return tryStartContextGivenRoot(FData->getOrAllocateContextRoot(), Guid,
                                   Counters, Callsites);
 }
 
 void __llvm_ctx_profile_release_context(FunctionData *FData)
     SANITIZER_NO_THREAD_SAFETY_ANALYSIS {
+  const auto *CurrentRoot = __llvm_ctx_profile_current_context_root;
+  if (!CurrentRoot || FData->CtxRoot != CurrentRoot)
+    return;
   IsUnderContext = false;
-  if (__llvm_ctx_profile_current_context_root) {
-    __llvm_ctx_profile_current_context_root = nullptr;
-    assert(FData->CtxRoot);
-    FData->CtxRoot->Taken.Unlock();
-  }
+  assert(FData->CtxRoot);
+  __llvm_ctx_profile_current_context_root = nullptr;
+  FData->CtxRoot->Taken.Unlock();
 }
 
-void __llvm_ctx_profile_start_collection() {
+void __llvm_ctx_profile_start_collection(unsigned AutodetectDuration) {
   size_t NumMemUnits = 0;
   __sanitizer::GenericScopedLock<__sanitizer::SpinMutex> Lock(
       &AllContextsMutex);
@@ -418,12 +435,24 @@ void __llvm_ctx_profile_start_collection() {
       resetContextNode(*Root->FirstUnhandledCalleeNode);
     __sanitizer::atomic_store_relaxed(&Root->TotalEntries, 0);
   }
+  if (AutodetectDuration) {
+    auto *RD = new (__sanitizer::InternalAlloc(sizeof(RootAutoDetector)))
+        RootAutoDetector(AllFunctionsData, RootDetector, AutodetectDuration);
+    RD->start();
+  } else {
+    __sanitizer::Printf("[ctxprof] Initial NumMemUnits: %zu \n", NumMemUnits);
+  }
   __sanitizer::atomic_store_relaxed(&ProfilingStarted, true);
-  __sanitizer::Printf("[ctxprof] Initial NumMemUnits: %zu \n", NumMemUnits);
 }
 
 bool __llvm_ctx_profile_fetch(ProfileWriter &Writer) {
   __sanitizer::atomic_store_relaxed(&ProfilingStarted, false);
+  if (auto *RD = getRootDetector()) {
+    __sanitizer::Printf("[ctxprof] Expected the root autodetector to have "
+                        "finished well before attempting to fetch a context");
+    RD->join();
+  }
+
   __sanitizer::GenericScopedLock<__sanitizer::SpinMutex> Lock(
       &AllContextsMutex);
 
@@ -448,8 +477,9 @@ bool __llvm_ctx_profile_fetch(ProfileWriter &Writer) {
   const auto *Pos = reinterpret_cast<const FunctionData *>(
       __sanitizer::atomic_load_relaxed(&AllFunctionsData));
   for (; Pos; Pos = Pos->Next)
-    Writer.writeFlat(Pos->FlatCtx->guid(), Pos->FlatCtx->counters(),
-                     Pos->FlatCtx->counters_size());
+    if (!Pos->CtxRoot)
+      Writer.writeFlat(Pos->FlatCtx->guid(), Pos->FlatCtx->counters(),
+                       Pos->FlatCtx->counters_size());
   Writer.endFlatSection();
   return true;
 }
diff --git a/compiler-rt/lib/ctx_profile/CtxInstrProfiling.h b/compiler-rt/lib/ctx_profile/CtxInstrProfiling.h
index 6326beaa53085..4983f086d230d 100644
--- a/compiler-rt/lib/ctx_profile/CtxInstrProfiling.h
+++ b/compiler-rt/lib/ctx_profile/CtxInstrProfiling.h
@@ -207,7 +207,7 @@ ContextNode *__llvm_ctx_profile_get_context(__ctx_profile::FunctionData *FData,
 
 /// Prepares for collection. Currently this resets counter values but preserves
 /// internal context tree structure.
-void __llvm_ctx_profile_start_collection();
+void __llvm_ctx_profile_start_collection(unsigned AutodetectDuration = 0);
 
 /// Completely free allocated memory.
 void __llvm_ctx_profile_free();
diff --git a/compiler-rt/lib/ctx_profile/RootAutoDetector.cpp b/compiler-rt/lib/ctx_profile/RootAutoDetector.cpp
index 483c55c25eefe..281ce5e33865a 100644
--- a/compiler-rt/lib/ctx_profile/RootAutoDetector.cpp
+++ b/compiler-rt/lib/ctx_profile/RootAutoDetector.cpp
@@ -8,6 +8,7 @@
 
 #include "RootAutoDetector.h"
 
+#include "CtxInstrProfiling.h"
 #include "sanitizer_common/sanitizer_common.h"
 #include "sanitizer_common/sanitizer_placement_new.h" // IWYU pragma: keep (DenseMap)
 #include <assert.h>
@@ -17,6 +18,99 @@
 using namespace __ctx_profile;
 template <typename T> using Set = DenseMap<T, bool>;
 
+namespace __sanitizer {
+void BufferedStackTrace::UnwindImpl(uptr pc, uptr bp, void *context,
+                                    bool request_fast, u32 max_depth) {
+  // We can't implement the fast variant. The fast variant ends up invoking an
+  // external allocator, because of pthread_attr_getstack. If this happens
+  // during an allocation of the program being instrumented, a non-reentrant
+  // lock may be taken (this was observed). The allocator called by
+  // pthread_attr_getstack will also try to take that lock.
+  UnwindSlow(pc, max_depth);
+}
+} // namespace __sanitizer
+
+RootAutoDetector::PerThreadSamples::PerThreadSamples(RootAutoDetector &Parent) {
+  GenericScopedLock<SpinMutex> L(&Parent.AllSamplesMutex);
+  Parent.AllSamples.PushBack(this);
+}
+
+void RootAutoDetector::start() {
+  atomic_store_relaxed(&Self, reinterpret_cast<uintptr_t>(this));
+  pthread_create(
+      &WorkerThread, nullptr,
+      +[](void *Ctx) -> void * {
+        RootAutoDetector *RAD = reinterpret_cast<RootAutoDetector *>(Ctx);
+        SleepForSeconds(RAD->WaitSeconds);
+        // To avoid holding the AllSamplesMutex, make a snapshot of all the
+        // thread samples collected so far
+        Vector<PerThreadSamples *> SamplesSnapshot;
+        {
+          GenericScopedLock<SpinMutex> M(&RAD->AllSamplesMutex);
+          SamplesSnapshot.Resize(RAD->AllSamples.Size());
+          for (uptr I = 0; I < RAD->AllSamples.Size(); ++I)
+            SamplesSnapshot[I] = RAD->AllSamples[I];
+        }
+        DenseMap<uptr, uint64_t> AllRoots;
+        for (uptr I = 0; I < SamplesSnapshot.Size(); ++I) {
+          GenericScopedLock<SpinMutex>(&SamplesSnapshot[I]->M);
+          SamplesSnapshot[I]->TrieRoot.determineRoots().forEach([&](auto &KVP) {
+            auto [FAddr, Count] = KVP;
+            AllRoots[FAddr] += Count;
+            return true;
+          });
+        }
+        // FIXME: as a next step, establish a minimum relative nr of samples
+        // per root that would qualify it as a root.
+        for (auto *FD = reinterpret_cast<FunctionData *>(
+                 atomic_load_relaxed(&RAD->FunctionDataListHead));
+             FD; FD = FD->Next) {
+          if (AllRoots.contains(reinterpret_cast<uptr>(FD->EntryAddress))) {
+            FD->getOrAllocateContextRoot();
+          }
+        }
+        atomic_store_relaxed(&RAD->Self, 0);
+        return nullptr;
+      },
+      this);
+}
+
+void RootAutoDetector::join() { pthread_join(WorkerThread, nullptr); }
+
+void RootAutoDetector::sample() {
+  // tracking reentry in case we want to re-explore fast stack unwind - which
+  // does potentially re-enter the runtime because it calls the instrumented
+  // allocator because of pthread_attr_getstack. See the notes also on
+  // UnwindImpl above.
+  static thread_local bool Entered = false;
+  static thread_local uint64_t Entries = 0;
+  if (Entered || (++Entries % SampleRate))
+    return;
+  Entered = true;
+  collectStack();
+  Entered = false;
+}
+
+void RootAutoDetector::collectStack() {
+  GET_CALLER_PC_BP;
+  BufferedStackTrace CurrentStack;
+  CurrentStack.Unwind(pc, bp, nullptr, false);
+  // 2 stack frames would be very unlikely to mean anything, since at least the
+  // compiler-rt frame - which can't be inlined - should be observable, which
+  // counts as 1; we can be even more aggressive with this number.
+  if (CurrentStack.size <= 2)
+    return;
+  static thread_local PerThreadSamples *ThisThreadSamples =
+      new (__sanitizer::InternalAlloc(sizeof(PerThreadSamples)))
+          PerThreadSamples(*this);
+
+  if (!ThisThreadSamples->M.TryLock())
+    return;
+
+  ThisThreadSamples->TrieRoot.insertStack(CurrentStack);
+  ThisThreadSamples->M.Unlock();
+}
+
 uptr PerThreadCallsiteTrie::getFctStartAddr(uptr CallsiteAddress) const {
   // this requires --linkopt=-Wl,--export-dynamic
   Dl_info Info;
diff --git a/compiler-rt/lib/ctx_profile/RootAutoDetector.h b/compiler-rt/lib/ctx_profile/RootAutoDetector.h
index 85dd5ef1c32d9..5c2abaeb1d0fa 100644
--- a/compiler-rt/lib/ctx_profile/RootAutoDetector.h
+++ b/compiler-rt/lib/ctx_profile/RootAutoDetector.h
@@ -12,6 +12,7 @@
 #include "sanitizer_common/sanitizer_dense_map.h"
 #include "sanitizer_common/sanitizer_internal_defs.h"
 #include "sanitizer_common/sanitizer_stacktrace.h"
+#include "sanitizer_common/sanitizer_vector.h"
 #include <pthread.h>
 #include <sanitizer/common_interface_defs.h>
 
@@ -53,5 +54,35 @@ class PerThreadCallsiteTrie {
   /// thread, together with the number of samples that included them.
   DenseMap<uptr, uint64_t> determineRoots() const;
 };
+
+class RootAutoDetector final {
+  static const uint64_t SampleRate = 6113;
+  const unsigned WaitSeconds;
+  pthread_t WorkerThread;
+
+  struct PerThreadSamples {
+    PerThreadSamples(RootAutoDetector &Parent);
+
+    PerThreadCallsiteTrie TrieRoot;
+    SpinMutex M;
+  };
+  SpinMutex AllSamplesMutex;
+  SANITIZER_GUARDED_BY(AllSamplesMutex)
+  Vector<PerThreadSamples *> AllSamples;
+  atomic_uintptr_t &FunctionDataListHead;
+  atomic_uintptr_t &Self;
+  void collectStack();
+
+public:
+  RootAutoDetector(atomic_uintptr_t &FunctionDataListHead,
+                   atomic_uintptr_t &Self, unsigned WaitSeconds)
+      : WaitSeconds(WaitSeconds), FunctionDataListHead(FunctionDataListHead),
+        Self(Self) {}
+
+  void sample();
+  void start();
+  void join();
+};
+
 } // namespace __ctx_profile
 #endif
diff --git a/compiler-rt/test/ctx_profile/TestCases/autodetect-roots.cpp b/compiler-rt/test/ctx_profile/TestCases/autodetect-roots.cpp
new file mode 100644
index 0000000000000..d4d4eb0230fc6
--- /dev/null
+++ b/compiler-rt/test/ctx_profile/TestCases/autodetect-roots.cpp
@@ -0,0 +1,188 @@
+// Root autodetection test for contextual profiling
+//
+// Copy the header defining ContextNode.
+// RUN: mkdir -p %t_include
+// RUN: cp %llvm_src/include/llvm/ProfileData/CtxInstrContextNode.h %t_include/
+//
+// Compile with ctx instrumentation "on". We use -profile-context-root as signal
+// that we want contextual profiling, but we can specify anything there, that
+// won't be matched with any function, and result in the behavior we are aiming
+// for here.
+//
+// RUN: %clangxx %s %ctxprofilelib -I%t_include -O2 -o %t.bin \
+// RUN:   -mllvm -profile-context-root="<autodetect>" -g -Wl,-export-dynamic
+//
+// Run the binary, and observe the profile fetch handler's output.
+// RUN %t.bin | FileCheck %s
+
+#include "CtxInstrContextNode.h"
+#include <atomic>
+#include <cstdio>
+#include <iostream>
+#include <thread>
+
+using namespace llvm::ctx_profile;
+extern "C" void __llvm_ctx_profile_start_collection(unsigned);
+extern "C" bool __llvm_ctx_profile_fetch(ProfileWriter &);
+
+// avoid name mangling
+extern "C" {
+__attribute__((noinline)) void anotherFunction() {}
+__attribute__((noinline)) void mock1() {}
+__attribute__((noinline)) void mock2() {}
+__attribute__((noinline)) void someFunction(int I) {
+  if (I % 2)
+    mock1();
+  else
+    mock2();
+  anotherFunction();
+}
+
+// block inlining because the pre-inliner otherwise will inline this - it's
+// too small.
+__attribute__((noinline)) void theRoot() {
+  someFunction(1);
+#pragma nounroll
+  for (auto I = 0; I < 2; ++I) {
+    someFunction(I);
+  }
+  anotherFunction();
+}
+}
+
+class TestProfileWriter : public ProfileWriter {
+  void printProfile(const ContextNode &Node, const std::string &Indent,
+                    const std::string &Increment) {
+    std::cout << Indent << "Guid: " << Node.guid() << std::endl;
+    std::cout << Indent << "Entries: " << Node.entrycount() << std::endl;
+    std::cout << Indent << Node.counters_size() << " counters and "
+              << Node.callsites_size() << " callsites" << std::endl;
+    std::cout << Indent << "Counter values: ";
+    for (uint32_t I = 0U; I < Node.counters_size(); ++I)
+      std::cout << Node.counters()[I] << " ";
+    std::cout << std::endl;
+    for (uint32_t I = 0U; I < Node.callsites_size(); ++I)
+      for (const auto *N = Node.subContexts()[I]; N; N = N->next()) {
+        std::cout << Indent << "At Index " << I << ":" << std::endl;
+        printProfile(*N, Indent + Increment, Increment);
+      }
+  }
+
+  void startContextSection() override {
+    std::cout << "Entered Context Section" << std::endl;
+  }
+
+  void endContextSection() override {
+    std::cout << "Exited Context Section" << std::endl;
+  }
+
+  void writeContextual(const ContextNode &RootNode,
+                       const ContextNode *Unhandled,
+                       uint64_t EntryCount) override {
+    std::cout << "Entering Root " << RootNode.guid()
+              << " with total entry count " << EntryCount << std::endl;
+    for (const auto *P = Unhandled; P; P = P->next())
+      std::cout << "Unhandled GUID: " << P->guid() << " entered "
+                << P->entrycount() << " times" << std::endl;
+    printProfile(RootNode, " ", " ");
+  }
+
+  void startFlatSection() override {
+    std::cout << "Entered Flat Section" << std::endl;
+  }
+
+  void writeFlat(GUID Guid, const uint64_t *Buffer,
+                 size_t BufferSize) override {
+    std::cout << "Flat: " << Guid << " " << Buffer[0];
+    for (size_t I = 1U; I < BufferSize; ++I)
+      std::cout << "," << Buffer[I];
+    std::cout << std::endl;
+  };
+
+  void endFlatSection() override {
+    std::cout << "Exited Flat Section" << std::endl;
+  }
+};
+
+// Guid:3950394326069683896 is anotherFunction
+// Guid:6759619411192316602 is someFunction
+// These are expected to be the auto-detected roots. This is because we cannot
+// discerne (with the current aut...
[truncated]

@mtrofin mtrofin force-pushed the users/mtrofin/03-24-rootautodetect branch 2 times, most recently from 3f0bc8f to a47d2b5 Compare March 31, 2025 21:23
@mtrofin mtrofin force-pushed the users/mtrofin/03-24-rootautodetect branch 2 times, most recently from a1e6f60 to 7f3dd7f Compare April 3, 2025 17:14
@mtrofin mtrofin requested a review from snehasish April 7, 2025 18:05
Copy link
Contributor

@snehasish snehasish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mtrofin mtrofin force-pushed the users/mtrofin/03-24-rootautodetect branch from 7f3dd7f to c27214d Compare April 8, 2025 01:26
@mtrofin mtrofin force-pushed the users/mtrofin/03-24-rootautodetect branch from c27214d to a05e286 Compare April 8, 2025 03:04
Copy link
Member Author

mtrofin commented Apr 8, 2025

Merge activity

  • Apr 8, 9:57 AM EDT: A user started a stack merge that includes this pull request via Graphite.
  • Apr 8, 9:59 AM EDT: A user merged this pull request with Graphite.

@mtrofin mtrofin merged commit b2dea4f into main Apr 8, 2025
11 checks passed
@mtrofin mtrofin deleted the users/mtrofin/03-24-rootautodetect branch April 8, 2025 13:59
@ilovepi
Copy link
Contributor

ilovepi commented Apr 8, 2025

We're seeing some test failures after this patch, would you mind taking a look? This is on linux-x86_64. There isn't too much special in our CMake, but since we use a very recent clang to build it, maybe its easiest to repro w/ a 2 stage build.

Failing Test:

  • CtxProfile-x86_64-linux :: TestCases/autodetect-roots.cpp
  • CtxProfile-x86_64-linux :: TestCases/generate-context.cpp

Failing Bot:
https://ci.chromium.org/ui/p/fuchsia/builders/toolchain.ci/clang-linux-x64/b8718149220847740529/overview.

Copy link
Member Author

mtrofin commented Apr 8, 2025

looking

Copy link
Member Author

mtrofin commented Apr 8, 2025

PR #134932 should take care of it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler-rt llvm:transforms PGO Profile Guided Optimizations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants