-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[ctxprof] root autodetection mechanism #133147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
✅ With the latest revision this PR passed the C/C++ code formatter. |
7182bae
to
7550771
Compare
bbe97a8
to
277a3b3
Compare
7d0b3a1
to
ce0ecd6
Compare
4f422ca
to
24468d9
Compare
ce0ecd6
to
1f145e0
Compare
24468d9
to
4fd0d3c
Compare
e2bee27
to
dbcdd9c
Compare
47c5535
to
9065433
Compare
0080eef
to
8ab329e
Compare
9065433
to
5579f73
Compare
8ab329e
to
e2b4b9c
Compare
319d31f
to
f36ebc7
Compare
e2b4b9c
to
a984fc7
Compare
f36ebc7
to
401ba9b
Compare
96fcc51
to
b78258c
Compare
401ba9b
to
ba9b6f2
Compare
f208007
to
1a12853
Compare
Most of the functionality will be reused with the auto-root detection mechanism (which is introduced subsequently in PR #133147).
1a12853
to
d8fe115
Compare
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-pgo Author: Mircea Trofin (mtrofin) ChangesThis is an optional mechanism that automatically detects roots. It's a best-effort mechanism, and its main goal is to avoid pointing at the message pump function as a root. This is the function that polls message queue(s) in an infinite loop, and is thus a bad root (it never exits). High-level, when collection is requested - which should happen when a server has already been set up and handing requests - we spend a bit of time sampling all the server's threads. Each sample is a stack which we insert in a For this to work, on the llvm side, we need to have all functions call Note that functions that The mechanism could be used in combination with explicit root specification, too. Patch is 33.28 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/133147.diff 11 Files Affected:
diff --git a/compiler-rt/lib/ctx_profile/CMakeLists.txt b/compiler-rt/lib/ctx_profile/CMakeLists.txt
index bb606449c61b1..446ebc96408dd 100644
--- a/compiler-rt/lib/ctx_profile/CMakeLists.txt
+++ b/compiler-rt/lib/ctx_profile/CMakeLists.txt
@@ -27,7 +27,7 @@ endif()
add_compiler_rt_runtime(clang_rt.ctx_profile
STATIC
ARCHS ${CTX_PROFILE_SUPPORTED_ARCH}
- OBJECT_LIBS RTSanitizerCommon RTSanitizerCommonLibc
+ OBJECT_LIBS RTSanitizerCommon RTSanitizerCommonLibc RTSanitizerCommonSymbolizer
CFLAGS ${EXTRA_FLAGS}
SOURCES ${CTX_PROFILE_SOURCES}
ADDITIONAL_HEADERS ${CTX_PROFILE_HEADERS}
diff --git a/compiler-rt/lib/ctx_profile/CtxInstrContextNode.h b/compiler-rt/lib/ctx_profile/CtxInstrContextNode.h
index a42bf9ebb01ea..55423d95b3088 100644
--- a/compiler-rt/lib/ctx_profile/CtxInstrContextNode.h
+++ b/compiler-rt/lib/ctx_profile/CtxInstrContextNode.h
@@ -127,6 +127,7 @@ class ContextNode final {
/// MUTEXDECL takes one parameter, the name of a field that is a mutex.
#define CTXPROF_FUNCTION_DATA(PTRDECL, VOLATILE_PTRDECL, MUTEXDECL) \
PTRDECL(FunctionData, Next) \
+ VOLATILE_PTRDECL(void, EntryAddress) \
VOLATILE_PTRDECL(ContextRoot, CtxRoot) \
VOLATILE_PTRDECL(ContextNode, FlatCtx) \
MUTEXDECL(Mutex)
diff --git a/compiler-rt/lib/ctx_profile/CtxInstrProfiling.cpp b/compiler-rt/lib/ctx_profile/CtxInstrProfiling.cpp
index 10a6a8c1f71e5..d8b6947a62e60 100644
--- a/compiler-rt/lib/ctx_profile/CtxInstrProfiling.cpp
+++ b/compiler-rt/lib/ctx_profile/CtxInstrProfiling.cpp
@@ -7,6 +7,7 @@
//===----------------------------------------------------------------------===//
#include "CtxInstrProfiling.h"
+#include "RootAutoDetector.h"
#include "sanitizer_common/sanitizer_allocator_internal.h"
#include "sanitizer_common/sanitizer_atomic.h"
#include "sanitizer_common/sanitizer_atomic_clang.h"
@@ -43,6 +44,12 @@ Arena *FlatCtxArena = nullptr;
__thread bool IsUnderContext = false;
__sanitizer::atomic_uint8_t ProfilingStarted = {};
+__sanitizer::atomic_uintptr_t RootDetector = {};
+RootAutoDetector *getRootDetector() {
+ return reinterpret_cast<RootAutoDetector *>(
+ __sanitizer::atomic_load_relaxed(&RootDetector));
+}
+
// utility to taint a pointer by setting the LSB. There is an assumption
// throughout that the addresses of contexts are even (really, they should be
// align(8), but "even"-ness is the minimum assumption)
@@ -201,7 +208,7 @@ ContextNode *getCallsiteSlow(GUID Guid, ContextNode **InsertionPoint,
return Ret;
}
-ContextNode *getFlatProfile(FunctionData &Data, GUID Guid,
+ContextNode *getFlatProfile(FunctionData &Data, void *Callee, GUID Guid,
uint32_t NumCounters) {
if (ContextNode *Existing = Data.FlatCtx)
return Existing;
@@ -232,6 +239,7 @@ ContextNode *getFlatProfile(FunctionData &Data, GUID Guid,
auto *Ret = allocContextNode(AllocBuff, Guid, NumCounters, 0);
Data.FlatCtx = Ret;
+ Data.EntryAddress = Callee;
Data.Next = reinterpret_cast<FunctionData *>(
__sanitizer::atomic_load_relaxed(&AllFunctionsData));
while (!__sanitizer::atomic_compare_exchange_strong(
@@ -316,27 +324,32 @@ ContextNode *getUnhandledContext(FunctionData &Data, GUID Guid,
// entered once and never exit. They should be assumed to be entered before
// profiling starts - because profiling should start after the server is up
// and running (which is equivalent to "message pumps are set up").
- ContextRoot *R = __llvm_ctx_profile_current_context_root;
- if (!R) {
+ if (!CtxRoot) {
+ if (auto *RAD = getRootDetector())
+ RAD->sample();
+ else if (auto *CR = Data.CtxRoot)
+ return tryStartContextGivenRoot(CR, Guid, NumCounters, NumCallsites);
if (IsUnderContext || !__sanitizer::atomic_load_relaxed(&ProfilingStarted))
return TheScratchContext;
else
return markAsScratch(
- onContextEnter(*getFlatProfile(Data, Guid, NumCounters)));
+ onContextEnter(*getFlatProfile(Data, Callee, Guid, NumCounters)));
}
- auto [Iter, Ins] = R->Unhandled.insert({Guid, nullptr});
+ auto [Iter, Ins] = CtxRoot->Unhandled.insert({Guid, nullptr});
if (Ins)
- Iter->second =
- getCallsiteSlow(Guid, &R->FirstUnhandledCalleeNode, NumCounters, 0);
+ Iter->second = getCallsiteSlow(Guid, &CtxRoot->FirstUnhandledCalleeNode,
+ NumCounters, 0);
return markAsScratch(onContextEnter(*Iter->second));
}
ContextNode *__llvm_ctx_profile_get_context(FunctionData *Data, void *Callee,
GUID Guid, uint32_t NumCounters,
uint32_t NumCallsites) {
+ auto *CtxRoot = __llvm_ctx_profile_current_context_root;
// fast "out" if we're not even doing contextual collection.
- if (!__llvm_ctx_profile_current_context_root)
- return getUnhandledContext(*Data, Guid, NumCounters);
+ if (!CtxRoot)
+ return getUnhandledContext(*Data, Callee, Guid, NumCounters, NumCallsites,
+ nullptr);
// also fast "out" if the caller is scratch. We can see if it's scratch by
// looking at the interior pointer into the subcontexts vector that the caller
@@ -345,7 +358,8 @@ ContextNode *__llvm_ctx_profile_get_context(FunctionData *Data, void *Callee,
// precisely, aligned - 8 values)
auto **CallsiteContext = consume(__llvm_ctx_profile_callsite[0]);
if (!CallsiteContext || isScratch(CallsiteContext))
- return getUnhandledContext(*Data, Guid, NumCounters);
+ return getUnhandledContext(*Data, Callee, Guid, NumCounters, NumCallsites,
+ CtxRoot);
// if the callee isn't the expected one, return scratch.
// Signal handler(s) could have been invoked at any point in the execution.
@@ -363,7 +377,8 @@ ContextNode *__llvm_ctx_profile_get_context(FunctionData *Data, void *Callee,
// for that case.
auto *ExpectedCallee = consume(__llvm_ctx_profile_expected_callee[0]);
if (ExpectedCallee != Callee)
- return getUnhandledContext(*Data, Guid, NumCounters);
+ return getUnhandledContext(*Data, Callee, Guid, NumCounters, NumCallsites,
+ CtxRoot);
auto *Callsite = *CallsiteContext;
// in the case of indirect calls, we will have all seen targets forming a
@@ -385,24 +400,26 @@ ContextNode *__llvm_ctx_profile_get_context(FunctionData *Data, void *Callee,
return Ret;
}
-ContextNode *__llvm_ctx_profile_start_context(
- FunctionData *FData, GUID Guid, uint32_t Counters,
- uint32_t Callsites) SANITIZER_NO_THREAD_SAFETY_ANALYSIS {
+ContextNode *__llvm_ctx_profile_start_context(FunctionData *FData, GUID Guid,
+ uint32_t Counters,
+ uint32_t Callsites) {
+
return tryStartContextGivenRoot(FData->getOrAllocateContextRoot(), Guid,
Counters, Callsites);
}
void __llvm_ctx_profile_release_context(FunctionData *FData)
SANITIZER_NO_THREAD_SAFETY_ANALYSIS {
+ const auto *CurrentRoot = __llvm_ctx_profile_current_context_root;
+ if (!CurrentRoot || FData->CtxRoot != CurrentRoot)
+ return;
IsUnderContext = false;
- if (__llvm_ctx_profile_current_context_root) {
- __llvm_ctx_profile_current_context_root = nullptr;
- assert(FData->CtxRoot);
- FData->CtxRoot->Taken.Unlock();
- }
+ assert(FData->CtxRoot);
+ __llvm_ctx_profile_current_context_root = nullptr;
+ FData->CtxRoot->Taken.Unlock();
}
-void __llvm_ctx_profile_start_collection() {
+void __llvm_ctx_profile_start_collection(unsigned AutodetectDuration) {
size_t NumMemUnits = 0;
__sanitizer::GenericScopedLock<__sanitizer::SpinMutex> Lock(
&AllContextsMutex);
@@ -418,12 +435,24 @@ void __llvm_ctx_profile_start_collection() {
resetContextNode(*Root->FirstUnhandledCalleeNode);
__sanitizer::atomic_store_relaxed(&Root->TotalEntries, 0);
}
+ if (AutodetectDuration) {
+ auto *RD = new (__sanitizer::InternalAlloc(sizeof(RootAutoDetector)))
+ RootAutoDetector(AllFunctionsData, RootDetector, AutodetectDuration);
+ RD->start();
+ } else {
+ __sanitizer::Printf("[ctxprof] Initial NumMemUnits: %zu \n", NumMemUnits);
+ }
__sanitizer::atomic_store_relaxed(&ProfilingStarted, true);
- __sanitizer::Printf("[ctxprof] Initial NumMemUnits: %zu \n", NumMemUnits);
}
bool __llvm_ctx_profile_fetch(ProfileWriter &Writer) {
__sanitizer::atomic_store_relaxed(&ProfilingStarted, false);
+ if (auto *RD = getRootDetector()) {
+ __sanitizer::Printf("[ctxprof] Expected the root autodetector to have "
+ "finished well before attempting to fetch a context");
+ RD->join();
+ }
+
__sanitizer::GenericScopedLock<__sanitizer::SpinMutex> Lock(
&AllContextsMutex);
@@ -448,8 +477,9 @@ bool __llvm_ctx_profile_fetch(ProfileWriter &Writer) {
const auto *Pos = reinterpret_cast<const FunctionData *>(
__sanitizer::atomic_load_relaxed(&AllFunctionsData));
for (; Pos; Pos = Pos->Next)
- Writer.writeFlat(Pos->FlatCtx->guid(), Pos->FlatCtx->counters(),
- Pos->FlatCtx->counters_size());
+ if (!Pos->CtxRoot)
+ Writer.writeFlat(Pos->FlatCtx->guid(), Pos->FlatCtx->counters(),
+ Pos->FlatCtx->counters_size());
Writer.endFlatSection();
return true;
}
diff --git a/compiler-rt/lib/ctx_profile/CtxInstrProfiling.h b/compiler-rt/lib/ctx_profile/CtxInstrProfiling.h
index 6326beaa53085..4983f086d230d 100644
--- a/compiler-rt/lib/ctx_profile/CtxInstrProfiling.h
+++ b/compiler-rt/lib/ctx_profile/CtxInstrProfiling.h
@@ -207,7 +207,7 @@ ContextNode *__llvm_ctx_profile_get_context(__ctx_profile::FunctionData *FData,
/// Prepares for collection. Currently this resets counter values but preserves
/// internal context tree structure.
-void __llvm_ctx_profile_start_collection();
+void __llvm_ctx_profile_start_collection(unsigned AutodetectDuration = 0);
/// Completely free allocated memory.
void __llvm_ctx_profile_free();
diff --git a/compiler-rt/lib/ctx_profile/RootAutoDetector.cpp b/compiler-rt/lib/ctx_profile/RootAutoDetector.cpp
index 483c55c25eefe..281ce5e33865a 100644
--- a/compiler-rt/lib/ctx_profile/RootAutoDetector.cpp
+++ b/compiler-rt/lib/ctx_profile/RootAutoDetector.cpp
@@ -8,6 +8,7 @@
#include "RootAutoDetector.h"
+#include "CtxInstrProfiling.h"
#include "sanitizer_common/sanitizer_common.h"
#include "sanitizer_common/sanitizer_placement_new.h" // IWYU pragma: keep (DenseMap)
#include <assert.h>
@@ -17,6 +18,99 @@
using namespace __ctx_profile;
template <typename T> using Set = DenseMap<T, bool>;
+namespace __sanitizer {
+void BufferedStackTrace::UnwindImpl(uptr pc, uptr bp, void *context,
+ bool request_fast, u32 max_depth) {
+ // We can't implement the fast variant. The fast variant ends up invoking an
+ // external allocator, because of pthread_attr_getstack. If this happens
+ // during an allocation of the program being instrumented, a non-reentrant
+ // lock may be taken (this was observed). The allocator called by
+ // pthread_attr_getstack will also try to take that lock.
+ UnwindSlow(pc, max_depth);
+}
+} // namespace __sanitizer
+
+RootAutoDetector::PerThreadSamples::PerThreadSamples(RootAutoDetector &Parent) {
+ GenericScopedLock<SpinMutex> L(&Parent.AllSamplesMutex);
+ Parent.AllSamples.PushBack(this);
+}
+
+void RootAutoDetector::start() {
+ atomic_store_relaxed(&Self, reinterpret_cast<uintptr_t>(this));
+ pthread_create(
+ &WorkerThread, nullptr,
+ +[](void *Ctx) -> void * {
+ RootAutoDetector *RAD = reinterpret_cast<RootAutoDetector *>(Ctx);
+ SleepForSeconds(RAD->WaitSeconds);
+ // To avoid holding the AllSamplesMutex, make a snapshot of all the
+ // thread samples collected so far
+ Vector<PerThreadSamples *> SamplesSnapshot;
+ {
+ GenericScopedLock<SpinMutex> M(&RAD->AllSamplesMutex);
+ SamplesSnapshot.Resize(RAD->AllSamples.Size());
+ for (uptr I = 0; I < RAD->AllSamples.Size(); ++I)
+ SamplesSnapshot[I] = RAD->AllSamples[I];
+ }
+ DenseMap<uptr, uint64_t> AllRoots;
+ for (uptr I = 0; I < SamplesSnapshot.Size(); ++I) {
+ GenericScopedLock<SpinMutex>(&SamplesSnapshot[I]->M);
+ SamplesSnapshot[I]->TrieRoot.determineRoots().forEach([&](auto &KVP) {
+ auto [FAddr, Count] = KVP;
+ AllRoots[FAddr] += Count;
+ return true;
+ });
+ }
+ // FIXME: as a next step, establish a minimum relative nr of samples
+ // per root that would qualify it as a root.
+ for (auto *FD = reinterpret_cast<FunctionData *>(
+ atomic_load_relaxed(&RAD->FunctionDataListHead));
+ FD; FD = FD->Next) {
+ if (AllRoots.contains(reinterpret_cast<uptr>(FD->EntryAddress))) {
+ FD->getOrAllocateContextRoot();
+ }
+ }
+ atomic_store_relaxed(&RAD->Self, 0);
+ return nullptr;
+ },
+ this);
+}
+
+void RootAutoDetector::join() { pthread_join(WorkerThread, nullptr); }
+
+void RootAutoDetector::sample() {
+ // tracking reentry in case we want to re-explore fast stack unwind - which
+ // does potentially re-enter the runtime because it calls the instrumented
+ // allocator because of pthread_attr_getstack. See the notes also on
+ // UnwindImpl above.
+ static thread_local bool Entered = false;
+ static thread_local uint64_t Entries = 0;
+ if (Entered || (++Entries % SampleRate))
+ return;
+ Entered = true;
+ collectStack();
+ Entered = false;
+}
+
+void RootAutoDetector::collectStack() {
+ GET_CALLER_PC_BP;
+ BufferedStackTrace CurrentStack;
+ CurrentStack.Unwind(pc, bp, nullptr, false);
+ // 2 stack frames would be very unlikely to mean anything, since at least the
+ // compiler-rt frame - which can't be inlined - should be observable, which
+ // counts as 1; we can be even more aggressive with this number.
+ if (CurrentStack.size <= 2)
+ return;
+ static thread_local PerThreadSamples *ThisThreadSamples =
+ new (__sanitizer::InternalAlloc(sizeof(PerThreadSamples)))
+ PerThreadSamples(*this);
+
+ if (!ThisThreadSamples->M.TryLock())
+ return;
+
+ ThisThreadSamples->TrieRoot.insertStack(CurrentStack);
+ ThisThreadSamples->M.Unlock();
+}
+
uptr PerThreadCallsiteTrie::getFctStartAddr(uptr CallsiteAddress) const {
// this requires --linkopt=-Wl,--export-dynamic
Dl_info Info;
diff --git a/compiler-rt/lib/ctx_profile/RootAutoDetector.h b/compiler-rt/lib/ctx_profile/RootAutoDetector.h
index 85dd5ef1c32d9..5c2abaeb1d0fa 100644
--- a/compiler-rt/lib/ctx_profile/RootAutoDetector.h
+++ b/compiler-rt/lib/ctx_profile/RootAutoDetector.h
@@ -12,6 +12,7 @@
#include "sanitizer_common/sanitizer_dense_map.h"
#include "sanitizer_common/sanitizer_internal_defs.h"
#include "sanitizer_common/sanitizer_stacktrace.h"
+#include "sanitizer_common/sanitizer_vector.h"
#include <pthread.h>
#include <sanitizer/common_interface_defs.h>
@@ -53,5 +54,35 @@ class PerThreadCallsiteTrie {
/// thread, together with the number of samples that included them.
DenseMap<uptr, uint64_t> determineRoots() const;
};
+
+class RootAutoDetector final {
+ static const uint64_t SampleRate = 6113;
+ const unsigned WaitSeconds;
+ pthread_t WorkerThread;
+
+ struct PerThreadSamples {
+ PerThreadSamples(RootAutoDetector &Parent);
+
+ PerThreadCallsiteTrie TrieRoot;
+ SpinMutex M;
+ };
+ SpinMutex AllSamplesMutex;
+ SANITIZER_GUARDED_BY(AllSamplesMutex)
+ Vector<PerThreadSamples *> AllSamples;
+ atomic_uintptr_t &FunctionDataListHead;
+ atomic_uintptr_t &Self;
+ void collectStack();
+
+public:
+ RootAutoDetector(atomic_uintptr_t &FunctionDataListHead,
+ atomic_uintptr_t &Self, unsigned WaitSeconds)
+ : WaitSeconds(WaitSeconds), FunctionDataListHead(FunctionDataListHead),
+ Self(Self) {}
+
+ void sample();
+ void start();
+ void join();
+};
+
} // namespace __ctx_profile
#endif
diff --git a/compiler-rt/test/ctx_profile/TestCases/autodetect-roots.cpp b/compiler-rt/test/ctx_profile/TestCases/autodetect-roots.cpp
new file mode 100644
index 0000000000000..d4d4eb0230fc6
--- /dev/null
+++ b/compiler-rt/test/ctx_profile/TestCases/autodetect-roots.cpp
@@ -0,0 +1,188 @@
+// Root autodetection test for contextual profiling
+//
+// Copy the header defining ContextNode.
+// RUN: mkdir -p %t_include
+// RUN: cp %llvm_src/include/llvm/ProfileData/CtxInstrContextNode.h %t_include/
+//
+// Compile with ctx instrumentation "on". We use -profile-context-root as signal
+// that we want contextual profiling, but we can specify anything there, that
+// won't be matched with any function, and result in the behavior we are aiming
+// for here.
+//
+// RUN: %clangxx %s %ctxprofilelib -I%t_include -O2 -o %t.bin \
+// RUN: -mllvm -profile-context-root="<autodetect>" -g -Wl,-export-dynamic
+//
+// Run the binary, and observe the profile fetch handler's output.
+// RUN %t.bin | FileCheck %s
+
+#include "CtxInstrContextNode.h"
+#include <atomic>
+#include <cstdio>
+#include <iostream>
+#include <thread>
+
+using namespace llvm::ctx_profile;
+extern "C" void __llvm_ctx_profile_start_collection(unsigned);
+extern "C" bool __llvm_ctx_profile_fetch(ProfileWriter &);
+
+// avoid name mangling
+extern "C" {
+__attribute__((noinline)) void anotherFunction() {}
+__attribute__((noinline)) void mock1() {}
+__attribute__((noinline)) void mock2() {}
+__attribute__((noinline)) void someFunction(int I) {
+ if (I % 2)
+ mock1();
+ else
+ mock2();
+ anotherFunction();
+}
+
+// block inlining because the pre-inliner otherwise will inline this - it's
+// too small.
+__attribute__((noinline)) void theRoot() {
+ someFunction(1);
+#pragma nounroll
+ for (auto I = 0; I < 2; ++I) {
+ someFunction(I);
+ }
+ anotherFunction();
+}
+}
+
+class TestProfileWriter : public ProfileWriter {
+ void printProfile(const ContextNode &Node, const std::string &Indent,
+ const std::string &Increment) {
+ std::cout << Indent << "Guid: " << Node.guid() << std::endl;
+ std::cout << Indent << "Entries: " << Node.entrycount() << std::endl;
+ std::cout << Indent << Node.counters_size() << " counters and "
+ << Node.callsites_size() << " callsites" << std::endl;
+ std::cout << Indent << "Counter values: ";
+ for (uint32_t I = 0U; I < Node.counters_size(); ++I)
+ std::cout << Node.counters()[I] << " ";
+ std::cout << std::endl;
+ for (uint32_t I = 0U; I < Node.callsites_size(); ++I)
+ for (const auto *N = Node.subContexts()[I]; N; N = N->next()) {
+ std::cout << Indent << "At Index " << I << ":" << std::endl;
+ printProfile(*N, Indent + Increment, Increment);
+ }
+ }
+
+ void startContextSection() override {
+ std::cout << "Entered Context Section" << std::endl;
+ }
+
+ void endContextSection() override {
+ std::cout << "Exited Context Section" << std::endl;
+ }
+
+ void writeContextual(const ContextNode &RootNode,
+ const ContextNode *Unhandled,
+ uint64_t EntryCount) override {
+ std::cout << "Entering Root " << RootNode.guid()
+ << " with total entry count " << EntryCount << std::endl;
+ for (const auto *P = Unhandled; P; P = P->next())
+ std::cout << "Unhandled GUID: " << P->guid() << " entered "
+ << P->entrycount() << " times" << std::endl;
+ printProfile(RootNode, " ", " ");
+ }
+
+ void startFlatSection() override {
+ std::cout << "Entered Flat Section" << std::endl;
+ }
+
+ void writeFlat(GUID Guid, const uint64_t *Buffer,
+ size_t BufferSize) override {
+ std::cout << "Flat: " << Guid << " " << Buffer[0];
+ for (size_t I = 1U; I < BufferSize; ++I)
+ std::cout << "," << Buffer[I];
+ std::cout << std::endl;
+ };
+
+ void endFlatSection() override {
+ std::cout << "Exited Flat Section" << std::endl;
+ }
+};
+
+// Guid:3950394326069683896 is anotherFunction
+// Guid:6759619411192316602 is someFunction
+// These are expected to be the auto-detected roots. This is because we cannot
+// discerne (with the current aut...
[truncated]
|
3f0bc8f
to
a47d2b5
Compare
a1e6f60
to
7f3dd7f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
7f3dd7f
to
c27214d
Compare
c27214d
to
a05e286
Compare
We're seeing some test failures after this patch, would you mind taking a look? This is on linux-x86_64. There isn't too much special in our CMake, but since we use a very recent clang to build it, maybe its easiest to repro w/ a 2 stage build. Failing Test:
Failing Bot: |
looking |
PR #134932 should take care of it |
This is an optional mechanism that automatically detects roots. It's a best-effort mechanism, and its main goal is to avoid pointing at the message pump function as a root. This is the function that polls message queue(s) in an infinite loop, and is thus a bad root (it never exits).
High-level, when collection is requested - which should happen when a server has already been set up and handing requests - we spend a bit of time sampling all the server's threads. Each sample is a stack which we insert in a
PerThreadCallsiteTrie
. After a while, we run for eachPerThreadCallsiteTrie
the root detection logic. We then traverse all theFunctionData
, find the ones matching the detected roots, and allocate aContextRoot
for them. From here, we special caseFunctionData
objects, in__llvm_ctx_profile_get_context, that have a
CtxRootand route them to
__llvm_ctx_profile_start_context`.For this to work, on the llvm side, we need to have all functions call
__llvm_ctx_profile_release_context
because they might be roots. This comes at a slight (percentages) penalty during collection - which we can afford since the overall technique is ~5x faster than normal instrumentation. We can later explore conditionally enabling autoroot detection and avoiding this penalty, if desired.Note that functions that
musttail call
can't have their return instrumented this way, and a subsequent patch will harden the mechanism against this case.The mechanism could be used in combination with explicit root specification, too.