Skip to content

[memprof] Speed up caller-callee pair extraction #116184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

kazutakahirata
Copy link
Contributor

We know that the MemProf profile has a lot of duplicate call stacks.
Extracting caller-callee pairs from a call stack we've seen before is
a wasteful effort.

This patch makes the extraction more efficient by first coming up with
a work list of linear call stack IDs -- the set of starting positions
in the radix tree array -- and then extract caller-callee pairs from
each call stack in the work list.

We implement the work list as a bit vector because we expect the work
list to be dense in the range [0, RadixTreeSize). Also, we want the
set insertion to be cheap.

Without this patch, it takes 25 seconds to extract caller-callee pairs
from a large MemProf profile. This patch shortenes that down to 4
seconds.

We know that the MemProf profile has a lot of duplicate call stacks.
Extracting caller-callee pairs from a call stack we've seen before is
a wasteful effort.

This patch makes the extraction more efficient by first coming up with
a work list of linear call stack IDs -- the set of starting positions
in the radix tree array -- and then extract caller-callee pairs from
each call stack in the work list.

We implement the work list as a bit vector because we expect the work
list to be dense in the range [0, RadixTreeSize).  Also, we want the
set insertion to be cheap.

Without this patch, it takes 25 seconds to extract caller-callee pairs
from a large MemProf profile.  This patch shortenes that down to 4
seconds.
@llvmbot llvmbot added the PGO Profile Guided Optimizations label Nov 14, 2024
@llvmbot
Copy link
Member

llvmbot commented Nov 14, 2024

@llvm/pr-subscribers-pgo

Author: Kazu Hirata (kazutakahirata)

Changes

We know that the MemProf profile has a lot of duplicate call stacks.
Extracting caller-callee pairs from a call stack we've seen before is
a wasteful effort.

This patch makes the extraction more efficient by first coming up with
a work list of linear call stack IDs -- the set of starting positions
in the radix tree array -- and then extract caller-callee pairs from
each call stack in the work list.

We implement the work list as a bit vector because we expect the work
list to be dense in the range [0, RadixTreeSize). Also, we want the
set insertion to be cheap.

Without this patch, it takes 25 seconds to extract caller-callee pairs
from a large MemProf profile. This patch shortenes that down to 4
seconds.


Full diff: https://github.com/llvm/llvm-project/pull/116184.diff

2 Files Affected:

  • (modified) llvm/include/llvm/ProfileData/InstrProfReader.h (+2)
  • (modified) llvm/lib/ProfileData/InstrProfReader.cpp (+16-1)
diff --git a/llvm/include/llvm/ProfileData/InstrProfReader.h b/llvm/include/llvm/ProfileData/InstrProfReader.h
index 42414bc193bc84..1930cc3f5c2c30 100644
--- a/llvm/include/llvm/ProfileData/InstrProfReader.h
+++ b/llvm/include/llvm/ProfileData/InstrProfReader.h
@@ -683,6 +683,8 @@ class IndexedMemProfReader {
   const unsigned char *FrameBase = nullptr;
   /// The starting address of the call stack array.
   const unsigned char *CallStackBase = nullptr;
+  // The number of elements in the radix tree array.
+  unsigned RadixTreeSize = 0;
 
   Error deserializeV012(const unsigned char *Start, const unsigned char *Ptr,
                         uint64_t FirstWord);
diff --git a/llvm/lib/ProfileData/InstrProfReader.cpp b/llvm/lib/ProfileData/InstrProfReader.cpp
index cae6ce5b824e62..88148962dbc390 100644
--- a/llvm/lib/ProfileData/InstrProfReader.cpp
+++ b/llvm/lib/ProfileData/InstrProfReader.cpp
@@ -1303,6 +1303,10 @@ Error IndexedMemProfReader::deserializeV3(const unsigned char *Start,
   FrameBase = Ptr;
   CallStackBase = Start + CallStackPayloadOffset;
 
+  // Compute the number of elements in the radix tree array.
+  RadixTreeSize = (RecordPayloadOffset - CallStackPayloadOffset) /
+                  sizeof(memprof::LinearFrameId);
+
   // Now initialize the table reader with a pointer into data buffer.
   MemProfRecordTable.reset(MemProfRecordHashTable::Create(
       /*Buckets=*/Start + RecordTableOffset,
@@ -1674,11 +1678,22 @@ IndexedMemProfReader::getMemProfCallerCalleePairs() const {
   memprof::LinearFrameIdConverter FrameIdConv(FrameBase);
   memprof::CallerCalleePairExtractor Extractor(CallStackBase, FrameIdConv);
 
+  // The set of linear call stack IDs that we need to traverse from.  We expect
+  // the set to be dense, so we use a BitVector.
+  BitVector Worklist(RadixTreeSize);
+
+  // Collect the set of linear call stack IDs.  Since we expect a lot of
+  // duplicates, we first collect them in the form of a bit vector before
+  // processing them.
   for (const memprof::IndexedMemProfRecord &IndexedRecord :
        MemProfRecordTable->data())
     for (const memprof::IndexedAllocationInfo &IndexedAI :
          IndexedRecord.AllocSites)
-      Extractor(IndexedAI.CSId);
+      Worklist.set(IndexedAI.CSId);
+
+  // Collect caller-callee pairs for each linear call stack ID in Worklist.
+  for (unsigned CS : Worklist.set_bits())
+    Extractor(CS);
 
   DenseMap<uint64_t, SmallVector<memprof::CallEdgeTy, 0>> Pairs =
       std::move(Extractor.CallerCalleePairs);

Copy link
Contributor Author

@kazutakahirata kazutakahirata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Please take a look. Thanks!

Copy link
Contributor

@snehasish snehasish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kazutakahirata kazutakahirata merged commit 59da1af into llvm:main Nov 14, 2024
8 checks passed
@kazutakahirata kazutakahirata deleted the memprof_undrift_faster_extract branch November 14, 2024 23:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PGO Profile Guided Optimizations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants