Skip to content

[llvm-profgen] Add --sample-period to estimate absolute counts #99826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 22, 2024

Conversation

tcreech-intel
Copy link
Contributor

Without --sample-period, no assumptions are made about perf profile sample frequencies. This is useful for comparing relative hotness of different program locations within the same profile.

With --sample-period, LBR- and IP-based profile hit counts are adjusted to estimate the absolute total event count for each program location. This makes it reasonable to compare hit counts between different profiles, e.g., between two LBR-based execution frequency profiles with different sampling periods or between LBR-based execution frequency profiles and IP-based branch mispredict profiles.

This functionality is in support of HWPGO1, which aims to enable feedback from a wider range of hardware events.

Footnotes

  1. https://llvm.org/devmtg/2024-04/slides/TechnicalTalks/Xiao-EnablingHW-BasedPGO.pdf

Without `--sample-period`, no assumptions are made about perf profile
sample frequencies. This is useful for comparing relative hotness
of different program locations within the same profile.

With `--sample-period`, LBR- and IP-based profile hit counts are
adjusted to estimate the absolute total event count for each program
location. This makes it reasonable to compare hit counts between
different profiles, e.g., between LBR-based execution frequency
profiles and IP-based branch mispredict profiles.
@tcreech-intel
Copy link
Contributor Author

@williamweixiao, @HaohaiWen, it may also be useful to have this PR, which adds --sample-period functionality. This is a nice usability improvement.
(It's a separate PR because it depends on the changes in #99026.)

@tcreech-intel tcreech-intel marked this pull request as ready for review July 21, 2024 21:47
@llvmbot llvmbot added the PGO Profile Guided Optimizations label Jul 21, 2024
@llvmbot
Copy link
Member

llvmbot commented Jul 21, 2024

@llvm/pr-subscribers-pgo

Author: Tim Creech (tcreech-intel)

Changes

Without --sample-period, no assumptions are made about perf profile sample frequencies. This is useful for comparing relative hotness of different program locations within the same profile.

With --sample-period, LBR- and IP-based profile hit counts are adjusted to estimate the absolute total event count for each program location. This makes it reasonable to compare hit counts between different profiles, e.g., between two LBR-based execution frequency profiles with different sampling periods or between LBR-based execution frequency profiles and IP-based branch mispredict profiles.

This functionality is in support of HWPGO1, which aims to enable feedback from a wider range of hardware events.


Full diff: https://github.com/llvm/llvm-project/pull/99826.diff

2 Files Affected:

  • (added) llvm/test/tools/llvm-profgen/period-scaling.test (+84)
  • (modified) llvm/tools/llvm-profgen/PerfReader.cpp (+14)
diff --git a/llvm/test/tools/llvm-profgen/period-scaling.test b/llvm/test/tools/llvm-profgen/period-scaling.test
new file mode 100644
index 0000000000000..a44c9a78caeaf
--- /dev/null
+++ b/llvm/test/tools/llvm-profgen/period-scaling.test
@@ -0,0 +1,84 @@
+// RUN: llvm-profgen --format=text --perfscript=%S/Inputs/cmov_3.perfscript --binary=%S/Inputs/cmov_3.perfbin --output=%t --skip-symbolization --perf-event=br_inst_retired.near_taken:upp --sample-period=1000003
+// RUN: FileCheck %s --input-file %t --check-prefix=CHECK-RAW-PROFILE
+// RUN: llvm-profgen --format=text --perfscript=%S/Inputs/cmov_3.perfscript --binary=%S/Inputs/cmov_3.perfbin --output=%t --perf-event=br_inst_retired.near_taken:upp --sample-period=1000003
+// RUN: FileCheck %s --input-file %t --check-prefix=CHECK
+
+// RUN: llvm-profgen --format=text --perfscript=%S/Inputs/cmov_3.perfscript --binary=%S/Inputs/cmov_3.perfbin --output=%t --skip-symbolization --perf-event=br_misp_retired.all_branches:upp --leading-ip-only --sample-period=1000003
+// RUN: FileCheck %s --input-file %t --check-prefix=UNPRED-RAW-PROFILE
+// RUN: llvm-profgen --format=text --perfscript=%S/Inputs/cmov_3.perfscript --binary=%S/Inputs/cmov_3.perfbin --output=%t --perf-event=br_misp_retired.all_branches:upp --leading-ip-only --sample-period=1000003
+// RUN: FileCheck %s --input-file %t --check-prefix=UNPRED
+
+// Check that we can use perf event filtering to generate multiple types of
+// source-level profiles from a single perf profile. In this case, we generate
+// a typical execution frequency profile using br_inst_retired.near_taken LBRs,
+// and a branch mispredict profile using br_misp_retired.all_branches sample
+// IPs.
+
+// Check that we can use --sample-period to compute LBR and IP-based profiles
+// which have comparable and absolute magnitudes. For example, in this case the
+// branch of interest (at source line offset 4) is in a loop body which is
+// executed ~20M times in total, and it's mispredicted about 9M times, yielding
+// a mispredict rate of roughly 0.45.
+
+// The source example below is based on perfKernelCpp/cmov_3, except a
+// misleading builtin is used to persuade the compiler not to use cmov, which
+// induces branch mispredicts.
+
+// CHECK: sel_arr:652547082:0
+// CHECK:  3.1: 20225766
+// CHECK:  3.2: 20225766
+// CHECK:  4: 19838670
+// CHECK:  5: 20225766
+
+// UNPRED: sel_arr:18000054:0
+// UNPRED:  3.1: 0
+// UNPRED:  3.2: 0
+// UNPRED:  4: 9000027
+// UNPRED:  5: 0
+
+// CHECK-RAW-PROFILE:      3
+// CHECK-RAW-PROFILE-NEXT: 2f0-2fa:9774174
+// CHECK-RAW-PROFILE-NEXT: 2f0-310:10064496
+// CHECK-RAW-PROFILE-NEXT: 2ff-310:10161270
+
+// UNPRED-RAW-PROFILE:      1
+// UNPRED-RAW-PROFILE-NEXT: 2fa-2fa:9000027
+
+// original code:
+// icx -fprofile-sample-generate lit.c
+#include <stdlib.h>
+
+#define N 20000
+#define ITERS 10000
+
+static int *m_s1, *m_s2, *m_s3, *m_dst;
+
+void init(void) {
+    m_s1 = malloc(sizeof(int)*N);
+    m_s2 = malloc(sizeof(int)*N);
+    m_s3 = malloc(sizeof(int)*N);
+    m_dst = malloc(sizeof(int)*N);
+    srand(42);
+
+    for (int i = 0; i < N; i++) {
+        m_s1[i] = rand() % N;
+        m_s2[i] = 0;
+        m_s3[i] = 1;
+    }
+}
+
+void __attribute__((noinline)) sel_arr(int *dst, int *s1, int *s2, int *s3) {
+#pragma nounroll
+#pragma clang loop vectorize(disable) interleave(disable)
+    for (int i = 0; i < N; i++) {
+        int *p = __builtin_expect((s1[i] < 10035), 0) ? &s2[i] : &s3[i];
+        dst[i] = *p;
+    }
+}
+
+int main(void) {
+  init();
+  for(int i=0; i<ITERS; ++i)
+    sel_arr(m_dst, m_s1, m_s2, m_s3);
+  return 0;
+}
diff --git a/llvm/tools/llvm-profgen/PerfReader.cpp b/llvm/tools/llvm-profgen/PerfReader.cpp
index b4e4911fb8912..4041271cc0a82 100644
--- a/llvm/tools/llvm-profgen/PerfReader.cpp
+++ b/llvm/tools/llvm-profgen/PerfReader.cpp
@@ -53,6 +53,10 @@ static cl::alias
                           cl::desc("Comma-delimited version of -perf-event"),
                           cl::aliasopt(PerfEventFilter));
 
+static cl::opt<uint64_t>
+    SamplePeriod("sample-period", cl::init(1),
+                 cl::desc("The sampling period (-c) used for perf data"));
+
 extern cl::opt<std::string> PerfTraceFilename;
 extern cl::opt<bool> ShowDisassemblyOnly;
 extern cl::opt<bool> ShowSourceLocations;
@@ -1000,6 +1004,16 @@ void LBRPerfReader::parseSample(TraceStream &TraceIt, uint64_t Count) {
   if (extractLBRStack(TraceIt, Sample->LBRStack)) {
     warnIfMissingMMap();
     // Record LBR only samples by aggregation
+    // If a sampling period is given we can adjust the magnitude of sample
+    // counts to estimate the absolute magnitute.
+    if (SamplePeriod.getNumOccurrences()) {
+      Count *= SamplePeriod;
+      // If counts are LBR-based, as opposed to IP-based, then the magnitude is
+      // now amplified by roughly the LBR stack size. By adjusting this down, we
+      // can produce LBR-based and IP-based profiles with comparable magnitudes.
+      if (!LeadingIPOnly && Sample->LBRStack.size() > 1)
+        Count /= (Sample->LBRStack.size() - 1);
+    }
     AggregatedSamples[Hashable<PerfSample>(Sample)] += Count;
   }
 }

Footnotes

  1. https://llvm.org/devmtg/2024-04/slides/TechnicalTalks/Xiao-EnablingHW-BasedPGO.pdf

Copy link
Contributor

@williamweixiao williamweixiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely need this feature when handing multiple events together.

@williamweixiao williamweixiao merged commit 01d7836 into llvm:main Jul 22, 2024
11 checks passed
@WenleiHe
Copy link
Member

I don't get it why --sample-period is needed from the change summary.

It sounds like the rational behind this change is two-fold:

  1. To normalize LBR sample with IP sample. If that's the case, we only need to normalize by LBR depth regardless of sampling rate/period.

  2. To support calibration and merging of profile collected with different sampling rate/period. If that's the case, llvm-profdata merge -weighted-input .. is the way to go.

Let me know if I misunderstood the intention of the change, otherwise I'd suggest not taking this patch. We need to be careful about avoiding duplicating/overlapping features, avoiding proliferation of switches, for long term maintainability.

@@ -53,6 +53,10 @@ static cl::alias
cl::desc("Comma-delimited version of -perf-event"),
cl::aliasopt(PerfEventFilter));

static cl::opt<uint64_t>
SamplePeriod("sample-period", cl::init(1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be confusing -- having default value 1 implies we always assume a sample period of 1.

@tcreech-intel
Copy link
Contributor Author

I'm interested in handling the combination of 1) and 2), where we're sampling the IP-only profile at a different period from the LBR-based profile and want to compare the two, e.g., for computing branch miss ratios.

However, let's say we're only concerned with 1), i.e., dividing LBR hit counts by the LBR depth, because we can account for sampling periods with llvm-profdata post-processing. If we apply the sampling period then we can do this division while processing each sample's contribution. This is useful because the effective LBR depth can vary by platform and even across samples in the same profile. Without applying the sampling period first, Count will typically be smaller than Sample->LBRStack.size() - 1 and the result won't be useful.

In short, --sample-period lets us avoid dividing integers which are smaller than LBR depths without assuming anything about LBR depths. (It also produces profile counts which approximate absolute event counts, but we don't depend on that property.)

Thoughts here are appreciated, but I will move to revert these two llvm-profgen changes.

tcreech-intel added a commit to tcreech-intel/llvm-project that referenced this pull request Jul 23, 2024
@WenleiHe
Copy link
Member

This is useful because the effective LBR depth can vary by platform and even across samples in the same profile.

But when processing profile, we know the exact depth of each LBR sample?

Without applying the sampling period first, Count will typically be smaller than Sample->LBRStack.size() - 1 and the result won't be useful.

Ok, I see why you want to multiple the count with something first. In this case, the multiplier doesn't have to be sampling period, instead it just needs to be something bigger than max LBR depth? It's sort of like a normalizing factor.. So I think we can get away with something simpler than actual sample period -- like we don't want to suggest that user need to provide the actual sampling period.

@tcreech-intel
Copy link
Contributor Author

But when processing profile, we know the exact depth of each LBR sample?

Yes, this should be Sample->LBRStack.size().

In this case, the multiplier doesn't have to be sampling period, instead it just needs to be something bigger than max LBR depth? It's sort of like a normalizing factor.. So I think we can get away with something simpler than actual sample period

Yes, for purposes of normalizing LBR and IP-only profiles with uniform sampling period I think that's true. It doesn't allow a user to compare profiles of different sampling periods, though, and I think that will be a useful capability. (Both perf and SEP support per-event periods.)

With a fixed normalizing factor users in that case would need to use their sampling periods to come up with an adjustment factor and apply it with llvm-profdata merge. I certainly understand the appeal of organizing the tooling this way, but I do worry some users will have trouble with this extra step.

we don't want to suggest that user need to provide the actual sampling period

Stepping back a bit, another direction might be to have llvm-profgen run perf script asking for the "period" field, i.e., to include the sampling period associated with each sample. (In general perf allows for the period to vary within a profile.) This could be optional behavior. Do you think that might be reasonable?

yuxuanchen1997 pushed a commit that referenced this pull request Jul 25, 2024
Summary:
Without `--sample-period`, no assumptions are made about perf profile
sample frequencies. This is useful for comparing relative hotness of
different program locations within the same profile.

With `--sample-period`, LBR- and IP-based profile hit counts are
adjusted to estimate the absolute total event count for each program
location. This makes it reasonable to compare hit counts between
different profiles, e.g., between two LBR-based execution frequency
profiles with different sampling periods or between LBR-based execution
frequency profiles and IP-based branch mispredict profiles.

This functionality is in support of HWPGO[^1], which aims to enable
feedback from a wider range of hardware events.

[^1]:
https://llvm.org/devmtg/2024-04/slides/TechnicalTalks/Xiao-EnablingHW-BasedPGO.pdf

Test Plan: 

Reviewers: 

Subscribers: 

Tasks: 

Tags: 


Differential Revision: https://phabricator.intern.facebook.com/D60251524
williamweixiao pushed a commit that referenced this pull request Aug 2, 2024
Revert #99826 and #99026 to allow for additional input.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PGO Profile Guided Optimizations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants