[llvm-profgen] Add --sample-period to estimate absolute counts #99826

tcreech-intel · 2024-07-21T21:41:39Z

Without --sample-period, no assumptions are made about perf profile sample frequencies. This is useful for comparing relative hotness of different program locations within the same profile.

With --sample-period, LBR- and IP-based profile hit counts are adjusted to estimate the absolute total event count for each program location. This makes it reasonable to compare hit counts between different profiles, e.g., between two LBR-based execution frequency profiles with different sampling periods or between LBR-based execution frequency profiles and IP-based branch mispredict profiles.

This functionality is in support of HWPGO¹, which aims to enable feedback from a wider range of hardware events.

https://llvm.org/devmtg/2024-04/slides/TechnicalTalks/Xiao-EnablingHW-BasedPGO.pdf ↩

Without `--sample-period`, no assumptions are made about perf profile sample frequencies. This is useful for comparing relative hotness of different program locations within the same profile. With `--sample-period`, LBR- and IP-based profile hit counts are adjusted to estimate the absolute total event count for each program location. This makes it reasonable to compare hit counts between different profiles, e.g., between LBR-based execution frequency profiles and IP-based branch mispredict profiles.

tcreech-intel · 2024-07-21T21:45:35Z

@williamweixiao, @HaohaiWen, it may also be useful to have this PR, which adds --sample-period functionality. This is a nice usability improvement.
(It's a separate PR because it depends on the changes in #99026.)

llvmbot · 2024-07-21T21:48:04Z

@llvm/pr-subscribers-pgo

Author: Tim Creech (tcreech-intel)

Changes

Without --sample-period, no assumptions are made about perf profile sample frequencies. This is useful for comparing relative hotness of different program locations within the same profile.

With --sample-period, LBR- and IP-based profile hit counts are adjusted to estimate the absolute total event count for each program location. This makes it reasonable to compare hit counts between different profiles, e.g., between two LBR-based execution frequency profiles with different sampling periods or between LBR-based execution frequency profiles and IP-based branch mispredict profiles.

This functionality is in support of HWPGO¹, which aims to enable feedback from a wider range of hardware events.

Full diff: https://github.com/llvm/llvm-project/pull/99826.diff

2 Files Affected:

(added) llvm/test/tools/llvm-profgen/period-scaling.test (+84)
(modified) llvm/tools/llvm-profgen/PerfReader.cpp (+14)

diff --git a/llvm/test/tools/llvm-profgen/period-scaling.test b/llvm/test/tools/llvm-profgen/period-scaling.test
new file mode 100644
index 0000000000000..a44c9a78caeaf
--- /dev/null
+++ b/llvm/test/tools/llvm-profgen/period-scaling.test
@@ -0,0 +1,84 @@
+// RUN: llvm-profgen --format=text --perfscript=%S/Inputs/cmov_3.perfscript --binary=%S/Inputs/cmov_3.perfbin --output=%t --skip-symbolization --perf-event=br_inst_retired.near_taken:upp --sample-period=1000003
+// RUN: FileCheck %s --input-file %t --check-prefix=CHECK-RAW-PROFILE
+// RUN: llvm-profgen --format=text --perfscript=%S/Inputs/cmov_3.perfscript --binary=%S/Inputs/cmov_3.perfbin --output=%t --perf-event=br_inst_retired.near_taken:upp --sample-period=1000003
+// RUN: FileCheck %s --input-file %t --check-prefix=CHECK
+
+// RUN: llvm-profgen --format=text --perfscript=%S/Inputs/cmov_3.perfscript --binary=%S/Inputs/cmov_3.perfbin --output=%t --skip-symbolization --perf-event=br_misp_retired.all_branches:upp --leading-ip-only --sample-period=1000003
+// RUN: FileCheck %s --input-file %t --check-prefix=UNPRED-RAW-PROFILE
+// RUN: llvm-profgen --format=text --perfscript=%S/Inputs/cmov_3.perfscript --binary=%S/Inputs/cmov_3.perfbin --output=%t --perf-event=br_misp_retired.all_branches:upp --leading-ip-only --sample-period=1000003
+// RUN: FileCheck %s --input-file %t --check-prefix=UNPRED
+
+// Check that we can use perf event filtering to generate multiple types of
+// source-level profiles from a single perf profile. In this case, we generate
+// a typical execution frequency profile using br_inst_retired.near_taken LBRs,
+// and a branch mispredict profile using br_misp_retired.all_branches sample
+// IPs.
+
+// Check that we can use --sample-period to compute LBR and IP-based profiles
+// which have comparable and absolute magnitudes. For example, in this case the
+// branch of interest (at source line offset 4) is in a loop body which is
+// executed ~20M times in total, and it's mispredicted about 9M times, yielding
+// a mispredict rate of roughly 0.45.
+
+// The source example below is based on perfKernelCpp/cmov_3, except a
+// misleading builtin is used to persuade the compiler not to use cmov, which
+// induces branch mispredicts.
+
+// CHECK: sel_arr:652547082:0
+// CHECK:  3.1: 20225766
+// CHECK:  3.2: 20225766
+// CHECK:  4: 19838670
+// CHECK:  5: 20225766
+
+// UNPRED: sel_arr:18000054:0
+// UNPRED:  3.1: 0
+// UNPRED:  3.2: 0
+// UNPRED:  4: 9000027
+// UNPRED:  5: 0
+
+// CHECK-RAW-PROFILE:      3
+// CHECK-RAW-PROFILE-NEXT: 2f0-2fa:9774174
+// CHECK-RAW-PROFILE-NEXT: 2f0-310:10064496
+// CHECK-RAW-PROFILE-NEXT: 2ff-310:10161270
+
+// UNPRED-RAW-PROFILE:      1
+// UNPRED-RAW-PROFILE-NEXT: 2fa-2fa:9000027
+
+// original code:
+// icx -fprofile-sample-generate lit.c
+#include <stdlib.h>
+
+#define N 20000
+#define ITERS 10000
+
+static int *m_s1, *m_s2, *m_s3, *m_dst;
+
+void init(void) {
+    m_s1 = malloc(sizeof(int)*N);
+    m_s2 = malloc(sizeof(int)*N);
+    m_s3 = malloc(sizeof(int)*N);
+    m_dst = malloc(sizeof(int)*N);
+    srand(42);
+
+    for (int i = 0; i < N; i++) {
+        m_s1[i] = rand() % N;
+        m_s2[i] = 0;
+        m_s3[i] = 1;
+    }
+}
+
+void __attribute__((noinline)) sel_arr(int *dst, int *s1, int *s2, int *s3) {
+#pragma nounroll
+#pragma clang loop vectorize(disable) interleave(disable)
+    for (int i = 0; i < N; i++) {
+        int *p = __builtin_expect((s1[i] < 10035), 0) ? &s2[i] : &s3[i];
+        dst[i] = *p;
+    }
+}
+
+int main(void) {
+  init();
+  for(int i=0; i<ITERS; ++i)
+    sel_arr(m_dst, m_s1, m_s2, m_s3);
+  return 0;
+}
diff --git a/llvm/tools/llvm-profgen/PerfReader.cpp b/llvm/tools/llvm-profgen/PerfReader.cpp
index b4e4911fb8912..4041271cc0a82 100644
--- a/llvm/tools/llvm-profgen/PerfReader.cpp
+++ b/llvm/tools/llvm-profgen/PerfReader.cpp
@@ -53,6 +53,10 @@ static cl::alias
                           cl::desc("Comma-delimited version of -perf-event"),
                           cl::aliasopt(PerfEventFilter));
 
+static cl::opt<uint64_t>
+    SamplePeriod("sample-period", cl::init(1),
+                 cl::desc("The sampling period (-c) used for perf data"));
+
 extern cl::opt<std::string> PerfTraceFilename;
 extern cl::opt<bool> ShowDisassemblyOnly;
 extern cl::opt<bool> ShowSourceLocations;
@@ -1000,6 +1004,16 @@ void LBRPerfReader::parseSample(TraceStream &TraceIt, uint64_t Count) {
   if (extractLBRStack(TraceIt, Sample->LBRStack)) {
     warnIfMissingMMap();
     // Record LBR only samples by aggregation
+    // If a sampling period is given we can adjust the magnitude of sample
+    // counts to estimate the absolute magnitute.
+    if (SamplePeriod.getNumOccurrences()) {
+      Count *= SamplePeriod;
+      // If counts are LBR-based, as opposed to IP-based, then the magnitude is
+      // now amplified by roughly the LBR stack size. By adjusting this down, we
+      // can produce LBR-based and IP-based profiles with comparable magnitudes.
+      if (!LeadingIPOnly && Sample->LBRStack.size() > 1)
+        Count /= (Sample->LBRStack.size() - 1);
+    }
     AggregatedSamples[Hashable<PerfSample>(Sample)] += Count;
   }
 }

https://llvm.org/devmtg/2024-04/slides/TechnicalTalks/Xiao-EnablingHW-BasedPGO.pdf ↩

williamweixiao

We definitely need this feature when handing multiple events together.

WenleiHe · 2024-07-23T05:22:16Z

I don't get it why --sample-period is needed from the change summary.

It sounds like the rational behind this change is two-fold:

To normalize LBR sample with IP sample. If that's the case, we only need to normalize by LBR depth regardless of sampling rate/period.
To support calibration and merging of profile collected with different sampling rate/period. If that's the case, llvm-profdata merge -weighted-input .. is the way to go.

Let me know if I misunderstood the intention of the change, otherwise I'd suggest not taking this patch. We need to be careful about avoiding duplicating/overlapping features, avoiding proliferation of switches, for long term maintainability.

WenleiHe · 2024-07-23T05:06:59Z

llvm/tools/llvm-profgen/PerfReader.cpp

@@ -53,6 +53,10 @@ static cl::alias
                          cl::desc("Comma-delimited version of -perf-event"),
                          cl::aliasopt(PerfEventFilter));

+static cl::opt<uint64_t>
+    SamplePeriod("sample-period", cl::init(1),


This can be confusing -- having default value 1 implies we always assume a sample period of 1.

tcreech-intel · 2024-07-23T16:09:18Z

I'm interested in handling the combination of 1) and 2), where we're sampling the IP-only profile at a different period from the LBR-based profile and want to compare the two, e.g., for computing branch miss ratios.

However, let's say we're only concerned with 1), i.e., dividing LBR hit counts by the LBR depth, because we can account for sampling periods with llvm-profdata post-processing. If we apply the sampling period then we can do this division while processing each sample's contribution. This is useful because the effective LBR depth can vary by platform and even across samples in the same profile. Without applying the sampling period first, Count will typically be smaller than Sample->LBRStack.size() - 1 and the result won't be useful.

In short, --sample-period lets us avoid dividing integers which are smaller than LBR depths without assuming anything about LBR depths. (It also produces profile counts which approximate absolute event counts, but we don't depend on that property.)

Thoughts here are appreciated, but I will move to revert these two llvm-profgen changes.

llvm#99826)" This reverts commit 01d7836.

WenleiHe · 2024-07-23T22:45:09Z

This is useful because the effective LBR depth can vary by platform and even across samples in the same profile.

But when processing profile, we know the exact depth of each LBR sample?

Without applying the sampling period first, Count will typically be smaller than Sample->LBRStack.size() - 1 and the result won't be useful.

Ok, I see why you want to multiple the count with something first. In this case, the multiplier doesn't have to be sampling period, instead it just needs to be something bigger than max LBR depth? It's sort of like a normalizing factor.. So I think we can get away with something simpler than actual sample period -- like we don't want to suggest that user need to provide the actual sampling period.

tcreech-intel · 2024-07-24T12:56:09Z

But when processing profile, we know the exact depth of each LBR sample?

Yes, this should be Sample->LBRStack.size().

In this case, the multiplier doesn't have to be sampling period, instead it just needs to be something bigger than max LBR depth? It's sort of like a normalizing factor.. So I think we can get away with something simpler than actual sample period

Yes, for purposes of normalizing LBR and IP-only profiles with uniform sampling period I think that's true. It doesn't allow a user to compare profiles of different sampling periods, though, and I think that will be a useful capability. (Both perf and SEP support per-event periods.)

With a fixed normalizing factor users in that case would need to use their sampling periods to come up with an adjustment factor and apply it with llvm-profdata merge. I certainly understand the appeal of organizing the tooling this way, but I do worry some users will have trouble with this extra step.

we don't want to suggest that user need to provide the actual sampling period

Stepping back a bit, another direction might be to have llvm-profgen run perf script asking for the "period" field, i.e., to include the sampling period associated with each sample. (In general perf allows for the period to vary within a profile.) This could be optional behavior. Do you think that might be reasonable?

Summary: Without `--sample-period`, no assumptions are made about perf profile sample frequencies. This is useful for comparing relative hotness of different program locations within the same profile. With `--sample-period`, LBR- and IP-based profile hit counts are adjusted to estimate the absolute total event count for each program location. This makes it reasonable to compare hit counts between different profiles, e.g., between two LBR-based execution frequency profiles with different sampling periods or between LBR-based execution frequency profiles and IP-based branch mispredict profiles. This functionality is in support of HWPGO[^1], which aims to enable feedback from a wider range of hardware events. [^1]: https://llvm.org/devmtg/2024-04/slides/TechnicalTalks/Xiao-EnablingHW-BasedPGO.pdf Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: https://phabricator.intern.facebook.com/D60251524

Revert #99826 and #99026 to allow for additional input.

tcreech-intel marked this pull request as ready for review July 21, 2024 21:47

llvmbot added the PGO Profile Guided Optimizations label Jul 21, 2024

williamweixiao requested review from WenleiHe, wlei-llvm, HaohaiWen, tianqingw and williamweixiao July 22, 2024 01:37

williamweixiao approved these changes Jul 22, 2024

View reviewed changes

williamweixiao merged commit 01d7836 into llvm:main Jul 22, 2024
11 checks passed

WenleiHe reviewed Jul 23, 2024

View reviewed changes

tcreech-intel added a commit to tcreech-intel/llvm-project that referenced this pull request Jul 23, 2024

Revert "[llvm-profgen] Add --sample-period to estimate absolute counts (

52fad27

llvm#99826)" This reverts commit 01d7836.

tcreech-intel mentioned this pull request Jul 23, 2024

[llvm-profgen] Revert #99826 and #99026 #100147

Merged

williamweixiao pushed a commit that referenced this pull request Aug 2, 2024

[llvm-profgen] Revert #99826 and #99026 (#100147)

23609a3

Revert #99826 and #99026 to allow for additional input.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[llvm-profgen] Add --sample-period to estimate absolute counts #99826

[llvm-profgen] Add --sample-period to estimate absolute counts #99826

tcreech-intel commented Jul 21, 2024

Uh oh!

tcreech-intel commented Jul 21, 2024

Uh oh!

llvmbot commented Jul 21, 2024

Uh oh!

williamweixiao left a comment

Uh oh!

Uh oh!

WenleiHe commented Jul 23, 2024

Uh oh!

WenleiHe Jul 23, 2024

Uh oh!

tcreech-intel commented Jul 23, 2024

Uh oh!

WenleiHe commented Jul 23, 2024

Uh oh!

tcreech-intel commented Jul 24, 2024

Uh oh!

Uh oh!

[llvm-profgen] Add --sample-period to estimate absolute counts #99826

[llvm-profgen] Add --sample-period to estimate absolute counts #99826

Conversation

tcreech-intel commented Jul 21, 2024

Footnotes

Uh oh!

tcreech-intel commented Jul 21, 2024

Uh oh!

llvmbot commented Jul 21, 2024

Footnotes

Uh oh!

williamweixiao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WenleiHe commented Jul 23, 2024

Uh oh!

WenleiHe Jul 23, 2024

Choose a reason for hiding this comment

Uh oh!

tcreech-intel commented Jul 23, 2024

Uh oh!

WenleiHe commented Jul 23, 2024

Uh oh!

tcreech-intel commented Jul 24, 2024

Uh oh!

Uh oh!