-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[llvm-profgen] Add --sample-period to estimate absolute counts #99826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Without `--sample-period`, no assumptions are made about perf profile sample frequencies. This is useful for comparing relative hotness of different program locations within the same profile. With `--sample-period`, LBR- and IP-based profile hit counts are adjusted to estimate the absolute total event count for each program location. This makes it reasonable to compare hit counts between different profiles, e.g., between LBR-based execution frequency profiles and IP-based branch mispredict profiles.
@williamweixiao, @HaohaiWen, it may also be useful to have this PR, which adds |
@llvm/pr-subscribers-pgo Author: Tim Creech (tcreech-intel) ChangesWithout With This functionality is in support of HWPGO1, which aims to enable feedback from a wider range of hardware events. Full diff: https://github.com/llvm/llvm-project/pull/99826.diff 2 Files Affected:
diff --git a/llvm/test/tools/llvm-profgen/period-scaling.test b/llvm/test/tools/llvm-profgen/period-scaling.test
new file mode 100644
index 0000000000000..a44c9a78caeaf
--- /dev/null
+++ b/llvm/test/tools/llvm-profgen/period-scaling.test
@@ -0,0 +1,84 @@
+// RUN: llvm-profgen --format=text --perfscript=%S/Inputs/cmov_3.perfscript --binary=%S/Inputs/cmov_3.perfbin --output=%t --skip-symbolization --perf-event=br_inst_retired.near_taken:upp --sample-period=1000003
+// RUN: FileCheck %s --input-file %t --check-prefix=CHECK-RAW-PROFILE
+// RUN: llvm-profgen --format=text --perfscript=%S/Inputs/cmov_3.perfscript --binary=%S/Inputs/cmov_3.perfbin --output=%t --perf-event=br_inst_retired.near_taken:upp --sample-period=1000003
+// RUN: FileCheck %s --input-file %t --check-prefix=CHECK
+
+// RUN: llvm-profgen --format=text --perfscript=%S/Inputs/cmov_3.perfscript --binary=%S/Inputs/cmov_3.perfbin --output=%t --skip-symbolization --perf-event=br_misp_retired.all_branches:upp --leading-ip-only --sample-period=1000003
+// RUN: FileCheck %s --input-file %t --check-prefix=UNPRED-RAW-PROFILE
+// RUN: llvm-profgen --format=text --perfscript=%S/Inputs/cmov_3.perfscript --binary=%S/Inputs/cmov_3.perfbin --output=%t --perf-event=br_misp_retired.all_branches:upp --leading-ip-only --sample-period=1000003
+// RUN: FileCheck %s --input-file %t --check-prefix=UNPRED
+
+// Check that we can use perf event filtering to generate multiple types of
+// source-level profiles from a single perf profile. In this case, we generate
+// a typical execution frequency profile using br_inst_retired.near_taken LBRs,
+// and a branch mispredict profile using br_misp_retired.all_branches sample
+// IPs.
+
+// Check that we can use --sample-period to compute LBR and IP-based profiles
+// which have comparable and absolute magnitudes. For example, in this case the
+// branch of interest (at source line offset 4) is in a loop body which is
+// executed ~20M times in total, and it's mispredicted about 9M times, yielding
+// a mispredict rate of roughly 0.45.
+
+// The source example below is based on perfKernelCpp/cmov_3, except a
+// misleading builtin is used to persuade the compiler not to use cmov, which
+// induces branch mispredicts.
+
+// CHECK: sel_arr:652547082:0
+// CHECK: 3.1: 20225766
+// CHECK: 3.2: 20225766
+// CHECK: 4: 19838670
+// CHECK: 5: 20225766
+
+// UNPRED: sel_arr:18000054:0
+// UNPRED: 3.1: 0
+// UNPRED: 3.2: 0
+// UNPRED: 4: 9000027
+// UNPRED: 5: 0
+
+// CHECK-RAW-PROFILE: 3
+// CHECK-RAW-PROFILE-NEXT: 2f0-2fa:9774174
+// CHECK-RAW-PROFILE-NEXT: 2f0-310:10064496
+// CHECK-RAW-PROFILE-NEXT: 2ff-310:10161270
+
+// UNPRED-RAW-PROFILE: 1
+// UNPRED-RAW-PROFILE-NEXT: 2fa-2fa:9000027
+
+// original code:
+// icx -fprofile-sample-generate lit.c
+#include <stdlib.h>
+
+#define N 20000
+#define ITERS 10000
+
+static int *m_s1, *m_s2, *m_s3, *m_dst;
+
+void init(void) {
+ m_s1 = malloc(sizeof(int)*N);
+ m_s2 = malloc(sizeof(int)*N);
+ m_s3 = malloc(sizeof(int)*N);
+ m_dst = malloc(sizeof(int)*N);
+ srand(42);
+
+ for (int i = 0; i < N; i++) {
+ m_s1[i] = rand() % N;
+ m_s2[i] = 0;
+ m_s3[i] = 1;
+ }
+}
+
+void __attribute__((noinline)) sel_arr(int *dst, int *s1, int *s2, int *s3) {
+#pragma nounroll
+#pragma clang loop vectorize(disable) interleave(disable)
+ for (int i = 0; i < N; i++) {
+ int *p = __builtin_expect((s1[i] < 10035), 0) ? &s2[i] : &s3[i];
+ dst[i] = *p;
+ }
+}
+
+int main(void) {
+ init();
+ for(int i=0; i<ITERS; ++i)
+ sel_arr(m_dst, m_s1, m_s2, m_s3);
+ return 0;
+}
diff --git a/llvm/tools/llvm-profgen/PerfReader.cpp b/llvm/tools/llvm-profgen/PerfReader.cpp
index b4e4911fb8912..4041271cc0a82 100644
--- a/llvm/tools/llvm-profgen/PerfReader.cpp
+++ b/llvm/tools/llvm-profgen/PerfReader.cpp
@@ -53,6 +53,10 @@ static cl::alias
cl::desc("Comma-delimited version of -perf-event"),
cl::aliasopt(PerfEventFilter));
+static cl::opt<uint64_t>
+ SamplePeriod("sample-period", cl::init(1),
+ cl::desc("The sampling period (-c) used for perf data"));
+
extern cl::opt<std::string> PerfTraceFilename;
extern cl::opt<bool> ShowDisassemblyOnly;
extern cl::opt<bool> ShowSourceLocations;
@@ -1000,6 +1004,16 @@ void LBRPerfReader::parseSample(TraceStream &TraceIt, uint64_t Count) {
if (extractLBRStack(TraceIt, Sample->LBRStack)) {
warnIfMissingMMap();
// Record LBR only samples by aggregation
+ // If a sampling period is given we can adjust the magnitude of sample
+ // counts to estimate the absolute magnitute.
+ if (SamplePeriod.getNumOccurrences()) {
+ Count *= SamplePeriod;
+ // If counts are LBR-based, as opposed to IP-based, then the magnitude is
+ // now amplified by roughly the LBR stack size. By adjusting this down, we
+ // can produce LBR-based and IP-based profiles with comparable magnitudes.
+ if (!LeadingIPOnly && Sample->LBRStack.size() > 1)
+ Count /= (Sample->LBRStack.size() - 1);
+ }
AggregatedSamples[Hashable<PerfSample>(Sample)] += Count;
}
}
Footnotes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely need this feature when handing multiple events together.
I don't get it why --sample-period is needed from the change summary. It sounds like the rational behind this change is two-fold:
Let me know if I misunderstood the intention of the change, otherwise I'd suggest not taking this patch. We need to be careful about avoiding duplicating/overlapping features, avoiding proliferation of switches, for long term maintainability. |
@@ -53,6 +53,10 @@ static cl::alias | |||
cl::desc("Comma-delimited version of -perf-event"), | |||
cl::aliasopt(PerfEventFilter)); | |||
|
|||
static cl::opt<uint64_t> | |||
SamplePeriod("sample-period", cl::init(1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be confusing -- having default value 1 implies we always assume a sample period of 1.
I'm interested in handling the combination of 1) and 2), where we're sampling the IP-only profile at a different period from the LBR-based profile and want to compare the two, e.g., for computing branch miss ratios. However, let's say we're only concerned with 1), i.e., dividing LBR hit counts by the LBR depth, because we can account for sampling periods with In short, --sample-period lets us avoid dividing integers which are smaller than LBR depths without assuming anything about LBR depths. (It also produces profile counts which approximate absolute event counts, but we don't depend on that property.) Thoughts here are appreciated, but I will move to revert these two llvm-profgen changes. |
llvm#99826)" This reverts commit 01d7836.
But when processing profile, we know the exact depth of each LBR sample?
Ok, I see why you want to multiple the count with something first. In this case, the multiplier doesn't have to be sampling period, instead it just needs to be something bigger than max LBR depth? It's sort of like a normalizing factor.. So I think we can get away with something simpler than actual sample period -- like we don't want to suggest that user need to provide the actual sampling period. |
Yes, this should be
Yes, for purposes of normalizing LBR and IP-only profiles with uniform sampling period I think that's true. It doesn't allow a user to compare profiles of different sampling periods, though, and I think that will be a useful capability. (Both perf and SEP support per-event periods.) With a fixed normalizing factor users in that case would need to use their sampling periods to come up with an adjustment factor and apply it with
Stepping back a bit, another direction might be to have |
Summary: Without `--sample-period`, no assumptions are made about perf profile sample frequencies. This is useful for comparing relative hotness of different program locations within the same profile. With `--sample-period`, LBR- and IP-based profile hit counts are adjusted to estimate the absolute total event count for each program location. This makes it reasonable to compare hit counts between different profiles, e.g., between two LBR-based execution frequency profiles with different sampling periods or between LBR-based execution frequency profiles and IP-based branch mispredict profiles. This functionality is in support of HWPGO[^1], which aims to enable feedback from a wider range of hardware events. [^1]: https://llvm.org/devmtg/2024-04/slides/TechnicalTalks/Xiao-EnablingHW-BasedPGO.pdf Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: https://phabricator.intern.facebook.com/D60251524
Without
--sample-period
, no assumptions are made about perf profile sample frequencies. This is useful for comparing relative hotness of different program locations within the same profile.With
--sample-period
, LBR- and IP-based profile hit counts are adjusted to estimate the absolute total event count for each program location. This makes it reasonable to compare hit counts between different profiles, e.g., between two LBR-based execution frequency profiles with different sampling periods or between LBR-based execution frequency profiles and IP-based branch mispredict profiles.This functionality is in support of HWPGO1, which aims to enable feedback from a wider range of hardware events.
Footnotes
https://llvm.org/devmtg/2024-04/slides/TechnicalTalks/Xiao-EnablingHW-BasedPGO.pdf ↩