Skip to content

Add initial support for SPE brstack format #129231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jun 20, 2025
Merged

Conversation

kaadam
Copy link
Contributor

@kaadam kaadam commented Feb 28, 2025

Perf will be able to report SPE branch events as similar as it does with LBR brstack.
Therefore we can utilize the existing LBR parsing process for SPE as well.

Example of the SPE brstack input format:

perf script -i perf.data -F pid,brstack --itrace=bl
  PID       FROM / TO / PREDICTED

16984  0x72e342e5f4/0x72e36192d0/M/-/-/11/RET/-
16984  0x72e7b8b3b4/0x72e7b8b3b8/PN/-/-/11/COND/-
16984  0x72e7b92b48/0x72e7b92b4c/PN/-/-/8/COND/-
16984  0x72eacc6b7c/0x760cc94b00/P/-/-/9/RET/-
16984  0x72e3f210fc/0x72e3f21068/P/-/-/4//-
16984  0x72e39b8c5c/0x72e3627b24/P/-/-/4//-
16984  0x72e7b89d20/0x72e7b92bbc/P/-/-/4/RET/-

SPE brstack mispredicted flag might be two characters long: PN or MN. Where N means the branch was marked as NOT-TAKEN. This event is only related to conditional instruction (conditional branch or compare-and-branch), it tells that failed its condition code check.

Perf with 'brstack' support for SPE is available here:

https://github.com/Leo-Yan/linux/tree/perf_arm_spe_branch_flags_v2

Example of useage with SPE perf data:

perf2bolt -p perf.data -o perf.fdata --spe BINARY

Capture standard SPE branch events with perf:

perf record -e 'arm_spe_0/branch_filter=1/u' -- BINARY

An unittest is also added to check parsing process of 'SPE brstack format'.

@llvmbot
Copy link
Member

llvmbot commented Feb 28, 2025

@llvm/pr-subscribers-bolt

Author: Ádám Kallai (kaadam)

Changes

Perf will be able to report SPE branch events as similar as it does with LBR brstack.
Therefore we can utilize the existing LBR parsing process for SPE as well.

Example of the SPE brstack input format:

perf script -i perf.data -F pid,brstack --itrace=bl

PID FROM TO PREDICTED

16984 0x72e342e5f4/0x72e36192d0/M/-/-/11/RET/-
16984 0x72e7b8b3b4/0x72e7b8b3b8/PN/-/-/11/COND/-
16984 0x72e7b92b48/0x72e7b92b4c/PN/-/-/8/COND/-
16984 0x72eacc6b7c/0x760cc94b00/P/-/-/9/RET/-
16984 0x72e3f210fc/0x72e3f21068/P/-/-/4//-
16984 0x72e39b8c5c/0x72e3627b24/P/-/-/4//-
16984 0x72e7b89d20/0x72e7b92bbc/P/-/-/4/RET/-

SPE brstack mispredicted flag might be two characters long: 'PN' or 'MN'. Where 'N' means the branch was marked as NOT-TAKEN. This event is only related to conditional instruction (conditional branch or compare-and-branch), it tells that failed its condition code check.

Perf with 'brstack' support for SPE is available here:

https://github.com/Leo-Yan/linux/tree/perf_arm_spe_branch_flags_v2

Example of useage with SPE perf data:

perf2bolt -p perf.data -o perf.fdata --spe BINARY

Capture standard SPE branch events with perf:

perf record -e 'arm_spe_0/branch_filter=1/u' -- BINARY

An unittest is also added to check parsing process of 'SPE brstack format'.


Full diff: https://github.com/llvm/llvm-project/pull/129231.diff

3 Files Affected:

  • (modified) bolt/lib/Profile/DataAggregator.cpp (+37-23)
  • (modified) bolt/test/perf2bolt/AArch64/perf2bolt-spe.test (+1-1)
  • (modified) bolt/unittests/Profile/PerfSpeEvents.cpp (+71)
diff --git a/bolt/lib/Profile/DataAggregator.cpp b/bolt/lib/Profile/DataAggregator.cpp
index cce9fdbef99bd..4af3a493b8be6 100644
--- a/bolt/lib/Profile/DataAggregator.cpp
+++ b/bolt/lib/Profile/DataAggregator.cpp
@@ -49,12 +49,10 @@ static cl::opt<bool>
                      cl::desc("aggregate basic samples (without LBR info)"),
                      cl::cat(AggregatorCategory));
 
-cl::opt<bool> ArmSPE(
-    "spe",
-    cl::desc(
-        "Enable Arm SPE mode. Used in conjuction with no-lbr mode, ie `--spe "
-        "--nl`"),
-    cl::cat(AggregatorCategory));
+cl::opt<bool> ArmSPE("spe",
+                     cl::desc("Enable Arm SPE mode. Can combine with `--nl` "
+                              "to use in no-lbr mode"),
+                     cl::cat(AggregatorCategory));
 
 static cl::opt<std::string>
     ITraceAggregation("itrace",
@@ -180,13 +178,16 @@ void DataAggregator::start() {
 
   if (opts::ArmSPE) {
     if (!opts::BasicAggregation) {
-      errs() << "PERF2BOLT-ERROR: Arm SPE mode is combined only with "
-                "BasicAggregation.\n";
-      exit(1);
+      // pid    from_ip      to_ip        predicted?
+      // 12345  0x123/0x456/P/-/-/8/RET/-
+      launchPerfProcess("SPE branch events", MainEventsPPI,
+                        "script -F pid,brstack --itrace=bl",
+                        /*Wait = */ false);
+    } else {
+      launchPerfProcess("SPE brstack events", MainEventsPPI,
+                        "script -F pid,event,ip,addr --itrace=i1i",
+                        /*Wait = */ false);
     }
-    launchPerfProcess("branch events with SPE", MainEventsPPI,
-                      "script -F pid,event,ip,addr --itrace=i1i",
-                      /*Wait = */ false);
   } else if (opts::BasicAggregation) {
     launchPerfProcess("events without LBR", MainEventsPPI,
                       "script -F pid,event,ip",
@@ -527,8 +528,7 @@ Error DataAggregator::preprocessProfile(BinaryContext &BC) {
     }
     exit(0);
   }
-
-  if (((!opts::BasicAggregation && !opts::ArmSPE) && parseBranchEvents()) ||
+  if ((!opts::BasicAggregation && parseBranchEvents()) ||
       (opts::BasicAggregation && opts::ArmSPE && parseSpeAsBasicEvents()) ||
       (opts::BasicAggregation && parseBasicEvents()))
     errs() << "PERF2BOLT: failed to parse samples\n";
@@ -1034,7 +1034,11 @@ ErrorOr<LBREntry> DataAggregator::parseLBREntry() {
   if (std::error_code EC = MispredStrRes.getError())
     return EC;
   StringRef MispredStr = MispredStrRes.get();
-  if (MispredStr.size() != 1 ||
+  // SPE brstack mispredicted flags might be two characters long: 'PN' or 'MN'.
+  bool ProperStrSize = (MispredStr.size() == 2 && opts::ArmSPE)
+                           ? (MispredStr[1] == 'N')
+                           : (MispredStr.size() == 1);
+  if (!ProperStrSize ||
       (MispredStr[0] != 'P' && MispredStr[0] != 'M' && MispredStr[0] != '-')) {
     reportError("expected single char for mispred bit");
     Diag << "Found: " << MispredStr << "\n";
@@ -1565,9 +1569,11 @@ uint64_t DataAggregator::parseLBRSample(const PerfBranchSample &Sample,
 }
 
 std::error_code DataAggregator::parseBranchEvents() {
-  outs() << "PERF2BOLT: parse branch events...\n";
-  NamedRegionTimer T("parseBranch", "Parsing branch events", TimerGroupName,
-                     TimerGroupDesc, opts::TimeAggregator);
+  std::string BranchEventTypeStr =
+      opts::ArmSPE ? "branch events" : "SPE branch events in LBR-format";
+  outs() << "PERF2BOLT: " << BranchEventTypeStr << "...\n";
+  NamedRegionTimer T("parseBranch", "Parsing " + BranchEventTypeStr,
+                     TimerGroupName, TimerGroupDesc, opts::TimeAggregator);
 
   uint64_t NumTotalSamples = 0;
   uint64_t NumEntries = 0;
@@ -1595,7 +1601,8 @@ std::error_code DataAggregator::parseBranchEvents() {
     }
 
     NumEntries += Sample.LBR.size();
-    if (BAT && Sample.LBR.size() == 32 && !NeedsSkylakeFix) {
+    if (this->BC->isX86() && BAT && Sample.LBR.size() == 32 &&
+        !NeedsSkylakeFix) {
       errs() << "PERF2BOLT-WARNING: using Intel Skylake bug workaround\n";
       NeedsSkylakeFix = true;
     }
@@ -1630,10 +1637,17 @@ std::error_code DataAggregator::parseBranchEvents() {
     if (NumSamples && NumSamplesNoLBR == NumSamples) {
       // Note: we don't know if perf2bolt is being used to parse memory samples
       // at this point. In this case, it is OK to parse zero LBRs.
-      errs() << "PERF2BOLT-WARNING: all recorded samples for this binary lack "
-                "LBR. Record profile with perf record -j any or run perf2bolt "
-                "in no-LBR mode with -nl (the performance improvement in -nl "
-                "mode may be limited)\n";
+      if (!opts::ArmSPE)
+        errs()
+            << "PERF2BOLT-WARNING: all recorded samples for this binary lack "
+               "LBR. Record profile with perf record -j any or run perf2bolt "
+               "in no-LBR mode with -nl (the performance improvement in -nl "
+               "mode may be limited)\n";
+      else
+        errs()
+            << "PERF2BOLT-WARNING: all recorded samples for this binary lack "
+               "SPE brstack entries. Record profile with:"
+               "perf record arm_spe_0/branch_filter=1/";
     } else {
       const uint64_t IgnoredSamples = NumTotalSamples - NumSamples;
       const float PercentIgnored = 100.0f * IgnoredSamples / NumTotalSamples;
diff --git a/bolt/test/perf2bolt/AArch64/perf2bolt-spe.test b/bolt/test/perf2bolt/AArch64/perf2bolt-spe.test
index d7cea7ff769b8..d34a2c7994f72 100644
--- a/bolt/test/perf2bolt/AArch64/perf2bolt-spe.test
+++ b/bolt/test/perf2bolt/AArch64/perf2bolt-spe.test
@@ -11,4 +11,4 @@ CHECK-SPE-NO-LBR: PERF2BOLT: Starting data aggregation job
 RUN: perf record -e cycles -q -o %t.perf.data -- %t.exe
 RUN: not perf2bolt -p %t.perf.data -o %t.perf.boltdata --spe %t.exe 2>&1 | FileCheck %s --check-prefix=CHECK-SPE-LBR
 
-CHECK-SPE-LBR: PERF2BOLT-ERROR: Arm SPE mode is combined only with BasicAggregation.
+CHECK-SPE-LBR: PERF2BOLT: spawning perf job to read SPE branch events
diff --git a/bolt/unittests/Profile/PerfSpeEvents.cpp b/bolt/unittests/Profile/PerfSpeEvents.cpp
index e52393b516fa3..448354b784f29 100644
--- a/bolt/unittests/Profile/PerfSpeEvents.cpp
+++ b/bolt/unittests/Profile/PerfSpeEvents.cpp
@@ -23,6 +23,7 @@ using namespace llvm::ELF;
 
 namespace opts {
 extern cl::opt<std::string> ReadPerfEvents;
+extern cl::opt<bool> ArmSPE;
 } // namespace opts
 
 namespace llvm {
@@ -88,6 +89,45 @@ struct PerfSpeEventsTestHelper : public testing::Test {
 
     return SampleSize == DA.BasicSamples.size();
   }
+
+  /// Compare LBREntries
+  bool checkLBREntry(const LBREntry &Lhs, const LBREntry &Rhs) {
+    return Lhs.From == Rhs.From && Lhs.To == Rhs.To &&
+           Lhs.Mispred == Rhs.Mispred;
+  }
+
+  /// Parse and check SPE brstack as LBR
+  void parseAndCheckBrstackEvents(
+      uint64_t PID,
+      const std::vector<SmallVector<LBREntry, 2>> &ExpectedSamples) {
+    int NumSamples = 0;
+
+    DataAggregator DA("<pseudo input>");
+    DA.ParsingBuf = opts::ReadPerfEvents;
+    DA.BC = BC.get();
+    DataAggregator::MMapInfo MMap;
+    DA.BinaryMMapInfo.insert(std::make_pair(PID, MMap));
+
+    // Process buffer.
+    while (DA.hasData()) {
+      ErrorOr<DataAggregator::PerfBranchSample> SampleRes =
+          DA.parseBranchSample();
+      if (std::error_code EC = SampleRes.getError())
+        EXPECT_NE(EC, std::errc::no_such_process);
+
+      DataAggregator::PerfBranchSample &Sample = SampleRes.get();
+      EXPECT_EQ(Sample.LBR.size(), ExpectedSamples[NumSamples].size());
+
+      // Check the parsed LBREntries.
+      const auto *ActualIter = Sample.LBR.begin();
+      const auto *ExpectIter = ExpectedSamples[NumSamples].begin();
+      while (ActualIter != Sample.LBR.end() &&
+             ExpectIter != ExpectedSamples[NumSamples].end())
+        EXPECT_TRUE(checkLBREntry(*ActualIter++, *ExpectIter++));
+
+      ++NumSamples;
+    }
+  }
 };
 
 } // namespace bolt
@@ -113,6 +153,37 @@ TEST_F(PerfSpeEventsTestHelper, SpeBranches) {
   EXPECT_TRUE(checkEvents(1234, 10, {"branches-spe:"}));
 }
 
+TEST_F(PerfSpeEventsTestHelper, SpeBranchesWithBrstack) {
+  // Check perf input with SPE branch events as brstack format.
+  // Example collection command:
+  // ```
+  // perf record -e 'arm_spe_0/branch_filter=1/u' -- BINARY
+  // ```
+  // How Bolt extracts the branch events:
+  // ```
+  // perf script -F pid,brstack --itrace=bl
+  // ```
+
+  opts::ArmSPE = true;
+  opts::ReadPerfEvents = "  1234  0xa001/0xa002/PN/-/-/10/COND/-\n"
+                         "  1234  0xb001/0xb002/P/-/-/4/RET/-\n"
+                         "  1234  0xc001/0xc002/P/-/-/13/-/-\n"
+                         "  1234  0xd001/0xd002/M/-/-/7/RET/-\n"
+                         "  1234  0xe001/0xe002/P/-/-/14/RET/-\n"
+                         "  1234  0xf001/0xf002/MN/-/-/8/COND/-\n";
+
+  LBREntry Entry1 = {0xa001, 0xa002, false};
+  LBREntry Entry2 = {0xb001, 0xb002, false};
+  LBREntry Entry3 = {0xc001, 0xc002, false};
+  LBREntry Entry4 = {0xd001, 0xd002, true};
+  LBREntry Entry5 = {0xe001, 0xe002, false};
+  LBREntry Entry6 = {0xf001, 0xf002, true};
+  std::vector<SmallVector<LBREntry, 2>> ExpectedSamples = {
+      {{Entry1}}, {{Entry2}}, {{Entry3}}, {{Entry4}}, {{Entry5}}, {{Entry6}},
+  };
+  parseAndCheckBrstackEvents(1234, ExpectedSamples);
+}
+
 TEST_F(PerfSpeEventsTestHelper, SpeBranchesAndCycles) {
   // Check perf input with SPE branch events and cycles.
   // Example collection command:

Copy link
Member

@paschalis-mpeis paschalis-mpeis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your work Adam!
I commented on some changes and nits.

Also noting that for now this PR is stacked on top of #120741.

Copy link

github-actions bot commented Apr 10, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@paschalis-mpeis paschalis-mpeis changed the base branch from users/paschalis-mpeis/bolt-spe-mode to main May 22, 2025 07:52
Copy link
Member

@paschalis-mpeis paschalis-mpeis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Adam,

Thanks for addressing the comments!
AArch64/perf2bolt-spe.test seems to be failing. See comment below.
I've also added a few more nits and cleanups.

Copy link
Member

@paschalis-mpeis paschalis-mpeis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Adam,

Thanks for addressing the comments. A final thing would be to reintroduce the loop for checking the samples in the unit-test.

@kaadam kaadam force-pushed the spe_brstack branch 2 times, most recently from dea69a1 to 18ba358 Compare June 16, 2025 10:48
Copy link
Member

@paschalis-mpeis paschalis-mpeis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Adam, thank you for all this work, LGTM!

Any comments from Meta? (cc: @maksfb, @aaupov)

@kaadam
Copy link
Contributor Author

kaadam commented Jun 16, 2025

Paschalis, thanks for your review. Updated the all suggested change.

Copy link
Contributor

@aaupov aaupov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for contributing this.
Do you have plans to include an end-to-end (binary) test, with perf data with SPE samples?

// N: optionally appears when the branch was Not-Taken (ie fall-through)
// 12345 0x123/0x456/PN/-/-/8/RET/-
launchPerfProcess("SPE brstack events", MainEventsPPI,
"script -F pid,brstack --itrace=bl",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just override itrace if ArmSPE is set?

Copy link
Contributor Author

@kaadam kaadam Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amir, thanks for your review. Are you thinking of something similar like this?

Suggested change
"script -F pid,brstack --itrace=bl",
if (opts::ArmSPE) {
opts::ITraceAggregation="bl";
opts::ParseMemProfile = true; // with itrace it's disabled
opts::BasicAggregation = false; // Do not use --nl along with --spe
}
if (opts::BasicAggregation) {

Technically we can use ITraceAggregation for SPE of course. The intention seems not clear enough at least for me. Since it might be worth to set some other options along with itrace, to ensure the execution goes the right direction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@paschalis-mpeis
Copy link
Member

paschalis-mpeis commented Jun 18, 2025

Do you have plans to include an end-to-end (binary) test, with perf data with SPE samples?

Good idea! Testing on an fdata file directly maybe not cover everything. And running perf2bolt requires a recent perf version, which can be flaky across platforms.

Perhaps we could reuse/extend the --perf-script-events logic to accept a pre-generated text file instead? 🤔
One produced on a platform with a recent perf, eg:

perf script -F pid,brstack --itrace=bl ..

We could also add another in-tree unittest, that uses SPE with the PBT feature.
I could provide some sample output in the coming days/weeks.

Those additions can go in a follow-up PR, so we don't block this one?

paschalis-mpeis and others added 5 commits June 19, 2025 11:21
BOLT gains the ability to process branch target information generated by
Arm SPE data, using the `BasicAggregation` format.

Example usage is:
```bash
perf2bolt -p perf.data -o perf.boltdata --nl --spe BINARY
```

New branch data and compatibility:
---
SPE branch entries in perf data contain a branch pair (`IP` -> `ADDR`)
for the source and destination branches. DataAggregator processes those
by creating two basic samples. Any other event types will have `ADDR`
field set to `0x0`. For those a single sample will be created. Such
events can be either SPE or non-SPE, like `l1d-access` and `cycles`
respectively.

The format of the input perf entries is:
```
PID   EVENT-TYPE   ADDR   IP
```

When on SPE mode and:
- host is not `AArch64`, BOLT will exit with a relevant message
- `ADDR` field is unavailable, BOLT will exit with a relevant message
- no branch pairs were recorded, BOLT will present a warning

Examples of generating profiling data for the SPE mode:
---
Profiles can be captured with perf on AArch64 machines with SPE enabled.
They can be combined with other events, SPE or not.

Capture only SPE branch data events:
```bash
perf record -e 'arm_spe_0/branch_filter=1/u' -- BINARY
```

Capture any SPE events:
```bash
perf record -e 'arm_spe_0//u' -- BINARY
```

Capture any SPE events and cycles
```bash
perf record -e 'arm_spe_0//u' -e cycles:u -- BINARY
```

More filters, jitter, and specify count to control overheads/quality.
```bash
perf record -e 'arm_spe_0/branch_filter=1,load_filter=0,store_filter=0,jitter=1/u' -c 10007 -- BINARY
```
Perf will be able to report SPE branch events as similar as it does
with LBR brstack.
Therefore we can utilize the existing LBR parsing process for SPE as well.

Example of the SPE brstack input format:
```bash
perf script -i perf.data -F pid,brstack --itrace=bl
```
```
---
PID    FROM         TO           PREDICTED
---
16984  0x72e342e5f4/0x72e36192d0/M/-/-/11/RET/-
16984  0x72e7b8b3b4/0x72e7b8b3b8/PN/-/-/11/COND/-
16984  0x72e7b92b48/0x72e7b92b4c/PN/-/-/8/COND/-
16984  0x72eacc6b7c/0x760cc94b00/P/-/-/9/RET/-
16984  0x72e3f210fc/0x72e3f21068/P/-/-/4//-
16984  0x72e39b8c5c/0x72e3627b24/P/-/-/4//-
16984  0x72e7b89d20/0x72e7b92bbc/P/-/-/4/RET/-
```
SPE brstack mispredicted flag might be two characters long: 'PN' or 'MN'.
Where 'N' means the branch was marked as NOT-TAKEN. This event is only related to
conditional instruction (conditional branch or compare-and-branch),
it tells that failed its condition code check.

Perf with 'brstack' support for SPE is available here:
```
https://github.com/Leo-Yan/linux/tree/perf_arm_spe_branch_flags_v2
```

Example of useage with SPE perf data:
```bash
perf2bolt -p perf.data -o perf.fdata --spe BINARY
```

Capture standard SPE branch events with perf:
```bash
perf record -e 'arm_spe_0/branch_filter=1/u' -- BINARY
```

An unittest is also added to check parsing process of 'SPE brstack format'.
kaadam and others added 13 commits June 19, 2025 11:21
This commit aim is to uncouple the SPE BRStack and SPE BasicAggregation approaches
based on the decision in issue llvm#115333.

BRStack change relies on the unit test logic which was introduced by
Paschalis Mpeis (ARM) in llvm#120741. Since it is a common part of the two aggregation
type technique, needs to retain an essential part of it.

All relevant tests to BasicAggregation is removed.

Co-Authored-By: Paschalis Mpeis <[email protected]>
The test could be simplified after llvm#143288 PR since
the validation phase is removed from parseLBRSample.
Now we can use branchLBRs container for the testing.
Formerly if Bolt was supplied with mock addresses, branchLBRs container
was empty due to validation phase.
@kaadam
Copy link
Contributor Author

kaadam commented Jun 19, 2025

@aaupov, @paschalis-mpeis Thanks for comments both of you.

Updated the PR based on the suggestions, and to rebase after the changes from #143289. If no further comments, I can merge by the end of this week.

Regarding end-to-end testing, I also agree it is a good idea. Most probably using a pre-generated text file will be good way to do this thanks for the hint Paschalis.

I'm happy to send a follow-up change in bolt-tests repo and cover end-to-end for SPE.

@kaadam
Copy link
Contributor Author

kaadam commented Jun 20, 2025

@aaupov @paschalis-mpeis Just one more thing. If you all find the change appropriate, may I ask one fo you to merge the PR at the final stage? I have no permission to do that. Thank you.

@paschalis-mpeis paschalis-mpeis merged commit f759739 into llvm:main Jun 20, 2025
7 checks passed
@kaadam kaadam deleted the spe_brstack branch June 20, 2025 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants