Skip to content

[AArch64][BOLT] Ensure tentative code layout for cold BBs runs. #96609

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

paschalis-mpeis
Copy link
Member

@paschalis-mpeis paschalis-mpeis commented Jun 25, 2024

When split functions is used, BOLT may skip tentative code layout estimation in some cases, like:

  • when there is no profile data for some blocks (ie cold blocks)
  • when there are cold functions in lite mode
  • when skip functions is used

However, when rewriting the binary we still need to compute PC-relative
distances between hot and cold basic blocks. Without cold layout
estimation, BOLT uses '0x0' as the address of the first cold block,
leading to incorrect estimations of any PC-relative addresses.

This affects large binaries as the relaxStub method expands more
branches than necessary using the short-jump sequence, at it wrongly
believes it has exceeded the branch distance boundary.

This increases code size with both a larger and slower sequence; however,
performance regression is expected to be minimal since this only affects
any called cold code.

Example of such an unnecessary relaxation:
from:

b       .Ltmp1234

to:

adrp    x16, .Ltmp1234
add     x16, x16, :lo12:.Ltmp1234
br      x16

@paschalis-mpeis paschalis-mpeis changed the title Users/paschalis mpeis/lite split functions cold layout [AArch64][BOLT] Ensure tentative code layout for cold BBs runs. Jun 25, 2024
@paschalis-mpeis paschalis-mpeis requested review from ilinpv and yota9 June 25, 2024 12:54
@paschalis-mpeis paschalis-mpeis marked this pull request as ready for review June 25, 2024 12:55
@llvmbot llvmbot added the BOLT label Jun 25, 2024
@llvmbot
Copy link
Member

llvmbot commented Jun 25, 2024

@llvm/pr-subscribers-bolt

Author: Paschalis Mpeis (paschalis-mpeis)

Changes

When BOLT is not processing all functions (e.g., when lite mode is
enabled or when multiple functions are ignored via the 'skip-funcs' flag),
we might skip the tentative code layout estimation for cold basic blocks.
However, due to the `-split-functions`` flag, we still need to compute
PC-relative distances between hot and cold basic blocks.

In such cases, BOLT will use '0x0' as the first address of the basic
block, leading to incorrect estimations of the necessary PC-relative
addresses. Consequently, the relaxStub method expands all those branches
to the short-jump sequence unnecessarily.

Such an unnecessary expansion by relaxStub is from:

b       .Ltmp1234

to:

adrp    x16, .Ltmp1234
add     x16, x16, :lo12:.Ltmp1234
br      x16

Full diff: https://github.com/llvm/llvm-project/pull/96609.diff

2 Files Affected:

  • (modified) bolt/lib/Passes/LongJmp.cpp (+11-3)
  • (added) bolt/test/AArch64/split-funcs-lite.test (+14)
diff --git a/bolt/lib/Passes/LongJmp.cpp b/bolt/lib/Passes/LongJmp.cpp
index c483f70a836ee..053650f8da16c 100644
--- a/bolt/lib/Passes/LongJmp.cpp
+++ b/bolt/lib/Passes/LongJmp.cpp
@@ -11,6 +11,7 @@
 //===----------------------------------------------------------------------===//
 
 #include "bolt/Passes/LongJmp.h"
+#include "bolt/Utils/CommandLineOpts.h"
 
 #define DEBUG_TYPE "longjmp"
 
@@ -324,7 +325,6 @@ uint64_t LongJmpPass::tentativeLayoutRelocColdPart(
 uint64_t LongJmpPass::tentativeLayoutRelocMode(
     const BinaryContext &BC, std::vector<BinaryFunction *> &SortedFunctions,
     uint64_t DotAddress) {
-
   // Compute hot cold frontier
   uint32_t LastHotIndex = -1u;
   uint32_t CurrentIndex = 0;
@@ -354,9 +354,12 @@ uint64_t LongJmpPass::tentativeLayoutRelocMode(
   for (BinaryFunction *Func : SortedFunctions) {
     if (!BC.shouldEmit(*Func)) {
       HotAddresses[Func] = Func->getAddress();
-      continue;
+      // Don't perform any tentative address estimation of a function's cold
+      // layout if it won't be emitted, unless we are ignoring a large number of
+      // functions (ie, on lite mode) and we haven't done such estimation yet.
+      if (opts::processAllFunctions() || ColdLayoutDone)
+        continue;
     }
-
     if (!ColdLayoutDone && CurrentIndex >= LastHotIndex) {
       DotAddress =
           tentativeLayoutRelocColdPart(BC, SortedFunctions, DotAddress);
@@ -382,6 +385,11 @@ uint64_t LongJmpPass::tentativeLayoutRelocMode(
     DotAddress += Func->estimateConstantIslandSize();
     ++CurrentIndex;
   }
+  if (!ColdLayoutDone) {
+    errs() << "BOLT-ERROR: Did not perform tentative code layout for cold functions.\n";
+    exit(1);
+  }
+
   // BBs
   for (BinaryFunction *Func : SortedFunctions)
     tentativeBBLayout(*Func);
diff --git a/bolt/test/AArch64/split-funcs-lite.test b/bolt/test/AArch64/split-funcs-lite.test
new file mode 100644
index 0000000000000..c7a11918fe061
--- /dev/null
+++ b/bolt/test/AArch64/split-funcs-lite.test
@@ -0,0 +1,14 @@
+// This test checks that tentative code layout for cold basic blocks runs
+// at least once, even when each function after the hot/cold frontier is not
+// emittable. This is done by ignoring each function using the 'skip-funcs' flag.
+// In a realistic scenario, this may happen when lite mode is enabled along
+// with a bolt profile.
+
+REQUIRES: system-linux
+
+RUN: %clang %cflags %p/../Inputs/asm_main.c -Wl,-q -o %t
+
+RUN: llvm-bolt %t -o %t.bolt -lite=1 -split-functions -split-all-cold \
+RUN: --skip-funcs="_init,_start,call_weak_fn/1,deregister_tm_clones/1,register_tm_clones/1,__do_global_dtors_aux/1,frame_dummy/1,main,foo,_fini" 2>&1 | FileCheck %s
+
+CHECK-NOT: BOLT-ERROR: Did not perform tentative code layout for cold functions.
\ No newline at end of file

@paschalis-mpeis
Copy link
Member Author

paschalis-mpeis commented Jun 25, 2024

This PR is a draft to initiate a discussion on the following two issues:

  1. On a large input binary (~60MiB of code), we encountered a case where LongJmp was expanding some branches it shouldn't (the current patch aims to solve this).
  2. This expansion seemed to be leading to a runtime crash.

A more detailed discussion follows.


(1) Expanding branches that shouldn't have

On AArch64, when using -lite=1 and -split-functions in relocation mode, we encountered cases where functions that come after the hot/cold frontier are not 'emittable', so when LongJmp needs to generate stubs between hot/cold basic blocks, it has incomplete information.

This happens because tentativeLayoutRelocMode never calls tentativeLayoutRelocColdPart, causing the estimated dot address of the first cold BB for such functions to be 0x0. As a result, when emitting stubs to jump from/to cold BBs, BOLT estimates much larger PC-relative addresses.

Example: one such function that had 6 hot and 7 cold basic blocks. The tentative addresses computed per basic block were:

 HOT BB: .LBB01750    0x8000054
 HOT BB: .Ltmp26442   0x8000074
 HOT BB: .Ltmp26443   0x800007c
 HOT BB: .Ltmp26444   0x8000090
 HOT BB: .LFT24377    0x800009c
 HOT BB: .LLP3525     0x80000a4
COLD BB: .LFT24370    0x0
COLD BB: .LFT24372    0x14
COLD BB: .LFT24378    0x30
COLD BB: .LFT24380    0x40
COLD BB: .LFT24381    0x48
COLD BB: .Ltmp26445   0x5c
COLD BB: .LFT24382    0x68

So for a jump from hot BB .Ltmp26443 to cold .LFT24378, we get:

  • Dot Target Address: 0x800007c
  • Dot Source Address: 0x30 (ie bb where we jump from)
    This makes the computed PCRelTgtAddress 0x800004c, which is above 0x8000000 ($2^{27}$), forcing relaxStub to emit a short-jump as it does not fit in the b instruction.

The same happens vice versa, e.g., when jumping from a cold to a hot BB.

A third example, which did not expand by luck, was:

  • Dot Target Address: 0x8000090
  • Dot Source Address: 0x96
    This makes PCRelTgtAddress=0x7fffffa (0x8000090-0x96), which falls just under the limit and fits into a single instruction.

Please note that without lite mode, tentativeLayoutRelocColdPart runs at least once as all functions are forced to be 'emittable'. Therefore, we don't encounter any wrong dot address estimations. Despite having to perform longer jumps (as each function is duplicated and optionally split into hot/cold), it just so happens that for the same binary all jumps fit in a single instruction and no 'relaxation' is needed.

The current patch, forces tentative address estimations on cold BBs to run at least once, otherwise it exits with a 'BOLT-ERROR'. This change should decrease stub 'relaxations' with lite mode. It can lead to slight decreased code size and may have a very slight impact on performance as well (given this concerns cold blocks; mostly if the profile is not good or if CPU caches are affected).


(2) Relaxing shouldn't have caused a runtime crash

Due to the wrong estimations described in (1), some branches are expanded/relaxed. However, it is not clear how such 'relaxation' could be responsible for a runtime crash. Despite being unnecessary, the below replacement looks like legit code. Given that, any ideas on what may have caused this?

Example: in the stub below, the branch was 'relaxed' from:

 b       .Ltmp1234

to:

adrp    x16, .Ltmp1234
add     x16, x16, :lo12:.Ltmp1234
br      x16

NOTE: The issue described in (1) ('relaxing' jumps where it shouldn't), seems to be found in other binaries that use -lite=1 -split-functions. However, the crash described here was not observed in those cases. It is currently unclear if this was due to pure luck (i.e., those cold stubs were never reached) or if there is something else that affects only larger binaries.

Copy link
Contributor

@samolisov samolisov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paschalis-mpeis Thank you for the patch. LGTM, but I worked with a rewritten tentative layout analysis and may not remember all the specific details of the mainline version. So, this will be great if you can wait for an approve for a more experienced bolt developer. Thank you.

@samolisov
Copy link
Contributor

samolisov commented Jun 29, 2024

(2) Relaxing shouldn't have caused a runtime crash

This is very interesting! May you document a bit the details of the crash and open an issue with them? I believe, every runtime crash after (mis-)compiling/processing deserve an issue to help the people track what happens. Thank you.

@paschalis-mpeis paschalis-mpeis force-pushed the users/paschalis-mpeis/lite-split-functions-cold-layout branch from 7cd7950 to 690d485 Compare October 1, 2024 11:16
@paschalis-mpeis
Copy link
Member Author

paschalis-mpeis commented Oct 1, 2024

Thanks all for your reviews. I've introduced a lambda which I call after the loop as suggested.

I force-pushed as I rebased to recent main.

I've also adjusted the initial 'showcase commit' (08dd00f) to reflect the changes.
Based on that commit, the below existing test cases were not previously running ColdLayout:

  AArch64/bf_min_alignment.s
  runtime/AArch64/basic-instrumentation.test

However, it might later require the addresses of cold BBs to compute
PC-relative jumps between hot and cold fragments of a function. Since
the addresses were never estimated, it uses 0x0 for cold BBs, resulting
in incorrect estimated addresses for any instructions within them.

As a result, LongJump often expands a branch that would otherwise fit in
a single instruction into a short jump.

For example, `relaxStub` might expand from:
```armasm
b       .Ltmp1234
```

to:
```armasm
adrp    x16, .Ltmp1234
add     x16, x16, :lo12:.Ltmp1234
br      x16
```

While this expansion is not wrong, it is unnecessary. Moreover, in some
large binaries this expansion has lead to runtime crashes. This should
be investigated further, as such expansions should not cause crashes.
When split functions is used, BOLT may skip tentative code layout
estimation in some cases, like:
- when there is no profile data for some blocks (ie cold blocks)
- when there are cold functions in lite mode
- when skip functions is used

However, when rewriting the binary we still need to compute PC-relative
distances between hot and cold basic blocks. Without cold layout
estimation, BOLT uses '0x0' as the address of the first cold block,
leading to incorrect estimations of any PC-relative addresses.

This affects large binaries as the relaxStub method expands more
branches than necessary using the short-jump sequence, at it wrongly
believes it has exceeded the branch distance boundary.

This increases code size with a larger and slower sequence; however,
performance regression is expected to be minimal since this only affects
called cold code.

Example of such an unnecessary relaxation:
from:
```armasm
b       .Ltmp1234
```

to:
```armasm
adrp    x16, .Ltmp1234
add     x16, x16, :lo12:.Ltmp1234
br      x16
```
@paschalis-mpeis paschalis-mpeis force-pushed the users/paschalis-mpeis/lite-split-functions-cold-layout branch from 690d485 to 104f14b Compare October 2, 2024 09:56
@paschalis-mpeis paschalis-mpeis requested a review from yota9 October 2, 2024 10:04
Copy link
Member

@yota9 yota9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the commit. Please fix indentation a bit a let's give 1-2 days for Meta team to review

@paschalis-mpeis
Copy link
Member Author

If no objections I will be merging this tomorrow.

Copy link
Contributor

@maksfb maksfb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me, but I believe the test case will fail on non-debug builds because LLVM_DEBUG will be a no-op and the expected string wouldn't be printed.

@maksfb
Copy link
Contributor

maksfb commented Oct 13, 2024

@paschalis-mpeis, could you try the test from #112110 without --compact-code-model? It fails in JITLink (without the flag), but this PR supposedly fixes it.

Changed comment style on the test.
@paschalis-mpeis
Copy link
Member Author

the change looks good to me, but I believe the test case will fail on non-debug builds because LLVM_DEBUG will be a no-op and the expected string wouldn't be printed.

Thanks @maksfb. The test should now run only when compiled with assertions.

@paschalis-mpeis
Copy link
Member Author

paschalis-mpeis commented Oct 15, 2024

Hey @maksfb, thanks for the suggestion. I played around with the test of #112110, and below are my findings.

In summary, with random splitting alone, no stubs are inserted, and no relocation is needed or performed. So as expected, the JITLink crash you observe does not go away.

It's only when I attach profile data that parts of LongJmp are engaged (incl. my patch). The JITLink crash is then is gone, but not due to my patch. In this simple/small test even with a wrong cold layout BOLT is not relaxing any branches. There is an example at the very end.

Details

The current patch (#96609) avoids relaxing (with relaxStubToShortJmp) any branches that tentative code layout predicts they are not needed.

When trying the test of #112110, the flags --split-functions --split-strategy=randomN do not trigger relaxStub. Therefore, my proposed change does not run, no stubs are added, and no relaxation is needed.
Was something different expected and could it be an issue with the random splitting strategy? 🤔

BTW, this was the reason for not using random splitting in my test, but instead a profile with internal labels.


Running test of #112110

When running the test (just w/o the compact-flag of course), the fix-point loop does a single iteration on runOnFunctions for both _start, and large_function, which results to no stub insertions and no relaxation:

Inserted 0 stubs in the hot area and 0 stubs in the cold area. Shared 0 times, iterated 1 times.

As expected, the JITLink crash appears, either with or without this patch:

BOLT-ERROR: JITLink failed: In graph in-memory object file, section .text: relocation target .text.cold + 0x4 at address 0x724fc0 is out of range of CondBranch19PCRel fixup at 0x600004 (_start, 0x600000 + 0x4)

What is the intention of the #112110 test?

Using your patch, in the bolted binary we get something like:

Disassembly of section .text:

0000000000600000 <_start>:
  600000: cmp	x1, #1
  600004: b.hi	0x60000c <_start+0xc>
  600008: b	0x724fc0 <_start.cold.0>
  60000c: b	0x724fc4 <_start.cold.0+0x4>

0000000000600010 <large_function>:
  600010: nop
...
  724f8c: nop
  724f90: ret

Disassembly of section .text.cold:

0000000000724fc0 <_start.cold.0>:
  724fc0: bl	0x600010 <large_function>
  724fc4: ret

And this aligns with your comments in the test. So we have:

  • in the hot section: the main fragment _start and the large_function
  • in the cold section: the non-main fragment of _start

With the surrounding code as is, I only achieve this kind of splitting when I change your test to attach some profile data on the _start entry-point and the large_function. In that case, the JITLink error is gone and the resulting code is:


Disassembly of section .text:

0000000000600000 <_start>:
  600000: cmp	x1, #0x1
  600004: b.ls	0x60000c <_start+0xc>
  600008: b	0x724fc4 <_start.cold.0+0x4>
  60000c: b	0x724fc0 <_start.cold.0>

0000000000600010 <large_function>:
  600010: nop
...
  724f8c: nop
  724f90: ret

Disassembly of section .text.cold:

0000000000724fc0 <_start.cold.0>:
  724fc0: bl	0x600010 <large_function>
  724fc4: ret

Note that the branching condition in the above snippet is switched to b.ls when compared to the original test. That must be due to the profile data I used that indicate a call to large_function is the more likely branch target.

So when profile data are attached, wider code of LongJmp runs (incl. my change), a couple of stubs are inserted, and the JITLink crash is gone. But as far as my patch is concerned, in this simple/small example it has no effect.

For example, without my patch we'll get addresses (1):

and with my patch we'll get (2):

  • SrcAddr: 0x600008
  • TgtAddr: 0x724fc4

Distance in (1) is within bounds, despite having a wrong cold block target address. So in this case, applying my patch wouldn't have made a difference (by chance).

Copy link
Contributor

@maksfb maksfb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, @paschalis-mpeis.

@paschalis-mpeis paschalis-mpeis merged commit cb9bacf into main Oct 17, 2024
7 checks passed
@paschalis-mpeis paschalis-mpeis deleted the users/paschalis-mpeis/lite-split-functions-cold-layout branch October 17, 2024 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants