-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AArch64][BOLT] Ensure tentative code layout for cold BBs runs. #96609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AArch64][BOLT] Ensure tentative code layout for cold BBs runs. #96609
Conversation
@llvm/pr-subscribers-bolt Author: Paschalis Mpeis (paschalis-mpeis) ChangesWhen BOLT is not processing all functions (e.g., when lite mode is In such cases, BOLT will use '0x0' as the first address of the basic Such an unnecessary expansion by
to:
Full diff: https://github.com/llvm/llvm-project/pull/96609.diff 2 Files Affected:
diff --git a/bolt/lib/Passes/LongJmp.cpp b/bolt/lib/Passes/LongJmp.cpp
index c483f70a836ee..053650f8da16c 100644
--- a/bolt/lib/Passes/LongJmp.cpp
+++ b/bolt/lib/Passes/LongJmp.cpp
@@ -11,6 +11,7 @@
//===----------------------------------------------------------------------===//
#include "bolt/Passes/LongJmp.h"
+#include "bolt/Utils/CommandLineOpts.h"
#define DEBUG_TYPE "longjmp"
@@ -324,7 +325,6 @@ uint64_t LongJmpPass::tentativeLayoutRelocColdPart(
uint64_t LongJmpPass::tentativeLayoutRelocMode(
const BinaryContext &BC, std::vector<BinaryFunction *> &SortedFunctions,
uint64_t DotAddress) {
-
// Compute hot cold frontier
uint32_t LastHotIndex = -1u;
uint32_t CurrentIndex = 0;
@@ -354,9 +354,12 @@ uint64_t LongJmpPass::tentativeLayoutRelocMode(
for (BinaryFunction *Func : SortedFunctions) {
if (!BC.shouldEmit(*Func)) {
HotAddresses[Func] = Func->getAddress();
- continue;
+ // Don't perform any tentative address estimation of a function's cold
+ // layout if it won't be emitted, unless we are ignoring a large number of
+ // functions (ie, on lite mode) and we haven't done such estimation yet.
+ if (opts::processAllFunctions() || ColdLayoutDone)
+ continue;
}
-
if (!ColdLayoutDone && CurrentIndex >= LastHotIndex) {
DotAddress =
tentativeLayoutRelocColdPart(BC, SortedFunctions, DotAddress);
@@ -382,6 +385,11 @@ uint64_t LongJmpPass::tentativeLayoutRelocMode(
DotAddress += Func->estimateConstantIslandSize();
++CurrentIndex;
}
+ if (!ColdLayoutDone) {
+ errs() << "BOLT-ERROR: Did not perform tentative code layout for cold functions.\n";
+ exit(1);
+ }
+
// BBs
for (BinaryFunction *Func : SortedFunctions)
tentativeBBLayout(*Func);
diff --git a/bolt/test/AArch64/split-funcs-lite.test b/bolt/test/AArch64/split-funcs-lite.test
new file mode 100644
index 0000000000000..c7a11918fe061
--- /dev/null
+++ b/bolt/test/AArch64/split-funcs-lite.test
@@ -0,0 +1,14 @@
+// This test checks that tentative code layout for cold basic blocks runs
+// at least once, even when each function after the hot/cold frontier is not
+// emittable. This is done by ignoring each function using the 'skip-funcs' flag.
+// In a realistic scenario, this may happen when lite mode is enabled along
+// with a bolt profile.
+
+REQUIRES: system-linux
+
+RUN: %clang %cflags %p/../Inputs/asm_main.c -Wl,-q -o %t
+
+RUN: llvm-bolt %t -o %t.bolt -lite=1 -split-functions -split-all-cold \
+RUN: --skip-funcs="_init,_start,call_weak_fn/1,deregister_tm_clones/1,register_tm_clones/1,__do_global_dtors_aux/1,frame_dummy/1,main,foo,_fini" 2>&1 | FileCheck %s
+
+CHECK-NOT: BOLT-ERROR: Did not perform tentative code layout for cold functions.
\ No newline at end of file
|
This PR is a draft to initiate a discussion on the following two issues:
A more detailed discussion follows. (1) Expanding branches that shouldn't haveOn AArch64, when using This happens because Example: one such function that had 6 hot and 7 cold basic blocks. The tentative addresses computed per basic block were:
So for a jump from hot BB
The same happens vice versa, e.g., when jumping from a cold to a hot BB. A third example, which did not expand by luck, was:
Please note that without lite mode, The current patch, forces tentative address estimations on cold BBs to run at least once, otherwise it exits with a 'BOLT-ERROR'. This change should decrease stub 'relaxations' with lite mode. It can lead to slight decreased code size and may have a very slight impact on performance as well (given this concerns cold blocks; mostly if the profile is not good or if CPU caches are affected). (2) Relaxing shouldn't have caused a runtime crashDue to the wrong estimations described in (1), some branches are expanded/relaxed. However, it is not clear how such 'relaxation' could be responsible for a runtime crash. Despite being unnecessary, the below replacement looks like legit code. Given that, any ideas on what may have caused this? Example: in the stub below, the branch was 'relaxed' from:
to:
NOTE: The issue described in (1) ('relaxing' jumps where it shouldn't), seems to be found in other binaries that use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paschalis-mpeis Thank you for the patch. LGTM, but I worked with a rewritten tentative layout analysis and may not remember all the specific details of the mainline version. So, this will be great if you can wait for an approve for a more experienced bolt developer. Thank you.
(2) Relaxing shouldn't have caused a runtime crash This is very interesting! May you document a bit the details of the crash and open an issue with them? I believe, every runtime crash after (mis-)compiling/processing deserve an issue to help the people track what happens. Thank you. |
7cd7950
to
690d485
Compare
Thanks all for your reviews. I've introduced a lambda which I call after the loop as suggested. I force-pushed as I rebased to recent main. I've also adjusted the initial 'showcase commit' (08dd00f) to reflect the changes.
|
However, it might later require the addresses of cold BBs to compute PC-relative jumps between hot and cold fragments of a function. Since the addresses were never estimated, it uses 0x0 for cold BBs, resulting in incorrect estimated addresses for any instructions within them. As a result, LongJump often expands a branch that would otherwise fit in a single instruction into a short jump. For example, `relaxStub` might expand from: ```armasm b .Ltmp1234 ``` to: ```armasm adrp x16, .Ltmp1234 add x16, x16, :lo12:.Ltmp1234 br x16 ``` While this expansion is not wrong, it is unnecessary. Moreover, in some large binaries this expansion has lead to runtime crashes. This should be investigated further, as such expansions should not cause crashes.
When split functions is used, BOLT may skip tentative code layout estimation in some cases, like: - when there is no profile data for some blocks (ie cold blocks) - when there are cold functions in lite mode - when skip functions is used However, when rewriting the binary we still need to compute PC-relative distances between hot and cold basic blocks. Without cold layout estimation, BOLT uses '0x0' as the address of the first cold block, leading to incorrect estimations of any PC-relative addresses. This affects large binaries as the relaxStub method expands more branches than necessary using the short-jump sequence, at it wrongly believes it has exceeded the branch distance boundary. This increases code size with a larger and slower sequence; however, performance regression is expected to be minimal since this only affects called cold code. Example of such an unnecessary relaxation: from: ```armasm b .Ltmp1234 ``` to: ```armasm adrp x16, .Ltmp1234 add x16, x16, :lo12:.Ltmp1234 br x16 ```
690d485
to
104f14b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the commit. Please fix indentation a bit a let's give 1-2 days for Meta team to review
If no objections I will be merging this tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good to me, but I believe the test case will fail on non-debug builds because LLVM_DEBUG
will be a no-op and the expected string wouldn't be printed.
@paschalis-mpeis, could you try the test from #112110 without |
Changed comment style on the test.
Thanks @maksfb. The test should now run only when compiled with assertions. |
Hey @maksfb, thanks for the suggestion. I played around with the test of #112110, and below are my findings. In summary, with random splitting alone, no stubs are inserted, and no relocation is needed or performed. So as expected, the JITLink crash you observe does not go away. It's only when I attach profile data that parts of LongJmp are engaged (incl. my patch). The JITLink crash is then is gone, but not due to my patch. In this simple/small test even with a wrong cold layout BOLT is not relaxing any branches. There is an example at the very end. DetailsThe current patch (#96609) avoids relaxing (with When trying the test of #112110, the flags BTW, this was the reason for not using random splitting in my test, but instead a profile with internal labels. Running test of #112110When running the test (just w/o the compact-flag of course), the fix-point loop does a single iteration on
As expected, the JITLink crash appears, either with or without this patch:
What is the intention of the #112110 test?Using your patch, in the bolted binary we get something like:
And this aligns with your comments in the test. So we have:
With the surrounding code as is, I only achieve this kind of splitting when I change your test to attach some profile data on the
Note that the branching condition in the above snippet is switched to So when profile data are attached, wider code of LongJmp runs (incl. my change), a couple of stubs are inserted, and the JITLink crash is gone. But as far as my patch is concerned, in this simple/small example it has no effect. For example, without my patch we'll get addresses (1): and with my patch we'll get (2):
Distance in (1) is within bounds, despite having a wrong cold block target address. So in this case, applying my patch wouldn't have made a difference (by chance). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, @paschalis-mpeis.
When split functions is used, BOLT may skip tentative code layout estimation in some cases, like:
However, when rewriting the binary we still need to compute PC-relative
distances between hot and cold basic blocks. Without cold layout
estimation, BOLT uses '0x0' as the address of the first cold block,
leading to incorrect estimations of any PC-relative addresses.
This affects large binaries as the relaxStub method expands more
branches than necessary using the short-jump sequence, at it wrongly
believes it has exceeded the branch distance boundary.
This increases code size with both a larger and slower sequence; however,
performance regression is expected to be minimal since this only affects
any called cold code.
Example of such an unnecessary relaxation:
from:
to: