[BOLT][AArch64] Add support for compact code model #112110

maksfb · 2024-10-12T22:38:53Z

Add --compact-code-model option that executes alternative branch relaxation with an assumption that the resulting binary has less than 128MB of code. The relaxation is done in relaxLocalBranches(), which operates on a function level and executes on multiple functions in parallel.

Running the new option on AArch64 Clang binary produces slightly smaller code and the relaxation finishes in about 1/10th of the time.

Note that the new .text has to be smaller than 128MB, and .plt has to be closer than 128MB to .text.

Add `--compact-code-model` option that executes alternative branch relaxation with an assumption that the resulting binary has less than 128MB of code. The relaxation is done in `relaxLocalBranches()`, which operates on a function level and executes on multiple functions in parallel. Running the new pass on AArch64 Clang binary produces slightly smaller code and finishes in about 1/10th of the time. Note that the new .text has to be smaller than 128MB, *and* .plt has to be closer than 128MB to the new code.

llvmbot · 2024-10-12T22:39:15Z

@llvm/pr-subscribers-bolt

Author: Maksim Panchenko (maksfb)

Changes

Add --compact-code-model option that executes alternative branch relaxation with an assumption that the resulting binary has less than 128MB of code. The relaxation is done in relaxLocalBranches(), which operates on a function level and executes on multiple functions in parallel.

Running the new option on AArch64 Clang binary produces slightly smaller code and the relaxation finishes in about 1/10th of the time.

Note that the new .text has to be smaller than 128MB, and .plt has to be closer than 128MB to .text.

Full diff: https://github.com/llvm/llvm-project/pull/112110.diff

6 Files Affected:

(modified) bolt/include/bolt/Core/BinaryBasicBlock.h (+3)
(modified) bolt/include/bolt/Core/FunctionLayout.h (+2-1)
(modified) bolt/include/bolt/Passes/LongJmp.h (+13)
(modified) bolt/lib/Core/FunctionLayout.cpp (+3-1)
(modified) bolt/lib/Passes/LongJmp.cpp (+275-3)
(added) bolt/test/AArch64/compact-code-model.s (+48)

diff --git a/bolt/include/bolt/Core/BinaryBasicBlock.h b/bolt/include/bolt/Core/BinaryBasicBlock.h
index b4f31cf2bae6f6..25cccc4edecf68 100644
--- a/bolt/include/bolt/Core/BinaryBasicBlock.h
+++ b/bolt/include/bolt/Core/BinaryBasicBlock.h
@@ -819,6 +819,9 @@ class BinaryBasicBlock {
     return OutputAddressRange;
   }
 
+  uint64_t getOutputStartAddress() const { return OutputAddressRange.first; }
+  uint64_t getOutputEndAddress() const { return OutputAddressRange.second; }
+
   bool hasLocSyms() const { return LocSyms != nullptr; }
 
   /// Return mapping of input offsets to symbols in the output.
diff --git a/bolt/include/bolt/Core/FunctionLayout.h b/bolt/include/bolt/Core/FunctionLayout.h
index 6a13cbec69fee7..ee4dd689b8dd64 100644
--- a/bolt/include/bolt/Core/FunctionLayout.h
+++ b/bolt/include/bolt/Core/FunctionLayout.h
@@ -123,7 +123,8 @@ class FunctionFragment {
   const_iterator begin() const;
   iterator end();
   const_iterator end() const;
-  const BinaryBasicBlock *front() const;
+  BinaryBasicBlock *front() const;
+  BinaryBasicBlock *back() const;
 
   friend class FunctionLayout;
 };
diff --git a/bolt/include/bolt/Passes/LongJmp.h b/bolt/include/bolt/Passes/LongJmp.h
index 3d02d75ac4a277..df3ea9620918af 100644
--- a/bolt/include/bolt/Passes/LongJmp.h
+++ b/bolt/include/bolt/Passes/LongJmp.h
@@ -63,6 +63,19 @@ class LongJmpPass : public BinaryFunctionPass {
   uint32_t NumColdStubs{0};
   uint32_t NumSharedStubs{0};
 
+  /// The shortest distance for any branch instruction on AArch64.
+  static constexpr size_t ShortestJumpBits = 16;
+  static constexpr size_t ShortestJumpSpan = 1ULL << (ShortestJumpBits - 1);
+
+  /// The longest single-instruction branch.
+  static constexpr size_t LongestJumpBits = 28;
+  static constexpr size_t LongestJumpSpan = 1ULL << (LongestJumpBits - 1);
+
+  /// Relax all internal function branches including those between fragments.
+  /// Assume that fragments are placed in different sections but are within
+  /// 128MB of each other.
+  void relaxLocalBranches(BinaryFunction &BF);
+
   ///                 -- Layout estimation methods --
   /// Try to do layout before running the emitter, by looking at BinaryFunctions
   /// and MCInsts -- this is an estimation. To be correct for longjmp inserter
diff --git a/bolt/lib/Core/FunctionLayout.cpp b/bolt/lib/Core/FunctionLayout.cpp
index 15e6127ad2e9e8..4498fc44da9548 100644
--- a/bolt/lib/Core/FunctionLayout.cpp
+++ b/bolt/lib/Core/FunctionLayout.cpp
@@ -33,7 +33,9 @@ FunctionFragment::const_iterator FunctionFragment::end() const {
   return const_iterator(Layout->block_begin() + StartIndex + Size);
 }
 
-const BinaryBasicBlock *FunctionFragment::front() const { return *begin(); }
+BinaryBasicBlock *FunctionFragment::front() const { return *begin(); }
+
+BinaryBasicBlock *FunctionFragment::back() const { return *std::prev(end()); }
 
 FunctionLayout::FunctionLayout() { addFragment(); }
 
diff --git a/bolt/lib/Passes/LongJmp.cpp b/bolt/lib/Passes/LongJmp.cpp
index c483f70a836ee1..4ce2322ab4352c 100644
--- a/bolt/lib/Passes/LongJmp.cpp
+++ b/bolt/lib/Passes/LongJmp.cpp
@@ -11,18 +11,26 @@
 //===----------------------------------------------------------------------===//
 
 #include "bolt/Passes/LongJmp.h"
+#include "bolt/Core/ParallelUtilities.h"
+#include "llvm/Support/MathExtras.h"
 
 #define DEBUG_TYPE "longjmp"
 
 using namespace llvm;
 
 namespace opts {
+extern cl::OptionCategory BoltCategory;
 extern cl::OptionCategory BoltOptCategory;
 extern llvm::cl::opt<unsigned> AlignText;
 extern cl::opt<unsigned> AlignFunctions;
 extern cl::opt<bool> UseOldText;
 extern cl::opt<bool> HotFunctionsAtEnd;
 
+static cl::opt<bool>
+    CompactCodeModel("compact-code-model",
+                     cl::desc("generate code for binaries <128MB on AArch64"),
+                     cl::init(false), cl::cat(BoltCategory));
+
 static cl::opt<bool> GroupStubs("group-stubs",
                                 cl::desc("share stubs across functions"),
                                 cl::init(true), cl::cat(BoltOptCategory));
@@ -61,10 +69,10 @@ static BinaryBasicBlock *getBBAtHotColdSplitPoint(BinaryFunction &Func) {
     if (Next != E && (*Next)->isCold())
       return *I;
   }
-  llvm_unreachable("No hot-colt split point found");
+  llvm_unreachable("No hot-cold split point found");
 }
 
-static bool shouldInsertStub(const BinaryContext &BC, const MCInst &Inst) {
+static bool mayNeedStub(const BinaryContext &BC, const MCInst &Inst) {
   return (BC.MIB->isBranch(Inst) || BC.MIB->isCall(Inst)) &&
          !BC.MIB->isIndirectBranch(Inst) && !BC.MIB->isIndirectCall(Inst);
 }
@@ -565,7 +573,7 @@ Error LongJmpPass::relax(BinaryFunction &Func, bool &Modified) {
       if (BC.MIB->isPseudo(Inst))
         continue;
 
-      if (!shouldInsertStub(BC, Inst)) {
+      if (!mayNeedStub(BC, Inst)) {
         DotAddress += InsnSize;
         continue;
       }
@@ -629,7 +637,271 @@ Error LongJmpPass::relax(BinaryFunction &Func, bool &Modified) {
   return Error::success();
 }
 
+void LongJmpPass::relaxLocalBranches(BinaryFunction &BF) {
+  BinaryContext &BC = BF.getBinaryContext();
+  auto &MIB = BC.MIB;
+
+  if (!BF.isSimple())
+    return;
+
+  // Quick path.
+  if (!BF.isSplit() && BF.estimateSize() < ShortestJumpSpan)
+    return;
+
+  auto isBranchOffsetInRange = [&](const MCInst &Inst, int64_t Offset) {
+    const unsigned Bits = MIB->getPCRelEncodingSize(Inst);
+    return isIntN(Bits, Offset);
+  };
+
+  auto isBlockInRange = [&](const MCInst &Inst, uint64_t InstAddress,
+                            const BinaryBasicBlock &BB) {
+    const int64_t Offset = BB.getOutputStartAddress() - InstAddress;
+    return isBranchOffsetInRange(Inst, Offset);
+  };
+
+  // Keep track of *all* function trampolines that are going to be added to the
+  // function layout at the end of relaxation.
+  std::vector<std::pair<BinaryBasicBlock *, std::unique_ptr<BinaryBasicBlock>>>
+      FunctionTrampolines;
+
+  // Function fragments are relaxed independently.
+  for (FunctionFragment &FF : BF.getLayout().fragments()) {
+    // Fill out code size estimation for the fragment. Use output BB address
+    // ranges to store offsets from the start of the function.
+    uint64_t CodeSize = 0;
+    for (BinaryBasicBlock *BB : FF) {
+      BB->setOutputStartAddress(CodeSize);
+      CodeSize += BB->estimateSize();
+      BB->setOutputEndAddress(CodeSize);
+    }
+
+    // Dynamically-updated size of the fragment.
+    uint64_t FragmentSize = CodeSize;
+
+    // Size of the trampoline in bytes.
+    constexpr uint64_t TrampolineSize = 4;
+
+    // Trampolines created for the fragment. DestinationBB -> TrampolineBB.
+    // NB: here we store only the first trampoline created for DestinationBB.
+    DenseMap<const BinaryBasicBlock *, BinaryBasicBlock *> FragmentTrampolines;
+
+    // Create a trampoline code after \p BB or at the end of the fragment if BB
+    // is nullptr.
+    auto addTrampolineAfter = [&](BinaryBasicBlock *BB,
+                                  BinaryBasicBlock *TargetBB, uint64_t Count,
+                                  bool UpdateOffsets = true) {
+      std::unique_ptr<BinaryBasicBlock> TrampolineBB = BF.createBasicBlock();
+      MCInst Inst;
+      {
+        auto L = BC.scopeLock();
+        MIB->createUncondBranch(Inst, TargetBB->getLabel(), BC.Ctx.get());
+      }
+      TrampolineBB->addInstruction(Inst);
+      TrampolineBB->addSuccessor(TargetBB, Count);
+      TrampolineBB->setExecutionCount(Count);
+      const uint64_t TrampolineAddress =
+          BB ? BB->getOutputEndAddress() : FragmentSize;
+      TrampolineBB->setOutputStartAddress(TrampolineAddress);
+      TrampolineBB->setOutputEndAddress(TrampolineAddress + TrampolineSize);
+      TrampolineBB->setFragmentNum(FF.getFragmentNum());
+
+      if (UpdateOffsets) {
+        FragmentSize += TrampolineSize;
+        for (BinaryBasicBlock *IBB : FF) {
+          if (IBB->getOutputStartAddress() >= TrampolineAddress) {
+            IBB->setOutputStartAddress(IBB->getOutputStartAddress() +
+                                       TrampolineSize);
+            IBB->setOutputEndAddress(IBB->getOutputEndAddress() +
+                                     TrampolineSize);
+          }
+        }
+        for (auto &Pair : FunctionTrampolines) {
+          BinaryBasicBlock *IBB = Pair.second.get();
+          if (IBB->getFragmentNum() != TrampolineBB->getFragmentNum())
+            continue;
+          if (IBB == TrampolineBB.get())
+            continue;
+          if (IBB->getOutputStartAddress() >= TrampolineAddress) {
+            IBB->setOutputStartAddress(IBB->getOutputStartAddress() +
+                                       TrampolineSize);
+            IBB->setOutputEndAddress(IBB->getOutputEndAddress() +
+                                     TrampolineSize);
+          }
+        }
+      }
+
+      if (!FragmentTrampolines.lookup(TargetBB))
+        FragmentTrampolines[TargetBB] = TrampolineBB.get();
+      FunctionTrampolines.emplace_back(BB ? BB : FF.back(),
+                                       std::move(TrampolineBB));
+
+      return FunctionTrampolines.back().second.get();
+    };
+
+    // Pre-populate trampolines by splitting unconditional branches from the
+    // containing basic block.
+    for (BinaryBasicBlock *BB : FF) {
+      MCInst *Inst = BB->getLastNonPseudoInstr();
+      if (!Inst || !MIB->isUnconditionalBranch(*Inst))
+        continue;
+
+      const MCSymbol *TargetSymbol = MIB->getTargetSymbol(*Inst);
+      BB->eraseInstruction(BB->findInstruction(Inst));
+      BB->setOutputEndAddress(BB->getOutputEndAddress() - TrampolineSize);
+
+      BinaryBasicBlock::BinaryBranchInfo BI;
+      BinaryBasicBlock *TargetBB = BB->getSuccessor(TargetSymbol, BI);
+
+      BinaryBasicBlock *TrampolineBB =
+          addTrampolineAfter(BB, TargetBB, BI.Count, /*UpdateOffsets*/ false);
+      BB->replaceSuccessor(TargetBB, TrampolineBB, BI.Count);
+    }
+
+    /// Relax the branch \p Inst. Return true if basic block offsets need an
+    /// update after the trampoline insertion.
+    auto relaxBranch = [&](BinaryBasicBlock *BB, MCInst &Inst,
+                           uint64_t InstAddress, BinaryBasicBlock *TargetBB) {
+      BinaryFunction *BF = BB->getParent();
+
+      // Use branch taken count for optimal relaxation.
+      const uint64_t Count = BB->getBranchInfo(*TargetBB).Count;
+      assert(Count != BinaryBasicBlock::COUNT_NO_PROFILE &&
+             "Expected valid branch execution count");
+
+      // Try to reuse an existing trampoline without introducing any new code.
+      BinaryBasicBlock *TrampolineBB = FragmentTrampolines.lookup(TargetBB);
+      if (TrampolineBB && isBlockInRange(Inst, InstAddress, *TrampolineBB)) {
+        BB->replaceSuccessor(TargetBB, TrampolineBB, Count);
+        TrampolineBB->setExecutionCount(TrampolineBB->getExecutionCount() +
+                                        Count);
+        auto L = BC.scopeLock();
+        MIB->replaceBranchTarget(Inst, TrampolineBB->getLabel(), BC.Ctx.get());
+        return;
+      }
+
+      // For cold branches, check if we can introduce a trampoline at the end
+      // of the fragment that is within the branch reach. Note that such
+      // trampoline may change address later and become unreachable in which
+      // case we will need further relaxation.
+      const int64_t OffsetToEnd = FragmentSize - InstAddress;
+      if (Count == 0 && isBranchOffsetInRange(Inst, OffsetToEnd)) {
+        TrampolineBB = addTrampolineAfter(nullptr, TargetBB, Count);
+        BB->replaceSuccessor(TargetBB, TrampolineBB, Count);
+        auto L = BC.scopeLock();
+        MIB->replaceBranchTarget(Inst, TrampolineBB->getLabel(), BC.Ctx.get());
+
+        return;
+      }
+
+      // Insert a new block after the current one and use it as a trampoline.
+      TrampolineBB = addTrampolineAfter(BB, TargetBB, Count);
+
+      // If the other successor is a fall-through, invert the condition code.
+      const BinaryBasicBlock *const NextBB =
+          BF->getLayout().getBasicBlockAfter(BB, /*IgnoreSplits*/ false);
+      if (BB->getConditionalSuccessor(false) == NextBB) {
+        BB->swapConditionalSuccessors();
+        auto L = BC.scopeLock();
+        MIB->reverseBranchCondition(Inst, NextBB->getLabel(), BC.Ctx.get());
+      } else {
+        auto L = BC.scopeLock();
+        MIB->replaceBranchTarget(Inst, TrampolineBB->getLabel(), BC.Ctx.get());
+      }
+      BB->replaceSuccessor(TargetBB, TrampolineBB, Count);
+    };
+
+    bool MayNeedRelaxation;
+    uint64_t NumIterations = 0;
+    do {
+      MayNeedRelaxation = false;
+      ++NumIterations;
+      for (auto BBI = FF.begin(); BBI != FF.end(); ++BBI) {
+        BinaryBasicBlock *BB = *BBI;
+        uint64_t NextInstOffset = BB->getOutputStartAddress();
+        for (MCInst &Inst : *BB) {
+          const size_t InstAddress = NextInstOffset;
+          if (!MIB->isPseudo(Inst))
+            NextInstOffset += 4;
+
+          if (!mayNeedStub(BF.getBinaryContext(), Inst))
+            continue;
+
+          const size_t BitsAvailable = MIB->getPCRelEncodingSize(Inst);
+
+          // Span of +/-128MB.
+          if (BitsAvailable == LongestJumpBits)
+            continue;
+
+          const MCSymbol *TargetSymbol = MIB->getTargetSymbol(Inst);
+          BinaryBasicBlock *TargetBB = BB->getSuccessor(TargetSymbol);
+          assert(TargetBB &&
+                 "Basic block target expected for conditional branch.");
+
+          // Check if the relaxation is needed.
+          if (TargetBB->getFragmentNum() == FF.getFragmentNum() &&
+              isBlockInRange(Inst, InstAddress, *TargetBB))
+            continue;
+
+          relaxBranch(BB, Inst, InstAddress, TargetBB);
+
+          MayNeedRelaxation = true;
+        }
+      }
+
+      // We may have added new instructions, but the whole fragment is less than
+      // the minimum branch span.
+      if (FragmentSize < ShortestJumpSpan)
+        MayNeedRelaxation = false;
+
+    } while (MayNeedRelaxation);
+
+    LLVM_DEBUG({
+      if (NumIterations > 2) {
+        dbgs() << "BOLT-DEBUG: relaxed fragment " << FF.getFragmentNum().get()
+               << " of " << BF << " in " << NumIterations << " iterations\n";
+      }
+    });
+  }
+
+  // Add trampoline blocks from all fragments to the layout.
+  DenseMap<BinaryBasicBlock *, std::vector<std::unique_ptr<BinaryBasicBlock>>>
+      Insertions;
+  for (std::pair<BinaryBasicBlock *, std::unique_ptr<BinaryBasicBlock>> &Pair :
+       FunctionTrampolines) {
+    if (!Pair.second)
+      continue;
+    Insertions[Pair.first].emplace_back(std::move(Pair.second));
+  }
+
+  for (auto &Pair : Insertions) {
+    BF.insertBasicBlocks(Pair.first, std::move(Pair.second),
+                         /*UpdateLayout*/ true, /*UpdateCFI*/ true,
+                         /*RecomputeLPs*/ false);
+  }
+}
+
 Error LongJmpPass::runOnFunctions(BinaryContext &BC) {
+
+  if (opts::CompactCodeModel) {
+    BC.outs()
+        << "BOLT-INFO: relaxing branches for compact code model (<128MB)\n";
+
+    ParallelUtilities::WorkFuncTy WorkFun = [&](BinaryFunction &BF) {
+      relaxLocalBranches(BF);
+    };
+
+    ParallelUtilities::PredicateTy SkipPredicate =
+        [&](const BinaryFunction &BF) {
+          return !BC.shouldEmit(BF) || !BF.isSimple();
+        };
+
+    ParallelUtilities::runOnEachFunction(
+        BC, ParallelUtilities::SchedulingPolicy::SP_INST_LINEAR, WorkFun,
+        SkipPredicate, "RelaxLocalBranches");
+
+    return Error::success();
+  }
+
   BC.outs() << "BOLT-INFO: Starting stub-insertion pass\n";
   std::vector<BinaryFunction *> Sorted = BC.getSortedFunctions();
   bool Modified;
diff --git a/bolt/test/AArch64/compact-code-model.s b/bolt/test/AArch64/compact-code-model.s
new file mode 100644
index 00000000000000..c8d8ac9131b45c
--- /dev/null
+++ b/bolt/test/AArch64/compact-code-model.s
@@ -0,0 +1,48 @@
+## Check that llvm-bolt successfully relaxes branches for compact (<128MB) code
+## model.
+
+# RUN: llvm-mc -filetype=obj -triple aarch64-unknown-unknown %s -o %t.o
+# RUN: %clang %cflags %t.o -o %t.exe -Wl,-q -static
+# RUN: llvm-bolt %t.exe -o %t.bolt --split-functions --split-strategy=randomN \
+# RUN:   --keep-nops --compact-code-model
+# RUN: llvm-objdump -d --disassemble-symbols=_start %t.bolt | FileCheck %s
+# RUN: llvm-nm -n %t.bolt | FileCheck %s --check-prefix=CHECK-NM
+
+## _start will be split and its main fragment will be separated from other
+## fragments by large_function() which is over 1MB.
+
+# CHECK-NM: _start
+# CHECK-NM-NEXT: large_function
+# CHECK-NM-NEXT: _start.cold
+
+  .text
+  .globl _start
+  .type _start, %function
+_start:
+  .cfi_startproc
+  cmp  x1, 1
+  b.hi  .L1
+# CHECK: b.hi
+# CHECK-NEXT: b
+# CHECK-NEXT: b
+
+  bl large_function
+.L1:
+  ret  x30
+  .cfi_endproc
+.size _start, .-_start
+
+
+  .globl large_function
+  .type large_function, %function
+large_function:
+  .cfi_startproc
+  .rept 300000
+    nop
+  .endr
+  ret  x30
+  .cfi_endproc
+.size large_function, .-large_function
+
+## Force relocation mode.
+  .reloc 0, R_AARCH64_NONE

kbeyls · 2024-10-14T13:19:04Z

@smithp35 would you also be able to have a look at this?

smithp35

@smithp35 would you also be able to have a look at this?

I've taken a look. I'm primarily a linker person rather than a BOLT person so I'll need someone more familiar with BOLT to comment on the details of the code structure. I could follow what the code was trying to do and I couldn't spot anything obviously incorrect.

My understanding is that this local relaxation is for conditional branch targets which are normally local to the function, but if BOLT splits a function then they can get out of range.

One possible extension so that this could be used without needing a compact mode, is to write the stub as ADRP, ADD, BR if the destination were greater than 128MB away. LLD will try and use "short" unconditional branches at first, but if these get knocked out of range they are rewritten as "long" sequences. This can add more passes as more instructions are added. We also don't change a long stub back to a short one to make sure we converge.

I did think that it could be worth inserting the trampoline later than the end of the block to maximize the chance it can be reused, but that might end up losing performance so I can see why it is created as close to the source as possible.

smithp35 · 2024-10-14T16:06:17Z

bolt/lib/Passes/LongJmp.cpp

+      BB->replaceSuccessor(TargetBB, TrampolineBB, BI.Count);
+    }
+
+    /// Relax the branch \p Inst. Return true if basic block offsets need an


Is the Return true part of the comment still active. It looks like this lambda doesn't return a value, and it looks like addTrampolineAfter handles the update of offsets.

smithp35 · 2024-10-14T16:13:31Z

bolt/test/AArch64/compact-code-model.s

+# RUN: llvm-objdump -d --disassemble-symbols=_start %t.bolt | FileCheck %s
+# RUN: llvm-nm -n %t.bolt | FileCheck %s --check-prefix=CHECK-NM
+
+## _start will be split and its main fragment will be separated from other


Not a comment about this specific test. Looks like there are cases that can be handled that aren't covered by the tests. For example:

Reuse of existing branches by multiple conditional branches.

Only applying to simple functions.

What happens if compact mode is specified and there is > 128MB of code.

Single or multiple iterations. I'm not well versed enough in bolt to confidently know which one the current test case handles.

I'm not sure what the overall policy is in BOLT for test coverage as I expect non trivial test cases to be difficult to write, please do ignore if this is too much.

paschalis-mpeis · 2024-10-14T17:20:19Z

One possible extension so that this could be used without needing a compact mode, is to write the stub as ADRP, ADD, BR if the destination were greater than 128MB away. LLD will try and use "short" unconditional branches at first, but if these get knocked out of range they are rewritten as "long" sequences. This can add more passes as more instructions are added. We also don't change a long stub back to a short one to make sure we converge.

@smithp35 that must be what the original LongJmp was doing, ie before the --compact-code-model extension.

This is the loop of the fix point algorithm; if relaxations keep happening (in relaxStub) the algorithm keeps going.

During relaxation:

When code is beyond br limits (with SingleInstrMask) but within ADRP/ADD/BR range (with ShortJmpMask), then it relaxes with a 'short jump' (relaxStubToShortJmp that emits the ADRP/ADD/ BR sequence).
When it does not fit ADRP range and given the binary is a non-PIC, it will use the long-jump sequence (relaxStubToLongJmp with absolute addressing). If a PIC it'll fail.

(the naming is a bit different as long jump refers >4GB addressing and short jump to >128MB)

maksfb · 2024-10-14T17:36:59Z

@smithp35, thank you for the feedback. One thing I didn't mention in this PR, is that I plan follow-up PRs with the support for >128MB binaries via stubs/thunks/veneers. Ideally, it would be done in JITLink. Regarding unit tests, yes, they are not as comprehensive as I would like them to be and often we rely on regression tests where we continuously build large binaries. However, we don't have that many regression tests for AArch64 at the moment.

@lhames, do you know if anyone is working (or has plans) on AArch64 call relaxation in JITLink?

paschalis-mpeis

Just checking IIUC:
The compact mode processes locally all fragments of a function.
Unconditional branches are split with a trampoline.
Conditional branches (excluding indirects), are relaxed/split with trampolines as needed, until we converge.
During relaxation (relaxBranch), it is attempted to add cold-branch trampolines at the end of the fragment. If out of range, those are added at the next block (just like with hot ones).

Note that the new .text has to be smaller than 128MB

Could this be checked? Or do we pass this responsibility to the user since compact mode flag was their choice?

produces slightly smaller code

Do we anticipate some trampoline duplication on unconditional branches to say common cold functions? Do we consistently expect code size to get smaller?
Also, I'm curious if those code size reductions are still observed since we have now merged:

[AArch64][BOLT] Ensure tentative code layout for cold BBs runs. #96609

paschalis-mpeis · 2024-10-17T10:08:40Z

bolt/lib/Passes/LongJmp.cpp

+  BinaryContext &BC = BF.getBinaryContext();
+  auto &MIB = BC.MIB;
+
+  if (!BF.isSimple())


This is also covered by the SkipPredicate, right?

paschalis-mpeis · 2024-10-17T10:35:59Z

bolt/lib/Passes/LongJmp.cpp

+          BinaryBasicBlock *IBB = Pair.second.get();
+          if (IBB->getFragmentNum() != TrampolineBB->getFragmentNum())
+            continue;
+          if (IBB == TrampolineBB.get())


I'm not sure if I'm reading code correctly, but this could go away if:

we get rid the earlier loop (L710) and

we move ~~FragmentTrampolines~~ (corrected: FunctionTrampolines) insertion (L733) before the loop (if no other side-effects)?

We delay the insertion of trampolines into the CFG until the CFG is iterated over (L866). Hence, we have to iterate over existing blocks in CFG and over trampolines separately.

Hey @maksfb, sorry for any confusion, I was referring just to the addTrampolineAfter lambda. Something like the below diff is what I had in mind.

If FunctionTrampolines insertion happens just before offsets are updated, then the start/end addresses of the added trampoline BB can be adjusted with a single loop? This assumes that only the added TrampolineBB of the original FF would need such an adjustment.

With some quick testing the diff seems to be OK. Tests passed (incl. compact-code-model.s cur/prev revisions). Also the more complex mongodb binary of #99848 was OK.

This was just a suggestion/observation anyway; ignore if you think is not always correct or if the original code is more readable.

diff --git a/bolt/lib/Passes/LongJmp.cpp b/bolt/lib/Passes/LongJmp.cpp index 279ff63faf11..cb1d556a06e5 100644 --- a/bolt/lib/Passes/LongJmp.cpp +++ b/bolt/lib/Passes/LongJmp.cpp @@ -702,21 +702,17 @@ void LongJmpPass::relaxLocalBranches(BinaryFunction &BF) { TrampolineBB->setOutputEndAddress(TrampolineAddress + TrampolineSize); TrampolineBB->setFragmentNum(FF.getFragmentNum()); + if (!FragmentTrampolines.lookup(TargetBB)) + FragmentTrampolines[TargetBB] = TrampolineBB.get(); + FunctionTrampolines.emplace_back(BB ? BB : FF.back(), + std::move(TrampolineBB)); + + auto *TBB = FunctionTrampolines.back().second.get(); if (UpdateOffsets) { FragmentSize += TrampolineSize; - for (BinaryBasicBlock *IBB : FF) { - if (IBB->getOutputStartAddress() >= TrampolineAddress) { - IBB->setOutputStartAddress(IBB->getOutputStartAddress() + - TrampolineSize); - IBB->setOutputEndAddress(IBB->getOutputEndAddress() + - TrampolineSize); - } - } for (auto &Pair : FunctionTrampolines) { BinaryBasicBlock *IBB = Pair.second.get(); - if (IBB->getFragmentNum() != TrampolineBB->getFragmentNum()) - continue; - if (IBB == TrampolineBB.get()) + if (IBB->getFragmentNum() != TBB->getFragmentNum()) continue; if (IBB->getOutputStartAddress() >= TrampolineAddress) { IBB->setOutputStartAddress(IBB->getOutputStartAddress() + @@ -726,13 +722,7 @@ void LongJmpPass::relaxLocalBranches(BinaryFunction &BF) { } } } - - if (!FragmentTrampolines.lookup(TargetBB)) - FragmentTrampolines[TargetBB] = TrampolineBB.get(); - FunctionTrampolines.emplace_back(BB ? BB : FF.back(), - std::move(TrampolineBB)); - - return FunctionTrampolines.back().second.get(); + return TBB; }; // Pre-populate trampolines by splitting unconditional branches from the

Thanks for your feedback and code review, Paschalis. I've added more comments and made mods to make the code more readable. Let me know if it makes sense. We need the loop above to update offsets when inserting the trampoline in the middle of the fragment.

Yeap, it does makes sense. My diff made the assumption I've mentioned which I was unsure if true as tests passed:

This assumes that only the added TrampolineBB of the original FF would need such an adjustment.

It is much clearer now, thank for the restructuring and the added comments.

paschalis-mpeis · 2024-10-17T13:14:07Z

bolt/lib/Passes/LongJmp.cpp

+    // Create a trampoline code after \p BB or at the end of the fragment if BB
+    // is nullptr.
+    auto addTrampolineAfter = [&](BinaryBasicBlock *BB,
+                                  BinaryBasicBlock *TargetBB, uint64_t Count,


(not so sure on this, so consider it as me 'thinking out loud')

Would it make sense to add a check on whether TargetBB is in range?
Given that this is for binaries <128MB, I assume it'll be in range.

But could there be a borderline case where TargetBB was initially in range but relaxations in between have shifted it right outside of range? If that's a possibility, then we would expect the relaxation loop to eventually get it right in the future? or is this part of the bits that will be 'offloaded' to the JIT linker?

Would it make sense to add a check on whether TargetBB is in range?

If TargetBB is in the same fragment, we can add an assertion that the fragment is <128MB. I think it's reasonable to reject functions over such size as I'm not sure we'll ever see those in practice. Otherwise, it wouldn't be possible to check the condition here since TargetBB is expected to be output into a different section (with unassigned address).

As of now, if the target is out of range, we'll get an error from the emitter when the destination is in the same section or from the JITLink if in a different section. In the future, I expect JITLink to handle latter cases with thunks/veneers.

Thanks for the explanation.

we'll get an error from the emitter when the destination is in the same section

Re the same section case:
I guess it's done with an assertion like this. Maybe in the future it makes sense to run follow-up checks to deal with this, as I guess it's too late by the time it reaches the emitter?

I stumbled upon such assertions by the emitter (pre compact-mode) and plan to follow-up on them.

I'd expect an error from here during the code emission.

lhames · 2024-10-21T20:35:46Z

@lhames, do you know if anyone is working (or has plans) on AArch64 call relaxation in JITLink?

I'd love to have it, but don't have time to implement it myself at the moment. I'd be very happy to review patches if you or someone else has time to look at it.

maksfb · 2024-10-22T22:25:20Z

Added a test case for branch sharing.

maksfb · 2024-10-22T22:31:14Z

Unconditional branches are split with a trampoline.

Yes, they are isolated in a separate basic block and registered as a trampoline so that we can re-use them if needed. Otherwise the code should stay the same.

Conditional branches (excluding indirects), are relaxed/split with trampolines as needed, until we converge. During relaxation (relaxBranch), it is attempted to add cold-branch trampolines at the end of the fragment. If out of range, those are added at the next block (just like with hot ones).

Correct.

Note that the new .text has to be smaller than 128MB

Could this be checked? Or do we pass this responsibility to the user since compact mode flag was their choice?

We will know the final layout only after JITLink pass. At that point, JITLink would fail if the binary exceeds the limits. O/c we can print the warning if the input binary was already larger than 128MB. At the moment, before we add thunk support to JITLink, we do pass the responsibility to the user.

produces slightly smaller code

Do we anticipate some trampoline duplication on unconditional branches to say common cold functions? Do we consistently expect code size to get smaller? Also, I'm curious if those code size reductions are still observed since we have now merged:

[AArch64][BOLT] Ensure tentative code layout for cold BBs runs. #96609

I don't think we have such expectation. Just something I've noticed.

paschalis-mpeis

Hey Maks, thanks for the update. Compact mode looks good.

I'm not sure how LongJump might evolve in the future from code design point of view.
eg, if JITLink does relaxation for >128MB, then we may have 'Tentative Mode' (the original LongJmp pass) and 'Non-Tentative/Local Mode' (the evolution of compact-mode).

Re code size decreases:

I don't think we have such expectation. Just something I've noticed.

Okay thanks. Quickly comparing sizes of a single binary (~55MB of text) in compact mode versus the updated LongJmp (#96609):

cold section had the same size
LongJmp's hot section was smaller, I guess due to hot stub sharing

paschalis-mpeis · 2024-10-29T12:47:13Z

bolt/lib/Passes/LongJmp.cpp

+    DenseMap<const BinaryBasicBlock *, BinaryBasicBlock *> FragmentTrampolines;
+
+    // Create a trampoline code after \p BB or at the end of the fragment if BB
+    // is nullptr. If /p UpdateOffsets is true, update FragmentSize and offsets


Suggested change

// is nullptr. If /p UpdateOffsets is true, update FragmentSize and offsets

// is nullptr. If \p UpdateOffsets is true, update FragmentSize and offsets

paschalis-mpeis · 2024-10-29T14:43:12Z

bolt/lib/Passes/LongJmp.cpp

+          BinaryBasicBlock *IBB = Pair.second.get();
+          if (IBB->getFragmentNum() != TrampolineBB->getFragmentNum())
+            continue;
+          if (IBB == TrampolineBB.get())


Yeap, it does makes sense. My diff made the assumption I've mentioned which I was unsure if true as tests passed:

This assumes that only the added TrampolineBB of the original FF would need such an adjustment.

It is much clearer now, thank for the restructuring and the added comments.

…ode model

maksfb · 2024-10-29T18:23:52Z

I'm not sure how LongJump might evolve in the future from code design point of view. eg, if JITLink does relaxation for >128MB, then we may have 'Tentative Mode' (the original LongJmp pass) and 'Non-Tentative/Local Mode' (the evolution of compact-mode).

Thanks again for the code review, Paschalis. We might leave the original LongJmp for some time, but eventually I plan to replace it with JITLink or any other pass that doesn't rely on code layout assumptions.

Re code size decreases:

I don't think we have such expectation. Just something I've noticed.

Okay thanks. Quickly comparing sizes of a single binary (~55MB of text) in compact mode versus the updated LongJmp (#96609):

cold section had the same size

LongJmp's hot section was smaller, I guess due to hot stub sharing

This is quite possible when for some functions fragments are separated by less than 1MB (e.g. hot code is less or close to 1MB). In such case, some trampolines to cold code introduced by relaxLocalBranches() will be unnecessary. If, however, that's not the case, I would like to take a look at your test case and see if we can improve the pass.

I just checked the build with #96609 for Clang. Hot code is ~6MB and --compact-code-model is slightly more efficient.

paschalis-mpeis · 2024-11-04T15:22:05Z

This is quite possible when for some functions fragments are separated by less than 1MB (e.g. hot code is less or close to 1MB). In such case, some trampolines to cold code introduced by relaxLocalBranches() will be unnecessary. If, however, that's not the case, I would like to take a look at your test case and see if we can improve the pass.

Haven't checked if that was the case (ie, fragments <1MB). For now just sharing what was done:

Compared to trunk it was around 4% less.
Used this binary that I got using the provided dockerscript.

Version	Hot Size	Cold Size	Hot Size Diff	Cold Size Diff	Comments
Without #96609	735Ki	908Ki	-	-	baseline
With #96609	628Ki	803Ki	-17%	-13%
compact-mode	652Ki	805Ki	-12%	-12%

This comment has some 'Stub Insertion Stats' (hot/cold/shared) for LongJmp if more details are needed (at section 3).

maksfb · 2024-11-07T22:54:00Z

@paschalis-mpeis, in the example above, the hot size is less than 1MB so it's expected that the "compact model" is generating branches that could be redundant.

Add `--compact-code-model` option that executes alternative branch relaxation with an assumption that the resulting binary has less than 128MB of code. The relaxation is done in `relaxLocalBranches()`, which operates on a function level and executes on multiple functions in parallel. Running the new option on AArch64 Clang binary produces slightly smaller code and the relaxation finishes in about 1/10th of the time. Note that the new `.text` has to be smaller than 128MB, *and* `.plt` has to be closer than 128MB to `.text`.

maksfb added the BOLT label Oct 12, 2024

maksfb requested review from paschalis-mpeis, yota9, kbeyls and yavtuk October 12, 2024 22:38

maksfb requested review from aaupov, rafaelauler, ayermolo and dcci as code owners October 12, 2024 22:38

maksfb mentioned this pull request Oct 13, 2024

[AArch64][BOLT] Ensure tentative code layout for cold BBs runs. #96609

Merged

smithp35 reviewed Oct 14, 2024

View reviewed changes

paschalis-mpeis reviewed Oct 17, 2024

View reviewed changes

fixup! [BOLT][AArch64] Add support for compact code model

3cabeab

fixup! fixup! [BOLT][AArch64] Add support for compact code model

4dc0221

fixup! fixup! fixup! [BOLT][AArch64] Add support for compact code model

6641d9f

paschalis-mpeis approved these changes Oct 29, 2024

View reviewed changes

fixup! fixup! fixup! fixup! [BOLT][AArch64] Add support for compact c…

96b858b

…ode model

maksfb merged commit 49ee606 into llvm:main Nov 7, 2024
7 checks passed

maksfb deleted the gh-aarch64-compact-code-model branch March 6, 2025 02:46

	// is nullptr. If /p UpdateOffsets is true, update FragmentSize and offsets
	// is nullptr. If \p UpdateOffsets is true, update FragmentSize and offsets

[BOLT][AArch64] Add support for compact code model #112110

[BOLT][AArch64] Add support for compact code model #112110

Uh oh!

Conversation

maksfb commented Oct 12, 2024

Uh oh!

llvmbot commented Oct 12, 2024

Uh oh!

kbeyls commented Oct 14, 2024

Uh oh!

smithp35 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paschalis-mpeis commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maksfb commented Oct 14, 2024

Uh oh!

paschalis-mpeis left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paschalis-mpeis Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paschalis-mpeis Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhames commented Oct 21, 2024

Uh oh!

maksfb commented Oct 22, 2024

Uh oh!

maksfb commented Oct 22, 2024

Uh oh!

paschalis-mpeis left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maksfb commented Oct 29, 2024

Uh oh!

paschalis-mpeis commented Nov 4, 2024

Uh oh!

Uh oh!

maksfb commented Nov 7, 2024

Uh oh!

Uh oh!

paschalis-mpeis commented Oct 14, 2024 •

edited

Loading

paschalis-mpeis left a comment •

edited

Loading

paschalis-mpeis Oct 17, 2024 •

edited

Loading

paschalis-mpeis Oct 17, 2024 •

edited

Loading

paschalis-mpeis left a comment •

edited

Loading