ELF: Add branch-to-branch optimization. #138366

pcc · 2025-05-02T23:57:37Z

When code calls a function which then immediately tail calls another
function there is no need to go via the intermediate function. By
branching directly to the target function we reduce the program's working
set for a slight increase in runtime performance.

Normally it is relatively uncommon to have functions that just tail call
another function, but with LLVM control flow integrity we have jump tables
that replace the function itself as the canonical address. As a result,
when a function address is taken and called directly, for example after
a compiler optimization resolves the indirect call, or if code built
without control flow integrity calls the function, the call will go via
the jump table.

The impact of this optimization was measured using a large internal
Google benchmark. The results were as follows:

CFI enabled: +0.1% ± 0.05% queries per second
CFI disabled: +0.01% queries per second [not statistically significant]

The optimization is enabled by default at -O2 but may also be enabled
or disabled individually with --{,no-}branch-to-branch.

This optimization is implemented for AArch64 and X86_64 only.

lld's runtime performance (real execution time) after adding this
optimization was measured using firefox-x64 from lld-speed-test [1]
with ldflags "-O2 -S" on an Apple M2 Ultra. The results are as follows:

    N           Min           Max        Median           Avg        Stddev
x 512     1.2264546     1.3481076     1.2970261     1.2965788   0.018620888
+ 512     1.2561196     1.3839965     1.3214632     1.3209327   0.019443971
Difference at 95.0% confidence
	0.0243538 +/- 0.00233202
	1.87831% +/- 0.179859%
	(Student's t, pooled s = 0.0190369)

[1] https://discourse.llvm.org/t/improving-the-reproducibility-of-linker-benchmarking/86057

llvmbot · 2025-05-02T23:58:10Z

@llvm/pr-subscribers-backend-loongarch
@llvm/pr-subscribers-coroutines
@llvm/pr-subscribers-compiler-rt-sanitizer
@llvm/pr-subscribers-mlgo
@llvm/pr-subscribers-backend-nvptx
@llvm/pr-subscribers-backend-directx
@llvm/pr-subscribers-vectorizers
@llvm/pr-subscribers-lld-coff
@llvm/pr-subscribers-lto
@llvm/pr-subscribers-flang-driver
@llvm/pr-subscribers-libcxxabi
@llvm/pr-subscribers-clang-tidy
@llvm/pr-subscribers-backend-spir-v
@llvm/pr-subscribers-bolt
@llvm/pr-subscribers-llvm-regalloc
@llvm/pr-subscribers-clangd
@llvm/pr-subscribers-clang-codegen
@llvm/pr-subscribers-lldb
@llvm/pr-subscribers-backend-sparc
@llvm/pr-subscribers-lld

@llvm/pr-subscribers-lld-elf

Author: Peter Collingbourne (pcc)

Changes

When code calls a function which then immediately tail calls another
function there is no need to go via the intermediate function. By
branching directly to the target function we reduce the program's
working set for a slight increase in runtime performance.

Normally it is relatively uncommon to have functions that just
tail call another function, but with LLVM control flow integrity we
have jump tables that replace the function itself as the canonical
address. As a result, when a function address is taken and called
directly, for example after a compiler optimization resolves the
indirect call, or if code built without control flow integrity calls
the function, the call will go via the jump table.

The impact of this optimization was measured using a large internal
Google benchmark. The results were as follows:

CFI enabled: +0.1% ± 0.05% queries per second
CFI disabled: +0.01% queries per second [not statistically significant]

The optimization is enabled by default at -O2 but may also be enabled
or disabled individually with --{,no-}branch-to-branch.

This optimization is implemented for AArch64 and X86_64 only.

lld's runtime performance (real execution time) after adding this
optimization was measured using firefox-x64 from lld-speed-test [1]
with ldflags "-O2 -S" on an Apple M2 Ultra. The results are as follows:

N           Min           Max        Median           Avg        Stddev

x 512 1.2264546 1.3481076 1.2970261 1.2965788 0.018620888

512 1.2561196 1.3839965 1.3214632 1.3209327 0.019443971
Difference at 95.0% confidence
0.0243538 +/- 0.00233202
1.87831% +/- 0.179859%
(Student's t, pooled s = 0.0190369)

[1] https://discourse.llvm.org/t/improving-the-reproducibility-of-linker-benchmarking/86057

Full diff: https://github.com/llvm/llvm-project/pull/138366.diff

11 Files Affected:

(modified) lld/ELF/Arch/AArch64.cpp (+58)
(added) lld/ELF/Arch/TargetImpl.h (+87)
(modified) lld/ELF/Arch/X86_64.cpp (+54)
(modified) lld/ELF/Config.h (+1)
(modified) lld/ELF/Driver.cpp (+2)
(modified) lld/ELF/Options.td (+4)
(modified) lld/ELF/Relocations.cpp (+6-2)
(modified) lld/ELF/Target.h (+1)
(modified) lld/docs/ld.lld.1 (+6-2)
(added) lld/test/ELF/aarch64-branch-to-branch.s (+58)
(added) lld/test/ELF/x86-64-branch-to-branch.s (+58)

diff --git a/lld/ELF/Arch/AArch64.cpp b/lld/ELF/Arch/AArch64.cpp
index 9538dd4a70bae..f3a24bd8a9184 100644
--- a/lld/ELF/Arch/AArch64.cpp
+++ b/lld/ELF/Arch/AArch64.cpp
@@ -11,6 +11,7 @@
 #include "Symbols.h"
 #include "SyntheticSections.h"
 #include "Target.h"
+#include "TargetImpl.h"
 #include "lld/Common/ErrorHandler.h"
 #include "llvm/BinaryFormat/ELF.h"
 #include "llvm/Support/Endian.h"
@@ -83,6 +84,7 @@ class AArch64 : public TargetInfo {
                 uint64_t val) const override;
   RelExpr adjustTlsExpr(RelType type, RelExpr expr) const override;
   void relocateAlloc(InputSectionBase &sec, uint8_t *buf) const override;
+  void applyBranchToBranchOpt() const override;
 
 private:
   void relaxTlsGdToLe(uint8_t *loc, const Relocation &rel, uint64_t val) const;
@@ -975,6 +977,62 @@ void AArch64::relocateAlloc(InputSectionBase &sec, uint8_t *buf) const {
   }
 }
 
+static std::optional<uint64_t> getControlTransferAddend(InputSection &is,
+                                                        Relocation &r) {
+  // Identify a control transfer relocation for the branch-to-branch
+  // optimization. A "control transfer relocation" means a B or BL
+  // target but it also includes relative vtable relocations for example.
+  //
+  // We require the relocation type to be JUMP26, CALL26 or PLT32. With a
+  // relocation type of PLT32 the value may be assumed to be used for branching
+  // directly to the symbol and the addend is only used to produce the relocated
+  // value (hence the effective addend is always 0). This is because if a PLT is
+  // needed the addend will be added to the address of the PLT, and it doesn't
+  // make sense to branch into the middle of a PLT. For example, relative vtable
+  // relocations use PLT32 and 0 or a positive value as the addend but still are
+  // used to branch to the symbol.
+  //
+  // With JUMP26 or CALL26 the only reasonable interpretation of a non-zero
+  // addend is that we are branching to symbol+addend so that becomes the
+  // effective addend.
+  if (r.type == R_AARCH64_PLT32)
+    return 0;
+  if (r.type == R_AARCH64_JUMP26 || r.type == R_AARCH64_CALL26)
+    return r.addend;
+  return std::nullopt;
+}
+
+static std::pair<Relocation *, uint64_t> getBranchInfo(InputSection &is,
+                                                       uint64_t offset) {
+  auto *i = std::lower_bound(
+      is.relocations.begin(), is.relocations.end(), offset,
+      [](Relocation &r, uint64_t offset) { return r.offset < offset; });
+  if (i != is.relocations.end() && i->offset == offset &&
+      i->type == R_AARCH64_JUMP26) {
+    return {i, i->addend};
+  }
+  return {nullptr, 0};
+}
+
+static void mergeControlTransferRelocations(Relocation &r1,
+                                            const Relocation &r2) {
+  r1.expr = r2.expr;
+  r1.sym = r2.sym;
+  // With PLT32 we must respect the original addend as that affects the value's
+  // interpretation. With the other relocation types the original addend is
+  // irrelevant because it referred to an offset within the original target
+  // section so we overwrite it.
+  if (r1.type == R_AARCH64_PLT32)
+    r1.addend += r2.addend;
+  else
+    r1.addend = r2.addend;
+}
+
+void AArch64::applyBranchToBranchOpt() const {
+  applyBranchToBranchOptImpl(ctx, getBranchInfo, getControlTransferAddend,
+                             mergeControlTransferRelocations);
+}
+
 // AArch64 may use security features in variant PLT sequences. These are:
 // Pointer Authentication (PAC), introduced in armv8.3-a and Branch Target
 // Indicator (BTI) introduced in armv8.5-a. The additional instructions used
diff --git a/lld/ELF/Arch/TargetImpl.h b/lld/ELF/Arch/TargetImpl.h
new file mode 100644
index 0000000000000..bb10749516953
--- /dev/null
+++ b/lld/ELF/Arch/TargetImpl.h
@@ -0,0 +1,87 @@
+//===- TargetImpl.h ---------------------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLD_ELF_ARCH_TARGETIMPL_H
+#define LLD_ELF_ARCH_TARGETIMPL_H
+
+#include "InputFiles.h"
+#include "InputSection.h"
+#include "Relocations.h"
+#include "Symbols.h"
+#include "llvm/BinaryFormat/ELF.h"
+
+namespace lld {
+namespace elf {
+
+// getControlTransferAddend: If this relocation is used for control transfer
+// instructions (e.g. branch, branch-link or call) or code references (e.g.
+// virtual function pointers) and indicates an address-insignificant reference,
+// return the effective addend for the relocation, otherwise return
+// std::nullopt. The effective addend for a relocation is the addend that is
+// used to determine its branch destination.
+//
+// getBranchInfo: If a control transfer relocation referring to is+offset
+// directly transfers control to a relocated branch instruction in the specified
+// section, return the relocation for the branch target as well as its effective
+// addend (see above). Otherwise return {nullptr, 0}.
+//
+// mergeControlTransferRelocations: Given r1, a relocation for which
+// getControlTransferAddend() returned a value, and r2, a relocation returned by
+// getBranchInfo(), modify r1 so that it branches directly to the target of r2.
+template <typename GetBranchInfo, typename GetControlTransferAddend,
+          typename MergeControlTransferRelocations>
+inline void applyBranchToBranchOptImpl(
+    Ctx &ctx, GetBranchInfo getBranchInfo,
+    GetControlTransferAddend getControlTransferAddend,
+    MergeControlTransferRelocations mergeControlTransferRelocations) {
+  // Needs to run serially because it writes to the relocations array as well as
+  // reading relocations of other sections.
+  for (ELFFileBase *f : ctx.objectFiles) {
+    auto getRelocBranchInfo =
+        [&ctx, &getBranchInfo](Relocation &r,
+               uint64_t addend) -> std::pair<Relocation *, uint64_t> {
+      auto *target = dyn_cast_or_null<Defined>(r.sym);
+      // We don't allow preemptible symbols (may go somewhere else),
+      // absolute symbols (runtime behavior unknown), non-executable memory
+      // (ditto) or non-regular sections (no section data).
+      if (!target || target->isPreemptible || !target->section ||
+          !(target->section->flags & llvm::ELF::SHF_EXECINSTR) ||
+          target->section->kind() != SectionBase::Regular)
+        return {nullptr, 0};
+      return getBranchInfo(*cast<InputSection>(target->section),
+                                       target->value + addend);
+    };
+    for (InputSectionBase *s : f->getSections()) {
+      if (!s)
+        continue;
+      for (Relocation &r : s->relocations) {
+        if (std::optional<uint64_t> addend =
+                getControlTransferAddend(*cast<InputSection>(s),
+                                                     r)) {
+          std::pair<Relocation *, uint64_t> targetAndAddend =
+              getRelocBranchInfo(r, *addend);
+          if (targetAndAddend.first) {
+            while (1) {
+              std::pair<Relocation *, uint64_t> nextTargetAndAddend =
+                  getRelocBranchInfo(*targetAndAddend.first, targetAndAddend.second);
+              if (!nextTargetAndAddend.first)
+                break;
+              targetAndAddend = nextTargetAndAddend;
+            }
+            mergeControlTransferRelocations(r, *targetAndAddend.first);
+          }
+        }
+      }
+    }
+  }
+}
+
+} // namespace elf
+} // namespace lld
+
+#endif
diff --git a/lld/ELF/Arch/X86_64.cpp b/lld/ELF/Arch/X86_64.cpp
index 0c4fd00cab65c..0a4578b0aca4b 100644
--- a/lld/ELF/Arch/X86_64.cpp
+++ b/lld/ELF/Arch/X86_64.cpp
@@ -11,6 +11,7 @@
 #include "Symbols.h"
 #include "SyntheticSections.h"
 #include "Target.h"
+#include "TargetImpl.h"
 #include "lld/Common/ErrorHandler.h"
 #include "llvm/BinaryFormat/ELF.h"
 #include "llvm/Support/Endian.h"
@@ -50,6 +51,7 @@ class X86_64 : public TargetInfo {
   bool deleteFallThruJmpInsn(InputSection &is, InputFile *file,
                              InputSection *nextIS) const override;
   bool relaxOnce(int pass) const override;
+  void applyBranchToBranchOpt() const override;
 
 private:
   void relaxTlsGdToLe(uint8_t *loc, const Relocation &rel, uint64_t val) const;
@@ -1162,6 +1164,58 @@ void X86_64::relocateAlloc(InputSectionBase &sec, uint8_t *buf) const {
   }
 }
 
+static std::optional<uint64_t> getControlTransferAddend(InputSection &is,
+                                                        Relocation &r) {
+  // Identify a control transfer relocation for the branch-to-branch
+  // optimization. A "control transfer relocation" usually means a CALL or JMP
+  // target but it also includes relative vtable relocations for example.
+  //
+  // We require the relocation type to be PLT32. With a relocation type of PLT32
+  // the value may be assumed to be used for branching directly to the symbol
+  // and the addend is only used to produce the relocated value (hence the
+  // effective addend is always 0). This is because if a PLT is needed the
+  // addend will be added to the address of the PLT, and it doesn't make sense
+  // to branch into the middle of a PLT. For example, relative vtable
+  // relocations use PLT32 and 0 or a positive value as the addend but still are
+  // used to branch to the symbol.
+  if (r.type == R_X86_64_PLT32)
+    return 0;
+  return std::nullopt;
+}
+
+static std::pair<Relocation *, uint64_t> getBranchInfo(InputSection &is,
+                                                       uint64_t offset) {
+  auto content = is.contentMaybeDecompress();
+  if (content.size() > offset && content[offset] == 0xe9) { // JMP immediate
+    auto *i = std::lower_bound(
+        is.relocations.begin(), is.relocations.end(), offset + 1,
+        [](Relocation &r, uint64_t offset) { return r.offset < offset; });
+    // Unlike with getControlTransferAddend() it is valid to accept a PC32
+    // relocation here because we know that this is actually a JMP and not some
+    // other reference, so the interpretation is that we add 4 to the addend and
+    // use that as the effective addend.
+    if (i != is.relocations.end() && i->offset == offset + 1 &&
+        (i->type == R_X86_64_PC32 || i->type == R_X86_64_PLT32)) {
+      return {i, i->addend + 4};
+    }
+  }
+  return {nullptr, 0};
+}
+
+static void mergeControlTransferRelocations(Relocation &r1,
+                                            const Relocation &r2) {
+  r1.expr = r2.expr;
+  r1.sym = r2.sym;
+  // The +4 is here to compensate for r2.addend which will likely be -4,
+  // but may also be addend-4 in case of a PC32 branch to symbol+addend.
+  r1.addend += r2.addend + 4;
+}
+
+void X86_64::applyBranchToBranchOpt() const {
+  applyBranchToBranchOptImpl(ctx, getBranchInfo, getControlTransferAddend,
+                             mergeControlTransferRelocations);
+}
+
 // If Intel Indirect Branch Tracking is enabled, we have to emit special PLT
 // entries containing endbr64 instructions. A PLT entry will be split into two
 // parts, one in .plt.sec (writePlt), and the other in .plt (writeIBTPlt).
diff --git a/lld/ELF/Config.h b/lld/ELF/Config.h
index f0e9592d85dd6..b7449b9d13cf5 100644
--- a/lld/ELF/Config.h
+++ b/lld/ELF/Config.h
@@ -276,6 +276,7 @@ struct Config {
   bool bpFunctionOrderForCompression = false;
   bool bpDataOrderForCompression = false;
   bool bpVerboseSectionOrderer = false;
+  bool branchToBranch = false;
   bool checkSections;
   bool checkDynamicRelocs;
   std::optional<llvm::DebugCompressionType> compressDebugSections;
diff --git a/lld/ELF/Driver.cpp b/lld/ELF/Driver.cpp
index 9d36071e1532f..e79372957e408 100644
--- a/lld/ELF/Driver.cpp
+++ b/lld/ELF/Driver.cpp
@@ -1589,6 +1589,8 @@ static void readConfigs(Ctx &ctx, opt::InputArgList &args) {
   ctx.arg.zWxneeded = hasZOption(args, "wxneeded");
   setUnresolvedSymbolPolicy(ctx, args);
   ctx.arg.power10Stubs = args.getLastArgValue(OPT_power10_stubs_eq) != "no";
+  ctx.arg.branchToBranch = args.hasFlag(
+      OPT_branch_to_branch, OPT_no_branch_to_branch, ctx.arg.optimize >= 2);
 
   if (opt::Arg *arg = args.getLastArg(OPT_eb, OPT_el)) {
     if (arg->getOption().matches(OPT_eb))
diff --git a/lld/ELF/Options.td b/lld/ELF/Options.td
index 76d28096f82c8..40fc0d2c8c64e 100644
--- a/lld/ELF/Options.td
+++ b/lld/ELF/Options.td
@@ -59,6 +59,10 @@ def build_id: J<"build-id=">, HelpText<"Generate build ID note">,
   MetaVarName<"[fast,md5,sha1,uuid,0x<hexstring>]">;
 def : F<"build-id">, Alias<build_id>, AliasArgs<["sha1"]>, HelpText<"Alias for --build-id=sha1">;
 
+defm branch_to_branch: B<"branch-to-branch",
+    "Enable branch-to-branch optimization (default at -O2)",
+    "Disable branch-to-branch optimization (default at -O0 and -O1)">;
+
 defm check_sections: B<"check-sections",
     "Check section addresses for overlaps (default)",
     "Do not check section addresses for overlaps">;
diff --git a/lld/ELF/Relocations.cpp b/lld/ELF/Relocations.cpp
index 277acb26987bc..457fd19da5493 100644
--- a/lld/ELF/Relocations.cpp
+++ b/lld/ELF/Relocations.cpp
@@ -1671,9 +1671,10 @@ void RelocationScanner::scan(Relocs<RelTy> rels) {
   }
 
   // Sort relocations by offset for more efficient searching for
-  // R_RISCV_PCREL_HI20 and R_PPC64_ADDR64.
+  // R_RISCV_PCREL_HI20, R_PPC64_ADDR64 and the branch-to-branch optimization.
   if (ctx.arg.emachine == EM_RISCV ||
-      (ctx.arg.emachine == EM_PPC64 && sec->name == ".toc"))
+      (ctx.arg.emachine == EM_PPC64 && sec->name == ".toc") ||
+      ctx.arg.branchToBranch)
     llvm::stable_sort(sec->relocs(),
                       [](const Relocation &lhs, const Relocation &rhs) {
                         return lhs.offset < rhs.offset;
@@ -1964,6 +1965,9 @@ void elf::postScanRelocations(Ctx &ctx) {
   for (ELFFileBase *file : ctx.objectFiles)
     for (Symbol *sym : file->getLocalSymbols())
       fn(*sym);
+
+  if (ctx.arg.branchToBranch)
+    ctx.target->applyBranchToBranchOpt();
 }
 
 static bool mergeCmp(const InputSection *a, const InputSection *b) {
diff --git a/lld/ELF/Target.h b/lld/ELF/Target.h
index fd1e5d33c438a..6dd20b2f0cbaa 100644
--- a/lld/ELF/Target.h
+++ b/lld/ELF/Target.h
@@ -101,6 +101,7 @@ class TargetInfo {
 
   virtual void applyJumpInstrMod(uint8_t *loc, JumpModType type,
                                  JumpModType val) const {}
+  virtual void applyBranchToBranchOpt() const {}
 
   virtual ~TargetInfo();
 
diff --git a/lld/docs/ld.lld.1 b/lld/docs/ld.lld.1
index 7b2650637cb10..d7b987ded784d 100644
--- a/lld/docs/ld.lld.1
+++ b/lld/docs/ld.lld.1
@@ -93,6 +93,10 @@ Bind default visibility defined STB_GLOBAL function symbols locally for
 .Fl shared.
 .It Fl --be8
 Write a Big Endian ELF File using BE8 format(AArch32 only)
+.It Fl -branch-to-branch
+Enable the branch-to-branch optimizations: a branch whose target is
+another branch instruction is rewritten to point to the latter branch
+target (AArch64 and X86_64 only). Enabled by default at -O2.
 .It Fl -build-id Ns = Ns Ar value
 Generate a build ID note.
 .Ar value
@@ -414,7 +418,7 @@ If not specified,
 .Dv a.out
 is used as a default.
 .It Fl O Ns Ar value
-Optimize output file size.
+Optimize output file.
 .Ar value
 may be:
 .Pp
@@ -424,7 +428,7 @@ Disable string merging.
 .It Cm 1
 Enable string merging.
 .It Cm 2
-Enable string tail merging.
+Enable string tail merging and branch-to-branch optimization.
 .El
 .Pp
 .Fl O Ns Cm 1
diff --git a/lld/test/ELF/aarch64-branch-to-branch.s b/lld/test/ELF/aarch64-branch-to-branch.s
new file mode 100644
index 0000000000000..3a3ae04ac0538
--- /dev/null
+++ b/lld/test/ELF/aarch64-branch-to-branch.s
@@ -0,0 +1,58 @@
+# REQUIRES: aarch64
+
+## Test that the branch-to-branch optimization follows the links
+## from f1 -> f2 -> f3 and updates all references to point to f3.
+
+# RUN: llvm-mc -filetype=obj -triple=aarch64-pc-linux %s -o %t.o
+# RUN: ld.lld %t.o -o %t --branch-to-branch
+# RUN: llvm-objdump -d -s %t | FileCheck --check-prefix=B2B %s
+# RUN: ld.lld %t.o -o %t -O2
+# RUN: llvm-objdump -d -s %t | FileCheck --check-prefix=B2B %s
+
+## Test that branch-to-branch is disabled by default.
+
+# RUN: ld.lld %t.o -o %t
+# RUN: llvm-objdump -d -s %t | FileCheck --check-prefix=NOB2B %s
+# RUN: ld.lld %t.o -o %t -O2 --no-branch-to-branch
+# RUN: llvm-objdump -d -s %t | FileCheck --check-prefix=NOB2B %s
+
+## Test that branch-to-branch is disabled for preemptible symbols.
+
+# RUN: ld.lld %t.o -o %t --branch-to-branch -shared
+# RUN: llvm-objdump -d -s %t | FileCheck --check-prefix=NOB2B %s
+
+.section .rodata.vtable,"a"
+.globl vtable
+vtable:
+# B2B: Contents of section .rodata:
+# B2B-NEXT: [[VF:[0-9a-f]{8}]]
+.4byte f1@PLT - vtable
+# B2B-SAME: [[VF]]
+.4byte f2@PLT - vtable
+# B2B-SAME: [[VF]]
+.4byte f3@PLT - vtable
+
+.section .text._start,"ax"
+.globl _start
+_start:
+# B2B: bl {{.*}} <f3>
+# NOB2B: bl {{.*}} <f1{{.*}}>
+bl f1
+# B2B: b {{.*}} <f3>
+# NOB2B: b {{.*}} <f2{{.*}}>
+b f2
+
+.section .text.f1,"ax"
+.globl f1
+f1:
+b f2
+
+.section .text.f2,"ax"
+.globl f2
+f2:
+b f3
+
+.section .text.f3,"ax"
+.globl f3
+f3:
+ret
diff --git a/lld/test/ELF/x86-64-branch-to-branch.s b/lld/test/ELF/x86-64-branch-to-branch.s
new file mode 100644
index 0000000000000..b9c9abe2eb752
--- /dev/null
+++ b/lld/test/ELF/x86-64-branch-to-branch.s
@@ -0,0 +1,58 @@
+# REQUIRES: x86
+
+## Test that the branch-to-branch optimization follows the links
+## from f1 -> f2 -> f3 and updates all references to point to f3.
+ 
+# RUN: llvm-mc -filetype=obj -triple=x86_64-pc-linux %s -o %t.o
+# RUN: ld.lld %t.o -o %t --branch-to-branch
+# RUN: llvm-objdump -d -s %t | FileCheck --check-prefix=B2B %s
+# RUN: ld.lld %t.o -o %t -O2
+# RUN: llvm-objdump -d -s %t | FileCheck --check-prefix=B2B %s
+
+## Test that branch-to-branch is disabled by default.
+
+# RUN: ld.lld %t.o -o %t
+# RUN: llvm-objdump -d -s %t | FileCheck --check-prefix=NOB2B %s
+# RUN: ld.lld %t.o -o %t -O2 --no-branch-to-branch
+# RUN: llvm-objdump -d -s %t | FileCheck --check-prefix=NOB2B %s
+
+## Test that branch-to-branch is disabled for preemptible symbols.
+
+# RUN: ld.lld %t.o -o %t --branch-to-branch -shared
+# RUN: llvm-objdump -d -s %t | FileCheck --check-prefix=NOB2B %s
+
+.section .rodata.vtable,"a"
+.globl vtable
+vtable:
+# B2B: Contents of section .rodata:
+# B2B-NEXT: [[VF:[0-9a-f]{8}]]
+.4byte f1@PLT - vtable
+# B2B-SAME: [[VF]]
+.4byte f2@PLT - vtable
+# B2B-SAME: [[VF]]
+.4byte f3@PLT - vtable
+
+.section .text._start,"ax"
+.globl _start
+_start:
+# B2B: jmp {{.*}} <f3>
+# NOB2B: jmp {{.*}} <f1{{.*}}>
+jmp f1
+# B2B: jmp {{.*}} <f3>
+# NOB2B: jmp {{.*}} <f2{{.*}}>
+jmp f2
+
+.section .text.f1,"ax"
+.globl f1
+f1:
+jmp f2
+
+.section .text.f2,"ax"
+.globl f2
+f2:
+jmp f3
+
+.section .text.f3,"ax"
+.globl f3
+f3:
+ret

github-actions · 2025-05-02T23:59:54Z

✅ With the latest revision this PR passed the C/C++ code formatter.

MaskRay

(Still on a trip with limited computer access)

MaskRay · 2025-05-03T02:49:09Z

lld/ELF/Options.td

@@ -59,6 +59,10 @@ def build_id: J<"build-id=">, HelpText<"Generate build ID note">,
  MetaVarName<"[fast,md5,sha1,uuid,0x<hexstring>]">;
 def : F<"build-id">, Alias<build_id>, AliasArgs<["sha1"]>, HelpText<"Alias for --build-id=sha1">;

+defm branch_to_branch: B<"branch-to-branch",


BB (only allow -- long options) for new options (no GNU ld compatibility concern)

MaskRay · 2025-05-03T02:51:09Z

lld/test/ELF/aarch64-branch-to-branch.s

+vtable:
+# B2B: Contents of section .rodata:
+# B2B-NEXT: [[VF:[0-9a-f]{8}]]
+.4byte f1@PLT - vtable


Note: f1@PLT - . is unofficial syntax (in both AArch64 and X86) that will change. It should be straightforward to update the uses when the change happens.

(For RISC-V I've switched to %pltpcrel(foo+offset) (#132569)

MaskRay · 2025-05-03T02:53:01Z

lld/test/ELF/aarch64-branch-to-branch.s

+# B2B: b {{.*}} <f3>
+# NOB2B: b {{.*}} <f2{{.*}}>
+b f2
+


There is no negative test (when the destination does not start with a branch instruction)

Done, we now check that the branch in f2 still goes to f3.

smithp35

As a general idea this looks good to me. Some thoughts on some possible additional tests and edge-cases.

The bolt folk have been asking if the range-extension thunks can output relocations so that they can recreate the control flow more easily. It looks like this will handle emit-relocs naturally, could be worth a test case to check.

This looks like it runs before range extension thunks, which likely simplifies the logic as we don't need to check for branch range. Due to the way thunks are currently placed it could be possible that we end up removing a branch to a branch, only for the thunk generation code placing a range extension thunk such that a long-branch is generated. For example:

fn:
b fn2
...
fn2: // in range of fn
b fn3
...
fn3: // in range of fn2 but not fn

If we change that so that fn branches directly to fn3 our range extension thunk may not be conveniently placed, so that an indirect branch is needed.

If this is a concern for large programs it may be better to do this after range extension thunks, although it would need a range check before applying it.

Alternately running it earlier than garbage collection would in theory permit some sections (branch in its own section) to be removed, but that's likely to be a neglible saving.

pcc · 2025-05-06T17:48:54Z

The bolt folk have been asking if the range-extension thunks can output relocations so that they can recreate the control flow more easily. It looks like this will handle emit-relocs naturally, could be worth a test case to check.

It doesn't look like that works because InputSection::copyRelocations reads the original relocation section and not the relocations array.

This looks like it runs before range extension thunks, which likely simplifies the logic as we don't need to check for branch range. Due to the way thunks are currently placed it could be possible that we end up removing a branch to a branch, only for the thunk generation code placing a range extension thunk such that a long-branch is generated.

Correct. I placed it before range extension thunks deliberately, as it means that in some cases we would be able to avoid the range extension thunk entirely. I had the CFI jump table case in mind, where typically all the jump tables will appear near the end of the program (link position of the combined full LTO object file). For example:

f1:
b f2
f2.cfi:
ret
[....]
f2:
b f2.cfi

So it's neutral (or possibly positive) in terms of the number of branches executed, but possibly negative for code size. That being said, this optimization did not change the number of thunks created in lld-speed-test/firefox-arm64, but that doesn't use CFI. (The optimization didn't fire at all in lld-speed-test/chrome because of BTI instructions, but that should be fixable in a followup.)

Alternately running it earlier than garbage collection would in theory permit some sections (branch in its own section) to be removed, but that's likely to be a neglible saving.

Correct. Also, if it ran before ICF that could expose more ICF opportunities but that's likely to be minor as well.

I considered placing it before ICF and GC, but we don't really have any infrastructure for rewriting relocation targets until reaching the relocation scan in the writer, and if we moved that earlier it would mean doing extra work (creating relocation lists for ICF'd and GC'd sections) for only a minor benefit.

pcc · 2025-05-06T20:28:43Z

After testing this patch more I found this problem affecting cross-section branches within the same object file.

.section .text.a,"ax",@progbits
jmp hfoo
jmp hfoo+1
jmp lfoo
jmp lfoo+1
jmp .Lfoo
jmp .Lfoo+1
jmp .text
jmp bar

.text
.globl hfoo
.hidden hfoo
hfoo:
lfoo:
  ret
.Lfoo:
  ret
pcc@pcc0 /tmp> as jmpl.s && objdump -dr

a.out:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <hfoo>:
   0:	c3                   	ret
   1:	c3                   	ret

Disassembly of section .text.a:

0000000000000000 <.text.a>:
   0:	e9 00 00 00 00       	jmp    5 <.text.a+0x5>
			1: R_X86_64_PLT32	hfoo-0x4
   5:	e9 00 00 00 00       	jmp    a <.text.a+0xa>
			6: R_X86_64_PC32	hfoo-0x3
   a:	e9 00 00 00 00       	jmp    f <.text.a+0xf>
			b: R_X86_64_PC32	.text-0x4
   f:	e9 00 00 00 00       	jmp    14 <.text.a+0x14>
			10: R_X86_64_PC32	.text-0x3
  14:	e9 00 00 00 00       	jmp    19 <.text.a+0x19>
			15: R_X86_64_PC32	.text-0x3
  19:	e9 00 00 00 00       	jmp    1e <.text.a+0x1e>
			1a: R_X86_64_PC32	.text-0x2
  1e:	e9 00 00 00 00       	jmp    23 <.text.a+0x23>
			1f: R_X86_64_PC32	.text-0x4
  23:	e9 00 00 00 00       	jmp    28 <hfoo+0x28>
			24: R_X86_64_PLT32	bar-0x4
pcc@pcc0 /tmp> ~/l2/ra/bin/clang -c jmpl.s && objdump -dr jmpl.o

jmpl.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <hfoo>:
   0:	c3                   	ret
   1:	c3                   	ret

Disassembly of section .text.a:

0000000000000000 <.text.a>:
   0:	e9 00 00 00 00       	jmp    5 <.text.a+0x5>
			1: R_X86_64_PLT32	hfoo-0x4
   5:	e9 00 00 00 00       	jmp    a <.text.a+0xa>
			6: R_X86_64_PC32	hfoo-0x3
   a:	e9 00 00 00 00       	jmp    f <.text.a+0xf>
			b: R_X86_64_PLT32	.text-0x4
   f:	e9 00 00 00 00       	jmp    14 <.text.a+0x14>
			10: R_X86_64_PC32	.text-0x3
  14:	e9 00 00 00 00       	jmp    19 <.text.a+0x19>
			15: R_X86_64_PLT32	.text-0x3
  19:	e9 00 00 00 00       	jmp    1e <.text.a+0x1e>
			1a: R_X86_64_PC32	.text-0x2
  1e:	e9 00 00 00 00       	jmp    23 <.text.a+0x23>
			1f: R_X86_64_PLT32	.text-0x4
  23:	e9 00 00 00 00       	jmp    28 <hfoo+0x28>
			24: R_X86_64_PLT32	bar-0x4

The problematic case is jmp .Lfoo which generates a PLT32 relocation with an addend. To fix this we could change the LLVM assembler to generate PC32 for branches if that would result in an addend other than -4. I don't think we have to match GNU as in the jmp lfoo or jmp .text cases. As a temporary workaround for old object files we could change how r_addend is interpreted on x86:

If relocation is on a non-SHF_EXECINSTR section: effective addend is always 0 (relocation assumed to be for a relative vtable)
If relocation is on an SHF_EXECINSTR section: effective addend is r_addend-4 (relocation assumed to be for a branch instruction)

pcc · 2025-05-07T02:05:56Z

#138795 fixes the MC bug cited above. I'll implement the old object file workaround later.

pcc · 2025-05-09T04:47:52Z

I changed my mind about #138366 (comment) and decided that the LLVM assembler is correct and it's a linker bug that this case isn't handled. I updated this PR to fix it by checking the symbol type.

I also updated this PR with two more bug fixes found during testing:

Don't follow relocations to STT_GNU_IFUNC symbols
Don't get stuck in an infinite loop when encountering a branch loop

pcc · 2025-05-21T18:24:43Z

Ping.

smithp35

I've made a few suggestions for some names that I found a bit confusing. The AArch64 side looks good to me. I've not checked the x86_64 side.

smithp35 · 2025-05-22T15:10:50Z

lld/ELF/Arch/TargetImpl.h

+// section, return the relocation for the branch target as well as its effective
+// addend (see above). Otherwise return {nullptr, 0}.
+//
+// mergeControlTransferRelocations: Given r1, a relocation for which


As a personal preference, I think redirectControlTransferRelocation could work better than mergeControlTransferRelocations. A possible interpretation of merge is that it makes both relocations the same, but it is more like redirecting the destination of just one.

smithp35 · 2025-05-22T15:12:19Z

lld/ELF/Arch/TargetImpl.h

+      // We don't allow preemptible symbols or ifuncs (may go somewhere else),
+      // absolute symbols (runtime behavior unknown), non-executable memory
+      // (ditto) or non-regular sections (no section data).
+      if (!target || target->isPreemptible || target->isGnuIFunc() ||


While uncommon, it is possible to have SHT_REL relocs which may have a non zero addend. I know of at least one tool that can generate them. I don't think these need to be supported, but could be worth skipping any that are encountered.

Shouldn't SHT_REL just work already because we read the implicit addend when producing the Relocation object?

I wanted to add a test case for this but it looks like llvm-mc doesn't have an option to write SHT_REL and instead SHT_REL is tested with yaml2obj hacks, e.g. lld/test/ELF/aarch64-reloc-implicit-addend.test. I think that test is already providing enough coverage of the SHT_REL path (otherwise we would need duplicate and difficult to maintain tests of every feature that processes Relocations).

Yes, just checked and it does copy the relocation addend.

I agree that this wouldn't need a test case.

As an aside when checking where the addends were read in I ran into this bit of copyRelocations https://github.com/llvm/llvm-project/blob/main/lld/ELF/InputSection.cpp#L433

if (ctx.arg.relax && !ctx.arg.relocatable && (ctx.arg.emachine == EM_RISCV || ctx.arg.emachine == EM_LOONGARCH)) { // On LoongArch and RISC-V, relaxation might change relocations: copy // from internal ones that are updated by relaxation. InputSectionBase *sec = getRelocatedSection(); copyRelocations<ELFT, RelTy>( ctx, buf, llvm::make_range(sec->relocations.begin(), sec->relocations.end()));

I think I mentioned in a previous comment that bolt uses emit-relocations so it may be worth following suite here when the transformation is applied.

I suspect that if bolt trusts the original relocation then in worst case the transformation is undone though.

I agree that there should be a --emit-relocs test

Added --emit-relocs test, also switched to the other code path in InputSection.cpp

smithp35 · 2025-05-22T15:32:44Z

lld/ELF/Arch/TargetImpl.h

+// std::nullopt. The effective addend for a relocation is the addend that is
+// used to determine its branch destination.
+//
+// getBranchInfo: If a control transfer relocation referring to is+offset


Could we name this getBranchInfoAtTarget or some other synonym of Target to indicate at the call site that we're referring to the information at the end of the relocation and not the control transfer relocation we started at.

Maybe just me, but I had to do some double checks to make sure I had the right one.

lld/ELF/Arch/AArch64.cpp

pcc · 2025-05-23T20:01:52Z

Right, this feature doesn't change section sizes, so there shouldn't be an interaction with SHT_LLVM_BB_ADDR_MAP. AFAICT LLD doesn't contain code that parses SHT_LLVM_BB_ADDR_MAP so I don't see value in adding a test for it.

When code calls a function which then immediately tail calls another function there is no need to go via the intermediate function. By branching directly to the target function we reduce the program's working set for a slight increase in runtime performance. Normally it is relatively uncommon to have functions that just tail call another function, but with LLVM control flow integrity we have jump tables that replace the function itself as the canonical address. As a result, when a function address is taken and called directly, for example after a compiler optimization resolves the indirect call, or if code built without control flow integrity calls the function, the call will go via the jump table. The impact of this optimization was measured using a large internal Google benchmark. The results were as follows: CFI enabled: +0.1% ± 0.05% queries per second CFI disabled: +0.01% queries per second [not statistically significant] The optimization is enabled by default at -O2 but may also be enabled or disabled individually with --{,no-}branch-to-branch. This optimization is implemented for AArch64 and X86_64 only. lld's runtime performance (real execution time) after adding this optimization was measured using firefox-x64 from lld-speed-test [1] with ldflags "-O2 -S" on an Apple M2 Ultra. The results are as follows: ``` N Min Max Median Avg Stddev x 512 1.2264546 1.3481076 1.2970261 1.2965788 0.018620888 + 512 1.2561196 1.3839965 1.3214632 1.3209327 0.019443971 Difference at 95.0% confidence 0.0243538 +/- 0.00233202 1.87831% +/- 0.179859% (Student's t, pooled s = 0.0190369) ``` [1] https://discourse.llvm.org/t/improving-the-reproducibility-of-linker-benchmarking/86057 Pull Request: #138366

smithp35 · 2025-05-27T10:56:46Z

Thanks for the updates. I don't have any more comments.

pcc · 2025-06-10T01:13:47Z

@MaskRay ping.

MaskRay · 2025-06-15T00:58:34Z

pcc wants to merge 1 commit into users/pcc/spr/main.elf-add-branch-to-branch-optimization from users/pcc/spr/elf-add-branch-to-branch-optimization

Is the base branch associated with an open PR?

Neither

curl -L https://github.com/llvm/llvm-project/pull/138366.diff | patch -p1

spr patch 138366

works.

pcc · 2025-06-15T01:03:23Z

Sorry, I was having a problem with spr with this change so I pushed the updated change directly to the branch. I think there are some merge conflicts by now in the include list but they should be easy to resolve. -- Peter

…

On Sat, Jun 14, 2025, 17:58 Fangrui Song ***@***.***> wrote: *MaskRay* left a comment (llvm/llvm-project#138366) <#138366 (comment)> pcc <https://github.com/pcc> wants to merge 1 commit into users/pcc/spr/main.elf-add-branch-to-branch-optimization <https://github.com/llvm/llvm-project/tree/users/pcc/spr/main.elf-add-branch-to-branch-optimization> from users/pcc/spr/elf-add-branch-to-branch-optimization <https://github.com/llvm/llvm-project/tree/users/pcc/spr/elf-add-branch-to-branch-optimization> Is the base branch associated with an open PR? Neither curl -L https://github.com/llvm/llvm-project/pull/138366.diff | patch -p1 spr patch 138366 works. — Reply to this email directly, view it on GitHub <#138366 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADHYQAZ2GOB3GOPSCGQL3L3DTANHAVCNFSM6AAAAAB4LCEMI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSNZTGQYDINJWGM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

MaskRay · 2025-05-09T06:44:56Z

lld/ELF/Arch/AArch64.cpp

+
+static std::pair<Relocation *, uint64_t> getBranchInfo(InputSection &is,
+                                                       uint64_t offset) {
+  auto *i = std::lower_bound(


Complex lower_bound/upper_bound can be simplified with partition_point

MaskRay · 2025-06-15T01:02:49Z

lld/ELF/Arch/TargetImpl.h

+#include "Relocations.h"
+#include "Symbols.h"
+#include "llvm/BinaryFormat/ELF.h"
+


namespace lld::elf

MaskRay · 2025-06-15T01:10:24Z

Perhaps you'll need to change the base branch to main and force push to users/pcc/spr/elf-add-branch-to-branch-optimization?

MaskRay · 2025-06-15T01:36:29Z

lld/ELF/Arch/TargetImpl.h

+      if (!s)
+        continue;
+      for (Relocation &r : s->relocations) {
+        if (std::optional<uint64_t> addend =


can use early return for this if the next if to reduce indentation

lld/test/ELF/x86-64-branch-to-branch.s

MaskRay · 2025-06-15T02:15:20Z

lld/ELF/Arch/TargetImpl.h

@@ -0,0 +1,94 @@
+//===- TargetImpl.h ---------------------------------------------*- C++ -*-===//


//===----------------------------------------------------------------------===// for new file per https://llvm.org/docs/CodingStandards.html#file-headers

When code calls a function which then immediately tail calls another function there is no need to go via the intermediate function. By branching directly to the target function we reduce the program's working set for a slight increase in runtime performance. Normally it is relatively uncommon to have functions that just tail call another function, but with LLVM control flow integrity we have jump tables that replace the function itself as the canonical address. As a result, when a function address is taken and called directly, for example after a compiler optimization resolves the indirect call, or if code built without control flow integrity calls the function, the call will go via the jump table. The impact of this optimization was measured using a large internal Google benchmark. The results were as follows: CFI enabled: +0.1% ± 0.05% queries per second CFI disabled: +0.01% queries per second [not statistically significant] The optimization is enabled by default at -O2 but may also be enabled or disabled individually with --{,no-}branch-to-branch. This optimization is implemented for AArch64 and X86_64 only. lld's runtime performance (real execution time) after adding this optimization was measured using firefox-x64 from lld-speed-test [1] with ldflags "-O2 -S" on an Apple M2 Ultra. The results are as follows: ``` N Min Max Median Avg Stddev x 512 1.2264546 1.3481076 1.2970261 1.2965788 0.018620888 + 512 1.2561196 1.3839965 1.3214632 1.3209327 0.019443971 Difference at 95.0% confidence 0.0243538 +/- 0.00233202 1.87831% +/- 0.179859% (Student's t, pooled s = 0.0190369) ``` [1] https://discourse.llvm.org/t/improving-the-reproducibility-of-linker-benchmarking/86057 Pull Request: #138366

pcc · 2025-06-19T05:02:44Z

Perhaps you'll need to change the base branch to main and force push to users/pcc/spr/elf-add-branch-to-branch-optimization?

Yeah, done

MaskRay

Looks good to me! Let's give @smithp35 a chance to share the feedback too.

MaskRay · 2025-06-19T05:13:21Z

lld/ELF/Arch/TargetImpl.h

@@ -0,0 +1,93 @@
+//===- TargetImpl.h -------------------------------------------------------===//


Nit: switch to the new style as well

MaskRay · 2025-06-19T05:15:55Z

lld/ELF/InputSection.cpp

@@ -430,8 +430,10 @@ InputSectionBase *InputSection::getRelocatedSection() const {

 template <class ELFT, class RelTy>
 void InputSection::copyRelocations(Ctx &ctx, uint8_t *buf) {
-  if (ctx.arg.relax && !ctx.arg.relocatable &&
-      (ctx.arg.emachine == EM_RISCV || ctx.arg.emachine == EM_LOONGARCH)) {
+  if (!ctx.arg.relocatable &&


The condition is now complex... Perhaps define a variable for linker relaxation targets (RISCV,LoongArch)?

bool linkerRelax = ctx.arg.relax && (is_contained({EM_RISCV, EM_LOONGARCH}, ctx.arg.emachine); ... (linkerRelax || ctx.arg.branchToBranch)

MaskRay · 2025-06-19T05:18:02Z

lld/docs/ld.lld.1

+.It Fl -branch-to-branch
+Enable the branch-to-branch optimizations: a branch whose target is
+another branch instruction is rewritten to point to the latter branch
+target (AArch64 and X86_64 only). Enabled by default at -O2.


-O2 should use .Fl O2. (unfortunately the dot after -O2 has to be highlighted as well...)

.Fl O2 Ns .?

MaskRay · 2025-06-19T05:27:02Z

Could you also update lld/docs/ReleaseNotes.rst?

When code calls a function which then immediately tail calls another function there is no need to go via the intermediate function. By branching directly to the target function we reduce the program's working set for a slight increase in runtime performance. Normally it is relatively uncommon to have functions that just tail call another function, but with LLVM control flow integrity we have jump tables that replace the function itself as the canonical address. As a result, when a function address is taken and called directly, for example after a compiler optimization resolves the indirect call, or if code built without control flow integrity calls the function, the call will go via the jump table. The impact of this optimization was measured using a large internal Google benchmark. The results were as follows: CFI enabled: +0.1% ± 0.05% queries per second CFI disabled: +0.01% queries per second [not statistically significant] The optimization is enabled by default at -O2 but may also be enabled or disabled individually with --{,no-}branch-to-branch. This optimization is implemented for AArch64 and X86_64 only. lld's runtime performance (real execution time) after adding this optimization was measured using firefox-x64 from lld-speed-test [1] with ldflags "-O2 -S" on an Apple M2 Ultra. The results are as follows: ``` N Min Max Median Avg Stddev x 512 1.2264546 1.3481076 1.2970261 1.2965788 0.018620888 + 512 1.2561196 1.3839965 1.3214632 1.3209327 0.019443971 Difference at 95.0% confidence 0.0243538 +/- 0.00233202 1.87831% +/- 0.179859% (Student's t, pooled s = 0.0190369) ``` [1] https://discourse.llvm.org/t/improving-the-reproducibility-of-linker-benchmarking/86057 Pull Request: #138366

pcc · 2025-06-20T20:16:57Z

Could you also update lld/docs/ReleaseNotes.rst?

Done

pcc · 2025-06-20T20:17:26Z

Merged 491b82a

When code calls a function which then immediately tail calls another function there is no need to go via the intermediate function. By branching directly to the target function we reduce the program's working set for a slight increase in runtime performance. Normally it is relatively uncommon to have functions that just tail call another function, but with LLVM control flow integrity we have jump tables that replace the function itself as the canonical address. As a result, when a function address is taken and called directly, for example after a compiler optimization resolves the indirect call, or if code built without control flow integrity calls the function, the call will go via the jump table. The impact of this optimization was measured using a large internal Google benchmark. The results were as follows: CFI enabled: +0.1% ± 0.05% queries per second CFI disabled: +0.01% queries per second [not statistically significant] The optimization is enabled by default at -O2 but may also be enabled or disabled individually with --{,no-}branch-to-branch. This optimization is implemented for AArch64 and X86_64 only. lld's runtime performance (real execution time) after adding this optimization was measured using firefox-x64 from lld-speed-test [1] with ldflags "-O2 -S" on an Apple M2 Ultra. The results are as follows: ``` N Min Max Median Avg Stddev x 512 1.2264546 1.3481076 1.2970261 1.2965788 0.018620888 + 512 1.2561196 1.3839965 1.3214632 1.3209327 0.019443971 Difference at 95.0% confidence 0.0243538 +/- 0.00233202 1.87831% +/- 0.179859% (Student's t, pooled s = 0.0190369) ``` [1] https://discourse.llvm.org/t/improving-the-reproducibility-of-linker-benchmarking/86057 Pull Request: llvm/llvm-project#138366

zmodem · 2025-06-23T11:25:47Z

We're hitting an assert after this. I've attached a reproducer at https://crbug.com/427013249#comment3

$ tar zxf /tmp/repro.tar.gz
$ ( cd repro && ../build/bin/ld.lld @response.txt )
ld.lld: /work/llvm-project/llvm/include/llvm/Support/Casting.h:578: decltype(auto) llvm::cast(From*) [with To = lld::elf::InputSection; From = lld::elf::InputSectionBase]: Assertion `isa<To>(Val) && "cast<Ty>() argument of incompatible type!"' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: ../build/bin/ld.lld @response.txt
 #0 0x000055ef8070ae7b llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (../build/bin/ld.lld+0x132ae7b)
 #1 0x000055ef80707f76 SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0
 #2 0x00007f356d359df0 (/lib/x86_64-linux-gnu/libc.so.6+0x3fdf0)
 #3 0x00007f356d3ae95c __pthread_kill_implementation ./nptl/pthread_kill.c:44:76
 #4 0x00007f356d359cc2 raise ./signal/../sysdeps/posix/raise.c:27:6
 #5 0x00007f356d3424ac abort ./stdlib/abort.c:81:3
 #6 0x00007f356d342420 __assert_perror_fail ./assert/assert-perr.c:31:1
 #7 0x000055ef80042c1d (anonymous namespace)::X86_64::applyBranchToBranchOpt() const (.cold) X86_64.cpp:0:0
 #8 0x000055ef80a287e4 (anonymous namespace)::Writer<llvm::object::ELFType<(llvm::endianness)1, true>>::finalizeSections() Writer.cpp:0:0
 #9 0x000055ef80a2a3f0 void lld::elf::writeResult<llvm::object::ELFType<(llvm::endianness)1, true>>(lld::elf::Ctx&) (../build/bin/ld.lld+0x164a3f0)
#10 0x000055ef80884a35 void lld::elf::LinkerDriver::link<llvm::object::ELFType<(llvm::endianness)1, true>>(llvm::opt::InputArgList&) (../build/bin/ld.lld+0x14a4a35)
#11 0x000055ef8088bfe2 lld::elf::LinkerDriver::linkerMain(llvm::ArrayRef<char const*>) (../build/bin/ld.lld+0x14abfe2)
#12 0x000055ef8088c8ff lld::elf::link(llvm::ArrayRef<char const*>, llvm::raw_ostream&, llvm::raw_ostream&, bool, bool) (../build/bin/ld.lld+0x14ac8ff)
#13 0x000055ef8074349a lld::unsafeLldMain(llvm::ArrayRef<char const*>, llvm::raw_ostream&, llvm::raw_ostream&, llvm::ArrayRef<lld::DriverDef>, bool) (../build/bin/ld.lld+0x136349a)
#14 0x000055ef80681409 lld_main(int, char**, llvm::ToolContext const&) (../build/bin/ld.lld+0x12a1409)
#15 0x000055ef80532643 main (../build/bin/ld.lld+0x1152643)
#16 0x00007f356d343ca8 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:74:3
#17 0x00007f356d343d65 call_init ./csu/../csu/libc-start.c:128:20
#18 0x00007f356d343d65 __libc_start_main ./csu/../csu/libc-start.c:347:5
#19 0x000055ef80680ce1 _start (../build/bin/ld.lld+0x12a0ce1)
Aborted (core dumped)

I'll revert this for now.

This caused assertion failures in applyBranchToBranchOpt(): llvm/include/llvm/Support/Casting.h:578: decltype(auto) llvm::cast(From*) [with To = lld::elf::InputSection; From = lld::elf::InputSectionBase]: Assertion `isa<To>(Val) && "cast<Ty>() argument of incompatible type!"' failed. See comment on the PR (#138366) This reverts commit 491b82a. This also reverts the follow-up "[lld] Use llvm::partition_point (NFC) (#145209)" This reverts commit 2ac293f.

This caused assertion failures in applyBranchToBranchOpt(): llvm/include/llvm/Support/Casting.h:578: decltype(auto) llvm::cast(From*) [with To = lld::elf::InputSection; From = lld::elf::InputSectionBase]: Assertion `isa<To>(Val) && "cast<Ty>() argument of incompatible type!"' failed. See comment on the PR (llvm/llvm-project#138366) This reverts commit 491b82a. This also reverts the follow-up "[lld] Use llvm::partition_point (NFC) (#145209)" This reverts commit 2ac293f.

This caused assertion failures in applyBranchToBranchOpt(): llvm/include/llvm/Support/Casting.h:578: decltype(auto) llvm::cast(From*) [with To = lld::elf::InputSection; From = lld::elf::InputSectionBase]: Assertion `isa<To>(Val) && "cast<Ty>() argument of incompatible type!"' failed. See comment on the PR (llvm#138366) This reverts commit 491b82a. This also reverts the follow-up "[lld] Use llvm::partition_point (NFC) (llvm#145209)" This reverts commit 2ac293f.

When code calls a function which then immediately tail calls another function there is no need to go via the intermediate function. By branching directly to the target function we reduce the program's working set for a slight increase in runtime performance. Normally it is relatively uncommon to have functions that just tail call another function, but with LLVM control flow integrity we have jump tables that replace the function itself as the canonical address. As a result, when a function address is taken and called directly, for example after a compiler optimization resolves the indirect call, or if code built without control flow integrity calls the function, the call will go via the jump table. The impact of this optimization was measured using a large internal Google benchmark. The results were as follows: CFI enabled: +0.1% ± 0.05% queries per second CFI disabled: +0.01% queries per second [not statistically significant] The optimization is enabled by default at -O2 but may also be enabled or disabled individually with --{,no-}branch-to-branch. This optimization is implemented for AArch64 and X86_64 only. lld's runtime performance (real execution time) after adding this optimization was measured using firefox-x64 from lld-speed-test [1] with ldflags "-O2 -S" on an Apple M2 Ultra. The results are as follows: ``` N Min Max Median Avg Stddev x 512 1.2264546 1.3481076 1.2970261 1.2965788 0.018620888 + 512 1.2561196 1.3839965 1.3214632 1.3209327 0.019443971 Difference at 95.0% confidence 0.0243538 +/- 0.00233202 1.87831% +/- 0.179859% (Student's t, pooled s = 0.0190369) ``` [1] https://discourse.llvm.org/t/improving-the-reproducibility-of-linker-benchmarking/86057 Pull Request: llvm#138366

This caused assertion failures in applyBranchToBranchOpt(): llvm/include/llvm/Support/Casting.h:578: decltype(auto) llvm::cast(From*) [with To = lld::elf::InputSection; From = lld::elf::InputSectionBase]: Assertion `isa<To>(Val) && "cast<Ty>() argument of incompatible type!"' failed. See comment on the PR (llvm#138366) This reverts commit 491b82a. This also reverts the follow-up "[lld] Use llvm::partition_point (NFC) (llvm#145209)" This reverts commit 2ac293f.

pcc · 2025-06-24T19:54:30Z

We're hitting an assert after this. I've attached a reproducer at https://crbug.com/427013249#comment3

$ tar zxf /tmp/repro.tar.gz
$ ( cd repro && ../build/bin/ld.lld @response.txt )
ld.lld: /work/llvm-project/llvm/include/llvm/Support/Casting.h:578: decltype(auto) llvm::cast(From*) [with To = lld::elf::InputSection; From = lld::elf::InputSectionBase]: Assertion `isa<To>(Val) && "cast<Ty>() argument of incompatible type!"' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: ../build/bin/ld.lld @response.txt
 #0 0x000055ef8070ae7b llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (../build/bin/ld.lld+0x132ae7b)
 #1 0x000055ef80707f76 SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0
 #2 0x00007f356d359df0 (/lib/x86_64-linux-gnu/libc.so.6+0x3fdf0)
 #3 0x00007f356d3ae95c __pthread_kill_implementation ./nptl/pthread_kill.c:44:76
 #4 0x00007f356d359cc2 raise ./signal/../sysdeps/posix/raise.c:27:6
 #5 0x00007f356d3424ac abort ./stdlib/abort.c:81:3
 #6 0x00007f356d342420 __assert_perror_fail ./assert/assert-perr.c:31:1
 #7 0x000055ef80042c1d (anonymous namespace)::X86_64::applyBranchToBranchOpt() const (.cold) X86_64.cpp:0:0
 #8 0x000055ef80a287e4 (anonymous namespace)::Writer<llvm::object::ELFType<(llvm::endianness)1, true>>::finalizeSections() Writer.cpp:0:0
 #9 0x000055ef80a2a3f0 void lld::elf::writeResult<llvm::object::ELFType<(llvm::endianness)1, true>>(lld::elf::Ctx&) (../build/bin/ld.lld+0x164a3f0)
#10 0x000055ef80884a35 void lld::elf::LinkerDriver::link<llvm::object::ELFType<(llvm::endianness)1, true>>(llvm::opt::InputArgList&) (../build/bin/ld.lld+0x14a4a35)
#11 0x000055ef8088bfe2 lld::elf::LinkerDriver::linkerMain(llvm::ArrayRef<char const*>) (../build/bin/ld.lld+0x14abfe2)
#12 0x000055ef8088c8ff lld::elf::link(llvm::ArrayRef<char const*>, llvm::raw_ostream&, llvm::raw_ostream&, bool, bool) (../build/bin/ld.lld+0x14ac8ff)
#13 0x000055ef8074349a lld::unsafeLldMain(llvm::ArrayRef<char const*>, llvm::raw_ostream&, llvm::raw_ostream&, llvm::ArrayRef<lld::DriverDef>, bool) (../build/bin/ld.lld+0x136349a)
#14 0x000055ef80681409 lld_main(int, char**, llvm::ToolContext const&) (../build/bin/ld.lld+0x12a1409)
#15 0x000055ef80532643 main (../build/bin/ld.lld+0x1152643)
#16 0x00007f356d343ca8 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:74:3
#17 0x00007f356d343d65 call_init ./csu/../csu/libc-start.c:128:20
#18 0x00007f356d343d65 __libc_start_main ./csu/../csu/libc-start.c:347:5
#19 0x000055ef80680ce1 _start (../build/bin/ld.lld+0x12a0ce1)
Aborted (core dumped)

I'll revert this for now.

Thanks, sent #145579 with the fix.

This caused assertion failures in applyBranchToBranchOpt(): llvm/include/llvm/Support/Casting.h:578: decltype(auto) llvm::cast(From*) [with To = lld::elf::InputSection; From = lld::elf::InputSectionBase]: Assertion `isa<To>(Val) && "cast<Ty>() argument of incompatible type!"' failed. See comment on the PR (llvm#138366) This reverts commit 491b82a. This also reverts the follow-up "[lld] Use llvm::partition_point (NFC) (llvm#145209)" This reverts commit 2ac293f.

llvmbot added lld lld:ELF labels May 2, 2025

pcc requested review from MaskRay and smithp35 May 2, 2025 23:59

MaskRay reviewed May 3, 2025

View reviewed changes

smithp35 reviewed May 6, 2025

View reviewed changes

pcc mentioned this pull request May 7, 2025

MC: Emit symbols for R_X86_64_PLT32 relocation pointing to symbols with non-zero values. #138795

Closed

pcc mentioned this pull request May 21, 2025

Also group CFI jump tables according to sections #140943

Open

smithp35 reviewed May 22, 2025

View reviewed changes

pcc changed the base branch from main to users/pcc/spr/main.elf-add-branch-to-branch-optimization May 23, 2025 04:31

pcc requested review from DeinAlptraum, daniel-grumberg, QuietMisdreavus, cyndyishida, aaupov, maksfb, rafaelauler, ayermolo, yota9, JDevlieghere, lanza, bcardosolopes and nikic as code owners May 23, 2025 04:31

pcc force-pushed the users/pcc/spr/elf-add-branch-to-branch-optimization branch from e0581c8 to d67e152 Compare May 23, 2025 23:57

MaskRay reviewed Jun 15, 2025

View reviewed changes

pcc changed the base branch from users/pcc/spr/main.elf-add-branch-to-branch-optimization to main June 19, 2025 04:59

pcc force-pushed the users/pcc/spr/elf-add-branch-to-branch-optimization branch from d67e152 to 01efd87 Compare June 19, 2025 05:02

MaskRay approved these changes Jun 19, 2025

View reviewed changes

pcc closed this Jun 20, 2025

carlocab deleted the users/pcc/spr/elf-add-branch-to-branch-optimization branch June 21, 2025 07:25

		@@ -0,0 +1,94 @@
		//===- TargetImpl.h ---------------------------------------------- C++ --===//

		@@ -0,0 +1,93 @@
		//===- TargetImpl.h -------------------------------------------------------===//

ELF: Add branch-to-branch optimization. #138366

ELF: Add branch-to-branch optimization. #138366

Uh oh!

Conversation

pcc commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaskRay left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smithp35 left a comment

Choose a reason for hiding this comment

Uh oh!

pcc commented May 6, 2025

Uh oh!

pcc commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcc commented May 7, 2025

Uh oh!

pcc commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcc commented May 21, 2025

Uh oh!

smithp35 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pcc commented May 23, 2025

Uh oh!

smithp35 commented May 27, 2025

Uh oh!

pcc commented Jun 10, 2025

Uh oh!

MaskRay commented Jun 15, 2025

Uh oh!

pcc commented Jun 15, 2025 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

pcc commented May 2, 2025 •

edited

Loading

llvmbot commented May 2, 2025 •

edited

Loading

github-actions bot commented May 2, 2025 •

edited

Loading

pcc commented May 6, 2025 •

edited

Loading

pcc commented May 9, 2025 •

edited

Loading