Skip to content

Commit 491b82a

Browse files
committed
ELF: Add branch-to-branch optimization.
When code calls a function which then immediately tail calls another function there is no need to go via the intermediate function. By branching directly to the target function we reduce the program's working set for a slight increase in runtime performance. Normally it is relatively uncommon to have functions that just tail call another function, but with LLVM control flow integrity we have jump tables that replace the function itself as the canonical address. As a result, when a function address is taken and called directly, for example after a compiler optimization resolves the indirect call, or if code built without control flow integrity calls the function, the call will go via the jump table. The impact of this optimization was measured using a large internal Google benchmark. The results were as follows: CFI enabled: +0.1% ± 0.05% queries per second CFI disabled: +0.01% queries per second [not statistically significant] The optimization is enabled by default at -O2 but may also be enabled or disabled individually with --{,no-}branch-to-branch. This optimization is implemented for AArch64 and X86_64 only. lld's runtime performance (real execution time) after adding this optimization was measured using firefox-x64 from lld-speed-test [1] with ldflags "-O2 -S" on an Apple M2 Ultra. The results are as follows: ``` N Min Max Median Avg Stddev x 512 1.2264546 1.3481076 1.2970261 1.2965788 0.018620888 + 512 1.2561196 1.3839965 1.3214632 1.3209327 0.019443971 Difference at 95.0% confidence 0.0243538 +/- 0.00233202 1.87831% +/- 0.179859% (Student's t, pooled s = 0.0190369) ``` [1] https://discourse.llvm.org/t/improving-the-reproducibility-of-linker-benchmarking/86057 Pull Request: #138366
1 parent 3b9795b commit 491b82a

13 files changed

+464
-6
lines changed

lld/ELF/Arch/AArch64.cpp

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
#include "Symbols.h"
1212
#include "SyntheticSections.h"
1313
#include "Target.h"
14+
#include "TargetImpl.h"
1415
#include "llvm/BinaryFormat/ELF.h"
1516
#include "llvm/Support/Endian.h"
1617

@@ -82,6 +83,7 @@ class AArch64 : public TargetInfo {
8283
uint64_t val) const override;
8384
RelExpr adjustTlsExpr(RelType type, RelExpr expr) const override;
8485
void relocateAlloc(InputSectionBase &sec, uint8_t *buf) const override;
86+
void applyBranchToBranchOpt() const override;
8587

8688
private:
8789
void relaxTlsGdToLe(uint8_t *loc, const Relocation &rel, uint64_t val) const;
@@ -974,6 +976,63 @@ void AArch64::relocateAlloc(InputSectionBase &sec, uint8_t *buf) const {
974976
}
975977
}
976978

979+
static std::optional<uint64_t> getControlTransferAddend(InputSection &is,
980+
Relocation &r) {
981+
// Identify a control transfer relocation for the branch-to-branch
982+
// optimization. A "control transfer relocation" means a B or BL
983+
// target but it also includes relative vtable relocations for example.
984+
//
985+
// We require the relocation type to be JUMP26, CALL26 or PLT32. With a
986+
// relocation type of PLT32 the value may be assumed to be used for branching
987+
// directly to the symbol and the addend is only used to produce the relocated
988+
// value (hence the effective addend is always 0). This is because if a PLT is
989+
// needed the addend will be added to the address of the PLT, and it doesn't
990+
// make sense to branch into the middle of a PLT. For example, relative vtable
991+
// relocations use PLT32 and 0 or a positive value as the addend but still are
992+
// used to branch to the symbol.
993+
//
994+
// With JUMP26 or CALL26 the only reasonable interpretation of a non-zero
995+
// addend is that we are branching to symbol+addend so that becomes the
996+
// effective addend.
997+
if (r.type == R_AARCH64_PLT32)
998+
return 0;
999+
if (r.type == R_AARCH64_JUMP26 || r.type == R_AARCH64_CALL26)
1000+
return r.addend;
1001+
return std::nullopt;
1002+
}
1003+
1004+
static std::pair<Relocation *, uint64_t>
1005+
getBranchInfoAtTarget(InputSection &is, uint64_t offset) {
1006+
auto *i =
1007+
std::partition_point(is.relocations.begin(), is.relocations.end(),
1008+
[&](Relocation &r) { return r.offset < offset; });
1009+
if (i != is.relocations.end() && i->offset == offset &&
1010+
i->type == R_AARCH64_JUMP26) {
1011+
return {i, i->addend};
1012+
}
1013+
return {nullptr, 0};
1014+
}
1015+
1016+
static void redirectControlTransferRelocations(Relocation &r1,
1017+
const Relocation &r2) {
1018+
r1.expr = r2.expr;
1019+
r1.sym = r2.sym;
1020+
// With PLT32 we must respect the original addend as that affects the value's
1021+
// interpretation. With the other relocation types the original addend is
1022+
// irrelevant because it referred to an offset within the original target
1023+
// section so we overwrite it.
1024+
if (r1.type == R_AARCH64_PLT32)
1025+
r1.addend += r2.addend;
1026+
else
1027+
r1.addend = r2.addend;
1028+
}
1029+
1030+
void AArch64::applyBranchToBranchOpt() const {
1031+
applyBranchToBranchOptImpl(ctx, getControlTransferAddend,
1032+
getBranchInfoAtTarget,
1033+
redirectControlTransferRelocations);
1034+
}
1035+
9771036
// AArch64 may use security features in variant PLT sequences. These are:
9781037
// Pointer Authentication (PAC), introduced in armv8.3-a and Branch Target
9791038
// Indicator (BTI) introduced in armv8.5-a. The additional instructions used

lld/ELF/Arch/TargetImpl.h

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
//===----------------------------------------------------------------------===//
2+
//
3+
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4+
// See https://llvm.org/LICENSE.txt for license information.
5+
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6+
//
7+
//===----------------------------------------------------------------------===//
8+
9+
#ifndef LLD_ELF_ARCH_TARGETIMPL_H
10+
#define LLD_ELF_ARCH_TARGETIMPL_H
11+
12+
#include "InputFiles.h"
13+
#include "InputSection.h"
14+
#include "Relocations.h"
15+
#include "Symbols.h"
16+
#include "llvm/BinaryFormat/ELF.h"
17+
18+
namespace lld::elf {
19+
20+
// getControlTransferAddend: If this relocation is used for control transfer
21+
// instructions (e.g. branch, branch-link or call) or code references (e.g.
22+
// virtual function pointers) and indicates an address-insignificant reference,
23+
// return the effective addend for the relocation, otherwise return
24+
// std::nullopt. The effective addend for a relocation is the addend that is
25+
// used to determine its branch destination.
26+
//
27+
// getBranchInfoAtTarget: If a control transfer relocation referring to
28+
// is+offset directly transfers control to a relocated branch instruction in the
29+
// specified section, return the relocation for the branch target as well as its
30+
// effective addend (see above). Otherwise return {nullptr, 0}.
31+
//
32+
// redirectControlTransferRelocations: Given r1, a relocation for which
33+
// getControlTransferAddend() returned a value, and r2, a relocation returned by
34+
// getBranchInfo(), modify r1 so that it branches directly to the target of r2.
35+
template <typename GetControlTransferAddend, typename GetBranchInfoAtTarget,
36+
typename RedirectControlTransferRelocations>
37+
inline void applyBranchToBranchOptImpl(
38+
Ctx &ctx, GetControlTransferAddend getControlTransferAddend,
39+
GetBranchInfoAtTarget getBranchInfoAtTarget,
40+
RedirectControlTransferRelocations redirectControlTransferRelocations) {
41+
// Needs to run serially because it writes to the relocations array as well as
42+
// reading relocations of other sections.
43+
for (ELFFileBase *f : ctx.objectFiles) {
44+
auto getRelocBranchInfo =
45+
[&getBranchInfoAtTarget](
46+
Relocation &r,
47+
uint64_t addend) -> std::pair<Relocation *, uint64_t> {
48+
auto *target = dyn_cast_or_null<Defined>(r.sym);
49+
// We don't allow preemptible symbols or ifuncs (may go somewhere else),
50+
// absolute symbols (runtime behavior unknown), non-executable or writable
51+
// memory (ditto) or non-regular sections (no section data).
52+
if (!target || target->isPreemptible || target->isGnuIFunc() ||
53+
!target->section ||
54+
!(target->section->flags & llvm::ELF::SHF_EXECINSTR) ||
55+
(target->section->flags & llvm::ELF::SHF_WRITE) ||
56+
target->section->kind() != SectionBase::Regular)
57+
return {nullptr, 0};
58+
return getBranchInfoAtTarget(*cast<InputSection>(target->section),
59+
target->value + addend);
60+
};
61+
for (InputSectionBase *s : f->getSections()) {
62+
if (!s)
63+
continue;
64+
for (Relocation &r : s->relocations) {
65+
std::optional<uint64_t> addend =
66+
getControlTransferAddend(*cast<InputSection>(s), r);
67+
if (!addend)
68+
continue;
69+
std::pair<Relocation *, uint64_t> targetAndAddend =
70+
getRelocBranchInfo(r, *addend);
71+
if (!targetAndAddend.first)
72+
continue;
73+
// Avoid getting stuck in an infinite loop if we encounter a branch
74+
// that (possibly indirectly) branches to itself. It is unlikely
75+
// that more than 5 iterations will ever be needed in practice.
76+
size_t iterations = 5;
77+
while (iterations--) {
78+
std::pair<Relocation *, uint64_t> nextTargetAndAddend =
79+
getRelocBranchInfo(*targetAndAddend.first,
80+
targetAndAddend.second);
81+
if (!nextTargetAndAddend.first)
82+
break;
83+
targetAndAddend = nextTargetAndAddend;
84+
}
85+
redirectControlTransferRelocations(r, *targetAndAddend.first);
86+
}
87+
}
88+
}
89+
}
90+
91+
} // namespace lld::elf
92+
93+
#endif

lld/ELF/Arch/X86_64.cpp

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
#include "Symbols.h"
1212
#include "SyntheticSections.h"
1313
#include "Target.h"
14+
#include "TargetImpl.h"
1415
#include "llvm/BinaryFormat/ELF.h"
1516
#include "llvm/Support/Endian.h"
1617
#include "llvm/Support/MathExtras.h"
@@ -49,6 +50,7 @@ class X86_64 : public TargetInfo {
4950
bool deleteFallThruJmpInsn(InputSection &is, InputFile *file,
5051
InputSection *nextIS) const override;
5152
bool relaxOnce(int pass) const override;
53+
void applyBranchToBranchOpt() const override;
5254

5355
private:
5456
void relaxTlsGdToLe(uint8_t *loc, const Relocation &rel, uint64_t val) const;
@@ -1161,6 +1163,73 @@ void X86_64::relocateAlloc(InputSectionBase &sec, uint8_t *buf) const {
11611163
}
11621164
}
11631165

1166+
static std::optional<uint64_t> getControlTransferAddend(InputSection &is,
1167+
Relocation &r) {
1168+
// Identify a control transfer relocation for the branch-to-branch
1169+
// optimization. A "control transfer relocation" usually means a CALL or JMP
1170+
// target but it also includes relative vtable relocations for example.
1171+
//
1172+
// We require the relocation type to be PLT32. With a relocation type of PLT32
1173+
// the value may be assumed to be used for branching directly to the symbol
1174+
// and the addend is only used to produce the relocated value (hence the
1175+
// effective addend is always 0). This is because if a PLT is needed the
1176+
// addend will be added to the address of the PLT, and it doesn't make sense
1177+
// to branch into the middle of a PLT. For example, relative vtable
1178+
// relocations use PLT32 and 0 or a positive value as the addend but still are
1179+
// used to branch to the symbol.
1180+
//
1181+
// STT_SECTION symbols are a special case on x86 because the LLVM assembler
1182+
// uses them for branches to local symbols which are assembled as referring to
1183+
// the section symbol with the addend equal to the symbol value - 4.
1184+
if (r.type == R_X86_64_PLT32) {
1185+
if (r.sym->isSection())
1186+
return r.addend + 4;
1187+
return 0;
1188+
}
1189+
return std::nullopt;
1190+
}
1191+
1192+
static std::pair<Relocation *, uint64_t>
1193+
getBranchInfoAtTarget(InputSection &is, uint64_t offset) {
1194+
auto content = is.contentMaybeDecompress();
1195+
if (content.size() > offset && content[offset] == 0xe9) { // JMP immediate
1196+
auto *i = std::partition_point(
1197+
is.relocations.begin(), is.relocations.end(),
1198+
[&](Relocation &r) { return r.offset < offset + 1; });
1199+
// Unlike with getControlTransferAddend() it is valid to accept a PC32
1200+
// relocation here because we know that this is actually a JMP and not some
1201+
// other reference, so the interpretation is that we add 4 to the addend and
1202+
// use that as the effective addend.
1203+
if (i != is.relocations.end() && i->offset == offset + 1 &&
1204+
(i->type == R_X86_64_PC32 || i->type == R_X86_64_PLT32)) {
1205+
return {i, i->addend + 4};
1206+
}
1207+
}
1208+
return {nullptr, 0};
1209+
}
1210+
1211+
static void redirectControlTransferRelocations(Relocation &r1,
1212+
const Relocation &r2) {
1213+
// The isSection() check handles the STT_SECTION case described above.
1214+
// In that case the original addend is irrelevant because it referred to an
1215+
// offset within the original target section so we overwrite it.
1216+
//
1217+
// The +4 is here to compensate for r2.addend which will likely be -4,
1218+
// but may also be addend-4 in case of a PC32 branch to symbol+addend.
1219+
if (r1.sym->isSection())
1220+
r1.addend = r2.addend;
1221+
else
1222+
r1.addend += r2.addend + 4;
1223+
r1.expr = r2.expr;
1224+
r1.sym = r2.sym;
1225+
}
1226+
1227+
void X86_64::applyBranchToBranchOpt() const {
1228+
applyBranchToBranchOptImpl(ctx, getControlTransferAddend,
1229+
getBranchInfoAtTarget,
1230+
redirectControlTransferRelocations);
1231+
}
1232+
11641233
// If Intel Indirect Branch Tracking is enabled, we have to emit special PLT
11651234
// entries containing endbr64 instructions. A PLT entry will be split into two
11661235
// parts, one in .plt.sec (writePlt), and the other in .plt (writeIBTPlt).

lld/ELF/Config.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -302,6 +302,7 @@ struct Config {
302302
bool bpFunctionOrderForCompression = false;
303303
bool bpDataOrderForCompression = false;
304304
bool bpVerboseSectionOrderer = false;
305+
bool branchToBranch = false;
305306
bool checkSections;
306307
bool checkDynamicRelocs;
307308
std::optional<llvm::DebugCompressionType> compressDebugSections;

lld/ELF/Driver.cpp

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1644,6 +1644,8 @@ static void readConfigs(Ctx &ctx, opt::InputArgList &args) {
16441644
ctx.arg.zWxneeded = hasZOption(args, "wxneeded");
16451645
setUnresolvedSymbolPolicy(ctx, args);
16461646
ctx.arg.power10Stubs = args.getLastArgValue(OPT_power10_stubs_eq) != "no";
1647+
ctx.arg.branchToBranch = args.hasFlag(
1648+
OPT_branch_to_branch, OPT_no_branch_to_branch, ctx.arg.optimize >= 2);
16471649

16481650
if (opt::Arg *arg = args.getLastArg(OPT_eb, OPT_el)) {
16491651
if (arg->getOption().matches(OPT_eb))

lld/ELF/InputSection.cpp

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -430,8 +430,9 @@ InputSectionBase *InputSection::getRelocatedSection() const {
430430

431431
template <class ELFT, class RelTy>
432432
void InputSection::copyRelocations(Ctx &ctx, uint8_t *buf) {
433-
if (ctx.arg.relax && !ctx.arg.relocatable &&
434-
(ctx.arg.emachine == EM_RISCV || ctx.arg.emachine == EM_LOONGARCH)) {
433+
bool linkerRelax =
434+
ctx.arg.relax && is_contained({EM_RISCV, EM_LOONGARCH}, ctx.arg.emachine);
435+
if (!ctx.arg.relocatable && (linkerRelax || ctx.arg.branchToBranch)) {
435436
// On LoongArch and RISC-V, relaxation might change relocations: copy
436437
// from internal ones that are updated by relaxation.
437438
InputSectionBase *sec = getRelocatedSection();

lld/ELF/Options.td

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,10 @@ def build_id: J<"build-id=">, HelpText<"Generate build ID note">,
5959
MetaVarName<"[fast,md5,sha1,uuid,0x<hexstring>]">;
6060
def : F<"build-id">, Alias<build_id>, AliasArgs<["sha1"]>, HelpText<"Alias for --build-id=sha1">;
6161

62+
defm branch_to_branch: BB<"branch-to-branch",
63+
"Enable branch-to-branch optimization (default at -O2)",
64+
"Disable branch-to-branch optimization (default at -O0 and -O1)">;
65+
6266
defm check_sections: B<"check-sections",
6367
"Check section addresses for overlaps (default)",
6468
"Do not check section addresses for overlaps">;

lld/ELF/Relocations.cpp

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1665,9 +1665,10 @@ void RelocationScanner::scan(Relocs<RelTy> rels) {
16651665
}
16661666

16671667
// Sort relocations by offset for more efficient searching for
1668-
// R_RISCV_PCREL_HI20 and R_PPC64_ADDR64.
1668+
// R_RISCV_PCREL_HI20, R_PPC64_ADDR64 and the branch-to-branch optimization.
16691669
if (ctx.arg.emachine == EM_RISCV ||
1670-
(ctx.arg.emachine == EM_PPC64 && sec->name == ".toc"))
1670+
(ctx.arg.emachine == EM_PPC64 && sec->name == ".toc") ||
1671+
ctx.arg.branchToBranch)
16711672
llvm::stable_sort(sec->relocs(),
16721673
[](const Relocation &lhs, const Relocation &rhs) {
16731674
return lhs.offset < rhs.offset;
@@ -1958,6 +1959,9 @@ void elf::postScanRelocations(Ctx &ctx) {
19581959
for (ELFFileBase *file : ctx.objectFiles)
19591960
for (Symbol *sym : file->getLocalSymbols())
19601961
fn(*sym);
1962+
1963+
if (ctx.arg.branchToBranch)
1964+
ctx.target->applyBranchToBranchOpt();
19611965
}
19621966

19631967
static bool mergeCmp(const InputSection *a, const InputSection *b) {

lld/ELF/Target.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,7 @@ class TargetInfo {
101101

102102
virtual void applyJumpInstrMod(uint8_t *loc, JumpModType type,
103103
JumpModType val) const {}
104+
virtual void applyBranchToBranchOpt() const {}
104105

105106
virtual ~TargetInfo();
106107

lld/docs/ReleaseNotes.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,10 @@ ELF Improvements
6262
on executable sections.
6363
(`#128883 <https://github.com/llvm/llvm-project/pull/128883>`_)
6464

65+
* For AArch64 and X86_64, added ``--branch-to-branch``, which rewrites branches
66+
that point to another branch instruction to instead branch directly to the
67+
target of the second instruction. Enabled by default at ``-O2``.
68+
6569
Breaking changes
6670
----------------
6771
* Executable-only and readable-executable sections are now allowed to be placed

lld/docs/ld.lld.1

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,11 @@ Bind default visibility defined STB_GLOBAL function symbols locally for
9393
.Fl shared.
9494
.It Fl -be8
9595
Write a Big Endian ELF File using BE8 format(AArch32 only)
96+
.It Fl -branch-to-branch
97+
Enable the branch-to-branch optimizations: a branch whose target is
98+
another branch instruction is rewritten to point to the latter branch
99+
target (AArch64 and X86_64 only). Enabled by default at
100+
.Fl O2 Ns .
96101
.It Fl -build-id Ns = Ns Ar value
97102
Generate a build ID note.
98103
.Ar value
@@ -414,7 +419,7 @@ If not specified,
414419
.Dv a.out
415420
is used as a default.
416421
.It Fl O Ns Ar value
417-
Optimize output file size.
422+
Optimize output file.
418423
.Ar value
419424
may be:
420425
.Pp
@@ -424,7 +429,7 @@ Disable string merging.
424429
.It Cm 1
425430
Enable string merging.
426431
.It Cm 2
427-
Enable string tail merging.
432+
Enable string tail merging and branch-to-branch optimization.
428433
.El
429434
.Pp
430435
.Fl O Ns Cm 1

0 commit comments

Comments
 (0)