Skip to content

Commit f708c82

Browse files
committed
[X86] Relax existing instructions to reduce the number of nops needed for alignment purposes
If we have an explicit align directive, we currently default to emitting nops to fill the space. As discussed in the context of the prefix padding work for branch alignment (D72225), we're allowed to play other tricks such as extending the size of previous instructions instead. This patch will convert near jumps to far jumps if doing so decreases the number of bytes of nops needed for a following align. It does so as a post-pass after relaxation is complete. It intentionally works without moving any labels or doing anything which might require another round of relaxation. The point of this patch is mainly to mock out the approach. The optimization implemented is real, and possibly useful, but the main point is to demonstrate an approach for implementing such "pad previous instruction" approaches. The key notion in this patch is to treat padding previous instructions as an optional optimization, not as a core part of relaxation. The benefit to this is that we avoid the potential concern about increasing the distance between two labels and thus causing further potentially non-local code grown due to relaxation. The downside is that we may miss some opportunities to avoid nops. For the moment, this patch only implements a small set of existing relaxations.. Assuming the approach is satisfactory, I plan to extend this to a broader set of instructions where there are obvious "relaxations" which are roughly performance equivalent. Note that this patch *doesn't* change which instructions are relaxable. We may wish to explore that separately to increase optimization opportunity, but I figured that deserved it's own separate discussion. There are possible downsides to this optimization (and all "pad previous instruction" variants). The major two are potentially increasing instruction fetch and perturbing uop caching. (i.e. the usual alignment risks) Specifically: * If we pad an instruction such that it crosses a fetch window (16 bytes on modern X86-64), we may cause the decoder to have to trigger a fetch it wouldn't have otherwise. This can effect both decode speed, and icache pressure. * Intel's uop caching have particular restrictions on instruction combinations which can fit in a particular way. By moving around instructions, we can both cause misses an change misses into hits. Many of the most painful cases are around branch density, so I don't expect this to be too bad on the whole. On the whole, I expect to see small swings (i.e. the typical alignment change problem), but nothing major or systematic in either direction. Differential Revision: https://reviews.llvm.org/D75203
1 parent 2b2a1a4 commit f708c82

File tree

3 files changed

+303
-0
lines changed

3 files changed

+303
-0
lines changed

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp

Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,9 @@
1212
#include "llvm/BinaryFormat/ELF.h"
1313
#include "llvm/BinaryFormat/MachO.h"
1414
#include "llvm/MC/MCAsmBackend.h"
15+
#include "llvm/MC/MCAsmLayout.h"
1516
#include "llvm/MC/MCAssembler.h"
17+
#include "llvm/MC/MCCodeEmitter.h"
1618
#include "llvm/MC/MCContext.h"
1719
#include "llvm/MC/MCDwarf.h"
1820
#include "llvm/MC/MCELFObjectWriter.h"
@@ -103,6 +105,14 @@ cl::opt<bool> X86AlignBranchWithin32BBoundaries(
103105
"assumptions about labels corresponding to particular instructions, "
104106
"and should be used with caution."));
105107

108+
cl::opt<bool> X86PadForAlign(
109+
"x86-pad-for-align", cl::init(true), cl::Hidden,
110+
cl::desc("Pad previous instructions to implement align directives"));
111+
112+
cl::opt<bool> X86PadForBranchAlign(
113+
"x86-pad-for-branch-align", cl::init(true), cl::Hidden,
114+
cl::desc("Pad previous instructions to implement branch alignment"));
115+
106116
class X86ELFObjectWriter : public MCELFObjectTargetWriter {
107117
public:
108118
X86ELFObjectWriter(bool is64Bit, uint8_t OSABI, uint16_t EMachine,
@@ -173,6 +183,10 @@ class X86AsmBackend : public MCAsmBackend {
173183
void relaxInstruction(const MCInst &Inst, const MCSubtargetInfo &STI,
174184
MCInst &Res) const override;
175185

186+
bool padInstructionEncoding(MCRelaxableFragment &RF, MCCodeEmitter &Emitter,
187+
unsigned &RemainingSize) const;
188+
void finishLayout(MCAssembler const &Asm, MCAsmLayout &Layout) const override;
189+
176190
bool writeNopData(raw_ostream &OS, uint64_t Count) const override;
177191
};
178192
} // end anonymous namespace
@@ -639,6 +653,168 @@ void X86AsmBackend::relaxInstruction(const MCInst &Inst,
639653
Res.setOpcode(RelaxedOp);
640654
}
641655

656+
static bool canBeRelaxedForPadding(const MCRelaxableFragment &RF) {
657+
// TODO: There are lots of other tricks we could apply for increasing
658+
// encoding size without impacting performance.
659+
auto &Inst = RF.getInst();
660+
auto &STI = *RF.getSubtargetInfo();
661+
bool is16BitMode = STI.getFeatureBits()[X86::Mode16Bit];
662+
return getRelaxedOpcode(Inst, is16BitMode) != Inst.getOpcode();
663+
}
664+
665+
bool X86AsmBackend::padInstructionEncoding(MCRelaxableFragment &RF,
666+
MCCodeEmitter &Emitter,
667+
unsigned &RemainingSize) const {
668+
if (!canBeRelaxedForPadding(RF))
669+
return false;
670+
671+
MCInst Relaxed;
672+
relaxInstruction(RF.getInst(), *RF.getSubtargetInfo(), Relaxed);
673+
674+
SmallVector<MCFixup, 4> Fixups;
675+
SmallString<15> Code;
676+
raw_svector_ostream VecOS(Code);
677+
Emitter.encodeInstruction(Relaxed, VecOS, Fixups, *RF.getSubtargetInfo());
678+
const unsigned OldSize = RF.getContents().size();
679+
const unsigned NewSize = Code.size();
680+
assert(NewSize >= OldSize && "size decrease during relaxation?");
681+
unsigned Delta = NewSize - OldSize;
682+
if (Delta > RemainingSize)
683+
return false;
684+
RF.setInst(Relaxed);
685+
RF.getContents() = Code;
686+
RF.getFixups() = Fixups;
687+
RemainingSize -= Delta;
688+
return true;
689+
}
690+
691+
void X86AsmBackend::finishLayout(MCAssembler const &Asm,
692+
MCAsmLayout &Layout) const {
693+
// See if we can further relax some instructions to cut down on the number of
694+
// nop bytes required for code alignment. The actual win is in reducing
695+
// instruction count, not number of bytes. Modern X86-64 can easily end up
696+
// decode limited. It is often better to reduce the number of instructions
697+
// (i.e. eliminate nops) even at the cost of increasing the size and
698+
// complexity of others.
699+
if (!X86PadForAlign && !X86PadForBranchAlign)
700+
return;
701+
702+
DenseSet<MCFragment *> LabeledFragments;
703+
for (const MCSymbol &S : Asm.symbols())
704+
LabeledFragments.insert(S.getFragment(false));
705+
706+
for (MCSection &Sec : Asm) {
707+
if (!Sec.getKind().isText())
708+
continue;
709+
710+
SmallVector<MCRelaxableFragment *, 4> Relaxable;
711+
for (MCSection::iterator I = Sec.begin(), IE = Sec.end(); I != IE; ++I) {
712+
MCFragment &F = *I;
713+
714+
if (LabeledFragments.count(&F))
715+
Relaxable.clear();
716+
717+
if (F.getKind() == MCFragment::FT_Data ||
718+
F.getKind() == MCFragment::FT_CompactEncodedInst)
719+
// Skip and ignore
720+
continue;
721+
722+
if (F.getKind() == MCFragment::FT_Relaxable) {
723+
auto &RF = cast<MCRelaxableFragment>(*I);
724+
Relaxable.push_back(&RF);
725+
continue;
726+
}
727+
728+
auto canHandle = [](MCFragment &F) -> bool {
729+
switch (F.getKind()) {
730+
default:
731+
return false;
732+
case MCFragment::FT_Align:
733+
return X86PadForAlign;
734+
case MCFragment::FT_BoundaryAlign:
735+
return X86PadForBranchAlign;
736+
}
737+
};
738+
// For any unhandled kind, assume we can't change layout.
739+
if (!canHandle(F)) {
740+
Relaxable.clear();
741+
continue;
742+
}
743+
744+
const uint64_t OrigOffset = Layout.getFragmentOffset(&F);
745+
const uint64_t OrigSize = Asm.computeFragmentSize(Layout, F);
746+
if (OrigSize == 0 || Relaxable.empty()) {
747+
Relaxable.clear();
748+
continue;
749+
}
750+
751+
// To keep the effects local, prefer to relax instructions closest to
752+
// the align directive. This is purely about human understandability
753+
// of the resulting code. If we later find a reason to expand
754+
// particular instructions over others, we can adjust.
755+
MCFragment *FirstChangedFragment = nullptr;
756+
unsigned RemainingSize = OrigSize;
757+
while (!Relaxable.empty() && RemainingSize != 0) {
758+
auto &RF = *Relaxable.pop_back_val();
759+
// Give the backend a chance to play any tricks it wishes to increase
760+
// the encoding size of the given instruction. Target independent code
761+
// will try further relaxation, but target's may play further tricks.
762+
if (padInstructionEncoding(RF, Asm.getEmitter(), RemainingSize))
763+
FirstChangedFragment = &RF;
764+
765+
// If we have an instruction which hasn't been fully relaxed, we can't
766+
// skip past it and insert bytes before it. Changing its starting
767+
// offset might require a larger negative offset than it can encode.
768+
// We don't need to worry about larger positive offsets as none of the
769+
// possible offsets between this and our align are visible, and the
770+
// ones afterwards aren't changing.
771+
if (mayNeedRelaxation(RF.getInst(), *RF.getSubtargetInfo()))
772+
break;
773+
}
774+
Relaxable.clear();
775+
776+
if (FirstChangedFragment) {
777+
// Make sure the offsets for any fragments in the effected range get
778+
// updated. Note that this (conservatively) invalidates the offsets of
779+
// those following, but this is not required.
780+
Layout.invalidateFragmentsFrom(FirstChangedFragment);
781+
}
782+
783+
// BoundaryAlign explicitly tracks it's size (unlike align)
784+
if (F.getKind() == MCFragment::FT_BoundaryAlign)
785+
cast<MCBoundaryAlignFragment>(F).setSize(RemainingSize);
786+
787+
const uint64_t FinalOffset = Layout.getFragmentOffset(&F);
788+
const uint64_t FinalSize = Asm.computeFragmentSize(Layout, F);
789+
assert(OrigOffset + OrigSize == FinalOffset + FinalSize &&
790+
"can't move start of next fragment!");
791+
assert(FinalSize == RemainingSize && "inconsistent size computation?");
792+
793+
// If we're looking at a boundary align, make sure we don't try to pad
794+
// its target instructions for some following directive. Doing so would
795+
// break the alignment of the current boundary align.
796+
if (F.getKind() == MCFragment::FT_BoundaryAlign) {
797+
auto &BF = cast<MCBoundaryAlignFragment>(F);
798+
const MCFragment *F = BF.getNextNode();
799+
// If the branch is unfused, it is emitted into one fragment, otherwise
800+
// it is emitted into two fragments at most, the next
801+
// MCBoundaryAlignFragment(if exists) also marks the end of the branch.
802+
for (int i = 0, N = BF.isFused() ? 2 : 1;
803+
i != N && !isa<MCBoundaryAlignFragment>(F);
804+
++i, F = F->getNextNode(), I++) {
805+
}
806+
}
807+
}
808+
}
809+
810+
// The layout is done. Mark every fragment as valid.
811+
for (unsigned int i = 0, n = Layout.getSectionOrder().size(); i != n; ++i) {
812+
MCSection &Section = *Layout.getSectionOrder()[i];
813+
Layout.getFragmentOffset(&*Section.getFragmentList().rbegin());
814+
Asm.computeFragmentSize(Layout, *Section.getFragmentList().rbegin());
815+
}
816+
}
817+
642818
/// Write a sequence of optimal nops to the output, covering \p Count
643819
/// bytes.
644820
/// \return - true on success, false on failure

llvm/test/MC/X86/align-branch-64.s

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,59 @@ test_indirect:
103103
bar:
104104
retq
105105

106+
# CHECK: test_pad_via_relax:
107+
# CHECK: 200: testq
108+
# CHECK: 203: jne
109+
# CHECK: 209: int3
110+
# note 6 byte jne which could be a 2 byte jne, but is instead
111+
# expanded for padding purposes
112+
# CHECK-NOT: nop
113+
# CHECK: 220: callq
114+
.global test_pad_via_relax
115+
.p2align 5
116+
test_pad_via_relax:
117+
testq %rax, %rax
118+
jnz bar
119+
.rept 23
120+
int3
121+
.endr
122+
callq bar
123+
124+
# This case looks really tempting to pad, but doing so for the call causes
125+
# the jmp to be misaligned.
126+
# CHECK: test_pad_via_relax_neg1:
127+
# CHECK: 240: int3
128+
# CHECK: 25a: testq
129+
# CHECK: 25d: jne
130+
# CHECK: 25f: nop
131+
# CHECK: 260: callq
132+
.global test_pad_via_relax_neg1
133+
.p2align 5
134+
test_pad_via_relax_neg1:
135+
.rept 26
136+
int3
137+
.endr
138+
testq %rax, %rax
139+
jnz bar
140+
callq bar
141+
142+
# Same as previous, but without fusion
143+
# CHECK: test_pad_via_relax_neg2:
144+
# CHECK: 280: int3
145+
# CHECK: 29d: jmp
146+
# CHECK: 29f: nop
147+
# CHECK: 2a0: callq
148+
.global test_pad_via_relax_neg2
149+
.p2align 5
150+
test_pad_via_relax_neg2:
151+
.rept 29
152+
int3
153+
.endr
154+
jmp bar2
155+
callq bar2
156+
157+
bar2:
158+
106159
.section "unknown"
107160
.p2align 4
108161
.type baz,@function
Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# RUN: llvm-mc -mcpu=skylake -filetype=obj -triple x86_64-pc-linux-gnu %s | llvm-objdump -d --section=.text - | FileCheck %s
2+
3+
4+
.file "test.c"
5+
.text
6+
.section .text
7+
# Demonstrate that we can relax instructions to provide padding, not
8+
# just insert nops. jmps are being used for ease of demonstration.
9+
# CHECK: .text
10+
# CHECK: 0: eb 1f jmp 31 <foo>
11+
# CHECK: 2: e9 1a 00 00 00 jmp 26 <foo>
12+
# CHECK: 7: e9 15 00 00 00 jmp 21 <foo>
13+
# CHECK: c: e9 10 00 00 00 jmp 16 <foo>
14+
# CHECK: 11: e9 0b 00 00 00 jmp 11 <foo>
15+
# CHECK: 16: e9 06 00 00 00 jmp 6 <foo>
16+
# CHECK: 1b: e9 01 00 00 00 jmp 1 <foo>
17+
# CHECK: 20: cc int3
18+
.p2align 4
19+
jmp foo
20+
jmp foo
21+
jmp foo
22+
jmp foo
23+
jmp foo
24+
jmp foo
25+
jmp foo
26+
.p2align 5
27+
int3
28+
foo:
29+
ret
30+
31+
# Check that we're not shifting aroudn the offsets of labels - doing
32+
# that would require a further round of relaxation
33+
# CHECK: bar:
34+
# CHECK: 22: eb fe jmp -2 <bar>
35+
# CHECK: 24: 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:(%rax,%rax)
36+
# CHECK: 2e: 66 90 nop
37+
# CHECK: 30: 0f 0b ud2
38+
39+
bar:
40+
jmp bar
41+
nobypass:
42+
.p2align 4
43+
ud2
44+
45+
46+
# Canonical toy loop to show benefit - we can align the loop header with
47+
# fewer nops by relaxing the branch, even though we don't need to
48+
# CHECK: loop_preheader:
49+
# CHECK: 45: 48 85 c0 testq %rax, %rax
50+
# CHECK: 48: 0f 8e 22 00 00 00 jle 34 <loop_exit>
51+
# CHECK: 4e: 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:(%rax,%rax)
52+
# CHECK: 58: 0f 1f 84 00 00 00 00 00 nopl (%rax,%rax)
53+
# CHECK: loop_header:
54+
# CHECK: 60: 48 83 e8 01 subq $1, %rax
55+
# CHECK: 64: 48 85 c0 testq %rax, %rax
56+
# CHECK: 67: 7e 07 jle 7 <loop_exit>
57+
# CHECK: 69: e9 f2 ff ff ff jmp -14 <loop_header>
58+
# CHECK: 6e: 66 90 nop
59+
# CHECK: loop_exit:
60+
# CHECK: 70: c3 retq
61+
.p2align 5
62+
.skip 5
63+
loop_preheader:
64+
testq %rax, %rax
65+
jle loop_exit
66+
.p2align 5
67+
loop_header:
68+
subq $1, %rax
69+
testq %rax, %rax
70+
jle loop_exit
71+
jmp loop_header
72+
.p2align 4
73+
loop_exit:
74+
ret

0 commit comments

Comments
 (0)