Skip to content

Commit 78eaff2

Browse files
committed
[llvm-exegesis] Loop unrolling for loop snippet repetitor mode
I really needed this, like, factually, yesterday, when verifying dependency breaking idioms for AMD Zen 3 scheduler model. Consider the following example: ``` $ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=duplicate Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-4a7e50.o --- mode: inverse_throughput key: instructions: - 'VPXORYrr YMM0 YMM0 YMM0' config: '' register_initial_values: [] cpu_name: znver3 llvm_triple: x86_64-unknown-linux-gnu num_repetitions: 1000000 measurements: - { key: inverse_throughput, value: 0.31025, per_snippet_value: 0.31025 } error: '' info: '' assembled_snippet: C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C3 ... ``` What does it tell us? So wait, it can only execute ~3 x86 AVX YMM PXOR zero-idioms per cycle? That doesn't seem right. That's even less than there are pipes supporting this type of op. Now, second example: ``` $ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=loop Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-2418b5.o --- mode: inverse_throughput key: instructions: - 'VPXORYrr YMM0 YMM0 YMM0' config: '' register_initial_values: [] cpu_name: znver3 llvm_triple: x86_64-unknown-linux-gnu num_repetitions: 1000000 measurements: - { key: inverse_throughput, value: 1.00011, per_snippet_value: 1.00011 } error: '' info: '' assembled_snippet: 49B80800000000000000C5FDEFC0C5FDEFC04983C0FF75F2C3 ... ``` Now that's just worse. Due to the looping, the throughput completely plummeted, and now we can only do a single instruction/cycle!? That's not great. And final example: ``` $ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=loop --loop-body-size=1000 Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-c402e2.o --- mode: inverse_throughput key: instructions: - 'VPXORYrr YMM0 YMM0 YMM0' config: '' register_initial_values: [] cpu_name: znver3 llvm_triple: x86_64-unknown-linux-gnu num_repetitions: 1000000 measurements: - { key: inverse_throughput, value: 0.167087, per_snippet_value: 0.167087 } error: '' info: '' assembled_snippet: 49B80800000000000000C5FDEFC0C5FDEFC04983C0FF75F2C3 ... ``` So if we merge the previous two approaches, do duplicate this single-instruction snippet 1000x (loop-body-size/instruction count in snippet), and run a loop with 1000 iterations over that duplicated/unrolled snippet, the measured throughput goes through the roof, up to 5.9 instructions/cycle, which finally tells us that this idiom is zero-cycle! Reviewed By: courbet Differential Revision: https://reviews.llvm.org/D102522
1 parent 44843e2 commit 78eaff2

File tree

8 files changed

+69
-27
lines changed

8 files changed

+69
-27
lines changed

llvm/docs/CommandGuide/llvm-exegesis.rst

Lines changed: 23 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -189,7 +189,8 @@ OPTIONS
189189

190190
`latency` mode can be make use of either RDTSC or LBR.
191191
`latency[LBR]` is only available on X86 (at least `Skylake`).
192-
To run in `latency` mode, a positive value must be specified for `x86-lbr-sample-period` and `--repetition-mode=loop`.
192+
To run in `latency` mode, a positive value must be specified
193+
for `x86-lbr-sample-period` and `--repetition-mode=loop`.
193194

194195
In `analysis` mode, you also need to specify at least one of the
195196
`-analysis-clusters-output-file=` and `-analysis-inconsistencies-output-file=`.
@@ -202,23 +203,36 @@ OPTIONS
202203
On choosing the "right" sampling period, a small value is preferred, but throttling
203204
could occur if the sampling is too frequent. A prime number should be used to
204205
avoid consistently skipping certain blocks.
205-
206+
206207
.. option:: -repetition-mode=[duplicate|loop|min]
207208

208209
Specify the repetition mode. `duplicate` will create a large, straight line
209-
basic block with `num-repetitions` copies of the snippet. `loop` will wrap
210-
the snippet in a loop which will be run `num-repetitions` times. The `loop`
211-
mode tends to better hide the effects of the CPU frontend on architectures
210+
basic block with `num-repetitions` instructions (repeating the snippet
211+
`num-repetitions`/`snippet size` times). `loop` will, optionally, duplicate the
212+
snippet until the loop body contains at least `loop-body-size` instructions,
213+
and then wrap the result in a loop which will execute `num-repetitions`
214+
instructions (thus, again, repeating the snippet
215+
`num-repetitions`/`snippet size` times). The `loop` mode, especially with loop
216+
unrolling tends to better hide the effects of the CPU frontend on architectures
212217
that cache decoded instructions, but consumes a register for counting
213-
iterations. If performing an analysis over many opcodes, it may be best
214-
to instead use the `min` mode, which will run each other mode, and produce
215-
the minimal measured result.
218+
iterations. If performing an analysis over many opcodes, it may be best to
219+
instead use the `min` mode, which will run each other mode,
220+
and produce the minimal measured result.
216221

217222
.. option:: -num-repetitions=<Number of repetitions>
218223

219-
Specify the number of repetitions of the asm snippet.
224+
Specify the target number of executed instructions. Note that the actual
225+
repetition count of the snippet will be `num-repetitions`/`snippet size`.
220226
Higher values lead to more accurate measurements but lengthen the benchmark.
221227

228+
.. option:: -loop-body-size=<Preferred loop body size>
229+
230+
Only effective for `-repetition-mode=[loop|min]`.
231+
Instead of looping over the snippet directly, first duplicate it so that the
232+
loop body contains at least this many instructions. This potentially results
233+
in loop body being cached in the CPU Op Cache / Loop Cache, which allows to
234+
which may have higher throughput than the CPU decoders.
235+
222236
.. option:: -max-configs-per-opcode=<value>
223237

224238
Specify the maximum configurations that can be generated for each opcode.

llvm/tools/llvm-exegesis/lib/BenchmarkResult.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ struct InstructionBenchmark {
6767
const MCInst &keyInstruction() const { return Key.Instructions[0]; }
6868
// The number of instructions inside the repeated snippet. For example, if a
6969
// snippet of 3 instructions is repeated 4 times, this is 12.
70-
int NumRepetitions = 0;
70+
unsigned NumRepetitions = 0;
7171
enum RepetitionModeE { Duplicate, Loop, AggregateMin };
7272
// Note that measurements are per instruction.
7373
std::vector<BenchmarkMeasure> Measurements;

llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,7 @@ class FunctionExecutorImpl : public BenchmarkRunner::FunctionExecutor {
133133
} // namespace
134134

135135
Expected<InstructionBenchmark> BenchmarkRunner::runConfiguration(
136-
const BenchmarkCode &BC, unsigned NumRepetitions,
136+
const BenchmarkCode &BC, unsigned NumRepetitions, unsigned LoopBodySize,
137137
ArrayRef<std::unique_ptr<const SnippetRepetitor>> Repetitors,
138138
bool DumpObjectToDisk) const {
139139
InstructionBenchmark InstrBenchmark;
@@ -168,14 +168,16 @@ Expected<InstructionBenchmark> BenchmarkRunner::runConfiguration(
168168
// Assemble at least kMinInstructionsForSnippet instructions by repeating
169169
// the snippet for debug/analysis. This is so that the user clearly
170170
// understands that the inside instructions are repeated.
171-
constexpr const int kMinInstructionsForSnippet = 16;
171+
const int MinInstructionsForSnippet = 4 * Instructions.size();
172+
const int LoopBodySizeForSnippet = 2 * Instructions.size();
172173
{
173174
SmallString<0> Buffer;
174175
raw_svector_ostream OS(Buffer);
175176
if (Error E = assembleToStream(
176177
State.getExegesisTarget(), State.createTargetMachine(),
177178
BC.LiveIns, BC.Key.RegisterInitialValues,
178-
Repetitor->Repeat(Instructions, kMinInstructionsForSnippet),
179+
Repetitor->Repeat(Instructions, MinInstructionsForSnippet,
180+
LoopBodySizeForSnippet),
179181
OS)) {
180182
return std::move(E);
181183
}
@@ -187,8 +189,8 @@ Expected<InstructionBenchmark> BenchmarkRunner::runConfiguration(
187189

188190
// Assemble NumRepetitions instructions repetitions of the snippet for
189191
// measurements.
190-
const auto Filler =
191-
Repetitor->Repeat(Instructions, InstrBenchmark.NumRepetitions);
192+
const auto Filler = Repetitor->Repeat(
193+
Instructions, InstrBenchmark.NumRepetitions, LoopBodySize);
192194

193195
object::OwningBinary<object::ObjectFile> ObjectFile;
194196
if (DumpObjectToDisk) {

llvm/tools/llvm-exegesis/lib/BenchmarkRunner.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ class BenchmarkRunner {
4141

4242
Expected<InstructionBenchmark>
4343
runConfiguration(const BenchmarkCode &Configuration, unsigned NumRepetitions,
44+
unsigned LoopUnrollFactor,
4445
ArrayRef<std::unique_ptr<const SnippetRepetitor>> Repetitors,
4546
bool DumpObjectToDisk) const;
4647

llvm/tools/llvm-exegesis/lib/SnippetRepetitor.cpp

Lines changed: 21 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111

1212
#include "SnippetRepetitor.h"
1313
#include "Target.h"
14+
#include "llvm/ADT/Sequence.h"
1415
#include "llvm/CodeGen/TargetInstrInfo.h"
1516
#include "llvm/CodeGen/TargetSubtargetInfo.h"
1617

@@ -24,8 +25,8 @@ class DuplicateSnippetRepetitor : public SnippetRepetitor {
2425

2526
// Repeats the snippet until there are at least MinInstructions in the
2627
// resulting code.
27-
FillFunction Repeat(ArrayRef<MCInst> Instructions,
28-
unsigned MinInstructions) const override {
28+
FillFunction Repeat(ArrayRef<MCInst> Instructions, unsigned MinInstructions,
29+
unsigned LoopBodySize) const override {
2930
return [Instructions, MinInstructions](FunctionFiller &Filler) {
3031
auto Entry = Filler.getEntry();
3132
if (!Instructions.empty()) {
@@ -53,17 +54,26 @@ class LoopSnippetRepetitor : public SnippetRepetitor {
5354
State.getTargetMachine().getTargetTriple())) {}
5455

5556
// Loop over the snippet ceil(MinInstructions / Instructions.Size()) times.
56-
FillFunction Repeat(ArrayRef<MCInst> Instructions,
57-
unsigned MinInstructions) const override {
58-
return [this, Instructions, MinInstructions](FunctionFiller &Filler) {
57+
FillFunction Repeat(ArrayRef<MCInst> Instructions, unsigned MinInstructions,
58+
unsigned LoopBodySize) const override {
59+
return [this, Instructions, MinInstructions,
60+
LoopBodySize](FunctionFiller &Filler) {
5961
const auto &ET = State.getExegesisTarget();
6062
auto Entry = Filler.getEntry();
6163
auto Loop = Filler.addBasicBlock();
6264
auto Exit = Filler.addBasicBlock();
6365

66+
const unsigned LoopUnrollFactor =
67+
LoopBodySize <= Instructions.size()
68+
? 1
69+
: divideCeil(LoopBodySize, Instructions.size());
70+
assert(LoopUnrollFactor >= 1 && "Should end up with at least 1 snippet.");
71+
6472
// Set loop counter to the right value:
65-
const APInt LoopCount(32, (MinInstructions + Instructions.size() - 1) /
66-
Instructions.size());
73+
const APInt LoopCount(
74+
32,
75+
divideCeil(MinInstructions, LoopUnrollFactor * Instructions.size()));
76+
assert(LoopCount.uge(1) && "Trip count should be at least 1.");
6777
for (const MCInst &Inst :
6878
ET.setRegTo(State.getSubtargetInfo(), LoopCounter, LoopCount))
6979
Entry.addInstruction(Inst);
@@ -78,7 +88,10 @@ class LoopSnippetRepetitor : public SnippetRepetitor {
7888
Loop.MBB->addLiveIn(Reg);
7989
for (const auto &LiveIn : Entry.MBB->liveins())
8090
Loop.MBB->addLiveIn(LiveIn);
81-
Loop.addInstructions(Instructions);
91+
for (auto _ : seq(0U, LoopUnrollFactor)) {
92+
(void)_;
93+
Loop.addInstructions(Instructions);
94+
}
8295
ET.decrementLoopCounterAndJump(*Loop.MBB, *Loop.MBB,
8396
State.getInstrInfo());
8497

llvm/tools/llvm-exegesis/lib/SnippetRepetitor.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,8 @@ class SnippetRepetitor {
3939
// Returns a functor that repeats `Instructions` so that the function executes
4040
// at least `MinInstructions` instructions.
4141
virtual FillFunction Repeat(ArrayRef<MCInst> Instructions,
42-
unsigned MinInstructions) const = 0;
42+
unsigned MinInstructions,
43+
unsigned LoopBodySize) const = 0;
4344

4445
explicit SnippetRepetitor(const LLVMState &State) : State(State) {}
4546

llvm/tools/llvm-exegesis/llvm-exegesis.cpp

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,13 @@ static cl::opt<unsigned>
116116
cl::desc("number of time to repeat the asm snippet"),
117117
cl::cat(BenchmarkOptions), cl::init(10000));
118118

119+
static cl::opt<unsigned>
120+
LoopBodySize("loop-body-size",
121+
cl::desc("when repeating the instruction snippet by looping "
122+
"over it, duplicate the snippet until the loop body "
123+
"contains at least this many instruction"),
124+
cl::cat(BenchmarkOptions), cl::init(0));
125+
119126
static cl::opt<unsigned> MaxConfigsPerOpcode(
120127
"max-configs-per-opcode",
121128
cl::desc(
@@ -365,7 +372,7 @@ void benchmarkMain() {
365372

366373
for (const BenchmarkCode &Conf : Configurations) {
367374
InstructionBenchmark Result = ExitOnErr(Runner->runConfiguration(
368-
Conf, NumRepetitions, Repetitors, DumpObjectToDisk));
375+
Conf, NumRepetitions, LoopBodySize, Repetitors, DumpObjectToDisk));
369376
ExitOnFileError(BenchmarkFile, Result.writeYaml(State, BenchmarkFile));
370377
}
371378
exegesis::pfm::pfmTerminate();

llvm/unittests/tools/llvm-exegesis/X86/SnippetRepetitorTest.cpp

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,11 +42,13 @@ class X86SnippetRepetitorTest : public X86TestBase {
4242
const auto Repetitor = SnippetRepetitor::Create(RepetitionMode, State);
4343
const std::vector<MCInst> Instructions = {MCInstBuilder(X86::NOOP)};
4444
FunctionFiller Sink(*MF, {X86::EAX});
45-
const auto Fill = Repetitor->Repeat(Instructions, kMinInstructions);
45+
const auto Fill =
46+
Repetitor->Repeat(Instructions, kMinInstructions, kLoopBodySize);
4647
Fill(Sink);
4748
}
4849

4950
static constexpr const unsigned kMinInstructions = 3;
51+
static constexpr const unsigned kLoopBodySize = 5;
5052

5153
std::unique_ptr<LLVMTargetMachine> TM;
5254
std::unique_ptr<LLVMContext> Context;
@@ -78,7 +80,9 @@ TEST_F(X86SnippetRepetitorTest, Loop) {
7880
ASSERT_EQ(MF->getNumBlockIDs(), 3u);
7981
const auto &LoopBlock = *MF->getBlockNumbered(1);
8082
EXPECT_THAT(LoopBlock.instrs(),
81-
ElementsAre(HasOpcode(X86::NOOP), HasOpcode(X86::ADD64ri8),
83+
ElementsAre(HasOpcode(X86::NOOP), HasOpcode(X86::NOOP),
84+
HasOpcode(X86::NOOP), HasOpcode(X86::NOOP),
85+
HasOpcode(X86::NOOP), HasOpcode(X86::ADD64ri8),
8286
HasOpcode(X86::JCC_1)));
8387
EXPECT_THAT(LoopBlock.liveins(),
8488
UnorderedElementsAre(

0 commit comments

Comments
 (0)