Skip to content

Commit 5ed023c

Browse files
Inside Rust - Exploring PGO for the Rust Compiler: Style touch ups.
1 parent 7ed8d52 commit 5ed023c

File tree

1 file changed

+18
-14
lines changed

1 file changed

+18
-14
lines changed

posts/inside-rust/2020-10-30-exploring-pgo-for-the-rust-compiler.md

Lines changed: 18 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -55,8 +55,8 @@ In order to enable PGO for rustc's LLVM we basically follow the steps laid out i
5555
[llvm]
5656

5757
# Pass extra compiler and linker flags to the LLVM CMake build.
58-
# <PROFDATA_DIR> must be an absolute path to a writeable directory,
59-
# like for example /tmp/my-rustc-profdata
58+
# <PROFDATA_DIR> must be an absolute path to a writeable
59+
# directory, like for example /tmp/my-rustc-profdata
6060
cflags = "-fprofile-generate=<PROFDATA_DIR>"
6161
cxxflags = "-fprofile-generate=<PROFDATA_DIR>"
6262

@@ -94,7 +94,7 @@ In order to enable PGO for rustc's LLVM we basically follow the steps laid out i
9494
[llvm-profdata]: https://clang.llvm.org/docs/UsersManual.html#cmdoption-fprofile-generate
9595

9696
3. Now that the combined profile data from all *rustc* invocations can be found in `<PROFDATA_DIR>/rustc-llvm.profdata` it is time to re-compile LLVM and *rustc* again, this time instructing Clang to make use of this valuable new information.
97-
To this end, we modify `config.toml` as follows:
97+
To this end we modify `config.toml` as follows:
9898

9999
```toml
100100
[llvm]
@@ -137,7 +137,7 @@ Diving more into details shows the expected profile:
137137
[rustc-perf-pgo-llvm-expanded]: /images/inside-rust/2020-10-30-exploring-pgo-for-the-rust-compiler/rustc-perf-pgo-llvm-expanded.png
138138

139139
Workloads that spend most of their time in LLVM (e.g. optimized builds) show the most improvement, while workloads that don't invoke LLVM at all (e.g. check builds) also don't profit from a faster LLVM.
140-
Let's take a look how we can take things further by applying PGO to the other half of the compiler.
140+
Let's take a look at how we can take things further by applying PGO to the other half of the compiler.
141141

142142
[clang-pgo-20]: https://www.llvm.org/docs/HowToBuildWithPGO.html#introduction
143143
[perf.rlo]: https://perf.rust-lang.org/
@@ -166,8 +166,9 @@ pub fn rustc_cargo_env(builder: &Builder<'_>,
166166
cargo.env("RUSTC_VERIFY_LLVM_IR", "1");
167167
}
168168

169-
// This is new: Hard code instrumentation in the RUSTFLAGS of the Cargo
170-
// invocation that builds the compiler
169+
// This is new: Hard code instrumentation in the
170+
// RUSTFLAGS of the Cargo invocation that builds
171+
// the compiler
171172
cargo.rustflag("-Cprofile-generate=<PROFDATA_DIR>");
172173

173174
// ... omitted ...
@@ -190,11 +191,14 @@ pub fn rustc_cargo_env(builder: &Builder<'_>,
190191
cargo.env("RUSTC_VERIFY_LLVM_IR", "1");
191192
}
192193

193-
// Replace `-Cprofile-generate` with `-Cprofile-use`, assuming
194-
// that we used the `llvm-profdata` tool to merge the collected
195-
// `<PROFDATA_DIR>/*.profraw` files into a common file named
194+
// Replace `-Cprofile-generate` with `-Cprofile-use`,
195+
// assuming that we used the `llvm-profdata` tool to
196+
// merge the collected `<PROFDATA_DIR>/*.profraw` files
197+
// into a common file named
196198
// `<PROFDATA_DIR>/rustc-rust.profdata`.
197-
cargo.rustflag("-Cprofile-use=<PROFDATA_DIR>/rustc-rust.profdata");
199+
cargo.rustflag(
200+
"-Cprofile-use=<PROFDATA_DIR>/rustc-rust.profdata"
201+
);
198202

199203
// ... omitted ...
200204
}
@@ -214,7 +218,7 @@ As expected the results are similar to when PGO was applied to LLVM: a reduction
214218

215219
Because different workloads execute different amounts of Rust code (vs C++/LLVM code), the total reduction can be a lot less for LLVM-heavy cases.
216220
For example, a full *webrender-opt* build will spend more than 80% of its time in LLVM, so reducing the remaining 20% by 5% can only reduce the total number by 1%.
217-
On the other hand, a *check* build or an *incr-unchanged* build spends almost no time in LLVM, so the 5% Rust performance improvement translates almost entirely into a 5% build time reduction for these cases:
221+
On the other hand, a *check* build or an *incr-unchanged* build spends almost no time in LLVM, so the 5% Rust performance improvement translates almost entirely into a 5% instruction count reduction for these cases:
218222

219223
![Performance improvements gained from applying PGO to (only) the Rust part of the compiler (details)][rustc-perf-pgo-rust-expanded]
220224

@@ -248,7 +252,7 @@ Given that PGO adds quite a few complications to the build process of the compil
248252

249253
[rustc-perf-pgo-both]: https://perf.rust-lang.org/compare.html?start=pgo-2020-10-30-none&end=pgo-2020-10-30-both&stat=instructions%3Au
250254

251-
I then took a glance that the benchmarks' wall time measurements (instead of the instruction count measurements) and saw quite a different picture: *webrender-opt* minus 15%, *style-servo-opt* minus 14%, *serde-check* minus 15%?
255+
I then took a glance at the benchmarks' wall time measurements (instead of the instruction count measurements) and saw quite a different picture: *webrender-opt* minus 15%, *style-servo-opt* minus 14%, *serde-check* minus 15%?
252256
This looked decidedly better than for instruction counts.
253257
But wall time measurements can be very noisy (which is why most people only look at instruction counts on perf.rust-lang.org), and `rustc-perf` only does a single iteration for each benchmark, so I was not prepared to trust these numbers just yet.
254258
I decided to try and reduce the noise by increasing the number of benchmark iterations from one to twenty.
@@ -260,7 +264,7 @@ After roughly eight hours to complete both the PGO and the non-PGO versions of t
260264
[rustc-perf-pgo-both-walltime-thumb]: /images/inside-rust/2020-10-30-exploring-pgo-for-the-rust-compiler/rustc-perf-pgo-both-walltime-thumb.png
261265
[rustc-perf-pgo-both-walltime]: https://perf.rust-lang.org/compare.html?start=pgo-2020-10-30-none-20&end=pgo-2020-10-30-both-20&stat=wall-time
262266

263-
As you can see we get a 10-16% reduction of build times almost across the board.
267+
As you can see we get a 10-16% reduction of build times almost across the board for real world test cases.
264268
This was more in line with what I had initially hoped to get from PGO.
265269
It is a bit surprising that the difference between instruction counts and wall time is so pronounced.
266270
One plausible explanation would be that PGO improves instruction cache utilization, something which makes a difference for execution time but would not be reflected in the amount of instructions executed.
@@ -300,4 +304,4 @@ It's unlikely that I can spend a lot of time on this personally -- but my hope i
300304

301305
[dist-builds]: https://github.com/rust-lang/rust/tree/master/src/ci/docker/host-x86_64
302306

303-
**PS** -- Special thanks to Mark Rousskov for uploading my local benchmarking data to [perf.rust-lang.org][perf.rlo], which makes it much nicer to explore!
307+
**PS** -- Special thanks to Mark Rousskov for uploading my local benchmarking data to [perf.rust-lang.org][rustc-perf-pgo-both-walltime], which makes it much nicer to explore!

0 commit comments

Comments
 (0)