Skip to content

Commit f9a4efc

Browse files
committed
infra: port the "How the Rust CI works" dropbox paper
1 parent 6c9c55b commit f9a4efc

File tree

2 files changed

+230
-0
lines changed

2 files changed

+230
-0
lines changed

src/SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@
2424
- [docs.rs](./infra/docs/docs-rs.md)
2525
- [Monitoring](./infra/docs/monitoring.md)
2626
- [rust-bots server](./infra/docs/rust-bots.md)
27+
- [rust-lang/rust CI](./infra/docs/rustc-ci.md)
2728
- [Language](./lang/README.md)
2829
- [RFC Merge Procedure](./lang/rfc-merge-procedure.md)
2930
- [Release](./release/README.md)

src/infra/docs/rustc-ci.md

Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
# How the Rust CI works
2+
3+
## Which jobs we run
4+
5+
The `rust-lang/rust` repository uses Azure Pipelines to test [all the other
6+
platforms][platforms] we support. We currently have two kinds of jobs running
7+
for each commit we want to merge to master:
8+
9+
- Dist jobs build a full release of the compiler for that platform, including
10+
all the tools we ship through rustup; Those builds are then uploaded to the
11+
`rust-lang-ci2` S3 bucket and are available to be locally installed with the
12+
[rustup-toolchain-install-master] tool; The same builds are also used for
13+
actual releases: our release process basically consists of copying those
14+
artifacts from `rust-lang-ci2` to the production endpoint and signing them.
15+
16+
- Non-dist jobs run our full test suite on the platform, and the test suite of
17+
all the tools we ship through rustup; The amount of stuff we test depends on
18+
the platform (for example some tests are run only on Tier 1 platforms), and
19+
some quicker platforms are grouped together on the same builder to avoid
20+
wasting CI resources.
21+
22+
All the builds except those on macOS and Windows are executed inside that
23+
platform’s custom Docker container. This has a lot of advantages for us:
24+
25+
- The build environment is consistent regardless of the changes of the
26+
underlying image (switching from the trusty image to xenial was painless for
27+
us).
28+
- We can use ancient build environments to ensure maximum binary compatibility,
29+
for example [using CentOS 5][dist-x86_64-linux] on our Linux builders.
30+
- We can avoid reinstalling tools (like QEMU or the Android emulator) every
31+
time thanks to Docker image caching.
32+
- Users can run the same tests in the same environment locally by just running
33+
`src/ci/docker/run.sh image-name`, which is awesome to debug failures.
34+
35+
We also run tests for less common architectures (mainly Tier 2 and Tier 3
36+
platforms) on Azure Pipelines. Since those platforms are not x86 we either run
37+
everything inside QEMU or just cross-compile if we don’t want to run the tests
38+
for that platform.
39+
40+
[platforms]: https://forge.rust-lang.org/release/platform-support.html
41+
[rustup-toolchain-install-master]: https://github.com/kennytm/rustup-toolchain-install-master
42+
[dist-x86_64-linux]: https://github.com/rust-lang/rust/blob/master/src/ci/docker/dist-x86_64-linux/Dockerfile
43+
44+
## Merging PRs serially with bors
45+
46+
CI services usually test the last commit of a branch merged with the last
47+
commit in master, and while that’s great to check if the feature works in
48+
isolation it doesn’t provide any guarantee the code is going to work once it’s
49+
merged. Breakages like these usually happen when another, incompatible PR is
50+
merged after the build happened.
51+
52+
To ensure a master that works all the time we forbid manual merges: instead all
53+
PRs have to be approved through our bot, [bors] (the software behind it is
54+
called [homu]). All the approved PRs are put [in a queue][homu-queue] (sorted
55+
by priority and creation date) and are automatically tested one at the time. If
56+
all the builders are green the PR is merged, otherwise the failure is recorded
57+
and the PR will have to be re-approved again.
58+
59+
Bors doesn’t interact with CI services directly, but it works by pushing the
60+
merge commit it wants to test to a branch called `auto`, and detecting the
61+
outcome of the build by listening for either Commit Statuses or Check Runs.
62+
Since the merge commit is based on the latest master and only one can be tested
63+
at the same time, when the results are green master is fast-forwarded to that
64+
merge commit.
65+
66+
Unfortunately testing a single PR at the time, combined with our long CI (~3.5
67+
hours for a full run), means we can’t merge too many PRs in a single day, and a
68+
single failure greatly impacts our throughput for the day. The maximum number
69+
of PRs we can merge in a day is 7.
70+
71+
[bors]: https://github.com/bors
72+
[homu]: https://github.com/rust-lang/homu
73+
[homu-queue]: https://buildbot2.rust-lang.org/homu/queue/rust
74+
75+
### Rollups
76+
77+
Some PRs don’t need the full test suite to be executed: trivial changes like
78+
typo fixes or README improvements *shouldn’t* break the build, and testing
79+
every single one of them for 2 to 3 hours is a big waste of time. To solve this
80+
we do a "rollup", a PR where we merge all the trivial PRs so they can be tested
81+
together. Rollups are created manually by a team member who uses their
82+
judgement to decide if a PR is risky or not, and are the best tool we have at
83+
the moment to keep the queue in a manageable state.
84+
85+
### Try builds
86+
87+
Sometimes we need a working compiler build before approving a PR, usually for
88+
[benchmarking][perf] or [checking the impact of the PR across the
89+
ecosystem][crater]. Bors supports creating them by pushing the merge commit on
90+
a separate branch (`try`), and they basically work the same as normal builds,
91+
without the actual merge at the end. Any number of try builds can happen at the
92+
same time, even if there is a normal PR in progress.
93+
94+
[perf]: https://perf.rust-lang.org
95+
[crater]: https://github.com/rust-lang/crater
96+
97+
## Which branches we test
98+
99+
Our builders are defined in `src/ci/azure-pipelines/`, and they depend on the
100+
branch used for the build. Each job is configured in one of the top `.yml`
101+
files.
102+
103+
### PR builds
104+
105+
All the commits pushed in a PR run a limited set of tests: a job containing a
106+
bunch of lints plus a cross-compile check build to Windows mingw (without
107+
producing any artifacts) and the `x86_64-gnu-llvm-6.0` non-dist builder. Those
108+
two builders are enough to catch most of the common errors introduced in a PR,
109+
but they don’t cover other platforms at all. Unfortunately it would take too
110+
many resources to run the full test suite for each commit on every PR.
111+
112+
Additionally, if the PR changes submodules the `x86_64-gnu-tools` non-dist
113+
builder is run.
114+
115+
### The `try` branch
116+
117+
On the main rust repo try builds produce just a Linux toolchain. Builds on
118+
those branches run a job containing the lint builder and both the dist and
119+
non-dist builders for `linux-x86_64`. Usually we don’t need `try` builds for
120+
other platforms, but on the rare cases when this is needed we just add a
121+
temporary commit that changes the `src/ci/azure-pipelines/try.yml` file to
122+
enable the builders we need on that platform (disabling Linux to avoid wasting
123+
CI resources).
124+
125+
### The `auto` branch
126+
127+
This branch is used by bors to run all the tests on a PR before merging it, so
128+
all the builders are enabled for it. bors will repeatedly force-push on it
129+
(every time a new commit is tested).
130+
131+
### The `master` branch
132+
133+
Since all the commits to `master` are fast-forwarded from the `auto` branch (if
134+
they pass all the tests there) we don’t need to build or test anything. A quick
135+
job is executed on each push to update toolstate (see the toolstate description
136+
below).
137+
138+
### Other branches
139+
140+
Other branches are just disabled and don’t run any kind of builds, since all
141+
the in-progress branches will eventually be tested in a PR. We try to encourage
142+
contributors to create branches on their own fork, but there is no way for us
143+
to disable that.
144+
145+
## Caching
146+
147+
The main rust repository doesn’t use the native Azure Pipelines caching tools.
148+
All our caching is uploaded to an S3 bucket we control
149+
(`rust-lang-ci-sccache2`), and it’s used mainly for two things:
150+
151+
### Docker images caching
152+
153+
The Docker images we use to run most of the Linux-based builders take a *long*
154+
time to fully build: every time we need to build them (for example when the CI
155+
scripts change) we consistently reach the build timeout, forcing us to retry
156+
the merge. To avoid the timeouts we cache the exported images on the S3 bucket
157+
(with `docker save`/`docker load`).
158+
159+
Since we test multiple, diverged branches (`master`, `beta` and `stable`) we
160+
can’t rely on a single cache for the images, otherwise builds on a branch would
161+
override the cache for the others. Instead we store the images identifying them
162+
with a custom hash, made from the host’s Docker version and the contents of all
163+
the Dockerfiles and related scripts.
164+
165+
### LLVM caching with sccache
166+
167+
We build some C/C++ stuff during the build and we rely on [sccache] to cache
168+
intermediate LLVM artifacts. Sccache is a distributed ccache developed by
169+
Mozilla, and it can use an object storage bucket as the storage backend, like
170+
we do with our S3 bucket.
171+
172+
[sccache]: https://github.com/mozilla/sccache
173+
174+
## Custom tooling around CI
175+
176+
During the years we developed some custom tooling to improve our CI experience.
177+
178+
### Cancelbot to keep the queue short
179+
180+
We have limited CI capacity on Azure Pipelines, and while that’s enough for a
181+
single build we can’t run more than one at the time. Unfortunately when a job
182+
fails the other jobs on the same build will continue to run, limiting the
183+
available capacity. To avoid the issue we have a tool called [cancelbot] that
184+
runs in cron every 2 minutes and kills all the jobs not related to a running
185+
build through the API.
186+
187+
[cancelbot]: https://github.com/rust-lang/rust-central-station/tree/master/cancelbot
188+
189+
### Rust Log Analyzer to show the error message in PRs
190+
191+
The build logs for `rust-lang/rust` are huge, and it’s not practical to find
192+
what caused the build to fail by looking at the logs. To improve the
193+
developers’ experience we developed a bot called [Rust Log Analyzer][rla] (RLA)
194+
that receives the build logs on failure and extracts the error message
195+
automatically, posting it on the PR.
196+
197+
The bot is not hardcoded to look for error strings, but was trained with a
198+
bunch of build failures to recognize which lines are common between builds and
199+
which are not. While the generated snippets can be weird sometimes, the bot is
200+
pretty good at identifying the relevant lines even if it’s an error we never
201+
saw before.
202+
203+
[rla]: https://github.com/rust-lang/rust-log-analyzer
204+
205+
### Toolstate to support allowed failures
206+
207+
The `rust-lang/rust` repo doesn’t only test the compiler on its CI, but also
208+
all the tools distributed through rustup (like rls, rustfmt, clippy…). Since
209+
those tools rely on the compiler internals (which don’t have any kind of
210+
stability guarantee) they often break after the compiler code is changed.
211+
212+
If we blocked merging rustc PRs on the tools being fixed we would be stuck in a
213+
chicken-and-egg problem, because the tools need the new rustc to be fixed but
214+
we can’t merge the rustc change until the tools are fixed. To avoid the problem
215+
most of the tools are allowed to fail, and their status is recorded in
216+
[rust-toolstate]. When a tool breaks a bot automatically pings the tool authors
217+
so they know about the breakage, and it records the failure on the toolstate
218+
repository. The release process will then ignore broken tools on nightly,
219+
removing them from the shipped nightlies.
220+
221+
While tool failures are allowed most of the time, they’re automatically
222+
forbidden a week before a release: we don’t care if tools are broken on nightly
223+
but they must work on beta and stable, so they also need to work on nightly a
224+
few days before we promote nightly to beta.
225+
226+
More information is available in the [toolstate documentation].
227+
228+
[rust-toolstate]: https://rust-lang-nursery.github.io/rust-toolstate
229+
[toolstate documentation]: https://forge.rust-lang.org/infra/toolstate.html

0 commit comments

Comments
 (0)