|
| 1 | +# How the Rust CI works |
| 2 | + |
| 3 | +## Which jobs we run |
| 4 | + |
| 5 | +The `rust-lang/rust` repository uses Azure Pipelines to test [all the other |
| 6 | +platforms][platforms] we support. We currently have two kinds of jobs running |
| 7 | +for each commit we want to merge to master: |
| 8 | + |
| 9 | +- Dist jobs build a full release of the compiler for that platform, including |
| 10 | + all the tools we ship through rustup; Those builds are then uploaded to the |
| 11 | + `rust-lang-ci2` S3 bucket and are available to be locally installed with the |
| 12 | + [rustup-toolchain-install-master] tool; The same builds are also used for |
| 13 | + actual releases: our release process basically consists of copying those |
| 14 | + artifacts from `rust-lang-ci2` to the production endpoint and signing them. |
| 15 | + |
| 16 | +- Non-dist jobs run our full test suite on the platform, and the test suite of |
| 17 | + all the tools we ship through rustup; The amount of stuff we test depends on |
| 18 | + the platform (for example some tests are run only on Tier 1 platforms), and |
| 19 | + some quicker platforms are grouped together on the same builder to avoid |
| 20 | + wasting CI resources. |
| 21 | + |
| 22 | +All the builds except those on macOS and Windows are executed inside that |
| 23 | +platform’s custom Docker container. This has a lot of advantages for us: |
| 24 | + |
| 25 | +- The build environment is consistent regardless of the changes of the |
| 26 | + underlying image (switching from the trusty image to xenial was painless for |
| 27 | + us). |
| 28 | +- We can use ancient build environments to ensure maximum binary compatibility, |
| 29 | + for example [using CentOS 5][dist-x86_64-linux] on our Linux builders. |
| 30 | +- We can avoid reinstalling tools (like QEMU or the Android emulator) every |
| 31 | + time thanks to Docker image caching. |
| 32 | +- Users can run the same tests in the same environment locally by just running |
| 33 | + `src/ci/docker/run.sh image-name`, which is awesome to debug failures. |
| 34 | + |
| 35 | +We also run tests for less common architectures (mainly Tier 2 and Tier 3 |
| 36 | +platforms) on Azure Pipelines. Since those platforms are not x86 we either run |
| 37 | +everything inside QEMU or just cross-compile if we don’t want to run the tests |
| 38 | +for that platform. |
| 39 | + |
| 40 | +[platforms]: https://forge.rust-lang.org/release/platform-support.html |
| 41 | +[rustup-toolchain-install-master]: https://github.com/kennytm/rustup-toolchain-install-master |
| 42 | +[dist-x86_64-linux]: https://github.com/rust-lang/rust/blob/master/src/ci/docker/dist-x86_64-linux/Dockerfile |
| 43 | + |
| 44 | +## Merging PRs serially with bors |
| 45 | + |
| 46 | +CI services usually test the last commit of a branch merged with the last |
| 47 | +commit in master, and while that’s great to check if the feature works in |
| 48 | +isolation it doesn’t provide any guarantee the code is going to work once it’s |
| 49 | +merged. Breakages like these usually happen when another, incompatible PR is |
| 50 | +merged after the build happened. |
| 51 | + |
| 52 | +To ensure a master that works all the time we forbid manual merges: instead all |
| 53 | +PRs have to be approved through our bot, [bors] (the software behind it is |
| 54 | +called [homu]). All the approved PRs are put [in a queue][homu-queue] (sorted |
| 55 | +by priority and creation date) and are automatically tested one at the time. If |
| 56 | +all the builders are green the PR is merged, otherwise the failure is recorded |
| 57 | +and the PR will have to be re-approved again. |
| 58 | + |
| 59 | +Bors doesn’t interact with CI services directly, but it works by pushing the |
| 60 | +merge commit it wants to test to a branch called `auto`, and detecting the |
| 61 | +outcome of the build by listening for either Commit Statuses or Check Runs. |
| 62 | +Since the merge commit is based on the latest master and only one can be tested |
| 63 | +at the same time, when the results are green master is fast-forwarded to that |
| 64 | +merge commit. |
| 65 | + |
| 66 | +Unfortunately testing a single PR at the time, combined with our long CI (~3.5 |
| 67 | +hours for a full run), means we can’t merge too many PRs in a single day, and a |
| 68 | +single failure greatly impacts our throughput for the day. The maximum number |
| 69 | +of PRs we can merge in a day is 7. |
| 70 | + |
| 71 | +[bors]: https://github.com/bors |
| 72 | +[homu]: https://github.com/rust-lang/homu |
| 73 | +[homu-queue]: https://buildbot2.rust-lang.org/homu/queue/rust |
| 74 | + |
| 75 | +### Rollups |
| 76 | + |
| 77 | +Some PRs don’t need the full test suite to be executed: trivial changes like |
| 78 | +typo fixes or README improvements *shouldn’t* break the build, and testing |
| 79 | +every single one of them for 2 to 3 hours is a big waste of time. To solve this |
| 80 | +we do a "rollup", a PR where we merge all the trivial PRs so they can be tested |
| 81 | +together. Rollups are created manually by a team member who uses their |
| 82 | +judgement to decide if a PR is risky or not, and are the best tool we have at |
| 83 | +the moment to keep the queue in a manageable state. |
| 84 | + |
| 85 | +### Try builds |
| 86 | + |
| 87 | +Sometimes we need a working compiler build before approving a PR, usually for |
| 88 | +[benchmarking][perf] or [checking the impact of the PR across the |
| 89 | +ecosystem][crater]. Bors supports creating them by pushing the merge commit on |
| 90 | +a separate branch (`try`), and they basically work the same as normal builds, |
| 91 | +without the actual merge at the end. Any number of try builds can happen at the |
| 92 | +same time, even if there is a normal PR in progress. |
| 93 | + |
| 94 | +[perf]: https://perf.rust-lang.org |
| 95 | +[crater]: https://github.com/rust-lang/crater |
| 96 | + |
| 97 | +## Which branches we test |
| 98 | + |
| 99 | +Our builders are defined in `src/ci/azure-pipelines/`, and they depend on the |
| 100 | +branch used for the build. Each job is configured in one of the top `.yml` |
| 101 | +files. |
| 102 | + |
| 103 | +### PR builds |
| 104 | + |
| 105 | +All the commits pushed in a PR run a limited set of tests: a job containing a |
| 106 | +bunch of lints plus a cross-compile check build to Windows mingw (without |
| 107 | +producing any artifacts) and the `x86_64-gnu-llvm-6.0` non-dist builder. Those |
| 108 | +two builders are enough to catch most of the common errors introduced in a PR, |
| 109 | +but they don’t cover other platforms at all. Unfortunately it would take too |
| 110 | +many resources to run the full test suite for each commit on every PR. |
| 111 | + |
| 112 | +Additionally, if the PR changes submodules the `x86_64-gnu-tools` non-dist |
| 113 | +builder is run. |
| 114 | + |
| 115 | +### The `try` branch |
| 116 | + |
| 117 | +On the main rust repo try builds produce just a Linux toolchain. Builds on |
| 118 | +those branches run a job containing the lint builder and both the dist and |
| 119 | +non-dist builders for `linux-x86_64`. Usually we don’t need `try` builds for |
| 120 | +other platforms, but on the rare cases when this is needed we just add a |
| 121 | +temporary commit that changes the `src/ci/azure-pipelines/try.yml` file to |
| 122 | +enable the builders we need on that platform (disabling Linux to avoid wasting |
| 123 | +CI resources). |
| 124 | + |
| 125 | +### The `auto` branch |
| 126 | + |
| 127 | +This branch is used by bors to run all the tests on a PR before merging it, so |
| 128 | +all the builders are enabled for it. bors will repeatedly force-push on it |
| 129 | +(every time a new commit is tested). |
| 130 | + |
| 131 | +### The `master` branch |
| 132 | + |
| 133 | +Since all the commits to `master` are fast-forwarded from the `auto` branch (if |
| 134 | +they pass all the tests there) we don’t need to build or test anything. A quick |
| 135 | +job is executed on each push to update toolstate (see the toolstate description |
| 136 | +below). |
| 137 | + |
| 138 | +### Other branches |
| 139 | + |
| 140 | +Other branches are just disabled and don’t run any kind of builds, since all |
| 141 | +the in-progress branches will eventually be tested in a PR. We try to encourage |
| 142 | +contributors to create branches on their own fork, but there is no way for us |
| 143 | +to disable that. |
| 144 | + |
| 145 | +## Caching |
| 146 | + |
| 147 | +The main rust repository doesn’t use the native Azure Pipelines caching tools. |
| 148 | +All our caching is uploaded to an S3 bucket we control |
| 149 | +(`rust-lang-ci-sccache2`), and it’s used mainly for two things: |
| 150 | + |
| 151 | +### Docker images caching |
| 152 | + |
| 153 | +The Docker images we use to run most of the Linux-based builders take a *long* |
| 154 | +time to fully build: every time we need to build them (for example when the CI |
| 155 | +scripts change) we consistently reach the build timeout, forcing us to retry |
| 156 | +the merge. To avoid the timeouts we cache the exported images on the S3 bucket |
| 157 | +(with `docker save`/`docker load`). |
| 158 | + |
| 159 | +Since we test multiple, diverged branches (`master`, `beta` and `stable`) we |
| 160 | +can’t rely on a single cache for the images, otherwise builds on a branch would |
| 161 | +override the cache for the others. Instead we store the images identifying them |
| 162 | +with a custom hash, made from the host’s Docker version and the contents of all |
| 163 | +the Dockerfiles and related scripts. |
| 164 | + |
| 165 | +### LLVM caching with sccache |
| 166 | + |
| 167 | +We build some C/C++ stuff during the build and we rely on [sccache] to cache |
| 168 | +intermediate LLVM artifacts. Sccache is a distributed ccache developed by |
| 169 | +Mozilla, and it can use an object storage bucket as the storage backend, like |
| 170 | +we do with our S3 bucket. |
| 171 | + |
| 172 | +[sccache]: https://github.com/mozilla/sccache |
| 173 | + |
| 174 | +## Custom tooling around CI |
| 175 | + |
| 176 | +During the years we developed some custom tooling to improve our CI experience. |
| 177 | + |
| 178 | +### Cancelbot to keep the queue short |
| 179 | + |
| 180 | +We have limited CI capacity on Azure Pipelines, and while that’s enough for a |
| 181 | +single build we can’t run more than one at the time. Unfortunately when a job |
| 182 | +fails the other jobs on the same build will continue to run, limiting the |
| 183 | +available capacity. To avoid the issue we have a tool called [cancelbot] that |
| 184 | +runs in cron every 2 minutes and kills all the jobs not related to a running |
| 185 | +build through the API. |
| 186 | + |
| 187 | +[cancelbot]: https://github.com/rust-lang/rust-central-station/tree/master/cancelbot |
| 188 | + |
| 189 | +### Rust Log Analyzer to show the error message in PRs |
| 190 | + |
| 191 | +The build logs for `rust-lang/rust` are huge, and it’s not practical to find |
| 192 | +what caused the build to fail by looking at the logs. To improve the |
| 193 | +developers’ experience we developed a bot called [Rust Log Analyzer][rla] (RLA) |
| 194 | +that receives the build logs on failure and extracts the error message |
| 195 | +automatically, posting it on the PR. |
| 196 | + |
| 197 | +The bot is not hardcoded to look for error strings, but was trained with a |
| 198 | +bunch of build failures to recognize which lines are common between builds and |
| 199 | +which are not. While the generated snippets can be weird sometimes, the bot is |
| 200 | +pretty good at identifying the relevant lines even if it’s an error we never |
| 201 | +saw before. |
| 202 | + |
| 203 | +[rla]: https://github.com/rust-lang/rust-log-analyzer |
| 204 | + |
| 205 | +### Toolstate to support allowed failures |
| 206 | + |
| 207 | +The `rust-lang/rust` repo doesn’t only test the compiler on its CI, but also |
| 208 | +all the tools distributed through rustup (like rls, rustfmt, clippy…). Since |
| 209 | +those tools rely on the compiler internals (which don’t have any kind of |
| 210 | +stability guarantee) they often break after the compiler code is changed. |
| 211 | + |
| 212 | +If we blocked merging rustc PRs on the tools being fixed we would be stuck in a |
| 213 | +chicken-and-egg problem, because the tools need the new rustc to be fixed but |
| 214 | +we can’t merge the rustc change until the tools are fixed. To avoid the problem |
| 215 | +most of the tools are allowed to fail, and their status is recorded in |
| 216 | +[rust-toolstate]. When a tool breaks a bot automatically pings the tool authors |
| 217 | +so they know about the breakage, and it records the failure on the toolstate |
| 218 | +repository. The release process will then ignore broken tools on nightly, |
| 219 | +removing them from the shipped nightlies. |
| 220 | + |
| 221 | +While tool failures are allowed most of the time, they’re automatically |
| 222 | +forbidden a week before a release: we don’t care if tools are broken on nightly |
| 223 | +but they must work on beta and stable, so they also need to work on nightly a |
| 224 | +few days before we promote nightly to beta. |
| 225 | + |
| 226 | +More information is available in the [toolstate documentation]. |
| 227 | + |
| 228 | +[rust-toolstate]: https://rust-lang-nursery.github.io/rust-toolstate |
| 229 | +[toolstate documentation]: https://forge.rust-lang.org/infra/toolstate.html |
0 commit comments