Skip to content

Commit 95a4b71

Browse files
authored
Merge pull request #431 from rust-lang/docsrs-postmortem
Add postmortem about the docs.rs outage
2 parents 3376e3d + 4d8a876 commit 95a4b71

File tree

1 file changed

+82
-0
lines changed

1 file changed

+82
-0
lines changed
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
---
2+
layout: post
3+
title: "docs.rs outage postmortem"
4+
author: Pietro Albini
5+
team: the infrastructure team <https://www.rust-lang.org/governance/teams/operations#infra>
6+
---
7+
8+
At 2019-10-21 01:38 UTC the docs.rs website went down because no available disk
9+
space was left on the server hosting the application. Crate builds were failing
10+
since 2019-10-20 00:55 UTC due to the same reason.
11+
12+
## Root cause of the outage
13+
14+
docs.rs needs to store the built documentation on the filesystem before
15+
uploading it to the database, and it does so in the
16+
`/opt/docs-rs-prefix/documentations` directory. docs.rs never cleared that
17+
directory though, so over time it started to increase its size until it caused
18+
this outage. Code to periodically purge temporary directories was present, but
19+
it was never configured to purge the one which caused the outage.
20+
21+
## Resolution
22+
23+
As the directory doesn’t contain any persistent data we cleared it and the web
24+
server was restarted. Once we were confident the situation was resolved all the
25+
crates that failed due to the outage were queued for a rebuild.
26+
27+
## Postmortem
28+
29+
The increased disk usage was gradual over weeks, slowly reaching 100% and
30+
causing the outage. While monitoring systems were in place and recorded graphs
31+
of the increase, no alert was configured so nobody noticed the problem. We need
32+
to add alerts when disk usage reaches 90%, so the problem can be investigated
33+
and dealt with on time.
34+
35+
Crates started to fail to build a day earlier, and close to no builds were
36+
successfully completed since then. We need to setup alerts when most of the
37+
builds are failing: as we don’t have the necessary metrics at the moment to
38+
reliably alert we'll have to add extra instrumentation as well.
39+
40+
Our response was slower due to issues with our on-call rotation for the
41+
service. The primary contacts don’t have the level of access required to
42+
increase the disk space of the instance (the temporary fix that was
43+
investigated at first but discarded after the discovery nobody awake could do
44+
it), and the backup contacts don’t have any production access or expertise on
45+
docs.rs.
46+
47+
## Timeline of events
48+
49+
Unless otherwise noted all events happened on 2019-10-21, and all times are in
50+
UTC.
51+
52+
- **2019-10-20 00:55: crate builds started failing due to the low disk space**
53+
- **01:38: alerts fired for the docs.rs website being down, [ashleygwilliams]
54+
(backup contact) got paged**
55+
- 01:39: [QuietMisdreavus] joins into the operations channel
56+
- 01:39: [QuietMisdreavus] found the reason for the outage (full root partition)
57+
- 01:52: [ashleygwilliams] proposed to increase disk space, nobody with
58+
permissions required to so was awake or available though
59+
- 01:56: [ashleygwilliams] contacts [Mark-Simulacrum], who has the access
60+
required to increase disk space
61+
- 01:57: [QuietMisdreavus] found the directory taking up all the disk space
62+
- 02:00: [QuietMisdreavus] removed the directory taking up all the disk space
63+
- 02:03: [QuietMisdreavus] restarted the web server
64+
- **02:06: CDN propagated the changes, docs.rs back online**
65+
- 02:06: [Mark-Simulacrum] joins into the operations channel
66+
- 08:19: [pietroalbini] added builds failed during the outage back into the
67+
queue
68+
- **19:27: builds of the crates failed during the outage finished**
69+
70+
[ashleygwilliams]: https://github.com/ashleygwilliams
71+
[QuietMisdreavus]: https://github.com/QuietMisdreavus
72+
[Mark-Simulacrum]: https://github.com/Mark-Simulacrum
73+
[pietroalbini]: https://github.com/pietroalbini
74+
75+
## Action items
76+
77+
* Update the docs.rs source code to cleanup the offending directory
78+
automatically.
79+
* Add alerts when the available disk space on a server is below 10%.
80+
* Add alerts when most of the builds are failing.
81+
* Revisit the on-call rotation to make sure everyone on it has the
82+
permissions to either react to the incidents or escalate.

0 commit comments

Comments
 (0)