|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "docs.rs outage postmortem" |
| 4 | +author: Pietro Albini |
| 5 | +team: the infrastructure team <https://www.rust-lang.org/governance/teams/operations#infra> |
| 6 | +--- |
| 7 | + |
| 8 | +At 2019-10-21 01:38 UTC the docs.rs website went down because no available disk |
| 9 | +space was left on the server hosting the application. Crate builds were failing |
| 10 | +since 2019-10-20 00:55 UTC due to the same reason. |
| 11 | + |
| 12 | +## Root cause of the outage |
| 13 | + |
| 14 | +docs.rs needs to store the built documentation on the filesystem before |
| 15 | +uploading it to the database, and it does so in the |
| 16 | +`/opt/docs-rs-prefix/documentations` directory. docs.rs never cleared that |
| 17 | +directory though, so over time it started to increase its size until it caused |
| 18 | +this outage. Code to periodically purge temporary directories was present, but |
| 19 | +it was never configured to purge the one which caused the outage. |
| 20 | + |
| 21 | +## Resolution |
| 22 | + |
| 23 | +As the directory doesn’t contain any persistent data we cleared it and the web |
| 24 | +server was restarted. Once we were confident the situation was resolved all the |
| 25 | +crates that failed due to the outage were queued for a rebuild. |
| 26 | + |
| 27 | +## Postmortem |
| 28 | + |
| 29 | +The increased disk usage was gradual over weeks, slowly reaching 100% and |
| 30 | +causing the outage. While monitoring systems were in place and recorded graphs |
| 31 | +of the increase, no alert was configured so nobody noticed the problem. We need |
| 32 | +to add alerts when disk usage reaches 90%, so the problem can be investigated |
| 33 | +and dealt with on time. |
| 34 | + |
| 35 | +Crates started to fail to build a day earlier, and close to no builds were |
| 36 | +successfully completed since then. We need to setup alerts when most of the |
| 37 | +builds are failing: as we don’t have the necessary metrics at the moment to |
| 38 | +reliably alert we'll have to add extra instrumentation as well. |
| 39 | + |
| 40 | +Our response was slower due to issues with our on-call rotation for the |
| 41 | +service. The primary contacts don’t have the level of access required to |
| 42 | +increase the disk space of the instance (the temporary fix that was |
| 43 | +investigated at first but discarded after the discovery nobody awake could do |
| 44 | +it), and the backup contacts don’t have any production access or expertise on |
| 45 | +docs.rs. |
| 46 | + |
| 47 | +## Timeline of events |
| 48 | + |
| 49 | +Unless otherwise noted all events happened on 2019-10-21, and all times are in |
| 50 | +UTC. |
| 51 | + |
| 52 | +- **2019-10-20 00:55: crate builds started failing due to the low disk space** |
| 53 | +- **01:38: alerts fired for the docs.rs website being down, [ashleygwilliams] |
| 54 | + (backup contact) got paged** |
| 55 | +- 01:39: [QuietMisdreavus] joins into the operations channel |
| 56 | +- 01:39: [QuietMisdreavus] found the reason for the outage (full root partition) |
| 57 | +- 01:52: [ashleygwilliams] proposed to increase disk space, nobody with |
| 58 | + permissions required to so was awake or available though |
| 59 | +- 01:56: [ashleygwilliams] contacts [Mark-Simulacrum], who has the access |
| 60 | + required to increase disk space |
| 61 | +- 01:57: [QuietMisdreavus] found the directory taking up all the disk space |
| 62 | +- 02:00: [QuietMisdreavus] removed the directory taking up all the disk space |
| 63 | +- 02:03: [QuietMisdreavus] restarted the web server |
| 64 | +- **02:06: CDN propagated the changes, docs.rs back online** |
| 65 | +- 02:06: [Mark-Simulacrum] joins into the operations channel |
| 66 | +- 08:19: [pietroalbini] added builds failed during the outage back into the |
| 67 | + queue |
| 68 | +- **19:27: builds of the crates failed during the outage finished** |
| 69 | + |
| 70 | +[ashleygwilliams]: https://github.com/ashleygwilliams |
| 71 | +[QuietMisdreavus]: https://github.com/QuietMisdreavus |
| 72 | +[Mark-Simulacrum]: https://github.com/Mark-Simulacrum |
| 73 | +[pietroalbini]: https://github.com/pietroalbini |
| 74 | + |
| 75 | +## Action items |
| 76 | + |
| 77 | +* Update the docs.rs source code to cleanup the offending directory |
| 78 | + automatically. |
| 79 | +* Add alerts when the available disk space on a server is below 10%. |
| 80 | +* Add alerts when most of the builds are failing. |
| 81 | +* Revisit the on-call rotation to make sure everyone on it has the |
| 82 | + permissions to either react to the incidents or escalate. |
0 commit comments