Skip to content

Commit e4aecba

Browse files
pietroalbiniXAMPPRocky
authored andcommitted
infra: move services docs from the infra-team repo
1 parent 715397a commit e4aecba

File tree

8 files changed

+397
-0
lines changed

8 files changed

+397
-0
lines changed

src/SUMMARY.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,13 @@
1313
- [Service Infrastructure](./infra/service-infrastructure.md)
1414
- [Team Maintenance](./infra/team-maintenance.md)
1515
- [The Toolstate System](./infra/toolstate.md)
16+
- [Documentation](./infra/docs/README.md)
17+
- [Bastion server](./infra/docs/bastion.md)
18+
- [Crater agents](./infra/docs/crater-agents.md)
19+
- [Discord moderation bot](./infra/docs/discord-mods-bot.md)
20+
- [docs.rs](./infra/docs/docs-rs.md)
21+
- [Monitoring](./infra/docs/monitoring.md)
22+
- [rust-bots server](./infra/docs/rust-bots.md)
1623
- [Language](./lang/README.md)
1724
- [RFC Merge Procedure](./lang/rfc-merge-procedure.md)
1825
- [Release](./release/README.md)

src/infra/docs/README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Infrastructure team documentation
2+
3+
This section contains the documentation about the services hosted and managed
4+
by the Rust Infrastructure Team. Most of the linked resources and instructions
5+
are only available to infra team members though.

src/infra/docs/bastion.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Bastion server
2+
3+
* FQDN: `bastion.infra.rust-lang.org`
4+
* [Ansible playbook][ansible] to deploy this server.
5+
* [Terraform configuration][terraform] to create AWS resources.
6+
* [Instance metrics][grafana] (only available to infra team members).
7+
8+
## Logging into servers through the bastion
9+
10+
To improve the security of our infrastructure it's not possible to connect
11+
directly to a production server with SSH. Instead, all connections must come
12+
from a small server called the "bastion", which only allows connections from a
13+
few whitelisted networks and logs any connection attempt.
14+
15+
To log into a server through the bastion you can use SSH's `-J` flag:
16+
17+
```
18+
ssh -J bastion.infra.rust-lang.org servername.infra.rust-lang.org
19+
```
20+
21+
It's also possible to configure SSH to always jump through the bastion when
22+
connecting to a host. Add this snippet to your SSH configuration file (usually
23+
located in `~/.ssh/config`):
24+
25+
```
26+
Host servername.infra.rust-lang.org
27+
ProxyJump bastion.infra.rust-lang.org
28+
```
29+
30+
Please remember the bastion server only allows connections from a small list of
31+
IP addresses. Infra team members with AWS access can change the whitelist, but
32+
it's good practice to either have your own bastion server or a static IP
33+
address.
34+
35+
The SSH keys authorized to log into each account are stored in the [simpleinfra
36+
repository][keys]. Additionally, people with sensitive 1password access can use
37+
the master key stored in the vault to log into every account, provided their
38+
connection comes from any whitelisted IP.
39+
40+
## Common maintenance procedures
41+
42+
### Adding a new user to the bastion server
43+
44+
To add a new user to the bastion you need to add its key to a file named
45+
`<username>.pub` in [`ansible/roles/common/files/ssh-keys`][keys], and change
46+
the [Ansible playbook][ansible] adding the user to the list of unprivileged
47+
users. Please leave a comment clarifying which servers the user will have
48+
access to.
49+
50+
Once that's done [apply the playbook][ansible-apply] and [add a new whitelisted
51+
IP address](#updating-the-whitelisted-ips).
52+
53+
### Updating the whitelisted IPs
54+
55+
Due to privacy reasons, all the static IP addresses of team members with access
56+
to the bastion are stored on [AWS SSM Parameter Store][ssm] instead of public
57+
git repositories. To add or update an IP address you can run this command
58+
(taking care of replacing `USERNAME` and `IP_ADDRESS` with the proper values):
59+
60+
```
61+
aws ssm put-parameter --type String --name "/prod/bastion/allowed-ips/USERNAME" --value "IP_ADDRESS/32"
62+
```
63+
64+
If you're adding an IP address instead of updating it, you'll also need to add
65+
the username to the list in [`terraform/services.tf`][allowed-ips] (key
66+
`allowed_users` in the `service_bastion` module).
67+
68+
Once you made all the needed changes you wanted you need to [apply the
69+
Terraform configuration][terraform-apply].
70+
71+
[ansible]: https://github.com/rust-lang/simpleinfra/blob/master/ansible/playbooks/bastion.yml
72+
[terraform]: https://github.com/rust-lang/simpleinfra/tree/master/terraform/services/bastion
73+
[grafana]: https://grafana.rust-lang.org/d/rpXrFfKWz/instance-metrics?orgId=1&var-instance=bastion.infra.rust-lang.org:9100
74+
[keys]: https://github.com/rust-lang/simpleinfra/tree/master/ansible/roles/common/files/ssh-keys
75+
[ansible-apply]: https://github.com/rust-lang/simpleinfra/blob/master/ansible/README.md#executing-a-playbook
76+
[ssm]: https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-parameter-store.html
77+
[allowed-ips]: https://github.com/rust-lang/simpleinfra/blob/master/terraform/services.tf
78+
[terraform-apply]: https://github.com/rust-lang/simpleinfra/tree/master/terraform#applying-the-configuration

src/infra/docs/crater-agents.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# Crater agents
2+
3+
* Source code: [rust-lang/crater][repo]
4+
* Hosted on:
5+
* `crater-aws-1.infra.rust-lang.org` (behind the bastion -- [how to connect][bastion-connect])
6+
* `crater-azure-1.infra.rust-lang.org` (behind the bastion -- [how to connect][bastion-connect])
7+
* Maintainers: [pietroalbini]
8+
* [Application metrics][grafana-app] (only available to infra team members).
9+
* Instance metrics (only available to infra team members):
10+
* [`crater-aws-1.infra.rust-lang.org`][grafana-instance-aws-1]
11+
* [`crater-azure-1.infra.rust-lang.org`][grafana-instance-azure-1]
12+
13+
## Service configuration
14+
15+
Crater agents are servers with our standard configuration running a Docker
16+
container hosting the agent. A timer checks for updates every 5 minutes, and if
17+
a newer Docker image is present the container will automatically be updated and
18+
restarted. This service is [managed with Ansible][ansible].
19+
20+
## Common maintenance procedures
21+
22+
### Starting and stopping the agent
23+
24+
The agent is managed by the `container-crater-agent.service` systemd unit. That
25+
means it's possible to start, stop and restart it with the usual systemctl
26+
commands:
27+
28+
```
29+
systemctl stop container-crater-agent.service
30+
systemctl start container-crater-agent.service
31+
systemctl restart container-crater-agent.service
32+
```
33+
34+
### Inspecting the logs of the agent
35+
36+
Logs of the agents are forwarded and collected by journald. To see them you can
37+
use journalctl:
38+
39+
```
40+
journalctl -u container-crater-agent.service
41+
```
42+
43+
### Manually updating the container image
44+
45+
The container is updated automatically every 5 minutes (provided a newer image
46+
is present). If you need to update them sooner you can manuallly start the
47+
updater service by running this command:
48+
49+
```
50+
systemctl start docker-images-update.service
51+
```
52+
53+
[repo]: https://github.com/rust-lang/docs.rs
54+
[bastion-connect]: https://github.com/rust-lang/infra-team/blob/master/docs/hosts/bastion.md#logging-into-servers-through-the-bastion
55+
[pietroalbini]: https://github.com/pietroalbini
56+
[grafana-instance-aws-1]: https://grafana.rust-lang.org/d/rpXrFfKWz/instance-metrics?orgId=1&var-instance=crater-aws-1.infra.rust-lang.org:9100
57+
[grafana-instance-azure-1]: https://grafana.rust-lang.org/d/rpXrFfKWz/instance-metrics?orgId=1&var-instance=crater-azure-1.infra.rust-lang.org:9100
58+
[grafana-app]: https://grafana.rust-lang.org/d/WLeJySTZz/crater?orgId=1
59+
[ansible]: https://github.com/rust-lang/simpleinfra/blob/master/ansible/playbooks/crater.yml

src/infra/docs/discord-mods-bot.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Discord moderation bot
2+
3+
* Source code: [rust-lang-nursery/discord-mods-bot][repo]
4+
* Hosted on: [`bots.infra.rust-lang.org`][rust-bots] (behind the bastion -- [how to connect][bastion-connect])
5+
* Maintainers: [technetos]
6+
7+
## Service configuration
8+
9+
The service uses a PostgreSQL database called `discord_mods_bot` hosted on the
10+
same server, and connects to it with the `discord_mods_bot` user. Backups are
11+
not yet setup for the database contents.
12+
13+
The service is run with docker-compose on the home of the `ec2-user` user, and
14+
its docker image is hosted on ECR. The image is automatically rebuilt by the
15+
git repository's CI every time a new commit is pushed to master.
16+
17+
The server doesn't expose anything to the outside, as it uses websockets to
18+
communicate with Discord.
19+
20+
The bot is [`rustbot#4299`][devportal]. [pietroalbini], [Mark-Simulacrum],
21+
[alexcrichton] and [aidanhs] have access to the developer portal.
22+
23+
## Common maintenance procedures
24+
25+
### Deploying changes to the bot
26+
27+
Once the CI build on the `master` branch of [the repo][repo] ends you can SSH
28+
into the server and run this command:
29+
30+
```
31+
./redeploy
32+
```
33+
34+
The command might also redeploy other services hosted on the same server.
35+
36+
[repo]: https://github.com/rust-lang-nursery/discord-mods-bot
37+
[rust-bots]: https://github.com/rust-lang/infra-team/blob/master/docs/hosts/rust-bots.md
38+
[bastion-connect]: https://github.com/rust-lang/infra-team/blob/master/docs/hosts/bastion.md#logging-into-servers-through-the-bastion
39+
[devportal]: https://discordapp.com/developers/applications/615806512790503424
40+
[technetos]: https://github.com/technetos
41+
[pietroalbini]: https://github.com/pietroalbini
42+
[Mark-Simulacrum]: https://github.com/Mark-Simulacrum
43+
[alexcrichton]: https://github.com/alexcrichton
44+
[aidanhs]: https://github.com/aidanhs

src/infra/docs/docs-rs.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# docs.rs
2+
3+
* Source code: [rust-lang/docs.rs][repo]
4+
* Hosted on: `docsrs.infra.rust-lang.org` (behind the bastion -- [how to connect][bastion-connect])
5+
* Maintainers: [onur][onur], [QuietMisdreavus][QuietMisdreavus]
6+
* [Instance metrics][grafana-instance] (only available to infra team members).
7+
* [Application metrics][grafana-app] (only available to infra team members).
8+
9+
## Common maintenance procedures
10+
11+
### Temporarily remove a crate from the queue
12+
13+
It might happen that a crate fails to build repeatedly due to a docs.rs bug,
14+
clogging up the queue and preventing other crates to build. In this case it's
15+
possible to temporarily remove the crate from the queue until the docs.rs's bug
16+
is fixed. To do that, log into the machine and open a PostgreSQL shell with:
17+
18+
```
19+
$ psql
20+
```
21+
22+
Then you can run this SQL query to remove the crate:
23+
24+
```
25+
UPDATE queue SET attempt = 100 WHERE name = '<CRATE_NAME>';
26+
```
27+
28+
To add the crate back in the queue you can run in the PostgreSQL shell this
29+
query:
30+
31+
```
32+
UPDATE queue SET attempt = 0 WHERE name = '<CRATE_NAME>';
33+
```
34+
35+
### Pinning a version of nightly
36+
37+
Sometimes the latest nightly might be broken, causing doc builds to fail. In
38+
those cases it's possible to tell docs.rs to stop updating to the latest
39+
nightly and instead pin a specific release. To do that you need to edit the
40+
`/home/cratesfyi/.docs-rs-env` file, adding or changing this environment
41+
variable:
42+
43+
```
44+
CRATESFYI_TOOLCHAIN=nightly-YYYY-MM-DD
45+
```
46+
47+
Once the file changed docs.rs needs to be restarted:
48+
49+
```
50+
systemctl restart docs.rs
51+
```
52+
53+
To return to the latest nightly simply remove the environment variable and
54+
restart docs.rs again.
55+
56+
### Adding all the crates failed after a date back in the queue
57+
58+
After an outage you might want to add all the failed builds back to the queue.
59+
To do that, log into the machine and open a PostgreSQL shell with:
60+
61+
```
62+
psql
63+
```
64+
65+
Then you can run this SQL query to add all the crates failed after `YYYY-MM-DD
66+
HH:MM:SS` back in the queue:
67+
68+
```
69+
UPDATE queue SET attempt = 0 WHERE attempt >= 5 AND build_time > 'YYYY-MM-DD HH:MM:SS';
70+
```
71+
72+
[repo]: https://github.com/rust-lang/docs.rs
73+
[grafana-instance]: https://grafana.rust-lang.org/d/rpXrFfKWz/instance-metrics?orgId=1&var-instance=docsrs.infra.rust-lang.org:9100
74+
[grafana-app]: https://grafana.rust-lang.org/d/-wWFg2cZz/docs-rs?orgId=1
75+
[bastion-connect]: https://github.com/rust-lang/infra-team/blob/master/docs/hosts/bastion.md#logging-into-servers-through-the-bastion
76+
[onur]: https://github.com/onur
77+
[QuietMisdreavus]: https://github.com/QuietMisdreavus

src/infra/docs/monitoring.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# Monitoring
2+
3+
* Hosted on: `monitoring.infra.rust-lang.org` (behind the bastion -- [how to connect][bastion-connect])
4+
* Maintainers: [pietroalbini], infra team
5+
* Public URL: [grafana.rust-lang.org](https://grafana.rust-lang.org)
6+
* [Ansible playbook][ansible-playbook] to deploy this server.
7+
* [Instance metrics][grafana-instance] (only available to infra team members).
8+
9+
## Service configuration
10+
11+
Our monitoring service is composed of three parts: [Prometheus] to scrape,
12+
collect and monitor metrics, [Alertmanager] to dispatch the alerts generated by
13+
Prometheus, and [Grafana] to display the metrics. All the parts are configured
14+
through [Ansible].
15+
16+
The metrics are not backed up, as Prometheus purges them after 7 days anyway,
17+
but the Grafana dashboards are stored in a PostgreSQL database, which is backed
18+
up with [restic] in the `rust-backups` bucket (`monitoring` subdirectory). The
19+
password to decrypt the backups is in 1password.
20+
21+
## Common maintenance procedures
22+
23+
### Scrape a new metrics source
24+
25+
Prometheus works by periodically scraping a list of HTTP endpoints for metrics,
26+
written [in its custom format][metrics-format]. In our configuration the list
27+
is located in the `prometheus_scrape` section of the
28+
`ansible/playbooks/monitoring.yml` file in the [simpleinfra] repository.
29+
30+
To add a new metrics source, add your endpoint to an existing job or, if the
31+
metrics you're scraping are not related to any other job, a new one. The
32+
endpoint must be reachable from the monitoring instance. You can read the
33+
[Prometheus documentation][prometheus-scrape] to find all the available
34+
options.
35+
36+
### Create a new alert
37+
38+
Alerts are generated by Prometheus every time a custom rule defined in its
39+
configuration evaluates to true. In our configuration the list of rules is
40+
located in the `prometheus_rule_groups` section of the
41+
`ansible/playbooks/monitoring.yml` file in the [simpleinfra] repository.
42+
43+
To add a new alert you need to create an alerting rule either in an existing
44+
group or a new one. The full list of options is available in the [Prometheus
45+
documentation][prometheus-alert].
46+
47+
### Add permissions to a user
48+
49+
There are two steps needed to grant access to [our Grafana
50+
instance][grafana-ours] to an user.
51+
52+
First of all, to enable the user to log into the instance with their GitHub
53+
account they need to be a member of a team authorized to log in. The list of
54+
teams is defined in the `grafana_github_teams` section of the
55+
`ansible/playbooks/monitoring.yml` file in the [simpleinfra] repository, and it
56+
contains a list of GitHub team IDs. To fetch an ID you can run this command:
57+
58+
```
59+
curl -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/orgs/<ORG>/teams/<NAME> | jq .id
60+
```
61+
62+
Once the user is a member of a team authorized to log in they will
63+
automatically be added to the main Grafana organization with "viewer"
64+
permissions. For infrastructure team members that needs to be changed to
65+
"admin" (in the "Configuration" -> "Users"), otherwise leave it as viewer.
66+
67+
By default a viewer only has access to the unrestricted dashboards. To grant
68+
access to other dashboards you'll need to add them to a team (in the
69+
"Configuration" -> "Teams" page). It's also possible to grant admin privileges
70+
to the whole Grafana instance in the "Server Admin" -> "Users" ->
71+
"`<username>`" page. Do not grant those permissions except to trusted infra
72+
team members.
73+
74+
## Additional resources
75+
76+
* [Prometheus documentation][prometheus-docs]
77+
* [Grafana documentation][grafana-docs]
78+
79+
[bastion-connect]: https://github.com/rust-lang/infra-team/blob/master/docs/hosts/bastion.md#logging-into-servers-through-the-bastion
80+
[pietroalbini]: https://github.com/pietroalbini
81+
[ansible-playbook]: https://github.com/rust-lang/simpleinfra/blob/master/ansible/playbooks/monitoring.yml
82+
[grafana-instance]: https://grafana.rust-lang.org/d/rpXrFfKWz/instance-metrics?orgId=1&var-instance=monitoring.infra.rust-lang.org:9100
83+
[Prometheus]: https://prometheus.io
84+
[Alertmanager]: https://prometheus.io/docs/alerting/alertmanager/
85+
[Grafana]: https://grafana.com
86+
[Ansible]: https://github.com/rust-lang/simpleinfra/tree/master/ansible
87+
[restic]: https://restic.net
88+
[metrics-format]: https://prometheus.io/docs/instrumenting/exposition_formats/
89+
[simpleinfra]: https://github.com/rust-lang/simpleinfra
90+
[prometheus-scrape]: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config
91+
[prometheus-alert]: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
92+
[grafana-ours]: https://grafana.rust-lang.org
93+
[prometheus-docs]: https://prometheus.io/docs/introduction/overview/
94+
[grafana-docs]: https://grafana.com/docs/

0 commit comments

Comments
 (0)