Skip to content

Commit a09a140

Browse files
committed
add alertmanager docs
1 parent 167d37e commit a09a140

File tree

4 files changed

+122
-72
lines changed

4 files changed

+122
-72
lines changed

ansible/roles/alertmanager/README.md

Lines changed: 0 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -101,66 +101,3 @@ Mem: 3.6Gi 2.4Gi 168Mi 11Mi 1.5Gi 1.2Gi
101101
Swap: 0B 0B 0B
102102
```
103103

104-
105-
106-
## Slack Integration
107-
108-
1. Create an app with a bot token:
109-
110-
- Go to https://api.slack.com/apps
111-
- select "Create an App"
112-
- select "From scratch"
113-
- Set app name and workspacef fields, select "Create"
114-
- Fill out "Short description" and "Background color" fields, select "Save changes"
115-
- Select "OAuth & Permissions" on left menu
116-
- Under "Scopes : Bot Token Scopes", select "Add an OAuth Scope", add
117-
`chat:write` and select "Save changes"
118-
- Select "Install App" on left menu, select "Install to your-workspace", select Allow
119-
- Copy the Bot User OAuth token shown
120-
121-
2. Add the bot token into the config and enable Slurm integration
122-
123-
- Open `environments/$ENV/inventory/group_vars/all/vault_alertmanager.yml`
124-
- Uncomment `vault_alertmanager_slack_integration_app_creds` and add the token
125-
- Vault-encrypt that file:
126-
127-
ansible-vault encrypt environments/$ENV/inventory/group_vars/all/vault_alertmanager.yml
128-
129-
- Open `environments/$ENV/inventory/group_vars/all/alertmanager.yml`
130-
- Uncomment the `alertmanager_slack_integration` mapping and set your alert channel name
131-
132-
3. Invite the bot to your alerts channel
133-
- In the appropriate Slack channel type:
134-
135-
/invite @YOUR_BOT_NAME
136-
137-
TODO: note that `prometheus_web_external_url` might need overriding too.
138-
139-
## Alert Rules
140-
141-
These are part of [Prometheus configuration](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/), defined in the appliance at
142-
[environments/common/inventory/group_vars/all/prometheus.yml](../../../environments/common/inventory/group_vars/all/prometheus.yml).
143-
144-
A fairly-minimal set of default alert rule files is provided at
145-
`environments/common/files/prometheus/rules/`. Because compute nodes are expected
146-
to operate with heavy CPU and memory load, no alerting on those parameters is
147-
defined for those nodes.
148-
149-
By default `prometheus_alert_rules_files` is set such that any `*.rules` files
150-
in a directory `files/prometheus/rules` in the current environment or *any*
151-
parent environment are loaded. So usually, site-specific alerts should be added
152-
by creating additional rules files in `environments/site/files/prometheus/rules`.
153-
154-
Note that the Prometheus targets are defined such that each node will have labels:
155-
- `env`: `ungrouped`, by default, unless a group/host var `prometheus_env` is set
156-
- `group`: One of `login`, `control`, `compute` or `other`
157-
These may be used to limit alerts to specific sets of nodes.
158-
159-
Some ideas for future alerts which could be useful:
160-
- smartctl-exporter-based rules for baremetal nodes where the is no
161-
infrastructure-level smart monitoring
162-
- loss of "up" network interfaces
163-
164-
165-
TODO: suggest awesome alerts
166-
TODO: note that child env rule files override parent envs

docs/alerting.md

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# Alerting
2+
3+
The [prometheus.io docs](https://prometheus.io/docs/alerting/latest/overview/)
4+
describe the overall alerting process:
5+
6+
> Alerting with Prometheus is separated into two parts. Alerting rules in
7+
Prometheus servers send alerts to an Alertmanager. The Alertmanager then
8+
manages those alerts, including silencing, inhibition, aggregation and
9+
sending out notifications via methods such as email, on-call notification
10+
systems, and chat platforms.
11+
12+
By default, both a `prometheus` server and an `alertmanager` server are
13+
deployed on the control node for new environments:
14+
15+
```ini
16+
# environments/site/groups:
17+
[prometheus:children]
18+
control
19+
20+
[alertmanager:children]
21+
control
22+
```
23+
24+
The general Prometheus configuration is described in
25+
[monitoring-and-logging.md](./monitoring-and-logging.md#defaults-3) - note this
26+
section specifies some role variables which commonly need modification.
27+
28+
The alertmanager server is defined by the [ansible/roles/alertmanager](../ansible/roles/alertmanager/README.md),
29+
and all the configuration options and defaults are defined there. By default
30+
it will be fully functional but:
31+
- `alertmanager_web_external_url` is likely to require modification.
32+
- A [receiver](https://prometheus.io/docs/alerting/latest/configuration/#receiver)
33+
must be defined to actually provide notifications. Currently a Slack receiver
34+
integration is provided (see below) but alternative receivers
35+
could be defined using the provided role variables.
36+
37+
## Slack receiver
38+
39+
This section describes how to enable the Slack receiver to provide notifications
40+
of alerts via Slack.
41+
42+
1. Create an app with a bot token:
43+
44+
- Go to https://api.slack.com/apps
45+
- select "Create an App"
46+
- select "From scratch"
47+
- Set app name and workspace fields, select "Create"
48+
- Fill out "Short description" and "Background color" fields, select "Save changes"
49+
- Select "OAuth & Permissions" on left menu
50+
- Under "Scopes : Bot Token Scopes", select "Add an OAuth Scope", add
51+
`chat:write` and select "Save changes"
52+
- Select "Install App" on left menu, select "Install to your-workspace", select Allow
53+
- Copy the Bot User OAuth token shown
54+
55+
2. Add the bot token into the config and enable Slack integration:
56+
57+
- Open `environments/$ENV/inventory/group_vars/all/vault_alertmanager.yml`
58+
- Uncomment `vault_alertmanager_slack_integration_app_creds` and add the token
59+
- Vault-encrypt that file:
60+
61+
ansible-vault encrypt environments/$ENV/inventory/group_vars/all/vault_alertmanager.yml
62+
63+
- Open `environments/$ENV/inventory/group_vars/all/alertmanager.yml`
64+
- Uncomment the `alertmanager_slack_integration` mapping and set your alert channel name
65+
66+
3. Invite the bot to your alerts channel
67+
- In the appropriate Slack channel type:
68+
69+
/invite @YOUR_BOT_NAME
70+
71+
72+
## Alerting Rules
73+
74+
These are part of [Prometheus configuration](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/)
75+
which is defined appliance at
76+
[environments/common/inventory/group_vars/all/prometheus.yml](../environments/common/inventory/group_vars/all/prometheus.yml).
77+
78+
Two `cloudalchemy.prometheus` role variables are relevant:
79+
- `prometheus_alert_rules_files`: Paths to check for files providing rules.
80+
Note these are copied to Prometheus config directly, so jinja expressions for
81+
Prometheus do not need escaping.
82+
- `prometheus_alert_rules`: Yaml-format rules. Jinja templating here will be
83+
interpolated by Ansible, so templating intended for Prometheus must be escaped
84+
using `{% raw %}`/`{% endraw %}` tags.
85+
86+
By default, `prometheus_alert_rules_files` is set so that any `*.rules` files
87+
in a directory `files/prometheus/rules` in the current environment or *any*
88+
parent environment are loaded. So usually, site-specific alerts should be added
89+
by creating additional rules files in `environments/site/files/prometheus/rules`.
90+
If the same file exists in more than one environment, the "child" file will take
91+
precedence and any rules in the "parent" file will be ignored.
92+
93+
A set of default alert rule files is provided at `environments/common/files/prometheus/rules/`.
94+
These cover:
95+
- Some node-exporter metrics for disk, filesystems, memory and clock. Note
96+
no alerts are triggered on memory for compute nodes due to the intended use
97+
of those nodes.
98+
- Slurm nodes in DOWN or FAIL states, or the Slurm DBD message queue being too
99+
large, usually indicating a database problem.
100+
101+
When defining additional rules, note the [labels defined](./monitoring-and-logging.md#prometheus_node_exporter_targets) for node-exporter targets.
102+
103+
In future more alerts may be added for:
104+
- smartctl-exporter-based rules for baremetal nodes where there is no
105+
infrastructure-level smart monitoring
106+
- loss of "up" network interfaces

docs/monitoring-and-logging.md

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -215,6 +215,12 @@ Internally, we use the [cloudalchemy.prometheus](https://github.com/cloudalchemy
215215

216216
> [environments/common/inventory/group_vars/all/prometheus.yml](../environments/common/inventory/group_vars/all/prometheus.yml)
217217
218+
Prometheus will be functional by default but the following variables should
219+
commonly be modified:
220+
- `prometheus_web_external_url`
221+
- `prometheus_storage_retention`
222+
- `prometheus_storage_retention_size`
223+
218224
### Placement
219225

220226
The `prometheus` group determines the placement of the prometheus service. Load balancing is currently unsupported so it is important that you only assign one host to this group.
@@ -240,12 +246,7 @@ This appliance provides a default set of recording rules which can be found here
240246
The intended purpose is to pre-compute some expensive queries that are used
241247
in the reference set of grafana dashboards.
242248

243-
To add new, or to remove rules you will be to adjust the `prometheus_alert_rules_files` variable. The default value can be found in:
244-
245-
> [environments/common/inventory/group_vars/all/prometheus.yml](../environments/common/inventory/group_vars/all/prometheus.yml)
246-
247-
You can extend this variable in your environment specific configuration to reference extra files or to remove the defaults. The reference set of dashboards expect these variables to be defined, so if you remove them, you
248-
will also have to update your dashboards.
249+
For information on configuring alerting rules see [docs/alerting.md#alerting-rules](./alerting.md#alerting-rules).
249250

250251
### node_exporter
251252

@@ -273,7 +274,14 @@ Variables in this file should *not* be customised directly, but should be overri
273274

274275
#### prometheus_node_exporter_targets
275276

276-
Groups prometheus targets into per environment groups. The ansible variable, `env` is used to determine the grouping. The metrics for each target in the group are given the prometheus label, `env: $env`, where `$env` is the value of the `env` variable for that host.
277+
Groups prometheus targets. Metrics from `node_exporter` hosts have two labels
278+
applied:
279+
- `env`: This is set from the Ansible variable `prometheus_env` if present
280+
(e.g. from hostvars or groupvars), defaulting to `ungrouped`. This can be
281+
used to group metrics by some arbitrary "environment", e.g. rack.
282+
- `group`: This refers to the "top-level" inventory group for the host and
283+
is one of `control`, `login`, `compute` or `other`. This can be used to
284+
define rules for specific host functionalities.
277285

278286
## slurm-stats
279287

docs/production.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -150,5 +150,4 @@ and referenced from the `site` and `production` environments, e.g.:
150150
In general it should be possible to raise this value to 50-100 if the cloud
151151
is properly tuned, again, demonstrated through testing.
152152

153-
- Enable alertmanager following the [role docs](../ansible/roles/alertmanager/README.md)
154-
if Slack is available.
153+
- Enable alertmanager if Slack is available - see [docs/alerting.md](./alerting.md).

0 commit comments

Comments
 (0)