Skip to content

Commit d806dee

Browse files
committed
alertmanager docs/defaults
1 parent a9cd55e commit d806dee

File tree

4 files changed

+120
-32
lines changed

4 files changed

+120
-32
lines changed

ansible/roles/alertmanager/README.md

Lines changed: 104 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,90 @@
11
# alertmanager
22

3+
Deploy [alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/)
4+
to route Prometheus alerts to a receiver. Currently Slack is the only supported
5+
receiver.
36

4-
notes:
5-
- HA is not supported
6-
- state ("notification state and configured silences") is not preserved across rebuild
7-
- not used for caas
8-
- no dashboard
7+
Note that:
8+
- HA configuration is not supported
9+
- Alertmanager state is not preserved when the node it runs on (by default,
10+
control node) is reimaged, so any alerts silenced via the GUI will reoccur.
11+
- No Grafana dashboard for alerts is currently provided.
912

13+
- not used for caas - todo - maybe we should disable by default, unless everyone has slack?
14+
- have fixed bug with `env` hostvar for prom, now called `prometheus_env`
15+
- have added label "group" to prom for control,compute,login
1016

17+
Alertmanager is enabled by default on the `control` node in the
18+
[everything](../../../environments/common/layouts/everything) template which
19+
`cookiecutter` uses for a new environment's `inventory/groups` file.
20+
21+
In general usage may only require:
22+
- Adding the `control` node into the `alertmanager` group in `environments/site/groups`
23+
if upgrading an existing environment.
24+
- Enabling the Slack integration (see below).
1125

1226
## Role variables
1327

28+
All variables are optional. See [defaults/main.yml](defaults/main.yml) for
29+
all default values.
30+
31+
General variables:
32+
- `alertmanager_version`: String, version (no leading 'v')
33+
- `alertmanager_download_checksum`: String, checksum for relevant version from
34+
[prometheus.io download page](https://prometheus.io/download/), in format
35+
`type:value`.
36+
- `alertmanager_download_dest`: String, path of temporary directory used for
37+
download. Must exist.
38+
- `alertmanager_binary_dir`: String, path of directory to install alertmanager
39+
binary to. Must exist.
40+
- `alertmanager_started`: Bool, whether the alertmanager service should be started.
41+
- `alertmanager_enabled`: Bool, whether the alertmanager service should be enabled.
42+
- `alertmanager_system_user`: String, name of user to run alertmanager as. Will be created.
43+
- `alertmanager_system_group`: String, name of group of alertmanager user.
44+
- `alertmanager_port`: Port to listen on.
45+
1446
The following variables are equivalent to similarly-named arguments to the
1547
`alertmanager` binary. See `man alertmanager` for more info:
1648

17-
- TODO:
18-
19-
The following variables are templated into the alertmanager configuration file:
20-
21-
- TODO:
22-
23-
Other variables:
24-
- TODO:
25-
49+
- `alertmanager_config_file`: String, path alertmanager config file will be
50+
written to. Parent directory will be created if necessary.
51+
- `alertmanager_storage_path`: String, base path for data storage.
52+
- `alertmanager_web_listen_addresses`: List of strings, defining addresses to listeen on.
53+
- `alertmanager_web_external_url`: String, the URL under which Alertmanager is
54+
externally reachable. See man page for more details if proxying alertmanager.
55+
- `alertmanager_data_retention`: String, how long to keep data for
56+
- `alertmanager_data_maintenance_interval`: String, interval between garbage
57+
collection and snapshotting to disk of the silences and the notification logs.
58+
- `alertmanager_config_flags`: Mapping. Keys/values in here are written to the
59+
alertmanager commandline as `--{{ key }}={{ value }}`.
60+
- `alertmanager_default_receivers`:
61+
62+
The following variables are templated into the [alertmanager configuration](https://prometheus.io/docs/alerting/latest/configuration/):
63+
- `alertmanager_config_template`: String, path to configuration template. The default
64+
is to template in `alertmanager_config_default` and `alertmanager_config_extra`.
65+
- `alertmanager_config_default`: Mapping with default configuration for the
66+
top-level `route` and `receivers` keys. The default is to send all alerts to
67+
the Slack receiver, if that has been enabled (see below).
68+
- `alertmanager_receivers`: A list of [receiver](https://prometheus.io/docs/alerting/)
69+
mappings to define under the top-level `receivers` configuration key. This
70+
will contain the Slack receiver if that has been enabled (see below).
71+
- `alertmanager_extra_receivers`: A list of additional [receiver](https://prometheus.io/docs/alerting/),
72+
mappings to add, by default empty.
73+
- `alertmanager_slack_receiver`: Mapping defining the [Slack receiver](https://prometheus.io/docs/alerting/latest/configuration/#slack_config). Note the default configuration for this is in
74+
`environments/common/inventory/group_vars/all/alertmanager.yml`.
75+
- `alertmanager_null_receiver`: Mapping defining a `null` [receiver](https://prometheus.io/docs/alerting/latest/configuration/#receiver) so a receiver is always defined.
76+
- `alertmanager_config_extra`: Mapping with additional configuration. Keys in
77+
this become top-level keys in the configuration. E.g this might be:
78+
```yaml
79+
alertmanager_config_extra:
80+
global:
81+
smtp_from: smtp.example.org:587
82+
time_intervals:
83+
- name: monday-to-friday
84+
time_intervals:
85+
- weekdays: ['monday:friday']
86+
```
87+
Note that `route` and `receivers` keys should not be added here.
2688

2789
## TODO
2890

@@ -54,21 +116,42 @@ Swap: 0B 0B 0B
54116

55117
2. Add the bot token into the config and enable Slurm integration
56118

57-
- Open `environments/site/inventory/group_vars/all/vault_alertmanager.yml`
119+
- Open `environments/$ENV/inventory/group_vars/all/vault_alertmanager.yml`
58120
- Uncomment `vault_alertmanager_slack_integration_app_creds` and add the token
59121
- Vault-encrypt that file:
60122

61-
ansible-vault encrypt environments/$ENV/inventory/group_vars/all/vault_alertmanager.yml
123+
ansible-vault encrypt environments/$ENV/inventory/group_vars/all/vault_alertmanager.yml
62124

63-
- Open `environments/site/inventory/group_vars/all/alertmanager.yml`
64-
- Uncomment the config and set your alert channel name
125+
- Open `environments/$ENV/inventory/group_vars/all/alertmanager.yml`
126+
- Uncomment the `alertmanager_slack_integration` mapping and set your alert channel name
65127

66128
3. Invite the bot to your alerts channel
67129
- In the appropriate Slack channel type:
68130

69-
/invite @YOUR_BOT_NAME
131+
/invite @YOUR_BOT_NAME
132+
133+
134+
## Alert Rules
135+
136+
These are part of [Prometheus configuration](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/), defined in the appliance at
137+
[environments/common/inventory/group_vars/all/prometheus.yml](../../../environments/common/inventory/group_vars/all/prometheus.yml).
138+
139+
A fairly-minimal set of default alert rule files is provided at
140+
`environments/common/files/prometheus/rules/`. Because compute nodes are expected
141+
to operate with heavy CPU and memory load, no alerting on those parameters is
142+
defined for those nodes.
70143

144+
By default `prometheus_alert_rules_files` is set such that any `*.rules` files
145+
in a directory `files/prometheus/rules` in the current environment or *any*
146+
parent environment are loaded. So usually, site-specific alerts should be added
147+
by creating additional rules files in `environments/site/files/prometheus/rules`.
71148

72-
## Adding Rules
149+
Note that the Prometheus targets are defined such that each node will have labels:
150+
- `env`: `ungrouped`, by default, unless a group/host var `prometheus_env` is set
151+
- `group`: One of `login`, `control`, `compute` or `other`
152+
These may be used to limit alerts to specific sets of nodes.
73153

74-
TODO: describe how prom config works
154+
Some ideas for future alerts which could be useful:
155+
- smartctl-exporter-based rules for baremetal nodes where the is no
156+
infrastructure-level smart monitoring
157+
- loss of "up" network interfaces

ansible/roles/alertmanager/defaults/main.yml

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,14 @@ alertmanager_started: true
66
alertmanager_enabled: true
77

88
alertmanager_system_user: alertmanager
9-
alertmanager_system_group: alertmanager
9+
alertmanager_system_group: "{{ alertmanager_system_user }}"
1010
alertmanager_config_file: /etc/alertmanager/alertmanager.yml # --config.file: Alertmanager configuration file name
1111
alertmanager_storage_path: /var/lib/alertmanager # --storage.path: Base path for data storage
1212

1313
alertmanager_port: '9093'
14-
alertmanager_web_listen_addresses: # elements of --web.listen-address
14+
alertmanager_web_listen_addresses:
1515
- ":{{ alertmanager_port }}"
16-
alertmanager_web_external_url: "http://localhost:{{ alertmanager_port}}/" # --web.external-url: The URL under which Alertmanager is externally reachable (for example, if Alertmanager is served via a reverse proxy). Used for generating relative and absolute links back to Alertmanager itself. If the URL has a path portion, it will be used to prefix all HTTP endpoints served by Alertmanager. If omitted, relevant URL components will be derived automatically
17-
# TODO: work out how we proxy this through ondemand
16+
alertmanager_web_external_url: "http://localhost:{{ alertmanager_port}}/" # TODO: is this right??
1817

1918
alertmanager_data_retention: '120h' # --data.retention # How long to keep data for
2019
alertmanager_data_maintenance_interval: '15m' # --data.maintenance-interval: Interval between garbage collection and snapshotting to disk of the silences and the notification logs
@@ -30,19 +29,18 @@ alertmanager_config_template: alertmanager.yml.j2
3029
# channel: '#alerts'
3130
# app_creds:
3231

33-
34-
alertmanager_default_receivers:
35-
- name: 'null'
36-
32+
alertmanager_null_receiver:
33+
name: 'null'
3734
alertmanager_slack_receiver: {} # defined in common env as it needs prometheus_address
38-
39-
alertmanager_extra_receivers: "{{ [alertmanager_slack_receiver] if alertmanager_slack_integration is defined else [] }}"
35+
alertmanager_extra_receivers: []
36+
alertmanager_default_receivers: "{{ [alertmanager_null_receiver] + ([alertmanager_slack_receiver] if alertmanager_slack_integration is defined else []) }}"
37+
alertmanager_receivers: "{{ alertmanager_default_receivers + alertmanager_extra_receivers }}"
4038

4139
alertmanager_config_default:
4240
route:
4341
group_by: ['...']
4442
receiver: "{{ 'slack-receiver' if alertmanager_slack_integration is defined else 'null' }}"
45-
receivers: "{{ alertmanager_default_receivers + alertmanager_extra_receivers }}"
43+
receivers: "{{ alertmanager_receivers }}"
4644

4745
alertmanager_config_extra: {} # top-level only
4846

docs/production.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -149,3 +149,6 @@ and referenced from the `site` and `production` environments, e.g.:
149149
raised using [shards](https://specs.openstack.org/openstack/nova-specs/specs/2024.1/implemented/ironic-shards.html).
150150
In general it should be possible to raise this value to 50-100 if the cloud
151151
is properly tuned, again, demonstrated through testing.
152+
153+
- Enable alertmanager following the [role docs](../ansible/roles/alertmanager/README.md)
154+
if Slack is available.

environments/common/inventory/group_vars/all/prometheus.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,9 @@ prometheus_targets:
2727
control: "{{ groups.get('node_exporter', []) | intersect(groups['control']) | prometheus_node_exporter_targets(hostvars, 'prometheus_env', 'control') }}"
2828
login: "{{ groups.get('node_exporter', []) | intersect(groups['login']) | prometheus_node_exporter_targets(hostvars, 'prometheus_env', 'login') }}"
2929
compute: "{{ groups.get('node_exporter', []) | intersect(groups['compute']) | prometheus_node_exporter_targets(hostvars, 'prometheus_env', 'compute') }}"
30+
# openhpc is defined as control+login+compute so this gets anything else:
31+
other: "{{ groups.get('node_exporter', []) | difference(groups['openhpc']) | prometheus_node_exporter_targets(hostvars, 'prometheus_env', 'other') }}"
32+
# TODO: check empty list gets coped with correctly!
3033

3134
prometheus_scrape_configs_default:
3235
- job_name: "prometheus"
@@ -44,6 +47,7 @@ prometheus_scrape_configs_default:
4447
- /etc/prometheus/file_sd/control.yml
4548
- /etc/prometheus/file_sd/login.yml
4649
- /etc/prometheus/file_sd/compute.yml
50+
- /etc/prometheus/file_sd/other.yml
4751
relabel_configs:
4852
# strip off port
4953
- source_labels: ['__address__']

0 commit comments

Comments
 (0)