|
1 | 1 | # alertmanager
|
2 | 2 |
|
| 3 | +Deploy [alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/) |
| 4 | +to route Prometheus alerts to a receiver. Currently Slack is the only supported |
| 5 | +receiver. |
3 | 6 |
|
4 |
| -notes: |
5 |
| -- HA is not supported |
6 |
| -- state ("notification state and configured silences") is not preserved across rebuild |
7 |
| -- not used for caas |
8 |
| -- no dashboard |
| 7 | +Note that: |
| 8 | +- HA configuration is not supported |
| 9 | +- Alertmanager state is not preserved when the node it runs on (by default, |
| 10 | + control node) is reimaged, so any alerts silenced via the GUI will reoccur. |
| 11 | +- No Grafana dashboard for alerts is currently provided. |
9 | 12 |
|
| 13 | +- not used for caas - todo - maybe we should disable by default, unless everyone has slack? |
| 14 | +- have fixed bug with `env` hostvar for prom, now called `prometheus_env` |
| 15 | +- have added label "group" to prom for control,compute,login |
10 | 16 |
|
| 17 | +Alertmanager is enabled by default on the `control` node in the |
| 18 | +[everything](../../../environments/common/layouts/everything) template which |
| 19 | +`cookiecutter` uses for a new environment's `inventory/groups` file. |
| 20 | + |
| 21 | +In general usage may only require: |
| 22 | +- Adding the `control` node into the `alertmanager` group in `environments/site/groups` |
| 23 | + if upgrading an existing environment. |
| 24 | +- Enabling the Slack integration (see below). |
11 | 25 |
|
12 | 26 | ## Role variables
|
13 | 27 |
|
| 28 | +All variables are optional. See [defaults/main.yml](defaults/main.yml) for |
| 29 | +all default values. |
| 30 | + |
| 31 | +General variables: |
| 32 | +- `alertmanager_version`: String, version (no leading 'v') |
| 33 | +- `alertmanager_download_checksum`: String, checksum for relevant version from |
| 34 | + [prometheus.io download page](https://prometheus.io/download/), in format |
| 35 | + `type:value`. |
| 36 | +- `alertmanager_download_dest`: String, path of temporary directory used for |
| 37 | + download. Must exist. |
| 38 | +- `alertmanager_binary_dir`: String, path of directory to install alertmanager |
| 39 | + binary to. Must exist. |
| 40 | +- `alertmanager_started`: Bool, whether the alertmanager service should be started. |
| 41 | +- `alertmanager_enabled`: Bool, whether the alertmanager service should be enabled. |
| 42 | +- `alertmanager_system_user`: String, name of user to run alertmanager as. Will be created. |
| 43 | +- `alertmanager_system_group`: String, name of group of alertmanager user. |
| 44 | +- `alertmanager_port`: Port to listen on. |
| 45 | + |
14 | 46 | The following variables are equivalent to similarly-named arguments to the
|
15 | 47 | `alertmanager` binary. See `man alertmanager` for more info:
|
16 | 48 |
|
17 |
| -- TODO: |
18 |
| - |
19 |
| -The following variables are templated into the alertmanager configuration file: |
20 |
| - |
21 |
| -- TODO: |
22 |
| - |
23 |
| -Other variables: |
24 |
| -- TODO: |
25 |
| - |
| 49 | +- `alertmanager_config_file`: String, path alertmanager config file will be |
| 50 | + written to. Parent directory will be created if necessary. |
| 51 | +- `alertmanager_storage_path`: String, base path for data storage. |
| 52 | +- `alertmanager_web_listen_addresses`: List of strings, defining addresses to listeen on. |
| 53 | +- `alertmanager_web_external_url`: String, the URL under which Alertmanager is |
| 54 | + externally reachable. See man page for more details if proxying alertmanager. |
| 55 | +- `alertmanager_data_retention`: String, how long to keep data for |
| 56 | +- `alertmanager_data_maintenance_interval`: String, interval between garbage |
| 57 | + collection and snapshotting to disk of the silences and the notification logs. |
| 58 | +- `alertmanager_config_flags`: Mapping. Keys/values in here are written to the |
| 59 | + alertmanager commandline as `--{{ key }}={{ value }}`. |
| 60 | +- `alertmanager_default_receivers`: |
| 61 | + |
| 62 | +The following variables are templated into the [alertmanager configuration](https://prometheus.io/docs/alerting/latest/configuration/): |
| 63 | +- `alertmanager_config_template`: String, path to configuration template. The default |
| 64 | + is to template in `alertmanager_config_default` and `alertmanager_config_extra`. |
| 65 | +- `alertmanager_config_default`: Mapping with default configuration for the |
| 66 | + top-level `route` and `receivers` keys. The default is to send all alerts to |
| 67 | + the Slack receiver, if that has been enabled (see below). |
| 68 | +- `alertmanager_receivers`: A list of [receiver](https://prometheus.io/docs/alerting/) |
| 69 | + mappings to define under the top-level `receivers` configuration key. This |
| 70 | + will contain the Slack receiver if that has been enabled (see below). |
| 71 | +- `alertmanager_extra_receivers`: A list of additional [receiver](https://prometheus.io/docs/alerting/), |
| 72 | + mappings to add, by default empty. |
| 73 | +- `alertmanager_slack_receiver`: Mapping defining the [Slack receiver](https://prometheus.io/docs/alerting/latest/configuration/#slack_config). Note the default configuration for this is in |
| 74 | +`environments/common/inventory/group_vars/all/alertmanager.yml`. |
| 75 | +- `alertmanager_null_receiver`: Mapping defining a `null` [receiver](https://prometheus.io/docs/alerting/latest/configuration/#receiver) so a receiver is always defined. |
| 76 | +- `alertmanager_config_extra`: Mapping with additional configuration. Keys in |
| 77 | + this become top-level keys in the configuration. E.g this might be: |
| 78 | + ```yaml |
| 79 | + alertmanager_config_extra: |
| 80 | + global: |
| 81 | + smtp_from: smtp.example.org:587 |
| 82 | + time_intervals: |
| 83 | + - name: monday-to-friday |
| 84 | + time_intervals: |
| 85 | + - weekdays: ['monday:friday'] |
| 86 | + ``` |
| 87 | + Note that `route` and `receivers` keys should not be added here. |
26 | 88 |
|
27 | 89 | ## TODO
|
28 | 90 |
|
@@ -54,21 +116,42 @@ Swap: 0B 0B 0B
|
54 | 116 |
|
55 | 117 | 2. Add the bot token into the config and enable Slurm integration
|
56 | 118 |
|
57 |
| -- Open `environments/site/inventory/group_vars/all/vault_alertmanager.yml` |
| 119 | +- Open `environments/$ENV/inventory/group_vars/all/vault_alertmanager.yml` |
58 | 120 | - Uncomment `vault_alertmanager_slack_integration_app_creds` and add the token
|
59 | 121 | - Vault-encrypt that file:
|
60 | 122 |
|
61 |
| - ansible-vault encrypt environments/$ENV/inventory/group_vars/all/vault_alertmanager.yml |
| 123 | + ansible-vault encrypt environments/$ENV/inventory/group_vars/all/vault_alertmanager.yml |
62 | 124 |
|
63 |
| -- Open `environments/site/inventory/group_vars/all/alertmanager.yml` |
64 |
| -- Uncomment the config and set your alert channel name |
| 125 | +- Open `environments/$ENV/inventory/group_vars/all/alertmanager.yml` |
| 126 | +- Uncomment the `alertmanager_slack_integration` mapping and set your alert channel name |
65 | 127 |
|
66 | 128 | 3. Invite the bot to your alerts channel
|
67 | 129 | - In the appropriate Slack channel type:
|
68 | 130 |
|
69 |
| - /invite @YOUR_BOT_NAME |
| 131 | + /invite @YOUR_BOT_NAME |
| 132 | + |
| 133 | + |
| 134 | +## Alert Rules |
| 135 | + |
| 136 | +These are part of [Prometheus configuration](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/), defined in the appliance at |
| 137 | +[environments/common/inventory/group_vars/all/prometheus.yml](../../../environments/common/inventory/group_vars/all/prometheus.yml). |
| 138 | + |
| 139 | +A fairly-minimal set of default alert rule files is provided at |
| 140 | +`environments/common/files/prometheus/rules/`. Because compute nodes are expected |
| 141 | +to operate with heavy CPU and memory load, no alerting on those parameters is |
| 142 | +defined for those nodes. |
70 | 143 |
|
| 144 | +By default `prometheus_alert_rules_files` is set such that any `*.rules` files |
| 145 | +in a directory `files/prometheus/rules` in the current environment or *any* |
| 146 | +parent environment are loaded. So usually, site-specific alerts should be added |
| 147 | +by creating additional rules files in `environments/site/files/prometheus/rules`. |
71 | 148 |
|
72 |
| -## Adding Rules |
| 149 | +Note that the Prometheus targets are defined such that each node will have labels: |
| 150 | + - `env`: `ungrouped`, by default, unless a group/host var `prometheus_env` is set |
| 151 | + - `group`: One of `login`, `control` or `compute` |
| 152 | +These may be used to limit alerts to specific sets of nodes. |
73 | 153 |
|
74 |
| -TODO: describe how prom config works |
| 154 | +Some ideas for future alerts which could be useful: |
| 155 | +- smartctl-exporter-based rules for baremetal nodes where the is no |
| 156 | + infrastructure-level smart monitoring |
| 157 | +- loss of "up" network interfaces |
0 commit comments