|
| 1 | +# Alerting |
| 2 | + |
| 3 | +The [prometheus.io docs](https://prometheus.io/docs/alerting/latest/overview/) |
| 4 | +describe the overall alerting process: |
| 5 | + |
| 6 | +> Alerting with Prometheus is separated into two parts. Alerting rules in |
| 7 | + Prometheus servers send alerts to an Alertmanager. The Alertmanager then |
| 8 | + manages those alerts, including silencing, inhibition, aggregation and |
| 9 | + sending out notifications via methods such as email, on-call notification |
| 10 | + systems, and chat platforms. |
| 11 | + |
| 12 | +By default, both a `prometheus` server and an `alertmanager` server are |
| 13 | +deployed on the control node for new environments: |
| 14 | + |
| 15 | +```ini |
| 16 | +# environments/site/groups: |
| 17 | +[prometheus:children] |
| 18 | +control |
| 19 | + |
| 20 | +[alertmanager:children] |
| 21 | +control |
| 22 | +``` |
| 23 | + |
| 24 | +The general Prometheus configuration is described in |
| 25 | +[monitoring-and-logging.md](./monitoring-and-logging.md#defaults-3) - note this |
| 26 | +section specifies some role variables which commonly need modification. |
| 27 | + |
| 28 | +The alertmanager server is defined by the [ansible/roles/alertmanager](../ansible/roles/alertmanager/README.md), |
| 29 | +and all the configuration options and defaults are defined there. By default |
| 30 | +it will be fully functional but: |
| 31 | +- `alertmanager_web_external_url` is likely to require modification. |
| 32 | +- A [receiver](https://prometheus.io/docs/alerting/latest/configuration/#receiver) |
| 33 | + must be defined to actually provide notifications. Currently a Slack receiver |
| 34 | + integration is provided (see below) but alternative receivers |
| 35 | + could be defined using the provided role variables. |
| 36 | + |
| 37 | +## Slack receiver |
| 38 | + |
| 39 | +This section describes how to enable the Slack receiver to provide notifications |
| 40 | +of alerts via Slack. |
| 41 | + |
| 42 | +1. Create an app with a bot token: |
| 43 | + |
| 44 | +- Go to https://api.slack.com/apps |
| 45 | +- select "Create an App" |
| 46 | +- select "From scratch" |
| 47 | +- Set app name and workspace fields, select "Create" |
| 48 | +- Fill out "Short description" and "Background color" fields, select "Save changes" |
| 49 | +- Select "OAuth & Permissions" on left menu |
| 50 | +- Under "Scopes : Bot Token Scopes", select "Add an OAuth Scope", add |
| 51 | + `chat:write` and select "Save changes" |
| 52 | +- Select "Install App" on left menu, select "Install to your-workspace", select Allow |
| 53 | +- Copy the Bot User OAuth token shown |
| 54 | + |
| 55 | +2. Add the bot token into the config and enable Slack integration: |
| 56 | + |
| 57 | +- Open `environments/$ENV/inventory/group_vars/all/vault_alertmanager.yml` |
| 58 | +- Uncomment `vault_alertmanager_slack_integration_app_creds` and add the token |
| 59 | +- Vault-encrypt that file: |
| 60 | + |
| 61 | + ansible-vault encrypt environments/$ENV/inventory/group_vars/all/vault_alertmanager.yml |
| 62 | + |
| 63 | +- Open `environments/$ENV/inventory/group_vars/all/alertmanager.yml` |
| 64 | +- Uncomment the `alertmanager_slack_integration` mapping and set your alert channel name |
| 65 | + |
| 66 | +3. Invite the bot to your alerts channel |
| 67 | +- In the appropriate Slack channel type: |
| 68 | + |
| 69 | + /invite @YOUR_BOT_NAME |
| 70 | + |
| 71 | + |
| 72 | +## Alerting Rules |
| 73 | + |
| 74 | +These are part of [Prometheus configuration](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) |
| 75 | +which is defined appliance at |
| 76 | +[environments/common/inventory/group_vars/all/prometheus.yml](../environments/common/inventory/group_vars/all/prometheus.yml). |
| 77 | + |
| 78 | +Two `cloudalchemy.prometheus` role variables are relevant: |
| 79 | +- `prometheus_alert_rules_files`: Paths to check for files providing rules. |
| 80 | + Note these are copied to Prometheus config directly, so jinja expressions for |
| 81 | + Prometheus do not need escaping. |
| 82 | +- `prometheus_alert_rules`: Yaml-format rules. Jinja templating here will be |
| 83 | +interpolated by Ansible, so templating intended for Prometheus must be escaped |
| 84 | +using `{% raw %}`/`{% endraw %}` tags. |
| 85 | + |
| 86 | +By default, `prometheus_alert_rules_files` is set so that any `*.rules` files |
| 87 | +in a directory `files/prometheus/rules` in the current environment or *any* |
| 88 | +parent environment are loaded. So usually, site-specific alerts should be added |
| 89 | +by creating additional rules files in `environments/site/files/prometheus/rules`. |
| 90 | +If the same file exists in more than one environment, the "child" file will take |
| 91 | +precedence and any rules in the "parent" file will be ignored. |
| 92 | + |
| 93 | +A set of default alert rule files is provided at `environments/common/files/prometheus/rules/`. |
| 94 | +These cover: |
| 95 | +- Some node-exporter metrics for disk, filesystems, memory and clock. Note |
| 96 | + no alerts are triggered on memory for compute nodes due to the intended use |
| 97 | + of those nodes. |
| 98 | +- Slurm nodes in DOWN or FAIL states, or the Slurm DBD message queue being too |
| 99 | + large, usually indicating a database problem. |
| 100 | + |
| 101 | +When defining additional rules, note the [labels defined](./monitoring-and-logging.md#prometheus_node_exporter_targets) for node-exporter targets. |
| 102 | + |
| 103 | +In future more alerts may be added for: |
| 104 | +- smartctl-exporter-based rules for baremetal nodes where there is no |
| 105 | + infrastructure-level smart monitoring |
| 106 | +- loss of "up" network interfaces |
0 commit comments