add alertmanager docs

sjpb · sjpb · commit a09a140c5583 · 2025-04-11T13:49:39.000Z
diff --git a/ansible/roles/alertmanager/README.md b/ansible/roles/alertmanager/README.md
@@ -101,66 +101,3 @@ Mem:           3.6Gi       2.4Gi       168Mi        11Mi       1.5Gi       1.2Gi
 Swap:             0B          0B          0B
 ```
 
-
-
-## Slack Integration
-
-1. Create an app with a bot token:
-
-- Go to https://api.slack.com/apps
-- select "Create an App"
-- select "From scratch"
-- Set app name and workspacef fields, select "Create"
-- Fill out "Short description" and "Background color" fields, select "Save changes"
-- Select "OAuth & Permissions" on left menu
-- Under "Scopes : Bot Token Scopes", select "Add an OAuth Scope", add
-  `chat:write` and select "Save changes"
-- Select "Install App" on left menu, select "Install to your-workspace", select Allow
-- Copy the Bot User OAuth token shown
-
-2. Add the bot token into the config and enable Slurm integration
-
-- Open `environments/$ENV/inventory/group_vars/all/vault_alertmanager.yml`
-- Uncomment `vault_alertmanager_slack_integration_app_creds` and add the token
-- Vault-encrypt that file:
-
-        ansible-vault encrypt environments/$ENV/inventory/group_vars/all/vault_alertmanager.yml
-
-- Open `environments/$ENV/inventory/group_vars/all/alertmanager.yml`
-- Uncomment the `alertmanager_slack_integration` mapping and set your alert channel name
-
-3. Invite the bot to your alerts channel
-- In the appropriate Slack channel type:
-
-        /invite @YOUR_BOT_NAME
-
-TODO: note that `prometheus_web_external_url` might need overriding too.
-
-## Alert Rules
-
-These are part of [Prometheus configuration](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/), defined in the appliance at
-[environments/common/inventory/group_vars/all/prometheus.yml](../../../environments/common/inventory/group_vars/all/prometheus.yml).
-
-A fairly-minimal set of default alert rule files is provided at
-`environments/common/files/prometheus/rules/`. Because compute nodes are expected
-to operate with heavy CPU and memory load, no alerting on those parameters is
-defined for those nodes.
-
-By default `prometheus_alert_rules_files` is set such that any `*.rules` files
-in a directory `files/prometheus/rules` in the current environment or *any*
-parent environment are loaded. So usually, site-specific alerts should be added
-by creating additional rules files in `environments/site/files/prometheus/rules`.
-
-Note that the Prometheus targets are defined such that each node will have labels:
-    - `env`: `ungrouped`, by default, unless a group/host var `prometheus_env` is set
-    - `group`: One of `login`, `control`, `compute` or `other`
-These may be used to limit alerts to specific sets of nodes.
-
-Some ideas for future alerts which could be useful:
-- smartctl-exporter-based rules for baremetal nodes where the is no
-  infrastructure-level smart monitoring
-- loss of "up" network interfaces
-
-
-TODO: suggest awesome alerts
-TODO: note that child env rule files override parent envs
diff --git a/docs/alerting.md b/docs/alerting.md
@@ -0,0 +1,106 @@
+# Alerting
+
+The [prometheus.io docs](https://prometheus.io/docs/alerting/latest/overview/)
+describe the overall alerting process:
+
+> Alerting with Prometheus is separated into two parts. Alerting rules in
+  Prometheus servers send alerts to an Alertmanager. The Alertmanager then
+  manages those alerts, including silencing, inhibition, aggregation and
+  sending out notifications via methods such as email, on-call notification
+  systems, and chat platforms.
+
+By default, both a `prometheus` server and an `alertmanager` server are
+deployed on the control node for new environments:
+
+```ini
+# environments/site/groups:
+[prometheus:children]
+control
+
+[alertmanager:children]
+control
+```
+
+The general Prometheus configuration is described in
+[monitoring-and-logging.md](./monitoring-and-logging.md#defaults-3) - note this
+section specifies some role variables which commonly need modification.
+
+The alertmanager server is defined by the [ansible/roles/alertmanager](../ansible/roles/alertmanager/README.md),
+and all the configuration options and defaults are defined there. By default
+it will be fully functional but:
+- `alertmanager_web_external_url` is likely to require modification.
+- A [receiver](https://prometheus.io/docs/alerting/latest/configuration/#receiver)
+  must be defined to actually provide notifications. Currently a Slack receiver
+  integration is provided (see below) but alternative receivers
+  could be defined using the provided role variables.
+
+## Slack receiver
+
+This section describes how to enable the Slack receiver to provide notifications
+of alerts via Slack.
+
+1. Create an app with a bot token:
+
+- Go to https://api.slack.com/apps
+- select "Create an App"
+- select "From scratch"
+- Set app name and workspace fields, select "Create"
+- Fill out "Short description" and "Background color" fields, select "Save changes"
+- Select "OAuth & Permissions" on left menu
+- Under "Scopes : Bot Token Scopes", select "Add an OAuth Scope", add
+  `chat:write` and select "Save changes"
+- Select "Install App" on left menu, select "Install to your-workspace", select Allow
+- Copy the Bot User OAuth token shown
+
+2. Add the bot token into the config and enable Slack integration:
+
+- Open `environments/$ENV/inventory/group_vars/all/vault_alertmanager.yml`
+- Uncomment `vault_alertmanager_slack_integration_app_creds` and add the token
+- Vault-encrypt that file:
+
+        ansible-vault encrypt environments/$ENV/inventory/group_vars/all/vault_alertmanager.yml
+
+- Open `environments/$ENV/inventory/group_vars/all/alertmanager.yml`
+- Uncomment the `alertmanager_slack_integration` mapping and set your alert channel name
+
+3. Invite the bot to your alerts channel
+- In the appropriate Slack channel type:
+
+        /invite @YOUR_BOT_NAME
+
+
+## Alerting Rules
+
+These are part of [Prometheus configuration](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/)
+which is defined appliance at
+[environments/common/inventory/group_vars/all/prometheus.yml](../environments/common/inventory/group_vars/all/prometheus.yml).
+
+Two `cloudalchemy.prometheus` role variables are relevant:
+- `prometheus_alert_rules_files`: Paths to check for files providing rules.
+  Note these are copied to Prometheus config directly, so jinja expressions for
+  Prometheus do not need escaping.
+- `prometheus_alert_rules`: Yaml-format rules. Jinja templating here will be
+interpolated by Ansible, so templating intended for Prometheus must be escaped
+using `{% raw %}`/`{% endraw %}` tags.
+
+By default, `prometheus_alert_rules_files` is set so that any `*.rules` files
+in a directory `files/prometheus/rules` in the current environment or *any*
+parent environment are loaded. So usually, site-specific alerts should be added
+by creating additional rules files in `environments/site/files/prometheus/rules`.
+If the same file exists in more than one environment, the "child" file will take
+precedence and any rules in the "parent" file will be ignored.
+
+A set of default alert rule files is provided at `environments/common/files/prometheus/rules/`.
+These cover:
+- Some node-exporter metrics for disk, filesystems, memory and clock. Note
+  no alerts are triggered on memory for compute nodes due to the intended use
+  of those nodes.
+- Slurm nodes in DOWN or FAIL states, or the Slurm DBD message queue being too
+  large, usually indicating a database problem.
+
+When defining additional rules, note the [labels defined](./monitoring-and-logging.md#prometheus_node_exporter_targets) for node-exporter targets.
+
+In future more alerts may be added for:
+- smartctl-exporter-based rules for baremetal nodes where there is no
+  infrastructure-level smart monitoring
+- loss of "up" network interfaces
diff --git a/docs/monitoring-and-logging.md b/docs/monitoring-and-logging.md
@@ -215,6 +215,12 @@ Internally, we use the [cloudalchemy.prometheus](https://github.com/cloudalchemy
 
 > [environments/common/inventory/group_vars/all/prometheus.yml](../environments/common/inventory/group_vars/all/prometheus.yml)
 
+Prometheus will be functional by default but the following variables should
+commonly be modified:
+- `prometheus_web_external_url`
+- `prometheus_storage_retention`
+- `prometheus_storage_retention_size`
+
 ### Placement
 
 The `prometheus` group determines the placement of the prometheus service. Load balancing is currently unsupported so it is important that you only assign one host to this group.
@@ -240,12 +246,7 @@ This appliance provides a default set of recording rules which can be found here
 The intended purpose is to pre-compute some expensive queries that are used
 in the reference set of grafana dashboards.
 
-To add new, or to remove rules you will be to adjust the `prometheus_alert_rules_files` variable. The default value can be found in:
-
-> [environments/common/inventory/group_vars/all/prometheus.yml](../environments/common/inventory/group_vars/all/prometheus.yml)
-
-You can extend this variable in your environment specific configuration to reference extra files or to remove the defaults. The reference set of dashboards expect these variables to be defined, so if you remove them, you
-will also have to update your dashboards.
+For information on configuring alerting rules see [docs/alerting.md#alerting-rules](./alerting.md#alerting-rules).
 
 ### node_exporter
 
@@ -273,7 +274,14 @@ Variables in this file should *not* be customised directly, but should be overri
 
 #### prometheus_node_exporter_targets
 
-Groups prometheus targets into per environment groups. The ansible variable, `env` is used to determine the grouping. The metrics for each target in the group are given the prometheus label, `env: $env`, where `$env` is the value of the `env` variable for that host.
+Groups prometheus targets. Metrics from `node_exporter` hosts have two labels
+applied:
+   - `env`: This is set from the Ansible variable `prometheus_env` if present
+     (e.g. from hostvars or groupvars), defaulting to `ungrouped`. This can be
+     used to group metrics by some arbitrary "environment", e.g. rack.
+   - `group`: This refers to the "top-level" inventory group for the host and
+     is one of `control`, `login`, `compute` or `other`. This can be used to
+     define rules for specific host functionalities.
 
 ## slurm-stats
 
diff --git a/docs/production.md b/docs/production.md
@@ -150,5 +150,4 @@ and referenced from the `site` and `production` environments, e.g.:
   In general it should be possible to raise this value to 50-100 if the cloud
   is properly tuned, again, demonstrated through testing.
 
-- Enable alertmanager following the [role docs](../ansible/roles/alertmanager/README.md)
-  if Slack is available.
+- Enable alertmanager if Slack is available - see [docs/alerting.md](./alerting.md).