Skip to content

Add support for alertmanager #649

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Apr 23, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
52e4daa
WIP: add alertmanager
sjpb Apr 9, 2025
6340947
disable alertmananger for caas
sjpb Apr 10, 2025
a1cc078
get slack integration working, with node down alert
sjpb Apr 10, 2025
a44f777
add node-exporter disk space alert
sjpb Apr 10, 2025
e9ffefd
setup fatimage/site
sjpb Apr 10, 2025
4b437aa
fix bug where prometheus environments didn't work
sjpb Apr 10, 2025
16b3a9b
add group label (login,control,compute) to prom targets
sjpb Apr 10, 2025
a9cd55e
add node-exporter rules
sjpb Apr 10, 2025
d806dee
alertmanager docs/defaults
sjpb Apr 10, 2025
ce9ed5b
add node failure alert
sjpb Apr 10, 2025
9c2469f
update alertmanager comments
sjpb Apr 11, 2025
72245f9
fix slack creds being exposed in alertmanager config
sjpb Apr 11, 2025
43f4a87
change alerts to ignore compute
sjpb Apr 11, 2025
4fee13c
update rules comments
sjpb Apr 11, 2025
ccd3014
change alertmanager external web url to use host IP
sjpb Apr 11, 2025
a2c07e0
fix up prom address in slack alert links
sjpb Apr 11, 2025
99df07f
alert on large Slurmdbd queue
sjpb Apr 11, 2025
167d37e
don't alert on /run/credentials/systemd fs problems
sjpb Apr 11, 2025
e6a4d3c
add alertmanager docs
sjpb Apr 11, 2025
8d04c50
guard alertmanager install
sjpb Apr 15, 2025
c8d761c
fix unused turbovcn service crashing
sjpb Apr 15, 2025
ba1a95e
bump CI image
sjpb Apr 15, 2025
5c3e93c
add basic auth with default user for alertmanager
sjpb Apr 15, 2025
d876471
add missing alertmanager web config template
sjpb Apr 15, 2025
86ae309
fix CI for secrets changing between PRs
sjpb Apr 15, 2025
d7efaf6
fix bug with json-encoded munge key in compute-init playbook
sjpb Apr 16, 2025
2ccf041
bump CI image
sjpb Apr 16, 2025
3bbb02f
add extra prom alertmanager config + fix bug in same
sjpb Apr 22, 2025
28ccabf
make slack alertmanager receiver more configurable
sjpb Apr 22, 2025
9033532
bump openhpc role to get facts for alert config
sjpb Apr 22, 2025
6820a37
remove empty alertmanager tasks file
sjpb Apr 22, 2025
1a6eff8
Merge branch 'main' into feat/alertmanager
sjpb Apr 22, 2025
41c7331
bump CI image
sjpb Apr 22, 2025
2c0c787
Merge branch 'feat/alertmanager' of github.com:stackhpc/ansible-slurm…
sjpb Apr 22, 2025
cad147f
fix promethes auth to alertmanager
sjpb Apr 22, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/stackhpc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,6 @@ jobs:
run: |
. venv/bin/activate
. environments/.stackhpc/activate
ansible-playbook ansible/adhoc/generate-passwords.yml
echo vault_demo_user_password: "$DEMO_USER_PASSWORD" > $APPLIANCES_ENVIRONMENT_ROOT/inventory/group_vars/all/test_user.yml
env:
DEMO_USER_PASSWORD: ${{ secrets.TEST_USER_PASSWORD }}
Expand All @@ -135,6 +134,7 @@ jobs:
. venv/bin/activate
. environments/.stackhpc/activate
ansible all -m wait_for_connection
ansible-playbook ansible/adhoc/generate-passwords.yml
ansible-playbook -v ansible/site.yml
ansible-playbook -v ansible/ci/check_slurm.yml

Expand Down Expand Up @@ -170,6 +170,7 @@ jobs:
. venv/bin/activate
. environments/.stackhpc/activate
ansible all -m wait_for_connection
ansible-playbook ansible/adhoc/generate-passwords.yml
ansible-playbook -v ansible/site.yml
ansible-playbook -v ansible/ci/check_slurm.yml

Expand Down
2 changes: 2 additions & 0 deletions ansible/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -88,3 +88,5 @@ roles/*
!roles/slurm_tools/**
!roles/gateway/
!roles/gateway/**
!roles/alertmanager/
!roles/alertmanager/**
6 changes: 6 additions & 0 deletions ansible/fatimage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,12 @@
slurm_exporter_state: stopped
when: "'slurm_exporter' in group_names"

- name: Install alertmanager
include_role:
name: alertmanager
tasks_from: install.yml
when: "'alertmanager' in group_names"

- hosts: prometheus
become: yes
gather_facts: yes
Expand Down
18 changes: 14 additions & 4 deletions ansible/filter_plugins/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,26 @@
import os.path
import re

def prometheus_node_exporter_targets(hosts, env):
def prometheus_node_exporter_targets(hosts, hostvars, env_key, group):
""" Return a mapping in cloudalchemy.nodeexporter prometheus_targets
format.

hosts: list of inventory_hostnames
hostvars: Ansible hostvars variable
env_key: key to lookup in each host's hostvars to add as label 'env' (default: 'ungrouped')
group: string to add as label 'group'
"""
result = []
per_env = defaultdict(list)
for host in hosts:
per_env[env].append(host)
host_env = hostvars[host].get(env_key, 'ungrouped')
per_env[host_env].append(host)
for env, hosts in per_env.items():
target = {
"targets": ["{target}:9100".format(target=target) for target in hosts],
"targets": [f"{target}:9100" for target in hosts],
"labels": {
"env": env
'env': env,
'group': group
}
}
result.append(target)
Expand Down
11 changes: 11 additions & 0 deletions ansible/monitoring.yml
Original file line number Diff line number Diff line change
Expand Up @@ -86,3 +86,14 @@
grafana_dashboards: []
- import_role: # done in same play so it can use handlers from cloudalchemy.grafana
name: grafana-dashboards

- name: Deploy alertmanager
hosts: alertmanager
tags: alertmanager
become: yes
gather_facts: false
tasks:
- name: Configure alertmanager
include_role:
name: alertmanager
tasks_from: configure.yml
97 changes: 97 additions & 0 deletions ansible/roles/alertmanager/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# alertmanager

Deploy [alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/)
to route Prometheus alerts to a receiver. Currently Slack is the only supported
receiver.

Note that:
- HA configuration is not supported
- Alertmanager state is not preserved when the node it runs on (by default,
control node) is reimaged, so any alerts silenced via the GUI will reoccur.
- No Grafana dashboard for alerts is currently provided.

Alertmanager is enabled by default on the `control` node in the
[everything](../../../environments/common/layouts/everything) template which
`cookiecutter` uses for a new environment's `inventory/groups` file.

In general usage may only require:
- Adding the `control` node into the `alertmanager` group in `environments/site/groups`
if upgrading an existing environment.
- Enabling the Slack integration (see section below).
- Possibly setting `alertmanager_web_external_url`.

The web UI is available on `alertmanager_web_external_url`.

## Role variables

All variables are optional. See [defaults/main.yml](defaults/main.yml) for
all default values.

General variables:
- `alertmanager_version`: String, version (no leading 'v')
- `alertmanager_download_checksum`: String, checksum for relevant version from
[prometheus.io download page](https://prometheus.io/download/), in format
`type:value`.
- `alertmanager_download_dest`: String, path of temporary directory used for
download. Must exist.
- `alertmanager_binary_dir`: String, path of directory to install alertmanager
binary to. Must exist.
- `alertmanager_started`: Bool, whether the alertmanager service should be started.
- `alertmanager_enabled`: Bool, whether the alertmanager service should be enabled.
- `alertmanager_system_user`: String, name of user to run alertmanager as. Will be created.
- `alertmanager_system_group`: String, name of group of alertmanager user.
- `alertmanager_port`: Port to listen on.

The following variables are equivalent to similarly-named arguments to the
`alertmanager` binary. See `man alertmanager` for more info:

- `alertmanager_config_file`: String, path the main alertmanager config file
will be written to. Parent directory will be created if necessary.
- `alertmanager_web_config_file`: String, path alertmanager web config file
will be written to. Parent directory will be created if necessary.
- `alertmanager_storage_path`: String, base path for data storage.
- `alertmanager_web_listen_addresses`: List of strings, defining addresses to listeen on.
- `alertmanager_web_external_url`: String, the URL under which Alertmanager is
externally reachable - defaults to host IP address and `alertmanager_port`.
See man page for more details if proxying alertmanager.
- `alertmanager_data_retention`: String, how long to keep data for
- `alertmanager_data_maintenance_interval`: String, interval between garbage
collection and snapshotting to disk of the silences and the notification logs.
- `alertmanager_config_flags`: Mapping. Keys/values in here are written to the
alertmanager commandline as `--{{ key }}={{ value }}`.
- `alertmanager_default_receivers`:

The following variables are templated into the alertmanager [main configuration](https://prometheus.io/docs/alerting/latest/configuration/):
- `alertmanager_config_template`: String, path to configuration template. The default
is to template in `alertmanager_config_default` and `alertmanager_config_extra`.
- `alertmanager_config_default`: Mapping with default configuration for the
top-level `route` and `receivers` keys. The default is to send all alerts to
the Slack receiver, if that has been enabled (see below).
- `alertmanager_receivers`: A list of [receiver](https://prometheus.io/docs/alerting/)
mappings to define under the top-level `receivers` configuration key. This
will contain the Slack receiver if that has been enabled (see below).
- `alertmanager_extra_receivers`: A list of additional [receiver](https://prometheus.io/docs/alerting/),
mappings to add, by default empty.
- `alertmanager_slack_receiver`: Mapping defining the [Slack receiver](https://prometheus.io/docs/alerting/latest/configuration/#slack_config). Note the default configuration for this is in
`environments/common/inventory/group_vars/all/alertmanager.yml`.
- `alertmanager_slack_receiver_name`: String, name for the above Slack reciever.
- `alertmanager_slack_receiver_send_resolved`: Bool, whether to send resolved alerts via the above Slack reciever.
- `alertmanager_null_receiver`: Mapping defining a `null` [receiver](https://prometheus.io/docs/alerting/latest/configuration/#receiver) so a receiver is always defined.
- `alertmanager_config_extra`: Mapping with additional configuration. Keys in
this become top-level keys in the configuration. E.g this might be:
```yaml
alertmanager_config_extra:
global:
smtp_from: smtp.example.org:587
time_intervals:
- name: monday-to-friday
time_intervals:
- weekdays: ['monday:friday']
```
Note that `route` and `receivers` keys should not be added here.

The following variables are templated into the alertmanager [web configuration](https://prometheus.io/docs/alerting/latest/https/):
- `alertmanager_web_config_default`: Mapping with default configuration for
`basic_auth_users` providing the default web user.
- `alertmanager_alertmanager_web_config_extra`: Mapping with additional web
configuration. Keys in this become top-level keys in the web configuration.
50 changes: 50 additions & 0 deletions ansible/roles/alertmanager/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
alertmanager_version: '0.28.1'
alertmanager_download_checksum: 'sha256:5ac7ab5e4b8ee5ce4d8fb0988f9cb275efcc3f181b4b408179fafee121693311'
alertmanager_download_dest: /tmp/alertmanager.tar.gz
alertmanager_binary_dir: /usr/local/bin
alertmanager_started: true
alertmanager_enabled: true

alertmanager_system_user: alertmanager
alertmanager_system_group: "{{ alertmanager_system_user }}"
alertmanager_config_file: /etc/alertmanager/alertmanager.yml
alertmanager_web_config_file: /etc/alertmanager/alertmanager-web.yml
alertmanager_storage_path: /var/lib/alertmanager

alertmanager_port: '9093'
alertmanager_web_listen_addresses:
- ":{{ alertmanager_port }}"
alertmanager_web_external_url: '' # defined in environments/common/inventory/group_vars/all/alertmanager.yml for visibility

alertmanager_data_retention: '120h'
alertmanager_data_maintenance_interval: '15m'
alertmanager_config_flags: {} # other command-line parameters as shown by `man alertmanager`
alertmanager_config_template: alertmanager.yml.j2
alertmanager_web_config_template: alertmanager-web.yml.j2

alertmanager_web_config_default:
basic_auth_users:
alertmanager: "{{ vault_alertmanager_admin_password | password_hash('bcrypt', '1234567890123456789012', ident='2b') }}"
alertmanager_alertmanager_web_config_extra: {} # top-level only

# Variables below are interpolated into alertmanager_config_default:

# Uncomment below and add Slack bot app creds for Slack integration
# alertmanager_slack_integration:
# channel: '#alerts'
# app_creds:

alertmanager_null_receiver:
name: 'null'
alertmanager_slack_receiver: {} # defined in environments/common/inventory/group_vars/all/alertmanager.yml as it needs prometheus_address
alertmanager_extra_receivers: []
alertmanager_default_receivers: "{{ [alertmanager_null_receiver] + ([alertmanager_slack_receiver] if alertmanager_slack_integration is defined else []) }}"
alertmanager_receivers: "{{ alertmanager_default_receivers + alertmanager_extra_receivers }}"

alertmanager_config_default:
route:
group_by: ['...']
receiver: "{{ alertmanager_slack_receiver_name if alertmanager_slack_integration is defined else 'null' }}"
receivers: "{{ alertmanager_receivers }}"

alertmanager_config_extra: {} # top-level only
6 changes: 6 additions & 0 deletions ansible/roles/alertmanager/handlers/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
- name: Restart alertmanager
systemd:
name: alertmanager
state: restarted
daemon_reload: "{{ _alertmanager_service.changed | default(false) }}"
when: alertmanager_started | bool
47 changes: 47 additions & 0 deletions ansible/roles/alertmanager/tasks/configure.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
- name: Create alertmanager directories
ansible.builtin.file:
path: "{{ item }}"
state: directory
owner: "{{ alertmanager_system_user }}"
group: "{{ alertmanager_system_group }}"
mode: u=rwX,go=rX
loop:
- "{{ alertmanager_config_file | dirname }}"
- "{{ alertmanager_web_config_file | dirname }}"
- "{{ alertmanager_storage_path }}"

- name: Create alertmanager service file with immutable options
template:
src: alertmanager.service.j2
dest: /usr/lib/systemd/system/alertmanager.service
owner: root
group: root
mode: u=rw,go=r
register: _alertmanager_service
notify: Restart alertmanager

- name: Template alertmanager config
ansible.builtin.template:
src: "{{ alertmanager_config_template }}"
dest: "{{ alertmanager_config_file }}"
owner: "{{ alertmanager_system_user }}"
group: "{{ alertmanager_system_group }}"
mode: u=rw,go=
notify: Restart alertmanager

- name: Template alertmanager web config
ansible.builtin.template:
src: "{{ alertmanager_web_config_template }}"
dest: "{{ alertmanager_web_config_file }}"
owner: "{{ alertmanager_system_user }}"
group: "{{ alertmanager_system_group }}"
mode: u=rw,go=
notify: Restart alertmanager

- meta: flush_handlers

- name: Ensure alertmanager service state
systemd:
name: alertmanager
state: "{{ 'started' if alertmanager_started | bool else 'stopped' }}"
enabled: "{{ alertmanager_enabled | bool }}"
25 changes: 25 additions & 0 deletions ansible/roles/alertmanager/tasks/install.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
- name: Create alertmanager system user
ansible.builtin.user:
name: "{{ alertmanager_system_user }}"
system: true
create_home: false

- name: Download alertmanager binary
ansible.builtin.get_url:
url: "https://github.com/prometheus/alertmanager/releases/download/v{{ alertmanager_version }}/alertmanager-{{ alertmanager_version }}.linux-amd64.tar.gz"
dest: "{{ alertmanager_download_dest }}"
owner: root
group: root
mode: u=rw,go=
checksum: "{{ alertmanager_download_checksum }}"

- name: Unpack alertmanager binary
ansible.builtin.unarchive:
src: "{{ alertmanager_download_dest }}"
include: "alertmanager-{{ alertmanager_version }}.linux-amd64/alertmanager"
dest: "{{ alertmanager_binary_dir }}"
owner: root
group: root
mode: u=rwx,go=rx
remote_src: true
extra_opts: ['--strip-components=1', '--show-stored-names']
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{{ ansible_managed | comment }}

{{ alertmanager_web_config_default | to_nice_yaml }}
{{ alertmanager_alertmanager_web_config_extra | to_nice_yaml if alertmanager_alertmanager_web_config_extra | length > 0 else '' }}
53 changes: 53 additions & 0 deletions ansible/roles/alertmanager/templates/alertmanager.service.j2
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@



{{ ansible_managed | comment }}
[Unit]
Description=Prometheus Alertmanager
After=network-online.target
StartLimitInterval=0
StartLimitIntervalSec=0

[Service]
Type=simple
PIDFile=/run/alertmanager.pid
User={{ alertmanager_system_user }}
Group={{ alertmanager_system_group }}
ExecReload=/bin/kill -HUP $MAINPID
ExecStart={{ alertmanager_binary_dir }}/alertmanager \
--cluster.listen-address='' \
--config.file={{ alertmanager_config_file }} \
--storage.path={{ alertmanager_storage_path }} \
--data.retention={{ alertmanager_data_retention }} \
--data.maintenance-interval={{ alertmanager_data_maintenance_interval }} \
{% for address in alertmanager_web_listen_addresses %}
--web.listen-address={{ address }} \
{% endfor %}
--web.external-url={{ alertmanager_web_external_url }} \
--web.config.file={{ alertmanager_web_config_file }} \
{% for flag, flag_value in alertmanager_config_flags.items() %}
--{{ flag }}={{ flag_value }} \
{% endfor %}

SyslogIdentifier=alertmanager
Restart=always
RestartSec=5

CapabilityBoundingSet=CAP_SET_UID
LockPersonality=true
NoNewPrivileges=true
MemoryDenyWriteExecute=true
PrivateTmp=true
ProtectHome=true
ReadWriteDirectories={{ alertmanager_storage_path }}
RemoveIPC=true
RestrictSUIDSGID=true

PrivateUsers=true
ProtectControlGroups=true
ProtectKernelModules=true
ProtectKernelTunables=yes
ProtectSystem=strict

[Install]
WantedBy=multi-user.target
4 changes: 4 additions & 0 deletions ansible/roles/alertmanager/templates/alertmanager.yml.j2
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{{ ansible_managed | comment }}

{{ alertmanager_config_default | to_nice_yaml }}
{{ alertmanager_config_extra | to_nice_yaml if alertmanager_config_extra | length > 0 else '' }}
4 changes: 3 additions & 1 deletion ansible/roles/compute_init/files/compute-init.yml
Original file line number Diff line number Diff line change
Expand Up @@ -299,7 +299,9 @@
# if not the case
- name: Write Munge key
copy:
content: "{{ openhpc_munge_key }}"
# NB: openhpc_munge_key is *binary* and may not survive json encoding
# so do same as environments/common/inventory/group_vars/all/openhpc.yml
content: "{{ vault_openhpc_mungekey | b64decode }}"
dest: "/etc/munge/munge.key"
owner: munge
group: munge
Expand Down
Loading