-
Notifications
You must be signed in to change notification settings - Fork 34
Add support for alertmanager #649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
35 commits
Select commit
Hold shift + click to select a range
52e4daa
WIP: add alertmanager
sjpb 6340947
disable alertmananger for caas
sjpb a1cc078
get slack integration working, with node down alert
sjpb a44f777
add node-exporter disk space alert
sjpb e9ffefd
setup fatimage/site
sjpb 4b437aa
fix bug where prometheus environments didn't work
sjpb 16b3a9b
add group label (login,control,compute) to prom targets
sjpb a9cd55e
add node-exporter rules
sjpb d806dee
alertmanager docs/defaults
sjpb ce9ed5b
add node failure alert
sjpb 9c2469f
update alertmanager comments
sjpb 72245f9
fix slack creds being exposed in alertmanager config
sjpb 43f4a87
change alerts to ignore compute
sjpb 4fee13c
update rules comments
sjpb ccd3014
change alertmanager external web url to use host IP
sjpb a2c07e0
fix up prom address in slack alert links
sjpb 99df07f
alert on large Slurmdbd queue
sjpb 167d37e
don't alert on /run/credentials/systemd fs problems
sjpb e6a4d3c
add alertmanager docs
sjpb 8d04c50
guard alertmanager install
sjpb c8d761c
fix unused turbovcn service crashing
sjpb ba1a95e
bump CI image
sjpb 5c3e93c
add basic auth with default user for alertmanager
sjpb d876471
add missing alertmanager web config template
sjpb 86ae309
fix CI for secrets changing between PRs
sjpb d7efaf6
fix bug with json-encoded munge key in compute-init playbook
sjpb 2ccf041
bump CI image
sjpb 3bbb02f
add extra prom alertmanager config + fix bug in same
sjpb 28ccabf
make slack alertmanager receiver more configurable
sjpb 9033532
bump openhpc role to get facts for alert config
sjpb 6820a37
remove empty alertmanager tasks file
sjpb 1a6eff8
Merge branch 'main' into feat/alertmanager
sjpb 41c7331
bump CI image
sjpb 2c0c787
Merge branch 'feat/alertmanager' of github.com:stackhpc/ansible-slurm…
sjpb cad147f
fix promethes auth to alertmanager
sjpb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -88,3 +88,5 @@ roles/* | |
!roles/slurm_tools/** | ||
!roles/gateway/ | ||
!roles/gateway/** | ||
!roles/alertmanager/ | ||
!roles/alertmanager/** |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
# alertmanager | ||
|
||
Deploy [alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/) | ||
to route Prometheus alerts to a receiver. Currently Slack is the only supported | ||
receiver. | ||
|
||
Note that: | ||
- HA configuration is not supported | ||
- Alertmanager state is not preserved when the node it runs on (by default, | ||
control node) is reimaged, so any alerts silenced via the GUI will reoccur. | ||
- No Grafana dashboard for alerts is currently provided. | ||
|
||
Alertmanager is enabled by default on the `control` node in the | ||
[everything](../../../environments/common/layouts/everything) template which | ||
`cookiecutter` uses for a new environment's `inventory/groups` file. | ||
|
||
In general usage may only require: | ||
- Adding the `control` node into the `alertmanager` group in `environments/site/groups` | ||
if upgrading an existing environment. | ||
- Enabling the Slack integration (see section below). | ||
- Possibly setting `alertmanager_web_external_url`. | ||
|
||
The web UI is available on `alertmanager_web_external_url`. | ||
|
||
## Role variables | ||
|
||
All variables are optional. See [defaults/main.yml](defaults/main.yml) for | ||
all default values. | ||
|
||
General variables: | ||
- `alertmanager_version`: String, version (no leading 'v') | ||
- `alertmanager_download_checksum`: String, checksum for relevant version from | ||
[prometheus.io download page](https://prometheus.io/download/), in format | ||
`type:value`. | ||
- `alertmanager_download_dest`: String, path of temporary directory used for | ||
download. Must exist. | ||
- `alertmanager_binary_dir`: String, path of directory to install alertmanager | ||
binary to. Must exist. | ||
- `alertmanager_started`: Bool, whether the alertmanager service should be started. | ||
- `alertmanager_enabled`: Bool, whether the alertmanager service should be enabled. | ||
- `alertmanager_system_user`: String, name of user to run alertmanager as. Will be created. | ||
- `alertmanager_system_group`: String, name of group of alertmanager user. | ||
- `alertmanager_port`: Port to listen on. | ||
|
||
The following variables are equivalent to similarly-named arguments to the | ||
`alertmanager` binary. See `man alertmanager` for more info: | ||
|
||
- `alertmanager_config_file`: String, path the main alertmanager config file | ||
will be written to. Parent directory will be created if necessary. | ||
- `alertmanager_web_config_file`: String, path alertmanager web config file | ||
will be written to. Parent directory will be created if necessary. | ||
- `alertmanager_storage_path`: String, base path for data storage. | ||
- `alertmanager_web_listen_addresses`: List of strings, defining addresses to listeen on. | ||
- `alertmanager_web_external_url`: String, the URL under which Alertmanager is | ||
externally reachable - defaults to host IP address and `alertmanager_port`. | ||
See man page for more details if proxying alertmanager. | ||
- `alertmanager_data_retention`: String, how long to keep data for | ||
- `alertmanager_data_maintenance_interval`: String, interval between garbage | ||
collection and snapshotting to disk of the silences and the notification logs. | ||
- `alertmanager_config_flags`: Mapping. Keys/values in here are written to the | ||
alertmanager commandline as `--{{ key }}={{ value }}`. | ||
- `alertmanager_default_receivers`: | ||
|
||
The following variables are templated into the alertmanager [main configuration](https://prometheus.io/docs/alerting/latest/configuration/): | ||
- `alertmanager_config_template`: String, path to configuration template. The default | ||
is to template in `alertmanager_config_default` and `alertmanager_config_extra`. | ||
- `alertmanager_config_default`: Mapping with default configuration for the | ||
top-level `route` and `receivers` keys. The default is to send all alerts to | ||
the Slack receiver, if that has been enabled (see below). | ||
- `alertmanager_receivers`: A list of [receiver](https://prometheus.io/docs/alerting/) | ||
mappings to define under the top-level `receivers` configuration key. This | ||
will contain the Slack receiver if that has been enabled (see below). | ||
- `alertmanager_extra_receivers`: A list of additional [receiver](https://prometheus.io/docs/alerting/), | ||
mappings to add, by default empty. | ||
- `alertmanager_slack_receiver`: Mapping defining the [Slack receiver](https://prometheus.io/docs/alerting/latest/configuration/#slack_config). Note the default configuration for this is in | ||
`environments/common/inventory/group_vars/all/alertmanager.yml`. | ||
- `alertmanager_slack_receiver_name`: String, name for the above Slack reciever. | ||
- `alertmanager_slack_receiver_send_resolved`: Bool, whether to send resolved alerts via the above Slack reciever. | ||
- `alertmanager_null_receiver`: Mapping defining a `null` [receiver](https://prometheus.io/docs/alerting/latest/configuration/#receiver) so a receiver is always defined. | ||
- `alertmanager_config_extra`: Mapping with additional configuration. Keys in | ||
this become top-level keys in the configuration. E.g this might be: | ||
```yaml | ||
alertmanager_config_extra: | ||
global: | ||
smtp_from: smtp.example.org:587 | ||
time_intervals: | ||
- name: monday-to-friday | ||
time_intervals: | ||
- weekdays: ['monday:friday'] | ||
``` | ||
Note that `route` and `receivers` keys should not be added here. | ||
|
||
The following variables are templated into the alertmanager [web configuration](https://prometheus.io/docs/alerting/latest/https/): | ||
- `alertmanager_web_config_default`: Mapping with default configuration for | ||
`basic_auth_users` providing the default web user. | ||
- `alertmanager_alertmanager_web_config_extra`: Mapping with additional web | ||
configuration. Keys in this become top-level keys in the web configuration. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
alertmanager_version: '0.28.1' | ||
alertmanager_download_checksum: 'sha256:5ac7ab5e4b8ee5ce4d8fb0988f9cb275efcc3f181b4b408179fafee121693311' | ||
alertmanager_download_dest: /tmp/alertmanager.tar.gz | ||
alertmanager_binary_dir: /usr/local/bin | ||
alertmanager_started: true | ||
alertmanager_enabled: true | ||
|
||
alertmanager_system_user: alertmanager | ||
alertmanager_system_group: "{{ alertmanager_system_user }}" | ||
alertmanager_config_file: /etc/alertmanager/alertmanager.yml | ||
alertmanager_web_config_file: /etc/alertmanager/alertmanager-web.yml | ||
alertmanager_storage_path: /var/lib/alertmanager | ||
|
||
alertmanager_port: '9093' | ||
alertmanager_web_listen_addresses: | ||
- ":{{ alertmanager_port }}" | ||
alertmanager_web_external_url: '' # defined in environments/common/inventory/group_vars/all/alertmanager.yml for visibility | ||
|
||
alertmanager_data_retention: '120h' | ||
alertmanager_data_maintenance_interval: '15m' | ||
alertmanager_config_flags: {} # other command-line parameters as shown by `man alertmanager` | ||
alertmanager_config_template: alertmanager.yml.j2 | ||
alertmanager_web_config_template: alertmanager-web.yml.j2 | ||
|
||
alertmanager_web_config_default: | ||
basic_auth_users: | ||
alertmanager: "{{ vault_alertmanager_admin_password | password_hash('bcrypt', '1234567890123456789012', ident='2b') }}" | ||
alertmanager_alertmanager_web_config_extra: {} # top-level only | ||
|
||
# Variables below are interpolated into alertmanager_config_default: | ||
|
||
# Uncomment below and add Slack bot app creds for Slack integration | ||
# alertmanager_slack_integration: | ||
# channel: '#alerts' | ||
# app_creds: | ||
|
||
alertmanager_null_receiver: | ||
name: 'null' | ||
alertmanager_slack_receiver: {} # defined in environments/common/inventory/group_vars/all/alertmanager.yml as it needs prometheus_address | ||
alertmanager_extra_receivers: [] | ||
alertmanager_default_receivers: "{{ [alertmanager_null_receiver] + ([alertmanager_slack_receiver] if alertmanager_slack_integration is defined else []) }}" | ||
alertmanager_receivers: "{{ alertmanager_default_receivers + alertmanager_extra_receivers }}" | ||
|
||
alertmanager_config_default: | ||
route: | ||
group_by: ['...'] | ||
receiver: "{{ alertmanager_slack_receiver_name if alertmanager_slack_integration is defined else 'null' }}" | ||
receivers: "{{ alertmanager_receivers }}" | ||
|
||
alertmanager_config_extra: {} # top-level only |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
- name: Restart alertmanager | ||
systemd: | ||
name: alertmanager | ||
state: restarted | ||
daemon_reload: "{{ _alertmanager_service.changed | default(false) }}" | ||
when: alertmanager_started | bool |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
- name: Create alertmanager directories | ||
ansible.builtin.file: | ||
path: "{{ item }}" | ||
state: directory | ||
owner: "{{ alertmanager_system_user }}" | ||
group: "{{ alertmanager_system_group }}" | ||
mode: u=rwX,go=rX | ||
loop: | ||
- "{{ alertmanager_config_file | dirname }}" | ||
- "{{ alertmanager_web_config_file | dirname }}" | ||
- "{{ alertmanager_storage_path }}" | ||
|
||
- name: Create alertmanager service file with immutable options | ||
template: | ||
src: alertmanager.service.j2 | ||
dest: /usr/lib/systemd/system/alertmanager.service | ||
owner: root | ||
group: root | ||
mode: u=rw,go=r | ||
register: _alertmanager_service | ||
notify: Restart alertmanager | ||
|
||
- name: Template alertmanager config | ||
ansible.builtin.template: | ||
src: "{{ alertmanager_config_template }}" | ||
dest: "{{ alertmanager_config_file }}" | ||
owner: "{{ alertmanager_system_user }}" | ||
group: "{{ alertmanager_system_group }}" | ||
mode: u=rw,go= | ||
notify: Restart alertmanager | ||
|
||
- name: Template alertmanager web config | ||
ansible.builtin.template: | ||
src: "{{ alertmanager_web_config_template }}" | ||
dest: "{{ alertmanager_web_config_file }}" | ||
owner: "{{ alertmanager_system_user }}" | ||
group: "{{ alertmanager_system_group }}" | ||
mode: u=rw,go= | ||
notify: Restart alertmanager | ||
|
||
- meta: flush_handlers | ||
|
||
- name: Ensure alertmanager service state | ||
systemd: | ||
name: alertmanager | ||
state: "{{ 'started' if alertmanager_started | bool else 'stopped' }}" | ||
enabled: "{{ alertmanager_enabled | bool }}" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
- name: Create alertmanager system user | ||
ansible.builtin.user: | ||
name: "{{ alertmanager_system_user }}" | ||
system: true | ||
create_home: false | ||
|
||
- name: Download alertmanager binary | ||
ansible.builtin.get_url: | ||
url: "https://github.com/prometheus/alertmanager/releases/download/v{{ alertmanager_version }}/alertmanager-{{ alertmanager_version }}.linux-amd64.tar.gz" | ||
dest: "{{ alertmanager_download_dest }}" | ||
owner: root | ||
group: root | ||
mode: u=rw,go= | ||
checksum: "{{ alertmanager_download_checksum }}" | ||
|
||
- name: Unpack alertmanager binary | ||
ansible.builtin.unarchive: | ||
src: "{{ alertmanager_download_dest }}" | ||
include: "alertmanager-{{ alertmanager_version }}.linux-amd64/alertmanager" | ||
dest: "{{ alertmanager_binary_dir }}" | ||
owner: root | ||
group: root | ||
mode: u=rwx,go=rx | ||
remote_src: true | ||
extra_opts: ['--strip-components=1', '--show-stored-names'] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
{{ ansible_managed | comment }} | ||
|
||
{{ alertmanager_web_config_default | to_nice_yaml }} | ||
{{ alertmanager_alertmanager_web_config_extra | to_nice_yaml if alertmanager_alertmanager_web_config_extra | length > 0 else '' }} |
53 changes: 53 additions & 0 deletions
53
ansible/roles/alertmanager/templates/alertmanager.service.j2
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
|
||
|
||
|
||
{{ ansible_managed | comment }} | ||
[Unit] | ||
Description=Prometheus Alertmanager | ||
After=network-online.target | ||
StartLimitInterval=0 | ||
StartLimitIntervalSec=0 | ||
|
||
[Service] | ||
Type=simple | ||
PIDFile=/run/alertmanager.pid | ||
User={{ alertmanager_system_user }} | ||
Group={{ alertmanager_system_group }} | ||
ExecReload=/bin/kill -HUP $MAINPID | ||
ExecStart={{ alertmanager_binary_dir }}/alertmanager \ | ||
--cluster.listen-address='' \ | ||
--config.file={{ alertmanager_config_file }} \ | ||
--storage.path={{ alertmanager_storage_path }} \ | ||
--data.retention={{ alertmanager_data_retention }} \ | ||
--data.maintenance-interval={{ alertmanager_data_maintenance_interval }} \ | ||
{% for address in alertmanager_web_listen_addresses %} | ||
--web.listen-address={{ address }} \ | ||
{% endfor %} | ||
--web.external-url={{ alertmanager_web_external_url }} \ | ||
--web.config.file={{ alertmanager_web_config_file }} \ | ||
{% for flag, flag_value in alertmanager_config_flags.items() %} | ||
--{{ flag }}={{ flag_value }} \ | ||
{% endfor %} | ||
|
||
SyslogIdentifier=alertmanager | ||
Restart=always | ||
RestartSec=5 | ||
|
||
CapabilityBoundingSet=CAP_SET_UID | ||
LockPersonality=true | ||
NoNewPrivileges=true | ||
MemoryDenyWriteExecute=true | ||
PrivateTmp=true | ||
ProtectHome=true | ||
ReadWriteDirectories={{ alertmanager_storage_path }} | ||
RemoveIPC=true | ||
RestrictSUIDSGID=true | ||
|
||
PrivateUsers=true | ||
ProtectControlGroups=true | ||
ProtectKernelModules=true | ||
ProtectKernelTunables=yes | ||
ProtectSystem=strict | ||
|
||
[Install] | ||
WantedBy=multi-user.target |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
{{ ansible_managed | comment }} | ||
|
||
{{ alertmanager_config_default | to_nice_yaml }} | ||
{{ alertmanager_config_extra | to_nice_yaml if alertmanager_config_extra | length > 0 else '' }} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.