Skip to content

Add support for alertmanager #649

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Apr 23, 2025
Merged

Add support for alertmanager #649

merged 35 commits into from
Apr 23, 2025

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented Apr 10, 2025

  • Adds alertmanager:
    • On by default, requires enabling slack integration to get any alerts
    • Not on for caas
    • No HA
    • No persistent state
    • basic_auth on web interface
      (see ansible/roles/alertmanager/README.md for more)
  • Adds default alert rules for node exporter and slurm nodes down or failed
  • Fixes a bug where prometheus env label was not set from hostvars. Now can be set using host/group var prometheus_env to group nodes by e.g. rack
  • Adds prometheus label group to node_exporter targets, set to login, control, compute, or other to enable targeting alerting rules.
  • Fixes a bug where compute-init could fail on boot due to json-encoding of binary munge key

@sjpb sjpb marked this pull request as ready for review April 10, 2025 15:24
@sjpb sjpb requested a review from a team as a code owner April 10, 2025 15:24
@sjpb sjpb force-pushed the feat/alertmanager branch from 0bf1f9c to d806dee Compare April 10, 2025 15:34
@sjpb sjpb force-pushed the feat/alertmanager branch from a09a140 to e6a4d3c Compare April 11, 2025 13:53
@sjpb
Copy link
Collaborator Author

sjpb commented Apr 11, 2025

Note tvncserver is falling over on all nodes. TBH I wasn't expecting it to be enabled/started unless a job requested it. Should fix.

@sjpb
Copy link
Collaborator Author

sjpb commented Apr 15, 2025

trying running ondemand remote desktop with service stopped:

[root@RL9-compute-0 rocky]# systemctl stop tvncserver
[root@RL9-compute-0 rocky]# systemctl disable tvncserver
tvncserver.service is not a native service, redirecting to systemd-sysv-install.
Executing: /usr/lib/systemd/systemd-sysv-install disable tvncserver

worked ok

@sjpb
Copy link
Collaborator Author

sjpb commented Apr 15, 2025

@sjpb
Copy link
Collaborator Author

sjpb commented Apr 16, 2025

@sjpb sjpb requested review from m-bull and wtripp180901 April 16, 2025 14:09
Copy link
Contributor

@wtripp180901 wtripp180901 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, mostly just configurability/consistency changes.

Also is there a reason there's an empty task file in the alertmanager role?

wtripp180901
wtripp180901 previously approved these changes Apr 22, 2025
Copy link
Contributor

@wtripp180901 wtripp180901 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sjpb
Copy link
Collaborator Author

sjpb commented Apr 22, 2025

@sjpb
Copy link
Collaborator Author

sjpb commented Apr 22, 2025

@sjpb sjpb requested a review from wtripp180901 April 23, 2025 08:26
Copy link
Contributor

@wtripp180901 wtripp180901 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sjpb sjpb merged commit 986a6bc into main Apr 23, 2025
7 checks passed
@sjpb sjpb deleted the feat/alertmanager branch April 23, 2025 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants