Skip to content

fix: use rabbitmq length for RabbitMQNodeDown #1579

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 19, 2025

Conversation

jackhodgkiss
Copy link
Contributor

The RabbitMQNodeDown made the assumption that all deployments involve three controllers. However, this is not always the case as we do support deployments with a single controller or more than three controllers.

Before this would have caused false alerts in deployments with a single controller. Whilst also concealing alerts in deployments with more than three controllers.

@jackhodgkiss jackhodgkiss self-assigned this Mar 17, 2025
@jackhodgkiss jackhodgkiss requested a review from a team as a code owner March 17, 2025 13:21
@product-auto-label product-auto-label bot added size: xs monitoring All things related to observability & telemetry labels Mar 17, 2025
The `RabbitMQNodeDown` made the assumption that all deployments involve
only three RabbitMQ nodes. However, this is not always the case as we
do support deployments with a single node or more than three.

Before this would have caused false alerts in deployments with a single
RabbitMQ node. Whilst also concealing alerts in deployments with more
than three nodes.
@jackhodgkiss jackhodgkiss force-pushed the fix-rabbitmq-node-down-rule branch from 61b564c to e183052 Compare March 23, 2025 12:39
@jackhodgkiss jackhodgkiss requested review from jovial and MoteHue March 23, 2025 12:40
@jackhodgkiss jackhodgkiss changed the title fix: use controller length for RabbitMQNodeDown fix: use rabbitmq length for RabbitMQNodeDown Mar 24, 2025
Copy link
Contributor

@MoteHue MoteHue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thanks!

MoteHue
MoteHue previously approved these changes Mar 24, 2025
@jackhodgkiss jackhodgkiss marked this pull request as draft March 24, 2025 22:54
@jackhodgkiss
Copy link
Contributor Author

This fails to template correctly.

  - alert: RabbitMQNodeDown
    expr: sum(rabbitmq_build_info{instance!=""}) < {{ groups['rabbitmq'] | length }}
    for: 30m
    labels:

@Alex-Welsh
Copy link
Member

Kolla-Ansible uses copy, not template, for rules files [1], so they can either be hard-coded or templated by Kayobe.

Possible Kayobe groups are: all, ungrouped, seed, seed-hypervisor, container-image-builders, hypervisors, infra-vms, wazuh-manager, wazuh-agent, github-runners, github-writer, controllers, network, monitoring, storage, compute-vgpu, compute, overcloud, vgpu, iommu, mlnx, docker, docker-registry, ntp, baremetal-compute, mgmt-switches, ctl-switches, hs-switches, switches, ceph, mons, mgrs, osds, rgws, cis-hardening, redfish_exporter_targets, fix-hostname, tempest_runner, controllers_with_ironic_enabled_False

Short term I'd say we make a new variable in SKC and default it to the length of the controller group, and have a backlog task to make the prometheus rules files templatable in KA
[1] https://github.com/openstack/kolla-ansible/blob/master/ansible/roles/prometheus/tasks/config.yml#L38

@seunghun1ee
Copy link
Member

Good idea. Happy to +1 once it's in ready-to-review state

@jackhodgkiss jackhodgkiss marked this pull request as ready for review May 18, 2025 12:11
Copy link
Member

@Alex-Welsh Alex-Welsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jack, I think this solution works well. Just need to update the release note

@jackhodgkiss jackhodgkiss force-pushed the fix-rabbitmq-node-down-rule branch from e62f3fb to 747181f Compare May 19, 2025 13:17
@Alex-Welsh Alex-Welsh enabled auto-merge (squash) May 19, 2025 13:20
@Alex-Welsh Alex-Welsh merged commit 64da1b1 into stackhpc/2024.1 May 19, 2025
15 checks passed
@Alex-Welsh Alex-Welsh deleted the fix-rabbitmq-node-down-rule branch May 19, 2025 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
monitoring All things related to observability & telemetry size: s
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants