Skip to content

CLI: new health check that detects QQs without an elected leader #13433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 25 commits into from

Conversation

Ayanda-D
Copy link
Contributor

@Ayanda-D Ayanda-D commented Feb 26, 2025

Proposed Changes

Hi Team 👋

The following changes are for a CLI diagnostics tool used for checking the health of Quorum Queue leaders. The main use is to detect presence quorum queues which are in a bad state (e.g. in particular, leaderless, where no messaging can be done) as reported and discussed here: #13101. There are some edge cases where quorum queues are observed to "lose leaders" and re-election is not triggered leaving such queues in bad state, and where any node restarts could result in complete message loss. This diagnostics tool allows us to carry out quick QQ leader health checks per vhost, or globally across all vhosts on a node/cluster for a specific queue match-spec. The command signature is as follows:

rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader [--vhost <vhost>] [--across-all-vhosts] <pattern>

Output on a healthy node would be as follows:

rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader --across-all-vhosts ".*"
Checking availability and health status of leaders for quorum queues matching .* in all vhosts ...
┌────────────────────────────────────────────────────────────────────────────────────────┐
│ Output                                                                                 │
├────────────────────────────────────────────────────────────────────────────────────────┤
│ Node rabbit@host-1 reported all quorum queue leaders as healthy                        │
└────────────────────────────────────────────────────────────────────────────────────────┘

Output on a node with unhealthy quorum queue leaders would be as follows (json formatted):

rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader --across-all-vhosts ".*"  --formatter json
{"message":"Node rabbit@host-2 reported unhealthy quorum queue leaders","queues":[{"name":"TEST-QQ-1","readable_name":"queue 'TEST-QQ-1' in vhost 'VHOST-A'","type":"rabbit_quorum_queue","virtual_host":"VHOST-A"},{"name":"TEST-QQ-2","readable_name":"queue 'TEST-QQ-2' in vhost 'VHOST-B'","type":"rabbit_quorum_queue","virtual_host":"VHOST-B"}],"result":"error"}

In addition to the command implementation, few items to note:

  • The command spawns and executes leader-health-check procedures concurrently per QQ and gathers results for fast execution
  • Each leader-health-check procedure queries the leaderboard and executes a ping as an aliveness and health check per leader
  • A ProcessLimitThreshold is set to 40% of the node's process_limit to ensure process_count when executing these checks will never reach the node's process_limit (and halt the node).
  • The CLI output.ex is extended to support :check_passed results with a payload. This is simply to help for clarity for commands such as these in which the command check is a successful operation, while the result itself is an error (list of unhealthy QQs).

This diagnostics tool would help catch leaderless queues, which on the Management UI, are listed as follows:

LEADERLESS-QUORUM-QUEUE-1

Some environments are quite risky to carrying out maintenance operations such as complete node shutdown. The main goal is to help avoid situations where nodes/queues are assumed to be in a good state (where other checks such as quorum critical checks which rely on quorum node count pass) while QQ leaders may be in an unhealthy state. If nodes pass both these checks, then we can deem them healthy and OK for various maintenance procedures. Following on, on this detection tool, are investigations to also fix the underlying problem reported in #13101.

Please take a look/review. We're keen on having these capabilities available upstream to catch and monitor this recurring issue^ 👍

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution
you did and what alternatives you considered, etc.

Copy link
Collaborator

@michaelklishin michaelklishin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is easy to see from

rabbitmq-diagnostics help | grep chec

that all existing health check commands except for the deprecated no-op one begin with check_, so this one should be named check_for_quorum_queues_without_an_elected_leader.

I'd rename --global to --across-all-vhosts.

@@ -144,6 +147,8 @@
-define(SNAPSHOT_INTERVAL, 8192). %% the ra default is 4096
% -define(UNLIMITED_PREFETCH_COUNT, 2000). %% something large for ra
-define(MIN_CHECKPOINT_INTERVAL, 8192). %% the ra default is 16384
-define(LEADER_HEALTH_CHECK_TIMEOUT, 1_000).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeouts lower than 5s are guaranteed to result in false positives.

leader_health_check(QueueNameOrRegEx, VHost) ->
%% Set a process limit threshold to 40% of ErlangVM process limit, beyond which
%% we cannot spawn any new processes for executing QQ leader health checks.
ProcessLimitThreshold = round(0.4 * erlang:system_info(process_limit)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

40% sounds like a lot for a health check. I'd use 20% at most.

Qs =
case VHost of
global ->
rabbit_amqqueue:list();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The modern modules for working with schema data stores is rabbit_db_queue. It is aware of the metadata store used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also provides functions such as get_all_by_type/1.

@michaelklishin michaelklishin changed the title Diagnostics tool for QQ leader health checks CLI: new health check that detects QQs without an elected leader Feb 26, 2025
Copy link
Contributor

@gomoripeti gomoripeti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great initiative!

What do you think if there would be a function check_local_leader_health() that would only check local leaders therefore it would avoid a lot of potential inter-node communication (a ping for each remote queue) and a wrapper function which calls check_local_leader_health() via rpc on every node? Similar to a common pattern in RabbitMQ

@michaelklishin
Copy link
Collaborator

@gomoripeti we even have separate health checks, e.g. for local vs. cluster-wide alarms.

I don't know if we need two separate health checks in this case. I'd try to get this command right first.

FTR, rabbit_quorum_queue:list_with_local_promotable/0 demonstrates how to filter only the local QQs (that have a replica on node()).

and use across_all_vhosts option for global checks
and introduce rabbit_db_queue:get_all_by_type_and_vhost/2.
Update leader health check timeout to 5s and process limit
threshold to 20% of node's process_limit.
@Ayanda-D
Copy link
Contributor Author

@gomoripeti @michaelklishin thanks for the feedback. For local or remote call optimization, gen_statem should already do that under-hood (if I'm understanding the question correctly). For this case though, large cluster sizes, purpose of this command is to list leaderless queues without any awareness of whether its local or remote, and only executed at specific times. Didnt see the need to filter locally or remote? Unless you see the need for this extension.

Copy link
Collaborator

@michaelklishin michaelklishin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cd deps/rabbit;
gmake dialyze

now fails with

rabbit_quorum_queue.erl:2189:5: Expression produces a value of type 
          {'error', _, _} | {'ok', _, _}, but this value is unmatched

@Ayanda-D Ayanda-D requested a review from michaelklishin March 3, 2025 12:30
_ ->
From ! {error, HealthCheckRef, QResource}
end,
{_, _, _} =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like _ = … then would do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, dialyzer didnt complain for both. switching to _ if thats preferred 👍

def output({:error, unhealthy_queues}, %{vhost: _vhost}) when is_list(unhealthy_queues) do
lines = queue_lines(unhealthy_queues)

{:ok, :check_passed, Enum.join(lines, line_separator())}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ayanda-D if this is an error, why does this return an :ok and :check_passed?

As things stand right now, this command will always exit with a code of zero. That's not at all the convention we have for all the other health checks.

It should have been

{:error, :check_failed}

or one of its variations.

Copy link
Contributor Author

@Ayanda-D Ayanda-D Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaelklishin Thank you for the QA/updates/merge. We previously had {:error, :check_failed} 68739a6 then switched to {:ok, :check_passed, payload} to get error code 0 instead of 69, meaning "command ran successfully and returned", then let the caller handle the result "payload", be it ok or error. Keeping context in result code as well sounds right, as per "health checks" convention.

Thanks! We're happy to now have these leader checks available upstream in 4.x 🎉

CC: @25schmekles (take note, this has changed^^^).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on adding an HTTP API endpoint and rabbitmqadmin v2 support.

michaelklishin added a commit that referenced this pull request Mar 12, 2025
… --silent

plus simplify function heads.

References #13433.
michaelklishin added a commit that referenced this pull request Mar 12, 2025
… --silent

plus simplify function heads.

References #13433.

(cherry picked from commit 7b39231)
@michaelklishin
Copy link
Collaborator

Moved to #13487.

michaelklishin added a commit that referenced this pull request Mar 12, 2025
…d reachable leader #13433 (#13487)

* Implement rabbitmq-queues leader_health_check command for quorum queues

(cherry picked from commit c26edbe)

* Tests for rabbitmq-queues leader_health_check command

(cherry picked from commit 6cc03b0)

* Ensure calling ParentPID in leader health check execution and
reuse and extend formatting API, with amqqueue:to_printable/2

(cherry picked from commit 76d66a1)

* Extend core leader health check tests and update badrpc error handling in cli tests

(cherry picked from commit 857e2a7)

* Refactor leader_health_check command validators and ignore vhost arg

(cherry picked from commit 6cf9339)

* Update leader_health_check_command description and banner

(cherry picked from commit 96b8bce)

* Improve output formatting for healthy leaders and support
silent mode in rabbitmq-queues leader_health_check command

(cherry picked from commit 239a69b)

* Support global flag to run leader health check for
all queues in all vhosts on local node

(cherry picked from commit 48ba3e1)

* Return immediately for leader health checks on empty vhosts

(cherry picked from commit 7873737)

* Rename leader health check timeout refs

(cherry picked from commit b7dec89)

* Update banner message for global leader health check

(cherry picked from commit c7da4d5)

* QQ leader-health-check: check_process_limit_safety before spawning leader checks

(cherry picked from commit 1736845)

* Log leader health check result in broker logs (if any leaderless queues)

(cherry picked from commit 1084179)

* Ensure check_passed result for leader health internal calls)

(cherry picked from commit 68739a6)

* Extend CLI format output to process check_passed payload

(cherry picked from commit 5f5e992)

* Format leader healthcheck result log and function exports

(cherry picked from commit ebffd7d)

* Change leader_health_check command scope from queues to diagnostics

(cherry picked from commit 663fc98)

* Update (c) line year

(cherry picked from commit df82f12)

* Rename command to check_for_quorum_queues_without_an_elected_leader
and use across_all_vhosts option for global checks

(cherry picked from commit b2acbae)

* Use rabbit_db_queue for qq leader health check lookups
and introduce rabbit_db_queue:get_all_by_type_and_vhost/2.
Update leader health check timeout to 5s and process limit
threshold to 20% of node's process_limit.

(cherry picked from commit 7a8e166)

* Update tests: quorum_queue_SUITE and rabbit_db_queue_SUITE

(cherry picked from commit 9bdb81f)

* Fix typo (cli test module)

(cherry picked from commit 6158568)

* Small refactor - simpler final leader health check result return on function head match

(cherry picked from commit ea07938)

* Clear dialyzer warning & fix type spec

(cherry picked from commit a45aa81)

* Ignore result without strict match to avoid diayzer warning

(cherry picked from commit bb43c0b)

* 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' documentation edits

(cherry picked from commit 845230b)

* 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' output copywriting

(cherry picked from commit 235f43b)

* diagnostics check_for_quorum_queues_without_an_elected_leader: behave like a health check w.r.t. error reporting

(cherry picked from commit db73767)

* check_for_quorum_queues_without_an_elected_leader: handle --quiet and --silent

plus simplify function heads.

References #13433.

(cherry picked from commit 7b39231)

---------

Co-authored-by: Ayanda Dube <[email protected]>
mergify bot pushed a commit that referenced this pull request Mar 12, 2025
…d reachable leader #13433 (#13487)

* Implement rabbitmq-queues leader_health_check command for quorum queues

(cherry picked from commit c26edbe)

* Tests for rabbitmq-queues leader_health_check command

(cherry picked from commit 6cc03b0)

* Ensure calling ParentPID in leader health check execution and
reuse and extend formatting API, with amqqueue:to_printable/2

(cherry picked from commit 76d66a1)

* Extend core leader health check tests and update badrpc error handling in cli tests

(cherry picked from commit 857e2a7)

* Refactor leader_health_check command validators and ignore vhost arg

(cherry picked from commit 6cf9339)

* Update leader_health_check_command description and banner

(cherry picked from commit 96b8bce)

* Improve output formatting for healthy leaders and support
silent mode in rabbitmq-queues leader_health_check command

(cherry picked from commit 239a69b)

* Support global flag to run leader health check for
all queues in all vhosts on local node

(cherry picked from commit 48ba3e1)

* Return immediately for leader health checks on empty vhosts

(cherry picked from commit 7873737)

* Rename leader health check timeout refs

(cherry picked from commit b7dec89)

* Update banner message for global leader health check

(cherry picked from commit c7da4d5)

* QQ leader-health-check: check_process_limit_safety before spawning leader checks

(cherry picked from commit 1736845)

* Log leader health check result in broker logs (if any leaderless queues)

(cherry picked from commit 1084179)

* Ensure check_passed result for leader health internal calls)

(cherry picked from commit 68739a6)

* Extend CLI format output to process check_passed payload

(cherry picked from commit 5f5e992)

* Format leader healthcheck result log and function exports

(cherry picked from commit ebffd7d)

* Change leader_health_check command scope from queues to diagnostics

(cherry picked from commit 663fc98)

* Update (c) line year

(cherry picked from commit df82f12)

* Rename command to check_for_quorum_queues_without_an_elected_leader
and use across_all_vhosts option for global checks

(cherry picked from commit b2acbae)

* Use rabbit_db_queue for qq leader health check lookups
and introduce rabbit_db_queue:get_all_by_type_and_vhost/2.
Update leader health check timeout to 5s and process limit
threshold to 20% of node's process_limit.

(cherry picked from commit 7a8e166)

* Update tests: quorum_queue_SUITE and rabbit_db_queue_SUITE

(cherry picked from commit 9bdb81f)

* Fix typo (cli test module)

(cherry picked from commit 6158568)

* Small refactor - simpler final leader health check result return on function head match

(cherry picked from commit ea07938)

* Clear dialyzer warning & fix type spec

(cherry picked from commit a45aa81)

* Ignore result without strict match to avoid diayzer warning

(cherry picked from commit bb43c0b)

* 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' documentation edits

(cherry picked from commit 845230b)

* 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' output copywriting

(cherry picked from commit 235f43b)

* diagnostics check_for_quorum_queues_without_an_elected_leader: behave like a health check w.r.t. error reporting

(cherry picked from commit db73767)

* check_for_quorum_queues_without_an_elected_leader: handle --quiet and --silent

plus simplify function heads.

References #13433.

(cherry picked from commit 7b39231)

---------

Co-authored-by: Ayanda Dube <[email protected]>
(cherry picked from commit 09f1ab4)
michaelklishin added a commit that referenced this pull request Mar 12, 2025
By @Ayanda-D: new CLI health check that detects QQs without an elected reachable leader #13433 (backport #13487)
mergify bot pushed a commit that referenced this pull request Mar 12, 2025
…d reachable leader #13433 (#13487)

* Implement rabbitmq-queues leader_health_check command for quorum queues

(cherry picked from commit c26edbe)

* Tests for rabbitmq-queues leader_health_check command

(cherry picked from commit 6cc03b0)

* Ensure calling ParentPID in leader health check execution and
reuse and extend formatting API, with amqqueue:to_printable/2

(cherry picked from commit 76d66a1)

* Extend core leader health check tests and update badrpc error handling in cli tests

(cherry picked from commit 857e2a7)

* Refactor leader_health_check command validators and ignore vhost arg

(cherry picked from commit 6cf9339)

* Update leader_health_check_command description and banner

(cherry picked from commit 96b8bce)

* Improve output formatting for healthy leaders and support
silent mode in rabbitmq-queues leader_health_check command

(cherry picked from commit 239a69b)

* Support global flag to run leader health check for
all queues in all vhosts on local node

(cherry picked from commit 48ba3e1)

* Return immediately for leader health checks on empty vhosts

(cherry picked from commit 7873737)

* Rename leader health check timeout refs

(cherry picked from commit b7dec89)

* Update banner message for global leader health check

(cherry picked from commit c7da4d5)

* QQ leader-health-check: check_process_limit_safety before spawning leader checks

(cherry picked from commit 1736845)

* Log leader health check result in broker logs (if any leaderless queues)

(cherry picked from commit 1084179)

* Ensure check_passed result for leader health internal calls)

(cherry picked from commit 68739a6)

* Extend CLI format output to process check_passed payload

(cherry picked from commit 5f5e992)

* Format leader healthcheck result log and function exports

(cherry picked from commit ebffd7d)

* Change leader_health_check command scope from queues to diagnostics

(cherry picked from commit 663fc98)

* Update (c) line year

(cherry picked from commit df82f12)

* Rename command to check_for_quorum_queues_without_an_elected_leader
and use across_all_vhosts option for global checks

(cherry picked from commit b2acbae)

* Use rabbit_db_queue for qq leader health check lookups
and introduce rabbit_db_queue:get_all_by_type_and_vhost/2.
Update leader health check timeout to 5s and process limit
threshold to 20% of node's process_limit.

(cherry picked from commit 7a8e166)

* Update tests: quorum_queue_SUITE and rabbit_db_queue_SUITE

(cherry picked from commit 9bdb81f)

* Fix typo (cli test module)

(cherry picked from commit 6158568)

* Small refactor - simpler final leader health check result return on function head match

(cherry picked from commit ea07938)

* Clear dialyzer warning & fix type spec

(cherry picked from commit a45aa81)

* Ignore result without strict match to avoid diayzer warning

(cherry picked from commit bb43c0b)

* 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' documentation edits

(cherry picked from commit 845230b)

* 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' output copywriting

(cherry picked from commit 235f43b)

* diagnostics check_for_quorum_queues_without_an_elected_leader: behave like a health check w.r.t. error reporting

(cherry picked from commit db73767)

* check_for_quorum_queues_without_an_elected_leader: handle --quiet and --silent

plus simplify function heads.

References #13433.

(cherry picked from commit 7b39231)

---------

Co-authored-by: Ayanda Dube <[email protected]>
(cherry picked from commit 09f1ab4)
(cherry picked from commit e1d7481)

# Conflicts:
#	deps/rabbit/test/quorum_queue_SUITE.erl
michaelklishin added a commit that referenced this pull request Mar 12, 2025
By @Ayanda-D: new CLI health check that detects QQs without an elected reachable leader #13433 (backport #13487) (backport #13488)
michaelklishin added a commit that referenced this pull request Mar 17, 2025
…d reachable leader #13433 (#13487)

* Implement rabbitmq-queues leader_health_check command for quorum queues

(cherry picked from commit c26edbe)

* Tests for rabbitmq-queues leader_health_check command

(cherry picked from commit 6cc03b0)

* Ensure calling ParentPID in leader health check execution and
reuse and extend formatting API, with amqqueue:to_printable/2

(cherry picked from commit 76d66a1)

* Extend core leader health check tests and update badrpc error handling in cli tests

(cherry picked from commit 857e2a7)

* Refactor leader_health_check command validators and ignore vhost arg

(cherry picked from commit 6cf9339)

* Update leader_health_check_command description and banner

(cherry picked from commit 96b8bce)

* Improve output formatting for healthy leaders and support
silent mode in rabbitmq-queues leader_health_check command

(cherry picked from commit 239a69b)

* Support global flag to run leader health check for
all queues in all vhosts on local node

(cherry picked from commit 48ba3e1)

* Return immediately for leader health checks on empty vhosts

(cherry picked from commit 7873737)

* Rename leader health check timeout refs

(cherry picked from commit b7dec89)

* Update banner message for global leader health check

(cherry picked from commit c7da4d5)

* QQ leader-health-check: check_process_limit_safety before spawning leader checks

(cherry picked from commit 1736845)

* Log leader health check result in broker logs (if any leaderless queues)

(cherry picked from commit 1084179)

* Ensure check_passed result for leader health internal calls)

(cherry picked from commit 68739a6)

* Extend CLI format output to process check_passed payload

(cherry picked from commit 5f5e992)

* Format leader healthcheck result log and function exports

(cherry picked from commit ebffd7d)

* Change leader_health_check command scope from queues to diagnostics

(cherry picked from commit 663fc98)

* Update (c) line year

(cherry picked from commit df82f12)

* Rename command to check_for_quorum_queues_without_an_elected_leader
and use across_all_vhosts option for global checks

(cherry picked from commit b2acbae)

* Use rabbit_db_queue for qq leader health check lookups
and introduce rabbit_db_queue:get_all_by_type_and_vhost/2.
Update leader health check timeout to 5s and process limit
threshold to 20% of node's process_limit.

(cherry picked from commit 7a8e166)

* Update tests: quorum_queue_SUITE and rabbit_db_queue_SUITE

(cherry picked from commit 9bdb81f)

* Fix typo (cli test module)

(cherry picked from commit 6158568)

* Small refactor - simpler final leader health check result return on function head match

(cherry picked from commit ea07938)

* Clear dialyzer warning & fix type spec

(cherry picked from commit a45aa81)

* Ignore result without strict match to avoid diayzer warning

(cherry picked from commit bb43c0b)

* 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' documentation edits

(cherry picked from commit 845230b)

* 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' output copywriting

(cherry picked from commit 235f43b)

* diagnostics check_for_quorum_queues_without_an_elected_leader: behave like a health check w.r.t. error reporting

(cherry picked from commit db73767)

* check_for_quorum_queues_without_an_elected_leader: handle --quiet and --silent

plus simplify function heads.

References #13433.

(cherry picked from commit 7b39231)

---------

Co-authored-by: Ayanda Dube <[email protected]>
ikavgo pushed a commit that referenced this pull request May 5, 2025
…d reachable leader #13433 (#13487)

* Implement rabbitmq-queues leader_health_check command for quorum queues

(cherry picked from commit c26edbe)

* Tests for rabbitmq-queues leader_health_check command

(cherry picked from commit 6cc03b0)

* Ensure calling ParentPID in leader health check execution and
reuse and extend formatting API, with amqqueue:to_printable/2

(cherry picked from commit 76d66a1)

* Extend core leader health check tests and update badrpc error handling in cli tests

(cherry picked from commit 857e2a7)

* Refactor leader_health_check command validators and ignore vhost arg

(cherry picked from commit 6cf9339)

* Update leader_health_check_command description and banner

(cherry picked from commit 96b8bce)

* Improve output formatting for healthy leaders and support
silent mode in rabbitmq-queues leader_health_check command

(cherry picked from commit 239a69b)

* Support global flag to run leader health check for
all queues in all vhosts on local node

(cherry picked from commit 48ba3e1)

* Return immediately for leader health checks on empty vhosts

(cherry picked from commit 7873737)

* Rename leader health check timeout refs

(cherry picked from commit b7dec89)

* Update banner message for global leader health check

(cherry picked from commit c7da4d5)

* QQ leader-health-check: check_process_limit_safety before spawning leader checks

(cherry picked from commit 1736845)

* Log leader health check result in broker logs (if any leaderless queues)

(cherry picked from commit 1084179)

* Ensure check_passed result for leader health internal calls)

(cherry picked from commit 68739a6)

* Extend CLI format output to process check_passed payload

(cherry picked from commit 5f5e992)

* Format leader healthcheck result log and function exports

(cherry picked from commit ebffd7d)

* Change leader_health_check command scope from queues to diagnostics

(cherry picked from commit 663fc98)

* Update (c) line year

(cherry picked from commit df82f12)

* Rename command to check_for_quorum_queues_without_an_elected_leader
and use across_all_vhosts option for global checks

(cherry picked from commit b2acbae)

* Use rabbit_db_queue for qq leader health check lookups
and introduce rabbit_db_queue:get_all_by_type_and_vhost/2.
Update leader health check timeout to 5s and process limit
threshold to 20% of node's process_limit.

(cherry picked from commit 7a8e166)

* Update tests: quorum_queue_SUITE and rabbit_db_queue_SUITE

(cherry picked from commit 9bdb81f)

* Fix typo (cli test module)

(cherry picked from commit 6158568)

* Small refactor - simpler final leader health check result return on function head match

(cherry picked from commit ea07938)

* Clear dialyzer warning & fix type spec

(cherry picked from commit a45aa81)

* Ignore result without strict match to avoid diayzer warning

(cherry picked from commit bb43c0b)

* 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' documentation edits

(cherry picked from commit 845230b)

* 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' output copywriting

(cherry picked from commit 235f43b)

* diagnostics check_for_quorum_queues_without_an_elected_leader: behave like a health check w.r.t. error reporting

(cherry picked from commit db73767)

* check_for_quorum_queues_without_an_elected_leader: handle --quiet and --silent

plus simplify function heads.

References #13433.

(cherry picked from commit 7b39231)

---------

Co-authored-by: Ayanda Dube <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants