-
Notifications
You must be signed in to change notification settings - Fork 3.9k
CLI: new health check that detects QQs without an elected leader #13433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
reuse and extend formatting API, with amqqueue:to_printable/2
silent mode in rabbitmq-queues leader_health_check command
all queues in all vhosts on local node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is easy to see from
rabbitmq-diagnostics help | grep chec
that all existing health check commands except for the deprecated no-op one begin with check_
, so this one should be named check_for_quorum_queues_without_an_elected_leader
.
I'd rename --global
to --across-all-vhosts
.
@@ -144,6 +147,8 @@ | |||
-define(SNAPSHOT_INTERVAL, 8192). %% the ra default is 4096 | |||
% -define(UNLIMITED_PREFETCH_COUNT, 2000). %% something large for ra | |||
-define(MIN_CHECKPOINT_INTERVAL, 8192). %% the ra default is 16384 | |||
-define(LEADER_HEALTH_CHECK_TIMEOUT, 1_000). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timeouts lower than 5s are guaranteed to result in false positives.
leader_health_check(QueueNameOrRegEx, VHost) -> | ||
%% Set a process limit threshold to 40% of ErlangVM process limit, beyond which | ||
%% we cannot spawn any new processes for executing QQ leader health checks. | ||
ProcessLimitThreshold = round(0.4 * erlang:system_info(process_limit)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
40% sounds like a lot for a health check. I'd use 20% at most.
Qs = | ||
case VHost of | ||
global -> | ||
rabbit_amqqueue:list(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The modern modules for working with schema data stores is rabbit_db_queue
. It is aware of the metadata store used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also provides functions such as get_all_by_type/1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great initiative!
What do you think if there would be a function check_local_leader_health()
that would only check local leaders therefore it would avoid a lot of potential inter-node communication (a ping for each remote queue) and a wrapper function which calls check_local_leader_health()
via rpc on every node? Similar to a common pattern in RabbitMQ
@gomoripeti we even have separate health checks, e.g. for local vs. cluster-wide alarms. I don't know if we need two separate health checks in this case. I'd try to get this command right first. FTR, |
and use across_all_vhosts option for global checks
and introduce rabbit_db_queue:get_all_by_type_and_vhost/2. Update leader health check timeout to 5s and process limit threshold to 20% of node's process_limit.
@gomoripeti @michaelklishin thanks for the feedback. For local or remote call optimization, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cd deps/rabbit;
gmake dialyze
now fails with
rabbit_quorum_queue.erl:2189:5: Expression produces a value of type
{'error', _, _} | {'ok', _, _}, but this value is unmatched
_ -> | ||
From ! {error, HealthCheckRef, QResource} | ||
end, | ||
{_, _, _} = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like _ = …
then would do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, dialyzer didnt complain for both. switching to _
if thats preferred 👍
def output({:error, unhealthy_queues}, %{vhost: _vhost}) when is_list(unhealthy_queues) do | ||
lines = queue_lines(unhealthy_queues) | ||
|
||
{:ok, :check_passed, Enum.join(lines, line_separator())} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ayanda-D if this is an error, why does this return an :ok
and :check_passed
?
As things stand right now, this command will always exit with a code of zero. That's not at all the convention we have for all the other health checks.
It should have been
{:error, :check_failed}
or one of its variations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michaelklishin Thank you for the QA/updates/merge. We previously had {:error, :check_failed}
68739a6 then switched to {:ok, :check_passed, payload}
to get error code 0 instead of 69, meaning "command ran successfully and returned", then let the caller handle the result "payload", be it ok or error. Keeping context in result code as well sounds right, as per "health checks" convention.
Thanks! We're happy to now have these leader checks available upstream in 4.x 🎉
CC: @25schmekles (take note, this has changed^^^).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working on adding an HTTP API endpoint and rabbitmqadmin
v2 support.
… --silent plus simplify function heads. References #13433.
Moved to #13487. |
…d reachable leader #13433 (#13487) * Implement rabbitmq-queues leader_health_check command for quorum queues (cherry picked from commit c26edbe) * Tests for rabbitmq-queues leader_health_check command (cherry picked from commit 6cc03b0) * Ensure calling ParentPID in leader health check execution and reuse and extend formatting API, with amqqueue:to_printable/2 (cherry picked from commit 76d66a1) * Extend core leader health check tests and update badrpc error handling in cli tests (cherry picked from commit 857e2a7) * Refactor leader_health_check command validators and ignore vhost arg (cherry picked from commit 6cf9339) * Update leader_health_check_command description and banner (cherry picked from commit 96b8bce) * Improve output formatting for healthy leaders and support silent mode in rabbitmq-queues leader_health_check command (cherry picked from commit 239a69b) * Support global flag to run leader health check for all queues in all vhosts on local node (cherry picked from commit 48ba3e1) * Return immediately for leader health checks on empty vhosts (cherry picked from commit 7873737) * Rename leader health check timeout refs (cherry picked from commit b7dec89) * Update banner message for global leader health check (cherry picked from commit c7da4d5) * QQ leader-health-check: check_process_limit_safety before spawning leader checks (cherry picked from commit 1736845) * Log leader health check result in broker logs (if any leaderless queues) (cherry picked from commit 1084179) * Ensure check_passed result for leader health internal calls) (cherry picked from commit 68739a6) * Extend CLI format output to process check_passed payload (cherry picked from commit 5f5e992) * Format leader healthcheck result log and function exports (cherry picked from commit ebffd7d) * Change leader_health_check command scope from queues to diagnostics (cherry picked from commit 663fc98) * Update (c) line year (cherry picked from commit df82f12) * Rename command to check_for_quorum_queues_without_an_elected_leader and use across_all_vhosts option for global checks (cherry picked from commit b2acbae) * Use rabbit_db_queue for qq leader health check lookups and introduce rabbit_db_queue:get_all_by_type_and_vhost/2. Update leader health check timeout to 5s and process limit threshold to 20% of node's process_limit. (cherry picked from commit 7a8e166) * Update tests: quorum_queue_SUITE and rabbit_db_queue_SUITE (cherry picked from commit 9bdb81f) * Fix typo (cli test module) (cherry picked from commit 6158568) * Small refactor - simpler final leader health check result return on function head match (cherry picked from commit ea07938) * Clear dialyzer warning & fix type spec (cherry picked from commit a45aa81) * Ignore result without strict match to avoid diayzer warning (cherry picked from commit bb43c0b) * 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' documentation edits (cherry picked from commit 845230b) * 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' output copywriting (cherry picked from commit 235f43b) * diagnostics check_for_quorum_queues_without_an_elected_leader: behave like a health check w.r.t. error reporting (cherry picked from commit db73767) * check_for_quorum_queues_without_an_elected_leader: handle --quiet and --silent plus simplify function heads. References #13433. (cherry picked from commit 7b39231) --------- Co-authored-by: Ayanda Dube <[email protected]>
…d reachable leader #13433 (#13487) * Implement rabbitmq-queues leader_health_check command for quorum queues (cherry picked from commit c26edbe) * Tests for rabbitmq-queues leader_health_check command (cherry picked from commit 6cc03b0) * Ensure calling ParentPID in leader health check execution and reuse and extend formatting API, with amqqueue:to_printable/2 (cherry picked from commit 76d66a1) * Extend core leader health check tests and update badrpc error handling in cli tests (cherry picked from commit 857e2a7) * Refactor leader_health_check command validators and ignore vhost arg (cherry picked from commit 6cf9339) * Update leader_health_check_command description and banner (cherry picked from commit 96b8bce) * Improve output formatting for healthy leaders and support silent mode in rabbitmq-queues leader_health_check command (cherry picked from commit 239a69b) * Support global flag to run leader health check for all queues in all vhosts on local node (cherry picked from commit 48ba3e1) * Return immediately for leader health checks on empty vhosts (cherry picked from commit 7873737) * Rename leader health check timeout refs (cherry picked from commit b7dec89) * Update banner message for global leader health check (cherry picked from commit c7da4d5) * QQ leader-health-check: check_process_limit_safety before spawning leader checks (cherry picked from commit 1736845) * Log leader health check result in broker logs (if any leaderless queues) (cherry picked from commit 1084179) * Ensure check_passed result for leader health internal calls) (cherry picked from commit 68739a6) * Extend CLI format output to process check_passed payload (cherry picked from commit 5f5e992) * Format leader healthcheck result log and function exports (cherry picked from commit ebffd7d) * Change leader_health_check command scope from queues to diagnostics (cherry picked from commit 663fc98) * Update (c) line year (cherry picked from commit df82f12) * Rename command to check_for_quorum_queues_without_an_elected_leader and use across_all_vhosts option for global checks (cherry picked from commit b2acbae) * Use rabbit_db_queue for qq leader health check lookups and introduce rabbit_db_queue:get_all_by_type_and_vhost/2. Update leader health check timeout to 5s and process limit threshold to 20% of node's process_limit. (cherry picked from commit 7a8e166) * Update tests: quorum_queue_SUITE and rabbit_db_queue_SUITE (cherry picked from commit 9bdb81f) * Fix typo (cli test module) (cherry picked from commit 6158568) * Small refactor - simpler final leader health check result return on function head match (cherry picked from commit ea07938) * Clear dialyzer warning & fix type spec (cherry picked from commit a45aa81) * Ignore result without strict match to avoid diayzer warning (cherry picked from commit bb43c0b) * 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' documentation edits (cherry picked from commit 845230b) * 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' output copywriting (cherry picked from commit 235f43b) * diagnostics check_for_quorum_queues_without_an_elected_leader: behave like a health check w.r.t. error reporting (cherry picked from commit db73767) * check_for_quorum_queues_without_an_elected_leader: handle --quiet and --silent plus simplify function heads. References #13433. (cherry picked from commit 7b39231) --------- Co-authored-by: Ayanda Dube <[email protected]> (cherry picked from commit 09f1ab4)
…d reachable leader #13433 (#13487) * Implement rabbitmq-queues leader_health_check command for quorum queues (cherry picked from commit c26edbe) * Tests for rabbitmq-queues leader_health_check command (cherry picked from commit 6cc03b0) * Ensure calling ParentPID in leader health check execution and reuse and extend formatting API, with amqqueue:to_printable/2 (cherry picked from commit 76d66a1) * Extend core leader health check tests and update badrpc error handling in cli tests (cherry picked from commit 857e2a7) * Refactor leader_health_check command validators and ignore vhost arg (cherry picked from commit 6cf9339) * Update leader_health_check_command description and banner (cherry picked from commit 96b8bce) * Improve output formatting for healthy leaders and support silent mode in rabbitmq-queues leader_health_check command (cherry picked from commit 239a69b) * Support global flag to run leader health check for all queues in all vhosts on local node (cherry picked from commit 48ba3e1) * Return immediately for leader health checks on empty vhosts (cherry picked from commit 7873737) * Rename leader health check timeout refs (cherry picked from commit b7dec89) * Update banner message for global leader health check (cherry picked from commit c7da4d5) * QQ leader-health-check: check_process_limit_safety before spawning leader checks (cherry picked from commit 1736845) * Log leader health check result in broker logs (if any leaderless queues) (cherry picked from commit 1084179) * Ensure check_passed result for leader health internal calls) (cherry picked from commit 68739a6) * Extend CLI format output to process check_passed payload (cherry picked from commit 5f5e992) * Format leader healthcheck result log and function exports (cherry picked from commit ebffd7d) * Change leader_health_check command scope from queues to diagnostics (cherry picked from commit 663fc98) * Update (c) line year (cherry picked from commit df82f12) * Rename command to check_for_quorum_queues_without_an_elected_leader and use across_all_vhosts option for global checks (cherry picked from commit b2acbae) * Use rabbit_db_queue for qq leader health check lookups and introduce rabbit_db_queue:get_all_by_type_and_vhost/2. Update leader health check timeout to 5s and process limit threshold to 20% of node's process_limit. (cherry picked from commit 7a8e166) * Update tests: quorum_queue_SUITE and rabbit_db_queue_SUITE (cherry picked from commit 9bdb81f) * Fix typo (cli test module) (cherry picked from commit 6158568) * Small refactor - simpler final leader health check result return on function head match (cherry picked from commit ea07938) * Clear dialyzer warning & fix type spec (cherry picked from commit a45aa81) * Ignore result without strict match to avoid diayzer warning (cherry picked from commit bb43c0b) * 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' documentation edits (cherry picked from commit 845230b) * 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' output copywriting (cherry picked from commit 235f43b) * diagnostics check_for_quorum_queues_without_an_elected_leader: behave like a health check w.r.t. error reporting (cherry picked from commit db73767) * check_for_quorum_queues_without_an_elected_leader: handle --quiet and --silent plus simplify function heads. References #13433. (cherry picked from commit 7b39231) --------- Co-authored-by: Ayanda Dube <[email protected]> (cherry picked from commit 09f1ab4) (cherry picked from commit e1d7481) # Conflicts: # deps/rabbit/test/quorum_queue_SUITE.erl
…d reachable leader #13433 (#13487) * Implement rabbitmq-queues leader_health_check command for quorum queues (cherry picked from commit c26edbe) * Tests for rabbitmq-queues leader_health_check command (cherry picked from commit 6cc03b0) * Ensure calling ParentPID in leader health check execution and reuse and extend formatting API, with amqqueue:to_printable/2 (cherry picked from commit 76d66a1) * Extend core leader health check tests and update badrpc error handling in cli tests (cherry picked from commit 857e2a7) * Refactor leader_health_check command validators and ignore vhost arg (cherry picked from commit 6cf9339) * Update leader_health_check_command description and banner (cherry picked from commit 96b8bce) * Improve output formatting for healthy leaders and support silent mode in rabbitmq-queues leader_health_check command (cherry picked from commit 239a69b) * Support global flag to run leader health check for all queues in all vhosts on local node (cherry picked from commit 48ba3e1) * Return immediately for leader health checks on empty vhosts (cherry picked from commit 7873737) * Rename leader health check timeout refs (cherry picked from commit b7dec89) * Update banner message for global leader health check (cherry picked from commit c7da4d5) * QQ leader-health-check: check_process_limit_safety before spawning leader checks (cherry picked from commit 1736845) * Log leader health check result in broker logs (if any leaderless queues) (cherry picked from commit 1084179) * Ensure check_passed result for leader health internal calls) (cherry picked from commit 68739a6) * Extend CLI format output to process check_passed payload (cherry picked from commit 5f5e992) * Format leader healthcheck result log and function exports (cherry picked from commit ebffd7d) * Change leader_health_check command scope from queues to diagnostics (cherry picked from commit 663fc98) * Update (c) line year (cherry picked from commit df82f12) * Rename command to check_for_quorum_queues_without_an_elected_leader and use across_all_vhosts option for global checks (cherry picked from commit b2acbae) * Use rabbit_db_queue for qq leader health check lookups and introduce rabbit_db_queue:get_all_by_type_and_vhost/2. Update leader health check timeout to 5s and process limit threshold to 20% of node's process_limit. (cherry picked from commit 7a8e166) * Update tests: quorum_queue_SUITE and rabbit_db_queue_SUITE (cherry picked from commit 9bdb81f) * Fix typo (cli test module) (cherry picked from commit 6158568) * Small refactor - simpler final leader health check result return on function head match (cherry picked from commit ea07938) * Clear dialyzer warning & fix type spec (cherry picked from commit a45aa81) * Ignore result without strict match to avoid diayzer warning (cherry picked from commit bb43c0b) * 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' documentation edits (cherry picked from commit 845230b) * 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' output copywriting (cherry picked from commit 235f43b) * diagnostics check_for_quorum_queues_without_an_elected_leader: behave like a health check w.r.t. error reporting (cherry picked from commit db73767) * check_for_quorum_queues_without_an_elected_leader: handle --quiet and --silent plus simplify function heads. References #13433. (cherry picked from commit 7b39231) --------- Co-authored-by: Ayanda Dube <[email protected]>
…d reachable leader #13433 (#13487) * Implement rabbitmq-queues leader_health_check command for quorum queues (cherry picked from commit c26edbe) * Tests for rabbitmq-queues leader_health_check command (cherry picked from commit 6cc03b0) * Ensure calling ParentPID in leader health check execution and reuse and extend formatting API, with amqqueue:to_printable/2 (cherry picked from commit 76d66a1) * Extend core leader health check tests and update badrpc error handling in cli tests (cherry picked from commit 857e2a7) * Refactor leader_health_check command validators and ignore vhost arg (cherry picked from commit 6cf9339) * Update leader_health_check_command description and banner (cherry picked from commit 96b8bce) * Improve output formatting for healthy leaders and support silent mode in rabbitmq-queues leader_health_check command (cherry picked from commit 239a69b) * Support global flag to run leader health check for all queues in all vhosts on local node (cherry picked from commit 48ba3e1) * Return immediately for leader health checks on empty vhosts (cherry picked from commit 7873737) * Rename leader health check timeout refs (cherry picked from commit b7dec89) * Update banner message for global leader health check (cherry picked from commit c7da4d5) * QQ leader-health-check: check_process_limit_safety before spawning leader checks (cherry picked from commit 1736845) * Log leader health check result in broker logs (if any leaderless queues) (cherry picked from commit 1084179) * Ensure check_passed result for leader health internal calls) (cherry picked from commit 68739a6) * Extend CLI format output to process check_passed payload (cherry picked from commit 5f5e992) * Format leader healthcheck result log and function exports (cherry picked from commit ebffd7d) * Change leader_health_check command scope from queues to diagnostics (cherry picked from commit 663fc98) * Update (c) line year (cherry picked from commit df82f12) * Rename command to check_for_quorum_queues_without_an_elected_leader and use across_all_vhosts option for global checks (cherry picked from commit b2acbae) * Use rabbit_db_queue for qq leader health check lookups and introduce rabbit_db_queue:get_all_by_type_and_vhost/2. Update leader health check timeout to 5s and process limit threshold to 20% of node's process_limit. (cherry picked from commit 7a8e166) * Update tests: quorum_queue_SUITE and rabbit_db_queue_SUITE (cherry picked from commit 9bdb81f) * Fix typo (cli test module) (cherry picked from commit 6158568) * Small refactor - simpler final leader health check result return on function head match (cherry picked from commit ea07938) * Clear dialyzer warning & fix type spec (cherry picked from commit a45aa81) * Ignore result without strict match to avoid diayzer warning (cherry picked from commit bb43c0b) * 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' documentation edits (cherry picked from commit 845230b) * 'rabbitmq-diagnostics check_for_quorum_queues_without_an_elected_leader' output copywriting (cherry picked from commit 235f43b) * diagnostics check_for_quorum_queues_without_an_elected_leader: behave like a health check w.r.t. error reporting (cherry picked from commit db73767) * check_for_quorum_queues_without_an_elected_leader: handle --quiet and --silent plus simplify function heads. References #13433. (cherry picked from commit 7b39231) --------- Co-authored-by: Ayanda Dube <[email protected]>
Proposed Changes
Hi Team 👋
The following changes are for a CLI diagnostics tool used for checking the health of Quorum Queue leaders. The main use is to detect presence quorum queues which are in a bad state (e.g. in particular, leaderless, where no messaging can be done) as reported and discussed here: #13101. There are some edge cases where quorum queues are observed to "lose leaders" and re-election is not triggered leaving such queues in bad state, and where any node restarts could result in complete message loss. This diagnostics tool allows us to carry out quick QQ leader health checks per vhost, or globally across all vhosts on a node/cluster for a specific queue match-spec. The command signature is as follows:
Output on a healthy node would be as follows:
Output on a node with unhealthy quorum queue leaders would be as follows (json formatted):
In addition to the command implementation, few items to note:
ping
as an aliveness and health check per leaderProcessLimitThreshold
is set to 40% of the node'sprocess_limit
to ensureprocess_count
when executing these checks will never reach the node'sprocess_limit
(and halt the node).:check_passed
results with a payload. This is simply to help for clarity for commands such as these in which the command check is a successful operation, while the result itself is an error (list of unhealthy QQs).This diagnostics tool would help catch leaderless queues, which on the Management UI, are listed as follows:
Some environments are quite risky to carrying out maintenance operations such as complete node shutdown. The main goal is to help avoid situations where nodes/queues are assumed to be in a good state (where other checks such as quorum critical checks which rely on quorum node count pass) while QQ leaders may be in an unhealthy state. If nodes pass both these checks, then we can deem them healthy and OK for various maintenance procedures. Following on, on this detection tool, are investigations to also fix the underlying problem reported in #13101.
Please take a look/review. We're keen on having these capabilities available upstream to catch and monitor this recurring issue^ 👍
Types of Changes
What types of changes does your code introduce to this project?
Put an
x
in the boxes that applyChecklist
Put an
x
in the boxes that apply.You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.
CONTRIBUTING.md
documentFurther Comments
If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution
you did and what alternatives you considered, etc.