@@ -420,18 +420,20 @@ to most systems:
420
420
421
421
## <a id =" health-checks " class =" anchor " href =" #health-checks " >Health Checks</a >
422
422
423
- A health check is a [ periodically executed] ( #monitoring-frequency ) command
424
- or set of commands that collect a few essential metrics of a RabbitMQ node or cluster.
425
- Just like with human or veterinary health checks, there's a variety of checks that
426
- can be performed and some are more intrusive than others. Different checks also have
427
- a different probability of reporting [ false positives] ( https://en.wikipedia.org/wiki/False_positives_and_false_negatives )
428
- (a scenario when a node is reported as unhealthy even when it is actually healthy).
429
-
430
- Health checks therefore should be thought of as a range of options, starting with the most
431
- basic and virtually never producing false positives to increasingly more comprehensive,
432
- intrusive, and opinionated checks that have a probability of false positives that should be
433
- taken into account. Health checks can verify the state of an individual node or the entire cluster. The former
434
- kind is known as node health checks and the latter as cluster health checks.
423
+ A health check is a [ periodically executed] ( #monitoring-frequency ) command that
424
+ tries to determine whether an aspect of the RabbitMQ service is operating
425
+ normally.
426
+
427
+ There is a series of health checks that can be performed, starting
428
+ with the most basic and virtually never producing [ false
429
+ positives] ( https://en.wikipedia.org/wiki/False_positives_and_false_negatives ) ,
430
+ to increasingly more comprehensive, intrusive, and opinionated that have a
431
+ higher probability of false positives. In other words, the more comprehensive a
432
+ health check is, the less conclusive the result will be.
433
+
434
+ Health checks can verify the state of an
435
+ individual node (node health checks), or the entire cluster (cluster health
436
+ checks).
435
437
436
438
### <a id =" individual-checks " class =" anchor " href =" #individual-checks " >Individual Node Checks</a >
437
439
@@ -453,14 +455,14 @@ The most basic check ensures that the runtime is running
453
455
and (indirectly) that CLI tools can authenticate to it.
454
456
455
457
Except for the CLI tool authentication
456
- part, the probability of false positives can be considered approaching 0
458
+ part, the probability of false positives can be considered approaching ` 0 `
457
459
except for upgrades and maintenance windows.
458
460
459
461
[ ` rabbitmq-diganostics ping ` ] ( /rabbitmq-diagnostics.8.html ) performs this check:
460
462
461
463
<pre class =" lang-bash " >
462
464
rabbitmq-diagnostics ping -q
463
- # => ; Ping succeeded
465
+ # => ; Ping succeeded if exit code is 0
464
466
</pre >
465
467
466
468
#### Stage 2
@@ -477,7 +479,7 @@ rabbitmq-diagnostics -q status
477
479
</pre >
478
480
479
481
This is a common way of sanity checking a node.
480
- The probability of false positives can be considered approaching 0
482
+ The probability of false positives can be considered approaching ` 0 `
481
483
except for upgrades and maintenance windows.
482
484
483
485
#### Stage 3
@@ -610,7 +612,11 @@ maintenance windows can raise significantly.
610
612
611
613
Includes all checks in stage 3 plus checks that there are no failed [ virtual hosts] ( /vhosts.html ) .
612
614
613
- RabbitMQ CLI tools currently do not provide a dedicated command for this check.
615
+ RabbitMQ CLI tools currently do not provide a dedicated command for this check, but here is an example that could be used in the meantime:
616
+ <pre class =" lang-bash " >
617
+ rabbitmqctl eval '[true = rabbit_vhost:is_running_on_all_nodes(VHost) || VHost <- rabbit_vhost:list()], all_vhosts_are_running_on_all_nodes.'
618
+ all_vhosts_are_running_on_all_nodes
619
+ </pre >
614
620
615
621
The probability of false positives is generally low except for systems that are under
616
622
high CPU load.
0 commit comments