Don't die in case of faulty node #894

lemenkov · 2016-07-28T12:07:58Z

This patch fixes TOCTOU issue introduced in the following commit:

93b9e37

If the node was just removed from the cluster, then there is a small
window when it is still listed as a member of a Mnesia cluster
locally. We retrieve list of nodes by calling locally

 unsafe_rpc(Node, rabbit_mnesia, cluster_nodes, [running]).

However retrieving status from that particular failed node is no longer
available and throws an exception. See alarms_by_node(Name) function,
which is simply calls unsafe_rpc(Name, rabbit, status, []) for this
node.

This unsafe_rpc/4 function is basically a wrapper over rabbit_misc:rpc_call/4
which translates {badrpc, nodedown} into exception. This exception
generated by alarms_by_node(Name) function call emerges on a very high
level, so rabbitmqct thinks that the entire cluster is down, while
generating a very bizarre message:

Cluster status of node 'rabbit@overcloud-controller-0' ...
Error: unable to connect to node 'rabbit@overcloud-controller-0':
nodedown

DIAGNOSTICS
===========

attempted to contact: ['rabbit@overcloud-controller-0']

rabbit@overcloud-controller-0:
  * connected to epmd (port 4369) on overcloud-controller-0
  * node rabbit@overcloud-controller-0 up, 'rabbit' application running

current node details:
- node name: 'rabbitmq-cli-31@overcloud-controller-0'
- home dir: /var/lib/rabbitmq
- cookie hash: PB31uPq3vzeQeZ+MHv+wgg==

See - it reports that it failed to connect to node
'rabbit@overcloud-controller-0' (because it catches an exception from
alarms_by_node(Name)), but attempt to connect to this node was
successful ('rabbit' application running).

In order to fix that we should not throw exception during consequent
calls ([alarms_by_node(Name) || Name <- nodes_in_cluster(Node)]), only
during the first one (unsafe_rpc(Node, rabbit_mnesia, status, [])).

Even more - we don't need to change nodes_in_cluster(Node), because it
is called locally. The only function which must use
rabbit_misc:rpc_call/4 is alarms_by_node(Name) because it is
executed remotely.

See this issue for further details and real world example:

https://bugzilla.redhat.com/1356169

Signed-off-by: Peter Lemenkov [email protected]

This patch fixes TOCTOU issue introduced in the following commit: * rabbitmq/rabbitmq-server@93b9e37 If the node was just removed from the cluster, then there is a small window when it is still listed as a member of a Mnesia cluster locally. We retrieve list of nodes by calling locally ```erlang unsafe_rpc(Node, rabbit_mnesia, cluster_nodes, [running]). ``` However retrieving status from that particular failed node is no longer available and throws an exception. See `alarms_by_node(Name)` function, which is simply calls `unsafe_rpc(Name, rabbit, status, [])` for this node. This `unsafe_rpc/4` function is basically a wrapper over `rabbit_misc:rpc_call/4` which translates `{badrpc, nodedown}` into exception. This exception generated by `alarms_by_node(Name)` function call emerges on a very high level, so rabbitmqct thinks that the entire cluster is down, while generating a very bizarre message: Cluster status of node 'rabbit@overcloud-controller-0' ... Error: unable to connect to node 'rabbit@overcloud-controller-0': nodedown DIAGNOSTICS =========== attempted to contact: ['rabbit@overcloud-controller-0'] rabbit@overcloud-controller-0: * connected to epmd (port 4369) on overcloud-controller-0 * node rabbit@overcloud-controller-0 up, 'rabbit' application running current node details: - node name: 'rabbitmq-cli-31@overcloud-controller-0' - home dir: /var/lib/rabbitmq - cookie hash: PB31uPq3vzeQeZ+MHv+wgg== See - it reports that it failed to connect to node 'rabbit@overcloud-controller-0' (because it catches an exception from `alarms_by_node(Name)`), but attempt to connect to this node was successful ('rabbit' application running). In order to fix that we should not throw exception during consequent calls (`[alarms_by_node(Name) || Name <- nodes_in_cluster(Node)]`), only during the first one (`unsafe_rpc(Node, rabbit_mnesia, status, [])`). Even more - we don't need to change `nodes_in_cluster(Node)`, because it is called locally. The only function which must use `rabbit_misc:rpc_call/4` is `alarms_by_node(Name)` because it is executed remotely. See this issue for further details and real world example: * https://bugzilla.redhat.com/1356169 Signed-off-by: Peter Lemenkov <[email protected]>

michaelklishin self-assigned this Jul 28, 2016

michaelklishin added this to the 3.6.4 milestone Jul 28, 2016

michaelklishin merged commit f8a9643 into rabbitmq:stable Jul 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't die in case of faulty node #894

Don't die in case of faulty node #894

Uh oh!

lemenkov commented Jul 28, 2016

Uh oh!

Uh oh!

Don't die in case of faulty node #894

Don't die in case of faulty node #894

Uh oh!

Conversation

lemenkov commented Jul 28, 2016

Uh oh!

Uh oh!