Skip to content

Commit 5bcf3ba

Browse files
committed
Don't die in case of faulty node
This patch fixes TOCTOU issue introduced in the following commit: * 93b9e37 If the node was just removed from the cluster, then there is a small window when it is still listed as a member of a Mnesia cluster locally. We retrieve list of nodes by calling locally ```erlang unsafe_rpc(Node, rabbit_mnesia, cluster_nodes, [running]). ``` However retrieving status from that particular failed node is no longer available and throws an exception. See `alarms_by_node(Name)` function, which is simply calls `unsafe_rpc(Name, rabbit, status, [])` for this node. This `unsafe_rpc/4` function is basically a wrapper over `rabbit_misc:rpc_call/4` which translates `{badrpc, nodedown}` into exception. This exception generated by `alarms_by_node(Name)` function call emerges on a very high level, so rabbitmqct thinks that the entire cluster is down, while generating a very bizarre message: Cluster status of node 'rabbit@overcloud-controller-0' ... Error: unable to connect to node 'rabbit@overcloud-controller-0': nodedown DIAGNOSTICS =========== attempted to contact: ['rabbit@overcloud-controller-0'] rabbit@overcloud-controller-0: * connected to epmd (port 4369) on overcloud-controller-0 * node rabbit@overcloud-controller-0 up, 'rabbit' application running current node details: - node name: 'rabbitmq-cli-31@overcloud-controller-0' - home dir: /var/lib/rabbitmq - cookie hash: PB31uPq3vzeQeZ+MHv+wgg== See - it reports that it failed to connect to node 'rabbit@overcloud-controller-0' (because it catches an exception from `alarms_by_node(Name)`), but attempt to connect to this node was successful ('rabbit' application running). In order to fix that we should not throw exception during consequent calls (`[alarms_by_node(Name) || Name <- nodes_in_cluster(Node)]`), only during the first one (`unsafe_rpc(Node, rabbit_mnesia, status, [])`). Even more - we don't need to change `nodes_in_cluster(Node)`, because it is called locally. The only function which must use `rabbit_misc:rpc_call/4` is `alarms_by_node(Name)` because it is executed remotely. See this issue for further details and real world example: * https://bugzilla.redhat.com/1356169 Signed-off-by: Peter Lemenkov <[email protected]>
1 parent 0c316fc commit 5bcf3ba

File tree

1 file changed

+6
-3
lines changed

1 file changed

+6
-3
lines changed

src/rabbit_control_main.erl

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -953,6 +953,9 @@ nodes_in_cluster(Node) ->
953953
unsafe_rpc(Node, rabbit_mnesia, cluster_nodes, [running]).
954954

955955
alarms_by_node(Name) ->
956-
Status = unsafe_rpc(Name, rabbit, status, []),
957-
{_, As} = lists:keyfind(alarms, 1, Status),
958-
{Name, As}.
956+
case rpc_call(Name, rabbit, status, []) of
957+
{badrpc,nodedown} -> {Name, [nodedown]};
958+
Status ->
959+
{_, As} = lists:keyfind(alarms, 1, Status),
960+
{Name, As}
961+
end.

0 commit comments

Comments
 (0)