Skip to content

Don't die in case of faulty node #894

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 28, 2016

Conversation

lemenkov
Copy link
Contributor

This patch fixes TOCTOU issue introduced in the following commit:

If the node was just removed from the cluster, then there is a small
window when it is still listed as a member of a Mnesia cluster
locally. We retrieve list of nodes by calling locally

 unsafe_rpc(Node, rabbit_mnesia, cluster_nodes, [running]).

However retrieving status from that particular failed node is no longer
available and throws an exception. See alarms_by_node(Name) function,
which is simply calls unsafe_rpc(Name, rabbit, status, []) for this
node.

This unsafe_rpc/4 function is basically a wrapper over rabbit_misc:rpc_call/4
which translates {badrpc, nodedown} into exception. This exception
generated by alarms_by_node(Name) function call emerges on a very high
level, so rabbitmqct thinks that the entire cluster is down, while
generating a very bizarre message:

Cluster status of node 'rabbit@overcloud-controller-0' ...
Error: unable to connect to node 'rabbit@overcloud-controller-0':
nodedown

DIAGNOSTICS
===========

attempted to contact: ['rabbit@overcloud-controller-0']

rabbit@overcloud-controller-0:
  * connected to epmd (port 4369) on overcloud-controller-0
  * node rabbit@overcloud-controller-0 up, 'rabbit' application running

current node details:
- node name: 'rabbitmq-cli-31@overcloud-controller-0'
- home dir: /var/lib/rabbitmq
- cookie hash: PB31uPq3vzeQeZ+MHv+wgg==

See - it reports that it failed to connect to node
'rabbit@overcloud-controller-0' (because it catches an exception from
alarms_by_node(Name)), but attempt to connect to this node was
successful ('rabbit' application running).

In order to fix that we should not throw exception during consequent
calls ([alarms_by_node(Name) || Name <- nodes_in_cluster(Node)]), only
during the first one (unsafe_rpc(Node, rabbit_mnesia, status, [])).

Even more - we don't need to change nodes_in_cluster(Node), because it
is called locally. The only function which must use
rabbit_misc:rpc_call/4 is alarms_by_node(Name) because it is
executed remotely.

See this issue for further details and real world example:

Signed-off-by: Peter Lemenkov [email protected]

This patch fixes TOCTOU issue introduced in the following commit:

* rabbitmq/rabbitmq-server@93b9e37

If the node was just removed from the cluster, then there is a small
window when it is still listed as a member of a Mnesia cluster
locally. We retrieve list of nodes by calling locally

```erlang
 unsafe_rpc(Node, rabbit_mnesia, cluster_nodes, [running]).
```

However retrieving status from that particular failed node is no longer
available and throws an exception. See `alarms_by_node(Name)` function,
which is simply calls `unsafe_rpc(Name, rabbit, status, [])` for this
node.

This `unsafe_rpc/4` function is basically a wrapper over `rabbit_misc:rpc_call/4`
which translates `{badrpc, nodedown}` into exception. This exception
generated by `alarms_by_node(Name)` function call emerges on a very high
level, so rabbitmqct thinks that the entire cluster is down, while
generating a very bizarre message:

	Cluster status of node 'rabbit@overcloud-controller-0' ...
	Error: unable to connect to node 'rabbit@overcloud-controller-0':
	nodedown

	DIAGNOSTICS
	===========

	attempted to contact: ['rabbit@overcloud-controller-0']

	rabbit@overcloud-controller-0:
	  * connected to epmd (port 4369) on overcloud-controller-0
	  * node rabbit@overcloud-controller-0 up, 'rabbit' application running

	current node details:
	- node name: 'rabbitmq-cli-31@overcloud-controller-0'
	- home dir: /var/lib/rabbitmq
	- cookie hash: PB31uPq3vzeQeZ+MHv+wgg==

See - it reports that it failed to connect to node
'rabbit@overcloud-controller-0' (because it catches an exception from
`alarms_by_node(Name)`), but attempt to connect to this node was
successful ('rabbit' application running).

In order to fix that we should not throw exception during consequent
calls (`[alarms_by_node(Name) || Name <- nodes_in_cluster(Node)]`), only
during the first one (`unsafe_rpc(Node, rabbit_mnesia, status, [])`).

Even more - we don't need to change `nodes_in_cluster(Node)`, because it
is called locally. The only function which must use
`rabbit_misc:rpc_call/4` is `alarms_by_node(Name)` because it is
executed remotely.

See this issue for further details and real world example:

* https://bugzilla.redhat.com/1356169

Signed-off-by: Peter Lemenkov <[email protected]>
@michaelklishin michaelklishin self-assigned this Jul 28, 2016
@michaelklishin michaelklishin added this to the 3.6.4 milestone Jul 28, 2016
@michaelklishin michaelklishin merged commit f8a9643 into rabbitmq:stable Jul 28, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants