Skip to content

Use erlang:system_info(creation) as GUID #3631

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 3, 2021
Merged

Conversation

mkuratczyk
Copy link
Contributor

The goal of these changes is to prevent a false-positive network partition:
in some cases, in a multi-node cluster, a single node restart could trigger
node monitor to declare a partition and stop the remaining nodes.

Node GUID allows to differentiate between different incarnations of a node.
However, since rabbit may take some time to start (many queues/bindings, etc),
there could be a significant difference between Erlang VM being up and
responding to RPC calls and the new GUID being announced. During that
time, node monitor could incorrectly assume there was a network
partition, while in fact a node was simply restarted. With this change,
as soon as the Erlang VM is up, we can tell whether it was restarted and
avoid false positives.

Additionally, we now log if any queues were deleted on behalf of the
restarted node. This can take quite a long time if there are many transient
queues (eg. auto-delete queues). The longer this takes, the higher were the
odds of a restarted node being up again by the time
check_partial_partition was called. We may need to reconsider this logic
as well but for now - we just log this activity.

Co-authored-by: Loïc Hoguin [email protected]

@michaelklishin
Copy link
Collaborator

@Mergifyio rebase

Node GUID allows to differentiate between different incarnations of a node.
However, since rabbit may take some time to start (many queues/bindings, etc),
there could be a significant difference between Erlang VM being up and
responding to RPC requests and the new GUID being announced. During that
time, node monitor could incorrectly assume there was a network
partition, while in fact a node was simply restarted. With this change,
as soon as the Erlang VM is up, we can tell whether it was restarted and
avoid false positives.

Additionally, we now log if any queues were deleted on behalf of the
restarted node. This can take quite a long time if there are many transient
queues (eg. auto-delete queues). The longer this takes, the higher were the
odds of a restarted node being up again by the time
check_partial_partition was called. We may need to reconsider this logic
as well but for now - we just log this activity.

Co-authored-by: Loïc Hoguin <[email protected]>
@mergify
Copy link

mergify bot commented Nov 3, 2021

rebase

✅ Branch has been successfully rebased

@michaelklishin michaelklishin merged commit 6318a7e into master Nov 3, 2021
@michaelklishin michaelklishin deleted the creation-as-guid branch November 3, 2021 10:21
michaelklishin added a commit that referenced this pull request Nov 3, 2021
Use erlang:system_info(creation) as GUID (backport #3631)
michaelklishin added a commit that referenced this pull request Nov 3, 2021
Use erlang:system_info(creation) as GUID

(cherry picked from commit 6318a7e)

Conflicts:
	deps/rabbit/src/rabbit_node_monitor.erl
@michaelklishin
Copy link
Collaborator

Backported to v3.8.x for 3.8.24 manually.

Copy link
Collaborator

@lukebakken lukebakken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@michaelklishin michaelklishin added this to the 3.9.9 milestone Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants