Skip to content

Shutdown cannot fail #1396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 17, 2017
Merged

Shutdown cannot fail #1396

merged 1 commit into from
Oct 17, 2017

Conversation

lukebakken
Copy link
Collaborator

This is a cherry-pick of commit 79fa87f from master. Removes a test that commit 904744d appears to have broken.

@michaelklishin
Copy link
Collaborator

79fa87f doesn't explain the intent but I bet it is: we don't want node shutdown to fail with obscure implementation-specific errors. We want the node to shut down successfully so that tools such as Monit or systemd do not freak out for no good reason. So if you consider that, this is an improvement.

@michaelklishin michaelklishin merged commit 8d8246b into stable Oct 17, 2017
@gerhard
Copy link
Contributor

gerhard commented Oct 18, 2017

A rabbitmqctl shutdown can fail due to the following reasons:

  1. Erlang cookie does not match
  2. RabbitMQ is booting
  3. RabbitMQ is stopping
  4. Erlang VM is crashing (in some cases, it was observed that this can take many hours)
  5. The Erlang VM is running but the distribution is not working correctly - this happens more often than you think in hostile environments, like almost every enterprise, multi-tenant RabbitMQ cluster.

In all the above scenarios, the thing that manages RabbitMQ fails because it believes that RabbitMQ is stopped (i.e. shutdown succeeded) while in actual fact the beam process is still running.

How should we communicate to the thing that manages RabbitMQ that the Erlang VM failed to stop?

@michaelklishin
Copy link
Collaborator

michaelklishin commented Oct 18, 2017 via email

@gerhard
Copy link
Contributor

gerhard commented Oct 18, 2017

This means that rabbitmqctl shutdown can fail, since the exit code will not be 0. We need to reach team consensus on this matter before we move on. (cc @lukebakken & @hairyhum)

A few points to remember:

  • we already use exit codes from sysexits (3) (cc @dumbbell)
  • the original behaviour of rabbitmqctl shutdown is still captured in the rabbitmqctl docs. We should discuss if we've learned that this is no longer correct. It might help to refer to the original context in #142699247 & #142699191

@lukebakken lukebakken deleted the lrb-fix-shutdown-tests branch October 18, 2017 17:46
michaelklishin added a commit that referenced this pull request Oct 19, 2017
This way tools that manage RabbitMQ nodes can detect this condition
among a bunch of fairly generic exit codes.

Note that when a timeout occurs, we still use a "temporary failure" code.

See #1396 for context.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants