Skip to content

Increase classic queue shutdown timeout #3409

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 21, 2021

Conversation

lhoguin
Copy link
Contributor

@lhoguin lhoguin commented Sep 13, 2021

A value that is too low will prevent the index from shutting
down in time when there are many queues. This leads to the
process being killed and on the next RabbitMQ restart a
(potentially very long) dirty recovery is needed.

The value of 10 minutes was chosen to mirror the shutdown
timeout of the message store. Since both queues and message
store need to have shut down gracefully in order to have
a clean restart it makes sense to use the same value.

Related: c40c262

To reproduce

You can get a rough reproduction of the issue (before this PR) by using the following patch:

diff --git a/deps/rabbit/src/rabbit_queue_index.erl b/deps/rabbit/src/rabbit_queue_index.erl
index e8c573c08f..58adc0c953 100644
--- a/deps/rabbit/src/rabbit_queue_index.erl
+++ b/deps/rabbit/src/rabbit_queue_index.erl
@@ -318,6 +318,7 @@ recover(#resource{ virtual_host = VHost } = Name, Terms, MsgStoreRecovered,
 
 terminate(VHost, Terms, State = #qistate { dir = Dir }) ->
     {SegmentCounts, State1} = terminate(State),
+    timer:sleep(60000),
     rabbit_recovery_terms:store(VHost, filename:basename(Dir),
                                 [{segments, SegmentCounts} | Terms]),
     State1.

Then start the node make run-broker, shut it down using ERL_LIBS=deps ./escript/rabbitmqctl stop then start it again make run-broker. Check the logs that RabbitMQ went into dirty recovery.

You will also likely see this in the logs during the shutdown:

2021-09-09 12:13:12.163 [error] <0.5609.636> Supervisor {<0.5609.636>,rabbit_amqqueue_sup} had child rabbit_amqqueue started with rabbit_prequeue:start_link({amqqueue,{resource,<<"61dacda2-b55f-413e-b2bc-b018dfa05ccc">>,queue,<<"classic-durable-mirrored-1...">>},...}, slave, <0.5606.636>) at <0.5620.636> exit with reason killed in context shutdown_error

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

A value that is too low will prevent the index from shutting
down in time when there are many queues. This leads to the
process being killed and on the next RabbitMQ restart a
(potentially very long) dirty recovery is needed.

The value of 10 minutes was chosen to mirror the shutdown
timeout of the message store. Since both queues and message
store need to have shut down gracefully in order to have
a clean restart it makes sense to use the same value.

Related: c40c262
@lhoguin
Copy link
Contributor Author

lhoguin commented Sep 13, 2021

A bit more context: the natural case where this is an issue is 3 nodes, 10k+ classic queues, 100k+ consumers, and other objects such as quorum queues. To reproduce this naturally you simply need to have the VM have to do enough work so that there isn't enough resources for the classic queues to shut down within 30 seconds.

Without this change the test could take a very long time to
cleanup the queues and finish because of a race condition
between the queue deletion and the federation link being
restarted and declaring the queue again.

(The test bidirectional was renamed to message_flow to better
 represent what it is doing.)
@lhoguin lhoguin marked this pull request as ready for review September 14, 2021 13:09
@lhoguin
Copy link
Contributor Author

lhoguin commented Sep 14, 2021

The most recent commit fixes the issue with the bidirectional test (now message_flow) by shutting down federation before deleting the queues in the test.

@lhoguin
Copy link
Contributor Author

lhoguin commented Sep 14, 2021

Consensus is to ship 3.9.6 without this so this should only be merged after that is released. Also the release notes will need to be updated.

@michaelklishin michaelklishin merged commit 5fb118e into master Sep 21, 2021
@michaelklishin michaelklishin deleted the lh-increase-queue-shutdown-timeout branch September 21, 2021 13:46
michaelklishin added a commit that referenced this pull request Sep 21, 2021
@michaelklishin
Copy link
Collaborator

Backported to v3.9.x.

michaelklishin added a commit that referenced this pull request Sep 21, 2021
…eout

Increase classic queue shutdown timeout

(cherry picked from commit 5fb118e)
@michaelklishin
Copy link
Collaborator

Backported to v3.8.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants