Increase classic queue shutdown timeout #3409

lhoguin · 2021-09-13T09:07:19Z

A value that is too low will prevent the index from shutting
down in time when there are many queues. This leads to the
process being killed and on the next RabbitMQ restart a
(potentially very long) dirty recovery is needed.

The value of 10 minutes was chosen to mirror the shutdown
timeout of the message store. Since both queues and message
store need to have shut down gracefully in order to have
a clean restart it makes sense to use the same value.

Related: c40c262

To reproduce

You can get a rough reproduction of the issue (before this PR) by using the following patch:

diff --git a/deps/rabbit/src/rabbit_queue_index.erl b/deps/rabbit/src/rabbit_queue_index.erl
index e8c573c08f..58adc0c953 100644
--- a/deps/rabbit/src/rabbit_queue_index.erl
+++ b/deps/rabbit/src/rabbit_queue_index.erl
@@ -318,6 +318,7 @@ recover(#resource{ virtual_host = VHost } = Name, Terms, MsgStoreRecovered,
 
 terminate(VHost, Terms, State = #qistate { dir = Dir }) ->
     {SegmentCounts, State1} = terminate(State),
+    timer:sleep(60000),
     rabbit_recovery_terms:store(VHost, filename:basename(Dir),
                                 [{segments, SegmentCounts} | Terms]),
     State1.

Then start the node make run-broker, shut it down using ERL_LIBS=deps ./escript/rabbitmqctl stop then start it again make run-broker. Check the logs that RabbitMQ went into dirty recovery.

You will also likely see this in the logs during the shutdown:

2021-09-09 12:13:12.163 [error] <0.5609.636> Supervisor {<0.5609.636>,rabbit_amqqueue_sup} had child rabbit_amqqueue started with rabbit_prequeue:start_link({amqqueue,{resource,<<"61dacda2-b55f-413e-b2bc-b018dfa05ccc">>,queue,<<"classic-durable-mirrored-1...">>},...}, slave, <0.5606.636>) at <0.5620.636> exit with reason killed in context shutdown_error

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

Bug fix (non-breaking change which fixes issue #NNNN)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause an observable behavior change in existing systems)
Documentation improvements (corrections, new content, etc)
Cosmetic change (whitespace, formatting, etc)

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

I have read the CONTRIBUTING.md document
I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
I have added tests that prove my fix is effective or that my feature works
All tests pass locally with my changes
If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

A value that is too low will prevent the index from shutting down in time when there are many queues. This leads to the process being killed and on the next RabbitMQ restart a (potentially very long) dirty recovery is needed. The value of 10 minutes was chosen to mirror the shutdown timeout of the message store. Since both queues and message store need to have shut down gracefully in order to have a clean restart it makes sense to use the same value. Related: c40c262

lhoguin · 2021-09-13T09:22:51Z

A bit more context: the natural case where this is an issue is 3 nodes, 10k+ classic queues, 100k+ consumers, and other objects such as quorum queues. To reproduce this naturally you simply need to have the VM have to do enough work so that there isn't enough resources for the classic queues to shut down within 30 seconds.

Without this change the test could take a very long time to cleanup the queues and finish because of a race condition between the queue deletion and the federation link being restarted and declaring the queue again. (The test bidirectional was renamed to message_flow to better represent what it is doing.)

lhoguin · 2021-09-14T13:10:59Z

The most recent commit fixes the issue with the bidirectional test (now message_flow) by shutting down federation before deleting the queues in the test.

lhoguin · 2021-09-14T14:36:10Z

Consensus is to ship 3.9.6 without this so this should only be merged after that is released. Also the release notes will need to be updated.

michaelklishin · 2021-09-21T13:56:52Z

Backported to v3.9.x.

…eout Increase classic queue shutdown timeout (cherry picked from commit 5fb118e)

michaelklishin · 2021-09-21T14:04:44Z

Backported to v3.8.x.

Mention CQ shutdown increase in 3.8.23/3.9.6 notes

180ce56

lhoguin added backport-v3.8.x labels Sep 13, 2021

lhoguin marked this pull request as ready for review September 14, 2021 13:09

HoloRin approved these changes Sep 14, 2021

View reviewed changes

michaelklishin merged commit 5fb118e into master Sep 21, 2021

michaelklishin deleted the lh-increase-queue-shutdown-timeout branch September 21, 2021 13:46

mergify bot mentioned this pull request Sep 21, 2021

Increase classic queue shutdown timeout (backport #3409) #3473

Closed

michaelklishin added a commit that referenced this pull request Sep 21, 2021

#3409 was postponed for 3.9.7

7ce2a0d

michaelklishin added a commit that referenced this pull request Sep 21, 2021

Merge pull request #3409 from rabbitmq/lh-increase-queue-shutdown-tim…

0e3adc5

…eout Increase classic queue shutdown timeout (cherry picked from commit 5fb118e)

luos mentioned this pull request Mar 9, 2022

Shutdown hangs because queue mirror started at the time of node shutdown #4241

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increase classic queue shutdown timeout #3409

Increase classic queue shutdown timeout #3409

Uh oh!

lhoguin commented Sep 13, 2021 •

edited

Loading

Uh oh!

lhoguin commented Sep 13, 2021

Uh oh!

lhoguin commented Sep 14, 2021

Uh oh!

lhoguin commented Sep 14, 2021

Uh oh!

michaelklishin commented Sep 21, 2021

Uh oh!

michaelklishin commented Sep 21, 2021

Uh oh!

Uh oh!

Increase classic queue shutdown timeout #3409

Increase classic queue shutdown timeout #3409

Uh oh!

Conversation

lhoguin commented Sep 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

To reproduce

Types of Changes

Checklist

Uh oh!

lhoguin commented Sep 13, 2021

Uh oh!

lhoguin commented Sep 14, 2021

Uh oh!

lhoguin commented Sep 14, 2021

Uh oh!

michaelklishin commented Sep 21, 2021

Uh oh!

michaelklishin commented Sep 21, 2021

Uh oh!

Uh oh!

lhoguin commented Sep 13, 2021 •

edited

Loading