Skip to content

Commit e3cf90a

Browse files
committed
Terminate replaced rabbit_fifo_dlx_worker
[Jepsen dead lettering tests](https://github.com/rabbitmq/rabbitmq-ci/blob/5977f587e203698b8f281ed52b636d60489883b7/jepsen/scripts/qq-jepsen-test.sh#L108) of job `qq-jepsen-test-3-12` of Concourse pipeline `jepsen-tests` fail sometimes with following error: ``` {{:try_clause, [{:undefined, #PID<12128.3596.0>, :worker, [:rabbit_fifo_dlx_worker]}, {:undefined, #PID<12128.10212.0>, :worker, [:rabbit_fifo_dlx_worker]}]}, [{:erl_eval, :try_clauses, 10, [file: 'erl_eval.erl', line: 995]}, {:erl_eval, :exprs, 2, []}]} ``` At the end of the Jepsen test, there are 2 DLX workers on the same node. Analysing the logs reveals the following: Source quorum queue node becomes leader and starts its DLX worker: ``` 2023-03-18 12:14:04.365295+00:00 [debug] <0.1645.0> started rabbit_fifo_dlx_worker <0.3596.0> for queue 'jepsen.queue' in vhost '/' ``` Less than 1 second later, Mnesia reports a network partition (introduced by Jepsen). The DLX worker does not succeed to register as consumer to its source quorum queue because the Ra command times out: ``` 2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> Failed to process command {dlx,{checkout,<0.3596.0>,32}} on quorum queue leader {'%2F_jepsen.queue', 2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> 'rabbit@concourse-qq-jepsen-312-3'}: {timeout, 2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> {'%2F_jepsen.queue', 2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> 'rabbit@concourse-qq-jepsen-312-3'}} 2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> Trying 5 more time(s)... ``` 3 seconds after the DLX worker got created, the local source quorum queue node is not leader anymore: ``` 2023-03-18 12:14:07.289213+00:00 [notice] <0.1645.0> queue 'jepsen.queue' in vhost '/': leader -> follower in term: 17 machine version: 3 ``` But because the DLX worker at this point failed to register as consumer, it will not be terminated in https://github.com/rabbitmq/rabbitmq-server/blob/865d533863c29ed52e03070ac8d9e1bcaee8b205/deps/rabbit/src/rabbit_fifo_dlx.erl#L264-L275 Eventually, when the local node becomes a leader again, that DLX worker succeeds to register as consumer (due to retries in https://github.com/rabbitmq/rabbitmq-server/blob/865d533863c29ed52e03070ac8d9e1bcaee8b205/deps/rabbit/src/rabbit_fifo_dlx_client.erl#L41-L58), and stays alive. When that happens, there is a 2nd DLX worker active because the 2nd got started when the local quorum queue node transitioned to become a leader. This commit prevents this issue. So, last consumer who does a `#checkout{}` wins and the “old one” has to terminate.
1 parent 865d533 commit e3cf90a

File tree

1 file changed

+10
-1
lines changed

1 file changed

+10
-1
lines changed

deps/rabbit/src/rabbit_fifo_dlx.erl

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,11 +114,20 @@ apply(_, {dlx, #checkout{consumer = Pid,
114114
apply(_, {dlx, #checkout{consumer = ConsumerPid,
115115
prefetch = Prefetch}},
116116
at_least_once,
117-
#?MODULE{consumer = #dlx_consumer{checked_out = CheckedOutOldConsumer},
117+
#?MODULE{consumer = #dlx_consumer{pid = OldConsumerPid,
118+
checked_out = CheckedOutOldConsumer},
118119
discards = Discards0,
119120
msg_bytes = Bytes,
120121
msg_bytes_checkout = BytesCheckout} = State0) ->
121122
%% Since we allow only a single consumer, the new consumer replaces the old consumer.
123+
case ConsumerPid of
124+
OldConsumerPid ->
125+
ok;
126+
_ ->
127+
rabbit_log:debug("Terminating ~p since ~p becomes active rabbit_fifo_dlx_worker",
128+
[OldConsumerPid, ConsumerPid]),
129+
ensure_worker_terminated(State0)
130+
end,
122131
%% All checked out messages to the old consumer need to be returned to the discards queue
123132
%% such that these messages will be re-delivered to the new consumer.
124133
%% When inserting back into the discards queue, we respect the original order in which messages

0 commit comments

Comments
 (0)