After node reboot rabbitmq rejoins cluster but message are being discarded. #2950

jasvinder1107 · 2021-04-02T18:50:50Z

jasvinder1107
Apr 2, 2021

Rabbitmq version: 3.8.11
Erlang : 23.2.7, 22.0.2

Team,

We are seeing the following discard messages after the node restarts and try to join the existing HA member.

2021-03-08 16:53:39.052 [error] emulator Discarding message {'$gen_call',{<0.1341.0>,#Ref<0.3121386697.1169686531.88317>},stat} from <0.1341.0> to <0.4171.0> in an old incarnation (1615219349) of this node (1615222369)

2021-03-08 16:53:39.054 [error] <0.1338.0> Discarding message {'$gen_call',{<0.1338.0>,#Ref<0.3121386697.1169686532.73676>},stat} from <0.1338.0> to <0.4181.0> in an old incarnation (1615219349) of this node (1615222369)

2021-03-08 16:53:39.054 [error] emulator Discarding message {'$gen_call',{<0.1338.0>,#Ref<0.3121386697.1169686532.73676>},stat} from <0.1338.0> to <0.4181.0> in an old incarnation (1615219349) of this node (1615222369)

2021-03-08 16:53:39.054 [error] <0.1343.0> Discarding message {'$gen_call',{<0.1343.0>,#Ref<0.3121386697.1169686531.88323>},stat} from <0.1343.0> to <0.4157.0> in an old incarnation (1615219349) of this node (1615222369)

2021-03-08 16:53:39.054 [error] <0.1332.0> Discarding message {'$gen_call',{<0.1332.0>,#Ref<0.3121386697.1169686529.99253>},stat} from <0.1332.0> to <0.4160.0> in an old incarnation (1615219349) of this node (1615222369)

2021-03-08 16:53:39.054 [error] emulator Discarding message {'$gen_call',{<0.1343.0>,#Ref<0.3121386697.1169686531.88323>},stat} from <0.1343.0> to <0.4157.0> in an old incarnation (1615219349) of this node (1615222369)

2021-03-08 16:53:39.054 [error] <0.1337.0> Discarding message {'$gen_call',{<0.1337.0>,#Ref<0.3121386697.1169686531.88324>},stat} from <0.1337.0> to <0.4181.0> in an old incarnation (1615219349) of this node (1615222369)

This is from a K8 environment where EPMD and Rabbitmq are co-located in a container. So we when we restart rabbit both EPMD and Rabbitmq are getting restarted. We have talked to Erlang developers and they suggested to reach rabbitmq developers for the same. When we restart Rabbitmq with EPMD, the ask here is it should handle the restart without creating issue for the cluster.

Reply below is from Erlang dev:

"The process identifiers used when sending these messages identifies a node with the same nodename as the receiving node, but it is an old instance of the node that is identified.

A node is identified by its name and an integer value called "creation" which is assigned when the Erlang distribution is started.
Both nodename and creation is stored in all process identifiers. If the nodename match but creation doesn't match when sending a message using a process identifier, the receiving node will print messages like below and drop the message (since the receiving process doesn't exist on the node). In your case below, the creation of the old instance is
1615219349 and the new instance is 1615222369.

Either the receiving node has been restarted with the same name, or the Erlang distribution on the receiving node has been restarted under the same name. In both cases a new creation will be assigned to the node and it will reject messages directed to the old instance of the node.

In OTP 23 we began using 32-bit creation values. In OTP 22 these values were only 2-bits. That is, in OTP 22 creation values were reused very quickly. This is probably the reason to why you don't see this issue as often with OTP 22.

I think you have to turn to the RabbitMQ team for support regarding this and/or the person(s) that have configured your system."

michaelklishin · 2021-04-02T18:54:10Z

michaelklishin
Apr 2, 2021
Maintainer

I will convert this issue to a GitHub discussion. Currently GitHub will automatically close and lock the issue even though your question will be transferred and responded to elsewhere. This is to let you know that we do not intend to ignore this but this is how the current GitHub conversion mechanism makes it seem for the users :(

0 replies

michaelklishin · 2021-04-02T18:56:51Z

michaelklishin
Apr 2, 2021
Maintainer

Those messages are Erlang messages, not RabbitMQ ones. They were sent to a previous "generation" of a queue replica and cannot be accepted by the new restarted one. This is not something that can or has to be fixed.

In terms of quorum queues, that means a replica was from another leader term (before the current queue leader was elected due to a node restart). Assuming you use quorum queues and publisher confirms, these warnings can be ignored entirely.

0 replies

michaelklishin · 2021-04-02T19:10:20Z

michaelklishin
Apr 2, 2021
Maintainer

Perhaps it's worth clarifying that in Raft, the spec explicitly requires replicas that receive any messages sent during an earlier leadership term (epoch) to be discarded. This is what the runtime does in the above case (without any awareness of Raft) because its developers concluded that ignoring such messages is the only safe thing to do. This is obviously not a problem for quorum queues, by design, as long as a majority of nodes is still available (three sections starting with the one linked are all worth reading).

If you don't use quorum queues, you should. They were developed as a replacement to classic mirrored queues. CMQs will be removed in a future version of RabbitMQ.

4 replies

jasvinder1107 Apr 2, 2021
Author

@michaelklishin Thank you for your reply. We are using the classic mirror queues. We are using rabbitmq-ha with two nodes. The openstack as a product i am not able to see if they support the quorum queues yet. I will talk to team internally for the migration. I I would like to make sure cluster is stable till we migrate properly. Do you think number of nodes is playing a role with these error?

jasvinder1107 Apr 2, 2021
Author

@michaelklishin Also point to note that the rabbitmq member stops catering traffic from openstack clients.. So any help in this scenario is greatly appreciated.

michaelklishin Apr 7, 2021
Maintainer

Classic mirrored queues will be removed in a future version of RabbitMQ. Use quorum queues, they have been available for about 18 months now.

This scenario can happen with any number of nodes. Two node clusters are explicitly recommended against and won't be an option for quorum queues.

michaelklishin Apr 7, 2021
Maintainer

You haven't really explained what you do to trigger this. Restarting one of two nodes means your cluster loses a majority immediately. Restarting one of three nodes would keep the majority.

With two nodes it is also extremely likely that you end up with all queue leader replicas being on the same node (there are simply no other options when there's only one node left online). You need to make sure that you rebalance leaders
after restarts, see

rabbitmq-queues help rebalance

Again, this is a warning for a reason: a restarted node will drop all [Erlang] messages sent to its prior incarnation. This is nothing new and not performed by RabbitMQ. Raft and quorum queues explicitly cover this scenario in the protocol,
and do not lose availability as long as a majority of nodes is online.

jasvinder1107 · 2021-04-02T20:10:35Z

jasvinder1107
Apr 2, 2021
Author

Also a datapoint here, we never had this problem with 3.7 we once in blue moon used to hit this. With inception of 3.8 this can be reproduced very easily. I can share the start logs for 3.7 vs 3.8 if you are interested.

2 replies

michaelklishin Apr 7, 2021
Maintainer

Classic mirrored queues will be removed in a future version of RabbitMQ. Use quorum queues.

Alabme May 7, 2021

Hi @michaelklishin, we have the same issue, and my question is: if we have a lot of classic mirrored queues on two nodes cluster, how to migrate them easily to three nodes with quorum queues? "Use quorum queues" is not enough :(

jasvinder1107 · 2021-07-12T18:14:15Z

jasvinder1107
Jul 12, 2021
Author

@Alabme, Just to let you know, this issue is not happening three nodes classic queue mirror clusters. From what i have read, this happens in mirrored classic queues when the queue master is not elected when we restart the two-node cluster. I am posting this here to help people who still wanted to stay with the classic queue and fix this problem. We wouldn't be able to move there as there is are no blueprints in OpenStack to move to Quorum queues yet for mirroring. I hope this helps.

1 reply

michaelklishin Jul 16, 2021
Maintainer

Two node clusters are explicitly recommended against.

willbrid · 2022-01-14T09:44:56Z

willbrid
Jan 14, 2022

Hi any update please. I have the same issue with docker image rabbitmq:3.8.5-alpine in k8s 1.13. However we have 3 nodes.
This is the error message :

2022-01-14 06:12:44.328 [error] emulator Discarding message {'$gen_call',{<0.29064.0>,#Ref<0.2150062824.3596877828.216420>},{info,[state]}} from <0.29064.0> to <0.32270.1> in an old incarnation (1641318169) of this node (1641899371)

Thank you

3 replies

lukebakken Jan 14, 2022
Maintainer

Please read @michaelklishin's comment: link

willbrid Jan 14, 2022

@lukebakken, I didn't restart anything. In these errors appear at some point of which we do not understand why. What I can guarantee is that the server, the pod (container) is on, has not restarted. Event the pod has not restarted. I confuse.
And then when we have this scenario our symfony messenger amqp client can no longer connect to their queue until we restart the pod of the application that consume ou push message to the queue.

Thank you.

lukebakken Jan 14, 2022
Maintainer

I didn't restart anything. In these errors appear at some point of which we do not understand why. What I can guarantee is that the server, the pod (container) is on, has not restarted

The only way I know that those messages appear is if a node restarts. At any rate, one log message is not enough to diagnose the issue.

After node reboot rabbitmq rejoins cluster but message are being discarded. #2950

Uh oh!

jasvinder1107 Apr 2, 2021

Replies: 6 comments · 10 replies

Uh oh!

michaelklishin Apr 2, 2021 Maintainer

Uh oh!

michaelklishin Apr 2, 2021 Maintainer

Uh oh!

michaelklishin Apr 2, 2021 Maintainer

Uh oh!

jasvinder1107 Apr 2, 2021 Author

Uh oh!

Uh oh!

jasvinder1107 Apr 2, 2021 Author

Uh oh!

michaelklishin Apr 7, 2021 Maintainer

Uh oh!

michaelklishin Apr 7, 2021 Maintainer

Uh oh!

jasvinder1107 Apr 2, 2021 Author

Uh oh!

michaelklishin Apr 7, 2021 Maintainer

Uh oh!

Alabme May 7, 2021

Uh oh!

jasvinder1107 Jul 12, 2021 Author

Uh oh!

michaelklishin Jul 16, 2021 Maintainer

Uh oh!

Uh oh!

willbrid Jan 14, 2022

Uh oh!

lukebakken Jan 14, 2022 Maintainer

Uh oh!

willbrid Jan 14, 2022

Uh oh!

lukebakken Jan 14, 2022 Maintainer

jasvinder1107
Apr 2, 2021

Replies: 6 comments 10 replies

michaelklishin
Apr 2, 2021
Maintainer

michaelklishin
Apr 2, 2021
Maintainer

michaelklishin
Apr 2, 2021
Maintainer

jasvinder1107 Apr 2, 2021
Author

jasvinder1107 Apr 2, 2021
Author

michaelklishin Apr 7, 2021
Maintainer

michaelklishin Apr 7, 2021
Maintainer

jasvinder1107
Apr 2, 2021
Author

michaelklishin Apr 7, 2021
Maintainer

jasvinder1107
Jul 12, 2021
Author

michaelklishin Jul 16, 2021
Maintainer

willbrid
Jan 14, 2022

lukebakken Jan 14, 2022
Maintainer

lukebakken Jan 14, 2022
Maintainer