After node reboot rabbitmq rejoins cluster but message are being discarded. #2950
Replies: 6 comments 10 replies
-
I will convert this issue to a GitHub discussion. Currently GitHub will automatically close and lock the issue even though your question will be transferred and responded to elsewhere. This is to let you know that we do not intend to ignore this but this is how the current GitHub conversion mechanism makes it seem for the users :( |
Beta Was this translation helpful? Give feedback.
-
Those messages are Erlang messages, not RabbitMQ ones. They were sent to a previous "generation" of a queue replica and cannot be accepted by the new restarted one. This is not something that can or has to be fixed. In terms of quorum queues, that means a replica was from another leader term (before the current queue leader was elected due to a node restart). Assuming you use quorum queues and publisher confirms, these warnings can be ignored entirely. |
Beta Was this translation helpful? Give feedback.
-
Perhaps it's worth clarifying that in Raft, the spec explicitly requires replicas that receive any messages sent during an earlier leadership term (epoch) to be discarded. This is what the runtime does in the above case (without any awareness of Raft) because its developers concluded that ignoring such messages is the only safe thing to do. This is obviously not a problem for quorum queues, by design, as long as a majority of nodes is still available (three sections starting with the one linked are all worth reading). If you don't use quorum queues, you should. They were developed as a replacement to classic mirrored queues. CMQs will be removed in a future version of RabbitMQ. |
Beta Was this translation helpful? Give feedback.
-
Also a datapoint here, we never had this problem with 3.7 we once in blue moon used to hit this. With inception of 3.8 this can be reproduced very easily. I can share the start logs for 3.7 vs 3.8 if you are interested. |
Beta Was this translation helpful? Give feedback.
-
@Alabme, Just to let you know, this issue is not happening three nodes classic queue mirror clusters. From what i have read, this happens in mirrored classic queues when the queue master is not elected when we restart the two-node cluster. I am posting this here to help people who still wanted to stay with the classic queue and fix this problem. We wouldn't be able to move there as there is are no blueprints in OpenStack to move to Quorum queues yet for mirroring. I hope this helps. |
Beta Was this translation helpful? Give feedback.
-
Hi any update please. I have the same issue with docker image rabbitmq:3.8.5-alpine in k8s 1.13. However we have 3 nodes.
Thank you |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Rabbitmq version: 3.8.11
Erlang : 23.2.7, 22.0.2
Team,
We are seeing the following discard messages after the node restarts and try to join the existing HA member.
This is from a K8 environment where EPMD and Rabbitmq are co-located in a container. So we when we restart rabbit both EPMD and Rabbitmq are getting restarted. We have talked to Erlang developers and they suggested to reach rabbitmq developers for the same. When we restart Rabbitmq with EPMD, the ask here is it should handle the restart without creating issue for the cluster.
Reply below is from Erlang dev:
"The process identifiers used when sending these messages identifies a node with the same nodename as the receiving node, but it is an old instance of the node that is identified.
A node is identified by its name and an integer value called "creation" which is assigned when the Erlang distribution is started.
Both nodename and creation is stored in all process identifiers. If the nodename match but creation doesn't match when sending a message using a process identifier, the receiving node will print messages like below and drop the message (since the receiving process doesn't exist on the node). In your case below, the creation of the old instance is
1615219349 and the new instance is 1615222369.
Either the receiving node has been restarted with the same name, or the Erlang distribution on the receiving node has been restarted under the same name. In both cases a new creation will be assigned to the node and it will reject messages directed to the old instance of the node.
In OTP 23 we began using 32-bit creation values. In OTP 22 these values were only 2-bits. That is, in OTP 22 creation values were reused very quickly. This is probably the reason to why you don't see this issue as often with OTP 22.
I think you have to turn to the RabbitMQ team for support regarding this and/or the person(s) that have configured your system."
Beta Was this translation helpful? Give feedback.
All reactions