Stream queues failing publisher confirms for multi-node cluster #3224
Replies: 8 comments 15 replies
-
with a load balancer, you should not have to set the |
Beta Was this translation helpful? Give feedback.
-
Probably 6000-6500
https://github.com/rabbitmq/osiris/blob/87447deb0361a7bf5caa47363031f2dfad6b0fe3/Makefile#L8
On Tue, 27 Jul 2021 at 23:00, Johan Rhodin ***@***.***> wrote:
There is no firewall between the nodes.
Which port is the inter node communication running on for streams?
This is the listeners output from rabbitmqctl status:
Interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Interface: [::], port: 15672, protocol: http, purpose: HTTP API
Interface: [::], port: 5552, protocol: stream, purpose: stream
Interface: 0.0.0.0, port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Interface: 0.0.0.0, port: 5671, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS```
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#3224 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJAHFHFZBL53MEWAH4H4EDTZ4T63ANCNFSM5A23CZ3Q>
.
--
*Karl Nilsson*
|
Beta Was this translation helpful? Give feedback.
-
Do your logs contain this warning?
https://github.com/rabbitmq/osiris/blob/master/src/osiris_replica_reader.erl#L58
This should tell you what host are attempted.
On Fri, 13 Aug 2021 at 17:44, Chad Knutson ***@***.***> wrote:
Still no success in setting up mirrored stream queues in our clusters.
Today, I monitored network traffic on the node that is the leader of the
stream queue.
Before I get to that, I share more details about our cluster
configuration. Each cluster is deployed within a VPC. Inside the VPC, each
node has an internal IP (e.g. 10.56.72.1) as well as internal hostname
(e.g. node-01.in.example.com). The nodes also have external IP (e.g.,
54.1.1.1) and external hostname (e.g. node-01.example.com). We expect
that any communication between nodes will use internal host/ IP.
After I have created a stream queue on a 3 node cluster, the leader node
is selected, but online member nodes fail to populate. In Wireshark, I see
network traffic between the internal IP of the stream leader and the
external IP of a mirror.
Here is a sampling of the communication that I see. Ideally, I would be
able to filter out just the stream queue traffic, but it is not clear to me
what the filter criteria should be.
num time source destination protocol length info
6974 3.757509 IP-01-INTERNAL IP-02-EXTERNAL TLSv1.2 1514 Application Data [TCP segment of a reassembled PDU]
6975 3.757516 IP-01-INTERNAL IP-02-EXTERNAL TLSv1.2 1514 Application Data [TCP segment of a reassembled PDU]
6976 3.757517 IP-01-INTERNAL IP-02-EXTERNAL TLSv1.2 1514 Application Data, Application Data
6977 3.757520 IP-01-INTERNAL IP-02-EXTERNAL TLSv1.2 1514 Application Data, Application Data
6978 3.757521 IP-01-INTERNAL IP-02-EXTERNAL TCP 1514 37788 → 5671 [PSH, ACK] Seq=75079 Ack=1 Win=459 Len=1448 TSval=3683319730 TSecr=65526665 [TCP segment of a reassembled PDU]
6979 3.757534 IP-01-INTERNAL IP-02-EXTERNAL TCP 1514 37788 → 5671 [ACK] Seq=76527 Ack=1 Win=459 Len=1448 TSval=3683319730 TSecr=65526665 [TCP segment of a reassembled PDU]
6980 3.757541 IP-01-INTERNAL IP-02-EXTERNAL TCP 1514 37788 → 5671 [ACK] Seq=77975 Ack=1 Win=459 Len=1448 TSval=3683319730 TSecr=65526665 [TCP segment of a reassembled PDU]
6981 3.757549 IP-01-INTERNAL IP-02-EXTERNAL TLSv1.2 1514 Application Data [TCP segment of a reassembled PDU]
6982 3.757553 IP-01-INTERNAL IP-02-EXTERNAL TLSv1.2 1514 Application Data, Application Data
6983 3.757554 IP-01-INTERNAL IP-02-EXTERNAL TLSv1.2 1514 Application Data [TCP segment of a reassembled PDU]
6984 3.757577 IP-01-INTERNAL IP-02-EXTERNAL TCP 1514 37788 → 5671 [ACK] Seq=83767 Ack=1 Win=459 Len=1448 TSval=3683319731 TSecr=65526665 [TCP segment of a reassembled PDU]
8011 4.631005 IP-02-EXTERNAL IP-01-INTERNAL TCP 66 [TCP ACKed unseen segment] [TCP Previous segment not captured] 5671 → 37788 [ACK] Seq=38 Ack=260423 Win=15928 Len=0 TSval=65527539 TSecr=3683320532
8012 4.631029 IP-01-INTERNAL IP-02-EXTERNAL TLSv1.2 1514 [TCP ACKed unseen segment] [TCP Previous segment not captured] , Ignored Unknown Record
8013 4.631031 IP-01-INTERNAL IP-02-EXTERNAL TLSv1.2 1514 Ignored Unknown Record
8014 4.631035 IP-01-INTERNAL IP-02-EXTERNAL TLSv1.2 1514 Ignored Unknown Record
How can we enforce internal ( local IP) communication between nodes?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3224 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJAHFECZCISBO3YCMCEMDTT4VDYDANCNFSM5A23CZ3Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
--
*Karl Nilsson*
|
Beta Was this translation helpful? Give feedback.
-
Here is where the replica node resolves the ip addresses on the host that
the replica reader will try in turn to connect to. You can run this in an
Erlang shell or rabbitmqctl eval command until it resolves something that
can be connected to.
https://github.com/rabbitmq/osiris/blob/master/src/osiris_replica.erl#L173
On Fri, 13 Aug 2021 at 20:16, Chad Knutson ***@***.***> wrote:
For further testing, I removed the external IP and hostnames from all
nodes, including the /etc/hosts file. The only way to communicate now is
within the VPC using internal IP.
However, RabbitMQ log is unchanged [now host 127.0.1.1 refers only to
'internal' host name].
2021-08-13 19:08:55.419699+00:00 [warn] <0.12551.2> osiris replica connection refused, host:{127,0,1,1}
Wireshark shows similar errors as initial capture, but now all
communication is between internal IPs.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3224 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJAHFEZTDQUT4HSYPQFMR3T4VVR3ANCNFSM5A23CZ3Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
--
*Karl Nilsson*
|
Beta Was this translation helpful? Give feedback.
-
Node-01 must be the node the replica is on. Then it sends that info to the
node with the leader which tries to connect back. Port is in the range I
mentioned before and is only listening whilst replica process is active.
On Fri, 13 Aug 2021 at 22:50, Chad Knutson ***@***.***> wrote:
I get good results from that evaluation:
$ rabbitmqctl eval 'inet:getaddrs("node-01", inet).' {ok,[{127,0,1,1}]} $
rabbitmqctl eval 'inet:getaddrs("node-02", inet).' {ok,[{10,16,16,222}]} $
rabbitmqctl eval 'inet:getaddrs("node-03", inet).' {ok,[{10,16,16,98}]}
This raises a few questions:
Why is the code only choosing 'node-01' every time? Why won't it choose
one of the other nodes?
I don't understand why it wouldn't be able to connect to 127.0.1.1 in any
case. I can telnet to that ip from node-01:
$ telnet 127.0.1.1 5672 Trying 127.0.1.1... Connected to 127.0.1.1.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3224 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJAHFEQDNMH2CZJTONYPVLT4WHS5ANCNFSM5A23CZ3Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
--
*Karl Nilsson*
|
Beta Was this translation helpful? Give feedback.
-
After running Tried restarting the leader and another node became the leader, so had to run it again on that node. |
Beta Was this translation helpful? Give feedback.
-
Here is a some other logs we are getting When starting a follower we get the following in the leader log
The corresponding lines on the follower
|
Beta Was this translation helpful? Give feedback.
-
I am happy to report that I finally have a better handle on this. The biggest issue is that we do not use true load balancers for our clusters in general. For the one case we use a load balancer (AWS privatelink), the stream client works perfectly with the --load-balancer flag. Other cases are complicated by the node name (e.g., test-node-01) vs FQDN (e.g. test-node-01.example.com). When using public internet for connections, the stream client will try to connect to the node name, which the client cannot resolve, instead of the FQDN. In such cases, I found that setting the value for 'advertised_host' to the FQDN solves the problem. The other issue that was causing problems was a loopback hostname in our /etc/hosts file. This was causing the mirroring issue described above when a stream queue was declared in a multi-node cluster. Removing that entry solved the problem. Thank you for the guidance provided in many of these answers. I am happy to share more details about our configuration and conclusions upon request. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I have a 3 node cluster hosted in a cloud platform with RabbitMQ 3.9.0.rc.2, Erlang 24.0.2 in Ubuntu (ARM processor). I'm using the released stream-perf-test-0.1.0 client running on my laptop (osx).
For a 1 node cluster, the client is working fine. But I am not getting publisher confirms for my 3 node cluster. I am running the command on laptop, where the RabbitMQ uri is DNS load-balanced over its 3 nodes (-01, -02, -03).
java -jar stream-perf-test-0.1.0.jar --uris rabbitmq-stream://$USR:[email protected]:5552/vhost
Response:
Starting producer
1, published 3637 msg/s, confirmed 0 msg/s, consumed 0 msg/s, latency min/median/75th/95th/99th 0/0/0/0/0 µs, chunk size 0
In the RabbitMQ manager for the queue, I see that the queue leader is node -01, online are nodes -01 and -02, members are -01, -02, -03. The streams view shows locator connected to node-01, producer on node-01, and consumer on node-02. Manager reports that 10000 messages have been published and that there is a consumer for the queue.
I have configured the stream plugin so that each of the 3 nodes uses its own uri for advertised host. E.g., for node-01
{advertised_host, "rabbitserver-01.example.com"}
I don't see anything in the RabbitMQ logs [probably that's another discussion. I don't think that I have the new logger configured correctly].
Am I missing a configuration, or is there a bug in either the client or RabbitMQ?
Beta Was this translation helpful? Give feedback.
All reactions