Connection recovery runs into timeouts when using NIO #413

ihiroky · 2018-10-12T01:24:56Z

Autorecovery of Java client (NIO mode) is failed when network gets down, not shutdown borker or client.
Its summary is shown below:

Network down (simulated on docker environment, disconnect client container by executing 'docker network disconnect').
Broker which the client connects detects heartbeat loss:
rabbitmq3 | 2018-10-11 06:35:49.757 [warning] <0.6261.0> closing AMQP connection <0.6261.0> (172.18.1.32:43016 -> 172.18.1.13:5672):
rabbitmq3 | missed heartbeats from client, timeout: 5s
Then the client gets error:
kanon-db2 | 2018-10-11 06:35:47,258 [rabbitmq-nio] ERROR c.r.c.i.ForgivingExceptionHandler - An unexpected connection driver error occured
kanon-db2 | com.rabbitmq.client.MissedHeartbeatException: Heartbeat missing with heartbeat = 5 seconds
kanon-db2 | at com.rabbitmq.client.impl.AMQConnection.handleHeartbeatFailure(AMQConnection.java:684)
kanon-db2 | at com.rabbitmq.client.impl.nio.NioLoop.run(NioLoop.java:80)
kanon-db2 | at java.base/java.lang.Thread.run(Thread.java:834)
kanon-db2 | 2018-10-11 06:35:47,265 [rabbitmq-nio] INFO c.e.anodim.LogShutdownListener - [shutdownCompleted] host:[/172.18.1.13:5672] id:[null] chNo:[-1] - com.rabbitmq.client.ShutdownSignalException: connection error
kanon-db2 | 2018-10-11 06:35:47,267 [rabbitmq-nio] INFO c.e.anodim.LogShutdownListener - [shutdownCompleted] host:[/172.18.1.13:5672] id:[null] chNo:[-1] - com.rabbitmq.client.ShutdownSignalException: connection error
Then, the client starts to recover:
anon-db2 | 2018-10-11 06:35:52,271 [rabbitmq-nio] INFO c.e.anodim.LogRecoveryListener - [handleRecoveryStarted] amqp://[email protected]:5672/
But the client says like below periodically (and never recovers):
kanon-db2 | 2018-10-11 06:36:57,290 [rabbitmq-nio] ERROR c.r.c.i.ForgivingExceptionHandler - Caught an exception during connection recovery!
kanon-db2 | java.util.concurrent.TimeoutException: null
kanon-db2 | at com.rabbitmq.utility.BlockingCell.get(BlockingCell.java:77)
kanon-db2 | at com.rabbitmq.utility.BlockingCell.uninterruptibleGet(BlockingCell.java:120)
kanon-db2 | at com.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(BlockingValueOrException.java:36)
kanon-db2 | at com.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQChannel.java:494)
kanon-db2 | at com.rabbitmq.client.impl.AMQConnection.start(AMQConnection.java:315)
kanon-db2 | at com.rabbitmq.client.impl.recovery.RecoveryAwareAMQConnectionFactory.newConnection(RecoveryAwareAMQConnectionFactory.java:64)
kanon-db2 | at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverConnection(AutorecoveringConnection.java:602)
kanon-db2 | at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.beginAutomaticRecovery(AutorecoveringConnection.java:566)
kanon-db2 | at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.lambda$addAutomaticRecoveryListener$3(AutorecoveringConnection.java:508)
kanon-db2 | at com.rabbitmq.client.impl.AMQConnection.notifyRecoveryCanBeginListeners(AMQConnection.java:748)
kanon-db2 | at com.rabbitmq.client.impl.AMQConnection.doFinalShutdown(AMQConnection.java:725)
kanon-db2 | at com.rabbitmq.client.impl.AMQConnection.handleHeartbeatFailure(AMQConnection.java:690)
kanon-db2 | at com.rabbitmq.client.impl.nio.NioLoop.run(NioLoop.java:80)
kanon-db2 | at java.base/java.lang.Thread.run(Thread.java:834)

This client recvovery process is executed in 'rabbitmq-nio' thread. And c.r.u.BlockingCell.get() blocks it. So any I/O event can't be executed (and never get result by BlockingCell.get()). I think this recovery process should be executed in another thread instead of 'rabbitmq-nio' thread, or split request/reply to eliminate the blocking.

Environment:
Broker: RabbitMQ 3.7.7 on Erlang 20.3.8.5 (docker image: 3-management)
Client runtime: Java 11 (docker image: 11-slim),
rabbitmq-client-java version: 5.4.1

michaelklishin · 2018-10-12T03:04:05Z

@ihiroky thanks for the details and digging in. Can you provide an environment we can use to reproduce (e.g. a public git repo with Docker files and Java client connection code)?

ihiroky · 2018-10-12T06:08:02Z

@michaelklishin I create https://github.com/ihiroky/rabbitmq-client-java-reproduce-413 for reproduction.

Clone the repository above.
Create client image and load it to docker. For example:
$ gradle
$ docker load --input build/jib-image.tar
Run containers (broker and client):
$ docker-compose up
Dissconnect client container from the network;
$ docker network disconnect <parent_dir>_rabbitmq_network client
and confirm errors.
Connect client container to the network
$ docker netowrk connect --ip 172.18.1.21 <parent_dir>_rabbitmq_network client
and confirm erros still continue.

Please let me know if there is anything I can do.

acogoluegnes · 2018-10-12T08:21:57Z

@ihiroky Thanks for the test project, I managed to reproduce.

A workaround is to increase the number of IO threads: the recovered connection will end up being dispatched to another IO thread.

I'm working on a fix to dispatch heartbeat failure in another thread, this is already the case for other connection failures.

Otherwise connection recovery can occur in the same NIO thread, leading to a deadlock. Fixes #413

Otherwise connection recovery can occur in the same NIO thread, leading to a deadlock. Fixes #413 (cherry picked from commit 46d3d2e)

Otherwise connection recovery can occur in the same NIO thread, leading to a deadlock. Fixes #413 (cherry picked from commit 46d3d2e) (cherry picked from commit 4f900f0) Conflicts: src/main/java/com/rabbitmq/client/impl/nio/NioLoop.java

acogoluegnes · 2018-10-15T11:55:48Z

@ihiroky 5.5.0.RC1 has shipped with the fix.

ihiroky · 2018-10-16T09:38:00Z

@acogoluegnes Thank you. I confirmed this isssue has been fixed.

michaelklishin changed the title ~~Connection recovery failed on rabbitmq-client-java:5.4.1 using NIO~~ Connection recovery runs into timeouts when using NIO Oct 12, 2018

acogoluegnes added a commit that referenced this issue Oct 12, 2018

Dispatch heartbeat failure in dedicated thread in NIO mode

46d3d2e

Otherwise connection recovery can occur in the same NIO thread, leading to a deadlock. Fixes #413

acogoluegnes mentioned this issue Oct 12, 2018

Dispatch heartbeat failure in dedicated thread in NIO mode #414

Merged

acogoluegnes added this to the 4.8.4 milestone Oct 12, 2018

michaelklishin closed this as completed in #414 Oct 12, 2018

acogoluegnes added a commit that referenced this issue Oct 15, 2018

Dispatch heartbeat failure in dedicated thread in NIO mode

4f900f0

Otherwise connection recovery can occur in the same NIO thread, leading to a deadlock. Fixes #413 (cherry picked from commit 46d3d2e)

acogoluegnes modified the milestones: 4.8.4, 4.9.0 Oct 15, 2018

acogoluegnes self-assigned this Oct 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Connection recovery runs into timeouts when using NIO #413

Connection recovery runs into timeouts when using NIO #413

ihiroky commented Oct 12, 2018

michaelklishin commented Oct 12, 2018

Uh oh!

ihiroky commented Oct 12, 2018

Uh oh!

acogoluegnes commented Oct 12, 2018

Uh oh!

acogoluegnes commented Oct 15, 2018

Uh oh!

ihiroky commented Oct 16, 2018 •

edited

Loading

Uh oh!

Connection recovery runs into timeouts when using NIO #413

Connection recovery runs into timeouts when using NIO #413

Comments

ihiroky commented Oct 12, 2018

michaelklishin commented Oct 12, 2018

Uh oh!

ihiroky commented Oct 12, 2018

Uh oh!

acogoluegnes commented Oct 12, 2018

Uh oh!

acogoluegnes commented Oct 15, 2018

Uh oh!

ihiroky commented Oct 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ihiroky commented Oct 16, 2018 •

edited

Loading