-
Notifications
You must be signed in to change notification settings - Fork 582
Connection recovery runs into timeouts when using NIO #413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@ihiroky thanks for the details and digging in. Can you provide an environment we can use to reproduce (e.g. a public git repo with Docker files and Java client connection code)? |
@michaelklishin I create https://github.com/ihiroky/rabbitmq-client-java-reproduce-413 for reproduction.
Please let me know if there is anything I can do. |
@ihiroky Thanks for the test project, I managed to reproduce. A workaround is to increase the number of IO threads: the recovered connection will end up being dispatched to another IO thread. I'm working on a fix to dispatch heartbeat failure in another thread, this is already the case for other connection failures. |
Otherwise connection recovery can occur in the same NIO thread, leading to a deadlock. Fixes #413
@acogoluegnes Thank you. I confirmed this isssue has been fixed. |
Autorecovery of Java client (NIO mode) is failed when network gets down, not shutdown borker or client.
Its summary is shown below:
Network down (simulated on docker environment, disconnect client container by executing 'docker network disconnect').
Broker which the client connects detects heartbeat loss:
rabbitmq3 | 2018-10-11 06:35:49.757 [warning] <0.6261.0> closing AMQP connection <0.6261.0> (172.18.1.32:43016 -> 172.18.1.13:5672):
rabbitmq3 | missed heartbeats from client, timeout: 5s
Then the client gets error:
kanon-db2 | 2018-10-11 06:35:47,258 [rabbitmq-nio] ERROR c.r.c.i.ForgivingExceptionHandler - An unexpected connection driver error occured
kanon-db2 | com.rabbitmq.client.MissedHeartbeatException: Heartbeat missing with heartbeat = 5 seconds
kanon-db2 | at com.rabbitmq.client.impl.AMQConnection.handleHeartbeatFailure(AMQConnection.java:684)
kanon-db2 | at com.rabbitmq.client.impl.nio.NioLoop.run(NioLoop.java:80)
kanon-db2 | at java.base/java.lang.Thread.run(Thread.java:834)
kanon-db2 | 2018-10-11 06:35:47,265 [rabbitmq-nio] INFO c.e.anodim.LogShutdownListener - [shutdownCompleted] host:[/172.18.1.13:5672] id:[null] chNo:[-1] - com.rabbitmq.client.ShutdownSignalException: connection error
kanon-db2 | 2018-10-11 06:35:47,267 [rabbitmq-nio] INFO c.e.anodim.LogShutdownListener - [shutdownCompleted] host:[/172.18.1.13:5672] id:[null] chNo:[-1] - com.rabbitmq.client.ShutdownSignalException: connection error
Then, the client starts to recover:
anon-db2 | 2018-10-11 06:35:52,271 [rabbitmq-nio] INFO c.e.anodim.LogRecoveryListener - [handleRecoveryStarted] amqp://[email protected]:5672/
But the client says like below periodically (and never recovers):
kanon-db2 | 2018-10-11 06:36:57,290 [rabbitmq-nio] ERROR c.r.c.i.ForgivingExceptionHandler - Caught an exception during connection recovery!
kanon-db2 | java.util.concurrent.TimeoutException: null
kanon-db2 | at com.rabbitmq.utility.BlockingCell.get(BlockingCell.java:77)
kanon-db2 | at com.rabbitmq.utility.BlockingCell.uninterruptibleGet(BlockingCell.java:120)
kanon-db2 | at com.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(BlockingValueOrException.java:36)
kanon-db2 | at com.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQChannel.java:494)
kanon-db2 | at com.rabbitmq.client.impl.AMQConnection.start(AMQConnection.java:315)
kanon-db2 | at com.rabbitmq.client.impl.recovery.RecoveryAwareAMQConnectionFactory.newConnection(RecoveryAwareAMQConnectionFactory.java:64)
kanon-db2 | at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverConnection(AutorecoveringConnection.java:602)
kanon-db2 | at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.beginAutomaticRecovery(AutorecoveringConnection.java:566)
kanon-db2 | at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.lambda$addAutomaticRecoveryListener$3(AutorecoveringConnection.java:508)
kanon-db2 | at com.rabbitmq.client.impl.AMQConnection.notifyRecoveryCanBeginListeners(AMQConnection.java:748)
kanon-db2 | at com.rabbitmq.client.impl.AMQConnection.doFinalShutdown(AMQConnection.java:725)
kanon-db2 | at com.rabbitmq.client.impl.AMQConnection.handleHeartbeatFailure(AMQConnection.java:690)
kanon-db2 | at com.rabbitmq.client.impl.nio.NioLoop.run(NioLoop.java:80)
kanon-db2 | at java.base/java.lang.Thread.run(Thread.java:834)
This client recvovery process is executed in 'rabbitmq-nio' thread. And c.r.u.BlockingCell.get() blocks it. So any I/O event can't be executed (and never get result by BlockingCell.get()). I think this recovery process should be executed in another thread instead of 'rabbitmq-nio' thread, or split request/reply to eliminate the blocking.
Environment:
Broker: RabbitMQ 3.7.7 on Erlang 20.3.8.5 (docker image: 3-management)
Client runtime: Java 11 (docker image: 11-slim),
rabbitmq-client-java version: 5.4.1
The text was updated successfully, but these errors were encountered: