Skip to content

Connection recovery runs into timeouts when using NIO #413

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ihiroky opened this issue Oct 12, 2018 · 5 comments
Closed

Connection recovery runs into timeouts when using NIO #413

ihiroky opened this issue Oct 12, 2018 · 5 comments
Assignees
Milestone

Comments

@ihiroky
Copy link

ihiroky commented Oct 12, 2018

Autorecovery of Java client (NIO mode) is failed when network gets down, not shutdown borker or client.
Its summary is shown below:

  1. Network down (simulated on docker environment, disconnect client container by executing 'docker network disconnect').

  2. Broker which the client connects detects heartbeat loss:
    rabbitmq3 | 2018-10-11 06:35:49.757 [warning] <0.6261.0> closing AMQP connection <0.6261.0> (172.18.1.32:43016 -> 172.18.1.13:5672):
    rabbitmq3 | missed heartbeats from client, timeout: 5s

  3. Then the client gets error:
    kanon-db2 | 2018-10-11 06:35:47,258 [rabbitmq-nio] ERROR c.r.c.i.ForgivingExceptionHandler - An unexpected connection driver error occured
    kanon-db2 | com.rabbitmq.client.MissedHeartbeatException: Heartbeat missing with heartbeat = 5 seconds
    kanon-db2 | at com.rabbitmq.client.impl.AMQConnection.handleHeartbeatFailure(AMQConnection.java:684)
    kanon-db2 | at com.rabbitmq.client.impl.nio.NioLoop.run(NioLoop.java:80)
    kanon-db2 | at java.base/java.lang.Thread.run(Thread.java:834)
    kanon-db2 | 2018-10-11 06:35:47,265 [rabbitmq-nio] INFO c.e.anodim.LogShutdownListener - [shutdownCompleted] host:[/172.18.1.13:5672] id:[null] chNo:[-1] - com.rabbitmq.client.ShutdownSignalException: connection error
    kanon-db2 | 2018-10-11 06:35:47,267 [rabbitmq-nio] INFO c.e.anodim.LogShutdownListener - [shutdownCompleted] host:[/172.18.1.13:5672] id:[null] chNo:[-1] - com.rabbitmq.client.ShutdownSignalException: connection error

  4. Then, the client starts to recover:
    anon-db2 | 2018-10-11 06:35:52,271 [rabbitmq-nio] INFO c.e.anodim.LogRecoveryListener - [handleRecoveryStarted] amqp://[email protected]:5672/

  5. But the client says like below periodically (and never recovers):
    kanon-db2 | 2018-10-11 06:36:57,290 [rabbitmq-nio] ERROR c.r.c.i.ForgivingExceptionHandler - Caught an exception during connection recovery!
    kanon-db2 | java.util.concurrent.TimeoutException: null
    kanon-db2 | at com.rabbitmq.utility.BlockingCell.get(BlockingCell.java:77)
    kanon-db2 | at com.rabbitmq.utility.BlockingCell.uninterruptibleGet(BlockingCell.java:120)
    kanon-db2 | at com.rabbitmq.utility.BlockingValueOrException.uninterruptibleGetValue(BlockingValueOrException.java:36)
    kanon-db2 | at com.rabbitmq.client.impl.AMQChannel$BlockingRpcContinuation.getReply(AMQChannel.java:494)
    kanon-db2 | at com.rabbitmq.client.impl.AMQConnection.start(AMQConnection.java:315)
    kanon-db2 | at com.rabbitmq.client.impl.recovery.RecoveryAwareAMQConnectionFactory.newConnection(RecoveryAwareAMQConnectionFactory.java:64)
    kanon-db2 | at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.recoverConnection(AutorecoveringConnection.java:602)
    kanon-db2 | at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.beginAutomaticRecovery(AutorecoveringConnection.java:566)
    kanon-db2 | at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.lambda$addAutomaticRecoveryListener$3(AutorecoveringConnection.java:508)
    kanon-db2 | at com.rabbitmq.client.impl.AMQConnection.notifyRecoveryCanBeginListeners(AMQConnection.java:748)
    kanon-db2 | at com.rabbitmq.client.impl.AMQConnection.doFinalShutdown(AMQConnection.java:725)
    kanon-db2 | at com.rabbitmq.client.impl.AMQConnection.handleHeartbeatFailure(AMQConnection.java:690)
    kanon-db2 | at com.rabbitmq.client.impl.nio.NioLoop.run(NioLoop.java:80)
    kanon-db2 | at java.base/java.lang.Thread.run(Thread.java:834)

This client recvovery process is executed in 'rabbitmq-nio' thread. And c.r.u.BlockingCell.get() blocks it. So any I/O event can't be executed (and never get result by BlockingCell.get()). I think this recovery process should be executed in another thread instead of 'rabbitmq-nio' thread, or split request/reply to eliminate the blocking.

Environment:
Broker: RabbitMQ 3.7.7 on Erlang 20.3.8.5 (docker image: 3-management)
Client runtime: Java 11 (docker image: 11-slim),
rabbitmq-client-java version: 5.4.1

@michaelklishin
Copy link
Contributor

@ihiroky thanks for the details and digging in. Can you provide an environment we can use to reproduce (e.g. a public git repo with Docker files and Java client connection code)?

@michaelklishin michaelklishin changed the title Connection recovery failed on rabbitmq-client-java:5.4.1 using NIO Connection recovery runs into timeouts when using NIO Oct 12, 2018
@ihiroky
Copy link
Author

ihiroky commented Oct 12, 2018

@michaelklishin I create https://github.com/ihiroky/rabbitmq-client-java-reproduce-413 for reproduction.

  • Clone the repository above.
  • Create client image and load it to docker. For example:
    $ gradle
    $ docker load --input build/jib-image.tar
  • Run containers (broker and client):
    $ docker-compose up
  • Dissconnect client container from the network;
    $ docker network disconnect <parent_dir>_rabbitmq_network client
    and confirm errors.
  • Connect client container to the network
    $ docker netowrk connect --ip 172.18.1.21 <parent_dir>_rabbitmq_network client
    and confirm erros still continue.

Please let me know if there is anything I can do.

@acogoluegnes
Copy link
Contributor

@ihiroky Thanks for the test project, I managed to reproduce.

A workaround is to increase the number of IO threads: the recovered connection will end up being dispatched to another IO thread.

I'm working on a fix to dispatch heartbeat failure in another thread, this is already the case for other connection failures.

acogoluegnes added a commit that referenced this issue Oct 12, 2018
Otherwise connection recovery can occur in the same NIO thread, leading
to a deadlock.

Fixes #413
@acogoluegnes acogoluegnes added this to the 4.8.4 milestone Oct 12, 2018
acogoluegnes added a commit that referenced this issue Oct 15, 2018
Otherwise connection recovery can occur in the same NIO thread, leading
to a deadlock.

Fixes #413

(cherry picked from commit 46d3d2e)
acogoluegnes added a commit that referenced this issue Oct 15, 2018
Otherwise connection recovery can occur in the same NIO thread, leading
to a deadlock.

Fixes #413

(cherry picked from commit 46d3d2e)
(cherry picked from commit 4f900f0)

Conflicts:
	src/main/java/com/rabbitmq/client/impl/nio/NioLoop.java
@acogoluegnes acogoluegnes modified the milestones: 4.8.4, 4.9.0 Oct 15, 2018
@acogoluegnes acogoluegnes self-assigned this Oct 15, 2018
@acogoluegnes
Copy link
Contributor

@ihiroky 5.5.0.RC1 has shipped with the fix.

@ihiroky
Copy link
Author

ihiroky commented Oct 16, 2018

@acogoluegnes Thank you. I confirmed this isssue has been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants