Skip to content

RPC timeouts can cause subsequent ClassCastExceptions #290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vikinghawk opened this issue Jul 6, 2017 · 11 comments
Closed

RPC timeouts can cause subsequent ClassCastExceptions #290

vikinghawk opened this issue Jul 6, 2017 · 11 comments
Assignees
Milestone

Comments

@vikinghawk
Copy link
Contributor

During cluster node failovers we have seen some TimeoutExceptions during connection recovery end up causing ClassCastExceptions because the reply for the timed out rpc request comes in while a 2nd rpc request is waiting.

Ideally in cases like this where the requestor has already timed out and gone away you would just throw that incoming reply away. Looking thru the code I'm not sure there is a good way to handle this though. Checking in all places that the command type is of the expected class prevents the exception, but you still have the issue that the reply delivered is not actually for the current request.

I believe it would require tagging requests with a unique id and then have the server set that request id as a correlation id on the reply allowing the client to match the reply with the original request. Would something like that be feasible in the AMQP spec?

@vikinghawk vikinghawk changed the title RPC TimeoutExceptions can cause subsequent ClassCastExceptions RPC timeouts can cause subsequent ClassCastExceptions Jul 6, 2017
@michaelklishin
Copy link
Contributor

How can a channel implementation know if "the requestor has already timed out and gone away"?

@michaelklishin
Copy link
Contributor

I'm afraid extending every protocol method with a new field is unrealistic at this point.

@vikinghawk
Copy link
Contributor Author

How can a channel implementation know if "the requestor has already timed out and gone away"?

The AMQChannel code already is catching the TimeoutException and calls cleanRpcChannelState to get ready for the next request before rethrowing the timeout ex.

I'm afraid extending every protocol method with a new field is unrealistic at this point.

Ya thats what I was afraid of. I didn't know if AMQContentHeader could have more fields added to it passively on methods such as Queue.Declare and Queue.DeclareOk

There are still ways to make the code better however.

  1. we could protect against ClassCastExceptions by verifying the class passed to RpcContinuation.handleCommand is of the expected type
  1. additionally for requests/replies such as Queue.Declare/DeclareOk we could verify the queue names match before assuming the reply if for the current request.

@acogoluegnes
Copy link
Contributor

We can add an option to do the check. This would then go into 4.2.0.

@acogoluegnes acogoluegnes reopened this Jul 7, 2017
@acogoluegnes acogoluegnes self-assigned this Jul 7, 2017
@acogoluegnes acogoluegnes added this to the 4.2.0 milestone Jul 7, 2017
acogoluegnes added a commit that referenced this issue Jul 7, 2017
During cluster node failovers some TimeoutExceptions during connection recovery
end up causing ClassCastExceptions because the reply for the timed out
RPC request comes in while a 2nd RPC request is waiting.

This commit add the ConnectionFactory#channelCheckRpcReplyType flag that, if enabled,
will make the channel check if a reply is compatible with the outstanding
RPC request. The flag is set to false by default but could be set to true in the future,
when the check logic is proven reliable.

This code was originally contributed by @vikinghawk.

Fixes #290
@acogoluegnes
Copy link
Contributor

@vikinghawk @michaelklishin A PR that adapted @vikinghawk's code for 4.2.0.

@vikinghawk
Copy link
Contributor Author

+1

@acogoluegnes
Copy link
Contributor

@vikinghawk Thanks again for your contribution! You can have a try with the 4.2 snapshot.

@Inverness
Copy link

After encountering this issue myself and enabling the response type checking I got this:

2017-11-22 15:33:33 [AMQP Connection 10.5.17.63:5672] ERROR c.s.r.u.RMQDefaultExceptionHandler - Unhandled error in RabbitMQ connection
java.lang.NullPointerException: null
	at com.rabbitmq.client.impl.AMQChannel.handleCompleteInboundCommand(AMQChannel.java:185)
	at com.rabbitmq.client.impl.AMQChannel.handleFrame(AMQChannel.java:111)
	at com.rabbitmq.client.impl.AMQConnection.readFrame(AMQConnection.java:643)
	at com.rabbitmq.client.impl.AMQConnection.access$300(AMQConnection.java:47)
	at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:581)
	at java.lang.Thread.run(Thread.java:748)

The added code does not check if _activeRpc is null before calling canHandleReply()

acogoluegnes added a commit that referenced this issue Nov 24, 2017
@acogoluegnes
Copy link
Contributor

acogoluegnes commented Nov 24, 2017

@Inverness Thanks for pointing this out, 5ff6ad4 fixes this, it will go in to 4.4.0 (which will be released in the next few days).

@Inverness
Copy link

@acogoluegnes I'm currently using 5.0.0. When will the fix be available for that branch?

@michaelklishin
Copy link
Contributor

Reasonably soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants