Skip to content

Fix flaky test #771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Fix flaky test #771

wants to merge 1 commit into from

Conversation

lukebakken
Copy link
Collaborator

For whatever reason TestBasicCancelNoWait has been flaky. Trying to figure out why.

@lukebakken lukebakken added this to the 6.0.0 milestone Mar 24, 2020
@lukebakken lukebakken self-assigned this Mar 24, 2020
@lukebakken lukebakken force-pushed the lrb-fix-flaky-roundtrip-test branch from dbba130 to 957de05 Compare March 24, 2020 16:19
Sometimes waiting 2 seconds is not long enough:

https://ci.rabbitmq.com/teams/main/pipelines/dotnet/jobs/test/builds/240

refactor

Fix WorkPool to execute all jobs in the Work queue in a loop iteration
@lukebakken lukebakken force-pushed the lrb-fix-flaky-roundtrip-test branch from 957de05 to 0b673f0 Compare March 24, 2020 20:23
@lukebakken lukebakken marked this pull request as ready for review March 24, 2020 20:23
@lukebakken
Copy link
Collaborator Author

lukebakken commented Mar 24, 2020

@danielmarbach you may find this fix interesting. The failure here...

https://ci.appveyor.com/project/rabbitmq/rabbitmq-dotnet-client/builds/31667700

Is due to the Consume.Ok and Basic.Deliver frames coming in on the same TCP packet, which causes both to be enqueued in the WorkPool prior to the loop starting. In that case, only the first operation is run (processing Consume.Ok) and Basic.Deliver isn't. The reset event in the test times out, and the test fails.

@danielmarbach
Copy link
Collaborator

My original code always had a while

c393768#diff-4e86f90f6b775db8f50d55f854995db9R82

To take out all enqueued stuff once spinning because that was the behavior that the sync version had. Could you have missed thah when reviewing stebets PR?

@danielmarbach
Copy link
Collaborator

Pretty sure it is broken without the while because it would always miss one.

@michaelklishin
Copy link
Contributor

@danielmarbach does the version in this PR look correct to your?

@danielmarbach
Copy link
Collaborator

danielmarbach commented Mar 24, 2020 via email

@danielmarbach
Copy link
Collaborator

@michaelklishin @lukebakken I have one question before I send in the PR that bothers me for a long time. The previous loop implementation did actually block on stop until the loop was properly stopped. Today's service implementation immediately stops and returns eventhough there might still be work going on or in the queue. Is this the behavior you guys wanted really by accepting stebet's PR?

@michaelklishin
Copy link
Contributor

@danielmarbach it would be nice to stop the service cleanly after processing all pending operations. I assume we stop the service e.g. when the connection or channel are closed, so, not on the hot path?

@danielmarbach
Copy link
Collaborator

danielmarbach commented Mar 24, 2020 via email

@michaelklishin
Copy link
Contributor

@danielmarbach sounds like it. Such things are easy to miss since shutdown is not seen as an "interesting" or "difficult" problem by most developers :)

@danielmarbach danielmarbach mentioned this pull request Mar 24, 2020
@danielmarbach
Copy link
Collaborator

How about #772 as a first step. Also aligns sync consumer service a bit

@danielmarbach
Copy link
Collaborator

@michaelklishin @lukebakken It seems this statement wasn't sufficiently reviewed

#687 (comment)

@danielmarbach
Copy link
Collaborator

Such things are easy to miss since shutdown is not seen as an "interesting" or "difficult" problem by most developers :)

In order to fix it we would need to reintroduce a GetAwaiter().GetResult() call that is called within Close() or Abort() which still can deadlock as long as the model is not fully async. Pick your poison ;)

What do you guys want?

@michaelklishin
Copy link
Contributor

@danielmarbach, in that case, we might choose to go with the current behavior. The reasoning is that app developers can achieve a clean shutdown in practice e.g. by introducing a delay before models are closed. Unfortunately, there's usually no such path to recover from a deadlock.

What would be the conditions for the deadlock? How likely do you think they are?

@danielmarbach
Copy link
Collaborator

It is really only a problem in environment with a SynchronizationContext. So any usage of the model in WinForms, WPF for example got on shutdown hang if the underlying code doesn't opt-out from context capturing. This risk could be largely mitigated by having ConfigureAwait(false) in-place or directly return the task. Of course there is also the problem of essentially wasting more threads than necessary but that is neglecteable given we only would do this on shutdown.

4.x and 5.1.x already has the GetAwaiter in place btw. https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/v5.1.2/projects/client/RabbitMQ.Client/src/client/impl/AsyncConsumerDispatcher.cs#L29

So it would be bringing back what we had before in terms of shutdown behavior that was there before stebet's changes

@michaelklishin
Copy link
Contributor

@danielmarbach OK, those changes are probably worth the risk as most users we see do not use WinForms or WPF.

@lukebakken lukebakken closed this Mar 24, 2020
@lukebakken lukebakken deleted the lrb-fix-flaky-roundtrip-test branch March 24, 2020 22:40
@lukebakken
Copy link
Collaborator Author

@danielmarbach as for missing items in previous reviews, I'm basically trusting the statements made by you, @bording and @stebet as you all have more recent experience with this library. Debugging tis issue has increased my understanding, however 😄

@danielmarbach
Copy link
Collaborator

Fair enough. FYI I mostly spelunking in this code base only in my free time and don't spend enough time on it that I would trust myself. Sometimes when I get pinged on this repo about a specific question I only look at the PR from the angle of the question and not further to be mindful about my time and that can cause me to not see things. I'm happy to help where I can from time to time but I'm definitely not trustworthy 😁😊

@danielmarbach
Copy link
Collaborator

Sent in #781

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants