Fix flaky test #771

lukebakken · 2020-03-24T00:15:01Z

For whatever reason TestBasicCancelNoWait has been flaky. Trying to figure out why.

Sometimes waiting 2 seconds is not long enough: https://ci.rabbitmq.com/teams/main/pipelines/dotnet/jobs/test/builds/240 refactor Fix WorkPool to execute all jobs in the Work queue in a loop iteration

lukebakken · 2020-03-24T20:25:18Z

@danielmarbach you may find this fix interesting. The failure here...

https://ci.appveyor.com/project/rabbitmq/rabbitmq-dotnet-client/builds/31667700

Is due to the Consume.Ok and Basic.Deliver frames coming in on the same TCP packet, which causes both to be enqueued in the WorkPool prior to the loop starting. In that case, only the first operation is run (processing Consume.Ok) and Basic.Deliver isn't. The reset event in the test times out, and the test fails.

danielmarbach · 2020-03-24T20:36:13Z

My original code always had a while

c393768#diff-4e86f90f6b775db8f50d55f854995db9R82

To take out all enqueued stuff once spinning because that was the behavior that the sync version had. Could you have missed thah when reviewing stebets PR?

danielmarbach · 2020-03-24T20:37:43Z

Pretty sure it is broken without the while because it would always miss one.

michaelklishin · 2020-03-24T20:42:28Z

@danielmarbach does the version in this PR look correct to your?

danielmarbach · 2020-03-24T20:46:12Z

Give me some time and I'll send a PR-------- Ursprüngliche Nachricht --------Von: Michael Klishin <[email protected]>Datum: Di., 24. März 2020, 21:42An: rabbitmq/rabbitmq-dotnet-client <[email protected]>Cc: Daniel Marbach <[email protected]>, Mention <[email protected]>Betreff: Re: [rabbitmq/rabbitmq-dotnet-client] Fix flaky test (#771) @danielmarbach does the version in this PR look correct to your? —You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.

danielmarbach · 2020-03-24T20:56:25Z

@michaelklishin @lukebakken I have one question before I send in the PR that bothers me for a long time. The previous loop implementation did actually block on stop until the loop was properly stopped. Today's service implementation immediately stops and returns eventhough there might still be work going on or in the queue. Is this the behavior you guys wanted really by accepting stebet's PR?

michaelklishin · 2020-03-24T21:05:07Z

@danielmarbach it would be nice to stop the service cleanly after processing all pending operations. I assume we stop the service e.g. when the connection or channel are closed, so, not on the hot path?

danielmarbach · 2020-03-24T21:06:40Z

The current implementation no longer does cleanly shutdown and that was also backported to 5.2. So another bug?

michaelklishin · 2020-03-24T21:17:39Z

@danielmarbach sounds like it. Such things are easy to miss since shutdown is not seen as an "interesting" or "difficult" problem by most developers :)

danielmarbach · 2020-03-24T21:18:58Z

How about #772 as a first step. Also aligns sync consumer service a bit

danielmarbach · 2020-03-24T21:27:23Z

@michaelklishin @lukebakken It seems this statement wasn't sufficiently reviewed

#687 (comment)

danielmarbach · 2020-03-24T21:43:17Z

Such things are easy to miss since shutdown is not seen as an "interesting" or "difficult" problem by most developers :)

In order to fix it we would need to reintroduce a GetAwaiter().GetResult() call that is called within Close() or Abort() which still can deadlock as long as the model is not fully async. Pick your poison ;)

What do you guys want?

michaelklishin · 2020-03-24T21:49:51Z

@danielmarbach, in that case, we might choose to go with the current behavior. The reasoning is that app developers can achieve a clean shutdown in practice e.g. by introducing a delay before models are closed. Unfortunately, there's usually no such path to recover from a deadlock.

What would be the conditions for the deadlock? How likely do you think they are?

danielmarbach · 2020-03-24T21:58:55Z

It is really only a problem in environment with a SynchronizationContext. So any usage of the model in WinForms, WPF for example got on shutdown hang if the underlying code doesn't opt-out from context capturing. This risk could be largely mitigated by having ConfigureAwait(false) in-place or directly return the task. Of course there is also the problem of essentially wasting more threads than necessary but that is neglecteable given we only would do this on shutdown.

4.x and 5.1.x already has the GetAwaiter in place btw. https://github.com/rabbitmq/rabbitmq-dotnet-client/blob/v5.1.2/projects/client/RabbitMQ.Client/src/client/impl/AsyncConsumerDispatcher.cs#L29

So it would be bringing back what we had before in terms of shutdown behavior that was there before stebet's changes

michaelklishin · 2020-03-24T22:00:53Z

@danielmarbach OK, those changes are probably worth the risk as most users we see do not use WinForms or WPF.

lukebakken · 2020-03-24T22:48:40Z

@danielmarbach as for missing items in previous reviews, I'm basically trusting the statements made by you, @bording and @stebet as you all have more recent experience with this library. Debugging tis issue has increased my understanding, however 😄

danielmarbach · 2020-03-25T05:57:07Z

Fair enough. FYI I mostly spelunking in this code base only in my free time and don't spend enough time on it that I would trust myself. Sometimes when I get pinged on this repo about a specific question I only look at the PR from the angle of the question and not further to be mindful about my time and that can cause me to not see things. I'm happy to help where I can from time to time but I'm definitely not trustworthy 😁😊

danielmarbach · 2020-03-26T07:53:18Z

Sent in #781

lukebakken added the effort-tiny label Mar 24, 2020

lukebakken added this to the 6.0.0 milestone Mar 24, 2020

lukebakken self-assigned this Mar 24, 2020

lukebakken force-pushed the lrb-fix-flaky-roundtrip-test branch from dbba130 to 957de05 Compare March 24, 2020 16:19

Fix flaky test

0b673f0

Sometimes waiting 2 seconds is not long enough: https://ci.rabbitmq.com/teams/main/pipelines/dotnet/jobs/test/builds/240 refactor Fix WorkPool to execute all jobs in the Work queue in a loop iteration

lukebakken force-pushed the lrb-fix-flaky-roundtrip-test branch from 957de05 to 0b673f0 Compare March 24, 2020 20:23

lukebakken marked this pull request as ready for review March 24, 2020 20:23

lukebakken requested a review from michaelklishin March 24, 2020 20:23

danielmarbach mentioned this pull request Mar 24, 2020

Fix consumer loop #772

Merged

lukebakken closed this Mar 24, 2020

lukebakken deleted the lrb-fix-flaky-roundtrip-test branch March 24, 2020 22:40

danielmarbach mentioned this pull request Mar 26, 2020

Graceful stop of worker services #781

Merged

Fix flaky test #771

Fix flaky test #771

Uh oh!

Conversation

lukebakken commented Mar 24, 2020

Uh oh!

lukebakken commented Mar 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielmarbach commented Mar 24, 2020

Uh oh!

danielmarbach commented Mar 24, 2020

Uh oh!

michaelklishin commented Mar 24, 2020

Uh oh!

danielmarbach commented Mar 24, 2020 via email

Uh oh!

danielmarbach commented Mar 24, 2020

Uh oh!

michaelklishin commented Mar 24, 2020

Uh oh!

danielmarbach commented Mar 24, 2020 via email • edited by lukebakken Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelklishin commented Mar 24, 2020

Uh oh!

danielmarbach commented Mar 24, 2020

Uh oh!

danielmarbach commented Mar 24, 2020

Uh oh!

danielmarbach commented Mar 24, 2020

Uh oh!

michaelklishin commented Mar 24, 2020

Uh oh!

danielmarbach commented Mar 24, 2020

Uh oh!

michaelklishin commented Mar 24, 2020

Uh oh!

lukebakken commented Mar 24, 2020

Uh oh!

danielmarbach commented Mar 25, 2020

Uh oh!

danielmarbach commented Mar 26, 2020

Uh oh!

Uh oh!

lukebakken commented Mar 24, 2020 •

edited

Loading

danielmarbach commented Mar 24, 2020 via email •

edited by lukebakken

Loading