Skip to content

Take "true" case into consideration #1685

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 17, 2018
Merged

Take "true" case into consideration #1685

merged 2 commits into from
Dec 17, 2018

Conversation

lukebakken
Copy link
Collaborator

Fixes #1682

@lukebakken lukebakken added this to the 3.7.8 milestone Aug 27, 2018
@lukebakken lukebakken self-assigned this Aug 27, 2018
@hairyhum
Copy link
Contributor

I understand that the original assertion was there to fail if an operation fails on a live queue Pid.
This behaviour will retry the operation if it's failing and then fail with the timeout error.
Is there a way to make this behaviour better? Running it for 2000 times and then reporting timeout doesn't look less confusing than the failed assertion.

@lukebakken
Copy link
Collaborator Author

@hairyhum do you think the following sequence of events could trigger the badmatch:

  • with/4 is called on a stopped mirrored queue here
  • The thunk F fails, so we fall into the error handler here
  • During that time, the queue pid becomes active
  • Since the queue is mirrored, we enter this code block, and trigger a badmatch

@hairyhum
Copy link
Contributor

Maybe. Mirroring does some waiting to avoid false-positives. Unfortunately F is an arbitrary function so we don't know why it fails and it's not logged.

@michaelklishin
Copy link
Collaborator

@lukebakken's hypothesis sounds plausible to me.

@lukebakken
Copy link
Collaborator Author

We know that retry_wait/4 is only called if F exits with one of these reasons:

noproc
noconnection
nodedown
normal
shutdown

Let me know what you think of 9d9f624. I special-cased the scenario where the queue is mirrored and the pid is alive. Then, in the other case, rather than matching against false the error handler func is called.

@michaelklishin michaelklishin removed this from the 3.7.8 milestone Aug 30, 2018
@hairyhum
Copy link
Contributor

The fact that retry_wait is not called for every failure makes me wonder what is the difference between mirrored master and non-mirrored queue process becoming available again? I thought that the handler is called on any error.

Copy link
Contributor

@hairyhum hairyhum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see the difference between master process being back and the non-mirrored queue being back after disconnect.
Also because pid reuse the QPid might not be the actual queue process. In this case sleep is actually helpful because we wait for mnesia record update.

@lukebakken
Copy link
Collaborator Author

Another instance of this error: https://groups.google.com/d/topic/rabbitmq-users/u9URsZ2ova8/discussion

Fixes #1682

Call E() instead of crashing with badmatch, retry when queue is mirrored
@lukebakken
Copy link
Collaborator Author

@hairyhum @michaelklishin @kjnilsson I'm not letting this one go! My latest change is simpler, and tries to do something sane where a crash would have happened.

Copy link
Contributor

@hairyhum hairyhum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should at least give more visibility.

@lukebakken
Copy link
Collaborator Author

Thanks for the review @hairyhum

@lukebakken lukebakken merged commit 6b73cad into master Dec 17, 2018
@lukebakken lukebakken deleted the rabbitmq-server-1682 branch December 17, 2018 20:49
lukebakken added a commit that referenced this pull request Dec 17, 2018
Take "true" case into consideration

(cherry picked from commit 6b73cad)
@lukebakken
Copy link
Collaborator Author

Backported to v3.7.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants