Skip to content

Optimize publication path when messages are expiring #51

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

richardlarocque
Copy link
Contributor

We saw some unusual CPU spikes when we migrated our direct exchanges to a delayed exchange in a heavy traffic environment. The CPU spikes did not correlate well with the amount of traffic, making this behavior very suspicious.

After some investigation [1], we found that both publication to the exchange and message timeout expiry were quite fast. However, when publication and message expiry were occurring at the same time, the publication rate plummeted and CPU usage skyrocketed.

The root cause of this issue appears to be the branch of code intended to start a timer when we publish the first delayed message to an empty exchange. That code is triggered by accident when we publish new messages while the exchange is falling behind in its handling of expired messages. That code path happens to be much slower than the regular case.

This PR avoids the issue by moving the responsibility for starting the initial timer from the publication code path to the exchange itself. Now it maintains a timer in the distant future as a placeholder even when there are no delayed messages sitting in the queue.

[1] https://groups.google.com/forum/#!topic/rabbitmq-users/XgjY7UtLkfs

Run `make rabbitmq-components-mk` and commit the result.
Replaces `maybe_delay_first` with `delay_first`.  Modfies the behavior
so it ensures that there is always one timer active, even when there
are no delayed messages in the exchange.  This timer is not used for
anything except book-keeping, so we set it to expire far in the future
(1 hour) so it doesn't gobble up resources for no reason.

Previously the `erlang:read_timer(CurrTimer) == false` code path could
be hit in two scenarios:
- No timer had been set.  The message currently being added is the
  only one in the delay queue.
- The timer has already expired.  The delay queue is full of messages
  ready for publication and the exchange has not yet caught up.
In the former case, the exchange should start a new timer.  In the
latter, the exchange is far behind and starting a new timer will only
waste time.  The new invariant makes the first of those scenarios
impossible.

Updates the publication code path to take advantage of this change.
Since the first of those two scenarios is no longer possible, the
publication code path is no longer responsible for starting timers.
(It can still *replace* a timer if the current timer is set too far
in the future, but it can't create a new one from nothing.)
@richardlarocque
Copy link
Contributor Author

I know it may seem weird to keep a timer around even when the exchange is idle. However, I think the other options are less good.

We could instead track whether or not a timer has been set with a flag in the State record. That's a little cleaner, but it risks getting out of sync.

Maintaining the extra timer eliminates some code paths that would be needed to handle the state flag. It's not more expensive, since we need to check the timer every time we publish anyway, in order to check if we need to restart the timer with a shorter delay.

@michaelklishin michaelklishin self-assigned this May 24, 2016
@michaelklishin
Copy link
Contributor

Thank you, we will consider this.

{noreply, State};
handle_cast(go, State = #state{}) ->
delay_first(),
{noreply, State#state{timer = delay_first()}};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two delay_first in a row. Is that on purpose?

@richardlarocque
Copy link
Contributor Author

Thanks a lot for the review.

The extra delay_first() call is indeed a bug. The second comment probably indicates a real bug, too, but the solution to it is not as obvious. Let me know what you think of it and I'll update the PR accordingly.

@michaelklishin
Copy link
Contributor

Let close this in favour of #53.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants