Skip to content

Set segment_entry_count per vhost and use a better default #2954

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 26, 2021

Conversation

lhoguin
Copy link
Contributor

@lhoguin lhoguin commented Apr 7, 2021

Proposed Changes

Lower values of queue_index_segment_entry_count have shown to
potentially improve the performance of the queue. This PR allows
changing the default to a better value without breaking existing
installations.

Types of Changes

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)

Checklist

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • All tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)
  • Any dependent changes have been merged and published in related repositories

Further Comments

At this point I have not tested it nor have I confirmed it works when upgrading.

The new default needs to be figured out by testing with values 512 1024 2056 4096 8192 16384. Since @gerhard already has a setup comparing 1024 and 16384 I think we should just extend the testing by adding more clusters for comparison.

@lhoguin lhoguin force-pushed the new-segment-entry-count-default branch from 66047f2 to 9ed851d Compare April 8, 2021 12:52
@@ -377,6 +410,10 @@ msg_store_dir_base() ->
Dir = rabbit_mnesia:dir(),
filename:join([Dir, "msg_stores", "vhosts"]).

config_file_path(VHost) ->
VHostDir = msg_store_dir_path(VHost),
filename:join(VHostDir, ".config").
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid .config as it might generate confusion with advanced.config and such. How about .meta since all we store here is metadata?

Or did you intentionally want to emphasize that this is a file:consult/1-compatible file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do want to show it's a consult file, but I agree that .config is not good, it's not something we want the user to ever modify. On that note I think I will add a comment saying it's auto generated inside the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left .config and added a comment at the top of the file instead. I have noticed that we already have some .config for example https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/src/rabbit_node_monitor.erl#L73-L74

The .config for the queue is deep within the mnesia directory so it is unlikely that users stumble upon it. But if that's not good enough we can still change it, I'm not sure into what though.

@lhoguin
Copy link
Contributor Author

lhoguin commented Apr 29, 2021

Thanks to @gerhard I have been able to run scenarios that produce interesting results for different values, based on a scenario reported by customers earlier this year. The scenario is 50k queues spread over 3 nodes where messages published fan-out over those 50k queues. PerfTest uses the following arguments:

             --predeclared
             --exchange 1001
             --routing-key 1001-1
             --flag persistent
             --producers 1
             --rate 1
             --size 4096
             --confirm 1
             --confirm-timeout 60
             --pmessages 10000
             --consumers 0 

Messages are NOT embedded in the index.

I have tested with queue_index_segment_entry_count values of 16384 (what RabbitMQ master uses), 8092, 4096, 2048, 1024, 512. In between runs, I have stopped the filling job, did a stop_app, deleted the queue data, made the node exit normally (q() - so that the pod got recreated, I couldn't get start_app to work it gave me a timeout), set the queue_index_segment_entry_count value in a remote shell and then once the nodes recovered started the filling job again.

I have captured 3 graphs that are interesting. In all cases the queues get filled in about 2h10m (likely as a result of --rate 1). In all cases, once the queues are full, the effective publishing rate decreases by an equal amount. The only difference in behavior is in the second graph which is the memory before publishers blocked.

queue_index_segment_entry_count 16384 (what RabbitMQ master uses)

Things go bad. One of the nodes behave quite differently while filling up, using a lot more memory. Once the queues are full, the nodes all use significantly more memory and block the publishers at regular intervals. We can see a lot of large spikes in memory usage, up and down.

qisec_16384

queue_index_segment_entry_count 8192

The memory stabilizes around 9.3GiB before publishers are blocked. We can still see medium sized spikes, indicating a lot of memory allocation and deallocation is likely going on, due to how the index works.

qisec_8192

queue_index_segment_entry_count 4096

Still around 9.3GiB. Similar scenario but with smaller spikes.

qisec_4096

queue_index_segment_entry_count 2048

Memory ends up around 11.5GiB for two nodes, 10.5GiB for the other (the one that receives messages). There are very few spikes.

This scenario looks best, but it also was stopped perhaps a bit early. Over time I would expect the memory graphs to go down (usage increases) a little just like scenarios 1024 and 512.

qisec_2048

queue_index_segment_entry_count 1024

Very similar to scenario 2048, perhaps using just a little more memory. I'd say around 11.4GiB and 10.4GiB respectively where the previous scenario was. But it's hard to be super confident about this.

On the other hand, because messages are not embedded in the index this time, perhaps we could see a bigger difference if we did embed the messages.

qisec_1024

queue_index_segment_entry_count 512

Very similar to scenarios 2048 and 1024. Around the same memory usage. Again, hard to be super confident.

qisec_512

Conclusion

We should probably do test runs of 2048, 1024 and 512 with messages embedded in the index, and see what results we get. The winner will be one of those three.

@lhoguin
Copy link
Contributor Author

lhoguin commented Apr 30, 2021

With messages embedded in the index, the results are inconclusive. As you can see in all cases, the memory slowly goes down until there is no more memory available. Then publishers are blocked, then publishers are started again, then memory goes down faster until there is no more. Repeat until the end of the experiment.

The memory usage after publishers are stopped seems to be directly related to the number of messages. Because in this scenario we do not end up with too many segments (2 at most even for the 512 scenario) we can easily see the impact that segments have on the memory even at rest. Perhaps this is also why it is difficult to notice any difference between the three scenarios.

This is with message sizes of 4000 bytes to get them into the index. Because messages end up being duplicated into each queue index, we end up with disk alarms being triggered at around 45.1 million messages, which is where the experiment ends for 1024 and 512.

There are two places where the publishers stopped that's not related to memory, I will comment below the graph where appropriate. I suspect this is some network issue or similar. There are also times where we do not get metrics despite the fact that the nodes are running and nothing is in the logs.

queue_index_segment_entry_count 2048

Screenshot 2021-04-30 at 13 23 24

queue_index_segment_entry_count 1024

Screenshot 2021-04-30 at 13 24 40

Between around 15h50 to 17h05 the publishers were not running.

queue_index_segment_entry_count 512

Screenshot 2021-04-30 at 13 25 42

Between around 11h45 to 12h10 the publishers were not running.

Conclusion

Inconclusive. I am thinking of running a benchmark to see the messages/s against a single queue and see if there's any difference there. If there isn't, then I will consider all three values to be roughly equivalent and suggest that we use 2048 as the new default, because in that case it will create less index files than the other two.

@lhoguin
Copy link
Contributor Author

lhoguin commented May 5, 2021

I have tried 1 queue with 1 publisher 1 consumer with no delay for the consuming, and didn't see anything useful. Messages rarely make it into segment files in that scenario.

I have also tried a similar scenario but with a consumer delay of 1ms between messages. In that scenario we quickly get hundreds of thousands of messages. I have therefore made the messages expire after 60s. The message payload is not embedded in the index. I ran 512 1024 2048 16384. This is the result:

Screenshot 2021-05-05 at 16 59 29

It's hard to notice a difference in some of the graphs but there is a tiny one:

SEC 512 1024 2048 16834
Messages ready 715-720K 735-740K 745-748K 750-753K
Published /s 12.0K 12.3K 12.4K 12.5K

Memory and disk before publishers blocked is clearer. Memory is not important in this scenario though because there's only one queue. The difference in disk usage is probably unrelated based on repeated testing.

There is a definite advantage to 16384 in that scenario. But it is a small advantage over 2048 (less than 1%) and it could be related to other workloads on the nodes or some other factor.

Conclusion

Based on these observations I am recommending 2048 as the new default. 2048 has a much better memory footprint and almost identical performance as the current value 16384 (less than 1% difference).

I will do the changes tomorrow and make the PR ready for merging.

@lhoguin lhoguin force-pushed the new-segment-entry-count-default branch 2 times, most recently from 7ece9c4 to c6a0df7 Compare May 7, 2021 09:33
@lhoguin lhoguin changed the title WIP: Set segment_entry_count per vhost and use a better default Set segment_entry_count per vhost and use a better default May 7, 2021
@lhoguin lhoguin marked this pull request as ready for review May 7, 2021 09:35
@lhoguin lhoguin requested a review from michaelklishin May 7, 2021 09:36
@lhoguin lhoguin force-pushed the new-segment-entry-count-default branch from c6a0df7 to 8d342b3 Compare May 11, 2021 08:14
The new default of 2048 was chosen based on various scenarios.
It provides much better memory usage when many queues are used
(allowing one host to go from 500 queues to 800+ queues) and
there seems to be none or negligible performance cost (< 1%)
for single queues.
@lhoguin lhoguin force-pushed the new-segment-entry-count-default branch from 8d342b3 to d9344b2 Compare May 11, 2021 08:45
@lhoguin
Copy link
Contributor Author

lhoguin commented May 11, 2021

I have fixed the couple tests that failed. Please review and provide guidance to the next steps (if any).

I have marked the PR as a breaking change, because while going from before the PR to after the PR is a seamless change, rolling back to a version before the PR would not work.

@michaelklishin michaelklishin merged commit 12253d2 into master May 26, 2021
@michaelklishin michaelklishin deleted the new-segment-entry-count-default branch May 26, 2021 22:56
michaelklishin added a commit that referenced this pull request May 27, 2021
Set segment_entry_count per vhost and use a better default

(cherry picked from commit 12253d2)

Conflicts:
	deps/rabbit/BUILD.bazel
	deps/rabbit/Makefile
	deps/rabbit/src/rabbit_vhost.erl
@michaelklishin
Copy link
Collaborator

Backported to v3.8.x.

@michaelklishin michaelklishin added this to the 3.8.17 milestone May 27, 2021
michaelklishin added a commit that referenced this pull request May 27, 2021
This happens during queue migration from a pre-3.7 version.

References #2954.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants