[Questions] 4.1 Enabling feature flag khepri_db while log exchange is enabled results in OOM #14069
Unanswered
gomoripeti
asked this question in
Questions
Replies: 1 comment
-
@gomoripeti the easiest option is to remove that node from the cluster and re-create it. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Community Support Policy
RabbitMQ version used
4.1.0
Erlang version used
27.3.x
Operating system (distribution) used
ubuntu
How is RabbitMQ deployed?
Debian package
Logs from node 1 (with sensitive values edited out)
See https://www.rabbitmq.com/docs/logging to learn how to collect logs
Logs from node 2 (if applicable, with sensitive values edited out)
See https://www.rabbitmq.com/docs/logging to learn how to collect logs
Logs from node 3 (if applicable, with sensitive values edited out)
See https://www.rabbitmq.com/docs/logging to learn how to collect logs
Steps to deploy RabbitMQ cluster
Install deb package on 3 servers.
Steps to reproduce the behavior in question
locally from a git repo
make start-cluster
sbin/rabbitmq-diagnostics remote_shell --node rabbit-1
sbin/rabbitmqctl stop_app --node rabbit-1 && sbin/rabbitmctl start_app --node rabbit-1
sbin/rabbitmqctl enable_feature_flag --experimental khepri_db --node rabbit-1
this last command never returns and memory of node rabbit-1 goes up
advanced.config
What problem are you trying to solve?
On a 3-node cluster on RabbitMQ 4.1.0 the log exchange was enabled on info level. When enabling feature flag
khepri_db
the node where the operation was initiated ran out of memory (~2GB RAM available).The issue is easily reproducible locally with a fresh, empty cluster, if log exchange level is set to debug (so there is more logging) but it sometimes also happens on higher levels (I suspect some timing race there, might depend the number of objects in Mnesia)
The OOM happens because of a cyclic dependency between
khepri_cluster,is_store_running
logging something, and the exchange logger checking khepri while looking up the log exchange.I managed to capture the below partial stacktraces before the OOM:
another deeper stacktrace, unfortunately only captured in observer
This is a similar issue to which happened on 3.13 with the message container feature flag (exchange logging also depended on mc FF, sorry I cannot find the discussion link). If it is hard to solve this properly, a workaround is to disable exchange logging while enabling the
khepri_db
feature flag.Another issue is that when the first node restarted after an OOM, the cluster couldn't recover. First we've seen the below crash
And after it
reached_max_restart_intensity
boot failed like this:The first node remained in a cyclic reboot, while the other nodes only logged:
The khepri_db feature flag got stuck in state_changing.
My question is if there is any advice how to recover the cluster from this state. (I imagine enabling khepri can be interrupted for various reasons, hardware failure or if there is too much metadata in mnesia on a server with small memory, migration can cause an OOM). Im not sure what state khepri could be (badmatch
ra_flru,insert
) but is it posible to use rollback functions (which are used when there is an error during migration) and somehow switch the khepri_db feature flag state from state_changing to false?Beta Was this translation helpful? Give feedback.
All reactions