[Questions] 4.1 Enabling feature flag khepri_db while log exchange is enabled results in OOM #14069

gomoripeti · 2025-06-12T12:15:51Z

gomoripeti
Jun 12, 2025

Community Support Policy

I have read RabbitMQ's Community Support Policy
I run RabbitMQ 4.x, the only series currently covered by community support
I promise to provide all relevant information (versions, logs from all nodes, rabbitmq-diagnostics output, detailed reproduction steps)

RabbitMQ version used

4.1.0

Erlang version used

27.3.x

Operating system (distribution) used

ubuntu

How is RabbitMQ deployed?

Debian package

Logs from node 1 (with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

# PASTE LOG HERE, BETWEEN BACKTICKS

Logs from node 2 (if applicable, with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

# PASTE LOG HERE, BETWEEN BACKTICKS

Logs from node 3 (if applicable, with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

# PASTE LOG HERE, BETWEEN BACKTICKS

Steps to deploy RabbitMQ cluster

Install deb package on 3 servers.

Steps to reproduce the behavior in question

locally from a git repo

make start-cluster
sbin/rabbitmq-diagnostics remote_shell --node rabbit-1

> application:set_env(rabbit, log, [{exchange, [{enabled, true}, {level, debug}, {formatter, {rabbit_logger_text_fmt, #{single_line => true}}}]},{file, [{level, info}]}]).

sbin/rabbitmqctl stop_app --node rabbit-1 && sbin/rabbitmctl start_app --node rabbit-1
sbin/rabbitmqctl enable_feature_flag --experimental khepri_db --node rabbit-1
this last command never returns and memory of node rabbit-1 goes up

advanced.config

{rabbit, [
    {log, [
      {exchange, [{enabled, true}, {level, debug}]},
      {file, [{level, debug}]}
    ]},
...

What problem are you trying to solve?

On a 3-node cluster on RabbitMQ 4.1.0 the log exchange was enabled on info level. When enabling feature flag khepri_db the node where the operation was initiated ran out of memory (~2GB RAM available).

The issue is easily reproducible locally with a fresh, empty cluster, if log exchange level is set to debug (so there is more logging) but it sometimes also happens on higher levels (I suspect some timing race there, might depend the number of objects in Mnesia)

The OOM happens because of a cyclic dependency between khepri_cluster,is_store_running logging something, and the exchange logger checking khepri while looking up the log exchange.

I managed to capture the below partial stacktraces before the OOM:

[{logger_h_common,string_to_binary,1,
     [{file,"logger_h_common.erl"},{line,427}]},
 {logger_h_common,do_log_to_binary,2,
     [{file,"logger_h_common.erl"},{line,397}]},
 {logger_h_common,log,2,
     [{file,"logger_h_common.erl"},{line,178}]},
 {logger_backend,call_handlers,3,
     [{file,"logger_backend.erl"},{line,52}]},
 {khepri_cluster,is_store_running,1,
     [{file,"src/khepri_cluster.erl"},{line,1600}]},
 {m2k_table_copy,is_migration_finished,2,
     [{file,"src/m2k_table_copy.erl"},{line,78}]},
 {mnesia_to_khepri,handle_fallback,5,
     [{file,"src/mnesia_to_khepri.erl"},{line,524}]},
 {rabbit_queue_type,publish_at_most_once,2,
     [{file,"rabbit_queue_type.erl"},{line,618}]}]

another deeper stacktrace, unfortunately only captured in observer

This is a similar issue to which happened on 3.13 with the message container feature flag (exchange logging also depended on mc FF, sorry I cannot find the discussion link). If it is hard to solve this properly, a workaround is to disable exchange logging while enabling the khepri_db feature flag.

Another issue is that when the first node restarted after an OOM, the cluster couldn't recover. First we've seen the below crash

2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>   crasher:
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>     initial call: ra_server_proc:init/1
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>     pid: <0.435.0>
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>     registered_name: rabbitmq_metadata
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>     exception exit: {'EXIT',
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>                         {{badmatch,[]},
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>                          [{ra_flru,insert,3,
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>                               [{file,"src/ra_flru.erl"},{line,63}]},
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>                           {ra_log_reader,get_segment,3,
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>                               [{file,"src/ra_log_reader.erl"},{line,394}]},
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>                           {ra_log_reader,'-segment_fold/5-fun-1-',4,
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>                               [{file,"src/ra_log_reader.erl"},{line,343}]},
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>                           {ra_log_reader,'-segment_fold/5-lists^foldl/2-1-',
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>                               3,
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>                               [{file,"src/ra_log_reader.erl"},{line,341}]},
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>                           {ra_log_reader,segment_fold,5,
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>                               [{file,"src/ra_log_reader.erl"},{line,341}]},
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>                           {ra_log,fold,5,[{file,"src/ra_log.erl"},{line,465}]},
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>                           {ra_server,apply_to,5,
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>                               [{file,"src/ra_server.erl"},{line,2789}]},
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>                           {ra_server,handle_leader,2,
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>                               [{file,"src/ra_server.erl"},{line,676}]}]}}
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>       in function  ra_server_proc:handle_leader/2 (src/ra_server_proc.erl, line 1201)
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>       in call from ra_server_proc:leader/3 (src/ra_server_proc.erl, line 584)
2025-06-10 20:05:00.571052+00:00 [error] <0.435.0>       in call from gen_statem:loop_state_callback/11 (gen_statem.erl, line 3735)

And after it reached_max_restart_intensity boot failed like this:

2025-06-10 20:05:00.613797+00:00 [error] <0.348.0> ** State machine rabbit_ff_controller terminating
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0> ** Last event = {internal,refresh_after_app_load}
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0> ** When server state  = {updating_feature_flag_states,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                             {rabbit_ff_controller,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                                 {<0.226.0>,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                                  #Ref<0.1388194414.2754609153.125627>},
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                                 #{}}}
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0> ** Reason for termination = error:function_clause
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0> ** Callback modules = [rabbit_ff_controller]
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0> ** Callback mode = state_functions
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0> ** Stacktrace =
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0> **  [{rabbit_ff_controller,'-list_feature_flags_enabled_somewhere/2-inlined-0-',
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                            [khepri_db,state_changing,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                             #{rabbit_exchange_type_local_random => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               classic_mirrored_queue_version => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               classic_queue_type_delivery_support => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               detailed_queues_endpoint => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               direct_exchange_routing_v2 => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               drop_unroutable_metric => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               empty_basic_get_metric => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               feature_flags_v2 => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               implicit_default_bindings => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               listener_records_in_ets => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               maintenance_mode_status => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               message_containers => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               message_containers_deaths_v2 => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               quorum_queue => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               quorum_queue_non_voters => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               'rabbitmq_4.0.0' => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               'rabbitmq_4.1.0' => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               restart_streams => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               stream_filtering => true,stream_queue => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               stream_sac_coordinator_unblock_group => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               stream_single_active_consumer => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               stream_update_config_command => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               tracking_records_in_ets => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               user_limits => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               virtual_host_metadata => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               classic_queue_mirroring => true,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                               ram_node_type => true}],
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                            [{file,"rabbit_ff_controller.erl"},{line,1485}]},
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>      {maps,fold_1,4,[{file,"maps.erl"},{line,860}]},
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>      {rabbit_ff_controller,list_feature_flags_enabled_somewhere,2,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                            [{file,"rabbit_ff_controller.erl"},{line,1482}]},
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>      {rabbit_ff_controller,sync_cluster_task,1,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                            [{file,"rabbit_ff_controller.erl"},{line,910}]},
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>      {rabbit_ff_controller,updating_feature_flag_states,3,
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>                            [{file,"rabbit_ff_controller.erl"},{line,307}]},
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>      {gen_statem,loop_state_callback,11,[{file,"gen_statem.erl"},{line,3735}]},
2025-06-10 20:05:00.613797+00:00 [error] <0.348.0>      {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,329}]}]
...
2025-06-10 20:05:00.624076+00:00 [error] <0.226.0> BOOT FAILED
2025-06-10 20:05:00.624076+00:00 [error] <0.226.0> ===========
2025-06-10 20:05:00.624076+00:00 [error] <0.226.0> Exception during startup:
2025-06-10 20:05:00.624076+00:00 [error] <0.226.0>
2025-06-10 20:05:00.624076+00:00 [error] <0.226.0> exit:{{function_clause,[{rabbit_ff_controller,'-list_feature_flags_enabled_somewhere/2-inlined-0-',[khepri_db,state_changing

The first node remained in a cyclic reboot, while the other nodes only logged:

2025-06-10 21:55:21.798967+00:00 [debug] <0.506.0> Feature flags: acquiring lock {feature_flags_state_change,<0.506.0>}
2025-06-10 21:55:21.800695+00:00 [debug] <0.506.0> Feature flags: acquired lock {feature_flags_state_change,<0.506.0>}
2025-06-10 21:55:21.801373+00:00 [debug] <0.506.0> Feature flags: releasing lock {feature_flags_state_change,<0.506.0>}

The khepri_db feature flag got stuck in state_changing.

My question is if there is any advice how to recover the cluster from this state. (I imagine enabling khepri can be interrupted for various reasons, hardware failure or if there is too much metadata in mnesia on a server with small memory, migration can cause an OOM). Im not sure what state khepri could be (badmatch ra_flru,insert) but is it posible to use rollback functions (which are used when there is an error during migration) and somehow switch the khepri_db feature flag state from state_changing to false?

michaelklishin · 2025-06-12T14:00:56Z

michaelklishin
Jun 12, 2025
Maintainer

@gomoripeti the easiest option is to remove that node from the cluster and re-create it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Questions] 4.1 Enabling feature flag khepri_db while log exchange is enabled results in OOM #14069

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[Questions] 4.1 Enabling feature flag khepri_db while log exchange is enabled results in OOM #14069

Uh oh!

Uh oh!

gomoripeti Jun 12, 2025

Community Support Policy

RabbitMQ version used

Erlang version used

Operating system (distribution) used

How is RabbitMQ deployed?

Logs from node 1 (with sensitive values edited out)

Logs from node 2 (if applicable, with sensitive values edited out)

Logs from node 3 (if applicable, with sensitive values edited out)

Steps to deploy RabbitMQ cluster

Steps to reproduce the behavior in question

advanced.config

What problem are you trying to solve?

Replies: 1 comment

Uh oh!

michaelklishin Jun 12, 2025 Maintainer

gomoripeti
Jun 12, 2025

michaelklishin
Jun 12, 2025
Maintainer