Skip to content

Fix handling shovels with old supervisor id format #10004

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 3, 2023

Conversation

gomoripeti
Copy link
Contributor

Proposed Changes

This is a forward port of #9965 and #9968 from v3.12.x to main (future 3.13.x)
(Sorry for the unusual method of not first submitting to main. I felt it was more urgent to fix the already released 3.12.x series and I thought that it will be easier to add khepri support than to remove it - this might not be the case)

Not fully tested yet, hence submitting a draft PR first

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered, etc.

@michaelklishin
Copy link
Collaborator

@gomoripeti any reason why this is currently a draft PR?

@gomoripeti
Copy link
Contributor Author

This is a draft because I haven't tested with khepri. Eg what happens after upgrade 3.12 -> 3.13 -> enable khepri (Does mirrored_supervisor:terminate_child(?SUPERVISOR, old_id(Name))works with khepri or needs a feature_flag check?)

@gomoripeti
Copy link
Contributor Author

I upgraded from 3.12.x with a shovel to 3.13.0-rc2. Then applied this patch (and cleaned up duplicate shovel). At this point there is a shovel worker with an old child id format.

When trying to enable khepri_db feature flag, I get the below exception because rabbit_db_msup_m2k_converter does not expect and old style id:

2023-12-01 17:30:55.802739+00:00 [error] <0.12931.0> Exception during Mnesia->Khepri data copy: {khepri_mnesia_migration_ex,converter_mod_exception,#{reason => function_clause,stacktrace => [{lists,foldl_1,[#Fun<khepri_path.2.100389385>,true,<<"/">>],[{file,"lists.erl"},{line,1598}]},{khepri_path,ensure_is_valid,1,[{file,"src/khepri_path.erl"},{line,607}]},{khepri_machine,put,4,[{file,"src/khepri_machine.erl"},{line,212}]},{khepri_adv,put,4,[{file,"src/khepri_adv.erl"},{line,396}]},{khepri,put,4,[{file,"src/khepri.erl"},{line,2084}]},{rabbit_db_msup_m2k_converter,copy_to_khepri,3,[{file,"rabbit_db_msup_m2k_converter.erl"},{line,60}]},{rabbit_db_m2k_converter,copy_to_khepri,3,[{file,"rabbit_db_m2k_converter.erl"},{line,88}]},{m2k_table_copy,handle_migration_records,1,[{file,"src/m2k_table_copy.erl"},{line,481}]}],tables => [rabbit_vhost,rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_runtime_parameters,rabbit_queue,rabbit_exchange,rabbit_exchange_serial,rabbit_route,rabbit_node_maintenance_states,mirrored_sup_childspec,rabbit_durable_queue,rabbit_durable_exchange,rabbit_durable_route,rabbit_semi_durable_route,rabbit_reverse_rou
te,rabbit_index_route,rabbit_exchange_type_consistent_hash_ring_state],class => error,converter_mod => {rabbit_db_m2k_converter,[{rabbit_vhost,rabbit_db_vhost_m2k_converter},{rabbit_user,rabbit_db_user_m2k_converter},{rabbit_user_permission,rabbit_db_user_m2k_converter},{rabbit_topic_permission,rabbit_db_user_m2k_converter},{rabbit_runtime_parameters,rabbit_db_rtparams_m2k_converter},{rabbit_queue,rabbit_db_queue_m2k_converter},{rabbit_exchange,rabbit_db_exchange_m2k_converter},{rabbit_exchange_serial,rabbit_db_exchange_m2k_converter},{rabbit_route,rabbit_db_binding_m2k_converter},{rabbit_node_maintenance_states,rabbit_db_maintenance_m2k_converter},{mirrored_sup_childspec,rabbit_db_msup_m2k_converter},rabbit_durable_queue,rabbit_durable_exchange,rabbit_durable_route,rabbit_semi_durable_route,rabbit_reverse_route,rabbit_index_route,{rabbit_exchange_type_consistent_hash_ring_state,rabbit_db_ch_exchange_m2k_converter},{x_jms_topic_table,rabbit_db_jms_exchange_m2k_converter},{rh_exchange_table,rabbit_db_rh_exchange_m2k_converter}]}}}
2023-12-01 17:30:55.802739+00:00 [error] <0.12931.0> [{m2k_table_copy,handle_migration_records,1,
2023-12-01 17:30:55.802739+00:00 [error] <0.12931.0>                  [{file,"src/m2k_table_copy.erl"},{line,518}]},
2023-12-01 17:30:55.802739+00:00 [error] <0.12931.0>  {m2k_table_copy,do_copy_data,1,[{file,"src/m2k_table_copy.erl"},{line,370}]},
2023-12-01 17:30:55.802739+00:00 [error] <0.12931.0>  {m2k_table_copy,handle_call,3,[{file,"src/m2k_table_copy.erl"},{line,267}]},
2023-12-01 17:30:55.802739+00:00 [error] <0.12931.0>  {gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,1113}]},
2023-12-01 17:30:55.802739+00:00 [error] <0.12931.0>  {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,1142}]},
2023-12-01 17:30:55.802739+00:00 [error] <0.12931.0>  {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,241}]}]

Additionally I did not test federation but I think it has the same set of issues with old style child ids starting from 3.12.8.

Wouldn't it be best to have a feature flag that converts from old style id to new style id? (both for federation and shovels) That would prevent the coexistence of both formats.

@michaelklishin
Copy link
Collaborator

@gomoripeti that would means that you must enable that flag before you enable Khepri. Extending m2k_table_copy to do the conversion sounds like a much better idea to me.

@michaelklishin
Copy link
Collaborator

I will look into it since this is one of our two biggest outstanding issues before the next 3.13 RC.

@michaelklishin
Copy link
Collaborator

Alternatively we can make the plugin migrate this data. It might be easier. I will take another look.

@michaelklishin
Copy link
Collaborator

@dcorbacho suggests that modifying rabbit_db_sup_m2k_converter may be as good an option.

@gomoripeti
Copy link
Contributor Author

My problem with solving this is that the mnesia->khepri conversion happens on the mirrored_supervisor level which does not know about the specific id format of each implementation (that is the shovel or federation). I got a slight hope when I saw the module name (eg rabbit_shovel_dyn_worker_sup_sup) in the mirrored_sup_childspec table. So I thought each module could implement an optional callback function for id conversion. But that atom is actually a group name not a callback module name (and in theory one module can use multiple groups), without leaking the abstraction the mirrored supervisor does not know which module understands the id format. So I would like to leave it to the Core Team to implement this conversion in the best place.

@gomoripeti
Copy link
Contributor Author

I pushed another commit to fix the case when Khepri is enabled. With the assumption that Mnesia->Khepri migration will make sure that no child id remains in the old format, I think this PR is ready.

Before 3.13.0 and 3.12.8 supervisor id format was different. Don't
start a shovel with the new id format if one exists with the old id
format.
Because the clause `({{_, N}, _, _, _}) -> N =:= Name;` matches both
new and old id format but always returns false for old id format,
`child_exists` also returned false for old ids.

Also rename SupId to ChildId to be pedantically correct.
@gomoripeti gomoripeti marked this pull request as ready for review December 3, 2023 00:12
It is assumed that old child id format cannot exist in this case.
@michaelklishin
Copy link
Collaborator

@gomoripeti thank you. This is a fair set of changes to merge, and we can look into Shovel process child ID conversion at a later point.

@michaelklishin michaelklishin added this to the 3.13.0 milestone Dec 3, 2023
@michaelklishin michaelklishin merged commit 5ede862 into rabbitmq:main Dec 3, 2023
@dcorbacho
Copy link
Contributor

@gomoripeti Could you describe the steps to reproduce the migration failure with the current main? I cannot reproduce it, as the mirrored supervisor record is updated by the shovel plugin when it starts. Thus, when the khepri_db feature flag is enabled it already finds the new id style. Not sure what I have missed.

@gomoripeti
Copy link
Contributor Author

hi @dcorbacho it needs to be a multi-node cluster on a version with old id format (eg 3.12.6). Create a shovel on the old version (will store old id format to Mnesia) Upgrade to 3.13/main in a rolling fashion -> the old id preserved in Mnesia. ("mirrored supervisor record is updated by the shovel plugin when it starts." I think this does not happen with multi-node rolling restarts) Then trying migration to khepri will hit the unexpected old id.

@dcorbacho
Copy link
Contributor

@gomoripeti thanks! That makes sense, I've tried a single node upgrade. I'll test this now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants