Skip to content

Revert "rabbit_feature_flags: Retry after erpc:call() fails with noconnection" #11507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

dumbbell
Copy link
Collaborator

@dumbbell dumbbell commented Jun 20, 2024

This reverts commit 8749c60.

Why

The patch was supposed to solve an issue that we didn't understand and that was likely a network/DNS problem outside of RabbitMQ. We know it didn't solve that issue because it was reported again 6 months after the initial pull request (#8411).

What we are sure however is that it increased the testing of RabbitMQ significantly because the code loops for 10+ minutes if the remote node is not running.

The retry in the Feature flags subsystem was not the right place either. The noconnection error is visible there because it runs earlier during RabbitMQ startup. But retrying there won't solve a network issue magically.

There are two ways to create a cluster:

  1. peer discovery and this subsystem takes care of retries if necessary and appropriate
  2. manually using the CLI, in which case the user is responsible for starting RabbitMQ nodes and clustering them

Let's revert it until the root cause is really understood.

@dumbbell dumbbell requested a review from kjnilsson June 20, 2024 09:44
@dumbbell dumbbell self-assigned this Jun 20, 2024
@kjnilsson
Copy link
Contributor

Additional observations:

  1. a clearly wrong command like rabbitmqctl join_node this-is-not-a-node@argh takes over a minute to return. This is bad UX.
  2. It makes an already slow test suite: clustering_managment_SUITE take longer than it needs to. There are two tests that test this functionality and each take over a minute to complete.

@dumbbell dumbbell force-pushed the revert-retry-after-noconnection-in-feature-flags-subsystem branch from ec41f06 to 61346ee Compare June 20, 2024 10:09
…onnection`"

This reverts commit 8749c60.

[Why]
The patch was supposed to solve an issue that we didn't understand and
that was likely a network/DNS problem outside of RabbitMQ. We know it
didn't solve that issue because it was reported again 6 months after the
initial pull request (#8411).

What we are sure however is that it increased the testing of RabbitMQ
significantly because the code loops for 10+ minutes if the remote node
is not running.

The retry in the Feature flags subsystem was not the right place either.
The `noconnection` error is visible there because it runs earlier during
RabbitMQ startup. But retrying there won't solve a network issue
magically.

There are two ways to create a cluster:
1. peer discovery and this subsystem takes care of retries if necessary
   and appropriate
2. manually using the CLI, in which case the user is responsible for
   starting RabbitMQ nodes and clustering them

Let's revert it until the root cause is really understood.
@dumbbell dumbbell force-pushed the revert-retry-after-noconnection-in-feature-flags-subsystem branch from 61346ee to d0c13b4 Compare June 20, 2024 12:30
@dumbbell dumbbell marked this pull request as ready for review June 20, 2024 14:29
@dumbbell dumbbell merged commit 8f1219a into main Jun 20, 2024
251 checks passed
@dumbbell dumbbell deleted the revert-retry-after-noconnection-in-feature-flags-subsystem branch June 20, 2024 14:31
dumbbell added a commit that referenced this pull request Jul 10, 2024
Revert "rabbit_feature_flags: Retry after erpc:call() fails with `noconnection`" (backport #11507)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants