-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Enable leader election in ws-manager-mk2 #18419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -176,7 +176,7 @@ func deployment(ctx *common.RenderContext) ([]runtime.Object, error) { | |||
}, | |||
Spec: appsv1.DeploymentSpec{ | |||
Selector: &metav1.LabelSelector{MatchLabels: labels}, | |||
Replicas: common.Replicas(ctx, Component), | |||
Replicas: pointer.Int32(2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could cause issues with activity timeouts. The load balancer will not know which instance is the leader so workspace activity could be reported to the standby and workspaces will time out. I think in practice it will not be a problem because we send heartbeats often enough that the leader will still get notified before that happens. Something to be aware of though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I think we should find a solution for that before going with 2 replicas, otherwise it will become hard to investigate any timeout issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could cause issues with activity timeouts. The load balancer will not know which instance is the leader so workspace activity could be reported to the standby and workspaces will time out. I think in practice it will not be a problem because we send heartbeats often enough that the leader will still get notified before that happens. Something to be aware of though.
I think there's a misconception here, only the leader process requests to the CRD types
The load balancer only sees one instance, not two
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I think we should find a solution for that before going with 2 replicas, otherwise it will become hard to investigate any timeout issues
Can you expand what do you mean by timeout issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The load balancer only sees one instance, not two
Why does it see only one? If we have two replicas would it not send requests to all two? Does the standby not become ready?
Can you expand what do you mean by timeout issues?
The workspaces send activity requests to ws-manager in order to mark the workspace as active. We store this information in memory in ws-manager. If the requests would go to the standby the active manager (which checks the timeouts) would not have the information that the workspace is active because that information is stored by the standby.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, the standby is not ready to serve, but passing the ready health checks.
You can see this in the preview killing the current leader while tailing the other pod logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We store this information in memory in ws-manager
This is wrong if when we switch the leader the other pod starts killing workspaces without waiting for status reports from workspaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a callback we can implement when a pod is elected the new leader and wait for workspace reports before change anything
We have decided that the impact of the above scenario is less than not merging this PR. |
Description
Summary generated by Copilot
🤖 Generated by Copilot at f4df6f5
This pull request enhances the ws-manager-mk2 component, which is responsible for managing workspaces in Gitpod. It fixes an import error, removes an unnecessary flag, and enables leader election for the controller.
Related Issue(s)
Fixes ENG-53
How to test
Preview status
Gitpod was successfully deployed to your preview environment.
Build Options
Build
Run the build with werft instead of GHA
Run Leeway with
--dont-test
Publish
Installer
Add desired feature flags to the end of the line above, space separated
Preview Environment / Integration Tests
If enabled this will build
install/preview
If enabled this will create the environment on GCE infra
Valid options are
all
,workspace
,webapp
,ide
,jetbrains
,vscode
,ssh
. If enabled,with-preview
andwith-large-vm
will be enabled./hold