Skip to content

Enable leader election in ws-manager-mk2 #18419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 14, 2023
Merged

Enable leader election in ws-manager-mk2 #18419

merged 2 commits into from
Aug 14, 2023

Conversation

aledbf
Copy link
Member

@aledbf aledbf commented Aug 2, 2023

Description

Summary generated by Copilot

🤖 Generated by Copilot at f4df6f5

This pull request enhances the ws-manager-mk2 component, which is responsible for managing workspaces in Gitpod. It fixes an import error, removes an unnecessary flag, and enables leader election for the controller.

Related Issue(s)

Fixes ENG-53

How to test

  • Open the Preview environment and check workspaces work as expected
  • Kill one of the ws-manager-mk2 pods and check the logs of the remaining one, checking for the leader election

Preview status

Gitpod was successfully deployed to your preview environment.

Build Options

Build
  • /werft with-werft
    Run the build with werft instead of GHA
  • leeway-no-cache
  • /werft no-test
    Run Leeway with --dont-test
Publish
  • /werft publish-to-npm
  • /werft publish-to-jb-marketplace
Installer
  • analytics=segment
  • with-dedicated-emulation
  • workspace-feature-flags
    Add desired feature flags to the end of the line above, space separated
Preview Environment / Integration Tests
  • /werft with-local-preview
    If enabled this will build install/preview
  • /werft with-preview
  • /werft with-large-vm
  • /werft with-gce-vm
    If enabled this will create the environment on GCE infra
  • with-integration-tests=all
    Valid options are all, workspace, webapp, ide, jetbrains, vscode, ssh. If enabled, with-preview and with-large-vm will be enabled.
  • with-monitoring

/hold

@aledbf aledbf changed the title Enable leader election in wa-manager-mk2 Enable leader election in ws-manager-mk2 Aug 2, 2023
@aledbf aledbf marked this pull request as ready for review August 2, 2023 18:30
@aledbf aledbf requested a review from a team as a code owner August 2, 2023 18:30
@@ -176,7 +176,7 @@ func deployment(ctx *common.RenderContext) ([]runtime.Object, error) {
},
Spec: appsv1.DeploymentSpec{
Selector: &metav1.LabelSelector{MatchLabels: labels},
Replicas: common.Replicas(ctx, Component),
Replicas: pointer.Int32(2),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could cause issues with activity timeouts. The load balancer will not know which instance is the leader so workspace activity could be reported to the standby and workspaces will time out. I think in practice it will not be a problem because we send heartbeats often enough that the leader will still get notified before that happens. Something to be aware of though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I think we should find a solution for that before going with 2 replicas, otherwise it will become hard to investigate any timeout issues

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could cause issues with activity timeouts. The load balancer will not know which instance is the leader so workspace activity could be reported to the standby and workspaces will time out. I think in practice it will not be a problem because we send heartbeats often enough that the leader will still get notified before that happens. Something to be aware of though.

I think there's a misconception here, only the leader process requests to the CRD types

The load balancer only sees one instance, not two

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I think we should find a solution for that before going with 2 replicas, otherwise it will become hard to investigate any timeout issues

Can you expand what do you mean by timeout issues?

Copy link
Member

@Furisto Furisto Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The load balancer only sees one instance, not two

Why does it see only one? If we have two replicas would it not send requests to all two? Does the standby not become ready?

Can you expand what do you mean by timeout issues?

The workspaces send activity requests to ws-manager in order to mark the workspace as active. We store this information in memory in ws-manager. If the requests would go to the standby the active manager (which checks the timeouts) would not have the information that the workspace is active because that information is stored by the standby.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, the standby is not ready to serve, but passing the ready health checks.

You can see this in the preview killing the current leader while tailing the other pod logs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We store this information in memory in ws-manager

This is wrong if when we switch the leader the other pod starts killing workspaces without waiting for status reports from workspaces

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a callback we can implement when a pod is elected the new leader and wait for workspace reports before change anything

@roboquat roboquat merged commit 12d7430 into main Aug 14, 2023
@roboquat roboquat deleted the aledbf/ha-mk2 branch August 14, 2023 08:28
@Furisto
Copy link
Member

Furisto commented Aug 14, 2023

We have decided that the impact of the above scenario is less than not merging this PR.

aledbf added a commit that referenced this pull request Aug 14, 2023
roboquat pushed a commit that referenced this pull request Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants