Skip to content

Enable leader election in ws-manager-mk2 #18511

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 17, 2023
Merged

Enable leader election in ws-manager-mk2 #18511

merged 6 commits into from
Aug 17, 2023

Conversation

aledbf
Copy link
Member

@aledbf aledbf commented Aug 14, 2023

Description

Replaces #18419

Description

Replaces #18419

This introduces an edge case when the maintenance mode is triggered, and we deploy a new version. The standby replica never gets the configmap update. We change the strategy to have one or more stand-by replicas waiting to be the leader and all the replicas watch the configuration configmap.

Summary generated by Copilot

🤖 Generated by Copilot at ae677ea

This pull request adds leader election for the ws-manager-mk2 component using the Kubernetes API and a Lease object. It removes the --leader-elect argument from the component and its deployment files, as it is no longer needed. It also reorders some imports in the sample-workspace command.

Preview status

Gitpod was successfully deployed to your preview environment.

Build Options

Build
  • /werft with-werft
    Run the build with werft instead of GHA
  • leeway-no-cache
  • /werft no-test
    Run Leeway with --dont-test
Publish
  • /werft publish-to-npm
  • /werft publish-to-jb-marketplace
Installer
  • analytics=segment
  • with-dedicated-emulation
  • workspace-feature-flags
    Add desired feature flags to the end of the line above, space separated
Preview Environment / Integration Tests
  • /werft with-local-preview
    If enabled this will build install/preview
  • /werft with-preview
  • /werft with-large-vm
  • /werft with-gce-vm
    If enabled this will create the environment on GCE infra
  • with-integration-tests=all
    Valid options are all, workspace, webapp, ide, jetbrains, vscode, ssh. If enabled, with-preview and with-large-vm will be enabled.
  • with-monitoring

/hold

@aledbf
Copy link
Member Author

aledbf commented Aug 14, 2023

/gh run recreate-vm=true

Comment triggered a workflow run

Started workflow run: 5858026990

  • recreate_vm: true

@aledbf
Copy link
Member Author

aledbf commented Aug 14, 2023

/gh run recreate-vm=true

Comment triggered a workflow run

Started workflow run: 5858423705

  • recreate_vm: true

@aledbf aledbf force-pushed the alerdbf/ha-mk2 branch 2 times, most recently from 1a95c59 to 765d4cc Compare August 15, 2023 08:11
@aledbf
Copy link
Member Author

aledbf commented Aug 15, 2023

/gh run recreate-vm=true

Comment triggered a workflow run

Started workflow run: 5870929595

  • recreate_vm: true

@aledbf
Copy link
Member Author

aledbf commented Aug 15, 2023

/gh run recreate-vm=true

Comment triggered a workflow run

Started workflow run: 5871222912

  • recreate_vm: true

@aledbf
Copy link
Member Author

aledbf commented Aug 16, 2023

/gh run recreate-vm=true

Comment triggered a workflow run

Started workflow run: 5875956095

  • recreate_vm: true

@aledbf
Copy link
Member Author

aledbf commented Aug 16, 2023

/gh run recreate-vm=true

Comment triggered a workflow run

Started workflow run: 5877356816

  • recreate_vm: true

@aledbf
Copy link
Member Author

aledbf commented Aug 16, 2023

/gh run recreate-vm=true

Comment triggered a workflow run

Started workflow run: 5878388959

  • recreate_vm: true

@aledbf
Copy link
Member Author

aledbf commented Aug 16, 2023

/gh run recreate-vm=true

Comment triggered a workflow run

Started workflow run: 5879474569

  • recreate_vm: true

@aledbf aledbf force-pushed the alerdbf/ha-mk2 branch 2 times, most recently from b58232d to 04c44f4 Compare August 16, 2023 21:40
Copy link
Member

@WVerlaek WVerlaek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, some questions

Also coming back to a question on the previous PR, how do we ensure workspaces don't timeout when a pod gets elected as leader after it had been running for a while?

We use the controller's startup time as the workspace's last activity in

. If a pod that was standby for 1 hour gets elected, it won't yet have workspace activity stored (as it didn't receive MarkActive requests), so it will think the workspace's last activity was 1 hour ago and timeout the workspace.

We could change ManagerStartedAt to e.g. ControllerActiveAt, and set this once a pod becomes elected?

@aledbf
Copy link
Member Author

aledbf commented Aug 17, 2023

Also coming back to a question on the previous PR, how do we ensure workspaces don't timeout when a pod gets elected as leader after it had been running for a while?

This is similar to when we restart ws-manager-mk2 or we deploy a new version. The worst-case scenario is workspaces will run for more time than it should due to the lost state.

@WVerlaek
Copy link
Member

This is similar to when we restart ws-manager-mk2 or we deploy a new version. The worst-case scenario is workspaces will run for more time than it should due to the lost state.

I don't think it is though, on a restart the ManagerStartedAt field also gets reset, but not on leader election. Unless I'm missing something I do believe that pods will timeout after an old standby pod gets elected

go func() {
for {
<-mgr.Elected()
activity.ManagerStartedAt = time.Now()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is though, on a restart the ManagerStartedAt field also gets reset, but not on leader election. Unless I'm missing something I do believe that pods will timeout after an old standby pod gets elected

Here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants