Skip to content

Enable leader election in ws-manager-mk2 #18419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion components/ws-manager-mk2/cmd/sample-workspace/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,11 @@ import (
"log"
"time"

workspacev1 "github.com/gitpod-io/gitpod/ws-manager/api/crd/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/utils/pointer"
"sigs.k8s.io/yaml"

workspacev1 "github.com/gitpod-io/gitpod/ws-manager/api/crd/v1"
)

func main() {
Expand Down
2 changes: 0 additions & 2 deletions components/ws-manager-mk2/config/manager/manager.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,6 @@ spec:
containers:
- command:
- /manager
args:
- --leader-elect
image: controller:latest
name: manager
securityContext:
Expand Down
6 changes: 1 addition & 5 deletions components/ws-manager-mk2/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -68,13 +68,9 @@ func init() {
}

func main() {
var enableLeaderElection bool
var configFN string
var jsonLog bool
var verbose bool
flag.BoolVar(&enableLeaderElection, "leader-elect", false,
"Enable leader election for controller manager. "+
"Enabling this will ensure there is only one active controller manager.")
flag.StringVar(&configFN, "config", "", "Path to the config file")
flag.BoolVar(&jsonLog, "json-log", true, "produce JSON log output on verbose level")
flag.BoolVar(&verbose, "verbose", false, "Enable verbose logging")
Expand Down Expand Up @@ -125,7 +121,7 @@ func main() {
MetricsBindAddress: cfg.Prometheus.Addr,
Port: 9443,
HealthProbeBindAddress: cfg.Health.Addr,
LeaderElection: enableLeaderElection,
LeaderElection: true,
LeaderElectionID: "ws-manager-mk2-leader.gitpod.io",
NewCache: cache.MultiNamespacedCacheBuilder([]string{cfg.Manager.Namespace, cfg.Manager.SecretsNamespace}),
})
Expand Down
12 changes: 6 additions & 6 deletions install/installer/pkg/components/ws-manager-mk2/deployment.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,18 @@
package wsmanagermk2

import (
"github.com/gitpod-io/gitpod/installer/pkg/cluster"
"github.com/gitpod-io/gitpod/installer/pkg/common"
wsdaemon "github.com/gitpod-io/gitpod/installer/pkg/components/ws-daemon"
"github.com/gitpod-io/gitpod/installer/pkg/config/v1"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/api/resource"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/util/intstr"
"k8s.io/utils/pointer"

"github.com/gitpod-io/gitpod/installer/pkg/cluster"
"github.com/gitpod-io/gitpod/installer/pkg/common"
wsdaemon "github.com/gitpod-io/gitpod/installer/pkg/components/ws-daemon"
"github.com/gitpod-io/gitpod/installer/pkg/config/v1"
)

func deployment(ctx *common.RenderContext) ([]runtime.Object, error) {
Expand Down Expand Up @@ -58,7 +59,6 @@ func deployment(ctx *common.RenderContext) ([]runtime.Object, error) {
Name: Component,
Args: []string{
"--config", "/config/config.json",
"--leader-elect",
},
Image: ctx.ImageName(ctx.Config.Repository, Component, ctx.VersionManifest.Components.WSManagerMk2.Version),
ImagePullPolicy: corev1.PullIfNotPresent,
Expand Down Expand Up @@ -176,7 +176,7 @@ func deployment(ctx *common.RenderContext) ([]runtime.Object, error) {
},
Spec: appsv1.DeploymentSpec{
Selector: &metav1.LabelSelector{MatchLabels: labels},
Replicas: common.Replicas(ctx, Component),
Replicas: pointer.Int32(2),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could cause issues with activity timeouts. The load balancer will not know which instance is the leader so workspace activity could be reported to the standby and workspaces will time out. I think in practice it will not be a problem because we send heartbeats often enough that the leader will still get notified before that happens. Something to be aware of though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I think we should find a solution for that before going with 2 replicas, otherwise it will become hard to investigate any timeout issues

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could cause issues with activity timeouts. The load balancer will not know which instance is the leader so workspace activity could be reported to the standby and workspaces will time out. I think in practice it will not be a problem because we send heartbeats often enough that the leader will still get notified before that happens. Something to be aware of though.

I think there's a misconception here, only the leader process requests to the CRD types

The load balancer only sees one instance, not two

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I think we should find a solution for that before going with 2 replicas, otherwise it will become hard to investigate any timeout issues

Can you expand what do you mean by timeout issues?

Copy link
Member

@Furisto Furisto Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The load balancer only sees one instance, not two

Why does it see only one? If we have two replicas would it not send requests to all two? Does the standby not become ready?

Can you expand what do you mean by timeout issues?

The workspaces send activity requests to ws-manager in order to mark the workspace as active. We store this information in memory in ws-manager. If the requests would go to the standby the active manager (which checks the timeouts) would not have the information that the workspace is active because that information is stored by the standby.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, the standby is not ready to serve, but passing the ready health checks.

You can see this in the preview killing the current leader while tailing the other pod logs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We store this information in memory in ws-manager

This is wrong if when we switch the leader the other pod starts killing workspaces without waiting for status reports from workspaces

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a callback we can implement when a pod is elected the new leader and wait for workspace reports before change anything

Strategy: common.DeploymentStrategy,
Template: corev1.PodTemplateSpec{
ObjectMeta: metav1.ObjectMeta{
Expand Down