[node-labeler] Introduce workspace count controller #20509

filiptronicek · 2025-01-08T20:52:21Z

Description

Adds a controller to node-labeler that basically checks the workspace CRD count of a workspace node and until that count reaches zero, it adds the cluster-autoscaler.kubernetes.io/scale-down-disabled: true annotation (removing it when this is no longer met).

This should prevent data loss for workspaces that take too long to back up and their node scaling down before the backup can complete.

Related Issue(s)

Fixes CLC-1054

How to test

Start a workspace on this PR
Open one terminal, where you'll watch the node's annotations
```
watch "kubectl describe node | grep Anno"
```
Open another terminal, where you'll observe the controller's logs
```
kubectl logs -l component=node-labeler -f --all-containers
```

When a workspace is running on the preview environment (link below), you should see cluster-autoscaler.kubernetes.io/scale-down-disabled: true in the watched annotations. You'll see a different one when no workspace is running on the preview environment.

https://ft-node-de7200b44d5e.preview.gitpod-dev.com/workspaces

components/node-labeler/cmd/run.go

socket-security · 2025-01-13T09:54:54Z

New dependencies detected. Learn more about Socket for GitHub ↗︎

Package	New capabilities	Transitives	Size	Publisher
golang/github.com/aws/[email protected]	environment, filesystem, network, shell, unsafe	`0`	739 kB

View full report↗︎

components/node-labeler/cmd/run.go

components/node-labeler/cmd/run_test.go

kylos101

Hey @filiptronicek , I'll review this after our team sync, but wanted to share some early feedback.

components/ws-daemon/pkg/controller/suite_test.go

components/ws-daemon/pkg/controller/workspace_controller_test.go

kylos101 · 2025-01-13T17:33:26Z

@filiptronicek we should test node-labeler's behavior for when a workspace is running, and the underlying node is deleted on the cloud provider side.

I would expect ws-manager-mk2 to eventually see that the underlying node is gone from the cloud provider, and then to mark the workspace as stopped, with stop reason as backup failed because the node was deleted. However, I'm unsure how node-labeler will respond Re: managing the annotation in Kubernetes. I expect it'll be set to false once the workspace is stopped, but, we should confirm.

To do this test, I suggest using catfood (so that you don't have to build a test cell).

filiptronicek · 2025-01-13T18:22:39Z

@kylos101 if we delete the node on the cloud provider's side, does it also disappear from k8s, or stay there without any actual machine backing it? If it does disappear, there would probably be no annotations to take care of. I think the only instance where this could be problematic is if the node still was there, but wasn't listable and hence our controller couldn't take care of de-annotating it.

kylos101

Adding the balance of feedback, I have none remaining, and would be happy to sync in my morning, so we can finalize this PR and get it shipped.

components/node-labeler/cmd/run_test.go

components/node-labeler/cmd/run.go

kylos101 · 2025-01-14T02:05:45Z

components/node-labeler/cmd/run_test.go

+	RunSpecs(t, "Controller Suite")
+}
+
+var _ = Describe("WorkspaceCountController", func() {


We have this disappearing node test for mk2.

gitpod/components/ws-manager-mk2/controllers/workspace_controller_test.go

Line 296 in bd36bac

It("node disappearing should fail with backup failure", func() {

It would be interesting to see if a similar test could be added for node-labeler, that way we could do a test like for this: #20509 (comment)

The mk2 test you linked here is using the k8s client to delete the node:

gitpod/components/ws-manager-mk2/controllers/workspace_controller_test.go

Lines 346 to 348 in bd36bac

// Make node disappear 🪄

By("deleting node")

Expect(k8sClient.Delete(ctx, &node)).To(Succeed())

I have two questions here:

Is that analogous to a node being ripped to shreds from the cloud provider's side?

I'm probably missing something, but if the node's gone, what is there to annotate? Do we need to care about a node disappearing at all?

I'm not sure. I think we need to do a brief manual test, where we delete a node from the cloud provider while it has a workspace on it, and see how k8s reconciles the delete node with the scale down disabled: true annotation

filiptronicek · 2025-01-15T08:30:59Z

components/node-labeler/cmd/run.go

+		Complete(wc)
+}
+
+func (wc *WorkspaceCountController) periodicReconciliation() {


We do a manual reconciliation every 5m. Why? Because SyncPeriod in the manager options doesn't seem to really be doing what I expected it to do - I didn't see the reconciliation trigger via an external trigger even once during testing.

Hence the goroutine, which we properly dispose of with the stopChan

filiptronicek · 2025-01-15T08:33:37Z

components/node-labeler/cmd/run.go

+			if ws.Status.Runtime != nil && ws.Status.Runtime.NodeName != "" {
+				// Queue the node for reconciliation
+				select {
+				case wc.nodesToReconcile <- ws.Status.Runtime.NodeName:


When deleting, we won't be able to query the workspaces' node name, since it already won't exist at the time we'll query for it. Because of that, we rather capture node names in a channel, which is consumed any time Reconcile is run.

filiptronicek · 2025-01-16T09:27:22Z

This is now pending a test on an ephemeral workspace cluster on Gitpod.io, since replacing the node-labeler deployment's image on catfood is not possible (due to permission changes necessary in the installer)

components/node-labeler/cmd/run.go

kylos101

Re: metrics

components/node-labeler/cmd/metrics.go

components/node-labeler/cmd/run.go

kylos101

Awesome, thank you @filiptronicek !

We analyze perf results with an ephemeral cluster, and compared it to node-labeler from prod, and the performance was similar for API requests (read/write).

filiptronicek · 2025-01-21T23:08:16Z

Let's !

[ws-daemon] Introduce pod count controller

583f54e

roboquat added do-not-merge/work-in-progress do-not-merge/hold size/L labels Jan 8, 2025

remove unnecessary variable

2806157

filiptronicek commented Jan 8, 2025

View reviewed changes

components/node-labeler/cmd/run.go Show resolved Hide resolved

filiptronicek added 3 commits January 10, 2025 16:47

move to node-labeler

e76b62e

act on ws crds

cf4e38f

Fix runtime not filled in yet

c5fc9bd

roboquat added size/XXL and removed size/L labels Jan 13, 2025

Make tests pass!

77213b7

roboquat added size/XL and removed size/XXL labels Jan 13, 2025

filiptronicek changed the title ~~[ws-daemon] Introduce pod count controller~~ [node-labeler] Introduce workspace count controller Jan 13, 2025

filiptronicek commented Jan 13, 2025

View reviewed changes

components/node-labeler/cmd/run.go Show resolved Hide resolved

Improve test file structure

11d73d5

filiptronicek marked this pull request as ready for review January 13, 2025 12:06

filiptronicek requested review from a team as code owners January 13, 2025 12:06

roboquat removed the do-not-merge/work-in-progress label Jan 13, 2025

github-actions bot added team: team-engine team: team-enterprise labels Jan 13, 2025

Fix node-labeler:lib build

1942721

filiptronicek commented Jan 13, 2025

View reviewed changes

components/node-labeler/cmd/run_test.go Show resolved Hide resolved

kylos101 reviewed Jan 13, 2025

View reviewed changes

components/ws-daemon/pkg/controller/suite_test.go Outdated Show resolved Hide resolved

components/ws-daemon/pkg/controller/workspace_controller_test.go Outdated Show resolved Hide resolved

Remove unnecessary changes

bd36bac

kylos101 reviewed Jan 14, 2025

View reviewed changes

filiptronicek added 3 commits January 14, 2025 09:32

Address some review comments (thanks, kyle!)

b3b8dc9

Try caching?

f043f94

Queue deleted nodes and periodically reconcile it all

56a4333

roboquat added size/XXL and removed size/XL labels Jan 15, 2025

WCC cleanup function

06abb06

filiptronicek commented Jan 15, 2025

View reviewed changes

filiptronicek added 4 commits January 15, 2025 08:48

Fix tests

4120599

Update name

9a097d8

Add metrics for controller

24ca89e

Add synchronization for node reconciliation to prevent race conditions

f333539

iQQBot reviewed Jan 21, 2025

View reviewed changes

components/node-labeler/cmd/run.go Outdated Show resolved Hide resolved

iQQBot reviewed Jan 21, 2025

View reviewed changes

components/node-labeler/cmd/run.go Outdated Show resolved Hide resolved

components/node-labeler/cmd/run.go Outdated Show resolved Hide resolved

components/node-labeler/cmd/run.go Outdated Show resolved Hide resolved

iQQBot reviewed Jan 21, 2025

View reviewed changes

components/node-labeler/cmd/run.go Outdated Show resolved Hide resolved

filiptronicek added 2 commits January 21, 2025 16:22

Address review comments

0405fcd

Remove superflous log

eb3a811

kylos101 reviewed Jan 21, 2025

View reviewed changes

components/node-labeler/cmd/metrics.go Outdated Show resolved Hide resolved

kylos101 reviewed Jan 21, 2025

View reviewed changes

components/node-labeler/cmd/run.go Show resolved Hide resolved

kylos101 reviewed Jan 21, 2025

View reviewed changes

components/node-labeler/cmd/run.go Show resolved Hide resolved

filiptronicek added 2 commits January 21, 2025 22:55

Remove unneeded metrics and add cool log line

1584a52

big yellow warning for a thing that should not happen

4e3401b

kylos101 approved these changes Jan 21, 2025

View reviewed changes

filiptronicek self-assigned this Jan 21, 2025

filiptronicek removed the do-not-merge/hold label Jan 21, 2025

roboquat merged commit 4d3cca4 into main Jan 21, 2025
18 checks passed

roboquat deleted the ft/node-deletion-annotations branch January 21, 2025 23:12

	// Make node disappear 🪄
	By("deleting node")
	Expect(k8sClient.Delete(ctx, &node)).To(Succeed())

[node-labeler] Introduce workspace count controller #20509

[node-labeler] Introduce workspace count controller #20509

Uh oh!

Conversation

filiptronicek commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

How to test

Uh oh!

Uh oh!

socket-security bot commented Jan 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylos101 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kylos101 commented Jan 13, 2025

Uh oh!

filiptronicek commented Jan 13, 2025

Uh oh!

kylos101 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylos101 Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

filiptronicek Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

kylos101 Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

filiptronicek Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

filiptronicek Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

filiptronicek commented Jan 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylos101 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylos101 left a comment

Choose a reason for hiding this comment

Uh oh!

filiptronicek commented Jan 21, 2025

Uh oh!

Uh oh!

Uh oh!

filiptronicek commented Jan 8, 2025 •

edited

Loading

socket-security bot commented Jan 13, 2025 •

edited

Loading