Skip to content

[node-labeler] Introduce workspace count controller #20509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Jan 21, 2025

Conversation

filiptronicek
Copy link
Member

@filiptronicek filiptronicek commented Jan 8, 2025

Description

Adds a controller to node-labeler that basically checks the workspace CRD count of a workspace node and until that count reaches zero, it adds the cluster-autoscaler.kubernetes.io/scale-down-disabled: true annotation (removing it when this is no longer met).

This should prevent data loss for workspaces that take too long to back up and their node scaling down before the backup can complete.

Related Issue(s)

Fixes CLC-1054

How to test

  1. Start a workspace on this PR
  2. Open one terminal, where you'll watch the node's annotations
    watch "kubectl describe node | grep Anno"
    
  3. Open another terminal, where you'll observe the controller's logs
    kubectl logs -l component=node-labeler -f --all-containers
    

When a workspace is running on the preview environment (link below), you should see cluster-autoscaler.kubernetes.io/scale-down-disabled: true in the watched annotations. You'll see a different one when no workspace is running on the preview environment.

https://ft-node-de7200b44d5e.preview.gitpod-dev.com/workspaces

@roboquat roboquat added size/XXL and removed size/L labels Jan 13, 2025
Copy link

socket-security bot commented Jan 13, 2025

New dependencies detected. Learn more about Socket for GitHub ↗︎

Package New capabilities Transitives Size Publisher
golang/github.com/aws/[email protected] environment, filesystem, network, shell, unsafe 0 739 kB

View full report↗︎

@roboquat roboquat added size/XL and removed size/XXL labels Jan 13, 2025
@filiptronicek filiptronicek changed the title [ws-daemon] Introduce pod count controller [node-labeler] Introduce workspace count controller Jan 13, 2025
Copy link
Contributor

@kylos101 kylos101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @filiptronicek , I'll review this after our team sync, but wanted to share some early feedback.

@kylos101
Copy link
Contributor

@filiptronicek we should test node-labeler's behavior for when a workspace is running, and the underlying node is deleted on the cloud provider side.

I would expect ws-manager-mk2 to eventually see that the underlying node is gone from the cloud provider, and then to mark the workspace as stopped, with stop reason as backup failed because the node was deleted. However, I'm unsure how node-labeler will respond Re: managing the annotation in Kubernetes. I expect it'll be set to false once the workspace is stopped, but, we should confirm.

To do this test, I suggest using catfood (so that you don't have to build a test cell).

@filiptronicek
Copy link
Member Author

@kylos101 if we delete the node on the cloud provider's side, does it also disappear from k8s, or stay there without any actual machine backing it? If it does disappear, there would probably be no annotations to take care of. I think the only instance where this could be problematic is if the node still was there, but wasn't listable and hence our controller couldn't take care of de-annotating it.

Copy link
Contributor

@kylos101 kylos101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the balance of feedback, I have none remaining, and would be happy to sync in my morning, so we can finalize this PR and get it shipped.

RunSpecs(t, "Controller Suite")
}

var _ = Describe("WorkspaceCountController", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this disappearing node test for mk2.

It("node disappearing should fail with backup failure", func() {

It would be interesting to see if a similar test could be added for node-labeler, that way we could do a test like for this: #20509 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mk2 test you linked here is using the k8s client to delete the node:

// Make node disappear 🪄
By("deleting node")
Expect(k8sClient.Delete(ctx, &node)).To(Succeed())

I have two questions here:

  1. Is that analogous to a node being ripped to shreds from the cloud provider's side?
  2. I'm probably missing something, but if the node's gone, what is there to annotate? Do we need to care about a node disappearing at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I think we need to do a brief manual test, where we delete a node from the cloud provider while it has a workspace on it, and see how k8s reconciles the delete node with the scale down disabled: true annotation

@roboquat roboquat added size/XXL and removed size/XL labels Jan 15, 2025
Complete(wc)
}

func (wc *WorkspaceCountController) periodicReconciliation() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do a manual reconciliation every 5m. Why? Because SyncPeriod in the manager options doesn't seem to really be doing what I expected it to do - I didn't see the reconciliation trigger via an external trigger even once during testing.

Hence the goroutine, which we properly dispose of with the stopChan

if ws.Status.Runtime != nil && ws.Status.Runtime.NodeName != "" {
// Queue the node for reconciliation
select {
case wc.nodesToReconcile <- ws.Status.Runtime.NodeName:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When deleting, we won't be able to query the workspaces' node name, since it already won't exist at the time we'll query for it. Because of that, we rather capture node names in a channel, which is consumed any time Reconcile is run.

@filiptronicek
Copy link
Member Author

This is now pending a test on an ephemeral workspace cluster on Gitpod.io, since replacing the node-labeler deployment's image on catfood is not possible (due to permission changes necessary in the installer)

Copy link
Contributor

@kylos101 kylos101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: metrics

Copy link
Contributor

@kylos101 kylos101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you @filiptronicek !

We analyze perf results with an ephemeral cluster, and compared it to node-labeler from prod, and the performance was similar for API requests (read/write).

@filiptronicek
Copy link
Member Author

Let's :shipit:!

@roboquat roboquat merged commit 4d3cca4 into main Jan 21, 2025
18 checks passed
@roboquat roboquat deleted the ft/node-deletion-annotations branch January 21, 2025 23:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants