-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[node-labeler] Introduce workspace count controller #20509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
New dependencies detected. Learn more about Socket for GitHub ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @filiptronicek , I'll review this after our team sync, but wanted to share some early feedback.
components/ws-daemon/pkg/controller/workspace_controller_test.go
Outdated
Show resolved
Hide resolved
@filiptronicek we should test node-labeler's behavior for when a workspace is running, and the underlying node is deleted on the cloud provider side. I would expect ws-manager-mk2 to eventually see that the underlying node is gone from the cloud provider, and then to mark the workspace as stopped, with stop reason as backup failed because the node was deleted. However, I'm unsure how node-labeler will respond Re: managing the annotation in Kubernetes. I expect it'll be set to false once the workspace is stopped, but, we should confirm. To do this test, I suggest using catfood (so that you don't have to build a test cell). |
@kylos101 if we delete the node on the cloud provider's side, does it also disappear from k8s, or stay there without any actual machine backing it? If it does disappear, there would probably be no annotations to take care of. I think the only instance where this could be problematic is if the node still was there, but wasn't listable and hence our controller couldn't take care of de-annotating it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the balance of feedback, I have none remaining, and would be happy to sync in my morning, so we can finalize this PR and get it shipped.
RunSpecs(t, "Controller Suite") | ||
} | ||
|
||
var _ = Describe("WorkspaceCountController", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have this disappearing node test for mk2.
It("node disappearing should fail with backup failure", func() { |
It would be interesting to see if a similar test could be added for node-labeler, that way we could do a test like for this: #20509 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mk2 test you linked here is using the k8s client to delete the node:
gitpod/components/ws-manager-mk2/controllers/workspace_controller_test.go
Lines 346 to 348 in bd36bac
// Make node disappear 🪄 | |
By("deleting node") | |
Expect(k8sClient.Delete(ctx, &node)).To(Succeed()) |
I have two questions here:
- Is that analogous to a node being ripped to shreds from the cloud provider's side?
- I'm probably missing something, but if the node's gone, what is there to annotate? Do we need to care about a node disappearing at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. I think we need to do a brief manual test, where we delete a node from the cloud provider while it has a workspace on it, and see how k8s reconciles the delete node with the scale down disabled: true annotation
components/node-labeler/cmd/run.go
Outdated
Complete(wc) | ||
} | ||
|
||
func (wc *WorkspaceCountController) periodicReconciliation() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do a manual reconciliation every 5m. Why? Because SyncPeriod
in the manager options doesn't seem to really be doing what I expected it to do - I didn't see the reconciliation trigger via an external trigger even once during testing.
Hence the goroutine, which we properly dispose of with the stopChan
components/node-labeler/cmd/run.go
Outdated
if ws.Status.Runtime != nil && ws.Status.Runtime.NodeName != "" { | ||
// Queue the node for reconciliation | ||
select { | ||
case wc.nodesToReconcile <- ws.Status.Runtime.NodeName: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When deleting, we won't be able to query the workspaces' node name, since it already won't exist at the time we'll query for it. Because of that, we rather capture node names in a channel, which is consumed any time Reconcile
is run.
This is now pending a test on an ephemeral workspace cluster on Gitpod.io, since replacing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thank you @filiptronicek !
We analyze perf results with an ephemeral cluster, and compared it to node-labeler from prod, and the performance was similar for API requests (read/write).
Let's |
Description
Adds a controller to
node-labeler
that basically checks the workspace CRD count of a workspace node and until that count reaches zero, it adds thecluster-autoscaler.kubernetes.io/scale-down-disabled: true
annotation (removing it when this is no longer met).This should prevent data loss for workspaces that take too long to back up and their node scaling down before the backup can complete.
Related Issue(s)
Fixes CLC-1054
How to test
When a workspace is running on the preview environment (link below), you should see
cluster-autoscaler.kubernetes.io/scale-down-disabled: true
in the watched annotations. You'll see a different one when no workspace is running on the preview environment.https://ft-node-de7200b44d5e.preview.gitpod-dev.com/workspaces