[ws-manager-mk2] Refactor metrics with EverReady condition #17114

WVerlaek · 2023-03-31T12:45:16Z

Description

Introduce an EverReady condition:

Gets set when the workspace container becomes Ready, i.e. when content init succeeded and the supervisor readiness check passes.

This condition then allows us to:

Count startup failure metrics similarly to mk1: a workspace that reaches the Stopped phase without ever becoming ready
Not moving a workspace from Running to Initializing if its container becomes unready. Once a workspace becomes ready it should not move backwards in phase

To track workspace failures that happen after startup, a new workspace_failure_total metric is added which gets incremented whenever a workspaces receives the Failed condition. This includes content init and disposal failures, but also all other possible failures.

The workspace_stops_total metric is also updated to match MK1 behaviour by including the stop reason label.

Related Issue(s)

Fixes WKS-29, WKS-23

How to test

Run unit tests

Or in a preview env, check a workspace starts and stops as expected, and the right metrics are incremented.

E.g. the following stop metric reasons are reported for a workspace with an image build:

gitpod_ws_manager_mk2_workspace_stops_total{class="default",reason="regular-stop",type="ImageBuild"} 1
gitpod_ws_manager_mk2_workspace_stops_total{class="default",reason="regular-stop",type="Regular"} 1

Release Notes

NONE

Documentation

Build Options:

/werft with-werft
Run the build with werft instead of GHA
leeway-no-cache
/werft no-test
Run Leeway with --dont-test

Publish Options

/werft publish-to-npm
/werft publish-to-jb-marketplace

Installer Options

with-dedicated-emulation
with-ws-manager-mk2
workspace-feature-flags
Add desired feature flags to the end of the line above, space separated

Preview Environment Options:

/werft with-local-preview
If enabled this will build install/preview
/werft with-preview
/werft with-large-vm
/werft with-gce-vm
If enabled this will create the environment on GCE infra
with-integration-tests=all
Valid options are all, workspace, webapp, ide, jetbrains, vscode, ssh

WVerlaek · 2023-03-31T13:56:21Z

components/ws-manager-mk2/controllers/metrics.go

+	if c := wsk8s.GetCondition(ws.Status.Conditions, string(workspacev1.WorkspaceConditionFailed)); c != nil {
+		reason = StopReasonFailed
+		if !wsk8s.ConditionPresentAndTrue(ws.Status.Conditions, string(workspacev1.WorkspaceConditionEverReady)) {
+			// Don't record 'failed' if there was a start failure.
+			reason = StopReasonStartFailure
+		} else if strings.Contains(c.Message, "Pod ephemeral local storage usage exceeds the total limit of containers") {
+			reason = StopReasonOutOfSpace
+		}
+	} else if wsk8s.ConditionPresentAndTrue(ws.Status.Conditions, string(workspacev1.WorkspaceConditionAborted)) {
+		reason = StopReasonAborted
+	} else if wsk8s.ConditionPresentAndTrue(ws.Status.Conditions, string(workspacev1.WorkspaceConditionTimeout)) {
+		reason = StopReasonTimeout
+	} else if wsk8s.ConditionPresentAndTrue(ws.Status.Conditions, string(workspacev1.WorkspaceConditionClosed)) {
+		reason = StopReasonTabClosed
+	} else {
+		reason = StopReasonRegular
+	}


re-ordered these checks compared to mk1, to e.g. check for Failure first. If there's both a failure and closed condition, we'd want the failure reason to be reported instead of closed

WVerlaek · 2023-03-31T14:04:04Z

components/ws-manager-mk2/controllers/metrics.go

+		reason = StopReasonFailed
+		if !wsk8s.ConditionPresentAndTrue(ws.Status.Conditions, string(workspacev1.WorkspaceConditionEverReady)) {
+			// Don't record 'failed' if there was a start failure.
+			reason = StopReasonStartFailure


different from MK1: instead of not incrementing the stop metric when there was a start failure, we increment it now but with its own unique reason.

* [ws-manager-mk2] Extract headless task failure * Undo ready status change, refactored #17114

WVerlaek added 2 commits March 31, 2023 10:59

[ws-manager-mk2] Refactor metrics with EverReady condition

4114014

Fix test, default failure message

483ed43

WVerlaek self-assigned this Mar 31, 2023

roboquat added do-not-merge/work-in-progress release-note-none size/L labels Mar 31, 2023

Add stop reason metric

7254866

WVerlaek commented Mar 31, 2023

View reviewed changes

WVerlaek marked this pull request as ready for review March 31, 2023 14:04

WVerlaek requested a review from a team March 31, 2023 14:04

roboquat removed the do-not-merge/work-in-progress label Mar 31, 2023

github-actions bot added the team: workspace Issue belongs to the Workspace team label Mar 31, 2023

WVerlaek added a commit that referenced this pull request Mar 31, 2023

Undo ready status change, refactored #17114

2a16da6

roboquat pushed a commit that referenced this pull request Apr 3, 2023

[ws-manager-mk2] Extract headless task failure (WKS-18) (#17091)

f237376

* [ws-manager-mk2] Extract headless task failure * Undo ready status change, refactored #17114

gitpod-io deleted a comment from reviewpad bot Apr 4, 2023

aledbf approved these changes Apr 8, 2023

View reviewed changes

roboquat merged commit 06ec36b into main Apr 8, 2023

roboquat deleted the wv/mk2-start-failure-metrics branch April 8, 2023 09:57

roboquat added deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ws-manager-mk2] Refactor metrics with EverReady condition #17114

[ws-manager-mk2] Refactor metrics with EverReady condition #17114

Uh oh!

WVerlaek commented Mar 31, 2023 •

edited

Loading

Uh oh!

WVerlaek Mar 31, 2023

Uh oh!

WVerlaek Mar 31, 2023

Uh oh!

Uh oh!

[ws-manager-mk2] Refactor metrics with EverReady condition #17114

[ws-manager-mk2] Refactor metrics with EverReady condition #17114

Uh oh!

Conversation

WVerlaek commented Mar 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

How to test

Release Notes

Documentation

Build Options:

Preview Environment Options:

Uh oh!

WVerlaek Mar 31, 2023

Choose a reason for hiding this comment

Uh oh!

WVerlaek Mar 31, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WVerlaek commented Mar 31, 2023 •

edited

Loading