Skip to content

[ws-manager-mk2] Refactor metrics with EverReady condition #17114

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 8, 2023

Conversation

WVerlaek
Copy link
Member

@WVerlaek WVerlaek commented Mar 31, 2023

Description

Introduce an EverReady condition:

  • Gets set when the workspace container becomes Ready, i.e. when content init succeeded and the supervisor readiness check passes.

This condition then allows us to:

  • Count startup failure metrics similarly to mk1: a workspace that reaches the Stopped phase without ever becoming ready
  • Not moving a workspace from Running to Initializing if its container becomes unready. Once a workspace becomes ready it should not move backwards in phase

To track workspace failures that happen after startup, a new workspace_failure_total metric is added which gets incremented whenever a workspaces receives the Failed condition. This includes content init and disposal failures, but also all other possible failures.

The workspace_stops_total metric is also updated to match MK1 behaviour by including the stop reason label.

Related Issue(s)

Fixes WKS-29, WKS-23

How to test

Run unit tests

Or in a preview env, check a workspace starts and stops as expected, and the right metrics are incremented.

E.g. the following stop metric reasons are reported for a workspace with an image build:

gitpod_ws_manager_mk2_workspace_stops_total{class="default",reason="regular-stop",type="ImageBuild"} 1
gitpod_ws_manager_mk2_workspace_stops_total{class="default",reason="regular-stop",type="Regular"} 1

Release Notes

NONE

Documentation

Build Options:

  • /werft with-werft
    Run the build with werft instead of GHA
  • leeway-no-cache
  • /werft no-test
    Run Leeway with --dont-test
Publish Options
  • /werft publish-to-npm
  • /werft publish-to-jb-marketplace
Installer Options
  • with-dedicated-emulation
  • with-ws-manager-mk2
  • workspace-feature-flags
    Add desired feature flags to the end of the line above, space separated

Preview Environment Options:

  • /werft with-local-preview
    If enabled this will build install/preview
  • /werft with-preview
  • /werft with-large-vm
  • /werft with-gce-vm
    If enabled this will create the environment on GCE infra
  • with-integration-tests=all
    Valid options are all, workspace, webapp, ide, jetbrains, vscode, ssh

Comment on lines +163 to +179
if c := wsk8s.GetCondition(ws.Status.Conditions, string(workspacev1.WorkspaceConditionFailed)); c != nil {
reason = StopReasonFailed
if !wsk8s.ConditionPresentAndTrue(ws.Status.Conditions, string(workspacev1.WorkspaceConditionEverReady)) {
// Don't record 'failed' if there was a start failure.
reason = StopReasonStartFailure
} else if strings.Contains(c.Message, "Pod ephemeral local storage usage exceeds the total limit of containers") {
reason = StopReasonOutOfSpace
}
} else if wsk8s.ConditionPresentAndTrue(ws.Status.Conditions, string(workspacev1.WorkspaceConditionAborted)) {
reason = StopReasonAborted
} else if wsk8s.ConditionPresentAndTrue(ws.Status.Conditions, string(workspacev1.WorkspaceConditionTimeout)) {
reason = StopReasonTimeout
} else if wsk8s.ConditionPresentAndTrue(ws.Status.Conditions, string(workspacev1.WorkspaceConditionClosed)) {
reason = StopReasonTabClosed
} else {
reason = StopReasonRegular
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-ordered these checks compared to mk1, to e.g. check for Failure first. If there's both a failure and closed condition, we'd want the failure reason to be reported instead of closed

reason = StopReasonFailed
if !wsk8s.ConditionPresentAndTrue(ws.Status.Conditions, string(workspacev1.WorkspaceConditionEverReady)) {
// Don't record 'failed' if there was a start failure.
reason = StopReasonStartFailure
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

different from MK1: instead of not incrementing the stop metric when there was a start failure, we increment it now but with its own unique reason.

@WVerlaek WVerlaek marked this pull request as ready for review March 31, 2023 14:04
@WVerlaek WVerlaek requested a review from a team March 31, 2023 14:04
@github-actions github-actions bot added the team: workspace Issue belongs to the Workspace team label Mar 31, 2023
WVerlaek added a commit that referenced this pull request Mar 31, 2023
roboquat pushed a commit that referenced this pull request Apr 3, 2023
* [ws-manager-mk2] Extract headless task failure

* Undo ready status change, refactored #17114
@gitpod-io gitpod-io deleted a comment from reviewpad bot Apr 4, 2023
@gitpod-io gitpod-io deleted a comment from reviewpad bot Apr 4, 2023
@gitpod-io gitpod-io deleted a comment from reviewpad bot Apr 4, 2023
@roboquat roboquat merged commit 06ec36b into main Apr 8, 2023
@roboquat roboquat deleted the wv/mk2-start-failure-metrics branch April 8, 2023 09:57
@roboquat roboquat added deployed: workspace Workspace team change is running in production deployed Change is completely running in production labels Apr 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: workspace Workspace team change is running in production deployed Change is completely running in production release-note-none size/L team: workspace Issue belongs to the Workspace team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants