Skip to content

Commit 2d3c03e

Browse files
authored
[obs] introduce workspace alerts for Dedicated (#18331)
* [obs] introduce GitpodImageBuildDurationAnomaly Depends on gitpod-io/runbooks#417 * [obs] Introduce GitpodImageBuilderCrashlooping As per https://samber.github.io/awesome-prometheus-alerts/rules#rule-kubernetes-1-19 * [obs] Introduce GitpodImageBuilderReplicasMismatch * [obs] use generic GitpodWorkspaceDeploymentCrashlooping for GitpodWsManagerCrashLoopingMk2 * Fix GitpodWsManagerCrashLoopingMk2 To avoid false positives * Introduce GitpodWsManagerMk2ReplicasMismatch * Fix syntax * Fix GitpodWorkspaceDeploymentReplicaMismatch URL * Introduce alerts for node-labeler and ws-proxy * Fix severity and dedicated labels * Fix proxy references * Exclude ephemeral clusters * Clean-up
1 parent a5b4a66 commit 2d3c03e

File tree

4 files changed

+141
-5
lines changed

4 files changed

+141
-5
lines changed
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Copyright (c) 2022 Gitpod GmbH. All rights reserved.
2+
# Licensed under the GNU Affero General Public License (AGPL).
3+
# See License.AGPL.txt in the project root for license information.
4+
5+
apiVersion: monitoring.coreos.com/v1
6+
kind: PrometheusRule
7+
metadata:
8+
labels:
9+
prometheus: k8s
10+
role: alert-rules
11+
name: image-builder-central-monitoring-rules
12+
spec:
13+
groups:
14+
- name: image-builder-central
15+
rules:
16+
- alert: GitpodImageBuildDurationAnomaly
17+
labels:
18+
severity: critical
19+
dedicated: included
20+
annotations:
21+
runbook_url: https://github.com/gitpod-io/runbooks/blob/main/runbooks/GitpodImageBuildDurationAnomaly.md
22+
summary: image-builder duration is unusually high in cluster {{ $labels.cluster }}
23+
description: Users are waiting too long for image builds
24+
expr: |
25+
(avg_over_time(gitpod_ws_manager_mk2_workspace_phase_total{phase="Running", type="ImageBuild", cluster!~"ephemeral.*"}[1h])-avg_over_time(gitpod_ws_manager_mk2_workspace_phase_total{phase="Running", type="ImageBuild", cluster!~"ephemeral.*"}[1d]))
26+
/
27+
stddev_over_time(gitpod_ws_manager_mk2_workspace_phase_total{phase="Running", type="ImageBuild", cluster!~"ephemeral.*"}[14d]) >=2.0
28+
- alert: GitpodImageBuilderCrashlooping
29+
labels:
30+
severity: critical
31+
dedicated: included
32+
annotations:
33+
runbook_url: https://github.com/gitpod-io/runbooks/blob/main/runbooks/GitpodWorkspaceDeploymentCrashlooping.md
34+
summary: image-builder-mk3 is crash looping in cluster {{ $labels.cluster }}
35+
description: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf "%.2f" $value }} times / 3 minutes.
36+
expr: |
37+
increase(kube_pod_container_status_restarts_total{container="image-builder-mk3", cluster!~"ephemeral.*"}[1m]) > 3
38+
for: 3m
39+
- alert: GitpodImageBuilderReplicasMismatch
40+
labels:
41+
severity: critical
42+
dedicated: included
43+
annotations:
44+
runbook_url: https://github.com/gitpod-io/runbooks/blob/main/runbooks/GitpodWorkspaceDeploymentReplicaMismatch.md
45+
summary: Desired number of replicas for image-builder-mk3 are not available in cluster {{ $labels.cluster }}
46+
description: The mismatch is {{ printf "%.2f" $value }}
47+
expr: |
48+
kube_deployment_spec_replicas{container="image-builder-mk3", cluster!~"ephemeral.*"} != kube_deployment_status_replicas_available{container="image-builder-mk3", cluster!~"ephemeral.*"}
49+
for: 3m
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Copyright (c) 2022 Gitpod GmbH. All rights reserved.
2+
# Licensed under the GNU Affero General Public License (AGPL).
3+
# See License.AGPL.txt in the project root for license information.
4+
5+
apiVersion: monitoring.coreos.com/v1
6+
kind: PrometheusRule
7+
metadata:
8+
labels:
9+
prometheus: k8s
10+
role: alert-rules
11+
name: node-labeler-central-monitoring-rules
12+
spec:
13+
groups:
14+
- name: node-labeler
15+
rules:
16+
- alert: GitpodNodeLabelerCrashLooping
17+
labels:
18+
severity: critical
19+
dedicated: included
20+
annotations:
21+
runbook_url: https://github.com/gitpod-io/runbooks/blob/main/runbooks/GitpodWorkspaceDeploymentCrashlooping.md
22+
summary: node-labeler is crashlooping in cluster {{ $labels.cluster }}.
23+
description: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf "%.2f" $value }} times / 3 minutes.
24+
expr: |
25+
increase(kube_pod_container_status_restarts_total{container="node-labeler", cluster!~"ephemeral.*"}[1m]) > 3
26+
for: 3m
27+
- alert: GitpodNodeLabelerReplicasMismatch
28+
labels:
29+
severity: critical
30+
dedicated: included
31+
annotations:
32+
runbook_url: https://github.com/gitpod-io/runbooks/blob/main/runbooks/GitpodWorkspaceDeploymentReplicaMismatch.md
33+
summary: Desired number of replicas for node-labeler are not available in cluster {{ $labels.cluster }}
34+
description: The mismatch is {{ printf "%.2f" $value }}
35+
expr: |
36+
kube_deployment_spec_replicas{container="node-labeler", cluster!~"ephemeral.*"} != kube_deployment_status_replicas_available{container="node-labeler", cluster!~"ephemeral.*"}
37+
for: 3m

operations/observability/mixins/workspace/rules/central/ws-manager.yaml

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,25 @@ spec:
1313
groups:
1414
- name: ws-manager
1515
rules:
16-
- alert: GitpodWsManagerCrashLoopingMk2
16+
- alert: GitpodWsManagerMk2CrashLooping
1717
labels:
18-
severity: warning
18+
severity: critical
19+
dedicated: included
1920
annotations:
20-
runbook_url: https://github.com/gitpod-io/runbooks/blob/main/runbooks/GitpodWsManagerCrashLooping.md
21+
runbook_url: https://github.com/gitpod-io/runbooks/blob/main/runbooks/GitpodWorkspaceDeploymentCrashlooping.md
2122
summary: ws-manager-mk2 is crashlooping in cluster {{ $labels.cluster }}.
22-
description: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf "%.2f" $value }} times / 10 minutes.
23+
description: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf "%.2f" $value }} times / 3 minutes.
2324
expr: |
24-
increase(kube_pod_container_status_restarts_total{container="ws-manager-mk2", cluster!~"ephemeral.*"}[10m]) > 0
25+
increase(kube_pod_container_status_restarts_total{container="ws-manager-mk2", cluster!~"ephemeral.*"}[1m]) > 3
26+
for: 3m
27+
- alert: GitpodWsManagerMk2ReplicasMismatch
28+
labels:
29+
severity: critical
30+
dedicated: included
31+
annotations:
32+
runbook_url: https://github.com/gitpod-io/runbooks/blob/main/runbooks/GitpodWorkspaceDeploymentReplicaMismatch.md
33+
summary: Desired number of replicas for ws-manager-mk2 are not available in cluster {{ $labels.cluster }}
34+
description: The mismatch is {{ printf "%.2f" $value }}
35+
expr: |
36+
kube_deployment_spec_replicas{container="ws-manager-mk2", cluster!~"ephemeral.*"} != kube_deployment_status_replicas_available{container="ws-manager-mk2", cluster!~"ephemeral.*"}
37+
for: 3m
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Copyright (c) 2022 Gitpod GmbH. All rights reserved.
2+
# Licensed under the GNU Affero General Public License (AGPL).
3+
# See License.AGPL.txt in the project root for license information.
4+
5+
apiVersion: monitoring.coreos.com/v1
6+
kind: PrometheusRule
7+
metadata:
8+
labels:
9+
prometheus: k8s
10+
role: alert-rules
11+
name: ws-proxy-central-monitoring-rules
12+
spec:
13+
groups:
14+
- name: ws-proxy
15+
rules:
16+
- alert: GitpodWsProxyCrashLooping
17+
labels:
18+
severity: critical
19+
dedicated: included
20+
annotations:
21+
runbook_url: https://github.com/gitpod-io/runbooks/blob/main/runbooks/GitpodWorkspaceDeploymentCrashlooping.md
22+
summary: ws-proxy is crashlooping in cluster {{ $labels.cluster }}.
23+
description: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf "%.2f" $value }} times / 3 minutes.
24+
expr: |
25+
increase(kube_pod_container_status_restarts_total{container="ws-proxy", cluster!~"ephemeral.*"}[1m]) > 3
26+
for: 3m
27+
- alert: GitpodWsProxyReplicasMismatch
28+
labels:
29+
severity: critical
30+
dedicated: included
31+
annotations:
32+
runbook_url: https://github.com/gitpod-io/runbooks/blob/main/runbooks/GitpodWorkspaceDeploymentReplicaMismatch.md
33+
summary: Desired number of replicas for ws-proxy are not available in cluster {{ $labels.cluster }}
34+
description: The mismatch is {{ printf "%.2f" $value }}
35+
expr: |
36+
kube_deployment_spec_replicas{container="ws-proxy", cluster!~"ephemeral.*"} != kube_deployment_status_replicas_available{container="ws-proxy", cluster!~"ephemeral.*"}
37+
for: 3m

0 commit comments

Comments
 (0)