Bug 1763293: pkg/operator/sync: Track lastError in waitForDeploymentRollout #417

wking · 2019-10-18T08:22:24Z

Because otherwise stuck deployments will result in the not-very-useful timed out waiting for the condition errors like:

Oct 17 18:41:52.205 E clusteroperator/machine-api changed Degraded to True: SyncingFailed: Failed when progressing towards operator: 4.3.0-0.ci-2019-10-17-173803 because timed out waiting for the condition

Also use %s instead of %q for formatting the deployment name, because we control the names being monitored and they don't contain whitespace or other potentially-confusing characters.

enxebre · 2019-10-18T08:29:26Z

thanks!
/approve

openshift-ci-robot · 2019-10-18T08:29:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [enxebre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

smarterclayton · 2019-10-18T17:07:41Z

Please backport to 4.2 as well

openshift-ci-robot · 2019-10-18T17:12:48Z

@wking: This pull request references Bugzilla bug 1763293, which is invalid:

expected the bug to target the "4.3.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1763293: pkg/operator/sync: Track lastError in waitForDeploymentRollout

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2019-10-18T17:13:59Z

/bugzilla refresh
/cherrypick release-4.1

openshift-cherrypick-robot · 2019-10-18T17:13:59Z

@wking: once the present PR merges, I will cherry-pick it on top of release-4.1 in a new PR and assign it to you.

In response to this:

/bugzilla refresh
/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2019-10-18T17:14:03Z

@wking: This pull request references Bugzilla bug 1763293, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/bugzilla refresh
/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2019-10-18T17:14:08Z

/cherrypick release-4.2

openshift-cherrypick-robot · 2019-10-18T17:14:08Z

@wking: once the present PR merges, I will cherry-pick it on top of release-4.2 in a new PR and assign it to you.

In response to this:

/cherrypick release-4.2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2019-10-18T17:19:48Z

All green; just needs a /lgtm 😇

wking · 2019-10-18T17:21:14Z

And here we are in action from CI:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/417/pull-ci-openshift-machine-api-operator-master-e2e-aws-upgrade/549/artifacts/e2e-aws-upgrade/container-logs/test.log | grep 'clusteroperator/machine-api changed Degraded'
Oct 18 09:09:36.913 E clusteroperator/machine-api changed Degraded to True: SyncingFailed: Failed when progressing towards operator: 0.0.1-2019-10-18-082547 because deployment machine-api-controllers is not ready. status: (replicas: 2, updated: 1, ready: 1, unavailable: 1)
Oct 18 09:09:36.937 W clusteroperator/machine-api changed Degraded to False

wking · 2019-10-18T17:22:30Z

Also, replicas: 2, updated: 1, ready: 1, unavailable: 1 shouldn't be Degraded, that's just a healthy upgrade (cf. openshift/cluster-dns-operator#134), but we can circle back and adjust stuff like that in follow-up work.

wking · 2019-10-18T17:24:24Z

/hold

Wait, no. I need to clear the error :p

Because otherwise stuck deployments will result in the not-very-useful "timed out waiting for the condition" errors like [1]: Oct 17 18:41:52.205 E clusteroperator/machine-api changed Degraded to True: SyncingFailed: Failed when progressing towards operator: 4.3.0-0.ci-2019-10-17-173803 because timed out waiting for the condition Also use %s instead of %q for formatting the deployment name, because we control the names being monitored and they don't contain whitespace or other potentially-confusing characters. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/8809

wking · 2019-10-18T17:25:43Z

Wait, no. I need to clear the error :p

Fixed with 5cd8ed4 -> a68bba5.

/hold cancel

wking · 2019-10-18T20:59:18Z

e2e-aws:

...Error waiting for instance (i-0886fce3591fc4d09) to become ready...

/retest

openshift-ci-robot · 2019-10-18T21:14:06Z

@wking: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-scaleup-rhel7	`a68bba5`	link	`/test e2e-aws-scaleup-rhel7`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

wking · 2019-10-18T23:14:49Z

Green again :)

enxebre · 2019-10-21T07:27:11Z

Also, replicas: 2, updated: 1, ready: 1, unavailable: 1 shouldn't be Degraded, that's just a healthy upgrade (cf. openshift/cluster-dns-operator#134)

mm the deployment is given a timeout to rollout https://github.com/openshift/machine-api-operator/pull/417/files#diff-fa45321336db7ad1cedc28bf643a4f97R117. Only is degraded if does not succeed to rollout during that timeframe.

but we can circle back and adjust stuff like that in follow-up work.

Sound good. fwiw we also need to clear this up https://bugzilla.redhat.com/show_bug.cgi?id=1758616

/lgtm

openshift-ci-robot · 2019-10-21T08:50:55Z

@wking: All pull requests linked via external trackers have merged. Bugzilla bug 1763293 has been moved to the MODIFIED state.

In response to this:

Bug 1763293: pkg/operator/sync: Track lastError in waitForDeploymentRollout

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot · 2019-10-21T08:51:05Z

@wking: new pull request created: #419

In response to this:

/bugzilla refresh
/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot · 2019-10-21T08:51:12Z

@wking: new pull request created: #420

In response to this:

/cherrypick release-4.2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ironcladlou · 2019-10-22T10:41:47Z

Sorry to raise this from the dead, but I stumbled across the PR by accident and noticed something confusing:

Deployment failed status already accounts for time using progressDeadlineSeconds and this is encoded in the failed condition — it’s not clear to me what using another timeout layer when waiting for the deployment achieves. What happens if the wait timeout in waitForDeploymentRollout is less than the deployment progressDeadlineSeconds and outer wait timeout is reached first? (Seems like deployment would prematurely and erroneously be given up on?)

Instead I guess I would expect to wait indefinitely for the deployment to reach either a rolled out state or the terminal failed state, and trust either way that timeout has been accounted for at the k8s deployment primitive layer.

cc @wking

openshift-ci-robot requested review from bison and enxebre October 18, 2019 08:22

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 18, 2019

wking force-pushed the waitForDeploymentRollout-lastError branch from 8d44b86 to 5cd8ed4 Compare October 18, 2019 08:23

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 18, 2019

wking changed the title ~~pkg/operator/sync: Track lastError in waitForDeploymentRollout~~ Bug 1763293: pkg/operator/sync: Track lastError in waitForDeploymentRollout Oct 18, 2019

openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Oct 18, 2019

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Oct 18, 2019

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 18, 2019

wking force-pushed the waitForDeploymentRollout-lastError branch from 5cd8ed4 to a68bba5 Compare October 18, 2019 17:25

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 18, 2019

openshift-ci-robot assigned enxebre Oct 21, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 21, 2019

openshift-merge-robot merged commit cc22f61 into openshift:master Oct 21, 2019

This was referenced Oct 21, 2019

Bug 1763772: pkg/operator/sync: Track lastError in waitForDeploymentRollout #419

Closed

Bug 1763295: pkg/operator/sync: Track lastError in waitForDeploymentRollout #420

Closed

wking deleted the waitForDeploymentRollout-lastError branch October 21, 2019 14:13

This was referenced Oct 21, 2019

Bug 1761506: status: prevent degraded status flapping on rollout openshift/cluster-dns-operator#134

Merged

[release-4.2] Bug 1762960: status: prevent degraded status flapping on rollout openshift/cluster-dns-operator#135

Merged

Bug 1763293: pkg/operator/sync: Track lastError in waitForDeploymentRollout #417

Bug 1763293: pkg/operator/sync: Track lastError in waitForDeploymentRollout #417

Uh oh!

Conversation

wking commented Oct 18, 2019

Uh oh!

enxebre commented Oct 18, 2019

Uh oh!

openshift-ci-robot commented Oct 18, 2019

Uh oh!

smarterclayton commented Oct 18, 2019

Uh oh!

openshift-ci-robot commented Oct 18, 2019

Uh oh!

wking commented Oct 18, 2019

Uh oh!

openshift-cherrypick-robot commented Oct 18, 2019

Uh oh!

openshift-ci-robot commented Oct 18, 2019

Uh oh!

wking commented Oct 18, 2019

Uh oh!

openshift-cherrypick-robot commented Oct 18, 2019

Uh oh!

wking commented Oct 18, 2019

Uh oh!

wking commented Oct 18, 2019

Uh oh!

wking commented Oct 18, 2019

Uh oh!

wking commented Oct 18, 2019

Uh oh!

wking commented Oct 18, 2019

Uh oh!

wking commented Oct 18, 2019

Uh oh!

openshift-ci-robot commented Oct 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking commented Oct 18, 2019

Uh oh!

enxebre commented Oct 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Oct 21, 2019

Uh oh!

openshift-cherrypick-robot commented Oct 21, 2019

Uh oh!

openshift-cherrypick-robot commented Oct 21, 2019

Uh oh!

ironcladlou commented Oct 22, 2019

Uh oh!

Uh oh!

openshift-ci-robot commented Oct 18, 2019 •

edited

Loading

enxebre commented Oct 21, 2019 •

edited

Loading