Skip to content

Bug 1763293: pkg/operator/sync: Track lastError in waitForDeploymentRollout #417

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

wking
Copy link
Member

@wking wking commented Oct 18, 2019

Because otherwise stuck deployments will result in the not-very-useful timed out waiting for the condition errors like:

Oct 17 18:41:52.205 E clusteroperator/machine-api changed Degraded to True: SyncingFailed: Failed when progressing towards operator: 4.3.0-0.ci-2019-10-17-173803 because timed out waiting for the condition

Also use %s instead of %q for formatting the deployment name, because we control the names being monitored and they don't contain whitespace or other potentially-confusing characters.

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 18, 2019
@wking wking force-pushed the waitForDeploymentRollout-lastError branch from 8d44b86 to 5cd8ed4 Compare October 18, 2019 08:23
@enxebre
Copy link
Member

enxebre commented Oct 18, 2019

thanks!
/approve

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 18, 2019
@smarterclayton
Copy link
Contributor

Please backport to 4.2 as well

@wking wking changed the title pkg/operator/sync: Track lastError in waitForDeploymentRollout Bug 1763293: pkg/operator/sync: Track lastError in waitForDeploymentRollout Oct 18, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1763293, which is invalid:

  • expected the bug to target the "4.3.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1763293: pkg/operator/sync: Track lastError in waitForDeploymentRollout

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Oct 18, 2019
@wking
Copy link
Member Author

wking commented Oct 18, 2019

/bugzilla refresh
/cherrypick release-4.1

@openshift-cherrypick-robot

@wking: once the present PR merges, I will cherry-pick it on top of release-4.1 in a new PR and assign it to you.

In response to this:

/bugzilla refresh
/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Bugzilla bug 1763293, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/bugzilla refresh
/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Oct 18, 2019
@wking
Copy link
Member Author

wking commented Oct 18, 2019

/cherrypick release-4.2

@openshift-cherrypick-robot

@wking: once the present PR merges, I will cherry-pick it on top of release-4.2 in a new PR and assign it to you.

In response to this:

/cherrypick release-4.2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member Author

wking commented Oct 18, 2019

All green; just needs a /lgtm 😇

@wking
Copy link
Member Author

wking commented Oct 18, 2019

And here we are in action from CI:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/417/pull-ci-openshift-machine-api-operator-master-e2e-aws-upgrade/549/artifacts/e2e-aws-upgrade/container-logs/test.log | grep 'clusteroperator/machine-api changed Degraded'
Oct 18 09:09:36.913 E clusteroperator/machine-api changed Degraded to True: SyncingFailed: Failed when progressing towards operator: 0.0.1-2019-10-18-082547 because deployment machine-api-controllers is not ready. status: (replicas: 2, updated: 1, ready: 1, unavailable: 1)
Oct 18 09:09:36.937 W clusteroperator/machine-api changed Degraded to False

@wking
Copy link
Member Author

wking commented Oct 18, 2019

Also, replicas: 2, updated: 1, ready: 1, unavailable: 1 shouldn't be Degraded, that's just a healthy upgrade (cf. openshift/cluster-dns-operator#134), but we can circle back and adjust stuff like that in follow-up work.

@wking
Copy link
Member Author

wking commented Oct 18, 2019

/hold

Wait, no. I need to clear the error :p

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 18, 2019
Because otherwise stuck deployments will result in the not-very-useful
"timed out waiting for the condition" errors like [1]:

  Oct 17 18:41:52.205 E clusteroperator/machine-api changed Degraded to True: SyncingFailed: Failed when progressing towards operator: 4.3.0-0.ci-2019-10-17-173803 because timed out waiting for the condition

Also use %s instead of %q for formatting the deployment name, because
we control the names being monitored and they don't contain whitespace
or other potentially-confusing characters.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/8809
@wking wking force-pushed the waitForDeploymentRollout-lastError branch from 5cd8ed4 to a68bba5 Compare October 18, 2019 17:25
@wking
Copy link
Member Author

wking commented Oct 18, 2019

Wait, no. I need to clear the error :p

Fixed with 5cd8ed4 -> a68bba5.

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 18, 2019
@wking
Copy link
Member Author

wking commented Oct 18, 2019

e2e-aws:

...Error waiting for instance (i-0886fce3591fc4d09) to become ready...

/retest

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Oct 18, 2019

@wking: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws-scaleup-rhel7 a68bba5 link /test e2e-aws-scaleup-rhel7

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking
Copy link
Member Author

wking commented Oct 18, 2019

Green again :)

@enxebre
Copy link
Member

enxebre commented Oct 21, 2019

Also, replicas: 2, updated: 1, ready: 1, unavailable: 1 shouldn't be Degraded, that's just a healthy upgrade (cf. openshift/cluster-dns-operator#134)

mm the deployment is given a timeout to rollout https://github.com/openshift/machine-api-operator/pull/417/files#diff-fa45321336db7ad1cedc28bf643a4f97R117. Only is degraded if does not succeed to rollout during that timeframe.

but we can circle back and adjust stuff like that in follow-up work.

Sound good. fwiw we also need to clear this up https://bugzilla.redhat.com/show_bug.cgi?id=1758616

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 21, 2019
@openshift-merge-robot openshift-merge-robot merged commit cc22f61 into openshift:master Oct 21, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: All pull requests linked via external trackers have merged. Bugzilla bug 1763293 has been moved to the MODIFIED state.

In response to this:

Bug 1763293: pkg/operator/sync: Track lastError in waitForDeploymentRollout

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@wking: new pull request created: #419

In response to this:

/bugzilla refresh
/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@wking: new pull request created: #420

In response to this:

/cherrypick release-4.2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ironcladlou
Copy link

Sorry to raise this from the dead, but I stumbled across the PR by accident and noticed something confusing:

Deployment failed status already accounts for time using progressDeadlineSeconds and this is encoded in the failed condition — it’s not clear to me what using another timeout layer when waiting for the deployment achieves. What happens if the wait timeout in waitForDeploymentRollout is less than the deployment progressDeadlineSeconds and outer wait timeout is reached first? (Seems like deployment would prematurely and erroneously be given up on?)

Instead I guess I would expect to wait indefinitely for the deployment to reach either a rolled out state or the terminal failed state, and trust either way that timeout has been accounted for at the k8s deployment primitive layer.

cc @wking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants