-
Notifications
You must be signed in to change notification settings - Fork 216
Bug 1763293: pkg/operator/sync: Track lastError in waitForDeploymentRollout #417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1763293: pkg/operator/sync: Track lastError in waitForDeploymentRollout #417
Conversation
8d44b86
to
5cd8ed4
Compare
thanks! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Please backport to 4.2 as well |
@wking: This pull request references Bugzilla bug 1763293, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@wking: once the present PR merges, I will cherry-pick it on top of release-4.1 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@wking: This pull request references Bugzilla bug 1763293, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherrypick release-4.2 |
@wking: once the present PR merges, I will cherry-pick it on top of release-4.2 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
All green; just needs a |
And here we are in action from CI: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/417/pull-ci-openshift-machine-api-operator-master-e2e-aws-upgrade/549/artifacts/e2e-aws-upgrade/container-logs/test.log | grep 'clusteroperator/machine-api changed Degraded'
Oct 18 09:09:36.913 E clusteroperator/machine-api changed Degraded to True: SyncingFailed: Failed when progressing towards operator: 0.0.1-2019-10-18-082547 because deployment machine-api-controllers is not ready. status: (replicas: 2, updated: 1, ready: 1, unavailable: 1)
Oct 18 09:09:36.937 W clusteroperator/machine-api changed Degraded to False |
Also, |
/hold Wait, no. I need to clear the error :p |
Because otherwise stuck deployments will result in the not-very-useful "timed out waiting for the condition" errors like [1]: Oct 17 18:41:52.205 E clusteroperator/machine-api changed Degraded to True: SyncingFailed: Failed when progressing towards operator: 4.3.0-0.ci-2019-10-17-173803 because timed out waiting for the condition Also use %s instead of %q for formatting the deployment name, because we control the names being monitored and they don't contain whitespace or other potentially-confusing characters. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/8809
5cd8ed4
to
a68bba5
Compare
/retest |
@wking: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Green again :) |
mm the deployment is given a timeout to rollout https://github.com/openshift/machine-api-operator/pull/417/files#diff-fa45321336db7ad1cedc28bf643a4f97R117. Only is degraded if does not succeed to rollout during that timeframe.
Sound good. fwiw we also need to clear this up https://bugzilla.redhat.com/show_bug.cgi?id=1758616 /lgtm |
@wking: All pull requests linked via external trackers have merged. Bugzilla bug 1763293 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@wking: new pull request created: #419 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@wking: new pull request created: #420 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Sorry to raise this from the dead, but I stumbled across the PR by accident and noticed something confusing: Deployment failed status already accounts for time using progressDeadlineSeconds and this is encoded in the failed condition — it’s not clear to me what using another timeout layer when waiting for the deployment achieves. What happens if the wait timeout in waitForDeploymentRollout is less than the deployment progressDeadlineSeconds and outer wait timeout is reached first? (Seems like deployment would prematurely and erroneously be given up on?) Instead I guess I would expect to wait indefinitely for the deployment to reach either a rolled out state or the terminal failed state, and trust either way that timeout has been accounted for at the k8s deployment primitive layer. cc @wking |
Because otherwise stuck deployments will result in the not-very-useful
timed out waiting for the condition
errors like:Also use
%s
instead of%q
for formatting the deployment name, because we control the names being monitored and they don't contain whitespace or other potentially-confusing characters.