-
Notifications
You must be signed in to change notification settings - Fork 71
Bug 1921264: Fail InstallPlan on bundle unpack timeout #78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1921264: Fail InstallPlan on bundle unpack timeout #78
Conversation
The InstallPlan sync can stay stalled on the Installing phase if the bundle cannot be successfully unpacked. Adding a configurable timeout for the duration of the bundle unpack Job helps identify if an unpack Job is stalled, and the InstallPlan is then transitioned to Failed with the unpack Job's failure condition propagated to the InstallPlan condition. Upstream-commit: 27ced4137bc4c7637b9c36f95946fe53cefe0e3d Upstream-repository: operator-lifecycle-manager
Upstream-commit: 048cdb46bf4a2d313bea91b6d61760b7cb993f23 Upstream-repository: operator-lifecycle-manager
Instead of checking a specific error type when the bundle unpack job fails, a new BundleLookupFailure condition is set to to indicate a failed bundle lookup. The InstallPlan is transitioned to Failed based on the presence of this condition. While the InstallPlan is waiting on a BundleLookupPending condition the message for that condition is also update with the initcontainer statuses of the unpack pods, since that can surface ImagePullErrs when the lookup is stalled on a non-existent bundle image. The bundle unpack Job's BackoffLimit is also set to a lower value to fail fast on repeated crashes and the pod restart policy is set to never to preserve the container logs after the job is terminated via the backoff limit. Upstream-commit: 260bc91988bb2e4c61fc93cb3bcfecb9af43d6f9 Upstream-repository: operator-lifecycle-manager
In the e2e test we have to wait for the default bundle unpack timeout of 10m to expire. Adding an annotation to set a timeout per InstallPlan lets us override the default unpack timeout for a faster e2e test. Upstream-commit: 06c8bfd4dbeec0cec26c7ac97d4851feb9f64b28 Upstream-repository: operator-lifecycle-manager
- Remove dependence on catalogsource registry image - Remove redundant clones - Wait for OperatorGroup to be synced to reduce flakes Upstream-commit: 5a3b64dd342d2da3415df7af67f3626081589c94 Upstream-repository: operator-lifecycle-manager
Upstream-commit: bb13b90757c695f3996a6de337333ea371203241 Upstream-repository: operator-lifecycle-manager
Upstream-commit: e881752bdb2b9bee4037e381b076001b43a78a70 Upstream-repository: operator-lifecycle-manager
Upstream-commit: e7199d11e1a1fb126b4ba99fd31a4e3b2058ac62 Upstream-repository: operator-lifecycle-manager
Upstream-commit: 447f30e50ab37b6c7741f775f2282d106cd3b975 Upstream-repository: operator-lifecycle-manager
@hasbro17: This pull request references Bugzilla bug 1921264, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hasbro17 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
1 similar comment
/retest |
/lgtm |
Same issue as the above comment. /retest |
/retest Please review the full test history for this PR and help us cut down flakes. |
6 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
4 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@hasbro17: All pull requests linked via external trackers have merged: Bugzilla bug 1921264 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The InstallPlan sync can stay stalled on the Installing phase if the bundle cannot be successfully unpacked.
Adding a configurable timeout for the duration of the bundle unpack Job helps identify if an unpack Job is
stalled, and the InstallPlan is then transitioned to Failed with the unpack Job's failure condition
propagated to the InstallPlan condition.
ActiveDeadlineSeconds
BackoffLimit
, currently 3Installing
but theBundleLookupPending
condition is update with the reason for why the unpack Job's pods are in a pending state. This shows theErrImagePull
.Example of Failed InstallPlan due to a bundle unpack timeout:
Example of Failed InstallPlan due to an invalid bundle causing repeated pod failures:
Example of an InstallPlan that is stalled on a non-existent bundle image lookup: