Skip to content

Bug 1921264: Fail InstallPlan on bundle unpack timeout #78

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

hasbro17
Copy link
Contributor

The InstallPlan sync can stay stalled on the Installing phase if the bundle cannot be successfully unpacked.
Adding a configurable timeout for the duration of the bundle unpack Job helps identify if an unpack Job is
stalled, and the InstallPlan is then transitioned to Failed with the unpack Job's failure condition
propagated to the InstallPlan condition.

  • InstallPlan will fail after the unpack job's ActiveDeadlineSeconds
  • InstallPlan will fail after the unpack Job's pods exit in error/crash more than the BackoffLimit, currently 3
  • For a non-existent image the InstallPlan stays in phase Installing but the BundleLookupPending condition is update with the reason for why the unpack Job's pods are in a pending state. This shows the ErrImagePull.

Example of Failed InstallPlan due to a bundle unpack timeout:

apiVersion: operators.coreos.com/v1alpha1
kind: InstallPlan
status:
  conditions:
    - lastTransitionTime: '2021-04-06T01:10:01Z'
      lastUpdateTime: '2021-04-06T01:10:01Z'
      message: >-
        Bundle extract Job failed with Reason: DeadlineExceeded, 
        and Message: Job was active longer than specified deadline
      reason: InstallCheckFailed
      status: 'False'
      type: Installed
  phase: Failed

Example of Failed InstallPlan due to an invalid bundle causing repeated pod failures:

apiVersion: operators.coreos.com/v1alpha1
kind: InstallPlan
status:
  bundleLookups:
    - lastTransitionTime: "2021-04-21T20:19:27Z"
      message: Job has reached the specified backoff limit
      reason: BackoffLimitExceeded
      status: "True"
      type: BundleLookupFailed
    identifier: foobar.v0.0.1
    path: alpine:latest
  conditions:
  - lastTransitionTime: "2021-04-21T20:19:27Z"
    lastUpdateTime: "2021-04-21T20:19:27Z"
    message: 'Bundle unpacking failed. Reason: BackoffLimitExceeded, and Message:
      Job has reached the specified backoff limit'
    reason: InstallCheckFailed
    status: "False"
    type: Installed
  phase: Failed

Example of an InstallPlan that is stalled on a non-existent bundle image lookup:

apiVersion: operators.coreos.com/v1alpha1
kind: InstallPlan
status:
  bundleLookups:
    conditions:
    - lastTransitionTime: "2021-04-09T00:44:55Z"
      message: "unpack job not completed: Unpack pod(default/8c3e265337115bdeff1ab7a092b9f531bffc580830a021ffe04d5214eenb2xr)
        container(pull) is pending. Reason: ErrImagePull, Message: rpc error: code
        = Unknown desc = failed to pull and unpack image \"quay.io/foo/bar:latest\":
        failed to resolve reference \"quay.io/foo/bar:latest\": unexpected status
        code [manifests latest]: 401 UNAUTHORIZED \n"
      reason: JobIncomplete
      status: "True"
      type: BundleLookupPending
    path: quay.io/foo/bar:latest

hasbro17 added 9 commits May 12, 2021 12:33
The InstallPlan sync can stay stalled on the Installing phase if the bundle cannot be successfully unpacked.
Adding a configurable timeout for the duration of the bundle unpack Job helps identify if an unpack Job is
stalled, and the InstallPlan is then transitioned to Failed with the unpack Job's failure condition
propagated to the InstallPlan condition.

Upstream-commit: 27ced4137bc4c7637b9c36f95946fe53cefe0e3d
Upstream-repository: operator-lifecycle-manager
Upstream-commit: 048cdb46bf4a2d313bea91b6d61760b7cb993f23
Upstream-repository: operator-lifecycle-manager
Instead of checking a specific error type when the bundle unpack job
fails, a new BundleLookupFailure condition is set to to indicate
a failed bundle lookup. The InstallPlan is transitioned to Failed based
on the presence of this condition.

While the InstallPlan is waiting on a BundleLookupPending condition
the message for that condition is also update with the initcontainer statuses
of the unpack pods, since that can surface ImagePullErrs when the lookup is stalled
on a non-existent bundle image.

The bundle unpack Job's BackoffLimit is also set to a lower value
to fail fast on repeated crashes and the pod restart policy is set to never
to preserve the container logs after the job is terminated via the backoff limit.

Upstream-commit: 260bc91988bb2e4c61fc93cb3bcfecb9af43d6f9
Upstream-repository: operator-lifecycle-manager
In the e2e test we have to wait for the default bundle unpack timeout of 10m
to expire. Adding an annotation to set a timeout per InstallPlan lets us override
the default unpack timeout for a faster e2e test.

Upstream-commit: 06c8bfd4dbeec0cec26c7ac97d4851feb9f64b28
Upstream-repository: operator-lifecycle-manager
- Remove dependence on catalogsource registry image
- Remove redundant clones
- Wait for OperatorGroup to be synced to reduce flakes

Upstream-commit: 5a3b64dd342d2da3415df7af67f3626081589c94
Upstream-repository: operator-lifecycle-manager
Upstream-commit: bb13b90757c695f3996a6de337333ea371203241
Upstream-repository: operator-lifecycle-manager
Upstream-commit: e881752bdb2b9bee4037e381b076001b43a78a70
Upstream-repository: operator-lifecycle-manager
Upstream-commit: e7199d11e1a1fb126b4ba99fd31a4e3b2058ac62
Upstream-repository: operator-lifecycle-manager
Upstream-commit: 447f30e50ab37b6c7741f775f2282d106cd3b975
Upstream-repository: operator-lifecycle-manager
@openshift-ci openshift-ci bot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels May 12, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 12, 2021

@hasbro17: This pull request references Bugzilla bug 1921264, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request.

In response to this:

Bug 1921264: Fail InstallPlan on bundle unpack timeout

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 12, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hasbro17

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 12, 2021
@hasbro17
Copy link
Contributor Author

/retest

1 similar comment
@hasbro17
Copy link
Contributor Author

/retest

@timflannagan
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 13, 2021
@timflannagan
Copy link
Contributor

Same issue as the above comment.

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

6 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

4 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 0b7700e into openshift:master May 14, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 14, 2021

@hasbro17: All pull requests linked via external trackers have merged:

Bugzilla bug 1921264 has been moved to the MODIFIED state.

In response to this:

Bug 1921264: Fail InstallPlan on bundle unpack timeout

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants