Fail InstallPlan on bundle unpack timeout #2093

hasbro17 · 2021-04-09T01:51:21Z

The InstallPlan sync can stay stalled on the Installing phase if the bundle cannot be successfully unpacked.
Adding a configurable timeout for the duration of the bundle unpack Job helps identify if an unpack Job is
stalled, and the InstallPlan is then transitioned to Failed with the unpack Job's failure condition
propagated to the InstallPlan condition.

InstallPlan will fail after the unpack job's ActiveDeadlineSeconds
InstallPlan will fail after the unpack Job's pods exit in error/crash more than the BackoffLimit, currently 3
For a non-existent image the InstallPlan stays in phase Installing but the BundleLookupPending condition is update with the reason for why the unpack Job's pods are in a pending state. This shows the ErrImagePull.

Example of Failed InstallPlan due to a bundle unpack timeout:

apiVersion: operators.coreos.com/v1alpha1
kind: InstallPlan
status:
  conditions:
    - lastTransitionTime: '2021-04-06T01:10:01Z'
      lastUpdateTime: '2021-04-06T01:10:01Z'
      message: >-
        Bundle extract Job failed with Reason: DeadlineExceeded, 
        and Message: Job was active longer than specified deadline
      reason: InstallCheckFailed
      status: 'False'
      type: Installed
  phase: Failed

Example of Failed InstallPlan due to an invalid bundle causing repeated pod failures:

apiVersion: operators.coreos.com/v1alpha1
kind: InstallPlan
status:
  bundleLookups:
    - lastTransitionTime: "2021-04-21T20:19:27Z"
      message: Job has reached the specified backoff limit
      reason: BackoffLimitExceeded
      status: "True"
      type: BundleLookupFailed
    identifier: foobar.v0.0.1
    path: alpine:latest
  conditions:
  - lastTransitionTime: "2021-04-21T20:19:27Z"
    lastUpdateTime: "2021-04-21T20:19:27Z"
    message: 'Bundle unpacking failed. Reason: BackoffLimitExceeded, and Message:
      Job has reached the specified backoff limit'
    reason: InstallCheckFailed
    status: "False"
    type: Installed
  phase: Failed

Example of an InstallPlan that is stalled on a non-existent bundle image lookup:

apiVersion: operators.coreos.com/v1alpha1
kind: InstallPlan
status:
  bundleLookups:
    conditions:
    - lastTransitionTime: "2021-04-09T00:44:55Z"
      message: "unpack job not completed: Unpack pod(default/8c3e265337115bdeff1ab7a092b9f531bffc580830a021ffe04d5214eenb2xr)
        container(pull) is pending. Reason: ErrImagePull, Message: rpc error: code
        = Unknown desc = failed to pull and unpack image \"quay.io/foo/bar:latest\":
        failed to resolve reference \"quay.io/foo/bar:latest\": unexpected status
        code [manifests latest]: 401 UNAUTHORIZED \n"
      reason: JobIncomplete
      status: "True"
      type: BundleLookupPending
    path: quay.io/foo/bar:latest

Reviewer Checklist

Implementation matches the proposed design, or proposal is updated to match implementation
Sufficient unit test coverage
Sufficient end-to-end test coverage
Docs updated or added to /doc
Commit messages sensible and descriptive

cmd/catalog/main.go

openshift-ci-robot · 2021-04-11T09:14:50Z

@hasbro17: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Xia-Zhao-rh · 2021-04-13T01:29:43Z

@hasbro17 Would you please help to rebase it?

hasbro17 · 2021-04-13T01:58:21Z

@Xia-Zhao-rh If this is with regards to testing, please hold that off for a while. I've just added the WIP label to indicate that this is still WIP and will update/rebase accordingly once it's ready for testing.

Xia-Zhao-rh · 2021-04-13T02:08:56Z

@Xia-Zhao-rh If this is with regards to testing, please hold that off for a while. I've just added the WIP label to indicate that this is still WIP and will update/rebase accordingly once it's ready for testing.

OK, got it, thanks a lot!

The InstallPlan sync can stay stalled on the Installing phase if the bundle cannot be successfully unpacked. Adding a configurable timeout for the duration of the bundle unpack Job helps identify if an unpack Job is stalled, and the InstallPlan is then transitioned to Failed with the unpack Job's failure condition propagated to the InstallPlan condition.

openshift-ci-robot · 2021-04-15T22:30:05Z

@hasbro17: This pull request references Bugzilla bug 1921264, which is invalid:

expected the bug to target the "4.8.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1921264: Fail InstallPlan on bundle unpack timeout

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hasbro17 · 2021-04-15T23:10:50Z

/bugzilla refresh

openshift-ci-robot · 2021-04-15T23:10:56Z

@hasbro17: This pull request references Bugzilla bug 1921264, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.8.0) matches configured target release for branch (4.8.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jianzhangbjz

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2021-04-15T23:10:57Z

@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: jianzhangbjz.

Note that only operator-framework members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@hasbro17: This pull request references Bugzilla bug 1921264, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)

bug target release (4.8.0) matches configured target release for branch (4.8.0)

bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jianzhangbjz

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2021-05-10T21:34:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benluddy, hasbro17

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [benluddy]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pkg/controller/bundle/bundle_unpacker.go

njhale · 2021-05-11T18:25:50Z

pkg/controller/bundle/bundle_unpacker_test.go

+								Spec: corev1.PodSpec{
+									RestartPolicy: corev1.RestartPolicyNever,
+									Containers: []corev1.Container{
+										{
+											Name:    "extract",
+											Image:   opmImage,
+											Command: []string{"opm", "alpha", "bundle", "extract", "-m", "/bundle/", "-n", "ns-a", "-c", pathHash},
+											Env: []corev1.EnvVar{
+												{
+													Name:  configmap.EnvContainerImage,
+													Value: bundlePath,
+												},
+											},
+											VolumeMounts: []corev1.VolumeMount{
+												{
+													Name:      "bundle",
+													MountPath: "/bundle",
+												},
+											},
+											Resources: corev1.ResourceRequirements{
+												Requests: corev1.ResourceList{
+													corev1.ResourceCPU:    resource.MustParse("10m"),
+													corev1.ResourceMemory: resource.MustParse("50Mi"),
+												},
+											},
+										},
+									},
+									InitContainers: []corev1.Container{
+										{
+											Name:    "util",
+											Image:   utilImage,
+											Command: []string{"/bin/cp", "-Rv", "/bin/cpb", "/util/cpb"}, // Copy tooling for the bundle container to use
+											VolumeMounts: []corev1.VolumeMount{
+												{
+													Name:      "util",
+													MountPath: "/util",
+												},
+											},
+											Resources: corev1.ResourceRequirements{
+												Requests: corev1.ResourceList{
+													corev1.ResourceCPU:    resource.MustParse("10m"),
+													corev1.ResourceMemory: resource.MustParse("50Mi"),
+												},
+											},
+										},
+										{


If this pod spec is identical in most test cases, can we factor it out?

I don't think so. I tried to trim down the Job spec and the problem is that func (c *ConfigMapUnpacker) UnpackBundle()
will fail to ensure that the expected Job is present with the desired spec.

operator-lifecycle-manager/pkg/controller/bundle/bundle_unpacker.go

Lines 411 to 430 in a326d83

func (c *ConfigMapUnpacker) ensureJob(cmRef *corev1.ObjectReference, bundlePath string, secrets []corev1.LocalObjectReference) (job *batchv1.Job, err error) {

fresh := c.job(cmRef, bundlePath, secrets)

job, err = c.jobLister.Jobs(fresh.GetNamespace()).Get(fresh.GetName())

if err != nil {

if apierrors.IsNotFound(err) {

job, err = c.client.BatchV1().Jobs(fresh.GetNamespace()).Create(context.TODO(), fresh, metav1.CreateOptions{})

}

return

}

if equality.Semantic.DeepDerivative(fresh.GetOwnerReferences(), job.GetOwnerReferences()) && equality.Semantic.DeepDerivative(fresh.Spec, job.Spec) {

return

}

// TODO: Decide when to fail-out instead of deleting the job

err = c.client.BatchV1().Jobs(job.GetNamespace()).Delete(context.TODO(), job.GetName(), metav1.DeleteOptions{})

job = nil

return

}

While testing this I've realized that we could end up with a nil job when we find and delete an unpack job that has the incorrect spec or ownerrefs.

operator-lifecycle-manager/pkg/controller/bundle/bundle_unpacker.go

Lines 358 to 362 in a326d83

var job *batchv1.Job

job, err = c.ensureJob(cmRef, result.Path, secrets)

if err != nil {

return

}

So I'm going to update to also return and retry when job == nil as well otherwise we could see a nil dereference later on.

njhale · 2021-05-11T18:35:37Z

pkg/controller/bundle/bundle_unpacker_test.go

@@ -789,7 +1313,8 @@ func TestConfigMapUnpacker(t *testing.T) {
 			)
 			require.NoError(t, err)

-			res, err := unpacker.UnpackBundle(tt.args.lookup)
+			// The annotation unpack timeout arg is negative so it is ignored for these tests
+			res, err := unpacker.UnpackBundle(tt.args.lookup, -1*time.Minute)


I would have expected at least one test where the timeout is varied and its inclusion on the job spec is verified

njhale · 2021-05-11T18:42:24Z

test/e2e/installplan_e2e_test.go

+		BeforeEach(func() {
+			ns = &corev1.Namespace{}
+			ns.SetName(genName("ns-"))
+			Expect(ctx.Ctx().Client().Create(context.Background(), ns)).To(Succeed())


I think we typically want all interactions with the cluster to use Eventually.

I thought Eventually was used for polling situations where we expect to see a certain condition eventually.
Are we worried that all CRUD operations could block for a long time? We do have an overall test timeout for that case.

Plus I might be wrong but I don't see that we've been using Eventually for ensuring Create,Update,Delete operations succeed in any of our other tests.

I thought Eventually was used for polling situations where we expect to see a certain condition eventually.

Yes. We eventually want the create to be successful or tell us the resource has already been created.

Are we worried that all CRUD operations could block for a long time?

Not really. We're more worried that any operation that reaches out over the network will experience transient errors at some point during our tests, yielding a false negative. This was much more pronounced on the OpenShift CI. I don't expect many errors of that nature running against our in-process kind infra, but I do think it's good practice with long running e2e suites none the less.

Plus I might be wrong but I don't see that we've been using Eventually for ensuring Create,Update,Delete operations succeed in any of our other tests.

We do -- find . -type f -print | xargs fgrep 'Eventually' in the e2e directory and you should find some

Makes sense. It's actually used pretty much everywhere on a closer look. I'll update these as well.

hasbro17 · 2021-05-12T00:40:05Z

Not sure why this unit test is failing all of a sudden:

--- FAIL: TestSyncOperatorGroups (1.40s)
    --- FAIL: TestSyncOperatorGroups/AllNamespaces/CSVPresent/InstallModeNotSupported (0.14s)
        operator_test.go:4489: 
            	Error Trace:	operator_test.go:4489
            	Error:      	Received unexpected error:
            	            	clusterroles.rbac.authorization.k8s.io "csv-role" already exists
            	Test:       	TestSyncOperatorGroups/AllNamespaces/CSVPresent/InstallModeNotSupported

Can't reproduce this locally but still looking.

hasbro17 · 2021-05-12T16:21:20Z

Seems like it was a flake.

njhale · 2021-05-12T17:14:11Z

Nice job!

/lgtm

hasbro17 · 2021-05-13T18:25:25Z

/cherry-pick release-4.7

openshift-cherrypick-robot · 2021-05-13T18:26:07Z

@hasbro17: #2093 failed to apply on top of branch "release-4.7":

Applying: Fail InstallPlan on bundle unpack timeout
Using index info to reconstruct a base tree...
M	cmd/catalog/main.go
M	pkg/controller/operators/catalog/operator.go
M	test/e2e/installplan_e2e_test.go
Falling back to patching base and 3-way merge...
Auto-merging test/e2e/installplan_e2e_test.go
Auto-merging pkg/controller/operators/catalog/operator.go
CONFLICT (content): Merge conflict in pkg/controller/operators/catalog/operator.go
Auto-merging cmd/catalog/main.go
CONFLICT (content): Merge conflict in cmd/catalog/main.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Fail InstallPlan on bundle unpack timeout
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace the InstallPlan was being created in would transition the InstallPlan to the `Failed` state, i.e failure to detected these resources when the InstallPlan was reconciled the first time was considered a permanant failure. This is a regression from the previous behavior of InstallPlans where failure to detect OG/SA would requeue the InstallPlan for reconciliation, so creating the required resources before the retry limit of the informer queue was reached would transition the InstallPlan from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed` phase). This regression introduced oddities for users who has infra built that applies a set of manifests simultaneously to install an operator that includes a Subscription to an operator (that creates InstallPlans) along with the required OG/SAs. In those cases, whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would be transitioned to a state of permanant faliure. This PR: * Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the InstallPlan will again be requeued for any reconciliation error. * Introduces logic to bubble up reconciliation error through the InstallPlan's status.Conditions, eg: When no OperatorGroup is detected: ``` conditions: - lastTransitionTime: "2021-06-23T18:16:00Z" lastUpdateTime: "2021-06-23T18:16:16Z" message: attenuated service account query failed - no operator group found that is managing this namespace reason: InstallCheckFailed status: "False" type: Installed ``` Then when a valid OperatorGroup is created: ``` conditions: - lastTransitionTime: "2021-06-23T18:33:37Z" lastUpdateTime: "2021-06-23T18:33:37Z" status: "True" type: Installed ```

In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace the InstallPlan was being created in would transition the InstallPlan to the `Failed` state, i.e failure to detected these resources when the InstallPlan was reconciled the first time was considered a permanant failure. This is a regression from the previous behavior of InstallPlans where failure to detect OG/SA would requeue the InstallPlan for reconciliation, so creating the required resources before the retry limit of the informer queue was reached would transition the InstallPlan from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed` phase). This regression introduced oddities for users who has infra built that applies a set of manifests simultaneously to install an operator that includes a Subscription to an operator (that creates InstallPlans) along with the required OG/SAs. In those cases, whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would be transitioned to a state of permanant faliure. This PR: * Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the InstallPlan will again be requeued for any reconciliation error. * Introduces logic to bubble up reconciliation error through the InstallPlan's status.Conditions, eg: When no OperatorGroup is detected: ``` conditions: - lastTransitionTime: "2021-06-23T18:16:00Z" lastUpdateTime: "2021-06-23T18:16:16Z" message: attenuated service account query failed - no operator group found that is managing this namespace reason: InstallCheckFailed status: "False" type: Installed ``` Then when a valid OperatorGroup is created: ``` conditions: - lastTransitionTime: "2021-06-23T18:33:37Z" lastUpdateTime: "2021-06-23T18:33:37Z" status: "True" type: Installed ``` Signed-off-by: Anik Bhattacharjee <[email protected]>

hasbro17 requested review from ecordell, benluddy, njhale, kevinrizza and exdx April 9, 2021 01:51

openshift-ci-robot requested a review from awgreene April 9, 2021 01:51

hasbro17 commented Apr 9, 2021

View reviewed changes

cmd/catalog/main.go Outdated Show resolved Hide resolved

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 11, 2021

hasbro17 changed the title ~~Fail InstallPlan on bundle unpack timeout~~ WIP: Fail InstallPlan on bundle unpack timeout Apr 13, 2021

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 13, 2021

hasbro17 force-pushed the bundle-lookup-timeout-on-IP-status branch from d71ecdc to 4191bbc Compare April 15, 2021 22:24

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 15, 2021

hasbro17 changed the title ~~WIP: Fail InstallPlan on bundle unpack timeout~~ Bug 1921264: Fail InstallPlan on bundle unpack timeout Apr 15, 2021

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 15, 2021

openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Apr 15, 2021

hasbro17 force-pushed the bundle-lookup-timeout-on-IP-status branch from 4191bbc to cc13189 Compare April 15, 2021 22:32

Cleanup e2e tests and add case for invalid bundle image

048cdb4

hasbro17 force-pushed the bundle-lookup-timeout-on-IP-status branch from cc13189 to 048cdb4 Compare April 15, 2021 22:35

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Apr 15, 2021

Add unit tests for UnpackBundle

e881752

hasbro17 requested a review from njhale May 10, 2021 21:38

njhale reviewed May 11, 2021

View reviewed changes

pkg/controller/bundle/bundle_unpacker.go Outdated Show resolved Hide resolved

njhale reviewed May 11, 2021

View reviewed changes

hasbro17 added 2 commits May 11, 2021 14:48

Add unit test for varying job timeout

e7199d1

Add eventually to e2e test crud operations

447f30e

openshift-ci bot assigned njhale May 12, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 12, 2021

openshift-merge-robot merged commit 01af7a3 into operator-framework:master May 12, 2021

openshift-ci bot mentioned this pull request May 14, 2021

Bug 1921264: Fail InstallPlan on bundle unpack timeout openshift/operator-framework-olm#78

Merged

anik120 mentioned this pull request Jun 24, 2021

(fix)InstallPlan: Do not tranisition IP to failed on OG/SA failure #2215

Merged

5 tasks

	func (c ConfigMapUnpacker) ensureJob(cmRef corev1.ObjectReference, bundlePath string, secrets []corev1.LocalObjectReference) (job *batchv1.Job, err error) {
	fresh := c.job(cmRef, bundlePath, secrets)
	job, err = c.jobLister.Jobs(fresh.GetNamespace()).Get(fresh.GetName())
	if err != nil {
	if apierrors.IsNotFound(err) {
	job, err = c.client.BatchV1().Jobs(fresh.GetNamespace()).Create(context.TODO(), fresh, metav1.CreateOptions{})
	}

	return
	}

	if equality.Semantic.DeepDerivative(fresh.GetOwnerReferences(), job.GetOwnerReferences()) && equality.Semantic.DeepDerivative(fresh.Spec, job.Spec) {
	return
	}

	// TODO: Decide when to fail-out instead of deleting the job
	err = c.client.BatchV1().Jobs(job.GetNamespace()).Delete(context.TODO(), job.GetName(), metav1.DeleteOptions{})
	job = nil
	return
	}

	var job *batchv1.Job
	job, err = c.ensureJob(cmRef, result.Path, secrets)
	if err != nil {
	return
	}

Fail InstallPlan on bundle unpack timeout #2093

Fail InstallPlan on bundle unpack timeout #2093

Uh oh!

Conversation

hasbro17 commented Apr 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

openshift-ci-robot commented Apr 11, 2021

Uh oh!

Xia-Zhao-rh commented Apr 13, 2021

Uh oh!

hasbro17 commented Apr 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Xia-Zhao-rh commented Apr 13, 2021

Uh oh!

openshift-ci-robot commented Apr 15, 2021

Uh oh!

hasbro17 commented Apr 15, 2021

Uh oh!

openshift-ci-robot commented Apr 15, 2021

Uh oh!

openshift-ci-robot commented Apr 15, 2021

Uh oh!

openshift-ci bot commented May 10, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hasbro17 commented May 12, 2021

Uh oh!

hasbro17 commented May 12, 2021

Uh oh!

njhale commented May 12, 2021

Uh oh!

hasbro17 commented May 13, 2021

Uh oh!

openshift-cherrypick-robot commented May 13, 2021

Uh oh!

Uh oh!

hasbro17 commented Apr 9, 2021 •

edited

Loading

hasbro17 commented Apr 13, 2021 •

edited

Loading