Skip to content

Fail InstallPlan on bundle unpack timeout #2093

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

hasbro17
Copy link
Contributor

@hasbro17 hasbro17 commented Apr 9, 2021

The InstallPlan sync can stay stalled on the Installing phase if the bundle cannot be successfully unpacked.
Adding a configurable timeout for the duration of the bundle unpack Job helps identify if an unpack Job is
stalled, and the InstallPlan is then transitioned to Failed with the unpack Job's failure condition
propagated to the InstallPlan condition.

  • InstallPlan will fail after the unpack job's ActiveDeadlineSeconds
  • InstallPlan will fail after the unpack Job's pods exit in error/crash more than the BackoffLimit, currently 3
  • For a non-existent image the InstallPlan stays in phase Installing but the BundleLookupPending condition is update with the reason for why the unpack Job's pods are in a pending state. This shows the ErrImagePull.

Example of Failed InstallPlan due to a bundle unpack timeout:

apiVersion: operators.coreos.com/v1alpha1
kind: InstallPlan
status:
  conditions:
    - lastTransitionTime: '2021-04-06T01:10:01Z'
      lastUpdateTime: '2021-04-06T01:10:01Z'
      message: >-
        Bundle extract Job failed with Reason: DeadlineExceeded, 
        and Message: Job was active longer than specified deadline
      reason: InstallCheckFailed
      status: 'False'
      type: Installed
  phase: Failed

Example of Failed InstallPlan due to an invalid bundle causing repeated pod failures:

apiVersion: operators.coreos.com/v1alpha1
kind: InstallPlan
status:
  bundleLookups:
    - lastTransitionTime: "2021-04-21T20:19:27Z"
      message: Job has reached the specified backoff limit
      reason: BackoffLimitExceeded
      status: "True"
      type: BundleLookupFailed
    identifier: foobar.v0.0.1
    path: alpine:latest
  conditions:
  - lastTransitionTime: "2021-04-21T20:19:27Z"
    lastUpdateTime: "2021-04-21T20:19:27Z"
    message: 'Bundle unpacking failed. Reason: BackoffLimitExceeded, and Message:
      Job has reached the specified backoff limit'
    reason: InstallCheckFailed
    status: "False"
    type: Installed
  phase: Failed

Example of an InstallPlan that is stalled on a non-existent bundle image lookup:

apiVersion: operators.coreos.com/v1alpha1
kind: InstallPlan
status:
  bundleLookups:
    conditions:
    - lastTransitionTime: "2021-04-09T00:44:55Z"
      message: "unpack job not completed: Unpack pod(default/8c3e265337115bdeff1ab7a092b9f531bffc580830a021ffe04d5214eenb2xr)
        container(pull) is pending. Reason: ErrImagePull, Message: rpc error: code
        = Unknown desc = failed to pull and unpack image \"quay.io/foo/bar:latest\":
        failed to resolve reference \"quay.io/foo/bar:latest\": unexpected status
        code [manifests latest]: 401 UNAUTHORIZED \n"
      reason: JobIncomplete
      status: "True"
      type: BundleLookupPending
    path: quay.io/foo/bar:latest

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive

@openshift-ci-robot
Copy link
Collaborator

@hasbro17: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 11, 2021
@Xia-Zhao-rh
Copy link

@hasbro17 Would you please help to rebase it?

@hasbro17 hasbro17 changed the title Fail InstallPlan on bundle unpack timeout WIP: Fail InstallPlan on bundle unpack timeout Apr 13, 2021
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 13, 2021
@hasbro17
Copy link
Contributor Author

hasbro17 commented Apr 13, 2021

@Xia-Zhao-rh If this is with regards to testing, please hold that off for a while. I've just added the WIP label to indicate that this is still WIP and will update/rebase accordingly once it's ready for testing.

@Xia-Zhao-rh
Copy link

@Xia-Zhao-rh If this is with regards to testing, please hold that off for a while. I've just added the WIP label to indicate that this is still WIP and will update/rebase accordingly once it's ready for testing.

OK, got it, thanks a lot!

@hasbro17 hasbro17 force-pushed the bundle-lookup-timeout-on-IP-status branch from d71ecdc to 4191bbc Compare April 15, 2021 22:24
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 15, 2021
The InstallPlan sync can stay stalled on the Installing phase if the bundle cannot be successfully unpacked.
Adding a configurable timeout for the duration of the bundle unpack Job helps identify if an unpack Job is
stalled, and the InstallPlan is then transitioned to Failed with the unpack Job's failure condition
propagated to the InstallPlan condition.
@hasbro17 hasbro17 changed the title WIP: Fail InstallPlan on bundle unpack timeout Bug 1921264: Fail InstallPlan on bundle unpack timeout Apr 15, 2021
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 15, 2021
@openshift-ci-robot
Copy link
Collaborator

@hasbro17: This pull request references Bugzilla bug 1921264, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1921264: Fail InstallPlan on bundle unpack timeout

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Apr 15, 2021
@hasbro17 hasbro17 force-pushed the bundle-lookup-timeout-on-IP-status branch from 4191bbc to cc13189 Compare April 15, 2021 22:32
@hasbro17 hasbro17 force-pushed the bundle-lookup-timeout-on-IP-status branch from cc13189 to 048cdb4 Compare April 15, 2021 22:35
@hasbro17
Copy link
Contributor Author

/bugzilla refresh

@openshift-ci-robot
Copy link
Collaborator

@hasbro17: This pull request references Bugzilla bug 1921264, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jianzhangbjz

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Apr 15, 2021
@openshift-ci-robot
Copy link
Collaborator

@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: jianzhangbjz.

Note that only operator-framework members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@hasbro17: This pull request references Bugzilla bug 1921264, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jianzhangbjz

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link

openshift-ci bot commented May 10, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benluddy, hasbro17

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hasbro17 hasbro17 requested a review from njhale May 10, 2021 21:38
Comment on lines +805 to +850
Spec: corev1.PodSpec{
RestartPolicy: corev1.RestartPolicyNever,
Containers: []corev1.Container{
{
Name: "extract",
Image: opmImage,
Command: []string{"opm", "alpha", "bundle", "extract", "-m", "/bundle/", "-n", "ns-a", "-c", pathHash},
Env: []corev1.EnvVar{
{
Name: configmap.EnvContainerImage,
Value: bundlePath,
},
},
VolumeMounts: []corev1.VolumeMount{
{
Name: "bundle",
MountPath: "/bundle",
},
},
Resources: corev1.ResourceRequirements{
Requests: corev1.ResourceList{
corev1.ResourceCPU: resource.MustParse("10m"),
corev1.ResourceMemory: resource.MustParse("50Mi"),
},
},
},
},
InitContainers: []corev1.Container{
{
Name: "util",
Image: utilImage,
Command: []string{"/bin/cp", "-Rv", "/bin/cpb", "/util/cpb"}, // Copy tooling for the bundle container to use
VolumeMounts: []corev1.VolumeMount{
{
Name: "util",
MountPath: "/util",
},
},
Resources: corev1.ResourceRequirements{
Requests: corev1.ResourceList{
corev1.ResourceCPU: resource.MustParse("10m"),
corev1.ResourceMemory: resource.MustParse("50Mi"),
},
},
},
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this pod spec is identical in most test cases, can we factor it out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I tried to trim down the Job spec and the problem is that func (c *ConfigMapUnpacker) UnpackBundle()
will fail to ensure that the expected Job is present with the desired spec.

func (c *ConfigMapUnpacker) ensureJob(cmRef *corev1.ObjectReference, bundlePath string, secrets []corev1.LocalObjectReference) (job *batchv1.Job, err error) {
fresh := c.job(cmRef, bundlePath, secrets)
job, err = c.jobLister.Jobs(fresh.GetNamespace()).Get(fresh.GetName())
if err != nil {
if apierrors.IsNotFound(err) {
job, err = c.client.BatchV1().Jobs(fresh.GetNamespace()).Create(context.TODO(), fresh, metav1.CreateOptions{})
}
return
}
if equality.Semantic.DeepDerivative(fresh.GetOwnerReferences(), job.GetOwnerReferences()) && equality.Semantic.DeepDerivative(fresh.Spec, job.Spec) {
return
}
// TODO: Decide when to fail-out instead of deleting the job
err = c.client.BatchV1().Jobs(job.GetNamespace()).Delete(context.TODO(), job.GetName(), metav1.DeleteOptions{})
job = nil
return
}

While testing this I've realized that we could end up with a nil job when we find and delete an unpack job that has the incorrect spec or ownerrefs.

var job *batchv1.Job
job, err = c.ensureJob(cmRef, result.Path, secrets)
if err != nil {
return
}

So I'm going to update to also return and retry when job == nil as well otherwise we could see a nil dereference later on.

@@ -789,7 +1313,8 @@ func TestConfigMapUnpacker(t *testing.T) {
)
require.NoError(t, err)

res, err := unpacker.UnpackBundle(tt.args.lookup)
// The annotation unpack timeout arg is negative so it is ignored for these tests
res, err := unpacker.UnpackBundle(tt.args.lookup, -1*time.Minute)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected at least one test where the timeout is varied and its inclusion on the job spec is verified

BeforeEach(func() {
ns = &corev1.Namespace{}
ns.SetName(genName("ns-"))
Expect(ctx.Ctx().Client().Create(context.Background(), ns)).To(Succeed())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we typically want all interactions with the cluster to use Eventually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought Eventually was used for polling situations where we expect to see a certain condition eventually.
Are we worried that all CRUD operations could block for a long time? We do have an overall test timeout for that case.

Plus I might be wrong but I don't see that we've been using Eventually for ensuring Create,Update,Delete operations succeed in any of our other tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought Eventually was used for polling situations where we expect to see a certain condition eventually.

Yes. We eventually want the create to be successful or tell us the resource has already been created.

Are we worried that all CRUD operations could block for a long time?

Not really. We're more worried that any operation that reaches out over the network will experience transient errors at some point during our tests, yielding a false negative. This was much more pronounced on the OpenShift CI. I don't expect many errors of that nature running against our in-process kind infra, but I do think it's good practice with long running e2e suites none the less.

Plus I might be wrong but I don't see that we've been using Eventually for ensuring Create,Update,Delete operations succeed in any of our other tests.

We do -- find . -type f -print | xargs fgrep 'Eventually' in the e2e directory and you should find some

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. It's actually used pretty much everywhere on a closer look. I'll update these as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@hasbro17
Copy link
Contributor Author

Not sure why this unit test is failing all of a sudden:

--- FAIL: TestSyncOperatorGroups (1.40s)
    --- FAIL: TestSyncOperatorGroups/AllNamespaces/CSVPresent/InstallModeNotSupported (0.14s)
        operator_test.go:4489: 
            	Error Trace:	operator_test.go:4489
            	Error:      	Received unexpected error:
            	            	clusterroles.rbac.authorization.k8s.io "csv-role" already exists
            	Test:       	TestSyncOperatorGroups/AllNamespaces/CSVPresent/InstallModeNotSupported

Can't reproduce this locally but still looking.

@hasbro17
Copy link
Contributor Author

Seems like it was a flake.

@njhale
Copy link
Member

njhale commented May 12, 2021

Nice job!

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 12, 2021
@openshift-merge-robot openshift-merge-robot merged commit 01af7a3 into operator-framework:master May 12, 2021
@hasbro17
Copy link
Contributor Author

/cherry-pick release-4.7

@openshift-cherrypick-robot

@hasbro17: #2093 failed to apply on top of branch "release-4.7":

Applying: Fail InstallPlan on bundle unpack timeout
Using index info to reconstruct a base tree...
M	cmd/catalog/main.go
M	pkg/controller/operators/catalog/operator.go
M	test/e2e/installplan_e2e_test.go
Falling back to patching base and 3-way merge...
Auto-merging test/e2e/installplan_e2e_test.go
Auto-merging pkg/controller/operators/catalog/operator.go
CONFLICT (content): Merge conflict in pkg/controller/operators/catalog/operator.go
Auto-merging cmd/catalog/main.go
CONFLICT (content): Merge conflict in cmd/catalog/main.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Fail InstallPlan on bundle unpack timeout
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jun 24, 2021
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jun 24, 2021
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jun 24, 2021
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```

Signed-off-by: Anik Bhattacharjee <[email protected]>
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jun 25, 2021
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```

Signed-off-by: Anik Bhattacharjee <[email protected]>
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jun 25, 2021
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```

Signed-off-by: Anik Bhattacharjee <[email protected]>
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jun 25, 2021
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```

Signed-off-by: Anik Bhattacharjee <[email protected]>
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jun 30, 2021
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```

Signed-off-by: Anik Bhattacharjee <[email protected]>
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jul 4, 2021
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```

Signed-off-by: Anik Bhattacharjee <[email protected]>
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jul 4, 2021
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```

Signed-off-by: Anik Bhattacharjee <[email protected]>
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jul 7, 2021
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```

Signed-off-by: Anik Bhattacharjee <[email protected]>
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jul 7, 2021
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```

Signed-off-by: Anik Bhattacharjee <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants