Skip to content

Indicate invalid OperatorGroup on InstallPlan status #2077

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

hasbro17
Copy link
Contributor

@hasbro17 hasbro17 commented Apr 6, 2021

Description of the change:
The InstallPlan reconciler/sync will now update the InstallPlan phase as Failed along with a status condition message
if it sees an invalid OperatorGroup.
An invalid OperatorGroup is one of:

  • No OperatorGroups in the InstallPlan's namespace
  • Multiple OperatorGroups in the InstallPlan's namespace
  • An incorrect or non-existent ServiceAccount name specified on the OperatorGroup

Motivation for the change:
Previously an InstallPlan would stay in the Installing phase indefinitely if the installplan sync encountered an invalid OperatorGroup.
With this change, the failure is more readily apparent and the InstallPlan status condition is also propagated to the Subscription's status condition.

apiVersion: operators.coreos.com/v1alpha1
kind: InstallPlan
status:
  conditions:
    - lastTransitionTime: '2021-04-06T01:10:01Z'
      lastUpdateTime: '2021-04-06T01:10:01Z'
      message: >-
        invalid operator group - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: 'False'
      type: Installed
  phase: Failed
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
...
status:
  ...
  conditions:
    ...
    - lastTransitionTime: '2021-04-06T01:29:16Z'
      reason: InstallCheckFailed
      status: 'True'
      type: InstallPlanFailed

See feature request: https://issues.redhat.com/browse/OLM-2116

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive

The InstallPlan reconciler/sync will now update the InstallPlan phase as failed
if it sees an invalid Operator Group.
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 6, 2021
@hasbro17
Copy link
Contributor Author

hasbro17 commented Apr 6, 2021

/retest

@hasbro17 hasbro17 changed the title WIP: Indicate invalid OperatorGroup on InstallPlan status Indicate invalid OperatorGroup on InstallPlan status Apr 6, 2021
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 6, 2021
@hasbro17
Copy link
Contributor Author

hasbro17 commented Apr 6, 2021

Looking into the e2e test failures and going to add some e2e tests myself. But the implementation should be okay for a preliminary review I think.

Comment on lines 1334 to 1335
if plan.Status.Phase == v1alpha1.InstallPlanPhaseFailed {
return
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed like an obvious step to me but I'm probably missing something if it wasn't here before.

Is the InstallPlan sync ever expected to recover back from a Failed InstallPlan? I thought that was terminal and you have to start over with a new one.

@kuiwang02
Copy link

@hasbro17 for "An incorrect or non-existent ServiceAccount name specified on the OperatorGroup", the error message seems not correct. we expect "please make sure the service account exists...", but it is "invalid operator group - no operator group found that is managing this namespace". here is the example

[root@preserve-olm-env OCP-40958]# cat teiidcatsrc.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: teiid
  namespace: default
spec:
  displayName: "teiid Operators"
  image: quay.io/kuiwang/teiid-index:1906056
  publisher: QE
  sourceType: grpc
[root@preserve-olm-env OCP-40958]# oc apply -f teiidcatsrc.yaml
catalogsource.operators.coreos.com/teiid created
[root@preserve-olm-env OCP-40958]# oc get pod
NAME          READY   STATUS    RESTARTS   AGE
teiid-t2rtq   0/1     Running   0          7s
[root@preserve-olm-env OCP-40958]# cat ogwrongsa.yaml
kind: OperatorGroup
apiVersion: operators.coreos.com/v1
metadata:
  name: ogwrongsa
  namespace: default
spec:
  serviceAccountName: foo
  targetNamespaces:
  - default
[root@preserve-olm-env OCP-40958]# oc apply -f ogwrongsa.yaml
operatorgroup.operators.coreos.com/ogwrongsa created
[root@preserve-olm-env OCP-40958]# oc get og ogwrongsa -o yaml
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"operators.coreos.com/v1","kind":"OperatorGroup","metadata":{"annotations":{},"name":"ogwrongsa","namespace":"default"},"spec":{"serviceAccountName":"foo","targetNamespaces":["default"]}}
  creationTimestamp: "2021-04-08T03:09:14Z"
  generation: 1
  ...
  name: ogwrongsa
  namespace: default
  resourceVersion: "45889"
  uid: a5420f23-d7d9-46a0-95ca-97586300de68
spec:
  serviceAccountName: foo
  targetNamespaces:
  - default
[root@preserve-olm-env OCP-40958]# cat teiidsub.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: teiid
  namespace: default
spec:
  source: teiid
  sourceNamespace: default

  channel: beta
  installPlanApproval: Automatic
  name: teiid
  startingCSV: teiid.v0.4.0
[root@preserve-olm-env OCP-40958]# oc apply -f teiidsub.yaml
subscription.operators.coreos.com/teiid created
[root@preserve-olm-env OCP-40958]# oc get sub
NAME    PACKAGE   SOURCE   CHANNEL
teiid   teiid     teiid    beta
[root@preserve-olm-env OCP-40958]# oc get ip
NAME            CSV            APPROVAL    APPROVED
install-27nd7   teiid.v0.4.0   Automatic   true
[root@preserve-olm-env OCP-40958]# oc get ip install-27nd7 -o=jsonpath='{.status.conditions}'|jq .
[
  {
    "lastTransitionTime": "2021-04-08T03:10:01Z",
    "lastUpdateTime": "2021-04-08T03:10:01Z",
    "message": "invalid operator group - no operator group found that is managing this namespace",
    "reason": "InstallCheckFailed",
    "status": "False",
    "type": "Installed"
  }
]

[root@preserve-olm-env OCP-40958]# oc get sa -A|grep foo

Also added e2e test for missing ServiceAccountRef in OperatorGroup
and reformatted e2e tests to use Gomega matchers
@hasbro17
Copy link
Contributor Author

hasbro17 commented Apr 8, 2021

@kuiwang02 That's actually fine I think. Specifying a non-existent ServiceAccount would mean the OperatorGroup never gets synced properly to have its status.Namespaces populated. Which means that the OperatorGroup is not managing that namespace, and so the correct error would be to see invalid operator group - no operator group found that is managing this namespace

op, err := a.serviceAccountSyncer.SyncOperatorGroup(op)
if err != nil {
logger.Errorf("error updating service account - %v", err)
return err
}

// A service account has been specified, we need to update the status.
sa, err := s.client.KubernetesInterface().CoreV1().ServiceAccounts(namespace).Get(context.TODO(), serviceAccountName, metav1.GetOptions{})
if err != nil {
err = fmt.Errorf("failed to get service account, sa=%s %v", serviceAccountName, err)
return
}

And that's what you can see from your OperatorGroup as well, that it has no status:

[root@preserve-olm-env OCP-40958]# oc get og ogwrongsa -o yaml
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"operators.coreos.com/v1","kind":"OperatorGroup","metadata":{"annotations":{},"name":"ogwrongsa","namespace":"default"},"spec":{"serviceAccountName":"foo","targetNamespaces":["default"]}}
  creationTimestamp: "2021-04-08T03:09:14Z"
  generation: 1
  ...
  name: ogwrongsa
  namespace: default
  resourceVersion: "45889"
  uid: a5420f23-d7d9-46a0-95ca-97586300de68
spec:
  serviceAccountName: foo
  targetNamespaces:
  - default

The proper way to test the case of the missing ServiceAccountRef would be to create an OperatorGroup with the status.namespaces populated but status.serviceAccountRef missing, e.g:

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"operators.coreos.com/v1","kind":"OperatorGroup","metadata":{"annotations":{},"name":"ogwrongsa","namespace":"default"},"spec":{"serviceAccountName":"foo","targetNamespaces":["default"]}}
  creationTimestamp: "2021-04-08T03:09:14Z"
  generation: 1
  ...
  name: ogwrongsa
  namespace: default
  resourceVersion: "45889"
  uid: a5420f23-d7d9-46a0-95ca-97586300de68
spec:
  serviceAccountName: foo
  targetNamespaces:
  - default
status:
  lastUpdated: "2021-04-06T01:00:44Z"
  namespaces:
  - default

I've just added an e2e test for that as well.

@kuiwang02
Copy link

@kuiwang02 That's actually fine I think. Specifying a non-existent ServiceAccount would mean the OperatorGroup never gets synced properly to have its status.Namespaces populated. Which means that the OperatorGroup is not managing that namespace, and so the correct error would be to see invalid operator group - no operator group found that is managing this namespace

op, err := a.serviceAccountSyncer.SyncOperatorGroup(op)
if err != nil {
logger.Errorf("error updating service account - %v", err)
return err
}

// A service account has been specified, we need to update the status.
sa, err := s.client.KubernetesInterface().CoreV1().ServiceAccounts(namespace).Get(context.TODO(), serviceAccountName, metav1.GetOptions{})
if err != nil {
err = fmt.Errorf("failed to get service account, sa=%s %v", serviceAccountName, err)
return
}

And that's what you can see from your OperatorGroup as well, that it has no status:

[root@preserve-olm-env OCP-40958]# oc get og ogwrongsa -o yaml
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"operators.coreos.com/v1","kind":"OperatorGroup","metadata":{"annotations":{},"name":"ogwrongsa","namespace":"default"},"spec":{"serviceAccountName":"foo","targetNamespaces":["default"]}}
  creationTimestamp: "2021-04-08T03:09:14Z"
  generation: 1
  ...
  name: ogwrongsa
  namespace: default
  resourceVersion: "45889"
  uid: a5420f23-d7d9-46a0-95ca-97586300de68
spec:
  serviceAccountName: foo
  targetNamespaces:
  - default

The proper way to test the case of the missing ServiceAccountRef would be to create an OperatorGroup with the status.namespaces populated but status.serviceAccountRef missing, e.g:

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"operators.coreos.com/v1","kind":"OperatorGroup","metadata":{"annotations":{},"name":"ogwrongsa","namespace":"default"},"spec":{"serviceAccountName":"foo","targetNamespaces":["default"]}}
  creationTimestamp: "2021-04-08T03:09:14Z"
  generation: 1
  ...
  name: ogwrongsa
  namespace: default
  resourceVersion: "45889"
  uid: a5420f23-d7d9-46a0-95ca-97586300de68
spec:
  serviceAccountName: foo
  targetNamespaces:
  - default
status:
  lastUpdated: "2021-04-06T01:00:44Z"
  namespaces:
  - default

I've just added an e2e test for that as well.

@hasbro17 you are right. thanks.

@openshift-ci
Copy link

openshift-ci bot commented Apr 8, 2021

@hasbro17: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-upgrade 4596d05 link /test e2e-upgrade
ci/prow/e2e-gcp b206c08 link /test e2e-gcp

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@hasbro17
Copy link
Contributor Author

hasbro17 commented Apr 8, 2021

Fixed the failing e2e test that had an incorrect gomega assertion in the previous commit.

@exdx
Copy link
Member

exdx commented Apr 8, 2021

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 8, 2021
@benluddy
Copy link
Contributor

benluddy commented Apr 8, 2021

/approve

@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benluddy, hasbro17

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 8, 2021
@openshift-merge-robot openshift-merge-robot merged commit b11215a into operator-framework:master Apr 8, 2021
@hasbro17 hasbro17 deleted the invalid-og-on-installplan-status branch April 8, 2021 19:25
@jianzhangbjz
Copy link
Contributor

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label May 27, 2021
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jun 24, 2021
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jun 24, 2021
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jun 24, 2021
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```

Signed-off-by: Anik Bhattacharjee <[email protected]>
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jun 25, 2021
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```

Signed-off-by: Anik Bhattacharjee <[email protected]>
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jun 25, 2021
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```

Signed-off-by: Anik Bhattacharjee <[email protected]>
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jun 25, 2021
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```

Signed-off-by: Anik Bhattacharjee <[email protected]>
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jun 30, 2021
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```

Signed-off-by: Anik Bhattacharjee <[email protected]>
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jul 4, 2021
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```

Signed-off-by: Anik Bhattacharjee <[email protected]>
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jul 4, 2021
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```

Signed-off-by: Anik Bhattacharjee <[email protected]>
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jul 7, 2021
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```

Signed-off-by: Anik Bhattacharjee <[email protected]>
anik120 added a commit to anik120/operator-lifecycle-manager that referenced this pull request Jul 7, 2021
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```

Signed-off-by: Anik Bhattacharjee <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants