Indicate invalid OperatorGroup on InstallPlan status #2077

hasbro17 · 2021-04-06T01:53:12Z

Description of the change:
The InstallPlan reconciler/sync will now update the InstallPlan phase as Failed along with a status condition message
if it sees an invalid OperatorGroup.
An invalid OperatorGroup is one of:

No OperatorGroups in the InstallPlan's namespace
Multiple OperatorGroups in the InstallPlan's namespace
An incorrect or non-existent ServiceAccount name specified on the OperatorGroup

Motivation for the change:
Previously an InstallPlan would stay in the Installing phase indefinitely if the installplan sync encountered an invalid OperatorGroup.
With this change, the failure is more readily apparent and the InstallPlan status condition is also propagated to the Subscription's status condition.

apiVersion: operators.coreos.com/v1alpha1
kind: InstallPlan
status:
  conditions:
    - lastTransitionTime: '2021-04-06T01:10:01Z'
      lastUpdateTime: '2021-04-06T01:10:01Z'
      message: >-
        invalid operator group - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: 'False'
      type: Installed
  phase: Failed

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
...
status:
  ...
  conditions:
    ...
    - lastTransitionTime: '2021-04-06T01:29:16Z'
      reason: InstallCheckFailed
      status: 'True'
      type: InstallPlanFailed

See feature request: https://issues.redhat.com/browse/OLM-2116

Reviewer Checklist

Implementation matches the proposed design, or proposal is updated to match implementation
Sufficient unit test coverage
Sufficient end-to-end test coverage
Docs updated or added to /doc
Commit messages sensible and descriptive

The InstallPlan reconciler/sync will now update the InstallPlan phase as failed if it sees an invalid Operator Group.

hasbro17 · 2021-04-06T04:21:29Z

/retest

hasbro17 · 2021-04-06T15:41:19Z

Looking into the e2e test failures and going to add some e2e tests myself. But the implementation should be okay for a preliminary review I think.

hasbro17 · 2021-04-06T16:03:23Z

pkg/controller/operators/catalog/operator.go

+	if plan.Status.Phase == v1alpha1.InstallPlanPhaseFailed {
+		return


This seemed like an obvious step to me but I'm probably missing something if it wasn't here before.

Is the InstallPlan sync ever expected to recover back from a Failed InstallPlan? I thought that was terminal and you have to start over with a new one.

pkg/controller/operators/catalog/operator.go

test/e2e/installplan_e2e_test.go

pkg/controller/operators/catalog/operator_test.go

kuiwang02 · 2021-04-08T05:25:53Z

@hasbro17 for "An incorrect or non-existent ServiceAccount name specified on the OperatorGroup", the error message seems not correct. we expect "please make sure the service account exists...", but it is "invalid operator group - no operator group found that is managing this namespace". here is the example

[root@preserve-olm-env OCP-40958]# cat teiidcatsrc.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: teiid
  namespace: default
spec:
  displayName: "teiid Operators"
  image: quay.io/kuiwang/teiid-index:1906056
  publisher: QE
  sourceType: grpc
[root@preserve-olm-env OCP-40958]# oc apply -f teiidcatsrc.yaml
catalogsource.operators.coreos.com/teiid created
[root@preserve-olm-env OCP-40958]# oc get pod
NAME          READY   STATUS    RESTARTS   AGE
teiid-t2rtq   0/1     Running   0          7s
[root@preserve-olm-env OCP-40958]# cat ogwrongsa.yaml
kind: OperatorGroup
apiVersion: operators.coreos.com/v1
metadata:
  name: ogwrongsa
  namespace: default
spec:
  serviceAccountName: foo
  targetNamespaces:
  - default
[root@preserve-olm-env OCP-40958]# oc apply -f ogwrongsa.yaml
operatorgroup.operators.coreos.com/ogwrongsa created
[root@preserve-olm-env OCP-40958]# oc get og ogwrongsa -o yaml
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"operators.coreos.com/v1","kind":"OperatorGroup","metadata":{"annotations":{},"name":"ogwrongsa","namespace":"default"},"spec":{"serviceAccountName":"foo","targetNamespaces":["default"]}}
  creationTimestamp: "2021-04-08T03:09:14Z"
  generation: 1
  ...
  name: ogwrongsa
  namespace: default
  resourceVersion: "45889"
  uid: a5420f23-d7d9-46a0-95ca-97586300de68
spec:
  serviceAccountName: foo
  targetNamespaces:
  - default
[root@preserve-olm-env OCP-40958]# cat teiidsub.yaml
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: teiid
  namespace: default
spec:
  source: teiid
  sourceNamespace: default

  channel: beta
  installPlanApproval: Automatic
  name: teiid
  startingCSV: teiid.v0.4.0
[root@preserve-olm-env OCP-40958]# oc apply -f teiidsub.yaml
subscription.operators.coreos.com/teiid created
[root@preserve-olm-env OCP-40958]# oc get sub
NAME    PACKAGE   SOURCE   CHANNEL
teiid   teiid     teiid    beta
[root@preserve-olm-env OCP-40958]# oc get ip
NAME            CSV            APPROVAL    APPROVED
install-27nd7   teiid.v0.4.0   Automatic   true
[root@preserve-olm-env OCP-40958]# oc get ip install-27nd7 -o=jsonpath='{.status.conditions}'|jq .
[
  {
    "lastTransitionTime": "2021-04-08T03:10:01Z",
    "lastUpdateTime": "2021-04-08T03:10:01Z",
    "message": "invalid operator group - no operator group found that is managing this namespace",
    "reason": "InstallCheckFailed",
    "status": "False",
    "type": "Installed"
  }
]

[root@preserve-olm-env OCP-40958]# oc get sa -A|grep foo

Also added e2e test for missing ServiceAccountRef in OperatorGroup and reformatted e2e tests to use Gomega matchers

hasbro17 · 2021-04-08T06:44:33Z

@kuiwang02 That's actually fine I think. Specifying a non-existent ServiceAccount would mean the OperatorGroup never gets synced properly to have its status.Namespaces populated. Which means that the OperatorGroup is not managing that namespace, and so the correct error would be to see invalid operator group - no operator group found that is managing this namespace

operator-lifecycle-manager/pkg/controller/operators/olm/operatorgroup.go

Lines 65 to 69 in 5cd7822

    
           op, err := a.serviceAccountSyncer.SyncOperatorGroup(op) 
        
           if err != nil { 
        
           	logger.Errorf("error updating service account - %v", err) 
        
           	return err 
        
           }

operator-lifecycle-manager/pkg/lib/scoped/syncer.go

Lines 74 to 79 in 5cd7822

    
           // A service account has been specified, we need to update the status. 
        
           sa, err := s.client.KubernetesInterface().CoreV1().ServiceAccounts(namespace).Get(context.TODO(), serviceAccountName, metav1.GetOptions{}) 
        
           if err != nil { 
        
           	err = fmt.Errorf("failed to get service account, sa=%s %v", serviceAccountName, err) 
        
           	return 
        
           }

And that's what you can see from your OperatorGroup as well, that it has no status:

[root@preserve-olm-env OCP-40958]# oc get og ogwrongsa -o yaml
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"operators.coreos.com/v1","kind":"OperatorGroup","metadata":{"annotations":{},"name":"ogwrongsa","namespace":"default"},"spec":{"serviceAccountName":"foo","targetNamespaces":["default"]}}
  creationTimestamp: "2021-04-08T03:09:14Z"
  generation: 1
  ...
  name: ogwrongsa
  namespace: default
  resourceVersion: "45889"
  uid: a5420f23-d7d9-46a0-95ca-97586300de68
spec:
  serviceAccountName: foo
  targetNamespaces:
  - default

The proper way to test the case of the missing ServiceAccountRef would be to create an OperatorGroup with the status.namespaces populated but status.serviceAccountRef missing, e.g:

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"operators.coreos.com/v1","kind":"OperatorGroup","metadata":{"annotations":{},"name":"ogwrongsa","namespace":"default"},"spec":{"serviceAccountName":"foo","targetNamespaces":["default"]}}
  creationTimestamp: "2021-04-08T03:09:14Z"
  generation: 1
  ...
  name: ogwrongsa
  namespace: default
  resourceVersion: "45889"
  uid: a5420f23-d7d9-46a0-95ca-97586300de68
spec:
  serviceAccountName: foo
  targetNamespaces:
  - default
status:
  lastUpdated: "2021-04-06T01:00:44Z"
  namespaces:
  - default

I've just added an e2e test for that as well.

kuiwang02 · 2021-04-08T08:33:02Z

@kuiwang02 That's actually fine I think. Specifying a non-existent ServiceAccount would mean the OperatorGroup never gets synced properly to have its status.Namespaces populated. Which means that the OperatorGroup is not managing that namespace, and so the correct error would be to see invalid operator group - no operator group found that is managing this namespace

operator-lifecycle-manager/pkg/controller/operators/olm/operatorgroup.go

Lines 65 to 69 in 5cd7822

op, err := a.serviceAccountSyncer.SyncOperatorGroup(op)

if err != nil {

logger.Errorf("error updating service account - %v", err)

return err

}

operator-lifecycle-manager/pkg/lib/scoped/syncer.go

Lines 74 to 79 in 5cd7822

// A service account has been specified, we need to update the status.

sa, err := s.client.KubernetesInterface().CoreV1().ServiceAccounts(namespace).Get(context.TODO(), serviceAccountName, metav1.GetOptions{})

if err != nil {

err = fmt.Errorf("failed to get service account, sa=%s %v", serviceAccountName, err)

return

}

And that's what you can see from your OperatorGroup as well, that it has no status:
[root@preserve-olm-env OCP-40958]# oc get og ogwrongsa -o yaml
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"operators.coreos.com/v1","kind":"OperatorGroup","metadata":{"annotations":{},"name":"ogwrongsa","namespace":"default"},"spec":{"serviceAccountName":"foo","targetNamespaces":["default"]}}
  creationTimestamp: "2021-04-08T03:09:14Z"
  generation: 1
  ...
  name: ogwrongsa
  namespace: default
  resourceVersion: "45889"
  uid: a5420f23-d7d9-46a0-95ca-97586300de68
spec:
  serviceAccountName: foo
  targetNamespaces:
  - default
The proper way to test the case of the missing ServiceAccountRef would be to create an OperatorGroup with the status.namespaces populated but status.serviceAccountRef missing, e.g:
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"operators.coreos.com/v1","kind":"OperatorGroup","metadata":{"annotations":{},"name":"ogwrongsa","namespace":"default"},"spec":{"serviceAccountName":"foo","targetNamespaces":["default"]}}
  creationTimestamp: "2021-04-08T03:09:14Z"
  generation: 1
  ...
  name: ogwrongsa
  namespace: default
  resourceVersion: "45889"
  uid: a5420f23-d7d9-46a0-95ca-97586300de68
spec:
  serviceAccountName: foo
  targetNamespaces:
  - default
status:
  lastUpdated: "2021-04-06T01:00:44Z"
  namespaces:
  - default
I've just added an e2e test for that as well.

@hasbro17 you are right. thanks.

test/e2e/installplan_e2e_test.go

pkg/controller/operators/catalog/operator.go

openshift-ci · 2021-04-08T16:45:57Z

@hasbro17: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-upgrade	`4596d05`	link	`/test e2e-upgrade`
ci/prow/e2e-gcp	`b206c08`	link	`/test e2e-gcp`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

hasbro17 · 2021-04-08T16:49:00Z

Fixed the failing e2e test that had an incorrect gomega assertion in the previous commit.

test/e2e/installplan_e2e_test.go

exdx · 2021-04-08T18:19:32Z

/lgtm

benluddy · 2021-04-08T18:40:00Z

/approve

openshift-ci-robot · 2021-04-08T18:40:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benluddy, hasbro17

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [benluddy]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jianzhangbjz · 2021-05-27T06:34:15Z

/label qe-approved

In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace the InstallPlan was being created in would transition the InstallPlan to the `Failed` state, i.e failure to detected these resources when the InstallPlan was reconciled the first time was considered a permanant failure. This is a regression from the previous behavior of InstallPlans where failure to detect OG/SA would requeue the InstallPlan for reconciliation, so creating the required resources before the retry limit of the informer queue was reached would transition the InstallPlan from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed` phase). This regression introduced oddities for users who has infra built that applies a set of manifests simultaneously to install an operator that includes a Subscription to an operator (that creates InstallPlans) along with the required OG/SAs. In those cases, whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would be transitioned to a state of permanant faliure. This PR: * Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the InstallPlan will again be requeued for any reconciliation error. * Introduces logic to bubble up reconciliation error through the InstallPlan's status.Conditions, eg: When no OperatorGroup is detected: ``` conditions: - lastTransitionTime: "2021-06-23T18:16:00Z" lastUpdateTime: "2021-06-23T18:16:16Z" message: attenuated service account query failed - no operator group found that is managing this namespace reason: InstallCheckFailed status: "False" type: Installed ``` Then when a valid OperatorGroup is created: ``` conditions: - lastTransitionTime: "2021-06-23T18:33:37Z" lastUpdateTime: "2021-06-23T18:33:37Z" status: "True" type: Installed ```

In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace the InstallPlan was being created in would transition the InstallPlan to the `Failed` state, i.e failure to detected these resources when the InstallPlan was reconciled the first time was considered a permanant failure. This is a regression from the previous behavior of InstallPlans where failure to detect OG/SA would requeue the InstallPlan for reconciliation, so creating the required resources before the retry limit of the informer queue was reached would transition the InstallPlan from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed` phase). This regression introduced oddities for users who has infra built that applies a set of manifests simultaneously to install an operator that includes a Subscription to an operator (that creates InstallPlans) along with the required OG/SAs. In those cases, whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would be transitioned to a state of permanant faliure. This PR: * Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the InstallPlan will again be requeued for any reconciliation error. * Introduces logic to bubble up reconciliation error through the InstallPlan's status.Conditions, eg: When no OperatorGroup is detected: ``` conditions: - lastTransitionTime: "2021-06-23T18:16:00Z" lastUpdateTime: "2021-06-23T18:16:16Z" message: attenuated service account query failed - no operator group found that is managing this namespace reason: InstallCheckFailed status: "False" type: Installed ``` Then when a valid OperatorGroup is created: ``` conditions: - lastTransitionTime: "2021-06-23T18:33:37Z" lastUpdateTime: "2021-06-23T18:33:37Z" status: "True" type: Installed ``` Signed-off-by: Anik Bhattacharjee <[email protected]>

Indicate invalid OperatorGroup on InstallPlan status

7f91c16

The InstallPlan reconciler/sync will now update the InstallPlan phase as failed if it sees an invalid Operator Group.

hasbro17 requested review from benluddy, njhale, kevinrizza and exdx April 6, 2021 01:53

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 6, 2021

openshift-ci-robot requested review from ankitathomas and ecordell April 6, 2021 01:53

hasbro17 changed the title ~~WIP: Indicate invalid OperatorGroup on InstallPlan status~~ Indicate invalid OperatorGroup on InstallPlan status Apr 6, 2021

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 6, 2021

hasbro17 commented Apr 6, 2021

View reviewed changes

Add e2e tests for installplan failed on invalid operatorgroups

bf2de82

hasbro17 commented Apr 6, 2021

View reviewed changes

pkg/controller/operators/catalog/operator.go Outdated Show resolved Hide resolved

Clarify status condition error message, and update architecture docs

bc37ac4

timflannagan reviewed Apr 7, 2021

View reviewed changes

pkg/controller/operators/catalog/operator.go Outdated Show resolved Hide resolved

benluddy reviewed Apr 7, 2021

View reviewed changes

Consider InstallPlan phase=Complete as terminal

4596d05

Also added e2e test for missing ServiceAccountRef in OperatorGroup and reformatted e2e tests to use Gomega matchers

hasbro17 commented Apr 8, 2021

View reviewed changes

test/e2e/installplan_e2e_test.go Outdated Show resolved Hide resolved

exdx reviewed Apr 8, 2021

View reviewed changes

pkg/controller/operators/catalog/operator.go Show resolved Hide resolved

pkg/controller/operators/catalog/operator.go Show resolved Hide resolved

Fix e2e test with incorrect gomega matcher and remove testify usage

b206c08

benluddy reviewed Apr 8, 2021

View reviewed changes

test/e2e/installplan_e2e_test.go Show resolved Hide resolved

test/e2e/installplan_e2e_test.go Outdated Show resolved Hide resolved

Remove more testify assertions

2450a37

openshift-ci-robot assigned exdx Apr 8, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 8, 2021

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 8, 2021

openshift-merge-robot merged commit b11215a into operator-framework:master Apr 8, 2021

hasbro17 deleted the invalid-og-on-installplan-status branch April 8, 2021 19:25

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label May 27, 2021

anik120 mentioned this pull request Jun 24, 2021

(fix)InstallPlan: Do not tranisition IP to failed on OG/SA failure #2215

Merged

5 tasks

anik120 mentioned this pull request Jun 24, 2021

invalid operator group - no operator group found that is managing this namespace #2207

Closed

		if plan.Status.Phase == v1alpha1.InstallPlanPhaseFailed {
		return

Indicate invalid OperatorGroup on InstallPlan status #2077

Indicate invalid OperatorGroup on InstallPlan status #2077

Uh oh!

Conversation

hasbro17 commented Apr 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hasbro17 commented Apr 6, 2021

Uh oh!

hasbro17 commented Apr 6, 2021

Uh oh!

hasbro17 Apr 6, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kuiwang02 commented Apr 8, 2021

Uh oh!

hasbro17 commented Apr 8, 2021

Uh oh!

kuiwang02 commented Apr 8, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

openshift-ci bot commented Apr 8, 2021

Uh oh!

hasbro17 commented Apr 8, 2021

Uh oh!

Uh oh!

Uh oh!

exdx commented Apr 8, 2021

Uh oh!

benluddy commented Apr 8, 2021

Uh oh!

openshift-ci-robot commented Apr 8, 2021

Uh oh!

jianzhangbjz commented May 27, 2021

Uh oh!

Uh oh!

hasbro17 commented Apr 6, 2021 •

edited

Loading