Skip to content

(fix)InstallPlan: Do not tranisition IP to failed on OG/SA failure #2215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

anik120
Copy link
Contributor

@anik120 anik120 commented Jun 24, 2021

Description of the change:

In #2077, a new phase Failed was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
Failed state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the Installing phase to the Complete phase(unless the bundle unpacking step
failed, in which case #2093 introduced transitioning the InstallPlan to the Failed
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:

  • Removes the logic that transitioned the InstallPlan to Failed. Instead, the
    InstallPlan will again be requeued for any reconciliation error.

  • Introduces logic to bubble up reconciliation error through the InstallPlan's
    status.Conditions, eg:

When no OperatorGroup is detected:

conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed

Then when a valid OperatorGroup is created:

conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive

Closes #2207

@anik120 anik120 force-pushed the fix-ip-reconciliation branch from 0d29485 to 6203e29 Compare June 24, 2021 21:11
@anik120
Copy link
Contributor Author

anik120 commented Jun 24, 2021

Note: Removed the e2e test cases that checked if the InstallPlan was being transitioned to the Failed state, instead of modifying them to checked if not state transition was happening, and if condition Installed:False condition is present because the unit test cases are doing the exact same thing, making the e2e test cases superfluous.

@anik120 anik120 force-pushed the fix-ip-reconciliation branch 4 times, most recently from 0c9e5b7 to 2e16b7e Compare June 30, 2021 19:32
@benluddy
Copy link
Contributor

/approve

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 30, 2021
Comment on lines +1404 to +1407
// reset condition/message if it had been set in previous sync. This condition is being reset since any delay in the next steps
// (bundle unpacking/plan step errors being retried for a duration) could lead to this condition sticking around, even after
// the serviceAccountQuerier returns no error since the error has been resolved (by creating the required resources), which would
// be confusing to the user
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add a unit test case to test this path and check that this holds, i.e pass an IP with the condtionInstalling=false and reason InstallPlanInstalledCheckFailed and check that the condition gets reset with an OG present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an e2e test for this instead since an e2e test feels more natural for this.

@anik120 anik120 force-pushed the fix-ip-reconciliation branch 2 times, most recently from 01239e5 to 9723d5d Compare July 4, 2021 00:42

It("should fail an InstallPlan when an OperatorGroup specifies an unsynced ServiceAccount with no ServiceAccountRef", func() {
It("should clear clear up the condition in the InstallPlan status that contains an error message when a valid OperatorGroup is created", func() {
// Create an operatorgroup for the same namespace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we create the OG to trigger the removal of the Installed/InstallCheckFailed condition, shouldn't we check if it's present before a valid OG is present?
Since we essentially want to test for the change of that condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, added a check to see if the InstallPlan has the expected condition/message before it is being cleared up.

Comment on lines 3219 to 3221
if fetchedInstallPlan.Status.Phase != v1alpha1.InstallPlanPhaseInstalling {
return false, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary?
I think when we get the IP above with the phase check, it's going to stall (poll 20 mins) until it finds the IP in phase Installing.

fetchInstallPlanWithNamespace(GinkgoT(), crc, installPlanName, ns.GetName(), buildInstallPlanPhaseCheckFunc(operatorsv1alpha1.InstallPlanPhaseInstalling))

err = wait.Poll(pollInterval, pollDuration, func() (bool, error) {
fetchedInstallPlan, err = c.OperatorsV1alpha1().InstallPlans(namespace).Get(context.TODO(), name, metav1.GetOptions{})
if err != nil || fetchedInstallPlan == nil {
return false, err
}
return checkPhase(fetchedInstallPlan), nil

So the fetched IP will always be in phase Installing at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big fan of the buildInstallPlanPhaseCheckFunc, I think it's an artifact of the pre gikgo test framework we use now. fetchInstallPlan should do one thing, i.e fetch the IP, and the checking of the phase is a separate task. But looks like we're using it in a lot of places, so I'm okay with refactoring the tests later on, and leaving this check out of here for now.

@anik120 anik120 force-pushed the fix-ip-reconciliation branch from 9723d5d to 6aebb08 Compare July 7, 2021 14:45
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in
detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
`Failed` state, i.e failure to detected these resources when the InstallPlan was
reconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step
failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed`
phase).

This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.

This PR:
* Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the
InstallPlan will again be requeued for any reconciliation error.

* Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:

When no OperatorGroup is detected:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:16:00Z"
      lastUpdateTime: "2021-06-23T18:16:16Z"
      message: attenuated service account query failed - no operator group found that
        is managing this namespace
      reason: InstallCheckFailed
      status: "False"
      type: Installed
```

Then when a valid OperatorGroup is created:

```
conditions:
    - lastTransitionTime: "2021-06-23T18:33:37Z"
      lastUpdateTime: "2021-06-23T18:33:37Z"
      status: "True"
      type: Installed
```

Signed-off-by: Anik Bhattacharjee <[email protected]>
@anik120 anik120 force-pushed the fix-ip-reconciliation branch from 6aebb08 to 3a3874b Compare July 7, 2021 14:51
Copy link
Contributor

@hasbro17 hasbro17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 7, 2021
@openshift-ci
Copy link

openshift-ci bot commented Jul 7, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: anik120, benluddy, hasbro17

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 2d11302 into operator-framework:master Jul 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

invalid operator group - no operator group found that is managing this namespace
4 participants