-
Notifications
You must be signed in to change notification settings - Fork 562
(fix)InstallPlan: Do not tranisition IP to failed on OG/SA failure #2215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(fix)InstallPlan: Do not tranisition IP to failed on OG/SA failure #2215
Conversation
0d29485
to
6203e29
Compare
Note: Removed the e2e test cases that checked if the InstallPlan was being transitioned to the |
0c9e5b7
to
2e16b7e
Compare
/approve |
// reset condition/message if it had been set in previous sync. This condition is being reset since any delay in the next steps | ||
// (bundle unpacking/plan step errors being retried for a duration) could lead to this condition sticking around, even after | ||
// the serviceAccountQuerier returns no error since the error has been resolved (by creating the required resources), which would | ||
// be confusing to the user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add a unit test case to test this path and check that this holds, i.e pass an IP with the condtionInstalling=false
and reason InstallPlanInstalledCheckFailed
and check that the condition gets reset with an OG present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an e2e test for this instead since an e2e test feels more natural for this.
01239e5
to
9723d5d
Compare
|
||
It("should fail an InstallPlan when an OperatorGroup specifies an unsynced ServiceAccount with no ServiceAccountRef", func() { | ||
It("should clear clear up the condition in the InstallPlan status that contains an error message when a valid OperatorGroup is created", func() { | ||
// Create an operatorgroup for the same namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we create the OG to trigger the removal of the Installed
/InstallCheckFailed
condition, shouldn't we check if it's present before a valid OG is present?
Since we essentially want to test for the change of that condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, added a check to see if the InstallPlan has the expected condition/message before it is being cleared up.
test/e2e/installplan_e2e_test.go
Outdated
if fetchedInstallPlan.Status.Phase != v1alpha1.InstallPlanPhaseInstalling { | ||
return false, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary?
I think when we get the IP above with the phase check, it's going to stall (poll 20 mins) until it finds the IP in phase Installing
.
fetchInstallPlanWithNamespace(GinkgoT(), crc, installPlanName, ns.GetName(), buildInstallPlanPhaseCheckFunc(operatorsv1alpha1.InstallPlanPhaseInstalling))
operator-lifecycle-manager/test/e2e/installplan_e2e_test.go
Lines 4052 to 4058 in 9723d5d
err = wait.Poll(pollInterval, pollDuration, func() (bool, error) { | |
fetchedInstallPlan, err = c.OperatorsV1alpha1().InstallPlans(namespace).Get(context.TODO(), name, metav1.GetOptions{}) | |
if err != nil || fetchedInstallPlan == nil { | |
return false, err | |
} | |
return checkPhase(fetchedInstallPlan), nil |
So the fetched IP will always be in phase Installing
at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big fan of the buildInstallPlanPhaseCheckFunc
, I think it's an artifact of the pre gikgo test framework we use now. fetchInstallPlan
should do one thing, i.e fetch the IP, and the checking of the phase is a separate task. But looks like we're using it in a lot of places, so I'm okay with refactoring the tests later on, and leaving this check out of here for now.
9723d5d
to
6aebb08
Compare
In operator-framework#2077, a new phase `Failed` was introduced for InstallPlans, and failure in detecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace the InstallPlan was being created in would transition the InstallPlan to the `Failed` state, i.e failure to detected these resources when the InstallPlan was reconciled the first time was considered a permanant failure. This is a regression from the previous behavior of InstallPlans where failure to detect OG/SA would requeue the InstallPlan for reconciliation, so creating the required resources before the retry limit of the informer queue was reached would transition the InstallPlan from the `Installing` phase to the `Complete` phase(unless the bundle unpacking step failed, in which case operator-framework#2093 introduced transitioning the InstallPlan to the `Failed` phase). This regression introduced oddities for users who has infra built that applies a set of manifests simultaneously to install an operator that includes a Subscription to an operator (that creates InstallPlans) along with the required OG/SAs. In those cases, whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would be transitioned to a state of permanant faliure. This PR: * Removes the logic that transitioned the InstallPlan to `Failed`. Instead, the InstallPlan will again be requeued for any reconciliation error. * Introduces logic to bubble up reconciliation error through the InstallPlan's status.Conditions, eg: When no OperatorGroup is detected: ``` conditions: - lastTransitionTime: "2021-06-23T18:16:00Z" lastUpdateTime: "2021-06-23T18:16:16Z" message: attenuated service account query failed - no operator group found that is managing this namespace reason: InstallCheckFailed status: "False" type: Installed ``` Then when a valid OperatorGroup is created: ``` conditions: - lastTransitionTime: "2021-06-23T18:33:37Z" lastUpdateTime: "2021-06-23T18:33:37Z" status: "True" type: Installed ``` Signed-off-by: Anik Bhattacharjee <[email protected]>
6aebb08
to
3a3874b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: anik120, benluddy, hasbro17 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description of the change:
In #2077, a new phase
Failed
was introduced for InstallPlans, and failure indetecting a valid OperatorGroup(OG) or a Service Account(SA) for the namespace
the InstallPlan was being created in would transition the InstallPlan to the
Failed
state, i.e failure to detected these resources when the InstallPlan wasreconciled the first time was considered a permanant failure. This is a regression
from the previous behavior of InstallPlans where failure to detect OG/SA would
requeue the InstallPlan for reconciliation, so creating the required resources before
the retry limit of the informer queue was reached would transition the InstallPlan
from the
Installing
phase to theComplete
phase(unless the bundle unpacking stepfailed, in which case #2093 introduced transitioning the InstallPlan to the
Failed
phase).
This regression introduced oddities for users who has infra built that applies a
set of manifests simultaneously to install an operator that includes a Subscription to
an operator (that creates InstallPlans) along with the required OG/SAs. In those cases,
whenever there was a delay in the reconciliation of the OG/SA, the InstallPlan would
be transitioned to a state of permanant faliure.
This PR:
Removes the logic that transitioned the InstallPlan to
Failed
. Instead, theInstallPlan will again be requeued for any reconciliation error.
Introduces logic to bubble up reconciliation error through the InstallPlan's
status.Conditions, eg:
When no OperatorGroup is detected:
Then when a valid OperatorGroup is created:
Reviewer Checklist
/doc
Closes #2207