Skip to content

fix: check installplan status in bundle e2e test #1631

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

exdx
Copy link
Member

@exdx exdx commented Jul 10, 2020

Description of the change:

Motivation for the change:
Reduce flakiness of bundle with additional objects e2e test.

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /docs
  • Commit messages sensible and descriptive


// Wait for the installplan to complete (5 minute timeout)
Eventually(func() error {
_, err := fetchInstallPlan(GinkgoT(), operatorClient, installPlanRef, buildInstallPlanPhaseCheckFunc(v1alpha1.InstallPlanPhaseComplete))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a subscription be at latest if the InstallPlan isn't complete?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed so but let me double check - that's basically my concern with this fix

Copy link
Member Author

@exdx exdx Jul 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can see the, the answer is yes. The SubscriptionStateAtLatest state is applied only considering whether an update exists for that given CSV rather than the actual status of the CSV's InstallPlan.

see

func (o *Operator) ensureSubscriptionCSVState(logger *logrus.Entry, sub *v1alpha1.Subscription, querier resolver.SourceQuerier) (*v1alpha1.Subscription, bool, error) {

@njhale
Copy link
Member

njhale commented Jul 10, 2020

2020-07-10T15:22:31.9409853Z     �[91mTimed out after 60.000s.
2020-07-10T15:22:31.9410158Z     expected no error finding vpa object associated with csv
2020-07-10T15:22:31.9410383Z     Expected success, but got an error:
2020-07-10T15:22:31.9410835Z         <*errors.StatusError | 0xc000b85cc0>: {
2020-07-10T15:22:31.9410958Z             ErrStatus: {
2020-07-10T15:22:31.9411081Z                 TypeMeta: {Kind: "Status", APIVersion: "v1"},
2020-07-10T15:22:31.9411200Z                 ListMeta: {
2020-07-10T15:22:31.9411315Z                     SelfLink: "",
2020-07-10T15:22:31.9411432Z                     ResourceVersion: "",
2020-07-10T15:22:31.9411545Z                     Continue: "",
2020-07-10T15:22:31.9411636Z                     RemainingItemCount: nil,
2020-07-10T15:22:31.9411752Z                 },
2020-07-10T15:22:31.9411863Z                 Status: "Failure",
2020-07-10T15:22:31.9412203Z                 Message: "verticalpodautoscalers.autoscaling.k8s.io \"busybox-vpa\" not found",
2020-07-10T15:22:31.9412339Z                 Reason: "NotFound",
2020-07-10T15:22:31.9412654Z                 Details: {
2020-07-10T15:22:31.9412905Z                     Name: "busybox-vpa",
2020-07-10T15:22:31.9413172Z                     Group: "autoscaling.k8s.io",
2020-07-10T15:22:31.9413299Z                     Kind: "verticalpodautoscalers",
2020-07-10T15:22:31.9413416Z                     UID: "",
2020-07-10T15:22:31.9413531Z                     Causes: nil,
2020-07-10T15:22:31.9413647Z                     RetryAfterSeconds: 0,
2020-07-10T15:22:31.9413757Z                 },
2020-07-10T15:22:31.9413867Z                 Code: 404,
2020-07-10T15:22:31.9413976Z             },
2020-07-10T15:22:31.9414060Z         }
2020-07-10T15:22:31.9414530Z         verticalpodautoscalers.autoscaling.k8s.io "busybox-vpa" not found�[0m

^ Sample failure here. If this is what we're seeing in general, I think there may be a deeper issue with conflicting test setup or even with OLM's reconciliation. The fact that it still doesn't exist after 60 seconds is suspicious to me.

@exdx
Copy link
Member Author

exdx commented Jul 10, 2020

I was troubleshooting this with @ecordell locally and it looks like the VPA CRD installplan stepresource gets stuck in WaitingforAPI even though the VPA CRD itself is established and accepted on-cluster. So this causes the e2e test to fail (on the first run) locally. However it sometimes passes, as we've seen when it gets through CI.

I think there was also a question of whether the VPA is shipped in OpenShift by default and the provided CRD is unnecessary - but I could not find a reference to the VPA CRD in a stock clusterbot cluster.

@njhale
Copy link
Member

njhale commented Jul 10, 2020

WaitingforAPI even though the VPA CRD itself is established and accepted on-cluster. So this causes the e2e test to fail (on the first run) locally. However it sometimes passes, as we've seen when it gets through CI.

Do we ensure CRDs are ordered before any CRs? If so, what happens if the CR is applied by the InstallPlan while the API is unavailable? Are we resyncing InstallPlans in this state when CRDs they've applied have changed?

@exdx
Copy link
Member Author

exdx commented Jul 10, 2020

WaitingforAPI even though the VPA CRD itself is established and accepted on-cluster. So this causes the e2e test to fail (on the first run) locally. However it sometimes passes, as we've seen when it gets through CI.

Do we ensure CRDs are ordered before any CRs? If so, what happens if the CR is applied by the InstallPlan while the API is unavailable? Are we resyncing InstallPlans in this state when CRDs they've applied have changed?

We do ensure that CRDs are created first (they are the first thing that gets reconciled in ExecutePlan). We don't currently have a check in our dynamic object creation that the CR that is attempting to be created actually has the associated CRD in a ready and established state on the cluster (which I think you're getting at). So potentially we have a race here, where sometimes the VPA CRD is still in WaitingforAPI after the VPA CR is created from the bundle. That would explain the flakiness. I wonder if we would have this issue with the prometheus objects @awgreene added as well though?

Yes, we should be re-syncing the installplan with respect to the CRDs. The code certainly looks that way. The issue here is that it looks like the sync is not happening fast enough?

@exdx exdx force-pushed the fix/bundle-objects-test branch from df0cd49 to 37b2d42 Compare July 10, 2020 19:35
@exdx exdx requested a review from njhale July 10, 2020 19:42
@exdx
Copy link
Member Author

exdx commented Jul 10, 2020

So with this change instead of creating VPA CRs before the VPA CRD is Complete, the InstallPlan check will fail and the test will fail there.

A little confusing since a lot of the flakes we saw were around the PDB not so much the VPA. I think this fix helps with the PDB check (we were checking too early, before the installplan got a chance to create it) but uncovered another issue with the VPA object.

@exdx
Copy link
Member Author

exdx commented Jul 10, 2020

Personally I think this PR addresses the original intended issues and makes things marginally better so we could merge and address the potential race we uncovered in a follow-up. Thoughts @njhale @ecordell

@ecordell
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 10, 2020
@ecordell
Copy link
Member

/approve

@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ecordell, exdx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 11, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

10 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

28 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 5dc2e63 into operator-framework:master Jul 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants