Bug 1859178: fix(installplans): GC older installplans #1669

ecordell · 2020-07-22T17:27:48Z

Description of the change:
This adds a GC check in the installplan controller.

If it finds more than 5 installplans for a namespace, it will delete the oldest ones. Age is determined first by Generation and second via CreationTimestamp (to account for any bugs that result in installplans with the same generation, which should never happen).

Motivation for the change:
It was possible in the current resolver to construct a catalog that would cause installplans to be created endlessly. This change protects against potential problems like this.

The new resolver will begin to output installplans even on failed resolution attempts, so this limit will support that feature as well.

Reviewer Checklist

Implementation matches the proposed design, or proposal is updated to match implementation
Sufficient unit test coverage
Sufficient end-to-end test coverage
Docs updated or added to /docs
Commit messages sensible and descriptive

openshift-ci-robot · 2020-07-22T17:29:01Z

@ecordell: This pull request references Bugzilla bug 1859178, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1859178: fix(installplans): GC older installplans

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pkg/controller/operators/catalog/operator.go

kevinrizza · 2020-07-22T17:38:21Z

/lgtm

njhale · 2020-07-22T17:40:20Z

pkg/controller/operators/catalog/operator.go

+		if err := o.client.OperatorsV1alpha1().InstallPlans(namespace).Delete(context.TODO(), i.GetName(), metav1.DeleteOptions{}); err != nil {
+			log.WithField("deleting", i.GetName()).WithError(err).Warn("error GCing old installplan")
+		}
+	}


Should we aggregate and return transient errors we care about so this can be requeued?

What happens when this fix goes out to clusters with 1000s of install plans? They will chug along and delete a lot of install plans all at once? Do we want to implement a semaphore or some kind of rate limiting in this deletion step?

I wanted to avoid doing that so that problems with GC don't block otherwise normal installplan processing.

As it stands, I'm a little uncomfortable with this implementation, which stops installplan processing to go GC installplans. On clusters that have hit the installplan bug, this could potentially take a long time.

(this seems fine for clusters that haven't hit this bug, because there should almost always just be 1 installplan to GC at a given time)

I will build this controller into an image, trigger the bug, and test it out. It might be that this needs to be in its own controller entirely to be safe.

They will chug along and delete a lot of install plans all at once?

OLM will issue a huge set of delete requests, yes.

Do we want to implement a semaphore or some kind of rate limiting in this deletion step?

This would make me feel better about having the gc run as part of an otherwise "real" control loop. I will look into options here (would be nice to page through the cache) and also consider separating GC entirely.

njhale · 2020-07-22T17:42:48Z

pkg/controller/operators/catalog/operator.go

@@ -1196,6 +1257,8 @@ func (o *Operator) syncInstallPlans(obj interface{}) (syncError error) {

 	logger.Info("syncing")

+	o.gcInstallPlans(logger, plan.GetNamespace())


I think performing this GC on namespace resolve sync would result in less overall churn. Is there a reason to perform this for every InstallPlan event?

I reasoned that this hedges against issues with catalogs / the resolver pinning the namespace sync for too long.

But lots of good ideas here + in other comments, I'll update this PR.

njhale · 2020-07-22T17:43:48Z

pkg/controller/operators/catalog/operator.go

@@ -1196,6 +1257,8 @@ func (o *Operator) syncInstallPlans(obj interface{}) (syncError error) {

 	logger.Info("syncing")

+	o.gcInstallPlans(logger, plan.GetNamespace())


I think we want to bail out if this InstallPlan isn't the latest generation -- or has been deleted -- right?

That's already handled by the rest of the installplan loop

ecordell · 2020-07-22T18:26:02Z

/hold

I want to perform some additional testing and consider comments here before this merges

exdx · 2020-07-22T18:26:04Z

pkg/controller/operators/catalog/operator.go

@@ -1180,6 +1183,64 @@ func (o *Operator) unpackBundles(plan *v1alpha1.InstallPlan) (bool, *v1alpha1.In
 	return unpacked, out, nil
 }

+func (o *Operator) gcInstallPlans(log logrus.FieldLogger, namespace string) {
+	ips, err := o.lister.OperatorsV1alpha1().InstallPlanLister().InstallPlans(namespace).List(labels.Everything())


should we ensure that we are getting installplans that are in some kind of terminal state only? we could end up listing and deleting installplans that are actively being resolved on-cluster

this happens per-namespace, and we shouldn't be creating more than one "active" installplan at a time

oh so if we make two subscriptions to two different operators in the same catalog (all in the same namespace) OLM will sequentially resolve the installplans (only one being active at a time?) good to know. i thought maybe OLM worked on them incrementally

@exdx only the latest InstallPlan counts, and that will contain the resolution for all the Subscriptions in the namespace.

njhale

/lgtm

nice tests btw!

openshift-ci-robot · 2020-07-22T19:55:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ecordell, njhale

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ecordell,njhale]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

exdx · 2020-07-22T20:05:35Z

pkg/controller/operators/catalog/operator.go

+	// we only consider maxDeletesPerSweep more than the allowed number of installplans for delete at one time
+	ips := allIps
+	if len(ips) > maxInstallPlanCount + maxDeletesPerSweep {
+		ips = allIps[:maxInstallPlanCount+maxDeletesPerSweep]


so here we limit the number of extra installplans removed to 8 at a time? Makes sense, but for these clusters with 1000s of installplans they could be waiting a while. I was thinking of a more async process that goes through and deletes these extra ips, eventually reporting status back to OLM. Its definitely more work though and this is an urgent bugfix so idk

exdx · 2020-07-22T20:06:30Z

pkg/controller/operators/catalog/operator.go

+	}
+
+	for _, i := range toDelete {
+		if err := o.client.OperatorsV1alpha1().InstallPlans(namespace).Delete(context.TODO(), i.GetName(), metav1.DeleteOptions{}); err != nil {


I'm thinking we want to set the DeleteOption here?

If the InstallPlan has no dependents then an Orphan PropagationPolicy might be more efficient?

ecordell · 2020-07-22T20:35:35Z

/hold cancel

tested in a 4.4 cluster with lots of installplans generated, and this kept them down to the five latest.

ecordell · 2020-07-22T20:42:09Z

/retest

openshift-bot · 2020-07-22T21:55:07Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-22T22:34:08Z

/retest

Please review the full test history for this PR and help us cut down flakes.

dinhxuanvu · 2020-07-22T23:32:55Z

/retest

dinhxuanvu · 2020-07-23T00:11:50Z

pkg/controller/operators/catalog/operator.go

+	// we only consider maxDeletesPerSweep more than the allowed number of installplans for delete at one time
+	ips := allIps
+	if len(ips) > maxInstallPlanCount + maxDeletesPerSweep {
+		ips = allIps[:maxInstallPlanCount+maxDeletesPerSweep]


I don't fully understand the slicing with indices here. Is this array ordered/sorted? Otherwise, why just randomly pick a bunch of InstallPlans from the first 10 in the array?

Also, there are a few clusters with InstallPlans in order of 1000s. I know one of them has at least 1200. Given the sync period of 15 mins (though in reality the namespace sync happens more often due to resources changes) and the limit of 5 IPs per sweep, would this take a bit too much time to clean up especially in the case of idle cluster? In fact, I would prefer to keep the InstallPlans that are currently referenced in the Subscriptions and clean up the others in a higher rate. Maybe this can be changed in the future instead of now due to the urgent need of getting this in.

I don't fully understand the slicing with indices here. Is this array ordered/sorted? Otherwise, why just randomly pick a bunch of InstallPlans from the first 10 in the array?

IIUC, for each sweep we take up to maxDeletesPerSweep InstallPlans to delete, and leave the rest for the next sweep. We're essentially only sorting a chunk of the list on each sweep, and after X sweeps we'll have sorted the whole thing.

pkg/controller/operators/catalog/operator.go

njhale

/lgtm

njhale · 2020-07-23T00:36:45Z

/retest

ecordell · 2020-07-23T02:09:31Z

/retest

openshift-ci-robot · 2020-07-23T04:00:13Z

@ecordell: All pull requests linked via external trackers have merged: operator-framework/operator-lifecycle-manager#1669. Bugzilla bug 1859178 has been moved to the MODIFIED state.

In response to this:

Bug 1859178: fix(installplans): GC older installplans

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from exdx and kevinrizza July 22, 2020 17:27

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 22, 2020

ecordell changed the title ~~fix(installplans): GC older installplans~~ Bug 1859178: fix(installplans): GC older installplans Jul 22, 2020

openshift-ci-robot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jul 22, 2020

ecordell force-pushed the gc-installplans branch from a670268 to 37df153 Compare July 22, 2020 17:31

anik120 reviewed Jul 22, 2020

View reviewed changes

pkg/controller/operators/catalog/operator.go Show resolved Hide resolved

openshift-ci-robot assigned kevinrizza Jul 22, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2020

njhale reviewed Jul 22, 2020

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 22, 2020

exdx reviewed Jul 22, 2020

View reviewed changes

ecordell force-pushed the gc-installplans branch from 37df153 to a43ed75 Compare July 22, 2020 19:34

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2020

exdx mentioned this pull request Jul 22, 2020

feat(operator): adopt referenced installplans #1661

Merged

njhale approved these changes Jul 22, 2020

View reviewed changes

openshift-ci-robot assigned njhale Jul 22, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2020

exdx reviewed Jul 22, 2020

View reviewed changes

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 22, 2020

fix(installplans): GC older installplans

91d4f2a

ecordell force-pushed the gc-installplans branch from a43ed75 to 91d4f2a Compare July 22, 2020 23:19

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2020

dinhxuanvu reviewed Jul 23, 2020

View reviewed changes

njhale reviewed Jul 23, 2020

View reviewed changes

pkg/controller/operators/catalog/operator.go Show resolved Hide resolved

njhale reviewed Jul 23, 2020

View reviewed changes

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 23, 2020

openshift-merge-robot merged commit 79e4d18 into operator-framework:master Jul 23, 2020

njhale mentioned this pull request Aug 3, 2020

Can't recreate operator if the installplan exist in 4.4 #1570

Closed

		@@ -1196,6 +1257,8 @@ func (o *Operator) syncInstallPlans(obj interface{}) (syncError error) {

		logger.Info("syncing")

		o.gcInstallPlans(logger, plan.GetNamespace())

Bug 1859178: fix(installplans): GC older installplans #1669

Bug 1859178: fix(installplans): GC older installplans #1669

Uh oh!

Conversation

ecordell commented Jul 22, 2020

Uh oh!

openshift-ci-robot commented Jul 22, 2020

Uh oh!

Uh oh!

kevinrizza commented Jul 22, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ecordell Jul 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ecordell commented Jul 22, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhale left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Jul 22, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

exdx Jul 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ecordell commented Jul 22, 2020

Uh oh!

ecordell commented Jul 22, 2020

Uh oh!

openshift-bot commented Jul 22, 2020

Uh oh!

openshift-bot commented Jul 22, 2020

Uh oh!

dinhxuanvu commented Jul 22, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhale Jul 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

njhale left a comment

Choose a reason for hiding this comment

Uh oh!

njhale commented Jul 23, 2020

Uh oh!

ecordell commented Jul 23, 2020

Uh oh!

openshift-ci-robot commented Jul 23, 2020

Uh oh!

ecordell Jul 22, 2020 •

edited

Loading

exdx Jul 22, 2020 •

edited

Loading

njhale Jul 23, 2020 •

edited

Loading