Skip to content

Bug 1859178: fix(installplans): GC older installplans #1669

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ecordell
Copy link
Member

Description of the change:
This adds a GC check in the installplan controller.

If it finds more than 5 installplans for a namespace, it will delete the oldest ones. Age is determined first by Generation and second via CreationTimestamp (to account for any bugs that result in installplans with the same generation, which should never happen).

Motivation for the change:
It was possible in the current resolver to construct a catalog that would cause installplans to be created endlessly. This change protects against potential problems like this.

The new resolver will begin to output installplans even on failed resolution attempts, so this limit will support that feature as well.

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /docs
  • Commit messages sensible and descriptive

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 22, 2020
@ecordell ecordell changed the title fix(installplans): GC older installplans Bug 1859178: fix(installplans): GC older installplans Jul 22, 2020
@openshift-ci-robot
Copy link
Collaborator

@ecordell: This pull request references Bugzilla bug 1859178, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1859178: fix(installplans): GC older installplans

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jul 22, 2020
@kevinrizza
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2020
Comment on lines 1238 to 1257
if err := o.client.OperatorsV1alpha1().InstallPlans(namespace).Delete(context.TODO(), i.GetName(), metav1.DeleteOptions{}); err != nil {
log.WithField("deleting", i.GetName()).WithError(err).Warn("error GCing old installplan")
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we aggregate and return transient errors we care about so this can be requeued?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when this fix goes out to clusters with 1000s of install plans? They will chug along and delete a lot of install plans all at once? Do we want to implement a semaphore or some kind of rate limiting in this deletion step?

Copy link
Member Author

@ecordell ecordell Jul 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid doing that so that problems with GC don't block otherwise normal installplan processing.

As it stands, I'm a little uncomfortable with this implementation, which stops installplan processing to go GC installplans. On clusters that have hit the installplan bug, this could potentially take a long time.

(this seems fine for clusters that haven't hit this bug, because there should almost always just be 1 installplan to GC at a given time)

I will build this controller into an image, trigger the bug, and test it out. It might be that this needs to be in its own controller entirely to be safe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They will chug along and delete a lot of install plans all at once?

OLM will issue a huge set of delete requests, yes.

Do we want to implement a semaphore or some kind of rate limiting in this deletion step?

This would make me feel better about having the gc run as part of an otherwise "real" control loop. I will look into options here (would be nice to page through the cache) and also consider separating GC entirely.

@@ -1196,6 +1257,8 @@ func (o *Operator) syncInstallPlans(obj interface{}) (syncError error) {

logger.Info("syncing")

o.gcInstallPlans(logger, plan.GetNamespace())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think performing this GC on namespace resolve sync would result in less overall churn. Is there a reason to perform this for every InstallPlan event?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reasoned that this hedges against issues with catalogs / the resolver pinning the namespace sync for too long.

But lots of good ideas here + in other comments, I'll update this PR.

@@ -1196,6 +1257,8 @@ func (o *Operator) syncInstallPlans(obj interface{}) (syncError error) {

logger.Info("syncing")

o.gcInstallPlans(logger, plan.GetNamespace())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to bail out if this InstallPlan isn't the latest generation -- or has been deleted -- right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's already handled by the rest of the installplan loop

@ecordell
Copy link
Member Author

/hold

I want to perform some additional testing and consider comments here before this merges

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 22, 2020
@@ -1180,6 +1183,64 @@ func (o *Operator) unpackBundles(plan *v1alpha1.InstallPlan) (bool, *v1alpha1.In
return unpacked, out, nil
}

func (o *Operator) gcInstallPlans(log logrus.FieldLogger, namespace string) {
ips, err := o.lister.OperatorsV1alpha1().InstallPlanLister().InstallPlans(namespace).List(labels.Everything())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we ensure that we are getting installplans that are in some kind of terminal state only? we could end up listing and deleting installplans that are actively being resolved on-cluster

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this happens per-namespace, and we shouldn't be creating more than one "active" installplan at a time

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh so if we make two subscriptions to two different operators in the same catalog (all in the same namespace) OLM will sequentially resolve the installplans (only one being active at a time?) good to know. i thought maybe OLM worked on them incrementally

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@exdx only the latest InstallPlan counts, and that will contain the resolution for all the Subscriptions in the namespace.

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2020
Copy link
Member

@njhale njhale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

nice tests btw!

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2020
@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ecordell, njhale

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

// we only consider maxDeletesPerSweep more than the allowed number of installplans for delete at one time
ips := allIps
if len(ips) > maxInstallPlanCount + maxDeletesPerSweep {
ips = allIps[:maxInstallPlanCount+maxDeletesPerSweep]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so here we limit the number of extra installplans removed to 8 at a time? Makes sense, but for these clusters with 1000s of installplans they could be waiting a while. I was thinking of a more async process that goes through and deletes these extra ips, eventually reporting status back to OLM. Its definitely more work though and this is an urgent bugfix so idk

}

for _, i := range toDelete {
if err := o.client.OperatorsV1alpha1().InstallPlans(namespace).Delete(context.TODO(), i.GetName(), metav1.DeleteOptions{}); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we want to set the DeleteOption here?

Copy link
Member

@exdx exdx Jul 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the InstallPlan has no dependents then an Orphan PropagationPolicy might be more efficient?

@ecordell
Copy link
Member Author

/hold cancel

tested in a 4.4 cluster with lots of installplans generated, and this kept them down to the five latest.

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 22, 2020
@ecordell
Copy link
Member Author

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2020
@dinhxuanvu
Copy link
Member

/retest

// we only consider maxDeletesPerSweep more than the allowed number of installplans for delete at one time
ips := allIps
if len(ips) > maxInstallPlanCount + maxDeletesPerSweep {
ips = allIps[:maxInstallPlanCount+maxDeletesPerSweep]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the slicing with indices here. Is this array ordered/sorted? Otherwise, why just randomly pick a bunch of InstallPlans from the first 10 in the array?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there are a few clusters with InstallPlans in order of 1000s. I know one of them has at least 1200. Given the sync period of 15 mins (though in reality the namespace sync happens more often due to resources changes) and the limit of 5 IPs per sweep, would this take a bit too much time to clean up especially in the case of idle cluster? In fact, I would prefer to keep the InstallPlans that are currently referenced in the Subscriptions and clean up the others in a higher rate. Maybe this can be changed in the future instead of now due to the urgent need of getting this in.

Copy link
Member

@njhale njhale Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the slicing with indices here. Is this array ordered/sorted? Otherwise, why just randomly pick a bunch of InstallPlans from the first 10 in the array?

IIUC, for each sweep we take up to maxDeletesPerSweep InstallPlans to delete, and leave the rest for the next sweep. We're essentially only sorting a chunk of the list on each sweep, and after X sweeps we'll have sorted the whole thing.

Copy link
Member

@njhale njhale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 23, 2020
@njhale
Copy link
Member

njhale commented Jul 23, 2020

/retest

1 similar comment
@ecordell
Copy link
Member Author

/retest

@openshift-merge-robot openshift-merge-robot merged commit 79e4d18 into operator-framework:master Jul 23, 2020
@openshift-ci-robot
Copy link
Collaborator

@ecordell: All pull requests linked via external trackers have merged: operator-framework/operator-lifecycle-manager#1669. Bugzilla bug 1859178 has been moved to the MODIFIED state.

In response to this:

Bug 1859178: fix(installplans): GC older installplans

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants