-
Notifications
You must be signed in to change notification settings - Fork 562
Bug 1859178: fix(installplans): GC older installplans #1669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1859178: fix(installplans): GC older installplans #1669
Conversation
@ecordell: This pull request references Bugzilla bug 1859178, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
a670268
to
37df153
Compare
/lgtm |
if err := o.client.OperatorsV1alpha1().InstallPlans(namespace).Delete(context.TODO(), i.GetName(), metav1.DeleteOptions{}); err != nil { | ||
log.WithField("deleting", i.GetName()).WithError(err).Warn("error GCing old installplan") | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we aggregate and return transient errors we care about so this can be requeued?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when this fix goes out to clusters with 1000s of install plans? They will chug along and delete a lot of install plans all at once? Do we want to implement a semaphore or some kind of rate limiting in this deletion step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to avoid doing that so that problems with GC don't block otherwise normal installplan processing.
As it stands, I'm a little uncomfortable with this implementation, which stops installplan processing to go GC installplans. On clusters that have hit the installplan bug, this could potentially take a long time.
(this seems fine for clusters that haven't hit this bug, because there should almost always just be 1 installplan to GC at a given time)
I will build this controller into an image, trigger the bug, and test it out. It might be that this needs to be in its own controller entirely to be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They will chug along and delete a lot of install plans all at once?
OLM will issue a huge set of delete
requests, yes.
Do we want to implement a semaphore or some kind of rate limiting in this deletion step?
This would make me feel better about having the gc run as part of an otherwise "real" control loop. I will look into options here (would be nice to page through the cache) and also consider separating GC entirely.
@@ -1196,6 +1257,8 @@ func (o *Operator) syncInstallPlans(obj interface{}) (syncError error) { | |||
|
|||
logger.Info("syncing") | |||
|
|||
o.gcInstallPlans(logger, plan.GetNamespace()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think performing this GC on namespace resolve sync would result in less overall churn. Is there a reason to perform this for every InstallPlan event?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reasoned that this hedges against issues with catalogs / the resolver pinning the namespace sync for too long.
But lots of good ideas here + in other comments, I'll update this PR.
@@ -1196,6 +1257,8 @@ func (o *Operator) syncInstallPlans(obj interface{}) (syncError error) { | |||
|
|||
logger.Info("syncing") | |||
|
|||
o.gcInstallPlans(logger, plan.GetNamespace()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to bail out if this InstallPlan isn't the latest generation -- or has been deleted -- right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's already handled by the rest of the installplan loop
/hold I want to perform some additional testing and consider comments here before this merges |
@@ -1180,6 +1183,64 @@ func (o *Operator) unpackBundles(plan *v1alpha1.InstallPlan) (bool, *v1alpha1.In | |||
return unpacked, out, nil | |||
} | |||
|
|||
func (o *Operator) gcInstallPlans(log logrus.FieldLogger, namespace string) { | |||
ips, err := o.lister.OperatorsV1alpha1().InstallPlanLister().InstallPlans(namespace).List(labels.Everything()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we ensure that we are getting installplans that are in some kind of terminal state only? we could end up listing and deleting installplans that are actively being resolved on-cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this happens per-namespace, and we shouldn't be creating more than one "active" installplan at a time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh so if we make two subscriptions to two different operators in the same catalog (all in the same namespace) OLM will sequentially resolve the installplans (only one being active at a time?) good to know. i thought maybe OLM worked on them incrementally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@exdx only the latest InstallPlan counts, and that will contain the resolution for all the Subscriptions in the namespace.
37df153
to
a43ed75
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
nice tests btw!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ecordell, njhale The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
// we only consider maxDeletesPerSweep more than the allowed number of installplans for delete at one time | ||
ips := allIps | ||
if len(ips) > maxInstallPlanCount + maxDeletesPerSweep { | ||
ips = allIps[:maxInstallPlanCount+maxDeletesPerSweep] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so here we limit the number of extra installplans removed to 8 at a time? Makes sense, but for these clusters with 1000s of installplans they could be waiting a while. I was thinking of a more async process that goes through and deletes these extra ips, eventually reporting status back to OLM. Its definitely more work though and this is an urgent bugfix so idk
} | ||
|
||
for _, i := range toDelete { | ||
if err := o.client.OperatorsV1alpha1().InstallPlans(namespace).Delete(context.TODO(), i.GetName(), metav1.DeleteOptions{}); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking we want to set the DeleteOption
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the InstallPlan
has no dependents then an Orphan
PropagationPolicy might be more efficient?
/hold cancel tested in a 4.4 cluster with lots of installplans generated, and this kept them down to the five latest. |
/retest |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
a43ed75
to
91d4f2a
Compare
/retest |
// we only consider maxDeletesPerSweep more than the allowed number of installplans for delete at one time | ||
ips := allIps | ||
if len(ips) > maxInstallPlanCount + maxDeletesPerSweep { | ||
ips = allIps[:maxInstallPlanCount+maxDeletesPerSweep] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand the slicing with indices here. Is this array ordered/sorted? Otherwise, why just randomly pick a bunch of InstallPlans from the first 10 in the array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there are a few clusters with InstallPlans in order of 1000s. I know one of them has at least 1200. Given the sync period of 15 mins (though in reality the namespace sync happens more often due to resources changes) and the limit of 5 IPs per sweep, would this take a bit too much time to clean up especially in the case of idle cluster? In fact, I would prefer to keep the InstallPlans that are currently referenced in the Subscriptions and clean up the others in a higher rate. Maybe this can be changed in the future instead of now due to the urgent need of getting this in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand the slicing with indices here. Is this array ordered/sorted? Otherwise, why just randomly pick a bunch of InstallPlans from the first 10 in the array?
IIUC, for each sweep we take up to maxDeletesPerSweep
InstallPlans to delete, and leave the rest for the next sweep. We're essentially only sorting a chunk of the list on each sweep, and after X sweeps we'll have sorted the whole thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/retest |
1 similar comment
/retest |
@ecordell: All pull requests linked via external trackers have merged: operator-framework/operator-lifecycle-manager#1669. Bugzilla bug 1859178 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Description of the change:
This adds a GC check in the installplan controller.
If it finds more than 5 installplans for a namespace, it will delete the oldest ones. Age is determined first by
Generation
and second viaCreationTimestamp
(to account for any bugs that result in installplans with the same generation, which should never happen).Motivation for the change:
It was possible in the current resolver to construct a catalog that would cause installplans to be created endlessly. This change protects against potential problems like this.
The new resolver will begin to output installplans even on failed resolution attempts, so this limit will support that feature as well.
Reviewer Checklist
/docs