Skip to content

✨ Allow removing individual informers from the cache (#935) #936

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

shomron
Copy link

@shomron shomron commented May 2, 2020

This change adds support for removing individual informers from the
cache using the new Remove() method. This is allowed before or after the
cache has been started. Informers are stopped at the time of removal -
once stopped, they will no longer deliver events to registered event
handlers, and registered watched will be aborted.

Also adds non-blocking API for getting informers without waiting for their
cache to sync - GetInformerNonBlocking().

Currently marked as ✨ but depending on whether we decide to extend existing interfaces, may become breaking ⚠️ .

This PR is introduced without comprehensive tests to give an opportunity to iron out the API with the maintainers.

Signed-off-by: Oren Shomron [email protected]

This change adds support for removing individual informers from the
cache using the new Remove() method. This is allowed before or after the
cache has been started.  Informers are stopped at the time of removal -
once stopped, they will no longer deliver events to registered event
handlers, and registered watched will be aborted.

Also adds non-blocking API for getting informers without waiting for their
cache to sync - GetInformerNonBlocking().

Signed-off-by: Oren Shomron <[email protected]>
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 2, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @shomron!

It looks like this is your first PR to kubernetes-sigs/controller-runtime 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/controller-runtime has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 2, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @shomron. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shomron
To complete the pull request process, please assign vincepri
You can assign the PR to them by writing /assign @vincepri in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 2, 2020
@shomron shomron changed the title ✨ Allow removing individual informers from the cache (#935) ✨ Allow removing individual informers from the cache (#935) May 2, 2020
@shomron
Copy link
Author

shomron commented May 6, 2020

@vincepri what's the best way to get some 👀 on this issue and decide how to proceed?

@shawn-hurley
Copy link

/ok-to-test

Sorry if I am missing some context, do we have an issue that describes the need for this? Sorry if I am out of the loop a little, just trying to link through when this would be used

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 6, 2020
@shomron
Copy link
Author

shomron commented May 7, 2020

@shawn-hurley yes, the linked issue is #935 and the use case is described there. Let me know if you have any questions, I'd be happy to elaborate further.

@@ -126,7 +126,7 @@ func defaultOpts(config *rest.Config, opts Options) (Options, error) {
// Construct a new Mapper if unset
if opts.Mapper == nil {
var err error
opts.Mapper, err = apiutil.NewDiscoveryRESTMapper(config)
opts.Mapper, err = apiutil.NewDynamicRESTMapper(config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good change but should be its own PR as it doesn't really have anything to do with this PR

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I will move this into a separate PR. 👍🏻

@alvaroaleman
Copy link
Member

@shomron this means that controllers that use the given informer for their watch will silently degrade, correct?

@shomron
Copy link
Author

shomron commented May 7, 2020

this means that controllers that use the given informer for their watch will silently degrade, correct?

@alvaroaleman yes that is correct. In our use case, we have a primary controller multiplexing watch events from dynamically watched resources and forwarding them on to interested parties using a stable source.Channel-based watch. That is also the point where we do things like reference counting for interested parties.

@alvaroaleman
Copy link
Member

@alvaroaleman yes that is correct. In our use case, we have a primary controller multiplexing watch events from dynamically watched resources and forwarding them on to interested parties using a stable source.Channel-based watch. That is also the point where we do things like reference counting for interested parties.

Okay. So from what I can see on a quick look, it is currently not possible to find out if an Informer has event handlers registered (correct me if I missed something there). This means that there is no way for us to prevent removal of informers that are being used in a watch by a controller and that in turn means that the implementation in this PR introduces a huge footgun.

That being said, it would be great to support your usecase, we've had something similiar as well in the past. What prevents you from creating a cache per watch?

cc @thetechnick

@shomron
Copy link
Author

shomron commented May 7, 2020

@alvaroaleman thank you for looking at this! Let me try to address your points:

Regarding checking whether the informer has registered event handlers - in our case we know the informer has registered event handlers, and explicitly want to stop watching. I think the issue is not introspection, but that event handlers cannot be removed from informers. Tangentially, as I called out in #863 (comment) - it is now possible to stop an unmanaged controller with registered event handlers - which I believe would cause its work queue to grow unbounded.

What prevents you from creating a cache per watch?

This seems less than ideal. The role of the cache is to manage the lifecycle of informers over disparate resources and provide reader/writer abstractions on top. If we created a cache-per-resource, we would end up implementing a cache of caches, where each cache manages a single informer. I feel we could achieve a more concise design.

Thoughts?

@alvaroaleman
Copy link
Member

Regarding checking whether the informer has registered event handlers - in our case we know the informer has registered event handlers, and explicitly want to stop watching. I think the issue is not introspection, but that event handlers cannot be removed from informers. Tangentially, as I called out in #863 (comment) - it is now possible to stop an unmanaged controller with registered event handlers - which I believe would cause its work queue to grow unbounded.

True, deregistering is the better solution. My point here is that this introduces a silent and hard to debug failure mode for everyone in order to fix a in the grand scheme of things rather exotic use case. And that is in my opinion not a good trade-off.

Maybe we can add refcounting ourselves on top of the underlying informer and refuse removing them if they have registered eventhandlers? Potentially with the option to say "I know this has N eventhandlers but this is okay because I am also stopping the controller?"

This is far from an elegant solution but from what I can see the best we can do without upstream changes. Maybe it would be a good idea to also start a discussion with upstream sig-apimachinery folks if they would consider adding an option to remove event handlers?

Tangentially, do you know if its safe to stop individual informers or if that might leak goroutines and/or memory?

This seems less than ideal. The role of the cache is to manage the lifecycle of informers over disparate resources and provide reader/writer abstractions on top. If we created a cache-per-resource, we would end up implementing a cache of caches, where each cache manages a single informer. I feel we could achieve a more concise design.

I completely agree that this isn't an ideal solution, its a workaround. The trade-off that is in this PR right now (big footgun for everyone to support a somewhat uncommon usecase) is in my opinion not a good one. But I'd love for others to weight in on that.

/cc @DirectXMan12 @mengqiy @alexeldeib

@alexeldeib
Copy link
Contributor

I'm torn on this. I think the use case totally makes sense. I also agree with everything Alvaro said about it being a footgun and possible solutions.

I'm curious what other API machinery folks think about (de)registering eventhandlers, as that's definitely the most pure solution albeit probably very non-trivial. I'd probably prefer doc'ing a cache of caches as a short term workaround for this case and fixing the eventhandlers long term to enable this scenario if I had full choice.

third opinions?

@alexeldeib
Copy link
Contributor

@vincepri vincepri self-assigned this Jun 9, 2020
@vincepri
Copy link
Member

Having read the whole discussion, it might be worth to take a step back and do a small proposal following the examples we already have in docs/proposals.

In general, the use case is valid. That said, we need to make sure we don't open a whole set of potential issues that weren't possible before. I'd like to see a way for us to support this feature while leaning on the side of safety and explicit user configuration/intent.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 9, 2020
Copy link
Contributor

@DirectXMan12 DirectXMan12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been discussing a usecase with some folks around this. The general gist is that that a given CRD may or may not exist on a cluster, and you want to gracefully degrade when it's not present, but you want to run a controller while it is. In various cases you don't actually want to restart the manager, so it's useful to be able to say "run this controller and equivalent informers while a resource is available, otherwise don't".

All that being said, a decent first step for this might be to keep it internal, use it for that use case, and work on refcounting or something in parallel.

// via the provided stop argument.
func (e *MapEntry) Start(stop <-chan struct{}) {
// Stop on either the whole map stopping or just this informer being removed.
internalStop, cancel := anyOf(stop, e.stop)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1116 should help here

@kevindelgado
Copy link
Contributor

Ahead of tomorrow's community meeting, I've been working on a more formal design doc for this as requested above. See #1192

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 7, 2020
@k8s-ci-robot
Copy link
Contributor

@shomron: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@DirectXMan12
Copy link
Contributor

gonna close this for the moment, we'll roll it into whatever we decide to do with the stoppable controllers work and give attribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants