✨ Allow removing individual informers from the cache (#935) #936

shomron · 2020-05-02T15:44:35Z

This change adds support for removing individual informers from the
cache using the new Remove() method. This is allowed before or after the
cache has been started. Informers are stopped at the time of removal -
once stopped, they will no longer deliver events to registered event
handlers, and registered watched will be aborted.

Also adds non-blocking API for getting informers without waiting for their
cache to sync - GetInformerNonBlocking().

Currently marked as ✨ but depending on whether we decide to extend existing interfaces, may become breaking ⚠️ .

This PR is introduced without comprehensive tests to give an opportunity to iron out the API with the maintainers.

Signed-off-by: Oren Shomron [email protected]

This change adds support for removing individual informers from the cache using the new Remove() method. This is allowed before or after the cache has been started. Informers are stopped at the time of removal - once stopped, they will no longer deliver events to registered event handlers, and registered watched will be aborted. Also adds non-blocking API for getting informers without waiting for their cache to sync - GetInformerNonBlocking(). Signed-off-by: Oren Shomron <[email protected]>

k8s-ci-robot · 2020-05-02T15:44:43Z

Welcome @shomron!

It looks like this is your first PR to kubernetes-sigs/controller-runtime 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/controller-runtime has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2020-05-02T15:44:44Z

Hi @shomron. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2020-05-02T15:44:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shomron
To complete the pull request process, please assign vincepri
You can assign the PR to them by writing /assign @vincepri in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shomron · 2020-05-06T20:33:08Z

@vincepri what's the best way to get some 👀 on this issue and decide how to proceed?

shawn-hurley · 2020-05-06T21:48:30Z

/ok-to-test

Sorry if I am missing some context, do we have an issue that describes the need for this? Sorry if I am out of the loop a little, just trying to link through when this would be used

shomron · 2020-05-07T00:41:55Z

@shawn-hurley yes, the linked issue is #935 and the use case is described there. Let me know if you have any questions, I'd be happy to elaborate further.

alvaroaleman · 2020-05-07T03:39:17Z

pkg/cache/cache.go

@@ -126,7 +126,7 @@ func defaultOpts(config *rest.Config, opts Options) (Options, error) {
 	// Construct a new Mapper if unset
 	if opts.Mapper == nil {
 		var err error
-		opts.Mapper, err = apiutil.NewDiscoveryRESTMapper(config)
+		opts.Mapper, err = apiutil.NewDynamicRESTMapper(config)


This is a good change but should be its own PR as it doesn't really have anything to do with this PR

Fair enough, I will move this into a separate PR. 👍🏻

alvaroaleman · 2020-05-07T12:43:38Z

@shomron this means that controllers that use the given informer for their watch will silently degrade, correct?

shomron · 2020-05-07T13:37:04Z

this means that controllers that use the given informer for their watch will silently degrade, correct?

@alvaroaleman yes that is correct. In our use case, we have a primary controller multiplexing watch events from dynamically watched resources and forwarding them on to interested parties using a stable source.Channel-based watch. That is also the point where we do things like reference counting for interested parties.

alvaroaleman · 2020-05-07T14:08:47Z

@alvaroaleman yes that is correct. In our use case, we have a primary controller multiplexing watch events from dynamically watched resources and forwarding them on to interested parties using a stable source.Channel-based watch. That is also the point where we do things like reference counting for interested parties.

Okay. So from what I can see on a quick look, it is currently not possible to find out if an Informer has event handlers registered (correct me if I missed something there). This means that there is no way for us to prevent removal of informers that are being used in a watch by a controller and that in turn means that the implementation in this PR introduces a huge footgun.

That being said, it would be great to support your usecase, we've had something similiar as well in the past. What prevents you from creating a cache per watch?

cc @thetechnick

shomron · 2020-05-07T15:50:37Z

@alvaroaleman thank you for looking at this! Let me try to address your points:

Regarding checking whether the informer has registered event handlers - in our case we know the informer has registered event handlers, and explicitly want to stop watching. I think the issue is not introspection, but that event handlers cannot be removed from informers. Tangentially, as I called out in #863 (comment) - it is now possible to stop an unmanaged controller with registered event handlers - which I believe would cause its work queue to grow unbounded.

What prevents you from creating a cache per watch?

This seems less than ideal. The role of the cache is to manage the lifecycle of informers over disparate resources and provide reader/writer abstractions on top. If we created a cache-per-resource, we would end up implementing a cache of caches, where each cache manages a single informer. I feel we could achieve a more concise design.

Thoughts?

alvaroaleman · 2020-05-08T19:27:35Z

Regarding checking whether the informer has registered event handlers - in our case we know the informer has registered event handlers, and explicitly want to stop watching. I think the issue is not introspection, but that event handlers cannot be removed from informers. Tangentially, as I called out in #863 (comment) - it is now possible to stop an unmanaged controller with registered event handlers - which I believe would cause its work queue to grow unbounded.

True, deregistering is the better solution. My point here is that this introduces a silent and hard to debug failure mode for everyone in order to fix a in the grand scheme of things rather exotic use case. And that is in my opinion not a good trade-off.

Maybe we can add refcounting ourselves on top of the underlying informer and refuse removing them if they have registered eventhandlers? Potentially with the option to say "I know this has N eventhandlers but this is okay because I am also stopping the controller?"

This is far from an elegant solution but from what I can see the best we can do without upstream changes. Maybe it would be a good idea to also start a discussion with upstream sig-apimachinery folks if they would consider adding an option to remove event handlers?

Tangentially, do you know if its safe to stop individual informers or if that might leak goroutines and/or memory?

This seems less than ideal. The role of the cache is to manage the lifecycle of informers over disparate resources and provide reader/writer abstractions on top. If we created a cache-per-resource, we would end up implementing a cache of caches, where each cache manages a single informer. I feel we could achieve a more concise design.

I completely agree that this isn't an ideal solution, its a workaround. The trade-off that is in this PR right now (big footgun for everyone to support a somewhat uncommon usecase) is in my opinion not a good one. But I'd love for others to weight in on that.

/cc @DirectXMan12 @mengqiy @alexeldeib

alexeldeib · 2020-05-11T21:12:36Z

I'm torn on this. I think the use case totally makes sense. I also agree with everything Alvaro said about it being a footgun and possible solutions.

I'm curious what other API machinery folks think about (de)registering eventhandlers, as that's definitely the most pure solution albeit probably very non-trivial. I'd probably prefer doc'ing a cache of caches as a short term workaround for this case and fixing the eventhandlers long term to enable this scenario if I had full choice.

third opinions?

alexeldeib · 2020-05-13T18:27:16Z

I ran into https://docs.google.com/document/d/1Wi3LM3sG6Qgfzm--bWb6R0SEKCkQCCt-ene6cO62FlM/edit?usp=sharing which touches on this use case

vincepri · 2020-06-11T19:26:33Z

Having read the whole discussion, it might be worth to take a step back and do a small proposal following the examples we already have in docs/proposals.

In general, the use case is valid. That said, we need to make sure we don't open a whole set of potential issues that weren't possible before. I'd like to see a way for us to support this feature while leaning on the side of safety and explicit user configuration/intent.

fejta-bot · 2020-09-09T19:53:37Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

DirectXMan12

I've been discussing a usecase with some folks around this. The general gist is that that a given CRD may or may not exist on a cluster, and you want to gracefully degrade when it's not present, but you want to run a controller while it is. In various cases you don't actually want to restart the manager, so it's useful to be able to say "run this controller and equivalent informers while a resource is available, otherwise don't".

All that being said, a decent first step for this might be to keep it internal, use it for that use case, and work on refcounting or something in parallel.

DirectXMan12 · 2020-09-11T22:34:52Z

pkg/cache/internal/informers_map.go

+// via the provided stop argument.
+func (e *MapEntry) Start(stop <-chan struct{}) {
+	// Stop on either the whole map stopping or just this informer being removed.
+	internalStop, cancel := anyOf(stop, e.stop)


#1116 should help here

kevindelgado · 2020-10-07T22:26:49Z

Ahead of tomorrow's community meeting, I've been working on a more formal design doc for this as requested above. See #1192

k8s-ci-robot · 2020-10-07T22:26:56Z

@shomron: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

DirectXMan12 · 2020-10-13T23:41:06Z

gonna close this for the moment, we'll roll it into whatever we decide to do with the stoppable controllers work and give attribution.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 2, 2020

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 2, 2020

k8s-ci-robot requested review from mengqiy and shawn-hurley May 2, 2020 15:44

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 2, 2020

shomron changed the title ~~✨ Allow removing individual informers from the cache (#935)~~ ✨ Allow removing individual informers from the cache (#935) May 2, 2020

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 6, 2020

alvaroaleman reviewed May 7, 2020

View reviewed changes

k8s-ci-robot requested review from alexeldeib and DirectXMan12 May 8, 2020 19:27

alexeldeib mentioned this pull request May 11, 2020

✨ Controller: Let sources sync even if they use a different cluster #941

Merged

njhale mentioned this pull request May 13, 2020

Addon Discovery PoC kubernetes-sigs/cluster-addons#37

Closed

vincepri self-assigned this Jun 9, 2020

njhale mentioned this pull request Jul 22, 2020

wip(operators): watch preferred kinds dynamically operator-framework/operator-lifecycle-manager#1670

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 9, 2020

DirectXMan12 reviewed Sep 11, 2020

View reviewed changes

kevindelgado mentioned this pull request Sep 25, 2020

[WIP] 🌱 Controller Lifecycle Management #1180

Closed

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 7, 2020

DirectXMan12 closed this Oct 13, 2020

kevindelgado mentioned this pull request Oct 19, 2020

🌱 Controller Lifecycle Management design doc #1192

Closed

✨ Allow removing individual informers from the cache (#935) #936

✨ Allow removing individual informers from the cache (#935) #936

Uh oh!

Conversation

shomron commented May 2, 2020

Uh oh!

k8s-ci-robot commented May 2, 2020

Uh oh!

k8s-ci-robot commented May 2, 2020

Uh oh!

k8s-ci-robot commented May 2, 2020

Uh oh!

shomron commented May 6, 2020

Uh oh!

shawn-hurley commented May 6, 2020

Uh oh!

shomron commented May 7, 2020

Uh oh!

alvaroaleman May 7, 2020

Choose a reason for hiding this comment

Uh oh!

shomron May 7, 2020

Choose a reason for hiding this comment

Uh oh!

alvaroaleman commented May 7, 2020

Uh oh!

shomron commented May 7, 2020

Uh oh!

alvaroaleman commented May 7, 2020

Uh oh!

shomron commented May 7, 2020

Uh oh!

alvaroaleman commented May 8, 2020

Uh oh!

alexeldeib commented May 11, 2020

Uh oh!

alexeldeib commented May 13, 2020

Uh oh!

vincepri commented Jun 11, 2020

Uh oh!

fejta-bot commented Sep 9, 2020

Uh oh!

DirectXMan12 left a comment

Choose a reason for hiding this comment

Uh oh!

DirectXMan12 Sep 11, 2020

Choose a reason for hiding this comment

Uh oh!

kevindelgado commented Oct 7, 2020

Uh oh!

k8s-ci-robot commented Oct 7, 2020

Uh oh!

DirectXMan12 commented Oct 13, 2020

Uh oh!

Uh oh!