Distill design into just informer removal and minimal hooks

kevindelgado · kevindelgado · commit ff1af8cdb997 · 2020-10-07T19:57:07.000Z
diff --git a/designs/conditional-controllers.md b/designs/conditional-controllers.md
@@ -1,96 +1,96 @@
-Conditionally Runnable Controllers
+Informer Removal / Controller Lifecycle Management
 ==========================
 
 ## Summary
 
-Enable controller managers to successfully operate when the CRD the controller
-is configured to watch does not exist.
-
-Successful operation of a controller manager includes:
-* Starts and runs without error when a CRD the controller watches does not exist.
-* Begins watching the CRD once it is installed.
-* Unregisters (stops watching) once a CRD is uninstalled.
+Enable fine-grained control over the lifecycle of a controller, including the
+ability to start/stop/restart controllers and their caches by exposing a way to
+remove individual informers from the cache and working around restrictions that
+currently prevent controllers from starting multiple times.
 
 ## Background/Motivation
 
-Usually there is a 1:1 relationship between controller and resource and a 1:1
-relationship between controller and k8s cluster. When this is the case it is
-fine to assume that for a controller to run successfully, the resource it
-controls must be installed on the cluster that the controller is watching.
-
-We are now seeing use cases where a single controller is responsible for
-multiple resources across multiple clusters. This creates situations where
-controllers need to successfully run even when the resource is not installed,
-and need to proceed successfully as it begins/terminates its watch on a resource
-when the resource is installed/uninstalled.
+Currently, the user does not have much controller over the lifecycle of a
+controller. The user can add controllers to the manager and add informers to the
+cache, but there is no way to remove either of these.
 
-The current approach is to check the discovery for a CRDs existence prior to
-adding the controller to the manager. This has its limitations, as complexity
-increases greatly for users who need to manage multiple controllers for a
-mixture of CRDs that might not always be installed on the cluster or are
-installed in different order.
+Additionally there are restrictions that prevent a controller from running
+multiple times such as the clearing the watches slice for a controller after it
+has started.
 
-While there is a lot of groundwork needed to fully support multi-cluster
-controllers, this proposal offers incremental steps to supporting one specific use case
-(conditionally runnable controllers), of which hopefully some minimal form can be agreed
-upon before needing a comple multi-cluster story.
+The effect of this is that users have no clean way of restarting controllers
+after they have stopped them. This would be useful for a number of use-cases
+around when controllers have little control over the installation or
+uninstallation on the of the resources that these controllers are responsible
+for watching.
 
 ## Goals
 
-(In order from most necessary to least)
-1. A mechanism for starting/stopping/restarting controllers and their caches.
-2. A solution for running a controller conditionally, such that it automatically
-starts/stops/restarts upon installation/uninstallation/reinstallation of its
-respective CRD in the cluster.
-3. An easy-to-use mechanism for solving goal #2 without the end user needing to
-understand too much.
+An implementation of the minimally viable hooks needed in controller-runtime to
+enable users to start, stop, and restart controllers and their caches.
 
 ## Non-Goals
 
-TODO
+A complete and ergonomic solution for automatically starting/stopping/restarting
+controllers upon arbitrary conditions.
 
 ## Proposal
 
-The following proposal attempts to address the three goals in four separate
-steps.
+The following proposal offers a solution for controller/cache restarting by:
+1. Enabling the removal of individual informers.
+2. Publically exposing the informer removal and adding hooks into the internal
+   controller implementation to allow for restarting controllers that have been
+   stopped.
 
-The first goal of enabling the possibility of starting/stopping/restarting
-controllers is addressed in two parts, first where a mechanism is built to
-enable the removal of informers, and second where we expose this mechanism and
-enable the restarting of controllers
+This proposal focuses on solutions that are entirely contained in
+controller-runtime. In the alternatives section, we discuss potential ways that
+changes can be added to api-machinery code in core kubernetes to enable a
+possibly cleaner interface of accomplishing our goals.
 
-The second goal of a solution for running controllers conditionally is addressed
-by creating a wrapper around a controller called a ConditionalController that
-within its Start() function, polls the discovery doc for the existence of the
-watched object and starts/stops/restarts the underlying controller as necessary.
+### Informer Removal
 
-The third goal of an ergonomic mechanism to use ConditionalControllers is a
-small modification to the controller builder to enable running a controller as a
-ConditionalController.
+The starting point for this proposal is Shomron’s proposed implementation of
+individual informer removal.
+[#936](https://github.com/kubernetes-sigs/controller-runtime/pull/936).
 
-###### Proof of concept
-A proof of concept PR exists at
-[#1180](https://github.com/kubernetes-sigs/controller-runtime/pull/1180)
+A discussion of risks/mitigations and alternatives are discussed in the linked PR as well as the
+corresponding issue
+[#935](https://github.com/kubernetes-sigs/controller-runtime/issues/935). A
+summarization of these discussions are presented below.
 
-Each commit maps to one step in the proposal and can loosely be considered going
-from most necessary to least.
+#### Risks and Mitigations
 
-### Informer Removal [Pre-req]
+* Controllers will silently degrade if the given informer for their watch is
+  removed. Most likely this issue is mitigated by the fact that it's the
+  controller responsible for removing the informer that will be the one impacted
+  by the informer's removal and thus will be expected. If this is insufficient
+  for all cases, a potnential mitigation is to implement reference counting in
+  controller-runtime such that an informer is aware of any and all outstanding
+  references when its removal is called.
 
-This proposal assumes a mechanism exists for removing individual informers. We
-are agnostic to how this is done, but the current proposal is built on top of
-Shomron’s proposed implementation of individual informer removal
-[#936](https://github.com/kubernetes-sigs/controller-runtime/pull/936).
+* Safety of stopping individual informers. There is concern that stopping
+  individual informers will leak go routines or memory. We should be able to use
+  pprof tooling and exisiting leak tooling in controller-runtime to identify and
+  mitigate any leaks
 
+#### Alternatives
 
-Risks/mitigations and alternatives are discussed in the linked PR as well as the
-corresponding issue
-[#935](https://github.com/kubernetes-sigs/controller-runtime/issues/935). 
+* Creating a cache per watch (i.e. cache of caches) as the end user. The advantage
+  of this is that it prevents having to modify any code in controller-runtime.
+  The main drawback is that it's very clunky to maintain multiple caches (one
+  for each informer) and breaks the clean design of the cache.
+
+* Adding support to de-register EventHandlers from informers in apimachinery.
+  This along with ref counting would be cleanest way to free us of the concern
+  of controllers silently failing when their watch is removed.
+  The downside is that we are ultimately at the mercy of whether apimachinery
+  wants to support these changes, and even if they were on board, it could take
+  a long time to land these changes upstream.
 
-We are currently looking into whether that PR can be distilled further into a more
-minimal solution.
+* TODO: Bubbling up errors from apimachinery.
 
-### Minimal hooks
+
+### Minimal hooks needed to use informer removal externally
 
 Given that a mechanism exists to remove individual informers, the next step is
 to expose this removal functionality and enable safely
@@ -105,105 +105,56 @@ controller is stopped.
 
 #### Risks and Mitigations
 
+* We lack a consistent story around multi-cluster support and introducing
+  changes such as this without fully thinking through the multi-cluster story
+  might bind us for future designs. We think that restarting
+  controllers is a valid use-case even for single cluster regardless of the
+  multi-cluster use case.
+
 * [#1139](https://github.com/kubernetes-sigs/controller-runtime/pull/1139) discusses why
 the ability to start a controller more than once was taken away. It's a little
 unclear what effect explicitly enabling multiple starts in the case of
-conditional controllers will hae on the number of workers and workqueues
+conditional controllers will have on the number of workers and workqueues
 relative to expectations and metrics.
+
 * [#1163](https://github.com/kubernetes-sigs/controller-runtime/pull/1163) discusses the
 memory leak caused by no clearing out the watches internal slice. A possible
 mitigation is to clear out the slices upon ConditionalController shutdown.
 
 #### Alternatives
 
 * A metacontroller or CRD controller could start and stop controllers based on
-the existence of their corresponding CRDs. This requires no changes to made to
-controller-runtime. It does put the complexity of designing such a controller
+the existence of their corresponding CRDs. This puts the complexity of designing such a controller
 onto the end user, but there are potentially ways to provide end users with
-default, pluggable CRD controllers.
+default, pluggable CRD controllers. More importantly, this probably is not even
+be sufficient for enabling controller restarting, because informers are shared
+between all controllers so restarting the controller will still try to use the
+informer that is erroring out if the CRD it is watching goes away. Some hooks
+into removing informers is sitll required in order to use a metacontroller.
+
 * Instead of exposing ResetStart and SaveWatches on the internal controller struct
 it might be better to expose it on the controller interface. This is more public
 and creates more potential for abuse, but it prevents some hacky solutions
 discussed below around needing to cast to the internal controller or create
 extra interfaces.
 
-### Conditional Controllers
-
-With the minimal hooks needed to start/stop/restart controllers and their caches
-in place, the next step is to provide a wrapper controller around a traditional
-controller that starts/stops the underlying controller based on the existence of
-the CRD under watch.
-
-The proposal to do this:
-1. `ConditionalController` that takes implements controller and with in Start()
-method:
-2. Polls the discovery doc every configurable amount of time and recognizes when
-the CRD has been installed/uninstalled.
-3. Upon installation it merges the caller’s stop channel with a local start channel
-and runs the underlying controller with the merged stop channel such that both
-the caller and this conditional controller can stop it.
-4. Upon uninstallation it sets the controllers `Started` field to false so that it
-can be restarted, indicates that the controller should save its watches upon
-stopping, and then stops the controller via the local stop channel and removes
-the cache for the object under watch.
-
-#### Risks and Mitigations
-
-* With the above minimal hooks exposing `ResetStart()` and `SaveWatches()` on the
-internal controller this creates the need to create a hacky intermediate
-interface (StoppableController) for testing and to avoid casting to the internal
-controller. The simplest solution is just to expose these on the controller
-interface (see above alternative), but there might be a better way.
+## Future Work / Motivating Use-Cases
 
-#### Alternatives
-
-* A more general solution could allow for conditional runnables of more than just
-controllers, where the user could supply the conditional check on their own,
-such that it’s not just limited to looking at the existence of a CRD. (but we
-believe that the common case will be a controller watching CRD install and thus
-there should still be a simple way to do this)
-* Nit: it's unclear if the controller package is the best place for this to live.
-Originally it was thought that controllerutil might be best, but that creates an
-import cycle.
-
-### Builder Ergonomics
-
-Since we provide simple ergonomics for creating controllers (builder), it would
-make sense to do the same for any conditional controller utility we create.
-
-The current proposal (link) is to:
-1. Provide a forInput boolean option that the user can set to enable conditionally
-running and an option to configure the wait time that the ConditionalController
-will wait between attempts to poll the discovery doc.
-2. In the builder’s doController function, it will first create an unmanaged
-controller.
-3. If the conditionally run option is set, it will wrap the unmanaged controller in
-a conditional controller.
-4. Add the resulting controller to the manager.
+Were this to move forward, it unlocks a number of potential use-cases.
 
-#### Risks and Mitigations
+1. OPA/Gatekeeper can simplify it's dynamic watch management by having greater
+   controller over the lifecycle of a controller. See [this
+   doc](https://docs.google.com/document/d/1Wi3LM3sG6Qgfzm--bWb6R0SEKCkQCCt-ene6cO62FlM/edit)
 
-TODO
-
-#### Alternatives
-* A separate builder specifically for the conditional controller would prevent us
-from having to make changes to the current builder.
-* As noted above a more general conditional runnable will probably require its own
-builder.
-
-## Acceptance Criteria
-
-Ultimately, the end user must have some successful solution that involves
-starting with a controller for a CRD that can
-1. Run the mgr without CRD installed yet and mgr should start successfully
-2. After CRD installation the controller should start and run successfully
-3. Upon uninstalling the CRD, the controller should stop but the manager should
-continue running successfully
-4. Upon CRD reinstallation, the controller should once again start and run
-successfully with no disruption to the manager and its other runnables.
+2. We can support controllers that can conditionally start/stop/restart based on
+   the installation/uninstallation of its CRD. See [this proof-of-concept branch](https://github.com/kevindelgado/controller-runtime/tree/experimental/conditional-runnables)
 
 ## Timeline of Events
 * 9/30/2020: Propose idea in design doc and proof-of-concept PR to
 controller-runtime
+* 10/7/2020: Design doc and proof-of-concept distilled to focus only on minimal
+  hooks needed rather than proposing entire conditional controller solution.
+  Alternatives added to discuss opportunities to push some of the implementation
+  to the api-machinery level.
 * 10/8/2020: Discuss idea in community meeting