|
| 1 | +Conditionally Runnable Controllers |
| 2 | +========================== |
| 3 | + |
| 4 | +## Summary |
| 5 | + |
| 6 | +Enable controller managers to successfully operate when the CRD the controller |
| 7 | +is configured to watch does not exist. |
| 8 | + |
| 9 | +Successful operation of a controller manager includes: |
| 10 | +* Starts and runs without error when a CRD the controller watches does not exist. |
| 11 | +* Begins watching the CRD once it is installed. |
| 12 | +* Unregisters (stops watching) once a CRD is uninstalled. |
| 13 | + |
| 14 | +## Background/Motivation |
| 15 | + |
| 16 | +Usually there is a 1:1 relationship between controller and resource and a 1:1 |
| 17 | +relationship between controller and k8s cluster. When this is the case it is |
| 18 | +fine to assume that for a controller to run successfully, the resource it |
| 19 | +controls must be installed on the cluster that the controller is watching. |
| 20 | + |
| 21 | +We are now seeing use cases where a single controller is responsible for |
| 22 | +multiple resources across multiple clusters. This creates situations where |
| 23 | +controllers need to successfully run even when the resource is not installed, |
| 24 | +and need to proceed successfully as it begins/terminates its watch on a resource |
| 25 | +when the resource is installed/uninstalled. |
| 26 | + |
| 27 | +The current approach is to check the discovery for a CRDs existence prior to |
| 28 | +adding the controller to the manager. This has its limitations, as complexity |
| 29 | +increases greatly for users who need to manage multiple controllers for a |
| 30 | +mixture of CRDs that might not always be installed on the cluster or are |
| 31 | +installed in different order. |
| 32 | + |
| 33 | +While there is a lot of groundwork needed to fully support multi-cluster |
| 34 | +controllers, this proposal offers incremental steps to supporting one specific use case |
| 35 | +(conditionally runnable controllers), of which hopefully some minimal form can be agreed |
| 36 | +upon before needing a comple multi-cluster story. |
| 37 | + |
| 38 | +## Goals |
| 39 | + |
| 40 | +(In order from most necessary to least) |
| 41 | +1. A mechanism for starting/stopping/restarting controllers and their caches. |
| 42 | +2. A solution for running a controller conditionally, such that it automatically |
| 43 | +starts/stops/restarts upon installation/uninstallation/reinstallation of its |
| 44 | +respective CRD in the cluster. |
| 45 | +3. An easy-to-use mechanism for solving goal #2 without the end user needing to |
| 46 | +understand too much. |
| 47 | + |
| 48 | +## Non-Goals |
| 49 | + |
| 50 | +TODO |
| 51 | + |
| 52 | +## Proposal |
| 53 | + |
| 54 | +The following proposal attempts to address the three goals in four separate |
| 55 | +steps. |
| 56 | + |
| 57 | +The first goal of enabling the possibility of starting/stopping/restarting |
| 58 | +controllers is addressed in two parts, first where a mechanism is built to |
| 59 | +enable the removal of informers, and second where we expose this mechanism and |
| 60 | +enable the restarting of controllers |
| 61 | + |
| 62 | +The second goal of a solution for running controllers conditionally is addressed |
| 63 | +by creating a wrapper around a controller called a ConditionalController that |
| 64 | +within its Start() function, it polls the discovery doc for the existence of the |
| 65 | +watched object and starts/stops/restarts the underlying controller as necessary |
| 66 | + |
| 67 | +The third goal of an ergonomic mechanism to use ConditionalControllers is a |
| 68 | +small modification to the controller builder to enable running a controller as a |
| 69 | +ConditionalController. |
| 70 | + |
| 71 | +###### Proof of concept |
| 72 | +A proof of concept PR exists at |
| 73 | +[#1180](https://github.com/kubernetes-sigs/controller-runtime/pull/1180) |
| 74 | + |
| 75 | +Each commit maps to one step in the proposal and can loosely be considered going |
| 76 | +from most necessary to least. |
| 77 | + |
| 78 | +### Informer Removal [Pre-req] |
| 79 | + |
| 80 | +This proposal assumes a mechanism exists for removing individual informers. We |
| 81 | +are agnostic to how this is done, but the current proposal is built on top of |
| 82 | +Shomron’s proposed implementation of individual informer removal |
| 83 | +[#936](https://github.com/kubernetes-sigs/controller-runtime/pull/936). |
| 84 | + |
| 85 | + |
| 86 | +Risks/mitigations and alternatives are discussed in the linked PR as well as the |
| 87 | +corresponding issue |
| 88 | +[#935](https://github.com/kubernetes-sigs/controller-runtime/issues/935). |
| 89 | + |
| 90 | +We are currently looking into whether that PR can be distilled further into a more |
| 91 | +minimal solution. |
| 92 | + |
| 93 | +### Minimal hooks |
| 94 | + |
| 95 | +Given that a mechanism exists to remove individual informers, the next step is |
| 96 | +to expose this removal functionality and enable safely |
| 97 | +starting/stopping/restarting controllers and their caches. |
| 98 | + |
| 99 | +The proposal to do this is: |
| 100 | +1. Expose the `informerCache.Remove()` functionality on the cache interface. |
| 101 | +2. Expose the ability to reset the internal controller’s `Started` field. |
| 102 | +3. Expose a field on the internal controller to determine whether to `saveWatches` |
| 103 | +or not and use this field to not empty the controller’s `startWatches` when the |
| 104 | +controller is stopped. |
| 105 | + |
| 106 | +#### Risks and Mitigations |
| 107 | + |
| 108 | +* [#1139](https://github.com/kubernetes-sigs/controller-runtime/pull/1139) discusses why |
| 109 | +the ability to start a controller more than once was taken away. It's a little |
| 110 | +unclear what effect explicitly enabling multiple starts in the case of |
| 111 | +conditional controllers will hae on the number of workers and workqueues |
| 112 | +relative to expectations and metrics. |
| 113 | +* [#1163](https://github.com/kubernetes-sigs/controller-runtime/pull/1163) discusses the |
| 114 | +memory leak caused by no clearing out the watches internal slice. A possible |
| 115 | +mitigation is to clear out the slices upon ConditionalController shutdown. |
| 116 | + |
| 117 | +#### Alternatives |
| 118 | + |
| 119 | +* Instead of exposing ResetStart and SaveWatches on the internal controller struct |
| 120 | +it might be better to expose it on the controller interface. This is more public |
| 121 | +and creates more potential for abuse, but it prevents some hacky solutions |
| 122 | +discussed below around needing to cast to the internal controller or create |
| 123 | +extra interfaces. |
| 124 | + |
| 125 | +### Conditional Controllers |
| 126 | + |
| 127 | +With the minimal hooks needed to start/stop/restart controllers and their caches |
| 128 | +in place, the next step is to provide a wrapper controller around a traditional |
| 129 | +controller that starts/stops the underlying controller based on the existence of |
| 130 | +the CRD under watch. |
| 131 | + |
| 132 | +The proposal to do this: |
| 133 | +1. `ConditionalController` that takes implements controller and with in Start() |
| 134 | +method: |
| 135 | +2. Polls the discovery doc every configurable amount of time and recognizes when |
| 136 | +the CRD has been installed/uninstalled. |
| 137 | +3. Upon installation it merges the caller’s stop channel with a local start channel |
| 138 | +and runs the underlying controller with the merged stop channel such that both |
| 139 | +the caller and this conditional controller can stop it. |
| 140 | +4. Upon uninstallation it sets the controllers `Started` field to false so that it |
| 141 | +can be restarted, indicates that the controller should save its watches upon |
| 142 | +stopping, and then stops the controller via the local stop channel and removes |
| 143 | +the cache for the object under watch. |
| 144 | + |
| 145 | +#### Risks and Mitigations |
| 146 | + |
| 147 | +* With the above minimal hooks exposing `ResetStart()` and `SaveWatches()` on the |
| 148 | +internal controller this creates the need to create a hacky intermediate |
| 149 | +interface (StoppableController) for testing and to avoid casting to the internal |
| 150 | +controller. The simplest solution is just to expose these on the controller |
| 151 | +interface (see above alternative), but there might be a better way. |
| 152 | + |
| 153 | +#### Alternatives |
| 154 | + |
| 155 | +* A more general solution could allow for conditional runnables of more than just |
| 156 | +controllers, where the user could supply the conditional check on their own, |
| 157 | +such that it’s not just limited to looking at the existence of a CRD. (but we |
| 158 | +believe that the common case will be a controller watching CRD install and thus |
| 159 | +there should still be a simple way to do this) |
| 160 | +* Nit: it's unclear if the controller package is the best place for this to live. |
| 161 | +Originally it was thought that controllerutil might be best, but that creates an |
| 162 | +import cycle. |
| 163 | + |
| 164 | +### Builder Ergonomics |
| 165 | + |
| 166 | +Since we provide simple ergonomics for creating controllers (builder), it would |
| 167 | +make sense to do the same for any conditional controller utility we create. |
| 168 | + |
| 169 | +The current proposal (link) is to: |
| 170 | +1. Provide a forInput boolean option that the user can set to enable conditionally |
| 171 | +running and an option to configure the wait time that the ConditionalController |
| 172 | +will wait between attempts to poll the discovery doc. |
| 173 | +2. In the builder’s doController function, it will first create an unmanaged |
| 174 | +controller. |
| 175 | +3. If the conditionally run option is set, it will wrap the unmanaged controller in |
| 176 | +a conditional controller. |
| 177 | +4. Add the resulting controller to the manager. |
| 178 | + |
| 179 | +#### Risks and Mitigations |
| 180 | + |
| 181 | +TODO |
| 182 | + |
| 183 | +#### Alternatives |
| 184 | +* A separate builder specifically for the conditional controller would prevent us |
| 185 | +from having to make changes to the current builder. |
| 186 | +* As noted above a more general conditional runnable will probably require its own |
| 187 | +builder. |
| 188 | + |
| 189 | +## Acceptance Criteria |
| 190 | + |
| 191 | +Ultimately, the end user must have some successful solution that involves |
| 192 | +starting with a controller for a CRD that can |
| 193 | +1. Run the mgr without CRD installed yet and mgr should start successfully |
| 194 | +2. After CRD installation the controller should start and run successfully |
| 195 | +3. Upon uninstalling the CRD, the controller should stop but the manager should |
| 196 | +continue running successfully |
| 197 | +4. Upon CRD reinstallation, the controller should once again start and run |
| 198 | +successfully with no disruption to the manager and its other runnables. |
| 199 | + |
| 200 | +## Timeline of Events |
| 201 | +* 9/30/2020: Propose idea in design doc and proof-of-concept PR to |
| 202 | +controller-runtime |
| 203 | +* 10/8/2020: Discuss idea in community meeting |
| 204 | + |
0 commit comments