|
1 |
| -Informer Removal / Controller Lifecycle Management |
| 1 | +Controller Lifecycle Management |
2 | 2 | ==========================
|
3 | 3 |
|
4 | 4 | ## Summary
|
@@ -27,148 +27,272 @@ control over which resources are installed).
|
27 | 27 |
|
28 | 28 | ## Goals
|
29 | 29 |
|
30 |
| -An implementation of the minimally viable hooks needed in controller-runtime to |
31 |
| -enable controller adminstrators to start, stop, and restart controllers and their caches. |
| 30 | +controller-runtime should support starting/stopping/restarting controllers and |
| 31 | +their caches on arbitrary conditions. |
32 | 32 |
|
33 | 33 | ## Non-Goals
|
34 | 34 |
|
35 |
| -A complete and ergonomic solution for automatically starting/stopping/restarting |
36 |
| -controllers upon arbitrary conditions. |
| 35 | +* An implementation of starting/stopping a controller based on a specific |
| 36 | +condition (e.g. the existence of a CRD in discovery) does not need to be |
| 37 | +implemented in controller-runtime (but the end consumer of controller-runtime |
| 38 | +should be able to build it on their own). |
| 39 | + |
| 40 | +* No further guarantees of multi-cluster support beyond what is already provided |
| 41 | + by controller-runtime. |
37 | 42 |
|
38 | 43 | ## Proposal
|
39 | 44 |
|
40 | 45 | The following proposal offers a solution for controller/cache restarting by:
|
41 | 46 | 1. Enabling the removal of individual informers.
|
42 |
| -2. Publically exposing the informer removal and adding hooks into the internal |
43 |
| - controller implementation to allow for restarting controllers that have been |
44 |
| - stopped. |
| 47 | +2. Tracking the number of event handlers on an Informer. |
45 | 48 |
|
46 | 49 | This proposal focuses on solutions that are entirely contained in
|
47 | 50 | controller-runtime. In the alternatives section, we discuss potential ways that
|
48 | 51 | changes can be added to api-machinery code in core kubernetes to enable a
|
49 |
| -possibly cleaner interface of accomplishing our goals. |
| 52 | +possibly cleaner SharedInformer interface for accomplishing our goals. |
50 | 53 |
|
51 | 54 | ### Informer Removal
|
52 | 55 |
|
53 | 56 | The starting point for this proposal is Shomron’s proposed implementation of
|
54 | 57 | individual informer removal.
|
55 | 58 | [#936](https://github.com/kubernetes-sigs/controller-runtime/pull/936).
|
56 | 59 |
|
57 |
| -A discussion of risks/mitigations and alternatives are discussed in the linked PR as well as the |
58 |
| -corresponding issue |
59 |
| -[#935](https://github.com/kubernetes-sigs/controller-runtime/issues/935). A |
60 |
| -summarization of these discussions are presented below. |
61 |
| - |
62 |
| -#### Risks and Mitigations |
63 |
| - |
64 |
| -* Controllers will silently degrade if the given informer for their watch is |
65 |
| - removed. Most likely this issue is mitigated by the fact that it's the |
66 |
| - controller responsible for removing the informer that will be the one impacted |
67 |
| - by the informer's removal and thus will be expected. If this is insufficient |
68 |
| - for all cases, a potnential mitigation is to implement reference counting in |
69 |
| - controller-runtime such that an informer is aware of any and all outstanding |
70 |
| - references when its removal is called. |
71 |
| - |
72 |
| -* The safety of stopping individual informers is unclear. There is concern that stopping |
73 |
| - individual informers will leak go routines or memory. We should be able to use |
74 |
| - pprof tooling and exisiting leak tooling in controller-runtime to identify and |
75 |
| - mitigate any leaks |
76 |
| - |
77 |
| -#### Alternatives |
78 |
| - |
79 |
| -* Creating a cache per watch (i.e. cache of caches) as the end user. The advantage |
80 |
| - of this is that it prevents having to modify any code in controller-runtime. |
81 |
| - The main drawback is that it's very clunky to maintain multiple caches (one |
82 |
| - for each informer) and breaks the clean design of the cache. |
83 |
| - |
84 |
| -* Adding support to de-register EventHandlers from informers in apimachinery. |
85 |
| - This along with ref counting would be the cleanest way to free us of the concern |
86 |
| - of controllers silently failing when their watch is removed. |
87 |
| - The downside is that we are ultimately at the mercy of whether apimachinery |
88 |
| - wants to support these changes, and even if they were on board, it could take |
89 |
| - a long time to land these changes upstream. |
90 |
| - |
91 |
| -* Surfacing the watch error handler from the reflector in client-go through the |
92 |
| - informer. Controller-runtime could then look for specific errors and decide how to |
93 |
| - handle it from there, such as terminating the informer if a specific error was |
94 |
| - thrown indicating that the informer will no longer be viable (for example when |
95 |
| - the resource is uninstalled). The advantage is that we'd be pushing updates |
96 |
| - from the informer to the manager when errors arise (such as when the resource |
97 |
| - disappears) and this would lead to more responsive informer shutdown that doesn't |
98 |
| - require a separate watch mechanism to determine whether to remove informers. Like |
99 |
| - de-registering EventHandlers, the downside is that we would need api-machinery to |
100 |
| - support these changes and might take a long time to coordinate and implement. |
101 |
| - |
102 |
| -### Minimal hooks needed to use informer removal externally |
103 |
| - |
104 |
| -Given that a mechanism exists to remove individual informers, the next step is |
105 |
| -to expose this removal functionality and enable safely |
106 |
| -starting/stopping/restarting controllers and their caches. |
107 |
| - |
108 |
| -The proposal to do this is: |
109 |
| -1. Expose the `informerCache.Remove()` functionality on the cache interface. |
110 |
| -2. Expose the ability to reset the internal controller’s `Started` field. |
111 |
| -3. Expose a field on the internal controller to determine whether to `saveWatches` |
112 |
| -or not and use this field to not empty the controller’s `startWatches` when the |
113 |
| -controller is stopped. |
114 |
| - |
115 |
| -A proof of concept for PR is here at |
116 |
| -[#1180](https://github.com/kubernetes-sigs/controller-runtime/pull/1180) |
117 |
| - |
118 |
| -#### Risks and Mitigations |
119 |
| - |
120 |
| -* We lack a consistent story around multi-cluster support and introducing |
121 |
| - changes such as this without fully thinking through the multi-cluster story |
122 |
| - might bind us for future designs. We think gracefully handling degraded |
123 |
| - functionality in informers we start as end users modify the cluster is a valid |
124 |
| - use case that exists whenever the cluster administrator is different from the |
125 |
| - controller administrator and should be handled irregardless of its application |
126 |
| - in multi-cluster envrionments. |
127 |
| - |
128 |
| -* [#1139](https://github.com/kubernetes-sigs/controller-runtime/pull/1139) discusses why |
129 |
| -the ability to start a controller more than once was taken away. It's a little |
130 |
| -unclear what effect explicitly enabling multiple starts in the case of |
131 |
| -conditional controllers will have on the number of workers and workqueues |
132 |
| -relative to expectations and metrics. |
133 |
| - |
134 |
| -* [#1163](https://github.com/kubernetes-sigs/controller-runtime/pull/1163) discusses the |
135 |
| -memory leak caused by no clearing out the watches internal slice. A possible |
136 |
| -mitigation is to clear out the slices upon ConditionalController shutdown. |
137 |
| - |
138 |
| -#### Alternatives |
139 |
| - |
140 |
| -* A metacontroller or CRD controller could start and stop controllers based on |
141 |
| -the existence of their corresponding CRDs. This puts the complexity of designing such a controller |
142 |
| -onto the end user, but there are potentially ways to provide end users with |
143 |
| -default, pluggable CRD controllers. More importantly, this probably is not even |
144 |
| -be sufficient for enabling controller restarting, because informers are shared |
145 |
| -between all controllers so restarting the controller will still try to use the |
146 |
| -informer that is erroring out if the CRD it is watching goes away. Some hooks |
147 |
| -into removing informers is sitll required in order to use a metacontroller. |
148 |
| - |
149 |
| -* Instead of exposing ResetStart and SaveWatches on the internal controller struct |
150 |
| -it might be better to expose it on the controller interface. This is more public |
151 |
| -and creates more potential for abuse, but it prevents some hacky solutions |
152 |
| -discussed below around needing to cast to the internal controller or create |
153 |
| -extra interfaces. |
154 |
| - |
155 |
| -## Future Work / Motivating Use-Cases |
| 60 | +It extends the informer cache to expose a remove method: |
| 61 | +``` |
| 62 | +// Remove removes an informer specified by the obj argument from the cache and |
| 63 | +// stops it if it existed. |
| 64 | +Remove(ctx context.Context, obj runtime.Object) error |
| 65 | +``` |
| 66 | + |
| 67 | +This gets plumbed through to the informers map, where we remove an informer |
| 68 | +entry from the informersByGVK |
| 69 | + |
| 70 | +``` |
| 71 | +// Remove removes an informer entry and stops it if it was running. |
| 72 | +func (ip *specificInformersMap) Remove(gvk schema.GroupVersionKind) { |
| 73 | + ... |
| 74 | + entry, ok := ip.informersByGVK[gvk] |
| 75 | + close(entry.stop) |
| 76 | + delete(ip.informersByGVK, gvk) |
| 77 | +} |
| 78 | +``` |
| 79 | + |
| 80 | +We support this by modifying the Start method to run the informer with |
| 81 | +a stop channel that can be stopped individually |
| 82 | +or globally (as opposed to just globally). |
| 83 | + |
| 84 | +``` |
| 85 | +// Start calls Run on each of the informers... |
| 86 | +func (ip *specificInformersMap) Start(ctx context) { |
| 87 | +... |
| 88 | +for _, entry := range ip.informersByGVK{ |
| 89 | + go entry.Start(ctx.Done()) |
| 90 | +} |
| 91 | +... |
| 92 | +} |
| 93 | +
|
| 94 | +// Start starts the informer managed by a MapEntry. |
| 95 | +// Blocks until the informer stops. The informer can be stopped |
| 96 | +// either individually (via the entry's stop channel) or globally |
| 97 | +// via the provided stop argument. |
| 98 | +func (e *MapEntry) Start(stop <-chan struct{}) { |
| 99 | + // Stop on either the whole map stopping or just this informer being removed. |
| 100 | + internalStop, cancel := anoyOf(stop, e.stop) |
| 101 | + defer cancel() |
| 102 | + e.Informer.Run(internalStop) |
| 103 | +} |
| 104 | +``` |
| 105 | + |
| 106 | +### EventHandler reference counting |
| 107 | + |
| 108 | +The primary concern with adding individual informer removal is that it exposes |
| 109 | +the risk of silently degrading controllers that are using the informer being |
| 110 | +removed. |
| 111 | + |
| 112 | +To mitigate this we need a way to track the number of EventHandlers added to a |
| 113 | +given informer and ensure that the are none remaining when an informer is |
| 114 | +removed. |
| 115 | + |
| 116 | +The proposal here is to expand the informer interface used by the `MapEntry` in |
| 117 | +the informers map: |
| 118 | +``` |
| 119 | +// CountingInformer exposes a way to track the number of EventHandlers |
| 120 | +//registered on an informer. |
| 121 | +
|
| 122 | +// The implementation of this should store a count that gets incremented every |
| 123 | +// time AddEventHandler is called and decremented every time RemoveEventHandler is |
| 124 | +// called. |
| 125 | +type CountingInformer interface { |
| 126 | + cache.SharedIndexInformer |
| 127 | + CountEventHandlers() int |
| 128 | + RemoveEventHandler() |
| 129 | +} |
| 130 | +``` |
| 131 | + |
| 132 | +Now, we need a way to run a controller, such that the starting the controller |
| 133 | +increments the EventHandler counts on the informer and stopping the controller |
| 134 | +decrements it. |
| 135 | + |
| 136 | +We can modify/expand the `Source` interface, such that we can call |
| 137 | +`Start()` with a context that blocks on ctx.Done() and calls RemoveEventHandler |
| 138 | +on the informer once the context is closed, because the controller has been |
| 139 | +stopped so the EventHandler no longer exists on the informer. |
| 140 | + |
| 141 | +``` |
| 142 | +// StoppableSource expands the Source interface to add a start function that |
| 143 | +// blocks on the context's Done channel, so that we know when the controller has |
| 144 | +// been stopped and can remove/decrement the EventHandler count on the informer |
| 145 | +// appropriately |
| 146 | +type StoppableSource interface { |
| 147 | + Source |
| 148 | + StartStoppable(context.Context, handler.EventHandler, |
| 149 | + workqueue.RateLimitingInterface, ...predicate.Predicate) error |
| 150 | +} |
| 151 | +
|
| 152 | +// StartStoppable blocks for start to finish and then calls RemoveEventHandler |
| 153 | +// on the kind's informer. |
| 154 | +func (ks *Kind) StartStoppable(ctx, handler, queue, prct) { |
| 155 | + informer := ks.cache.GetInformer(ctx, ks.Type) |
| 156 | + ks.Start(ctx, handler, queue, prct) |
| 157 | + <-ctx.Done() |
| 158 | + i.RemoveEventHandler() |
| 159 | +} |
| 160 | +
|
| 161 | +``` |
| 162 | + |
| 163 | +A simpler alternative is to all of this is to bake the `EventHandler` tracking into the underlying |
| 164 | +client-go `SharedIndexInformer`. This involves modifying client-go and is |
| 165 | +discussed below in the alternatives section. |
| 166 | + |
| 167 | +## Alternatives |
| 168 | + |
| 169 | +A couple alternatives to what we propose, one that offloads all the |
| 170 | +responsibility to the consumer requiring no changes to controller-runtime and |
| 171 | +one that moves the event handler reference counting upstream into client-go. |
| 172 | + |
| 173 | +### Cache of Caches |
| 174 | + |
| 175 | +One alternative is to create a separate cache per watch (i.e. cache of caches) |
| 176 | +as the end user consuming controller-runtime. The advantage is that it prevents |
| 177 | +us from having to modify any code in controller-runtime. The main drawback is |
| 178 | +that it's very clunky to maintain multiple caches (one for each informer) and |
| 179 | +breaks the clean design of the cache. |
| 180 | + |
| 181 | +### Stoppable Informers and EventHandler removal natively in client-go |
| 182 | + |
| 183 | +A few changes to the underlying SharedInformer interface in client-go could save |
| 184 | +us from a lot of work in controller-runtime. |
| 185 | + |
| 186 | +One is to add a second `Run` method on the `SharedInformer` interface such as |
| 187 | +``` |
| 188 | +// RunWithStopOptions runs the informer and provides options to be checked that |
| 189 | +// would indicate under what conditions the informer should stop |
| 190 | +RunWithStopOptions(stopOptions StopOptions) StopReason |
| 191 | +
|
| 192 | +// StopOptions let the caller specifcy which conditions to stop the informer. |
| 193 | +type StopOptions struct { |
| 194 | + // StopChannel stops the Informer when it receives a close signal |
| 195 | + StopChannel <-chan struct{} |
| 196 | +
|
| 197 | + // OnListError inspects errors returned from the informer's underlying |
| 198 | + // reflector, and based on the error determines whether or not to stop the |
| 199 | + // informer. |
| 200 | + OnListError func(err) bool |
| 201 | +} |
| 202 | +
|
| 203 | +// StopReason is a custom typed error that indicates how the informer was |
| 204 | +// stopped |
| 205 | +type StopReason error |
| 206 | +``` |
| 207 | + |
| 208 | +This could be plumbed through controller-runtime allowing users to pass informer |
| 209 | +`StopOptions` when they create a controller. |
| 210 | + |
| 211 | +Another potential change to `SharedInformer` is to have it be responsible for |
| 212 | +tracking the count of event handlers and allowing consumers to remove |
| 213 | +EventHandlers from a `SharedInformer. |
| 214 | + |
| 215 | +similar to the `EventHandler` reference counting interface we discussed above the |
| 216 | +`SharedInformer` could add methods: |
| 217 | + |
| 218 | +``` |
| 219 | +type SharedInformer interface { |
| 220 | +... |
| 221 | + CountEventHandlers() int |
| 222 | + RemoveEventHandler(id) error |
| 223 | +} |
| 224 | +``` |
| 225 | +(where `id`` is a to be determined identifer of the specific handler to be |
| 226 | +removed). |
| 227 | + |
| 228 | +This would remove the need to track it ourselves in controller-runtime. |
| 229 | + |
| 230 | +TODO: Bring this design up (potentially with a proof-of-concept branch) at an |
| 231 | +upcoming sig-api-machinery meeting to begin getting feedback and see whether it |
| 232 | +is feasible to land the changes in client-go. |
| 233 | + |
| 234 | +## Open Questions |
| 235 | + |
| 236 | +### Multiple Restarts |
| 237 | + |
| 238 | +The ability to start a controller more than once was recently removed over |
| 239 | +concerns around data races from workqueue recreation and starting too many |
| 240 | +workers ([#1139](https://github.com/kubernetes-sigs/controller-runtime/pull/1139)) |
| 241 | + |
| 242 | +Previously there was never a valid reason to start a controller more than once, |
| 243 | +now we want to support that. The simplest workaround is to expose a |
| 244 | +`ResetStart()` method on the controller interface that simply resets the |
| 245 | +'Started' on the controller. This allows restarts but doesn't address the |
| 246 | +initial data race/excess worker concerns that drove the removal of this |
| 247 | +functionality in the first place. |
| 248 | + |
| 249 | +### Persisting controller watches |
| 250 | + |
| 251 | +The list of start watches store don the controller currently gets cleared out |
| 252 | +once the controller is started in order to prevent memory leak([#1163](https://github.com/kubernetes-sigs/controller-runtime/pull/1163)) |
| 253 | + |
| 254 | +We need to retain references to these watches in order to restart controllers. |
| 255 | +The simplest workaround is maintain a boolean `SaveWatches` field on the |
| 256 | +controller that determines whether or not the watches slice should me retained |
| 257 | +after starting a controller. We can then expose a method on the controller for |
| 258 | +setting this field and enabling watches to be saved. We will still have a memory |
| 259 | +leak, but only on controllers that are expected to be restart. There might be a |
| 260 | +better way clear out the watches list that fits with the restartable controllers |
| 261 | +paradigm. |
| 262 | + |
| 263 | +### Multi-Cluster Support |
| 264 | + |
| 265 | +A multi-cluster environment where a single controller operates across multiple |
| 266 | +clusters will likely be a heavy user of these new features because the |
| 267 | +multi-cluster frequently results in a situation where the |
| 268 | +cluster administrator is different from the controller administrator. |
| 269 | + |
| 270 | +We lack a consistent story around multi-cluster support and introducing changes |
| 271 | +such as these without fully thinking through the mult-cluster story might bind |
| 272 | +us for future designs. |
| 273 | + |
| 274 | + |
| 275 | +## Future Work / Use Cases |
156 | 276 |
|
157 | 277 | Were this to move forward, it unlocks a number of potential use-cases.
|
158 | 278 |
|
159 |
| -1. OPA/Gatekeeper can simplify it's dynamic watch management by having greater |
| 279 | +1. We can support a "metacontroller" or controller of controllers that could |
| 280 | + start and stop controllers based on the existence of their corresponding |
| 281 | + CRDs. |
| 282 | + |
| 283 | +2. OPA/Gatekeeper can simplify it's dynamic watch management by having greater |
160 | 284 | controller over the lifecycle of a controller. See [this
|
161 | 285 | doc](https://docs.google.com/document/d/1Wi3LM3sG6Qgfzm--bWb6R0SEKCkQCCt-ene6cO62FlM/edit)
|
162 | 286 |
|
163 |
| -2. We can support controllers that can conditionally start/stop/restart based on |
164 |
| - the installation/uninstallation of its CRD. See [this proof-of-concept branch](https://github.com/kevindelgado/controller-runtime/tree/experimental/conditional-runnables) |
165 |
| - |
166 | 287 | ## Timeline of Events
|
| 288 | + |
167 | 289 | * 9/30/2020: Propose idea in design doc and proof-of-concept PR to
|
168 | 290 | controller-runtime
|
169 | 291 | * 10/7/2020: Design doc and proof-of-concept distilled to focus only on minimal
|
170 | 292 | hooks needed rather than proposing entire conditional controller solution.
|
171 | 293 | Alternatives added to discuss opportunities to push some of the implementation
|
172 | 294 | to the api-machinery level.
|
173 | 295 | * 10/8/2020: Discuss idea in community meeting
|
| 296 | +* 10/19/2020: Proposal updated to add EventHandler count tracking and client-go |
| 297 | + based alternative. |
174 | 298 |
|
0 commit comments