Skip to content

Commit 265a8df

Browse files
committed
Add event handler ref counting and client-go alternative
1 parent 712b72f commit 265a8df

File tree

1 file changed

+236
-112
lines changed

1 file changed

+236
-112
lines changed

designs/conditional-controllers.md

Lines changed: 236 additions & 112 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Informer Removal / Controller Lifecycle Management
1+
Controller Lifecycle Management
22
==========================
33

44
## Summary
@@ -27,148 +27,272 @@ control over which resources are installed).
2727

2828
## Goals
2929

30-
An implementation of the minimally viable hooks needed in controller-runtime to
31-
enable controller adminstrators to start, stop, and restart controllers and their caches.
30+
controller-runtime should support starting/stopping/restarting controllers and
31+
their caches on arbitrary conditions.
3232

3333
## Non-Goals
3434

35-
A complete and ergonomic solution for automatically starting/stopping/restarting
36-
controllers upon arbitrary conditions.
35+
* An implementation of starting/stopping a controller based on a specific
36+
condition (e.g. the existence of a CRD in discovery) does not need to be
37+
implemented in controller-runtime (but the end consumer of controller-runtime
38+
should be able to build it on their own).
39+
40+
* No further guarantees of multi-cluster support beyond what is already provided
41+
by controller-runtime.
3742

3843
## Proposal
3944

4045
The following proposal offers a solution for controller/cache restarting by:
4146
1. Enabling the removal of individual informers.
42-
2. Publically exposing the informer removal and adding hooks into the internal
43-
controller implementation to allow for restarting controllers that have been
44-
stopped.
47+
2. Tracking the number of event handlers on an Informer.
4548

4649
This proposal focuses on solutions that are entirely contained in
4750
controller-runtime. In the alternatives section, we discuss potential ways that
4851
changes can be added to api-machinery code in core kubernetes to enable a
49-
possibly cleaner interface of accomplishing our goals.
52+
possibly cleaner SharedInformer interface for accomplishing our goals.
5053

5154
### Informer Removal
5255

5356
The starting point for this proposal is Shomron’s proposed implementation of
5457
individual informer removal.
5558
[#936](https://github.com/kubernetes-sigs/controller-runtime/pull/936).
5659

57-
A discussion of risks/mitigations and alternatives are discussed in the linked PR as well as the
58-
corresponding issue
59-
[#935](https://github.com/kubernetes-sigs/controller-runtime/issues/935). A
60-
summarization of these discussions are presented below.
61-
62-
#### Risks and Mitigations
63-
64-
* Controllers will silently degrade if the given informer for their watch is
65-
removed. Most likely this issue is mitigated by the fact that it's the
66-
controller responsible for removing the informer that will be the one impacted
67-
by the informer's removal and thus will be expected. If this is insufficient
68-
for all cases, a potnential mitigation is to implement reference counting in
69-
controller-runtime such that an informer is aware of any and all outstanding
70-
references when its removal is called.
71-
72-
* The safety of stopping individual informers is unclear. There is concern that stopping
73-
individual informers will leak go routines or memory. We should be able to use
74-
pprof tooling and exisiting leak tooling in controller-runtime to identify and
75-
mitigate any leaks
76-
77-
#### Alternatives
78-
79-
* Creating a cache per watch (i.e. cache of caches) as the end user. The advantage
80-
of this is that it prevents having to modify any code in controller-runtime.
81-
The main drawback is that it's very clunky to maintain multiple caches (one
82-
for each informer) and breaks the clean design of the cache.
83-
84-
* Adding support to de-register EventHandlers from informers in apimachinery.
85-
This along with ref counting would be the cleanest way to free us of the concern
86-
of controllers silently failing when their watch is removed.
87-
The downside is that we are ultimately at the mercy of whether apimachinery
88-
wants to support these changes, and even if they were on board, it could take
89-
a long time to land these changes upstream.
90-
91-
* Surfacing the watch error handler from the reflector in client-go through the
92-
informer. Controller-runtime could then look for specific errors and decide how to
93-
handle it from there, such as terminating the informer if a specific error was
94-
thrown indicating that the informer will no longer be viable (for example when
95-
the resource is uninstalled). The advantage is that we'd be pushing updates
96-
from the informer to the manager when errors arise (such as when the resource
97-
disappears) and this would lead to more responsive informer shutdown that doesn't
98-
require a separate watch mechanism to determine whether to remove informers. Like
99-
de-registering EventHandlers, the downside is that we would need api-machinery to
100-
support these changes and might take a long time to coordinate and implement.
101-
102-
### Minimal hooks needed to use informer removal externally
103-
104-
Given that a mechanism exists to remove individual informers, the next step is
105-
to expose this removal functionality and enable safely
106-
starting/stopping/restarting controllers and their caches.
107-
108-
The proposal to do this is:
109-
1. Expose the `informerCache.Remove()` functionality on the cache interface.
110-
2. Expose the ability to reset the internal controller’s `Started` field.
111-
3. Expose a field on the internal controller to determine whether to `saveWatches`
112-
or not and use this field to not empty the controller’s `startWatches` when the
113-
controller is stopped.
114-
115-
A proof of concept for PR is here at
116-
[#1180](https://github.com/kubernetes-sigs/controller-runtime/pull/1180)
117-
118-
#### Risks and Mitigations
119-
120-
* We lack a consistent story around multi-cluster support and introducing
121-
changes such as this without fully thinking through the multi-cluster story
122-
might bind us for future designs. We think gracefully handling degraded
123-
functionality in informers we start as end users modify the cluster is a valid
124-
use case that exists whenever the cluster administrator is different from the
125-
controller administrator and should be handled irregardless of its application
126-
in multi-cluster envrionments.
127-
128-
* [#1139](https://github.com/kubernetes-sigs/controller-runtime/pull/1139) discusses why
129-
the ability to start a controller more than once was taken away. It's a little
130-
unclear what effect explicitly enabling multiple starts in the case of
131-
conditional controllers will have on the number of workers and workqueues
132-
relative to expectations and metrics.
133-
134-
* [#1163](https://github.com/kubernetes-sigs/controller-runtime/pull/1163) discusses the
135-
memory leak caused by no clearing out the watches internal slice. A possible
136-
mitigation is to clear out the slices upon ConditionalController shutdown.
137-
138-
#### Alternatives
139-
140-
* A metacontroller or CRD controller could start and stop controllers based on
141-
the existence of their corresponding CRDs. This puts the complexity of designing such a controller
142-
onto the end user, but there are potentially ways to provide end users with
143-
default, pluggable CRD controllers. More importantly, this probably is not even
144-
be sufficient for enabling controller restarting, because informers are shared
145-
between all controllers so restarting the controller will still try to use the
146-
informer that is erroring out if the CRD it is watching goes away. Some hooks
147-
into removing informers is sitll required in order to use a metacontroller.
148-
149-
* Instead of exposing ResetStart and SaveWatches on the internal controller struct
150-
it might be better to expose it on the controller interface. This is more public
151-
and creates more potential for abuse, but it prevents some hacky solutions
152-
discussed below around needing to cast to the internal controller or create
153-
extra interfaces.
154-
155-
## Future Work / Motivating Use-Cases
60+
It extends the informer cache to expose a remove method:
61+
```
62+
// Remove removes an informer specified by the obj argument from the cache and
63+
// stops it if it existed.
64+
Remove(ctx context.Context, obj runtime.Object) error
65+
```
66+
67+
This gets plumbed through to the informers map, where we remove an informer
68+
entry from the informersByGVK
69+
70+
```
71+
// Remove removes an informer entry and stops it if it was running.
72+
func (ip *specificInformersMap) Remove(gvk schema.GroupVersionKind) {
73+
...
74+
entry, ok := ip.informersByGVK[gvk]
75+
close(entry.stop)
76+
delete(ip.informersByGVK, gvk)
77+
}
78+
```
79+
80+
We support this by modifying the Start method to run the informer with
81+
a stop channel that can be stopped individually
82+
or globally (as opposed to just globally).
83+
84+
```
85+
// Start calls Run on each of the informers...
86+
func (ip *specificInformersMap) Start(ctx context) {
87+
...
88+
for _, entry := range ip.informersByGVK{
89+
go entry.Start(ctx.Done())
90+
}
91+
...
92+
}
93+
94+
// Start starts the informer managed by a MapEntry.
95+
// Blocks until the informer stops. The informer can be stopped
96+
// either individually (via the entry's stop channel) or globally
97+
// via the provided stop argument.
98+
func (e *MapEntry) Start(stop <-chan struct{}) {
99+
// Stop on either the whole map stopping or just this informer being removed.
100+
internalStop, cancel := anoyOf(stop, e.stop)
101+
defer cancel()
102+
e.Informer.Run(internalStop)
103+
}
104+
```
105+
106+
### EventHandler reference counting
107+
108+
The primary concern with adding individual informer removal is that it exposes
109+
the risk of silently degrading controllers that are using the informer being
110+
removed.
111+
112+
To mitigate this we need a way to track the number of EventHandlers added to a
113+
given informer and ensure that the are none remaining when an informer is
114+
removed.
115+
116+
The proposal here is to expand the informer interface used by the `MapEntry` in
117+
the informers map:
118+
```
119+
// CountingInformer exposes a way to track the number of EventHandlers
120+
//registered on an informer.
121+
122+
// The implementation of this should store a count that gets incremented every
123+
// time AddEventHandler is called and decremented every time RemoveEventHandler is
124+
// called.
125+
type CountingInformer interface {
126+
cache.SharedIndexInformer
127+
CountEventHandlers() int
128+
RemoveEventHandler()
129+
}
130+
```
131+
132+
Now, we need a way to run a controller, such that the starting the controller
133+
increments the EventHandler counts on the informer and stopping the controller
134+
decrements it.
135+
136+
We can modify/expand the `Source` interface, such that we can call
137+
`Start()` with a context that blocks on ctx.Done() and calls RemoveEventHandler
138+
on the informer once the context is closed, because the controller has been
139+
stopped so the EventHandler no longer exists on the informer.
140+
141+
```
142+
// StoppableSource expands the Source interface to add a start function that
143+
// blocks on the context's Done channel, so that we know when the controller has
144+
// been stopped and can remove/decrement the EventHandler count on the informer
145+
// appropriately
146+
type StoppableSource interface {
147+
Source
148+
StartStoppable(context.Context, handler.EventHandler,
149+
workqueue.RateLimitingInterface, ...predicate.Predicate) error
150+
}
151+
152+
// StartStoppable blocks for start to finish and then calls RemoveEventHandler
153+
// on the kind's informer.
154+
func (ks *Kind) StartStoppable(ctx, handler, queue, prct) {
155+
informer := ks.cache.GetInformer(ctx, ks.Type)
156+
ks.Start(ctx, handler, queue, prct)
157+
<-ctx.Done()
158+
i.RemoveEventHandler()
159+
}
160+
161+
```
162+
163+
A simpler alternative is to all of this is to bake the `EventHandler` tracking into the underlying
164+
client-go `SharedIndexInformer`. This involves modifying client-go and is
165+
discussed below in the alternatives section.
166+
167+
## Alternatives
168+
169+
A couple alternatives to what we propose, one that offloads all the
170+
responsibility to the consumer requiring no changes to controller-runtime and
171+
one that moves the event handler reference counting upstream into client-go.
172+
173+
### Cache of Caches
174+
175+
One alternative is to create a separate cache per watch (i.e. cache of caches)
176+
as the end user consuming controller-runtime. The advantage is that it prevents
177+
us from having to modify any code in controller-runtime. The main drawback is
178+
that it's very clunky to maintain multiple caches (one for each informer) and
179+
breaks the clean design of the cache.
180+
181+
### Stoppable Informers and EventHandler removal natively in client-go
182+
183+
A few changes to the underlying SharedInformer interface in client-go could save
184+
us from a lot of work in controller-runtime.
185+
186+
One is to add a second `Run` method on the `SharedInformer` interface such as
187+
```
188+
// RunWithStopOptions runs the informer and provides options to be checked that
189+
// would indicate under what conditions the informer should stop
190+
RunWithStopOptions(stopOptions StopOptions) StopReason
191+
192+
// StopOptions let the caller specifcy which conditions to stop the informer.
193+
type StopOptions struct {
194+
// StopChannel stops the Informer when it receives a close signal
195+
StopChannel <-chan struct{}
196+
197+
// OnListError inspects errors returned from the informer's underlying
198+
// reflector, and based on the error determines whether or not to stop the
199+
// informer.
200+
OnListError func(err) bool
201+
}
202+
203+
// StopReason is a custom typed error that indicates how the informer was
204+
// stopped
205+
type StopReason error
206+
```
207+
208+
This could be plumbed through controller-runtime allowing users to pass informer
209+
`StopOptions` when they create a controller.
210+
211+
Another potential change to `SharedInformer` is to have it be responsible for
212+
tracking the count of event handlers and allowing consumers to remove
213+
EventHandlers from a `SharedInformer.
214+
215+
similar to the `EventHandler` reference counting interface we discussed above the
216+
`SharedInformer` could add methods:
217+
218+
```
219+
type SharedInformer interface {
220+
...
221+
CountEventHandlers() int
222+
RemoveEventHandler(id) error
223+
}
224+
```
225+
(where `id`` is a to be determined identifer of the specific handler to be
226+
removed).
227+
228+
This would remove the need to track it ourselves in controller-runtime.
229+
230+
TODO: Bring this design up (potentially with a proof-of-concept branch) at an
231+
upcoming sig-api-machinery meeting to begin getting feedback and see whether it
232+
is feasible to land the changes in client-go.
233+
234+
## Open Questions
235+
236+
### Multiple Restarts
237+
238+
The ability to start a controller more than once was recently removed over
239+
concerns around data races from workqueue recreation and starting too many
240+
workers ([#1139](https://github.com/kubernetes-sigs/controller-runtime/pull/1139))
241+
242+
Previously there was never a valid reason to start a controller more than once,
243+
now we want to support that. The simplest workaround is to expose a
244+
`ResetStart()` method on the controller interface that simply resets the
245+
'Started' on the controller. This allows restarts but doesn't address the
246+
initial data race/excess worker concerns that drove the removal of this
247+
functionality in the first place.
248+
249+
### Persisting controller watches
250+
251+
The list of start watches store don the controller currently gets cleared out
252+
once the controller is started in order to prevent memory leak([#1163](https://github.com/kubernetes-sigs/controller-runtime/pull/1163))
253+
254+
We need to retain references to these watches in order to restart controllers.
255+
The simplest workaround is maintain a boolean `SaveWatches` field on the
256+
controller that determines whether or not the watches slice should me retained
257+
after starting a controller. We can then expose a method on the controller for
258+
setting this field and enabling watches to be saved. We will still have a memory
259+
leak, but only on controllers that are expected to be restart. There might be a
260+
better way clear out the watches list that fits with the restartable controllers
261+
paradigm.
262+
263+
### Multi-Cluster Support
264+
265+
A multi-cluster environment where a single controller operates across multiple
266+
clusters will likely be a heavy user of these new features because the
267+
multi-cluster frequently results in a situation where the
268+
cluster administrator is different from the controller administrator.
269+
270+
We lack a consistent story around multi-cluster support and introducing changes
271+
such as these without fully thinking through the mult-cluster story might bind
272+
us for future designs.
273+
274+
275+
## Future Work / Use Cases
156276

157277
Were this to move forward, it unlocks a number of potential use-cases.
158278

159-
1. OPA/Gatekeeper can simplify it's dynamic watch management by having greater
279+
1. We can support a "metacontroller" or controller of controllers that could
280+
start and stop controllers based on the existence of their corresponding
281+
CRDs.
282+
283+
2. OPA/Gatekeeper can simplify it's dynamic watch management by having greater
160284
controller over the lifecycle of a controller. See [this
161285
doc](https://docs.google.com/document/d/1Wi3LM3sG6Qgfzm--bWb6R0SEKCkQCCt-ene6cO62FlM/edit)
162286

163-
2. We can support controllers that can conditionally start/stop/restart based on
164-
the installation/uninstallation of its CRD. See [this proof-of-concept branch](https://github.com/kevindelgado/controller-runtime/tree/experimental/conditional-runnables)
165-
166287
## Timeline of Events
288+
167289
* 9/30/2020: Propose idea in design doc and proof-of-concept PR to
168290
controller-runtime
169291
* 10/7/2020: Design doc and proof-of-concept distilled to focus only on minimal
170292
hooks needed rather than proposing entire conditional controller solution.
171293
Alternatives added to discuss opportunities to push some of the implementation
172294
to the api-machinery level.
173295
* 10/8/2020: Discuss idea in community meeting
296+
* 10/19/2020: Proposal updated to add EventHandler count tracking and client-go
297+
based alternative.
174298

0 commit comments

Comments
 (0)