Skip to content

Commit ca7439e

Browse files
committed
Add event handler ref counting and client-go alternative
1 parent 712b72f commit ca7439e

File tree

1 file changed

+239
-112
lines changed

1 file changed

+239
-112
lines changed

designs/conditional-controllers.md

Lines changed: 239 additions & 112 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Informer Removal / Controller Lifecycle Management
1+
Controller Lifecycle Management
22
==========================
33

44
## Summary
@@ -27,148 +27,275 @@ control over which resources are installed).
2727

2828
## Goals
2929

30-
An implementation of the minimally viable hooks needed in controller-runtime to
31-
enable controller adminstrators to start, stop, and restart controllers and their caches.
30+
controller-runtime should support starting/stopping/restarting controllers and
31+
their caches on arbitrary conditions.
3232

3333
## Non-Goals
3434

35-
A complete and ergonomic solution for automatically starting/stopping/restarting
36-
controllers upon arbitrary conditions.
35+
* An implementation of starting/stopping a controller based on a specific
36+
condition (e.g. the existence of a CRD in discovery) does not need to be
37+
implemented in controller-runtime (but the end consumer of controller-runtime
38+
should be able to build it on their own).
39+
40+
* No further guarantees of multi-cluster support beyond what is already provided
41+
by controller-runtime.
3742

3843
## Proposal
3944

4045
The following proposal offers a solution for controller/cache restarting by:
4146
1. Enabling the removal of individual informers.
42-
2. Publically exposing the informer removal and adding hooks into the internal
43-
controller implementation to allow for restarting controllers that have been
44-
stopped.
47+
2. Tracking the number of event handlers on an Informer.
4548

4649
This proposal focuses on solutions that are entirely contained in
4750
controller-runtime. In the alternatives section, we discuss potential ways that
4851
changes can be added to api-machinery code in core kubernetes to enable a
49-
possibly cleaner interface of accomplishing our goals.
52+
possibly cleaner SharedInformer interface for accomplishing our goals.
53+
54+
A proof-of-concept exists at
55+
[#1180](https://github.com/kubernetes-sigs/controller-runtime/pull/1180).
5056

5157
### Informer Removal
5258

5359
The starting point for this proposal is Shomron’s proposed implementation of
5460
individual informer removal.
5561
[#936](https://github.com/kubernetes-sigs/controller-runtime/pull/936).
5662

57-
A discussion of risks/mitigations and alternatives are discussed in the linked PR as well as the
58-
corresponding issue
59-
[#935](https://github.com/kubernetes-sigs/controller-runtime/issues/935). A
60-
summarization of these discussions are presented below.
61-
62-
#### Risks and Mitigations
63-
64-
* Controllers will silently degrade if the given informer for their watch is
65-
removed. Most likely this issue is mitigated by the fact that it's the
66-
controller responsible for removing the informer that will be the one impacted
67-
by the informer's removal and thus will be expected. If this is insufficient
68-
for all cases, a potnential mitigation is to implement reference counting in
69-
controller-runtime such that an informer is aware of any and all outstanding
70-
references when its removal is called.
71-
72-
* The safety of stopping individual informers is unclear. There is concern that stopping
73-
individual informers will leak go routines or memory. We should be able to use
74-
pprof tooling and exisiting leak tooling in controller-runtime to identify and
75-
mitigate any leaks
76-
77-
#### Alternatives
78-
79-
* Creating a cache per watch (i.e. cache of caches) as the end user. The advantage
80-
of this is that it prevents having to modify any code in controller-runtime.
81-
The main drawback is that it's very clunky to maintain multiple caches (one
82-
for each informer) and breaks the clean design of the cache.
83-
84-
* Adding support to de-register EventHandlers from informers in apimachinery.
85-
This along with ref counting would be the cleanest way to free us of the concern
86-
of controllers silently failing when their watch is removed.
87-
The downside is that we are ultimately at the mercy of whether apimachinery
88-
wants to support these changes, and even if they were on board, it could take
89-
a long time to land these changes upstream.
90-
91-
* Surfacing the watch error handler from the reflector in client-go through the
92-
informer. Controller-runtime could then look for specific errors and decide how to
93-
handle it from there, such as terminating the informer if a specific error was
94-
thrown indicating that the informer will no longer be viable (for example when
95-
the resource is uninstalled). The advantage is that we'd be pushing updates
96-
from the informer to the manager when errors arise (such as when the resource
97-
disappears) and this would lead to more responsive informer shutdown that doesn't
98-
require a separate watch mechanism to determine whether to remove informers. Like
99-
de-registering EventHandlers, the downside is that we would need api-machinery to
100-
support these changes and might take a long time to coordinate and implement.
101-
102-
### Minimal hooks needed to use informer removal externally
103-
104-
Given that a mechanism exists to remove individual informers, the next step is
105-
to expose this removal functionality and enable safely
106-
starting/stopping/restarting controllers and their caches.
107-
108-
The proposal to do this is:
109-
1. Expose the `informerCache.Remove()` functionality on the cache interface.
110-
2. Expose the ability to reset the internal controller’s `Started` field.
111-
3. Expose a field on the internal controller to determine whether to `saveWatches`
112-
or not and use this field to not empty the controller’s `startWatches` when the
113-
controller is stopped.
114-
115-
A proof of concept for PR is here at
116-
[#1180](https://github.com/kubernetes-sigs/controller-runtime/pull/1180)
117-
118-
#### Risks and Mitigations
119-
120-
* We lack a consistent story around multi-cluster support and introducing
121-
changes such as this without fully thinking through the multi-cluster story
122-
might bind us for future designs. We think gracefully handling degraded
123-
functionality in informers we start as end users modify the cluster is a valid
124-
use case that exists whenever the cluster administrator is different from the
125-
controller administrator and should be handled irregardless of its application
126-
in multi-cluster envrionments.
127-
128-
* [#1139](https://github.com/kubernetes-sigs/controller-runtime/pull/1139) discusses why
129-
the ability to start a controller more than once was taken away. It's a little
130-
unclear what effect explicitly enabling multiple starts in the case of
131-
conditional controllers will have on the number of workers and workqueues
132-
relative to expectations and metrics.
133-
134-
* [#1163](https://github.com/kubernetes-sigs/controller-runtime/pull/1163) discusses the
135-
memory leak caused by no clearing out the watches internal slice. A possible
136-
mitigation is to clear out the slices upon ConditionalController shutdown.
137-
138-
#### Alternatives
139-
140-
* A metacontroller or CRD controller could start and stop controllers based on
141-
the existence of their corresponding CRDs. This puts the complexity of designing such a controller
142-
onto the end user, but there are potentially ways to provide end users with
143-
default, pluggable CRD controllers. More importantly, this probably is not even
144-
be sufficient for enabling controller restarting, because informers are shared
145-
between all controllers so restarting the controller will still try to use the
146-
informer that is erroring out if the CRD it is watching goes away. Some hooks
147-
into removing informers is sitll required in order to use a metacontroller.
148-
149-
* Instead of exposing ResetStart and SaveWatches on the internal controller struct
150-
it might be better to expose it on the controller interface. This is more public
151-
and creates more potential for abuse, but it prevents some hacky solutions
152-
discussed below around needing to cast to the internal controller or create
153-
extra interfaces.
154-
155-
## Future Work / Motivating Use-Cases
63+
It extends the informer cache to expose a remove method:
64+
```
65+
// Remove removes an informer specified by the obj argument from the cache and
66+
// stops it if it existed.
67+
Remove(ctx context.Context, obj runtime.Object) error
68+
```
69+
70+
This gets plumbed through to the informers map, where we remove an informer
71+
entry from the informersByGVK
72+
73+
```
74+
// Remove removes an informer entry and stops it if it was running.
75+
func (ip *specificInformersMap) Remove(gvk schema.GroupVersionKind) {
76+
...
77+
entry, ok := ip.informersByGVK[gvk]
78+
close(entry.stop)
79+
delete(ip.informersByGVK, gvk)
80+
}
81+
```
82+
83+
We support this by modifying the Start method to run the informer with
84+
a stop channel that can be stopped individually
85+
or globally (as opposed to just globally).
86+
87+
```
88+
// Start calls Run on each of the informers...
89+
func (ip *specificInformersMap) Start(ctx context) {
90+
...
91+
for _, entry := range ip.informersByGVK{
92+
go entry.Start(ctx.Done())
93+
}
94+
...
95+
}
96+
97+
// Start starts the informer managed by a MapEntry.
98+
// Blocks until the informer stops. The informer can be stopped
99+
// either individually (via the entry's stop channel) or globally
100+
// via the provided stop argument.
101+
func (e *MapEntry) Start(stop <-chan struct{}) {
102+
// Stop on either the whole map stopping or just this informer being removed.
103+
internalStop, cancel := anoyOf(stop, e.stop)
104+
defer cancel()
105+
e.Informer.Run(internalStop)
106+
}
107+
```
108+
109+
### EventHandler reference counting
110+
111+
The primary concern with adding individual informer removal is that it exposes
112+
the risk of silently degrading controllers that are using the informer being
113+
removed.
114+
115+
To mitigate this we need a way to track the number of EventHandlers added to a
116+
given informer and ensure that the are none remaining when an informer is
117+
removed.
118+
119+
The proposal here is to expand the informer interface used by the `MapEntry` in
120+
the informers map:
121+
```
122+
// CountingInformer exposes a way to track the number of EventHandlers
123+
//registered on an informer.
124+
125+
// The implementation of this should store a count that gets incremented every
126+
// time AddEventHandler is called and decremented every time RemoveEventHandler is
127+
// called.
128+
type CountingInformer interface {
129+
cache.SharedIndexInformer
130+
CountEventHandlers() int
131+
RemoveEventHandler()
132+
}
133+
```
134+
135+
Now, we need a way to run a controller, such that the starting the controller
136+
increments the EventHandler counts on the informer and stopping the controller
137+
decrements it.
138+
139+
We can modify/expand the `Source` interface, such that we can call
140+
`Start()` with a context that blocks on ctx.Done() and calls RemoveEventHandler
141+
on the informer once the context is closed, because the controller has been
142+
stopped so the EventHandler no longer exists on the informer.
143+
144+
```
145+
// StoppableSource expands the Source interface to add a start function that
146+
// blocks on the context's Done channel, so that we know when the controller has
147+
// been stopped and can remove/decrement the EventHandler count on the informer
148+
// appropriately
149+
type StoppableSource interface {
150+
Source
151+
StartStoppable(context.Context, handler.EventHandler,
152+
workqueue.RateLimitingInterface, ...predicate.Predicate) error
153+
}
154+
155+
// StartStoppable blocks for start to finish and then calls RemoveEventHandler
156+
// on the kind's informer.
157+
func (ks *Kind) StartStoppable(ctx, handler, queue, prct) {
158+
informer := ks.cache.GetInformer(ctx, ks.Type)
159+
ks.Start(ctx, handler, queue, prct)
160+
<-ctx.Done()
161+
i.RemoveEventHandler()
162+
}
163+
164+
```
165+
166+
A simpler alternative is to all of this is to bake the `EventHandler` tracking into the underlying
167+
client-go `SharedIndexInformer`. This involves modifying client-go and is
168+
discussed below in the alternatives section.
169+
170+
## Alternatives
171+
172+
A couple alternatives to what we propose, one that offloads all the
173+
responsibility to the consumer requiring no changes to controller-runtime and
174+
one that moves the event handler reference counting upstream into client-go.
175+
176+
### Cache of Caches
177+
178+
One alternative is to create a separate cache per watch (i.e. cache of caches)
179+
as the end user consuming controller-runtime. The advantage is that it prevents
180+
us from having to modify any code in controller-runtime. The main drawback is
181+
that it's very clunky to maintain multiple caches (one for each informer) and
182+
breaks the clean design of the cache.
183+
184+
### Stoppable Informers and EventHandler removal natively in client-go
185+
186+
A few changes to the underlying SharedInformer interface in client-go could save
187+
us from a lot of work in controller-runtime.
188+
189+
One is to add a second `Run` method on the `SharedInformer` interface such as
190+
```
191+
// RunWithStopOptions runs the informer and provides options to be checked that
192+
// would indicate under what conditions the informer should stop
193+
RunWithStopOptions(stopOptions StopOptions) StopReason
194+
195+
// StopOptions let the caller specifcy which conditions to stop the informer.
196+
type StopOptions struct {
197+
// StopChannel stops the Informer when it receives a close signal
198+
StopChannel <-chan struct{}
199+
200+
// OnListError inspects errors returned from the informer's underlying
201+
// reflector, and based on the error determines whether or not to stop the
202+
// informer.
203+
OnListError func(err) bool
204+
}
205+
206+
// StopReason is a custom typed error that indicates how the informer was
207+
// stopped
208+
type StopReason error
209+
```
210+
211+
This could be plumbed through controller-runtime allowing users to pass informer
212+
`StopOptions` when they create a controller.
213+
214+
Another potential change to `SharedInformer` is to have it be responsible for
215+
tracking the count of event handlers and allowing consumers to remove
216+
EventHandlers from a `SharedInformer.
217+
218+
similar to the `EventHandler` reference counting interface we discussed above the
219+
`SharedInformer` could add methods:
220+
221+
```
222+
type SharedInformer interface {
223+
...
224+
CountEventHandlers() int
225+
RemoveEventHandler(id) error
226+
}
227+
```
228+
(where `id`` is a to be determined identifer of the specific handler to be
229+
removed).
230+
231+
This would remove the need to track it ourselves in controller-runtime.
232+
233+
TODO: Bring this design up (potentially with a proof-of-concept branch) at an
234+
upcoming sig-api-machinery meeting to begin getting feedback and see whether it
235+
is feasible to land the changes in client-go.
236+
237+
## Open Questions
238+
239+
### Multiple Restarts
240+
241+
The ability to start a controller more than once was recently removed over
242+
concerns around data races from workqueue recreation and starting too many
243+
workers ([#1139](https://github.com/kubernetes-sigs/controller-runtime/pull/1139))
244+
245+
Previously there was never a valid reason to start a controller more than once,
246+
now we want to support that. The simplest workaround is to expose a
247+
`ResetStart()` method on the controller interface that simply resets the
248+
'Started' on the controller. This allows restarts but doesn't address the
249+
initial data race/excess worker concerns that drove the removal of this
250+
functionality in the first place.
251+
252+
### Persisting controller watches
253+
254+
The list of start watches store don the controller currently gets cleared out
255+
once the controller is started in order to prevent memory leak([#1163](https://github.com/kubernetes-sigs/controller-runtime/pull/1163))
256+
257+
We need to retain references to these watches in order to restart controllers.
258+
The simplest workaround is maintain a boolean `SaveWatches` field on the
259+
controller that determines whether or not the watches slice should me retained
260+
after starting a controller. We can then expose a method on the controller for
261+
setting this field and enabling watches to be saved. We will still have a memory
262+
leak, but only on controllers that are expected to be restart. There might be a
263+
better way clear out the watches list that fits with the restartable controllers
264+
paradigm.
265+
266+
### Multi-Cluster Support
267+
268+
A multi-cluster environment where a single controller operates across multiple
269+
clusters will likely be a heavy user of these new features because the
270+
multi-cluster frequently results in a situation where the
271+
cluster administrator is different from the controller administrator.
272+
273+
We lack a consistent story around multi-cluster support and introducing changes
274+
such as these without fully thinking through the mult-cluster story might bind
275+
us for future designs.
276+
277+
278+
## Future Work / Use Cases
156279

157280
Were this to move forward, it unlocks a number of potential use-cases.
158281

159-
1. OPA/Gatekeeper can simplify it's dynamic watch management by having greater
282+
1. We can support a "metacontroller" or controller of controllers that could
283+
start and stop controllers based on the existence of their corresponding
284+
CRDs.
285+
286+
2. OPA/Gatekeeper can simplify it's dynamic watch management by having greater
160287
controller over the lifecycle of a controller. See [this
161288
doc](https://docs.google.com/document/d/1Wi3LM3sG6Qgfzm--bWb6R0SEKCkQCCt-ene6cO62FlM/edit)
162289

163-
2. We can support controllers that can conditionally start/stop/restart based on
164-
the installation/uninstallation of its CRD. See [this proof-of-concept branch](https://github.com/kevindelgado/controller-runtime/tree/experimental/conditional-runnables)
165-
166290
## Timeline of Events
291+
167292
* 9/30/2020: Propose idea in design doc and proof-of-concept PR to
168293
controller-runtime
169294
* 10/7/2020: Design doc and proof-of-concept distilled to focus only on minimal
170295
hooks needed rather than proposing entire conditional controller solution.
171296
Alternatives added to discuss opportunities to push some of the implementation
172297
to the api-machinery level.
173298
* 10/8/2020: Discuss idea in community meeting
299+
* 10/19/2020: Proposal updated to add EventHandler count tracking and client-go
300+
based alternative.
174301

0 commit comments

Comments
 (0)