1
- Conditionally Runnable Controllers
1
+ Informer Removal / Controller Lifecycle Management
2
2
==========================
3
3
4
4
## Summary
5
5
6
- Enable controller managers to successfully operate when the CRD the controller
7
- is configured to watch does not exist.
8
-
9
- Successful operation of a controller manager includes:
10
- * Starts and runs without error when a CRD the controller watches does not exist.
11
- * Begins watching the CRD once it is installed.
12
- * Unregisters (stops watching) once a CRD is uninstalled.
6
+ Enable fine-grained control over the lifecycle of a controller, including the
7
+ ability to start/stop/restart controllers and their caches by exposing a way to
8
+ remove individual informers from the cache and working around restrictions that
9
+ currently prevent controllers from starting multiple times.
13
10
14
11
## Background/Motivation
15
12
16
- Usually there is a 1:1 relationship between controller and resource and a 1:1
17
- relationship between controller and k8s cluster. When this is the case it is
18
- fine to assume that for a controller to run successfully, the resource it
19
- controls must be installed on the cluster that the controller is watching.
20
-
21
- We are now seeing use cases where a single controller is responsible for
22
- multiple resources across multiple clusters. This creates situations where
23
- controllers need to successfully run even when the resource is not installed,
24
- and need to proceed successfully as it begins/terminates its watch on a resource
25
- when the resource is installed/uninstalled.
13
+ Currently, the user does not have much controller over the lifecycle of a
14
+ controller. The user can add controllers to the manager and add informers to the
15
+ cache, but there is no way to remove either of these.
26
16
27
- The current approach is to check the discovery for a CRDs existence prior to
28
- adding the controller to the manager. This has its limitations, as complexity
29
- increases greatly for users who need to manage multiple controllers for a
30
- mixture of CRDs that might not always be installed on the cluster or are
31
- installed in different order.
17
+ Additionally there are restrictions that prevent a controller from running
18
+ multiple times such as the clearing the watches slice for a controller after it
19
+ has started.
32
20
33
- While there is a lot of groundwork needed to fully support multi-cluster
34
- controllers, this proposal offers incremental steps to supporting one specific use case
35
- (conditionally runnable controllers), of which hopefully some minimal form can be agreed
36
- upon before needing a comple multi-cluster story.
21
+ The effect of this is that users have no clean way of restarting controllers
22
+ after they have stopped them. This would be useful for a number of use-cases
23
+ around when controllers have little control over the installation or
24
+ uninstallation on the of the resources that these controllers are responsible
25
+ for watching.
37
26
38
27
## Goals
39
28
40
- (In order from most necessary to least)
41
- 1 . A mechanism for starting/stopping/restarting controllers and their caches.
42
- 2 . A solution for running a controller conditionally, such that it automatically
43
- starts/stops/restarts upon installation/uninstallation/reinstallation of its
44
- respective CRD in the cluster.
45
- 3 . An easy-to-use mechanism for solving goal #2 without the end user needing to
46
- understand too much.
29
+ An implementation of the minimally viable hooks needed in controller-runtime to
30
+ enable users to start, stop, and restart controllers and their caches.
47
31
48
32
## Non-Goals
49
33
50
- TODO
34
+ A complete and ergonomic solution for automatically starting/stopping/restarting
35
+ controllers upon arbitrary conditions.
51
36
52
37
## Proposal
53
38
54
- The following proposal attempts to address the three goals in four separate
55
- steps.
39
+ The following proposal offers a solution for controller/cache restarting by:
40
+ 1 . Enabling the removal of individual informers.
41
+ 2 . Publically exposing the informer removal and adding hooks into the internal
42
+ controller implementation to allow for restarting controllers that have been
43
+ stopped.
56
44
57
- The first goal of enabling the possibility of starting/stopping/restarting
58
- controllers is addressed in two parts, first where a mechanism is built to
59
- enable the removal of informers, and second where we expose this mechanism and
60
- enable the restarting of controllers
45
+ This proposal focuses on solutions that are entirely contained in
46
+ controller-runtime. In the alternatives section, we discuss potential ways that
47
+ changes can be added to api-machinery code in core kubernetes to enable a
48
+ possibly cleaner interface of accomplishing our goals.
61
49
62
- The second goal of a solution for running controllers conditionally is addressed
63
- by creating a wrapper around a controller called a ConditionalController that
64
- within its Start() function, polls the discovery doc for the existence of the
65
- watched object and starts/stops/restarts the underlying controller as necessary.
50
+ ### Informer Removal
66
51
67
- The third goal of an ergonomic mechanism to use ConditionalControllers is a
68
- small modification to the controller builder to enable running a controller as a
69
- ConditionalController .
52
+ The starting point for this proposal is Shomron’s proposed implementation of
53
+ individual informer removal.
54
+ [ # 936 ] ( https://github.com/kubernetes-sigs/controller-runtime/pull/936 ) .
70
55
71
- ###### Proof of concept
72
- A proof of concept PR exists at
73
- [ #1180 ] ( https://github.com/kubernetes-sigs/controller-runtime/pull/1180 )
56
+ A discussion of risks/mitigations and alternatives are discussed in the linked PR as well as the
57
+ corresponding issue
58
+ [ #935 ] ( https://github.com/kubernetes-sigs/controller-runtime/issues/935 ) . A
59
+ summarization of these discussions are presented below.
74
60
75
- Each commit maps to one step in the proposal and can loosely be considered going
76
- from most necessary to least.
61
+ #### Risks and Mitigations
77
62
78
- ### Informer Removal [ Pre-req]
63
+ * Controllers will silently degrade if the given informer for their watch is
64
+ removed. Most likely this issue is mitigated by the fact that it's the
65
+ controller responsible for removing the informer that will be the one impacted
66
+ by the informer's removal and thus will be expected. If this is insufficient
67
+ for all cases, a potnential mitigation is to implement reference counting in
68
+ controller-runtime such that an informer is aware of any and all outstanding
69
+ references when its removal is called.
79
70
80
- This proposal assumes a mechanism exists for removing individual informers. We
81
- are agnostic to how this is done, but the current proposal is built on top of
82
- Shomron’s proposed implementation of individual informer removal
83
- [ # 936 ] ( https://github.com/kubernetes-sigs/controller-runtime/pull/936 ) .
71
+ * Safety of stopping individual informers. There is concern that stopping
72
+ individual informers will leak go routines or memory. We should be able to use
73
+ pprof tooling and exisiting leak tooling in controller-runtime to identify and
74
+ mitigate any leaks
84
75
76
+ #### Alternatives
85
77
86
- Risks/mitigations and alternatives are discussed in the linked PR as well as the
87
- corresponding issue
88
- [ #935 ] ( https://github.com/kubernetes-sigs/controller-runtime/issues/935 ) .
78
+ * Creating a cache per watch (i.e. cache of caches) as the end user. The advantage
79
+ of this is that it prevents having to modify any code in controller-runtime.
80
+ The main drawback is that it's very clunky to maintain multiple caches (one
81
+ for each informer) and breaks the clean design of the cache.
82
+
83
+ * Adding support to de-register EventHandlers from informers in apimachinery.
84
+ This along with ref counting would be cleanest way to free us of the concern
85
+ of controllers silently failing when their watch is removed.
86
+ The downside is that we are ultimately at the mercy of whether apimachinery
87
+ wants to support these changes, and even if they were on board, it could take
88
+ a long time to land these changes upstream.
89
89
90
- We are currently looking into whether that PR can be distilled further into a more
91
- minimal solution.
90
+ * TODO: Bubbling up errors from apimachinery.
92
91
93
- ### Minimal hooks
92
+
93
+ ### Minimal hooks needed to use informer removal externally
94
94
95
95
Given that a mechanism exists to remove individual informers, the next step is
96
96
to expose this removal functionality and enable safely
@@ -105,105 +105,56 @@ controller is stopped.
105
105
106
106
#### Risks and Mitigations
107
107
108
+ * We lack a consistent story around multi-cluster support and introducing
109
+ changes such as this without fully thinking through the multi-cluster story
110
+ might bind us for future designs. We think that restarting
111
+ controllers is a valid use-case even for single cluster regardless of the
112
+ multi-cluster use case.
113
+
108
114
* [ #1139 ] ( https://github.com/kubernetes-sigs/controller-runtime/pull/1139 ) discusses why
109
115
the ability to start a controller more than once was taken away. It's a little
110
116
unclear what effect explicitly enabling multiple starts in the case of
111
- conditional controllers will hae on the number of workers and workqueues
117
+ conditional controllers will have on the number of workers and workqueues
112
118
relative to expectations and metrics.
119
+
113
120
* [ #1163 ] ( https://github.com/kubernetes-sigs/controller-runtime/pull/1163 ) discusses the
114
121
memory leak caused by no clearing out the watches internal slice. A possible
115
122
mitigation is to clear out the slices upon ConditionalController shutdown.
116
123
117
124
#### Alternatives
118
125
119
126
* A metacontroller or CRD controller could start and stop controllers based on
120
- the existence of their corresponding CRDs. This requires no changes to made to
121
- controller-runtime. It does put the complexity of designing such a controller
127
+ the existence of their corresponding CRDs. This puts the complexity of designing such a controller
122
128
onto the end user, but there are potentially ways to provide end users with
123
- default, pluggable CRD controllers.
129
+ default, pluggable CRD controllers. More importantly, this probably is not even
130
+ be sufficient for enabling controller restarting, because informers are shared
131
+ between all controllers so restarting the controller will still try to use the
132
+ informer that is erroring out if the CRD it is watching goes away. Some hooks
133
+ into removing informers is sitll required in order to use a metacontroller.
134
+
124
135
* Instead of exposing ResetStart and SaveWatches on the internal controller struct
125
136
it might be better to expose it on the controller interface. This is more public
126
137
and creates more potential for abuse, but it prevents some hacky solutions
127
138
discussed below around needing to cast to the internal controller or create
128
139
extra interfaces.
129
140
130
- ### Conditional Controllers
131
-
132
- With the minimal hooks needed to start/stop/restart controllers and their caches
133
- in place, the next step is to provide a wrapper controller around a traditional
134
- controller that starts/stops the underlying controller based on the existence of
135
- the CRD under watch.
136
-
137
- The proposal to do this:
138
- 1 . ` ConditionalController ` that takes implements controller and with in Start()
139
- method:
140
- 2 . Polls the discovery doc every configurable amount of time and recognizes when
141
- the CRD has been installed/uninstalled.
142
- 3 . Upon installation it merges the caller’s stop channel with a local start channel
143
- and runs the underlying controller with the merged stop channel such that both
144
- the caller and this conditional controller can stop it.
145
- 4 . Upon uninstallation it sets the controllers ` Started ` field to false so that it
146
- can be restarted, indicates that the controller should save its watches upon
147
- stopping, and then stops the controller via the local stop channel and removes
148
- the cache for the object under watch.
149
-
150
- #### Risks and Mitigations
151
-
152
- * With the above minimal hooks exposing ` ResetStart() ` and ` SaveWatches() ` on the
153
- internal controller this creates the need to create a hacky intermediate
154
- interface (StoppableController) for testing and to avoid casting to the internal
155
- controller. The simplest solution is just to expose these on the controller
156
- interface (see above alternative), but there might be a better way.
141
+ ## Future Work / Motivating Use-Cases
157
142
158
- #### Alternatives
159
-
160
- * A more general solution could allow for conditional runnables of more than just
161
- controllers, where the user could supply the conditional check on their own,
162
- such that it’s not just limited to looking at the existence of a CRD. (but we
163
- believe that the common case will be a controller watching CRD install and thus
164
- there should still be a simple way to do this)
165
- * Nit: it's unclear if the controller package is the best place for this to live.
166
- Originally it was thought that controllerutil might be best, but that creates an
167
- import cycle.
168
-
169
- ### Builder Ergonomics
170
-
171
- Since we provide simple ergonomics for creating controllers (builder), it would
172
- make sense to do the same for any conditional controller utility we create.
173
-
174
- The current proposal (link) is to:
175
- 1 . Provide a forInput boolean option that the user can set to enable conditionally
176
- running and an option to configure the wait time that the ConditionalController
177
- will wait between attempts to poll the discovery doc.
178
- 2 . In the builder’s doController function, it will first create an unmanaged
179
- controller.
180
- 3 . If the conditionally run option is set, it will wrap the unmanaged controller in
181
- a conditional controller.
182
- 4 . Add the resulting controller to the manager.
143
+ Were this to move forward, it unlocks a number of potential use-cases.
183
144
184
- #### Risks and Mitigations
145
+ 1 . OPA/Gatekeeper can simplify it's dynamic watch management by having greater
146
+ controller over the lifecycle of a controller. See [ this
147
+ doc] ( https://docs.google.com/document/d/1Wi3LM3sG6Qgfzm--bWb6R0SEKCkQCCt-ene6cO62FlM/edit )
185
148
186
- TODO
187
-
188
- #### Alternatives
189
- * A separate builder specifically for the conditional controller would prevent us
190
- from having to make changes to the current builder.
191
- * As noted above a more general conditional runnable will probably require its own
192
- builder.
193
-
194
- ## Acceptance Criteria
195
-
196
- Ultimately, the end user must have some successful solution that involves
197
- starting with a controller for a CRD that can
198
- 1 . Run the mgr without CRD installed yet and mgr should start successfully
199
- 2 . After CRD installation the controller should start and run successfully
200
- 3 . Upon uninstalling the CRD, the controller should stop but the manager should
201
- continue running successfully
202
- 4 . Upon CRD reinstallation, the controller should once again start and run
203
- successfully with no disruption to the manager and its other runnables.
149
+ 2 . We can support controllers that can conditionally start/stop/restart based on
150
+ the installation/uninstallation of its CRD. See [ this proof-of-concept branch] ( https://github.com/kevindelgado/controller-runtime/tree/experimental/conditional-runnables )
204
151
205
152
## Timeline of Events
206
153
* 9/30/2020: Propose idea in design doc and proof-of-concept PR to
207
154
controller-runtime
155
+ * 10/7/2020: Design doc and proof-of-concept distilled to focus only on minimal
156
+ hooks needed rather than proposing entire conditional controller solution.
157
+ Alternatives added to discuss opportunities to push some of the implementation
158
+ to the api-machinery level.
208
159
* 10/8/2020: Discuss idea in community meeting
209
160
0 commit comments