@@ -16,18 +16,19 @@ cache, but there is no way to remove either of these.
16
16
17
17
Additionally there are restrictions that prevent a controller from running
18
18
multiple times such as the clearing the watches slice for a controller after it
19
- has started.
19
+ has started. The effect of this is that users have no clean way of restarting controllers
20
+ after they have stopped them.
20
21
21
- The effect of this is that users have no clean way of restarting controllers
22
- after they have stopped them. This would be useful for a number of use-cases
23
- around when controllers have little control over the installation or
24
- uninstallation on the of the resources that these controllers are responsible
25
- for watching .
22
+ Greater detail is given in the "Future Work / Use Cases" section, but a summary
23
+ of the motivating use-cases for the proposal is anytime a controller's watched resource
24
+ can be installed/uninstalled unexpectedly or whenever the administrator of a
25
+ controller is different from the administrator of the cluster (and thus has no
26
+ control over which resources are installed) .
26
27
27
28
## Goals
28
29
29
30
An implementation of the minimally viable hooks needed in controller-runtime to
30
- enable users to start, stop, and restart controllers and their caches.
31
+ enable controller adminstrators to start, stop, and restart controllers and their caches.
31
32
32
33
## Non-Goals
33
34
@@ -68,7 +69,7 @@ summarization of these discussions are presented below.
68
69
controller-runtime such that an informer is aware of any and all outstanding
69
70
references when its removal is called.
70
71
71
- * Safety of stopping individual informers. There is concern that stopping
72
+ * The safety of stopping individual informers is unclear . There is concern that stopping
72
73
individual informers will leak go routines or memory. We should be able to use
73
74
pprof tooling and exisiting leak tooling in controller-runtime to identify and
74
75
mitigate any leaks
@@ -81,14 +82,22 @@ summarization of these discussions are presented below.
81
82
for each informer) and breaks the clean design of the cache.
82
83
83
84
* Adding support to de-register EventHandlers from informers in apimachinery.
84
- This along with ref counting would be cleanest way to free us of the concern
85
+ This along with ref counting would be the cleanest way to free us of the concern
85
86
of controllers silently failing when their watch is removed.
86
87
The downside is that we are ultimately at the mercy of whether apimachinery
87
88
wants to support these changes, and even if they were on board, it could take
88
89
a long time to land these changes upstream.
89
90
90
- * TODO: Bubbling up errors from apimachinery.
91
-
91
+ * Surfacing the watch error handler from the reflector in client-go through the
92
+ informer. Controller-runtimer could then look for specific errors and decide how to
93
+ handle it from there, such as terminating the informer if a specific error was
94
+ thrown indicating that the informer will no longer be viable (for example when
95
+ the resource is uninstalled). The advantage is that we'd be pushing updates
96
+ from the informer to the manager when errors arise (such as when the resource
97
+ disappears) and this would lead to more responsive informer shutdown that doesn't
98
+ require a separate watch mechanism to determine whether to remove informers. Like
99
+ de-registering EventHandlers, the downside is that we would need api-machinery to
100
+ support these changes and might take a long time to coordinate and implement.
92
101
93
102
### Minimal hooks needed to use informer removal externally
94
103
@@ -103,13 +112,18 @@ The proposal to do this is:
103
112
or not and use this field to not empty the controller’s ` startWatches ` when the
104
113
controller is stopped.
105
114
115
+ A proof of concept for PR is here at
116
+ [ #1180 ] ( https://github.com/kubernetes-sigs/controller-runtime/pull/1180 )
117
+
106
118
#### Risks and Mitigations
107
119
108
120
* We lack a consistent story around multi-cluster support and introducing
109
121
changes such as this without fully thinking through the multi-cluster story
110
- might bind us for future designs. We think that restarting
111
- controllers is a valid use-case even for single cluster regardless of the
112
- multi-cluster use case.
122
+ might bind us for future designs. We think gracefully handling degraded
123
+ functionality in informers we start as end users modify the cluster is a valid
124
+ use case that exists whenever the cluster administrator is different from the
125
+ controller administrator and should be handled irregardless of its application
126
+ in multi-cluster envrionments.
113
127
114
128
* [ #1139 ] ( https://github.com/kubernetes-sigs/controller-runtime/pull/1139 ) discusses why
115
129
the ability to start a controller more than once was taken away. It's a little
0 commit comments