Skip to content

Commit 78b926b

Browse files
committed
Add surfacing watch error handler alternative
1 parent ff1af8c commit 78b926b

File tree

1 file changed

+28
-14
lines changed

1 file changed

+28
-14
lines changed

designs/conditional-controllers.md

Lines changed: 28 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -16,18 +16,19 @@ cache, but there is no way to remove either of these.
1616

1717
Additionally there are restrictions that prevent a controller from running
1818
multiple times such as the clearing the watches slice for a controller after it
19-
has started.
19+
has started. The effect of this is that users have no clean way of restarting controllers
20+
after they have stopped them.
2021

21-
The effect of this is that users have no clean way of restarting controllers
22-
after they have stopped them. This would be useful for a number of use-cases
23-
around when controllers have little control over the installation or
24-
uninstallation on the of the resources that these controllers are responsible
25-
for watching.
22+
Greater detail is given in the "Future Work / Use Cases" section, but a summary
23+
of the motivating use-cases for the proposal is anytime a controller's watched resource
24+
can be installed/uninstalled unexpectedly or whenever the administrator of a
25+
controller is different from the administrator of the cluster (and thus has no
26+
control over which resources are installed).
2627

2728
## Goals
2829

2930
An implementation of the minimally viable hooks needed in controller-runtime to
30-
enable users to start, stop, and restart controllers and their caches.
31+
enable controller adminstrators to start, stop, and restart controllers and their caches.
3132

3233
## Non-Goals
3334

@@ -68,7 +69,7 @@ summarization of these discussions are presented below.
6869
controller-runtime such that an informer is aware of any and all outstanding
6970
references when its removal is called.
7071

71-
* Safety of stopping individual informers. There is concern that stopping
72+
* The safety of stopping individual informers is unclear. There is concern that stopping
7273
individual informers will leak go routines or memory. We should be able to use
7374
pprof tooling and exisiting leak tooling in controller-runtime to identify and
7475
mitigate any leaks
@@ -81,14 +82,22 @@ summarization of these discussions are presented below.
8182
for each informer) and breaks the clean design of the cache.
8283

8384
* Adding support to de-register EventHandlers from informers in apimachinery.
84-
This along with ref counting would be cleanest way to free us of the concern
85+
This along with ref counting would be the cleanest way to free us of the concern
8586
of controllers silently failing when their watch is removed.
8687
The downside is that we are ultimately at the mercy of whether apimachinery
8788
wants to support these changes, and even if they were on board, it could take
8889
a long time to land these changes upstream.
8990

90-
* TODO: Bubbling up errors from apimachinery.
91-
91+
* Surfacing the watch error handler from the reflector in client-go through the
92+
informer. Controller-runtimer could then look for specific errors and decide how to
93+
handle it from there, such as terminating the informer if a specific error was
94+
thrown indicating that the informer will no longer be viable (for example when
95+
the resource is uninstalled). The advantage is that we'd be pushing updates
96+
from the informer to the manager when errors arise (such as when the resource
97+
disappears) and this would lead to more responsive informer shutdown that doesn't
98+
require a separate watch mechanism to determine whether to remove informers. Like
99+
de-registering EventHandlers, the downside is that we would need api-machinery to
100+
support these changes and might take a long time to coordinate and implement.
92101

93102
### Minimal hooks needed to use informer removal externally
94103

@@ -103,13 +112,18 @@ The proposal to do this is:
103112
or not and use this field to not empty the controller’s `startWatches` when the
104113
controller is stopped.
105114

115+
A proof of concept for PR is here at
116+
[#1180](https://github.com/kubernetes-sigs/controller-runtime/pull/1180)
117+
106118
#### Risks and Mitigations
107119

108120
* We lack a consistent story around multi-cluster support and introducing
109121
changes such as this without fully thinking through the multi-cluster story
110-
might bind us for future designs. We think that restarting
111-
controllers is a valid use-case even for single cluster regardless of the
112-
multi-cluster use case.
122+
might bind us for future designs. We think gracefully handling degraded
123+
functionality in informers we start as end users modify the cluster is a valid
124+
use case that exists whenever the cluster administrator is different from the
125+
controller administrator and should be handled irregardless of its application
126+
in multi-cluster envrionments.
113127

114128
* [#1139](https://github.com/kubernetes-sigs/controller-runtime/pull/1139) discusses why
115129
the ability to start a controller more than once was taken away. It's a little

0 commit comments

Comments
 (0)