Skip to content

🌱 Controller Lifecycle Management design doc #1192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

kevindelgado
Copy link
Contributor

@kevindelgado kevindelgado commented Oct 1, 2020

Design doc for finer grained controller lifecycle management.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 1, 2020
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 1, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @kevindelgado. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 1, 2020
@kevindelgado
Copy link
Contributor Author

/assign @alvaroaleman
/assign @DirectXMan12

#### Alternatives

* A metacontroller or CRD controller could start and stop controllers based on
the existence of their corresponding CRDs. This requires no changes to made to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "meta controller" strategy still requires minimal hooks.

@kevindelgado kevindelgado changed the title 🌱 Initial ConditionalControllers design doc 🌱 Controller Lifecycle Management design doc Oct 7, 2020
@kevindelgado kevindelgado force-pushed the design/conditional-controllers branch from 7280a4e to 78b926b Compare October 8, 2020 00:19
@coderanger coderanger mentioned this pull request Oct 8, 2020
@kevindelgado kevindelgado force-pushed the design/conditional-controllers branch from 78b926b to 712b72f Compare October 8, 2020 15:57
* A metacontroller or CRD controller could start and stop controllers based on
the existence of their corresponding CRDs. This puts the complexity of designing such a controller
onto the end user, but there are potentially ways to provide end users with
default, pluggable CRD controllers. More importantly, this probably is not even
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the informer removal doesn't really have anything to do with metacontroller or not, its a requirement regardless of how the add/removal functionality is implemented.

The reason I personally like the metacontroller is that it is event-based, rather than doing polling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, I think I was a bit confused on this point. I agree that metacontroller is not an alternative and will update the doc to reflect that.

A discussion of the metacontroller probably belongs in the "Future works / use cases" section rather than in the proposal which is more focused on providing the capability to externally build the metacontroller outside of c-r.

Enable fine-grained control over the lifecycle of a controller, including the
ability to start/stop/restart controllers and their caches by exposing a way to
remove individual informers from the cache and working around restrictions that
currently prevent controllers from starting multiple times.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is supporting this in multi-cluster scenarios a goal?

Copy link
Contributor Author

@kevindelgado kevindelgado Oct 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that there's isn't anything explicitly preventing a single-cluster solution from working for multiple clusters out of the box. It's sounds like you have specific concerns around this approach not working multi-cluster, is that true?

It's just that because controller-runtime historically has not attempted to support multi-cluster, attempting to use any solution that comes from this effort in a multi-cluster manner would be harder and undocumented.

I think the short answer is no, we aren't looking to fully flesh out a multi-cluster solution here but @DirectXMan12 can chime in if I'm misspeaking

Copy link
Member

@alvaroaleman alvaroaleman Oct 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if this is a generic solution, it doesn't really matter what cluster a given controller talks to (But I can't mark this convo as resolved)


## Goals

An implementation of the minimally viable hooks needed in controller-runtime to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is an actual implementation of those hooks also a goal or just to provide a hook?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just providing a hook for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that then achieves your non-goal of supporting start/stop on arbitrary conditions, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supporting start/stop on arbitrary conditions is the goal (which I believe is accomplished just by implementing a mechanism to remove informers and rerun controllers).

Implementing a solution (the actual polling mechanism that does the start/stopping such as a metacontroller or conditional controller) is not the goal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay perfect, that wasn't clear to me before. And agreed with #1192 (comment) that metacontroller is not an alternative but more of a follow-up

## Goals

An implementation of the minimally viable hooks needed in controller-runtime to
enable controller adminstrators to start, stop, and restart controllers and their caches.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-goals says that you don't want to add support to start/stop on arbitrary conditions. Please explicitly mention the conditions this is supposed to support

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't think the non-goals are clear, basically I was just trying to say that the goals are just to provide a hook, non-goals are an actual implementation of the hooks.


### Informer Removal

The starting point for this proposal is Shomron’s proposed implementation of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe this a bit, rahter than just saying "This PR explains it", because the PR might change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@vincepri
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 12, 2020
@kevindelgado kevindelgado force-pushed the design/conditional-controllers branch from 5bb2ab5 to 265a8df Compare October 19, 2020 20:03
@kevindelgado kevindelgado force-pushed the design/conditional-controllers branch from 265a8df to ca7439e Compare October 19, 2020 22:03
@kevindelgado
Copy link
Contributor Author

kevindelgado commented Oct 19, 2020

I took a pass at reorganizing, expanding, and addressing feedback on this design.

Main updates:

  1. Added tracking of event handlers to the proposal to address the feedback about silently degrading (✨ Allow removing individual informers from the cache (#935) #936 (comment))
  2. Added a discussion of what we could potentially ask api-machinery/client-go for in terms of updates to the SharedInformer interface. I'm working on a proposal and POC for that now, not sure if it'll be ready to present at the api-machinery meeting this wednesday, but that's my goal right now.

proof-of-concept updated at #1180

I think I've addressed all your feedback from last time @alvaroaleman, let me know if I've missed anything.

PTAL
cc @DirectXMan12 @shomron

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 3, 2021
@RainbowMango
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 23, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kevindelgado
To complete the pull request process, please ask for approval from alvaroaleman after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kevindelgado kevindelgado force-pushed the design/conditional-controllers branch from fc7d646 to f4f3ca4 Compare May 15, 2021 01:10
@kevindelgado kevindelgado force-pushed the design/conditional-controllers branch from f4f3ca4 to f242fff Compare May 15, 2021 01:11
@k8s-ci-robot
Copy link
Contributor

@kevindelgado: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-controller-runtime-test-master f242fff link /test pull-controller-runtime-test-master

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@kevindelgado
Copy link
Contributor Author

Dusting the cobwebs off this with a new WIP implementation at #1527

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 18, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 17, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants