-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add liveness check against /state #406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ac6ca19
to
2273adc
Compare
2273adc
to
dc9a27d
Compare
dc9a27d
to
ba27b84
Compare
@tyrannasaurusbanks rather than overloading |
I am impacted by this. Manually pulled in your changes and its holding stable atm. |
414d7e7
to
be85bfa
Compare
pkg/controller/alb-controller.go
Outdated
@@ -356,6 +360,18 @@ func (ac *albController) StatusHandler(w http.ResponseWriter, r *http.Request) { | |||
encoder.Encode(checkResults) | |||
} | |||
|
|||
// AliveHandler only returns a empty response. It checks nothing downstream & should only used to check the controller is still running. | |||
func (ac *albController) AliveHandler(w http.ResponseWriter, r *http.Request) { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you prefer to see a minimal internal check added here - like maybe attain a mutex.RLock() like in /state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea.
Add liveness check into deployment.yaml. Add explicit responseCode to stateHandler. Also fix some compiler warnings in alb-controller and correct log statement in ec2.
be85bfa
to
29a4f68
Compare
Add liveness check against /state
What
Why
We’ve been running an older version of the controller which was working fine for us but recently we starting seeing the liveness and readiness probes start to sporadically fail, with the liveness probe causing the pod to restart. My guess is that we're being rate limited by AWS and so the API calls in /healthz are failing/retrying.
We've mitigated this by increasing the timeouts for both our liveness and readiness probes manually and increasing the period between checks.
When I went to create a PR on the project to make the probe timeout configurable, I noticed that the liveness probe has since been removed.
I'm unsure as to why, but I guess people may have run into similar issues where the liveness probe failures could cause the pod to be killed.
We feel that the liveness probe shouldn't be checking downstream dependencies (AWS APIs) through the /healthz endpoint since if these downstream dependencies fail then all that will happen is that the controller will be restarted by kube - which would leave it in no better state. I propose that it should instead just conduct a simple check to see if the controller is alive.
Notes
I chose to copy the query-param approach from /healthz onto the /state handler - shout if you'd prefer me to refactor /healthz to support liveness checking without downstream calls.
If people hate this addition, or if there's some reason I'm missing as to why there is no liveness check then let me know and I can amend the PR to simply include the timeout and logging change.
Thanks for a great project, great to see the recently activity & migration to kubernetes-sig!