Add liveness check against /state #406

tyrannasaurusbanks · 2018-06-22T14:32:43Z

What

Add liveness check against /state into deployment.yaml.
Tweak stateHandler to include an explicit responseCode and optional empty response with a query-param toggle.
Make readinessProbe timeout configurable
Correct log statement.

Why

We’ve been running an older version of the controller which was working fine for us but recently we starting seeing the liveness and readiness probes start to sporadically fail, with the liveness probe causing the pod to restart. My guess is that we're being rate limited by AWS and so the API calls in /healthz are failing/retrying.
We've mitigated this by increasing the timeouts for both our liveness and readiness probes manually and increasing the period between checks.

When I went to create a PR on the project to make the probe timeout configurable, I noticed that the liveness probe has since been removed.
I'm unsure as to why, but I guess people may have run into similar issues where the liveness probe failures could cause the pod to be killed.

We feel that the liveness probe shouldn't be checking downstream dependencies (AWS APIs) through the /healthz endpoint since if these downstream dependencies fail then all that will happen is that the controller will be restarted by kube - which would leave it in no better state. I propose that it should instead just conduct a simple check to see if the controller is alive.

Notes

I chose to copy the query-param approach from /healthz onto the /state handler - shout if you'd prefer me to refactor /healthz to support liveness checking without downstream calls.

If people hate this addition, or if there's some reason I'm missing as to why there is no liveness check then let me know and I can amend the PR to simply include the timeout and logging change.

Thanks for a great project, great to see the recently activity & migration to kubernetes-sig!

bigkraig · 2018-06-25T22:37:31Z

@tyrannasaurusbanks rather than overloading /state, how about we just create an /alive endpoint?

geota · 2018-06-27T18:35:34Z

I am impacted by this. Manually pulled in your changes and its holding stable atm.

tyrannasaurusbanks · 2018-06-29T09:25:11Z

@bigkraig added /alive endpoint and rebased latest changes onto PR. @geota nice!

tyrannasaurusbanks · 2018-06-29T09:27:17Z

pkg/controller/alb-controller.go

@@ -356,6 +360,18 @@ func (ac *albController) StatusHandler(w http.ResponseWriter, r *http.Request) {
 	encoder.Encode(checkResults)
 }

+// AliveHandler only returns a empty response. It checks nothing downstream & should only used to check the controller is still running.
+func (ac *albController) AliveHandler(w http.ResponseWriter, r *http.Request) {
+


Would you prefer to see a minimal internal check added here - like maybe attain a mutex.RLock() like in /state?

That's a good idea.

Add liveness check into deployment.yaml. Add explicit responseCode to stateHandler. Also fix some compiler warnings in alb-controller and correct log statement in ec2.

Add liveness check against /state

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 22, 2018

k8s-ci-robot requested a review from bigkraig June 22, 2018 14:32

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jun 22, 2018

tyrannasaurusbanks force-pushed the master branch from ac6ca19 to 2273adc Compare June 22, 2018 14:34

tyrannasaurusbanks mentioned this pull request Jun 22, 2018

healthz endpoint never comes up despite service functioning #281

Closed

tyrannasaurusbanks force-pushed the master branch from 2273adc to dc9a27d Compare June 25, 2018 12:39

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 25, 2018

tyrannasaurusbanks force-pushed the master branch from dc9a27d to ba27b84 Compare June 25, 2018 12:50

tyrannasaurusbanks changed the title ~~Add liveness check against /state [Small]~~ Add liveness check against /state Jun 25, 2018

bigkraig added this to the 1.0: ALB Stabilization milestone Jun 27, 2018

tyrannasaurusbanks force-pushed the master branch 2 times, most recently from 414d7e7 to be85bfa Compare June 29, 2018 09:23

tyrannasaurusbanks commented Jun 29, 2018

View reviewed changes

Add aliveHandler to simply verify controller livesness.

29a4f68

Add liveness check into deployment.yaml. Add explicit responseCode to stateHandler. Also fix some compiler warnings in alb-controller and correct log statement in ec2.

tyrannasaurusbanks force-pushed the master branch from be85bfa to 29a4f68 Compare July 2, 2018 08:10

bigkraig merged commit f924913 into kubernetes-sigs:master Jul 9, 2018

bigkraig mentioned this pull request Jul 11, 2018

alb-ingress-controller tag changed #451

Closed

bigkraig added a commit that referenced this pull request Oct 1, 2018

Merge pull request #406 from tyrannasaurusbanks/master

f65e646

Add liveness check against /state

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add liveness check against /state #406

Add liveness check against /state #406

Uh oh!

tyrannasaurusbanks commented Jun 22, 2018 •

edited

Loading

Uh oh!

bigkraig commented Jun 25, 2018 •

edited

Loading

Uh oh!

geota commented Jun 27, 2018

Uh oh!

tyrannasaurusbanks commented Jun 29, 2018

Uh oh!

tyrannasaurusbanks Jun 29, 2018 •

edited

Loading

Uh oh!

bigkraig Jun 29, 2018

Uh oh!

Uh oh!

Add liveness check against /state #406

Add liveness check against /state #406

Uh oh!

Conversation

tyrannasaurusbanks commented Jun 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Notes

Uh oh!

bigkraig commented Jun 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geota commented Jun 27, 2018

Uh oh!

tyrannasaurusbanks commented Jun 29, 2018

Uh oh!

tyrannasaurusbanks Jun 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bigkraig Jun 29, 2018

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tyrannasaurusbanks commented Jun 22, 2018 •

edited

Loading

bigkraig commented Jun 25, 2018 •

edited

Loading

tyrannasaurusbanks Jun 29, 2018 •

edited

Loading