Skip to content

Add liveness check against /state #406

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 9, 2018

Conversation

tyrannasaurusbanks
Copy link

@tyrannasaurusbanks tyrannasaurusbanks commented Jun 22, 2018

What

  • Add liveness check against /state into deployment.yaml.
  • Tweak stateHandler to include an explicit responseCode and optional empty response with a query-param toggle.
  • Make readinessProbe timeout configurable
  • Correct log statement.

Why

We’ve been running an older version of the controller which was working fine for us but recently we starting seeing the liveness and readiness probes start to sporadically fail, with the liveness probe causing the pod to restart. My guess is that we're being rate limited by AWS and so the API calls in /healthz are failing/retrying.
We've mitigated this by increasing the timeouts for both our liveness and readiness probes manually and increasing the period between checks.

When I went to create a PR on the project to make the probe timeout configurable, I noticed that the liveness probe has since been removed.
I'm unsure as to why, but I guess people may have run into similar issues where the liveness probe failures could cause the pod to be killed.

We feel that the liveness probe shouldn't be checking downstream dependencies (AWS APIs) through the /healthz endpoint since if these downstream dependencies fail then all that will happen is that the controller will be restarted by kube - which would leave it in no better state. I propose that it should instead just conduct a simple check to see if the controller is alive.

Notes

I chose to copy the query-param approach from /healthz onto the /state handler - shout if you'd prefer me to refactor /healthz to support liveness checking without downstream calls.

If people hate this addition, or if there's some reason I'm missing as to why there is no liveness check then let me know and I can amend the PR to simply include the timeout and logging change.

Thanks for a great project, great to see the recently activity & migration to kubernetes-sig!

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 22, 2018
@k8s-ci-robot k8s-ci-robot requested a review from bigkraig June 22, 2018 14:32
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jun 22, 2018
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 25, 2018
@tyrannasaurusbanks tyrannasaurusbanks changed the title Add liveness check against /state [Small] Add liveness check against /state Jun 25, 2018
@bigkraig
Copy link

bigkraig commented Jun 25, 2018

@tyrannasaurusbanks rather than overloading /state, how about we just create an /alive endpoint?

@geota
Copy link

geota commented Jun 27, 2018

I am impacted by this. Manually pulled in your changes and its holding stable atm.

@bigkraig bigkraig added this to the 1.0: ALB Stabilization milestone Jun 27, 2018
@tyrannasaurusbanks tyrannasaurusbanks force-pushed the master branch 2 times, most recently from 414d7e7 to be85bfa Compare June 29, 2018 09:23
@tyrannasaurusbanks
Copy link
Author

@bigkraig added /alive endpoint and rebased latest changes onto PR. @geota nice!

@@ -356,6 +360,18 @@ func (ac *albController) StatusHandler(w http.ResponseWriter, r *http.Request) {
encoder.Encode(checkResults)
}

// AliveHandler only returns a empty response. It checks nothing downstream & should only used to check the controller is still running.
func (ac *albController) AliveHandler(w http.ResponseWriter, r *http.Request) {

Copy link
Author

@tyrannasaurusbanks tyrannasaurusbanks Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you prefer to see a minimal internal check added here - like maybe attain a mutex.RLock() like in /state?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea.

Add liveness check into deployment.yaml.
Add explicit responseCode to stateHandler.
Also fix some compiler warnings in alb-controller and correct log statement in ec2.
@bigkraig bigkraig merged commit f924913 into kubernetes-sigs:master Jul 9, 2018
bigkraig added a commit that referenced this pull request Oct 1, 2018
Add liveness check against /state
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants