Skip to content

v3: prod worker graceful shutdown #1034

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 18, 2024
Merged

Conversation

nicktrn
Copy link
Collaborator

@nicktrn nicktrn commented Apr 16, 2024

  • Give workers 10 minutes to finish their current attempt when receiving SIGTERM
  • Fail run with timeout error if it doesn't exit in that timeframe
  • Fix an issue where large timeout delays could exceed 32 bit signed integer limits

Copy link

changeset-bot bot commented Apr 16, 2024

🦋 Changeset detected

Latest commit: 44a1a92

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@matt-aitken
Copy link
Member

Some questions on this:

  • In what situations would workers get a SIGTERM?
  • What's the worst outcome for a customer running a task and they received a SIGTERM?

@nicktrn
Copy link
Collaborator Author

nicktrn commented Apr 16, 2024

Should probably have prefaced this by stating the previous default, which was to quit immediately. This is more of a precaution to limit situations that are difficult to debug. With these changes we'll know when this happens, and we have control over the shutdown process.

In what situations would workers get a SIGTERM?

Manually or automatically terminating containers / pods.

  • Manual: May be required when runs appear "stuck" or otherwise unresponsive.
  • Automatic: Could be for a number of reasons. OOM conditions or other critical issues, and when scaling down with tasks still on the affected node. I'm trying to reduce the chances of any of this happening.

What's the worst outcome for a customer running a task and they received a SIGTERM?

Worst case would be the entire run fails and has to be replayed.

Could up the grace period to an hour for now?

Copy link
Member

@matt-aitken matt-aitken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had one question

@matt-aitken matt-aitken merged commit 584c7da into main Apr 18, 2024
@matt-aitken matt-aitken deleted the v3/prod-graceful-shutdown branch April 18, 2024 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants