Skip to content

Add enable-out-of-service-taint flag #1132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 26, 2025

Conversation

tiationg-kho
Copy link
Contributor

@tiationg-kho tiationg-kho commented Feb 21, 2025

Issue #, if available:
#1124

Description of changes:

Add enable-out-of-service-taint flag

  • Enabling this feature will add out-of-service taint to node after cordon/drain process which would forcefully evict pods without matching tolerations and detach persistent volumes
  • This could prevent the PVC multi-attach error

How you tested your changes:
Environment (Linux / Windows): Linux
Kubernetes Version: 1.31

Test with queue mode NTH: refreshing ASG and monitoring the k8s event

Ref:
PVC attaching takes much time
Kubernetes 1.28: Non-Graceful Node Shutdown Moves to GA

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@tiationg-kho tiationg-kho requested a review from a team as a code owner February 21, 2025 06:47
@Lu-David
Copy link
Contributor

Were you able to test this on an actual ASG perchance? I wonder if it would be simple enough to reproduce the issue with start-instance-refresh for an ASG or using EKS? And also test that this fix will get rid of that multi-attach error?

Copy link
Contributor

@LikithaVemulapalli LikithaVemulapalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm, you can merge it after the typo fix. Thanks

@tiationg-kho
Copy link
Contributor Author

Were you able to test this on an actual ASG perchance? I wonder if it would be simple enough to reproduce the issue with start-instance-refresh for an ASG or using EKS? And also test that this fix will get rid of that multi-attach error?

Have tested by deploying StatefulSet with Volume. First, we recreated the PVC multi-attach error through ASG refresh.

Then triggered ASG refresh multiple times (let NTH enable the out-of-service taint flag). During each round, we monitored and confirmed that the error did not occur.

Copy link
Contributor

@LikithaVemulapalli LikithaVemulapalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tiationg-kho tiationg-kho merged commit c06e7f1 into aws:main Feb 26, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants