Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes graceful shutdown not working as expected #3305

Closed
stmcginnis opened this issue Jul 31, 2023 Discussed in #3291 · 0 comments · Fixed by #3308
Closed

Kubernetes graceful shutdown not working as expected #3305

stmcginnis opened this issue Jul 31, 2023 Discussed in #3291 · 0 comments · Fixed by #3308
Assignees
Labels
area/kubernetes K8s including EKS, EKS-A, and including VMW status/research This issue is being researched type/bug Something isn't working

Comments

@stmcginnis
Copy link
Contributor

tldr: Kubernetes graceful shutdown relies on inhibitor locks. These are part of systemd-logind, which is not currently included as part of Bottlerocket.

Discussed in #3291

Originally posted by carlosjgp July 25, 2023
I've been working on a way to roll out our EKS nodes gracefully and my first attempt was trying to use K8s native support for node graceful shutdown and avoid using any extra infrastructure or deployments

I can see that support for its properties here

I tried setting these to shutdown_grace_period=3m and shutdown_grace_period_critical_pod=2m and swapping the nodes between Bottlerocket version 1.14.1 and 1.14.2 while running a simple Nginx deployment (helm create test) with 8 replicas and ingress setup to accept traffic

Then run vegeta to hit this Nginx deployment to ensure the pods were gracefully moved across and respecting the PodDisruptionBudget but there are a lot of failed requests

The nodes are supposed to be tainted with node.kubernetes.io/not-ready:NoSchedule but I didn't see this happening either

I'm using:

  • EKS K8s 1.27
  • BottleRocket AMI from SSM parameter /aws/service/bottlerocket/aws-k8s-1.27/x86_64/${var.bottlerocket_version}/image_id
  • AWS Autoscaling group Instace refresh to roll out the nodes when the AMI id changes

Has someone seen this before?

I could provide more details from Cloudwatch, Container or OS logs and K8s events

@stmcginnis stmcginnis added type/bug Something isn't working area/kubernetes K8s including EKS, EKS-A, and including VMW status/research This issue is being researched labels Jul 31, 2023
@stmcginnis stmcginnis self-assigned this Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubernetes K8s including EKS, EKS-A, and including VMW status/research This issue is being researched type/bug Something isn't working
Projects
Development

Successfully merging a pull request may close this issue.

1 participant