-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubernetes create so much Pods when a trainer failed #149
Comments
Emergent issue! |
We can have a work around like periodically check job status and put the job to fail it too many pods are failing. I'll try to add this feature today. |
Thanks for @emailweixu 's suggestion, maybe we can add a function at before: /~https://github.com/PaddlePaddle/cloud/blob/develop/docker/paddle_k8s#L28, while the failed times execute the threshold, return 0 and write message into |
Will test and fix~ |
@Yancey1989 @typhoonzero 这个不停创建新 Pod 怎么解决的?设置次数限制? |
强制exit 0。需要验证下。 |
The Paddle trainers is scheduled by Kubernetes Job, when any Pod is failed, Kubernetes will start up a new Pod, so if the upload
train.py
exists with non-zero, there will be more and more Pod with an Error status, and never stop only when user kills the job in manual.here is the design doc for backoff policy and failed pod limit.
The text was updated successfully, but these errors were encountered: