Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes create so much Pods when a trainer failed #149

Closed
Yancey1989 opened this issue Jun 14, 2017 · 6 comments
Closed

Kubernetes create so much Pods when a trainer failed #149

Yancey1989 opened this issue Jun 14, 2017 · 6 comments
Labels

Comments

@Yancey1989
Copy link
Collaborator

The Paddle trainers is scheduled by Kubernetes Job, when any Pod is failed, Kubernetes will start up a new Pod, so if the upload train.py exists with non-zero, there will be more and more Pod with an Error status, and never stop only when user kills the job in manual.

here is the design doc for backoff policy and failed pod limit.

@typhoonzero
Copy link
Collaborator

Emergent issue!

@typhoonzero
Copy link
Collaborator

We can have a work around like periodically check job status and put the job to fail it too many pods are failing. I'll try to add this feature today.

@Yancey1989
Copy link
Collaborator Author

Thanks for @emailweixu 's suggestion, maybe we can add a function at before: /~https://github.com/PaddlePaddle/cloud/blob/develop/docker/paddle_k8s#L28, while the failed times execute the threshold, return 0 and write message into /dev/termination-log , so that paddlecloud comman-line will fetch the error message.

@typhoonzero
Copy link
Collaborator

Will test and fix~

@pineking
Copy link

pineking commented Jul 7, 2017

@Yancey1989 @typhoonzero 这个不停创建新 Pod 怎么解决的?设置次数限制?

@typhoonzero
Copy link
Collaborator

强制exit 0。需要验证下。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants