-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup pserver
resource after trainer
job end.
#515
Conversation
@@ -0,0 +1,225 @@ | |||
package main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not put this function in the controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
监听的资源和功能都不一样。
我准备把scaler的功能完善之后把这个功能merge进去。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 如果只是kill pserver,没必要使用Informer接口注册controller。直接每隔几分钟扫一遍所有job就行,代码更简单。
- 没必要先单独实现一个controller之外的单独的二进制。都放在controller内部,只需要一个二进制就可以启动。尽量不要有“中间状态”或“临时的程序”
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果只是kill pserver,没必要使用Informer接口注册controller。直接每隔几分钟扫一遍所有job就行,代码更简单。
job多的时候效率很差。
没必要先单独实现一个controller之外的单独的二进制。都放在controller内部,只需要一个二进制就可以启动。
监听的资源不是一样的。不是一个Generation的东西。
尽量不要有“中间状态”或“临时的程序”。
同意。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
监听的资源不是一样的。不是一个Generation的东西。
我理解,我们需要把现在Create/Delete Job的功能都移到controller中,所以自动回收PServer资源也在Controller,应该也是没问题的,相当于做了garbage collection。
@@ -0,0 +1,225 @@ | |||
package main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 如果只是kill pserver,没必要使用Informer接口注册controller。直接每隔几分钟扫一遍所有job就行,代码更简单。
- 没必要先单独实现一个controller之外的单独的二进制。都放在controller内部,只需要一个二进制就可以启动。尽量不要有“中间状态”或“临时的程序”
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同意@typhoonzero , 可能与controller放在同一个二进制比较好。
controller现在可以scale,加上这个可以delete,另外还需要create(不需要在这个PR里)。(不然controller只delete不create,有些奇怪)。
|
||
type event struct { | ||
Type eventType | ||
Job interface{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为何这里不是Job *batchv1.Job
,创建event的时候把类型改过来可能更好:event{Type: x, Job: job.(*batchv1.Job)}
log.Info(fmt.Sprintf("delete pserver replicaset namespace:%s jobname:%s", namespace, jobname)) | ||
|
||
// wait to delete replicaset | ||
time.Sleep(1 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
能否设置删除replica set的时候就自动删除对应的pod?这里好像有一些资料:https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的。研究一下。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同意临时程序不merge,不过还是要解释一下背景:
这是云上的用户提出的问题。他们使用的是我们controller实现之前的版本。是通过PaddleCloud直接提交的任务。
我们的Controller监听的不是Job资源,而是TrainingJob资源。controller应该叫做TrainingJobController更能体现原有的意思一些。
这个patch监听的是Job资源,跟我们现有的实现不论实现目的上有一些差别。不宜放到一块。
完整版的Controller我正在写。包含Create,Delete还有scheduler,不过可能要这周才能写完。
@@ -0,0 +1,225 @@ | |||
package main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
监听的资源不是一样的。不是一个Generation的东西。
我理解,我们需要把现在Create/Delete Job的功能都移到controller中,所以自动回收PServer资源也在Controller,应该也是没问题的,相当于做了garbage collection。
// Cleaner is a struct to clean pserver. | ||
type Cleaner struct { | ||
client *rest.RESTClient | ||
clientset *kubernetes.Clientset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubernetes Client可以使用 go/controller/cluster.go#Cluster
。
@@ -0,0 +1,16 @@ | |||
apiVersion: extensions/v1beta1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不需要一个独立的Yaml,和controller部署在起就好。
Fix #504
I will merge this function to scaler when I finish scaler. And now it's temporary patch to solve the ISSUE.