Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup pserver resource after trainer job end. #515

Closed
wants to merge 7 commits into from

Conversation

gongweibao
Copy link
Collaborator

@gongweibao gongweibao commented Dec 9, 2017

Fix #504

I will merge this function to scaler when I finish scaler. And now it's temporary patch to solve the ISSUE.

@@ -0,0 +1,225 @@
package main
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put this function in the controller?

Copy link
Collaborator Author

@gongweibao gongweibao Dec 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

监听的资源和功能都不一样。
我准备把scaler的功能完善之后把这个功能merge进去。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 如果只是kill pserver,没必要使用Informer接口注册controller。直接每隔几分钟扫一遍所有job就行,代码更简单。
  2. 没必要先单独实现一个controller之外的单独的二进制。都放在controller内部,只需要一个二进制就可以启动。尽量不要有“中间状态”或“临时的程序”

Copy link
Collaborator Author

@gongweibao gongweibao Dec 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果只是kill pserver,没必要使用Informer接口注册controller。直接每隔几分钟扫一遍所有job就行,代码更简单。

job多的时候效率很差。

没必要先单独实现一个controller之外的单独的二进制。都放在controller内部,只需要一个二进制就可以启动。

监听的资源不是一样的。不是一个Generation的东西。

尽量不要有“中间状态”或“临时的程序”。

同意。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

监听的资源不是一样的。不是一个Generation的东西。

我理解,我们需要把现在Create/Delete Job的功能都移到controller中,所以自动回收PServer资源也在Controller,应该也是没问题的,相当于做了garbage collection。

@@ -0,0 +1,225 @@
package main
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 如果只是kill pserver,没必要使用Informer接口注册controller。直接每隔几分钟扫一遍所有job就行,代码更简单。
  2. 没必要先单独实现一个controller之外的单独的二进制。都放在controller内部,只需要一个二进制就可以启动。尽量不要有“中间状态”或“临时的程序”

Copy link
Collaborator

@helinwang helinwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同意@typhoonzero , 可能与controller放在同一个二进制比较好。
controller现在可以scale,加上这个可以delete,另外还需要create(不需要在这个PR里)。(不然controller只delete不create,有些奇怪)。


type event struct {
Type eventType
Job interface{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为何这里不是Job *batchv1.Job,创建event的时候把类型改过来可能更好:event{Type: x, Job: job.(*batchv1.Job)}

log.Info(fmt.Sprintf("delete pserver replicaset namespace:%s jobname:%s", namespace, jobname))

// wait to delete replicaset
time.Sleep(1 * time.Second)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

能否设置删除replica set的时候就自动删除对应的pod?这里好像有一些资料:https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的。研究一下。

Copy link
Collaborator Author

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同意临时程序不merge,不过还是要解释一下背景:

这是云上的用户提出的问题。他们使用的是我们controller实现之前的版本。是通过PaddleCloud直接提交的任务。
我们的Controller监听的不是Job资源,而是TrainingJob资源。controller应该叫做TrainingJobController更能体现原有的意思一些。

这个patch监听的是Job资源,跟我们现有的实现不论实现目的上有一些差别。不宜放到一块。
完整版的Controller我正在写。包含Create,Delete还有scheduler,不过可能要这周才能写完。

@@ -0,0 +1,225 @@
package main
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

监听的资源不是一样的。不是一个Generation的东西。

我理解,我们需要把现在Create/Delete Job的功能都移到controller中,所以自动回收PServer资源也在Controller,应该也是没问题的,相当于做了garbage collection。

// Cleaner is a struct to clean pserver.
type Cleaner struct {
client *rest.RESTClient
clientset *kubernetes.Clientset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes Client可以使用 go/controller/cluster.go#Cluster

@@ -0,0 +1,16 @@
apiVersion: extensions/v1beta1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不需要一个独立的Yaml,和controller部署在起就好。

@CLAassistant
Copy link

CLAassistant commented Mar 23, 2020

CLA assistant check
All committers have signed the CLA.

@xiaolao xiaolao closed this Apr 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pserver需要自动释放资源
6 participants