Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support submit v2 API job #99

Merged
merged 1 commit into from
May 26, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion doc/usage_cn.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ Subcommands:
submit Submit job to PaddlePaddle Cloud.


Use "paddlecloud.darwin flags" for a list of top-level flags
Use "paddlecloud flags" for a list of top-level flags
```

## 准备训练数据
Expand Down Expand Up @@ -120,10 +120,18 @@ scp -r my_training_package/ user@tunnel-server:/mnt/hdfs_mulan/idl/idl-dl/mypack

执行下面的命令提交准备好的任务:

- 提交基于V1 API的训练任务

```bash
paddlecloud submit -jobname my-paddlecloud-job -cpu 1 -gpu 0 -memory 1Gi -parallelism 10 -pscpu 1 -pservers 3 -psmemory 1Gi -passes 1 -topology trainer_config.py /pfs/[datacenter_name]/home/[username]/ctr_demo_package
```

- 提交基于V2 API的训练任务

```bash
paddlecloud submit -jobname my-paddlecloud-job -cpu 1 -gpu 0 -memory 1Gi -parallelism 10 -pscpu 1 -pservers 3 -psmemory 1Gi -passes 1 -entry "python trainer_config.py" /pfs/[datacenter_name]/home/[username]/ctr_demo_package
```

参数说明:
- `jobname`:提交任务的名称,paddlecloud使用`jobname`唯一标识一个任务
- `-cpu`:每个trainer进程使用的CPU资源,单位是“核”
Expand All @@ -134,6 +142,7 @@ paddlecloud submit -jobname my-paddlecloud-job -cpu 1 -gpu 0 -memory 1Gi -parall
- `-pservers`:parameter server的节点个数
- `-psmemory`:parameter server占用的内存资源,格式为“数字+单位”,单位可以是:Ki,Mi,Gi
- `-topology`:指定PaddlePaddle v1训练的模型配置python文件
- `-entry`: 指定PaddlePaddle v2训练程序的启动命令
- `-passes`:执行训练的pass个数
- `package`:HDFS 训练任务package的路径

Expand Down
2 changes: 1 addition & 1 deletion docker/paddle_k8s
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ start_trainer() {
--num_gradient_servers=$PADDLE_INIT_NUM_GRADIENT_SERVERS
;;
"v2")
python ${TOPOLOGY}
${ENTRY}
;;
*)
;;
Expand Down
5 changes: 5 additions & 0 deletions paddlecloud/paddlejob/paddle_job.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ def __init__(self,
pscpu,
psmemory,
topology,
entry,
image,
passes,
gpu=0,
Expand All @@ -38,6 +39,7 @@ def __init__(self,
self._pscpu = pscpu
self._psmemory = psmemory
self._topology = topology
self._entry = entry
self._image = image
self._volumes = volumes
self._registry_secret = registry_secret
Expand Down Expand Up @@ -67,6 +69,7 @@ def get_env(self):
envs.append({"name":"TRAINERS", "value":str(self._parallelism)})
envs.append({"name":"PSERVERS", "value":str(self._pservers)})
envs.append({"name":"TOPOLOGY", "value":self._topology})
envs.append({"name":"ENTRY", "value":self._entry})
envs.append({"name":"TRAINER_PACKAGE", "value":self._job_package})
envs.append({"name":"PADDLE_INIT_PORT", "value":str(DEFAULT_PADDLE_PORT)})
envs.append({"name":"PADDLE_INIT_TRAINER_COUNT", "value":str(self._cpu)})
Expand Down Expand Up @@ -97,6 +100,8 @@ def _get_pserver_entrypoint(self):
return ["paddle_k8s", "start_pserver"]

def _get_trainer_entrypoint(sefl):
if self._entry:
return ["paddle_k8s", "start_trainer", "v2"]
return ["paddle_k8s", "start_trainer", "v1"]

def _get_trainer_labels(self):
Expand Down
9 changes: 6 additions & 3 deletions paddlecloud/paddlejob/views.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,10 @@ def post(self, request, format=None):
username = request.user.username
namespace = notebook.utils.email_escape(username)
obj = json.loads(request.body)
if not obj.get("topology"):
return utils.simple_response(500, "no topology specified")
topology = obj.get("topology", "")
entry = obj.get("entry", "")
if not topology and not entry:
return utils.simple_response(500, "no topology or entry specified")
if not obj.get("datacenter"):
return utils.simple_response(500, "no datacenter specified")
dc = obj.get("datacenter")
Expand Down Expand Up @@ -72,7 +74,8 @@ def post(self, request, format=None):
pservers = obj.get("pservers", 1),
pscpu = obj.get("pscpu", 1),
psmemory = obj.get("psmemory", "1Gi"),
topology = obj["topology"],
topology = topology,
entry = entry,
gpu = obj.get("gpu", 0),
image = obj.get("image", settings.JOB_DOCKER_IMAGE["image"]),
passes = obj.get("passes", 1),
Expand Down