Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to specify the distributed training job resource #2019

Closed
Yancey1989 opened this issue May 5, 2017 · 11 comments
Closed

How to specify the distributed training job resource #2019

Yancey1989 opened this issue May 5, 2017 · 11 comments
Assignees

Comments

@Yancey1989
Copy link
Contributor

在用户提交分布式训练任务时,集群需要确定以下几个资源:

  • trainer/pserver count: trainer/pserver 进程数量
  • trainer/pserver memory: 每个 trainer/pserver 进程的memory limit
  • trainer CPU/GPU count: 每个trainer 进程使用 CPU/GPU count
  • pserver CPU count: 每个pserver使用的CPU count
  1. 分别指定所有的资源使用情况:
    • 好处:直接
    • 坏处:用户需要知道集群中物理硬件的情况,例如一台机器多少块GPU卡,例如每台机器只有4块GPU卡,那么应该指定trainer_gpu_num=4, trainer_count=2而不是trainer_gpu_num=8, trainer_count=1
  2. 只指定一共需要的CPU/GPU countmemory limit,根据集群物理配置决定pserver/trainer count以及按比例分配pserver/trainer memoery, pserver CPU limittrainer CPU/GPU limit
    • 好处:用户可以无感知集群的物理配置情况,根据集群当前情况动态调整pserver/trainer count
    • 坏处:灵活性略差,pserver CPU limittrainer CPU limit根据不同的模型可能需要不同的分配比例,没有办法达到最优。
@typhoonzero
Copy link
Contributor

赞同2 。同时可以把1作为默认隐藏的一些高级配置选项。

可能的话我们应该实现一个scheduler的插件,根据用户指定的资源总数计算出合理的调度方案,并启动任务。

@Yancey1989 Yancey1989 changed the title How to specify the resource(CPU, Memory, GPU) while the user submits a distributed training job How to specify the resource(CPU, Memory, GPU) for distributed training May 5, 2017
@helinwang
Copy link
Contributor

赞同2,只用指定纯cpu的时候cpu number,纯gpu的时候gpu number,memory limit也是个好主意。pserver如果有的话,可以不是具体几个,而是有几个bucket:few, more, a lot之类的。

@Yancey1989
Copy link
Contributor Author

pserver如果有的话,可以不是具体几个,而是有几个bucket:few, more, a lot之类的
pserver的bucket是个好主意,也可以提供高级参数来指定pserver resource limit.

另外应该不会存在纯GPU的模式, 即使指定了GPU卡的数量,CPU的limit也是要存在的,或许采用GPU优先的方式比较合适,例如

## trainer limit
if gpu_num>0:
    trainer_count = gpu_num
    trainer_cpu = cpu_num/trainer_count
    trainer_memory = memory/trainer_count
else:
    trainer_count = cpu_num
    trainer_memory = memory/trainer_count

@typhoonzero
Copy link
Contributor

@Yancey1989 Can you please update the job submit design doc for this and close this issue?

@Yancey1989
Copy link
Contributor Author

@typhoonzero I have already updated the design doc: #1770, and it's reviewing. I'll close this issue:)

@Yancey1989 Yancey1989 changed the title How to specify the resource(CPU, Memory, GPU) for distributed training How to specify the distributed training job resource May 10, 2017
@Yancey1989
Copy link
Contributor Author

Yancey1989 commented May 11, 2017

  • Required Parameters
PaddleJob(
    cpu_nums=10,
    gpu_nums=2,
    memory="10G)
)

默认情况下,用户只需要指定Job总共的资源大小,具体多少trainer和pserver由我们的调度程序根据集群配置以及Job的资源大小来决定

  • Advanced Parameters
PaddleJob(
    trainers=3,
    pservers=4,
    trainer_cpu=1,
    trainer_mem="1G",
    pserver_cpu=1,
    pserver_mem="1G"
)  

高级配置可以指定trainer,pserver格式

@helinwang
Copy link
Contributor

helinwang commented May 11, 2017

要不先指定高级配置,我们照做就行,简单配置还需要我们去调度分配。

PaddleJob(
    trainers=3,
    pservers=4,
    trainer_cpu=1,
    trainer_mem="1G",
    gpu_num=10,
    pserver_cpu=1,
    pserver_mem="1G"
)

这样如何?(多了个gpu_num=10)

@Yancey1989
Copy link
Contributor Author

gpu_num是说每个trainer的GPU卡的数量吧?那可以是trainer_gpu
因为这两组参数也是互斥的,所以不如就同时提供两种配置方式吧,对于初学者或者是试验性质的Job,只需要指定Job的总资源就好了,高级用户可以用另外一组参数来分别指定trainer和pserver的资源。

@Yancey1989 Yancey1989 reopened this May 11, 2017
@jacquesqiao
Copy link
Member

感觉可以优先实现精确配置,觉得方便就实现一个自动计算的版本

@Yancey1989
Copy link
Contributor Author

@jacquesqiao @helinwang 同意第一版实现精确配置,等自动计算的方案想清楚再实现:)

@wanghaoshuang
Copy link
Contributor

Closing this issue due to inactivity, feel free to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants