How to specify the distributed training job resource #2019

Yancey1989 · 2017-05-05T02:55:37Z

在用户提交分布式训练任务时，集群需要确定以下几个资源：

trainer/pserver count: trainer/pserver 进程数量
trainer/pserver memory: 每个 trainer/pserver 进程的memory limit
trainer CPU/GPU count: 每个trainer 进程使用 CPU/GPU count
pserver CPU count: 每个pserver使用的CPU count

分别指定所有的资源使用情况：
- 好处：直接
- 坏处：用户需要知道集群中物理硬件的情况，例如一台机器多少块GPU卡，例如每台机器只有4块GPU卡，那么应该指定trainer_gpu_num=4, trainer_count=2而不是trainer_gpu_num=8, trainer_count=1
只指定一共需要的CPU/GPU count 和 memory limit，根据集群物理配置决定pserver/trainer count以及按比例分配pserver/trainer memoery, pserver CPU limit和trainer CPU/GPU limit
- 好处：用户可以无感知集群的物理配置情况，根据集群当前情况动态调整pserver/trainer count
- 坏处：灵活性略差，pserver CPU limit和trainer CPU limit根据不同的模型可能需要不同的分配比例，没有办法达到最优。

The text was updated successfully, but these errors were encountered:

typhoonzero · 2017-05-05T03:30:01Z

赞同2 。同时可以把1作为默认隐藏的一些高级配置选项。

可能的话我们应该实现一个scheduler的插件，根据用户指定的资源总数计算出合理的调度方案，并启动任务。

helinwang · 2017-05-05T06:12:04Z

赞同2，只用指定纯cpu的时候cpu number，纯gpu的时候gpu number，memory limit也是个好主意。pserver如果有的话，可以不是具体几个，而是有几个bucket：few, more, a lot之类的。

Yancey1989 · 2017-05-05T06:33:57Z

pserver如果有的话，可以不是具体几个，而是有几个bucket：few, more, a lot之类的
pserver的bucket是个好主意，也可以提供高级参数来指定pserver resource limit.

另外应该不会存在纯GPU的模式, 即使指定了GPU卡的数量，CPU的limit也是要存在的，或许采用GPU优先的方式比较合适，例如

## trainer limit
if gpu_num>0:
    trainer_count = gpu_num
    trainer_cpu = cpu_num/trainer_count
    trainer_memory = memory/trainer_count
else:
    trainer_count = cpu_num
    trainer_memory = memory/trainer_count

typhoonzero · 2017-05-07T01:16:42Z

@Yancey1989 Can you please update the job submit design doc for this and close this issue?

Yancey1989 · 2017-05-09T03:08:17Z

@typhoonzero I have already updated the design doc: #1770, and it's reviewing. I'll close this issue:)

Yancey1989 · 2017-05-11T03:28:16Z

Required Parameters

PaddleJob(
    cpu_nums=10,
    gpu_nums=2,
    memory="10G)
)

默认情况下，用户只需要指定Job总共的资源大小，具体多少trainer和pserver由我们的调度程序根据集群配置以及Job的资源大小来决定

Advanced Parameters

PaddleJob(
    trainers=3,
    pservers=4,
    trainer_cpu=1,
    trainer_mem="1G",
    pserver_cpu=1,
    pserver_mem="1G"
)

高级配置可以指定trainer，pserver格式

helinwang · 2017-05-11T03:35:36Z

要不先指定高级配置，我们照做就行，简单配置还需要我们去调度分配。

PaddleJob(
    trainers=3,
    pservers=4,
    trainer_cpu=1,
    trainer_mem="1G",
    gpu_num=10,
    pserver_cpu=1,
    pserver_mem="1G"
)

这样如何？(多了个gpu_num=10）

Yancey1989 · 2017-05-11T04:50:47Z

gpu_num是说每个trainer的GPU卡的数量吧？那可以是trainer_gpu。
因为这两组参数也是互斥的，所以不如就同时提供两种配置方式吧，对于初学者或者是试验性质的Job，只需要指定Job的总资源就好了，高级用户可以用另外一组参数来分别指定trainer和pserver的资源。

jacquesqiao · 2017-05-11T07:50:23Z

感觉可以优先实现精确配置，觉得方便就实现一个自动计算的版本

Yancey1989 · 2017-05-12T07:18:12Z

@jacquesqiao @helinwang 同意第一版实现精确配置，等自动计算的方案想清楚再实现：）

wanghaoshuang · 2017-08-03T12:45:42Z

Closing this issue due to inactivity, feel free to reopen it.

Yancey1989 assigned helinwang and typhoonzero May 5, 2017

Yancey1989 mentioned this issue May 5, 2017

Design doc: submit a distributed job #1770

Closed

Yancey1989 added the need be discussed label May 5, 2017

Yancey1989 changed the title ~~How to specify the resource(CPU, Memory, GPU) while the user submits a distributed training job~~ How to specify the resource(CPU, Memory, GPU) for distributed training May 5, 2017

Yancey1989 closed this as completed May 9, 2017

Yancey1989 mentioned this issue May 10, 2017

Simpler cluster train job submit code #2047

Closed

Yancey1989 changed the title ~~How to specify the resource(CPU, Memory, GPU) for distributed training~~ How to specify the distributed training job resource May 10, 2017

Yancey1989 reopened this May 11, 2017

Yancey1989 mentioned this issue May 12, 2017

Design doc: submit paddle job #2111

Merged

wanghaoshuang closed this as completed Aug 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to specify the distributed training job resource #2019

How to specify the distributed training job resource #2019

Yancey1989 commented May 5, 2017

typhoonzero commented May 5, 2017

helinwang commented May 5, 2017

Yancey1989 commented May 5, 2017

typhoonzero commented May 7, 2017

Yancey1989 commented May 9, 2017

Yancey1989 commented May 11, 2017 •

edited

Loading

helinwang commented May 11, 2017 •

edited

Loading

Yancey1989 commented May 11, 2017

jacquesqiao commented May 11, 2017

Yancey1989 commented May 12, 2017

wanghaoshuang commented Aug 3, 2017

How to specify the distributed training job resource #2019

How to specify the distributed training job resource #2019

Comments

Yancey1989 commented May 5, 2017

typhoonzero commented May 5, 2017

helinwang commented May 5, 2017

Yancey1989 commented May 5, 2017

typhoonzero commented May 7, 2017

Yancey1989 commented May 9, 2017

Yancey1989 commented May 11, 2017 • edited Loading

helinwang commented May 11, 2017 • edited Loading

Yancey1989 commented May 11, 2017

jacquesqiao commented May 11, 2017

Yancey1989 commented May 12, 2017

wanghaoshuang commented Aug 3, 2017

Yancey1989 commented May 11, 2017 •

edited

Loading

helinwang commented May 11, 2017 •

edited

Loading