-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to specify the distributed training job resource #2019
Comments
赞同2 。同时可以把1作为默认隐藏的一些高级配置选项。 可能的话我们应该实现一个scheduler的插件,根据用户指定的资源总数计算出合理的调度方案,并启动任务。 |
赞同2,只用指定纯cpu的时候cpu number,纯gpu的时候gpu number,memory limit也是个好主意。pserver如果有的话,可以不是具体几个,而是有几个bucket:few, more, a lot之类的。 |
另外应该不会存在纯GPU的模式, 即使指定了GPU卡的数量,CPU的limit也是要存在的,或许采用GPU优先的方式比较合适,例如 ## trainer limit
if gpu_num>0:
trainer_count = gpu_num
trainer_cpu = cpu_num/trainer_count
trainer_memory = memory/trainer_count
else:
trainer_count = cpu_num
trainer_memory = memory/trainer_count |
@Yancey1989 Can you please update the job submit design doc for this and close this issue? |
@typhoonzero I have already updated the design doc: #1770, and it's reviewing. I'll close this issue:) |
PaddleJob(
cpu_nums=10,
gpu_nums=2,
memory="10G)
) 默认情况下,用户只需要指定Job总共的资源大小,具体多少trainer和pserver由我们的调度程序根据集群配置以及Job的资源大小来决定
PaddleJob(
trainers=3,
pservers=4,
trainer_cpu=1,
trainer_mem="1G",
pserver_cpu=1,
pserver_mem="1G"
) 高级配置可以指定trainer,pserver格式 |
要不先指定高级配置,我们照做就行,简单配置还需要我们去调度分配。 PaddleJob(
trainers=3,
pservers=4,
trainer_cpu=1,
trainer_mem="1G",
gpu_num=10,
pserver_cpu=1,
pserver_mem="1G"
) 这样如何?(多了个gpu_num=10) |
|
感觉可以优先实现精确配置,觉得方便就实现一个自动计算的版本 |
@jacquesqiao @helinwang 同意第一版实现精确配置,等自动计算的方案想清楚再实现:) |
Closing this issue due to inactivity, feel free to reopen it. |
在用户提交分布式训练任务时,集群需要确定以下几个资源:
trainer_gpu_num=4, trainer_count=2
而不是trainer_gpu_num=8, trainer_count=1
CPU/GPU count
和memory limit
,根据集群物理配置决定pserver/trainer count
以及按比例分配pserver/trainer memoery
,pserver CPU limit
和trainer CPU/GPU limit
pserver/trainer count
pserver CPU limit
和trainer CPU limit
根据不同的模型可能需要不同的分配比例,没有办法达到最优。The text was updated successfully, but these errors were encountered: