deepspeed/launcher: add launcher_helper as each rank's start portal#4699
deepspeed/launcher: add launcher_helper as each rank's start portal#4699mrwyattii merged 9 commits intodeepspeedai:masterfrom
Conversation
launcher_helper.py: init script to map env variables
|
@tjruwase Hi, could you please help to start workflow for this PR? Thanks. |
|
Hi @YizhouZ can you show the command line launched by deepspeed before and after your PR, illuatrating how your PR could help reduce command line length? Thanks! |
Sure. Without this PR, while using mpich runner, the cmd would be like When the number of total ranks goes high, the cmd would be extremely long, and trigger cmd word size limitations. After this PR, the cmd would be much shorter: |
|
Hi @mrwyattii Do you have any comments on this PR? This PR is essential when need to run DeepSpeed training on thousands of nodes with MPICH. The former implementation would make command line too long that overflow the command line buffer. The new implementation fix this issue. |
|
Hi @tjruwase @mrwyattii, do you have any comments on this PR? Like @delock said previously, this new implement is essential for us to enable DeepSpeed Training on a large amount of nodes. Otherwise training process would reach linux command limits. |
|
Thank you so much for this quick merge! @tjruwase @mrwyattii @loadams |
…eepspeedai#4699) File Changes: multinode_runner.py: modify mpich runner to use launcher_helper launcher_helper.py: init script to map env variables Descriptions: Previous mpich runner would cause linux command line reaching size limitations when rank number is extremely higher. After discussion, we want to optimize it by using a helper script as each rank's start portal, which maps env variables such as rank, local_rank for deepspeed. So far we only use it for mpich runner, but it is made to be extendable, any runner could be added if facing similar situation. Only necessary args are passed to helper script. Let us know if there is any suggestion. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
…eepspeedai#5025) Hi, for my last PR deepspeedai#4699 about launcher_helper, it mistakenly used two "PMIX". In this PR I corrected them to be "PMIX" and "PMI". And I also added _EnvironmentError_ to make sure env not get _NONE_ type, otherwise it would trigger env setting error.
…eepspeedai#4699) File Changes: multinode_runner.py: modify mpich runner to use launcher_helper launcher_helper.py: init script to map env variables Descriptions: Previous mpich runner would cause linux command line reaching size limitations when rank number is extremely higher. After discussion, we want to optimize it by using a helper script as each rank's start portal, which maps env variables such as rank, local_rank for deepspeed. So far we only use it for mpich runner, but it is made to be extendable, any runner could be added if facing similar situation. Only necessary args are passed to helper script. Let us know if there is any suggestion. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
…eepspeedai#5025) Hi, for my last PR deepspeedai#4699 about launcher_helper, it mistakenly used two "PMIX". In this PR I corrected them to be "PMIX" and "PMI". And I also added _EnvironmentError_ to make sure env not get _NONE_ type, otherwise it would trigger env setting error.
File Changes:
multinode_runner.py: modify mpich runner to use launcher_helper
launcher_helper.py: init script to map env variables
Descriptions:
Previous mpich runner would cause linux command line reaching size limitations when rank number is extremely higher. After discussion, we want to optimize it by using a helper script as each rank's start portal, which maps env variables such as rank, local_rank for deepspeed. So far we only use it for mpich runner, but it is made to be extendable, any runner could be added if facing similar situation. Only necessary args are passed to helper script.
Let us know if there is any suggestion.