Skip to content

deepspeed/launcher: add launcher_helper as each rank's start portal#4699

Merged
mrwyattii merged 9 commits intodeepspeedai:masterfrom
YizhouZ:yizhou/launcher
Jan 27, 2024
Merged

deepspeed/launcher: add launcher_helper as each rank's start portal#4699
mrwyattii merged 9 commits intodeepspeedai:masterfrom
YizhouZ:yizhou/launcher

Conversation

@YizhouZ
Copy link
Contributor

@YizhouZ YizhouZ commented Nov 17, 2023

File Changes:
multinode_runner.py: modify mpich runner to use launcher_helper
launcher_helper.py: init script to map env variables

Descriptions:
Previous mpich runner would cause linux command line reaching size limitations when rank number is extremely higher. After discussion, we want to optimize it by using a helper script as each rank's start portal, which maps env variables such as rank, local_rank for deepspeed. So far we only use it for mpich runner, but it is made to be extendable, any runner could be added if facing similar situation. Only necessary args are passed to helper script.

Let us know if there is any suggestion.

launcher_helper.py: init script to map env variables
@YizhouZ
Copy link
Contributor Author

YizhouZ commented Nov 20, 2023

@tjruwase Hi, could you please help to start workflow for this PR? Thanks.

@YizhouZ
Copy link
Contributor Author

YizhouZ commented Dec 1, 2023

Hi @jeffra @awan-10 @tjruwase, is there any comment on this PR? Thanks!

@delock
Copy link
Collaborator

delock commented Dec 11, 2023

Hi @YizhouZ can you show the command line launched by deepspeed before and after your PR, illuatrating how your PR could help reduce command line length? Thanks!

@YizhouZ
Copy link
Contributor Author

YizhouZ commented Dec 18, 2023

Hi @YizhouZ can you show the command line launched by deepspeed before and after your PR, illuatrating how your PR could help reduce command line length? Thanks!

Sure.

Without this PR, while using mpich runner, the cmd would be like

cmd = mpirun --genv xxx -n 1 -env xxx python xxx.py : -n 1 -env xxx python xxx.py : <repeat local number times> : -n 1 -env xxx python xxx.py

When the number of total ranks goes high, the cmd would be extremely long, and trigger cmd word size limitations.

After this PR, the cmd would be much shorter:

cmd = mpirun --genv xxx python xxx.py <with no more additional cmds>

@YizhouZ YizhouZ requested a review from mrwyattii as a code owner January 5, 2024 13:36
@tjruwase tjruwase requested review from ShadenSmith and tjruwase and removed request for awan-10 and jeffra January 5, 2024 13:37
@delock
Copy link
Collaborator

delock commented Jan 16, 2024

Hi @mrwyattii Do you have any comments on this PR? This PR is essential when need to run DeepSpeed training on thousands of nodes with MPICH. The former implementation would make command line too long that overflow the command line buffer. The new implementation fix this issue.

@YizhouZ
Copy link
Contributor Author

YizhouZ commented Jan 25, 2024

Hi @tjruwase @mrwyattii, do you have any comments on this PR? Like @delock said previously, this new implement is essential for us to enable DeepSpeed Training on a large amount of nodes. Otherwise training process would reach linux command limits.

@tjruwase
Copy link
Contributor

@YizhouZ, @delock, thanks for this great PR. Apologies for the delay. We will review asap.

@loadams loadams added this pull request to the merge queue Jan 26, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 26, 2024
@mrwyattii mrwyattii merged commit 6e1a680 into deepspeedai:master Jan 27, 2024
@YizhouZ YizhouZ deleted the yizhou/launcher branch January 29, 2024 06:08
@YizhouZ
Copy link
Contributor Author

YizhouZ commented Jan 29, 2024

Thank you so much for this quick merge! @tjruwase @mrwyattii @loadams

github-merge-queue bot pushed a commit that referenced this pull request Jan 29, 2024
…5025)

Hi, for my last PR #4699
about launcher_helper, it mistakenly used two "PMIX". In this PR I
corrected them to be "PMIX" and "PMI". And I also added
_EnvironmentError_ to make sure env not get _NONE_ type, otherwise it
would trigger env setting error.
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this pull request Feb 17, 2024
…eepspeedai#4699)

File Changes:
multinode_runner.py: modify mpich runner to use launcher_helper
launcher_helper.py: init script to map env variables

Descriptions:
Previous mpich runner would cause linux command line reaching size
limitations when rank number is extremely higher. After discussion, we
want to optimize it by using a helper script as each rank's start
portal, which maps env variables such as rank, local_rank for deepspeed.
So far we only use it for mpich runner, but it is made to be extendable,
any runner could be added if facing similar situation. Only necessary
args are passed to helper script.

Let us know if there is any suggestion.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this pull request Feb 17, 2024
…eepspeedai#5025)

Hi, for my last PR deepspeedai#4699
about launcher_helper, it mistakenly used two "PMIX". In this PR I
corrected them to be "PMIX" and "PMI". And I also added
_EnvironmentError_ to make sure env not get _NONE_ type, otherwise it
would trigger env setting error.
rraminen pushed a commit to ROCm/DeepSpeed that referenced this pull request May 9, 2024
…eepspeedai#4699)

File Changes:
multinode_runner.py: modify mpich runner to use launcher_helper
launcher_helper.py: init script to map env variables

Descriptions:
Previous mpich runner would cause linux command line reaching size
limitations when rank number is extremely higher. After discussion, we
want to optimize it by using a helper script as each rank's start
portal, which maps env variables such as rank, local_rank for deepspeed.
So far we only use it for mpich runner, but it is made to be extendable,
any runner could be added if facing similar situation. Only necessary
args are passed to helper script.

Let us know if there is any suggestion.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
rraminen pushed a commit to ROCm/DeepSpeed that referenced this pull request May 9, 2024
…eepspeedai#5025)

Hi, for my last PR deepspeedai#4699
about launcher_helper, it mistakenly used two "PMIX". In this PR I
corrected them to be "PMIX" and "PMI". And I also added
_EnvironmentError_ to make sure env not get _NONE_ type, otherwise it
would trigger env setting error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants