Skip to content

Comments

[launcher] look ma, no more zombies#714

Merged
jeffra merged 4 commits intodeepspeedai:masterfrom
stas00:zombies
Feb 4, 2021
Merged

[launcher] look ma, no more zombies#714
jeffra merged 4 commits intodeepspeedai:masterfrom
stas00:zombies

Conversation

@stas00
Copy link
Collaborator

@stas00 stas00 commented Jan 30, 2021

Almost identical to the solution merged into pytorch pytorch/pytorch#49305, this PR automatically takes care of killing zombies, should one of the processes fail (or get killed) or the parent process gets manually terminated by the operator.

As I already explained the 2 problems and the solution in full details here: pytorch/pytorch#49305 I hope it's ok if I don't repeat it again.

Thank you.

@jeffra
Copy link
Collaborator

jeffra commented Jan 30, 2021

Thank you @stas! This has been on our backlog to investigate for a long time and a huge annoyance. Can’t wait to try this out soon.

@stas00
Copy link
Collaborator Author

stas00 commented Jan 30, 2021

Let's trade - I will fix annoying issues and perhaps someone could give me a bit of guidance with the 2D integration - as it looks I'm not making any progress - your docs say it supports non-DeepSpeed pipeline but it doesn't look it is the case :( #710 really could use help from someone who has the deepspeed internals understanding related to more than 1D parallelism.

@jeffra
Copy link
Collaborator

jeffra commented Feb 4, 2021

Thanks @stas00 for the help here! We've tested on our side and this is awesome. One remaining note though, this will fix the zombie problem on a single-node/multi-process setup however the issue currently still remains on a multi-node/multi-process setup. I think it shouldn't be too bad to extend this to work on multi-node though now that we have the single-node/multi-process setup working.

@jeffra jeffra merged commit 4f1d827 into deepspeedai:master Feb 4, 2021
@stas00 stas00 deleted the zombies branch February 4, 2021 20:31
ConnorJL pushed a commit to EleutherAI/DeeperSpeed that referenced this pull request Feb 8, 2021
* Dist testing backend fixes, etc. (deepspeedai#708)

* set_batch_fn and remove old sanity check (deepspeedai#712)

* properly set engine.local_rank if it's set to -1

* Add executable permission to `ds_elastic` and `ds_report` in `bin`. (deepspeedai#711)

* Add executable permission to `ds_elastic` and `ds_report` in `bin`.

* Automatic `ds_elastic` formatting

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* local rank of -1 means not set (deepspeedai#720)

* bump to 0.3.11

* [launcher] look ma, no more zombies (deepspeedai#714)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jon Eyolfson <eyolfson@gmail.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
sdtblck added a commit to EleutherAI/DeeperSpeed that referenced this pull request Feb 11, 2021
* Dist testing backend fixes, etc. (deepspeedai#708)

* set_batch_fn and remove old sanity check (deepspeedai#712)

* properly set engine.local_rank if it's set to -1

* Add executable permission to `ds_elastic` and `ds_report` in `bin`. (deepspeedai#711)

* Add executable permission to `ds_elastic` and `ds_report` in `bin`.

* Automatic `ds_elastic` formatting

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* local rank of -1 means not set (deepspeedai#720)

* bump to 0.3.11

* [launcher] look ma, no more zombies (deepspeedai#714)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Improve starred expressions (deepspeedai#696)

* Improve starred expressions

`deepspeed/profiling/flops_profiler/profiler.py` uses starred expressions
that are no longer valid with [PEP 617][1]. The new Python parser is in 3.9,
and this change allows DeepSpeed to run with the newest Python version. I have
not checked all locations that has this issue. However, this change allows me
to run simple examples.

[1]: https://www.python.org/dev/peps/pep-0617/

* Match style for "Improve starred expressions", although readability suffers

The style guide might need to be updated for this new use case of expressions.
Python [Issue 40631][1] includes more discussion on the change.

[1]: https://bugs.python.org/issue40631

Co-authored-by: Cheng Li <pistasable@gmail.com>

* Fixed typo in Readme. (deepspeedai#737)

* 1bit_adam dependencies (deepspeedai#742)

* Clickable screenshots (deepspeedai#746)

* Fix docstring

* Make screenshots clickable for easier viewing

* Add flops profiler tutorial (deepspeedai#682)

* work on flops profiler tutorial

* update flops profiler tutorial

* add flops profiler tutorial and fix names

* work on flops profiler tutorial

* update flops profiler tutorial

* add flops profiler tutorial and fix names

* fix tailing ws

* fix names

* remove multistep profiling and update docs

* fix cases where functionals and submodules coexist in a parent module, update readme

* fix typo

* always invoke post hook function

* fix module flops sum and update tests

* update tutorial

* Only initialize distributed if required (deepspeedai#734)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jon Eyolfson <eyolfson@gmail.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: TheDudeFromCI <thedudefromci@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Sean Naren <sean@grid.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants