[launcher] look ma, no more zombies#714
Conversation
|
Thank you @stas! This has been on our backlog to investigate for a long time and a huge annoyance. Can’t wait to try this out soon. |
|
Let's trade - I will fix annoying issues and perhaps someone could give me a bit of guidance with the 2D integration - as it looks I'm not making any progress - your docs say it supports non-DeepSpeed pipeline but it doesn't look it is the case :( #710 really could use help from someone who has the deepspeed internals understanding related to more than 1D parallelism. |
|
Thanks @stas00 for the help here! We've tested on our side and this is awesome. One remaining note though, this will fix the zombie problem on a single-node/multi-process setup however the issue currently still remains on a multi-node/multi-process setup. I think it shouldn't be too bad to extend this to work on multi-node though now that we have the single-node/multi-process setup working. |
* Dist testing backend fixes, etc. (deepspeedai#708) * set_batch_fn and remove old sanity check (deepspeedai#712) * properly set engine.local_rank if it's set to -1 * Add executable permission to `ds_elastic` and `ds_report` in `bin`. (deepspeedai#711) * Add executable permission to `ds_elastic` and `ds_report` in `bin`. * Automatic `ds_elastic` formatting Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * local rank of -1 means not set (deepspeedai#720) * bump to 0.3.11 * [launcher] look ma, no more zombies (deepspeedai#714) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Jon Eyolfson <eyolfson@gmail.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
* Dist testing backend fixes, etc. (deepspeedai#708) * set_batch_fn and remove old sanity check (deepspeedai#712) * properly set engine.local_rank if it's set to -1 * Add executable permission to `ds_elastic` and `ds_report` in `bin`. (deepspeedai#711) * Add executable permission to `ds_elastic` and `ds_report` in `bin`. * Automatic `ds_elastic` formatting Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * local rank of -1 means not set (deepspeedai#720) * bump to 0.3.11 * [launcher] look ma, no more zombies (deepspeedai#714) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Improve starred expressions (deepspeedai#696) * Improve starred expressions `deepspeed/profiling/flops_profiler/profiler.py` uses starred expressions that are no longer valid with [PEP 617][1]. The new Python parser is in 3.9, and this change allows DeepSpeed to run with the newest Python version. I have not checked all locations that has this issue. However, this change allows me to run simple examples. [1]: https://www.python.org/dev/peps/pep-0617/ * Match style for "Improve starred expressions", although readability suffers The style guide might need to be updated for this new use case of expressions. Python [Issue 40631][1] includes more discussion on the change. [1]: https://bugs.python.org/issue40631 Co-authored-by: Cheng Li <pistasable@gmail.com> * Fixed typo in Readme. (deepspeedai#737) * 1bit_adam dependencies (deepspeedai#742) * Clickable screenshots (deepspeedai#746) * Fix docstring * Make screenshots clickable for easier viewing * Add flops profiler tutorial (deepspeedai#682) * work on flops profiler tutorial * update flops profiler tutorial * add flops profiler tutorial and fix names * work on flops profiler tutorial * update flops profiler tutorial * add flops profiler tutorial and fix names * fix tailing ws * fix names * remove multistep profiling and update docs * fix cases where functionals and submodules coexist in a parent module, update readme * fix typo * always invoke post hook function * fix module flops sum and update tests * update tutorial * Only initialize distributed if required (deepspeedai#734) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Jon Eyolfson <eyolfson@gmail.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: TheDudeFromCI <thedudefromci@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Sean Naren <sean@grid.ai>
Almost identical to the solution merged into pytorch pytorch/pytorch#49305, this PR automatically takes care of killing zombies, should one of the processes fail (or get killed) or the parent process gets manually terminated by the operator.
As I already explained the 2 problems and the solution in full details here: pytorch/pytorch#49305 I hope it's ok if I don't repeat it again.
Thank you.