Conversation
deepspeed/runtime/engine.py
Outdated
| 'local_rank') else int(os.environ.get("LOCAL_RANK", | ||
| -1)) | ||
| env_local_rank = int(os.environ.get("LOCAL_RANK", -1)) | ||
| if env_local_rank >= 0: |
There was a problem hiding this comment.
I think it would be nice if this validation logic was moved into _do_args_sanity_check().
* Dist testing backend fixes, etc. (deepspeedai#708) * set_batch_fn and remove old sanity check (deepspeedai#712) * properly set engine.local_rank if it's set to -1 * Add executable permission to `ds_elastic` and `ds_report` in `bin`. (deepspeedai#711) * Add executable permission to `ds_elastic` and `ds_report` in `bin`. * Automatic `ds_elastic` formatting Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * local rank of -1 means not set (deepspeedai#720) * bump to 0.3.11 * [launcher] look ma, no more zombies (deepspeedai#714) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Jon Eyolfson <eyolfson@gmail.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
* Dist testing backend fixes, etc. (deepspeedai#708) * set_batch_fn and remove old sanity check (deepspeedai#712) * properly set engine.local_rank if it's set to -1 * Add executable permission to `ds_elastic` and `ds_report` in `bin`. (deepspeedai#711) * Add executable permission to `ds_elastic` and `ds_report` in `bin`. * Automatic `ds_elastic` formatting Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * local rank of -1 means not set (deepspeedai#720) * bump to 0.3.11 * [launcher] look ma, no more zombies (deepspeedai#714) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Improve starred expressions (deepspeedai#696) * Improve starred expressions `deepspeed/profiling/flops_profiler/profiler.py` uses starred expressions that are no longer valid with [PEP 617][1]. The new Python parser is in 3.9, and this change allows DeepSpeed to run with the newest Python version. I have not checked all locations that has this issue. However, this change allows me to run simple examples. [1]: https://www.python.org/dev/peps/pep-0617/ * Match style for "Improve starred expressions", although readability suffers The style guide might need to be updated for this new use case of expressions. Python [Issue 40631][1] includes more discussion on the change. [1]: https://bugs.python.org/issue40631 Co-authored-by: Cheng Li <pistasable@gmail.com> * Fixed typo in Readme. (deepspeedai#737) * 1bit_adam dependencies (deepspeedai#742) * Clickable screenshots (deepspeedai#746) * Fix docstring * Make screenshots clickable for easier viewing * Add flops profiler tutorial (deepspeedai#682) * work on flops profiler tutorial * update flops profiler tutorial * add flops profiler tutorial and fix names * work on flops profiler tutorial * update flops profiler tutorial * add flops profiler tutorial and fix names * fix tailing ws * fix names * remove multistep profiling and update docs * fix cases where functionals and submodules coexist in a parent module, update readme * fix typo * always invoke post hook function * fix module flops sum and update tests * update tutorial * Only initialize distributed if required (deepspeedai#734) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Jon Eyolfson <eyolfson@gmail.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: TheDudeFromCI <thedudefromci@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Sean Naren <sean@grid.ai>
| assert type(args.local_rank) == int, local_rank_err | ||
| if "LOCAL_RANK" in os.environ: | ||
| env_local_rank = int(os.environ.get("LOCAL_RANK", -1)) | ||
| assert env_local_rank == args.local_rank, \ |
There was a problem hiding this comment.
@jeffra SOS! the addition of this assertion is causing my jobs to fail now!
Why was this assert added? A downstream script can have args.local_rank existing in its argument parser and yet be totally stale until DeepSpeed overrides its value via MPI discovery, etc.
For instance, a downstream script may be written in a manner where it can be used for both single-GPU (without DeepSpeed) and multi-GPU training (with as well as without DeepSpeed), wherein single-GPU training is equivalent to defaulting to args.local_rank = -1.
|
@jeffra Okay so with the latest build which includes #720, the assertion error I was getting earlier went away. Having said that, I see with this latest build of DeepSpeed that it is no longer updating Earlier, the behavior was such that if the client script has the command-line arg, then deepspeed would overwrite the arg to contain the right values as identified by Would appreciate any assistance here! |
|
Hi @g-karthik, sorry for my delayed response on this blocking issue :( Glad to hear that #720 fixed the issue. Are you installing via PyPI or from source? PyPI is a bit out of date with master right now. I believe #764 should fix the issue you are now seeing. Can you give the PR branch a try though? |
|
Hi @jeffra I'm installing from source. Looks like you've merged it to master already, so I'll try the master branch! I'm a bit confused though - which commit caused this stale-args regression? I'm unable to find it. It worked totally fine earlier, i.e., |
|
@jeffra the fix in #764 seems to work, thanks for this! But for my sanity, can you point me to the commit where this was originally "undone", i.e., I'm trying to reconcile this particular behavior with past commits and pretty much all I see is Without changing a single line of code in my scripts, I see that without your #764 fix, every rank tries to write a model checkpoint directory. But that was not happening 2-3 weeks ago, i.e., only one rank used to write a model checkpoint directory. |
|
I think i may have added this regression in #608 |
Misc. distributed changes, outlined below.
SimpleModel(empty_grads=True), empty_grads should only be true for tests where we want to test gradient imbalance issues. Only a few unit tests do this so I removed a lot of unneccisary empty_grads=True calls.--local_rank <n>to each sub-process so for now the user node will still need support for this for now.init_methodparam for torch distributed initialization (this was motivated by user request)