Skip to content

Comments

Dist testing backend fixes, etc.#708

Merged
jeffra merged 9 commits intomasterfrom
jeffra/dist_changes
Jan 29, 2021
Merged

Dist testing backend fixes, etc.#708
jeffra merged 9 commits intomasterfrom
jeffra/dist_changes

Conversation

@jeffra
Copy link
Collaborator

@jeffra jeffra commented Jan 28, 2021

Misc. distributed changes, outlined below.

  • Update distributed testing backend to not remove args.local_rank patching. Not all unit tests were actually setting the local rank properly which led to many tests running only on one cuda device even if we had world size > 1.
  • Previous fix exposed hanging issue with zero-2 tests and SimpleModel(empty_grads=True), empty_grads should only be true for tests where we want to test gradient imbalance issues. Only a few unit tests do this so I removed a lot of unneccisary empty_grads=True calls.
  • Add support for env["LOCAL_RANK"] in deepspeed.initialize, this means we (technically) no longer require args.local_rank to be passed into deepspeed.initialize via args. Currently our launcher will still pass --local_rank <n> to each sub-process so for now the user node will still need support for this for now.
  • Expose the init_method param for torch distributed initialization (this was motivated by user request)
  • Remove some unit test cases for transformer kernel that were failing on v100, @RezaYazdaniAminabadi is investigating this separately and will have an updated PR soon.

'local_rank') else int(os.environ.get("LOCAL_RANK",
-1))
env_local_rank = int(os.environ.get("LOCAL_RANK", -1))
if env_local_rank >= 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice if this validation logic was moved into _do_args_sanity_check().

@jeffra jeffra merged commit 2e2dd86 into master Jan 29, 2021
@jeffra jeffra deleted the jeffra/dist_changes branch January 29, 2021 21:08
ConnorJL pushed a commit to EleutherAI/DeeperSpeed that referenced this pull request Feb 8, 2021
* Dist testing backend fixes, etc. (deepspeedai#708)

* set_batch_fn and remove old sanity check (deepspeedai#712)

* properly set engine.local_rank if it's set to -1

* Add executable permission to `ds_elastic` and `ds_report` in `bin`. (deepspeedai#711)

* Add executable permission to `ds_elastic` and `ds_report` in `bin`.

* Automatic `ds_elastic` formatting

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* local rank of -1 means not set (deepspeedai#720)

* bump to 0.3.11

* [launcher] look ma, no more zombies (deepspeedai#714)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jon Eyolfson <eyolfson@gmail.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
sdtblck added a commit to EleutherAI/DeeperSpeed that referenced this pull request Feb 11, 2021
* Dist testing backend fixes, etc. (deepspeedai#708)

* set_batch_fn and remove old sanity check (deepspeedai#712)

* properly set engine.local_rank if it's set to -1

* Add executable permission to `ds_elastic` and `ds_report` in `bin`. (deepspeedai#711)

* Add executable permission to `ds_elastic` and `ds_report` in `bin`.

* Automatic `ds_elastic` formatting

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* local rank of -1 means not set (deepspeedai#720)

* bump to 0.3.11

* [launcher] look ma, no more zombies (deepspeedai#714)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Improve starred expressions (deepspeedai#696)

* Improve starred expressions

`deepspeed/profiling/flops_profiler/profiler.py` uses starred expressions
that are no longer valid with [PEP 617][1]. The new Python parser is in 3.9,
and this change allows DeepSpeed to run with the newest Python version. I have
not checked all locations that has this issue. However, this change allows me
to run simple examples.

[1]: https://www.python.org/dev/peps/pep-0617/

* Match style for "Improve starred expressions", although readability suffers

The style guide might need to be updated for this new use case of expressions.
Python [Issue 40631][1] includes more discussion on the change.

[1]: https://bugs.python.org/issue40631

Co-authored-by: Cheng Li <pistasable@gmail.com>

* Fixed typo in Readme. (deepspeedai#737)

* 1bit_adam dependencies (deepspeedai#742)

* Clickable screenshots (deepspeedai#746)

* Fix docstring

* Make screenshots clickable for easier viewing

* Add flops profiler tutorial (deepspeedai#682)

* work on flops profiler tutorial

* update flops profiler tutorial

* add flops profiler tutorial and fix names

* work on flops profiler tutorial

* update flops profiler tutorial

* add flops profiler tutorial and fix names

* fix tailing ws

* fix names

* remove multistep profiling and update docs

* fix cases where functionals and submodules coexist in a parent module, update readme

* fix typo

* always invoke post hook function

* fix module flops sum and update tests

* update tutorial

* Only initialize distributed if required (deepspeedai#734)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jon Eyolfson <eyolfson@gmail.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: TheDudeFromCI <thedudefromci@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Sean Naren <sean@grid.ai>
assert type(args.local_rank) == int, local_rank_err
if "LOCAL_RANK" in os.environ:
env_local_rank = int(os.environ.get("LOCAL_RANK", -1))
assert env_local_rank == args.local_rank, \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffra SOS! the addition of this assertion is causing my jobs to fail now!

Why was this assert added? A downstream script can have args.local_rank existing in its argument parser and yet be totally stale until DeepSpeed overrides its value via MPI discovery, etc.

For instance, a downstream script may be written in a manner where it can be used for both single-GPU (without DeepSpeed) and multi-GPU training (with as well as without DeepSpeed), wherein single-GPU training is equivalent to defaulting to args.local_rank = -1.

Copy link

@g-karthik g-karthik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffra please take a look at my inline comment! Your addition of that specific assert is causing my jobs to fail now.

EDIT: Oh, hmm, looks like you addressed this in a follow-up PR #720. Perhaps my build doesn't contain this fix. Will re-build and test.

@g-karthik
Copy link

@jeffra Okay so with the latest build which includes #720, the assertion error I was getting earlier went away.

Having said that, I see with this latest build of DeepSpeed that it is no longer updating args.local_rank to contain the right value of local_rank. My client script's args.local_rank remains -1 even after the deepspeed.initialize() call. This change in behavior, in turn, is affecting my downstream logic such as ensuring that only the 0th world rank process writes to disk, etc.

Earlier, the behavior was such that if the client script has the command-line arg, then deepspeed would overwrite the arg to contain the right values as identified by mpi4py (via the setting of the environment variables in the init_distributed() method call prior to args sanity checks and configurations in the DeepSpeedEngine).

Would appreciate any assistance here!

@jeffra
Copy link
Collaborator Author

jeffra commented Feb 17, 2021

Hi @g-karthik, sorry for my delayed response on this blocking issue :( Glad to hear that #720 fixed the issue. Are you installing via PyPI or from source? PyPI is a bit out of date with master right now.

I believe #764 should fix the issue you are now seeing. Can you give the PR branch a try though?

@g-karthik
Copy link

g-karthik commented Feb 18, 2021

Hi @jeffra I'm installing from source.

Looks like you've merged it to master already, so I'll try the master branch!

I'm a bit confused though - which commit caused this stale-args regression? I'm unable to find it.

It worked totally fine earlier, i.e., args.local_rank was getting set internally by deepspeed.initialize().

@g-karthik
Copy link

@jeffra the fix in #764 seems to work, thanks for this!

But for my sanity, can you point me to the commit where this was originally "undone", i.e., args.local_rank was no longer being set properly within deepspeed?

I'm trying to reconcile this particular behavior with past commits and pretty much all I see is self.local_rank assignments, not args.local_rank assignments.

Without changing a single line of code in my scripts, I see that without your #764 fix, every rank tries to write a model checkpoint directory. But that was not happening 2-3 weeks ago, i.e., only one rank used to write a model checkpoint directory.

@jeffra
Copy link
Collaborator Author

jeffra commented Feb 19, 2021

I think i may have added this regression in #608

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants