Dist testing backend fixes, etc. by jeffra · Pull Request #708 · deepspeedai/DeepSpeed

jeffra · 2021-01-28T22:58:38Z

Misc. distributed changes, outlined below.

Update distributed testing backend to not remove args.local_rank patching. Not all unit tests were actually setting the local rank properly which led to many tests running only on one cuda device even if we had world size > 1.
Previous fix exposed hanging issue with zero-2 tests and SimpleModel(empty_grads=True), empty_grads should only be true for tests where we want to test gradient imbalance issues. Only a few unit tests do this so I removed a lot of unneccisary empty_grads=True calls.
Add support for env["LOCAL_RANK"] in deepspeed.initialize, this means we (technically) no longer require args.local_rank to be passed into deepspeed.initialize via args. Currently our launcher will still pass --local_rank <n> to each sub-process so for now the user node will still need support for this for now.
Expose the init_method param for torch distributed initialization (this was motivated by user request)
Remove some unit test cases for transformer kernel that were failing on v100, @RezaYazdaniAminabadi is investigating this separately and will have an updated PR soon.

… tests

…iAminabadi

tjruwase · 2021-01-28T23:26:47Z

deepspeed/runtime/engine.py

+            'local_rank') else int(os.environ.get("LOCAL_RANK",
+                                                  -1))
+        env_local_rank = int(os.environ.get("LOCAL_RANK", -1))
+        if env_local_rank >= 0:


I think it would be nice if this validation logic was moved into _do_args_sanity_check().

* Dist testing backend fixes, etc. (deepspeedai#708) * set_batch_fn and remove old sanity check (deepspeedai#712) * properly set engine.local_rank if it's set to -1 * Add executable permission to `ds_elastic` and `ds_report` in `bin`. (deepspeedai#711) * Add executable permission to `ds_elastic` and `ds_report` in `bin`. * Automatic `ds_elastic` formatting Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * local rank of -1 means not set (deepspeedai#720) * bump to 0.3.11 * [launcher] look ma, no more zombies (deepspeedai#714) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Jon Eyolfson <eyolfson@gmail.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Dist testing backend fixes, etc. (deepspeedai#708) * set_batch_fn and remove old sanity check (deepspeedai#712) * properly set engine.local_rank if it's set to -1 * Add executable permission to `ds_elastic` and `ds_report` in `bin`. (deepspeedai#711) * Add executable permission to `ds_elastic` and `ds_report` in `bin`. * Automatic `ds_elastic` formatting Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * local rank of -1 means not set (deepspeedai#720) * bump to 0.3.11 * [launcher] look ma, no more zombies (deepspeedai#714) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Improve starred expressions (deepspeedai#696) * Improve starred expressions `deepspeed/profiling/flops_profiler/profiler.py` uses starred expressions that are no longer valid with [PEP 617][1]. The new Python parser is in 3.9, and this change allows DeepSpeed to run with the newest Python version. I have not checked all locations that has this issue. However, this change allows me to run simple examples. [1]: https://www.python.org/dev/peps/pep-0617/ * Match style for "Improve starred expressions", although readability suffers The style guide might need to be updated for this new use case of expressions. Python [Issue 40631][1] includes more discussion on the change. [1]: https://bugs.python.org/issue40631 Co-authored-by: Cheng Li <pistasable@gmail.com> * Fixed typo in Readme. (deepspeedai#737) * 1bit_adam dependencies (deepspeedai#742) * Clickable screenshots (deepspeedai#746) * Fix docstring * Make screenshots clickable for easier viewing * Add flops profiler tutorial (deepspeedai#682) * work on flops profiler tutorial * update flops profiler tutorial * add flops profiler tutorial and fix names * work on flops profiler tutorial * update flops profiler tutorial * add flops profiler tutorial and fix names * fix tailing ws * fix names * remove multistep profiling and update docs * fix cases where functionals and submodules coexist in a parent module, update readme * fix typo * always invoke post hook function * fix module flops sum and update tests * update tutorial * Only initialize distributed if required (deepspeedai#734) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Jon Eyolfson <eyolfson@gmail.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: TheDudeFromCI <thedudefromci@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Sean Naren <sean@grid.ai>

g-karthik · 2021-02-15T23:10:26Z

deepspeed/runtime/engine.py

+            assert type(args.local_rank) == int, local_rank_err
+            if "LOCAL_RANK" in os.environ:
+                env_local_rank = int(os.environ.get("LOCAL_RANK", -1))
+                assert env_local_rank == args.local_rank, \


@jeffra SOS! the addition of this assertion is causing my jobs to fail now!

Why was this assert added? A downstream script can have args.local_rank existing in its argument parser and yet be totally stale until DeepSpeed overrides its value via MPI discovery, etc.

For instance, a downstream script may be written in a manner where it can be used for both single-GPU (without DeepSpeed) and multi-GPU training (with as well as without DeepSpeed), wherein single-GPU training is equivalent to defaulting to args.local_rank = -1.

g-karthik

@jeffra please take a look at my inline comment! Your addition of that specific assert is causing my jobs to fail now.

EDIT: Oh, hmm, looks like you addressed this in a follow-up PR #720. Perhaps my build doesn't contain this fix. Will re-build and test.

g-karthik · 2021-02-17T07:37:52Z

@jeffra Okay so with the latest build which includes #720, the assertion error I was getting earlier went away.

Having said that, I see with this latest build of DeepSpeed that it is no longer updating args.local_rank to contain the right value of local_rank. My client script's args.local_rank remains -1 even after the deepspeed.initialize() call. This change in behavior, in turn, is affecting my downstream logic such as ensuring that only the 0th world rank process writes to disk, etc.

Earlier, the behavior was such that if the client script has the command-line arg, then deepspeed would overwrite the arg to contain the right values as identified by mpi4py (via the setting of the environment variables in the init_distributed() method call prior to args sanity checks and configurations in the DeepSpeedEngine).

Would appreciate any assistance here!

jeffra · 2021-02-17T16:46:07Z

Hi @g-karthik, sorry for my delayed response on this blocking issue :( Glad to hear that #720 fixed the issue. Are you installing via PyPI or from source? PyPI is a bit out of date with master right now.

I believe #764 should fix the issue you are now seeing. Can you give the PR branch a try though?

g-karthik · 2021-02-18T02:57:34Z

Hi @jeffra I'm installing from source.

Looks like you've merged it to master already, so I'll try the master branch!

I'm a bit confused though - which commit caused this stale-args regression? I'm unable to find it.

It worked totally fine earlier, i.e., args.local_rank was getting set internally by deepspeed.initialize().

g-karthik · 2021-02-19T08:16:48Z

@jeffra the fix in #764 seems to work, thanks for this!

But for my sanity, can you point me to the commit where this was originally "undone", i.e., args.local_rank was no longer being set properly within deepspeed?

I'm trying to reconcile this particular behavior with past commits and pretty much all I see is self.local_rank assignments, not args.local_rank assignments.

Without changing a single line of code in my scripts, I see that without your #764 fix, every rank tries to write a model checkpoint directory. But that was not happening 2-3 weeks ago, i.e., only one rank used to write a model checkpoint directory.

jeffra · 2021-02-19T15:53:44Z

I think i may have added this regression in #608

jeffra added 8 commits January 27, 2021 15:41

add support for passing init_method to distributed backend

da0bae4

make args.local_rank optional for engine, update docs

f3a78ff

int cast rank

10e4dcc

use fmt string

b62e24f

remove args.local_rank assert and remove adding local rank on ds args…

dae5d55

… tests

refactor some tests related to local rank

9c11f10

comment out some backward tests (based on discussion with @RezaYazdan…

4584704

…iAminabadi)

remove empty_grad=True, if zero-2 enabled this will cause a hang

9d6326d

jeffra requested review from RezaYazdaniAminabadi, ShadenSmith, arashashari, awan-10, cli99, conglongli, eltonzheng, minjiaz, niumanar, samyam and tjruwase as code owners January 28, 2021 22:58

tjruwase reviewed Jan 28, 2021

View reviewed changes

tjruwase approved these changes Jan 28, 2021

View reviewed changes

refactor to address comment

721013b

jeffra merged commit 2e2dd86 into master Jan 29, 2021

jeffra deleted the jeffra/dist_changes branch January 29, 2021 21:08

g-karthik reviewed Feb 15, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Dist testing backend fixes, etc.#708

Dist testing backend fixes, etc.#708
jeffra merged 9 commits intomasterfrom
jeffra/dist_changes

jeffra commented Jan 28, 2021

Uh oh!

tjruwase Jan 28, 2021

Uh oh!

g-karthik Feb 15, 2021

Uh oh!

g-karthik left a comment •

edited

Loading

Uh oh!

g-karthik commented Feb 17, 2021

Uh oh!

jeffra commented Feb 17, 2021

Uh oh!

g-karthik commented Feb 18, 2021 •

edited

Loading

Uh oh!

g-karthik commented Feb 19, 2021

Uh oh!

jeffra commented Feb 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

jeffra commented Jan 28, 2021

Uh oh!

tjruwase Jan 28, 2021

Choose a reason for hiding this comment

Uh oh!

g-karthik Feb 15, 2021

Choose a reason for hiding this comment

Uh oh!

g-karthik left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

g-karthik commented Feb 17, 2021

Uh oh!

jeffra commented Feb 17, 2021

Uh oh!

g-karthik commented Feb 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

g-karthik commented Feb 19, 2021

Uh oh!

jeffra commented Feb 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

g-karthik left a comment •

edited

Loading

g-karthik commented Feb 18, 2021 •

edited

Loading