Refactor LightningDistributedDataParallel #5185

awaelchli · 2020-12-18T19:06:02Z

What does this PR do?

The motivation behind this refactor of the DDP wrapper is that we get all (future) improvements from upstream pytorch DDP, and the user can comfortably subclass the pytorch wrapper if they want to.

TODO:

Docs
add unit tests for new class to keep coverage up
add deprecation for old class?

awaelchli · 2020-12-18T20:11:21Z

pytorch_lightning/overrides/data_parallel.py


-    def reducer_prepare_for_backwards(self, output):


I'm not sure where this should go, it requires the reducer from ddp

@SeanNaren do you remember the conversation we started in #4976 about this? You had an idea there, I'm trying to understand it, maybe you can explain again? :)

I have been thinking about this one maybe too much ahah, but I didn't find a better way to do it as backward and optimizer.step are being called in training_step and DDP reducer is being called on training_step output

could open a PR in PyTorch to at least move it to a function we can use

ananthsub · 2020-12-18T23:40:06Z

fyi @pritamdamania87

ananthsub

exciting progress! are there other spots in lightning where LightningDistributedDataParallel is referenced? i imagine this will also provide a perf benefit as we can take advantage of improvements in DDP's forward automatically now.

Does the same pattern apply for data parallel?
(down the line) what if someone wants to wrap only some modules in their lightning module with DDP? what further changes would that require here?

ananthsub · 2020-12-24T23:19:52Z

pytorch_lightning/plugins/ddp_plugin.py

    @contextmanager
-    def block_backward_sync(self, model: LightningDistributedDataParallel):
+    def block_backward_sync(self, model: DistributedDataParallel):
        """
        Blocks ddp sync gradients behaviour on backwards pass.
        This is useful for skipping sync when accumulating gradients, reducing communication overhead
        Returns: context manager with sync behaviour off
        """
        yield model.no_sync()


do we need block_backward_sync still? can we directly call model.no_sync()

the only reason I can think of why it is there is so the user can override this method in their own plugin, though there is not much customization they can do to this context manager :)

We basically added this, since we did not want anything that is only DDP specific (i.e. any typechecks against the prior LightningDistributedDataParallel within the trainer/training-loop as this one should be backend agnostic.

ananthsub · 2020-12-24T23:22:08Z

pytorch_lightning/overrides/data_parallel.py

-
-    def parallel_apply(self, replicas, inputs, kwargs):
-        return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
+class LightningDistributedWrapper(torch.nn.Module):


for later: we should add comments for how this class is to be used

done, with example.

ananthsub · 2020-12-24T23:23:00Z

pytorch_lightning/plugins/ddp_plugin.py

-        model = LightningDistributedDataParallel(
-            model,
+        model = DistributedDataParallel(
+            module=LightningDistributedWrapper(model),


the docstring above needs to be updated for the return type

ananthsub · 2020-12-24T23:23:34Z

pytorch_lightning/plugins/ddp_plugin.py

@@ -63,8 +64,8 @@ def configure_ddp(self, model, device_ids):
        self._ddp_kwargs["find_unused_parameters"] = self._ddp_kwargs.get(
            "find_unused_parameters", True


i'm not sure why this is the default. this incurs a perf hit and is different from the DDP default

It was added in this PR by you: #4382, I'm not sure if it's necessary.

Hey @awaelchli. Not sure Will looked into the default for find_unused_parameters. Let's stick to Pytorch default which is False right ?

It should be false, it is only recommended to do this if necessary: https://pytorch.org/docs/stable/notes/ddp.html#internal-design

ok, made it default False to be in line with pytorch DDP: #5435

in #4382 I was preserving the prior behavior without digging into the full history behind the setting :/

This could be a nice speedup for distributed training jobs. @SeanNaren n00b question: is there a way to estimate the possible gains using the lightning benchmarks?

awaelchli · 2020-12-25T14:01:30Z

are there other spots in lightning where LightningDistributedDataParallel is referenced?

in the end I will do a search across the codebase, but I think it's more or less isolated to the plugins and accelerators and in most places in Lightning we operate on model.module.

Does the same pattern apply for data parallel?

yes I believe so. currently the biggest difference I see is the custom gather function for Results objects, but since we dropped support we should also be able to get rid of it and just use the native torch implementation for DP.

(down the line) what if someone wants to wrap only some modules in their lightning module with DDP? what further changes would that require here?

oh, that sounds adventurous! Not sure if in this scenario the wrapper in this PR would stand in the way, because of the order of nesting it implies. Need to think about it more.

At the moment the PR is blocked by the problem of manual optimization. It seems the custom code for calling the reducer is necessary for Lightning's manual backward. I don't fully understand it, I hope @SeanNaren can give me some hints.

SeanNaren · 2020-12-27T13:16:52Z

Nice! I like the overall flow of this. Just to make it clear for me, all we're doing is adding a higher level torch module on top of the DDP/DP modules? I have to ask; was there any reason this wasn't done initially? I may have missed something in this case.

Regarding manual optimization, the additional logic is due to the backward all reduce hooks being added after forward is finished. Right now we assume that if you override training_step in manual optimization, you can do everything you like within the function. Before the fix, this meant that if you called .backward() in training_step, gradients were not synced as no autograd reduce hooks were added.

We added the additional logic as a temp fix for this, but I really don't like the additional logic. I haven't had time to figure out a cleaner solution than what we added, but our current solution boils down to calling reducer.prepare_for_backwards(...) manually which adds the hooks (optionally using the loss to add hooks to parameters that actually require it iirc).

tchaton · 2021-01-04T16:04:50Z

Nice! I like the overall flow of this. Just to make it clear for me, all we're doing is adding a higher level torch module on top of the DDP/DP modules? I have to ask; was there any reason this wasn't done initially? I may have missed something in this case.

Regarding manual optimization, the additional logic is due to the backward all reduce hooks being added after forward is finished. Right now we assume that if you override training_step in manual optimization, you can do everything you like within the function. Before the fix, this meant that if you called .backward() in training_step, gradients were not synced as no autograd reduce hooks were added.

We added the additional logic as a temp fix for this, but I really don't like the additional logic. I haven't had time to figure out a cleaner solution than what we added, but our current solution boils down to calling reducer.prepare_for_backwards(...) manually which adds the hooks (optionally using the loss to add hooks to parameters that actually require it iirc).

Hey @awaelchli, I can help you out for manual_optimization and Pipe. Would be great to finalise this work.

For PipeRPCPlugin, we just need to replace PREPARE_FOR_BACKWARD into require_backward_grad_sync.

For manual_optimization, we need to set require_backward_grad_sync=False by default.

And update Accelerator.backward function. Still hacky ...

def backward(self, closure_loss, optimizer, opt_idx, *args, should_sync = True, **kwargs):
    automatic_optimization = self.trainer.train_loop.automatic_optimization

    if not automatic_optimization and self.ddp_plugin is not None:
        if should_sync:
            if torch.is_grad_enabled():
                if self.find_unused_parameters:
                    self.reducer.prepare_for_backward(list(_find_tensors(closure_loss)))
                else:
                    self.reducer.prepare_for_backward([])
            self.require_backward_grad_sync = True
            self.require_forward_param_sync = True

        else:
            self.require_backward_grad_sync = False
            self.require_forward_param_sync = False

And it could be use this way.

def training_step(....):
    ...
    make_optimizer_step=(batch_idx % 2 == 0)
    self.manual_backward(loss, opt, should_sync=make_optimizer_step)
    opt.step(make_optimizer_step=make_optimizer_step)

However, we will have to be careful if the user provides a closure as the accumulated_gradient logic is already within the LightningOptimizer.

def training_step(....):
    ...
    def closure():
          loss = ...
          # should_sync should be inferred from LightningOptimizer 
          self.manual_backward(loss, opt)
    opt.step(closure=closure. make_optimizer_step=(batch_idx % 2 == 0))

NB: require_backward_grad_sync and require_forward_param_sync was already present in July 2019 v1.3.0.
So it might work

Best,
T.C

Borda · 2021-01-05T10:40:06Z

@awaelchli is this ready to land/review? 🐰

tchaton · 2021-01-08T10:23:57Z

Hey @awaelchli,

Resolved on this checkout from this branch: #5415

Best,
T.C

awaelchli · 2021-01-08T13:07:39Z

@tchaton amazing, thanks!

Any preferences/suggestions for a better name?

LightningDistributedWrapper (current)
LightningDistributedModule? (maybe more in line with LightningModule)

codecov · 2021-01-08T13:15:51Z

Codecov Report

Merging #5185 (1c90586) into release/1.2-dev (61f415f) will decrease coverage by 0%.
The diff coverage is 100%.

@@               Coverage Diff                @@
##           release/1.2-dev   #5185    +/-   ##
================================================
- Coverage               93%     93%    -0%     
================================================
  Files                  152     151     -1     
  Lines                10737   10616   -121     
================================================
- Hits                  9950    9828   -122     
- Misses                 787     788     +1

pytorch_lightning/overrides/data_parallel.py

Borda · 2021-01-08T14:04:40Z

pytorch_lightning/overrides/data_parallel.py

-    Override the forward call in lightning so it goes to training and validation step respectively
-    """
-    PREPARE_FOR_BACKWARDS = True
+class LightningDistributedWrapper(torch.nn.Module):


btw, this smells like api change ;]

ok, deprecation warning and remove in 1.4?

@Borda I also added a deprecation test, but it is not so simple because it needs torch.distributed to be initialized, so it will add about 3-4 seconds only for that test. Don't know how to do it simpler

SeanNaren

Love this, great cleanup @awaelchli. Something similar could potentially happen to the ShardedDataParallel class in override/fairscale.py but that can be a separate PR.

I also don't mind the name, but I think it's more pytorch-esque if we go for LightningDistributedModule. I don't really mind, either way is clear :)

tchaton · 2021-01-08T17:52:02Z

pytorch_lightning/overrides/data_parallel.py


-    def reducer_prepare_for_backwards(self, output):


I have been thinking about this one maybe too much ahah, but I didn't find a better way to do it as backward and optimizer.step are being called in training_step and DDP reducer is being called on training_step output

This reverts commit 8e45151.

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Borda · 2021-01-13T13:19:49Z

tests/overrides/test_data_parallel.py

+    pl_module.training = True
+    pl_module.testing = False


curious how the module has these attributes, isn't it Trainer?

see my answer here
The property is copied to the model here
happy to follow up on this in a next step. The test that I wrote there is just to make sure it works the same as before refactor.

awaelchli added refactor design Includes a design discussion labels Dec 18, 2020

awaelchli added this to the 1.2 milestone Dec 18, 2020

awaelchli changed the base branch from master to release/1.2-dev December 18, 2020 19:20

awaelchli changed the base branch from release/1.2-dev to master December 18, 2020 19:21

awaelchli force-pushed the refactor/distrib-wrapper branch from 57fb035 to 7655c37 Compare December 18, 2020 19:24

awaelchli changed the base branch from master to release/1.2-dev December 18, 2020 19:24

awaelchli force-pushed the refactor/distrib-wrapper branch from 7655c37 to ef34dc1 Compare December 18, 2020 19:37

awaelchli changed the title ~~Refactor LightningDistributedDataParallel [skip-ci]~~ Refactor LightningDistributedDataParallel [skip ci] Dec 18, 2020

awaelchli added the distributed Generic distributed-related topic label Dec 18, 2020

awaelchli commented Dec 18, 2020

View reviewed changes

ananthsub reviewed Dec 24, 2020

View reviewed changes

Borda added the priority: 1 Medium priority task label Jan 5, 2021

tchaton mentioned this pull request Jan 8, 2021

Refactor/distrib wrapper tc #5415

Closed

12 tasks

awaelchli changed the title ~~Refactor LightningDistributedDataParallel [skip ci]~~ Refactor LightningDistributedDataParallel Jan 8, 2021

awaelchli marked this pull request as ready for review January 8, 2021 13:08

awaelchli requested review from Borda, justusschock and williamFalcon as code owners January 8, 2021 13:08

Borda reviewed Jan 8, 2021

View reviewed changes

SeanNaren approved these changes Jan 8, 2021

View reviewed changes

tchaton approved these changes Jan 8, 2021

View reviewed changes

awaelchli and others added 20 commits January 13, 2021 14:11

add squeeze

80d5992

replace LightningDistributedDP

214d7ba

update import

01838c9

module access

178084a

inputs

26d8540

refactor warning

0569795

update

2f8fbf9

resolve flake8

4f1f23a

remove old class

3d2bf26

set find unused params to False

1d60a50

update docstrings

46d10d9

update docs

e3097c8

update docs

08c486e

add changelog

d02bb0a

deprecation

0c392a4

rename wrapper -> module

1ebbbdc

rename pl_module

855d426

add unit tests

d137267

Revert "set find unused params to False"

60d8fe5

This reverts commit 8e45151.

update link in note

be8e11e

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

Borda force-pushed the refactor/distrib-wrapper branch from 4eadb89 to be8e11e Compare January 13, 2021 13:15

Borda requested changes Jan 13, 2021

View reviewed changes

Borda approved these changes Jan 13, 2021

View reviewed changes

github-actions bot added the has conflicts label Jan 13, 2021

Merge branch 'release/1.2-dev' into refactor/distrib-wrapper

558c938

tchaton merged commit e806bb7 into release/1.2-dev Jan 13, 2021

tchaton deleted the refactor/distrib-wrapper branch January 13, 2021 19:35

awaelchli mentioned this pull request Jan 13, 2021

Keeping DP override in sync with upstream torch #5506

Closed

This was referenced Jan 27, 2021

Refactor LightningDataParallel #5670

Merged

Support for uneven inputs in LightningDDP #5141

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor LightningDistributedDataParallel #5185

Refactor LightningDistributedDataParallel #5185

awaelchli commented Dec 18, 2020 •

edited

Loading

awaelchli Dec 18, 2020

awaelchli Dec 18, 2020

tchaton Jan 8, 2021

carmocca Jan 9, 2021

ananthsub commented Dec 18, 2020

ananthsub left a comment

ananthsub Dec 24, 2020

awaelchli Dec 25, 2020

justusschock Jan 11, 2021

ananthsub Dec 24, 2020

awaelchli Jan 8, 2021

ananthsub Dec 24, 2020

awaelchli Jan 8, 2021

ananthsub Dec 24, 2020

awaelchli Dec 25, 2020 •

edited

Loading

tchaton Jan 4, 2021

SeanNaren Jan 4, 2021

awaelchli Jan 8, 2021 •

edited by carmocca

Loading

ananthsub Jan 9, 2021 •

edited

Loading

awaelchli commented Dec 25, 2020

SeanNaren commented Dec 27, 2020 •

edited

Loading

tchaton commented Jan 4, 2021 •

edited

Loading

Borda commented Jan 5, 2021

tchaton commented Jan 8, 2021

awaelchli commented Jan 8, 2021 •

edited

Loading

codecov bot commented Jan 8, 2021 •

edited

Loading

Borda Jan 8, 2021

awaelchli Jan 8, 2021

awaelchli Jan 8, 2021

SeanNaren left a comment

tchaton Jan 8, 2021

Borda Jan 13, 2021

awaelchli Jan 13, 2021

		@@ -63,8 +64,8 @@ def configure_ddp(self, model, device_ids):
		self._ddp_kwargs["find_unused_parameters"] = self._ddp_kwargs.get(
		"find_unused_parameters", True

Refactor LightningDistributedDataParallel #5185

Refactor LightningDistributedDataParallel #5185

Conversation

awaelchli commented Dec 18, 2020 • edited Loading

What does this PR do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ananthsub commented Dec 18, 2020

ananthsub left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awaelchli Dec 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awaelchli Jan 8, 2021 • edited by carmocca Loading

Choose a reason for hiding this comment

ananthsub Jan 9, 2021 • edited Loading

Choose a reason for hiding this comment

awaelchli commented Dec 25, 2020

SeanNaren commented Dec 27, 2020 • edited Loading

tchaton commented Jan 4, 2021 • edited Loading

Borda commented Jan 5, 2021

tchaton commented Jan 8, 2021

awaelchli commented Jan 8, 2021 • edited Loading

codecov bot commented Jan 8, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SeanNaren left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awaelchli commented Dec 18, 2020 •

edited

Loading

awaelchli Dec 25, 2020 •

edited

Loading

awaelchli Jan 8, 2021 •

edited by carmocca

Loading

ananthsub Jan 9, 2021 •

edited

Loading

SeanNaren commented Dec 27, 2020 •

edited

Loading

tchaton commented Jan 4, 2021 •

edited

Loading

awaelchli commented Jan 8, 2021 •

edited

Loading

codecov bot commented Jan 8, 2021 •

edited

Loading