Added Horovod distributed backend #1529

tgaddair · 2020-04-19T17:17:20Z

Make the following change to your Trainer to run on GPU (single or multiple) with Horovod:

trainer = Trainer(distributed_backend='horovod', gpus=1)

Or to run on CPU:

trainer = Trainer(distributed_backend='horovod')

Then the training script can be launched via the horovodrun command-line tool, where the host/GPU allocation is specified:

horovodrun -np 8 -H host1:4,host2:4 python train.py

pep8speaks · 2020-04-19T17:17:29Z

Hello @tgaddair! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-22 21:21:47 UTC

williamFalcon · 2020-04-19T19:02:44Z

pytorch_lightning/trainer/distrib_data_parallel.py

@@ -219,6 +220,13 @@ def set_distributed_mode(self, distributed_backend):
            self.use_ddp = True
            self.data_parallel_device_ids = None
            self.on_gpu = False
+        elif distributed_backend == 'horovod':


it would be nice to be transparent to the user.
can we automate setting this? this way the abstraction doesn’t bleed?

(the mpirun thing)

Just to make sure I understand you correctly: is the idea that when running via horovodrun or mpirun, if the user has not specified distributed_backend, then we will automatically set distributed_backend='horovod' here?

We could certainly do that when running with horovodrun + our Gloo backend, as we have special environment variables we can check (HOROVOD_RANK for example). Doing so with mpirun is more tricky, because different MPI implementations have different environment variables. Also, in the future, there might be another distributed backend other than Horovod that uses MPI.

So maybe we could automate it for horovodrun but still require them to set it explicitly for mpirun? (Let me know if I misunderstood your suggestion).

Just to make sure I understand you correctly: is the idea that when running via horovodrun or mpirun, if the user has not specified distributed_backend, then we will automatically set distributed_backend='horovod' here?

Yes!

So maybe we could automate it for horovodrun but still require them to set it explicitly for mpirun? (Let me know if I misunderstood your suggestion).

Let's do this for now (v1) and for v2 maybe we set it explicitely for mpirun? i just don't know enough about mpirun yet, but if mpirun can run any backend then the user should be forced to set it.

Sounds good! I added a has_horovodrun() check in distrib_data_parallel.py that checks for Gloo or OpenMPI environment variables set by horovodrun. Also added a test. Let me know if that aligns with what you were thinking.

mergify · 2020-04-19T21:00:34Z

This pull request is now in conflict... :(

williamFalcon · 2020-04-19T21:06:06Z

@tgaddair i love this! wondering if we can automate the comment I added so the user can use horovod without remembering anything other than turning on the flag

mergify · 2020-04-19T21:08:25Z

This pull request is now in conflict... :(

mergify · 2020-04-19T21:08:25Z

This pull request is now in conflict... :(

CHANGELOG.md

Borda · 2020-04-20T06:39:59Z

pytorch_lightning/trainer/distrib_parts.py

+        set_proc_rank(self.proc_rank)
+
+        if hvd.rank() != 0:
+            self.logger = None


is this needed, loggers shall have already rank_ero_only
in such case you #1408 and setting global rank
https://github.com/PyTorchLightning/pytorch-lightning/blob/a22a8142ac65668781a6e6f76d3c4e55ea7c249a/pytorch_lightning/trainer/distrib_parts.py#L494

The errors come from a race condition where different ranks will attempt to mkdir the same directory, leading to an exception being raised on one of the workers. For example, this can happen when creating a SummaryWriter, which is why in Horovod we only do so on rank 0.

in lightning we already handle setting loggers, etc only to rank=0 btw

I see. I updated to set the logger ranks to hvd.rank() instead of deleting them outside of rank 0. Let me know if that makes more sense.

Borda · 2020-04-20T06:43:16Z

tests/models/data/horovod/train_default_model.py

+parser.add_argument('--trainer-options', required=True)
+
+
+def test(trainer_options):


Renamed for clarity and added a docstring at the top of the file to explain usage.

Borda · 2020-04-20T06:44:51Z

tests/models/data/horovod/train_default_model.py

@@ -0,0 +1,36 @@
+import argparse


is this meant to be a (unit)test because by the name it won't be found
why there is data/horovod/ would it rather be tests/models/script_train_horovod.py

Added a docstring at the top for clarity. This script is meant to be executed from test_horovod.py. Reason for this is to test driving the training via horovodrun using multiple parallel worker processes.

tests/models/test_horovod.py

williamFalcon · 2020-04-20T20:34:12Z

pytorch_lightning/trainer/distrib_parts.py

+
+        # Horovod: wrap optimizers to perform gradient aggregation via allreduce
+        self.optimizers = [
+            hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())


what happens when i do:

def configure_optimizers(self): return Adam(self.generator.parameters(), Adam(self.discriminator.parameters())

Wouldn't this line here break?

model.named_parameters()

Should we not instead do:

[hvd.DistributedOptimizer(optimizer, named_parameters=opt.named_parameters()) for opt in self.optimizers]

This might be a silly question as I don't know the details of DistributedOptimizer

Yes, good catch! This was an oversight on my part. I added a fix, and added a unit test specifically for GAN / multi-optimizers.

pytorch_lightning/trainer/data_loading.py

codecov · 2020-04-21T18:39:53Z

Codecov Report

Merging #1529 into master will increase coverage by 0%.
The diff coverage is 84%.

@@          Coverage Diff           @@
##           master   #1529   +/-   ##
======================================
  Coverage      89%     89%           
======================================
  Files          68      68           
  Lines        3811    3906   +95     
======================================
+ Hits         3385    3471   +86     
- Misses        426     435    +9

tgaddair · 2020-04-21T20:49:32Z

Hey @williamFalcon looks like there is an incompatibility between PyTorch Lightning and PyTorch 1.5.0 (released last night) that's causing the CI failures:

FAILED test_cpu.py::test_multi_cpu_model_ddp - _pickle.PicklingError: Can't pickle <class 'torch._C._VariableFunctions'>: it's not the same object as torch._C._VariableFunctions

Is someone on your end looking into this? Happy to file an issue.

Borda · 2020-04-21T21:00:01Z

Is someone on your end looking into this? Happy to file an issue.

probably issue is not needed we are already working on it #1552

tgaddair · 2020-04-21T22:30:31Z

Is someone on your end looking into this? Happy to file an issue.

probably issue is not needed we are already working on it #1552

Thanks @Borda! Once that lands I'll merge in from master and tests should be good again.

Let me know if there's anything additional feedback you have on this PR.

mergify · 2020-04-22T00:34:47Z

This pull request is now in conflict... :(

williamFalcon · 2020-04-22T14:39:45Z

@tgaddair fixed on master. want to rebase so we can merge this?
i want to get this in before merging the other PRs since this touches so many parts haha

Borda · 2020-04-22T20:12:48Z

we do not use tox anymore...

.

tgaddair · 2020-04-22T20:30:46Z

we do not use tox anymore...

I see, I mistook a failure due to a corrupt pip cache for a tox issue. Is there a way to refresh the pip cache? I just commented out that step for now to get tests to pass, not sure what will happen when I restore that line.

Borda · 2020-04-22T20:40:19Z

I see, I mistook a failure due to a corrupt pip cache for a tox issue. Is there a way to refresh the pip cache? I just commented out that step for now to get tests to pass, not sure what will happen when I restore that line.

I have tried dropping cache some time ago and didn't find it...
that is the reason why we shall rather use images then cache :]

tgaddair · 2020-04-22T21:02:49Z

I see, I mistook a failure due to a corrupt pip cache for a tox issue. Is there a way to refresh the pip cache? I just commented out that step for now to get tests to pass, not sure what will happen when I restore that line.

I have tried dropping cache some time ago and didn't find it...
that is the reason why we shall rather use images then cache :]

Docker images would be much better, I agree. Looks like I was able to refresh the cache by running rm -rf ~/.cache/pip, then removing that line and running again. Not very elegant, but looks like it worked (fingers crossed).

williamFalcon · 2020-04-22T21:39:56Z

🎆

tgaddair · 2020-04-22T21:40:19Z

Hey @Borda @williamFalcon looks like the Drone GPU test timeout was recently changed from 30 minutes to 15 minutes. Before this PR, those tests took about 4:30 minutes to run, and were taking about 18 minutes with this PR.

However, 10 minutes of that was attributed to the time to build Apex. As I mentioned in a previous comment, it looks like Apex was failing to install correctly before due to the lack of the nvcc compiler in the image you were using. The new image has nvcc, and can successfully build Apex, but takes a very long time.

I just ran a test where I removed the line to install Apex, and the tests now pass in about 8:30 minutes (less than the time for CircleCI to finish). I believe this is consistent with the current test behavior, but I wanted to get your thoughts on Apex: do you feel it's worth building it in these tests and waiting the extra 10 minutes? If so, we can restore it in a follow-up and bump up the test timeout.

tgaddair · 2020-04-22T21:40:48Z

🎆

Thanks for merging!

Borda · 2020-04-22T22:14:48Z

However, 10 minutes of that was attributed to the time to build Apex. As I mentioned in a previous comment, it looks like Apex was failing to install correctly before due to the lack of the nvcc compiler in the image you were using. The new image has nvcc, and can successfully build Apex, but takes a very long time.

I just ran a test where I removed the line to install Apex, and the tests now pass in about 8:30 minutes (less than the time for CircleCI to finish). I believe this is consistent with the current test behavior, but I wanted to get your thoughts on Apex: do you feel it's worth building it in these tests and waiting the extra 10 minutes? If so, we can restore it in a follow-up and bump up the test timeout.

in my opinion it just another reason to create own test image and use it for all CI as we do not want to spend most of the machine time on repetitive building/installing dependencies

maybe I am missing something but without apex there is no amp support, right so the test shall fail...?

mcarilli · 2020-06-25T18:34:12Z

Re #1561 (comment), after talking to @tgaddair and Meet Shah on pytorch slack, when using Horovod's DistributedOptimizer + native Amp, you need to ensure grads are synced across processes before the unscaling+infchecking. In other words, you need the following pattern:

scaler.scale(loss).backward()

opt.synchronize()

# if separate scaler.unscale_(optimizer) is
# needed, eg to allow clipping unscaled gradients,
# it should come here, after opt.synchronize()

with opt.skip_synchronize():
   scaler.step(opt)

scaler.update()

I think a similar pattern was needed with apex.

Borda · 2020-06-25T22:12:57Z

@mcarilli mind send a PR? ❤️

mcarilli · 2020-06-26T16:52:42Z

@Borda I contacted @williamFalcon on pytorch slack, he said he was refactoring horovod integration already and would ping me for review.

tgaddair · 2020-06-26T17:09:48Z

Hey @mcarilli, I think the AMP integration should already be in place with the Horovod backend. Are you seeing issues when trying to use it?

mcarilli · 2020-06-26T17:11:32Z

I haven't seen any issues, just wanted to remind about the synchronize() pattern. If it's already taken care of, ignore me.

tgaddair · 2020-06-26T17:12:44Z

Thanks for clarifying and raising the issue. We should definitely double check!

mergify bot requested a review from a team April 19, 2020 17:17

williamFalcon reviewed Apr 19, 2020

View reviewed changes

mergify bot requested a review from a team April 19, 2020 19:03

Borda added the feature Is an improvement or enhancement label Apr 20, 2020

Borda self-assigned this Apr 20, 2020

Borda added this to the 0.7.4 milestone Apr 20, 2020

Borda previously requested changes Apr 20, 2020

View reviewed changes

mergify bot requested review from a team April 20, 2020 06:47

williamFalcon reviewed Apr 20, 2020

View reviewed changes

mergify bot requested a review from a team April 20, 2020 20:34

williamFalcon reviewed Apr 21, 2020

View reviewed changes

pytorch_lightning/trainer/data_loading.py Show resolved Hide resolved

mergify bot requested a review from a team April 21, 2020 01:29

williamFalcon added the priority: 0 High priority task label Apr 22, 2020

tgaddair force-pushed the horovod branch 4 times, most recently from 96a190e to a76736a Compare April 22, 2020 15:58

Initial commit of Horovod distributed backend implementation

d0b38d6

tgaddair added 4 commits April 22, 2020 12:11

Install tox for GitHub CI

5d06e8c

Retry tests

47eac28

Catch all exceptions

ce97afc

Skip cache

bce27f3

Remove tox

516ff99

Restore pip cache

b55473f

tgaddair added 2 commits April 22, 2020 13:40

Remove the cache

8f21c72

Restore pip cache

a0ce6ba

Remove AMP

9257b37

williamFalcon merged commit 7024177 into Lightning-AI:master Apr 22, 2020

tgaddair deleted the horovod branch April 22, 2020 21:42

This was referenced Apr 23, 2020

support for native amp #1561

Merged

Tests/docker #1573

Merged

Borda modified the milestones: 0.7.4, v0.7.x Apr 18, 2021

		parser.add_argument('--trainer-options', required=True)


		def test(trainer_options):

Added Horovod distributed backend #1529

Added Horovod distributed backend #1529

Uh oh!

Conversation

tgaddair commented Apr 19, 2020

Uh oh!

pep8speaks commented Apr 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-04-22 21:21:47 UTC

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Apr 19, 2020

Uh oh!

williamFalcon commented Apr 19, 2020

Uh oh!

mergify bot commented Apr 19, 2020

Uh oh!

mergify bot commented Apr 19, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Apr 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tgaddair commented Apr 21, 2020

Uh oh!

Borda commented Apr 21, 2020

Uh oh!

tgaddair commented Apr 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Apr 22, 2020

Uh oh!

williamFalcon commented Apr 22, 2020

Uh oh!

Borda commented Apr 22, 2020

Uh oh!

tgaddair commented Apr 22, 2020

Uh oh!

Borda commented Apr 22, 2020

Uh oh!

tgaddair commented Apr 22, 2020

Uh oh!

williamFalcon commented Apr 22, 2020

Uh oh!

tgaddair commented Apr 22, 2020

Uh oh!

pep8speaks commented Apr 19, 2020 •

edited

Loading

codecov bot commented Apr 21, 2020 •

edited

Loading

tgaddair commented Apr 21, 2020 •

edited

Loading

Borda commented Apr 22, 2020 •

edited

Loading

mcarilli commented Jun 25, 2020 •

edited

Loading

mcarilli commented Jun 26, 2020 •

edited

Loading