trainer.fit() stuck with accelerator set to "ddp" #5961

ifsheldon · 2021-02-13T20:12:54Z

🐛 Bug

The problem is that trainer.fit() with accelerator set to ddp takes extremely long time to do something before it can get CPUs and GPUs working. And I cannot interrupt the kernel but have to restart it.

Please reproduce using the BoringModel

To Reproduce

I tried the Boring Model, and I can reproduce the issue.

The only modification I made is in Define the test section. The code is below

def test_x(tmpdir):
    # init model
    model = BoringModel()

    # Initialize a trainer
    trainer = pl.Trainer(
        max_epochs=1, 
        progress_bar_refresh_rate=20,
        gpus = 4, # added to use 4 gpus
        accelerator='ddp' # added to use ddp
    )

    # Train the model ⚡
    trainer.fit(model, train, val)

    trainer.test(test_dataloaders=test)

And my code that initially encountered this issue is in the discussion post.

Expected behavior

The expected behavior is that the training should start in a couple of minutes, but trainer.fit() is stuck while GPUs and CPUs stay idle.

Environment

My environment is below as detected by the official python script. I run my code on a shared GPU cluster after I apply for computation resources. I usually apply for 512GB memory, 32 cores and 4 V100. The environment is managed by my personal conda without messing with others' environment. If you want to know more about the configuration, just let me know.

(torch) [liangf@gpu208-14 liangf]$ python collect_env_details.py
* CUDA:
        - GPU:
                - Tesla V100-SXM2-32GB
                - Tesla V100-SXM2-32GB
                - Tesla V100-SXM2-32GB
                - Tesla V100-SXM2-32GB
        - available:         True
        - version:           11.0
* Packages:
        - numpy:             1.20.0
        - pyTorch_debug:     False
        - pyTorch_version:   1.7.1
        - pytorch-lightning: 1.1.8
        - tqdm:              4.56.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.9.1
        - version:           #1 SMP Tue Nov 17 13:59:11 UTC 2020

Additional context

If I changed the code above in the boring model to the below, the trainer "works" as expected. The trainer with accelerator='dp' takes less than 1min to get everything set up and keeps CPUs and GPUs busy while the one with accelerator='ddp' takes 10min and more and does not successfully get things running before I lose my patience.

def test_x(tmpdir):
    # init model
    model = BoringModel()

    # Initialize a trainer
    trainer = pl.Trainer(
        max_epochs=1, 
        progress_bar_refresh_rate=20,
        gpus = 4, # added to use 4 gpus
        accelerator='dp' # changed to use dp instead of ddp
    )

    # Train the model ⚡
    trainer.fit(model, train, val)

    trainer.test(test_dataloaders=test)

By "works" I meant it can get GPUs running, but later a runtime error is thrown. And I think this will be another issue, which maybe that the code in the boring model notebook is not runnable in multi-GPU environment. However, I don't know what is the cause, since I am just transferring from ordinary pytorch to pytorch-lightning, and the code in the notebook looks reasonably good for me.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-10-1f9f6fbe4f6c> in <module>
----> 1 test_x(tmpdir)

<ipython-input-9-8b8914eff5a4> in test_x(tmpdir)
     12 
     13     # Train the model ⚡
---> 14     trainer.fit(model, train, val)
     15 
     16     trainer.test(test_dataloaders=test)

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
    508         self.call_hook('on_fit_start')
    509 
--> 510         results = self.accelerator_backend.train()
    511         self.accelerator_backend.teardown()
    512 

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py in train(self)
     55     def train(self):
     56         self.trainer.setup_trainer(self.trainer.model)
---> 57         return self.train_or_test()
     58 
     59     def teardown(self):

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py in train_or_test(self)
     72         else:
     73             self.trainer.train_loop.setup_training()
---> 74             results = self.trainer.train()
     75         return results
     76 

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py in train(self)
    559                 with self.profiler.profile("run_training_epoch"):
    560                     # run train epoch
--> 561                     self.train_loop.run_training_epoch()
    562 
    563                 if self.max_steps and self.max_steps <= self.global_step:

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_epoch(self)
    548             # ------------------------------------
    549             with self.trainer.profiler.profile("run_training_batch"):
--> 550                 batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
    551 
    552             # when returning -1 from train_step, we end epoch early

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_batch(self, batch, batch_idx, dataloader_idx)
    716 
    717                         # optimizer step
--> 718                         self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
    719 
    720                     else:

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in optimizer_step(self, optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
    483 
    484         # model hook
--> 485         model_ref.optimizer_step(
    486             self.trainer.current_epoch,
    487             batch_idx,

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py in optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, on_tpu, using_native_amp, using_lbfgs)
   1296 
   1297         """
-> 1298         optimizer.step(closure=optimizer_closure)
   1299 
   1300     def optimizer_zero_grad(

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py in step(self, closure, make_optimizer_step, *args, **kwargs)
    284 
    285         if make_optimizer_step:
--> 286             self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
    287         else:
    288             # make sure to call optimizer_closure when accumulating

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py in __optimizer_step(self, closure, profiler_name, *args, **kwargs)
    142         else:
    143             with trainer.profiler.profile(profiler_name):
--> 144                 optimizer.step(closure=closure, *args, **kwargs)
    145 
    146         accelerator_backend = trainer.accelerator_backend

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/optim/lr_scheduler.py in wrapper(*args, **kwargs)
     65                 instance._step_count += 1
     66                 wrapped = func.__get__(instance, cls)
---> 67                 return wrapped(*args, **kwargs)
     68 
     69             # Note that the returned function here is no longer a bound method,

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
     24         def decorate_context(*args, **kwargs):
     25             with self.__class__():
---> 26                 return func(*args, **kwargs)
     27         return cast(F, decorate_context)
     28 

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/optim/sgd.py in step(self, closure)
     84         if closure is not None:
     85             with torch.enable_grad():
---> 86                 loss = closure()
     87 
     88         for group in self.param_groups:

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in train_step_and_backward_closure()
    706 
    707                         def train_step_and_backward_closure():
--> 708                             result = self.training_step_and_backward(
    709                                 split_batch,
    710                                 batch_idx,

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in training_step_and_backward(self, split_batch, batch_idx, opt_idx, optimizer, hiddens)
    814                 # backward pass
    815                 with self.trainer.profiler.profile("model_backward"):
--> 816                     self.backward(result, optimizer, opt_idx)
    817 
    818                 # hook - call this hook only

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in backward(self, result, optimizer, opt_idx, *args, **kwargs)
    840             self.trainer.accelerator_backend.backward(result, optimizer, opt_idx, *args, **kwargs)
    841         else:
--> 842             result.closure_loss = self.trainer.accelerator_backend.backward(
    843                 result.closure_loss, optimizer, opt_idx, *args, **kwargs
    844             )

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py in backward(self, closure_loss, optimizer, opt_idx, *args, **kwargs)
    107             # do backward pass
    108             model = self.trainer.get_model()
--> 109             model.backward(closure_loss, optimizer, opt_idx, *args, **kwargs)
    110 
    111             # once backward has been applied, release graph

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py in backward(self, loss, optimizer, optimizer_idx, *args, **kwargs)
   1160         """
   1161         if self.trainer.train_loop.automatic_optimization or self._running_manual_backward:
-> 1162             loss.backward(*args, **kwargs)
   1163 
   1164     def toggle_optimizer(self, optimizer: Optimizer, optimizer_idx: int):

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    219                 retain_graph=retain_graph,
    220                 create_graph=create_graph)
--> 221         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    222 
    223     def register_hook(self, hook):

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
    124 
    125     grad_tensors_ = _tensor_or_tensors_to_tuple(grad_tensors, len(tensors))
--> 126     grad_tensors_ = _make_grads(tensors, grad_tensors_)
    127     if retain_graph is None:
    128         retain_graph = create_graph

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/autograd/__init__.py in _make_grads(outputs, grads)
     48             if out.requires_grad:
     49                 if out.numel() != 1:
---> 50                     raise RuntimeError("grad can be implicitly created only for scalar outputs")
     51                 new_grads.append(torch.ones_like(out, memory_format=torch.preserve_format))
     52             else:

RuntimeError: grad can be implicitly created only for scalar outputs

The text was updated successfully, but these errors were encountered:

awaelchli · 2021-02-14T02:07:28Z

I answered in the discussion post about usage of ddp in Jupyter environment.
About your second error:

By "works" I meant it can get GPUs running, but later a runtime error is thrown. And I think this will be another issue, which maybe that the code in the boring model notebook is not runnable in multi-GPU environment.

The boring model runs fine for multi gpu (current code from this repo), I can confirm that on master and also with your version 1.1.8.

Please post your full boring model as one script so that I can run it.

ifsheldon · 2021-02-14T12:51:54Z

The only modification I made just gpus=4 and accelerator, but anyway, my notebook is here and you can see the backtrace stack and environment settings. I ran it using Jupyter Lab on a shared GPU cluster. Neither accelerator=dp nor accelerator=ddp_spawn works, which is weird.

sooperset · 2021-02-14T19:11:47Z

Have you tried pytorch-lightning 1.1.6? For me, after 1.1.7, ddp training is stuck.. I also wonder what a cause is.

tchaton · 2021-02-15T18:03:37Z

Dear @ifsheldon,

ddp doesn't work in notebook if you are trying to do so.

Best,
T.C

IhabBendidi · 2021-02-26T23:58:31Z

I have had this kind of issue (to note, I'm working on a terminal on a server, so i'm not on a notebook). The training just got stuck after two epochs when using ddp. I tried out a couple of things that didn't work, including lessening the number of workers in the data loader. This only happened when I used 3 gpus. when using two gpus, this didn't happen. Cuda version 10.2. I also tried in another server, with the same issue repeating itself.

Have you tried pytorch-lightning 1.1.6? For me, after 1.1.7, ddp training is stuck.. I also wonder what a cause is.

I have tried uninstalling 1.1.7 and installing 1.1.6 and it worked without any issue !

carmocca · 2021-03-01T11:11:00Z

Hi @IhabBendidi can you check if this #5604 (comment) fixes your problem?

That might be a more appropriate issue than this one

IhabBendidi · 2021-03-01T16:44:19Z

Hi @IhabBendidi can you check if this #5604 (comment) fixes your problem?

That might be a more appropriate issue than this one

That solved it thanks !

ifsheldon added bug Something isn't working help wanted Open to be worked on labels Feb 13, 2021

awaelchli linked a pull request Feb 18, 2021 that will close this issue

Ensure accelerator is valid if running interactively #5970

Merged

12 tasks

carmocca closed this as completed in #5970 Feb 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trainer.fit() stuck with accelerator set to "ddp" #5961

trainer.fit() stuck with accelerator set to "ddp" #5961

ifsheldon commented Feb 13, 2021 •

edited

Loading

awaelchli commented Feb 14, 2021

ifsheldon commented Feb 14, 2021

sooperset commented Feb 14, 2021

tchaton commented Feb 15, 2021

IhabBendidi commented Feb 26, 2021 •

edited

Loading

carmocca commented Mar 1, 2021

IhabBendidi commented Mar 1, 2021

trainer.fit() stuck with accelerator set to "ddp" #5961

trainer.fit() stuck with accelerator set to "ddp" #5961

Comments

ifsheldon commented Feb 13, 2021 • edited Loading

🐛 Bug

Please reproduce using the BoringModel

To Reproduce

Expected behavior

Environment

Additional context

awaelchli commented Feb 14, 2021

ifsheldon commented Feb 14, 2021

sooperset commented Feb 14, 2021

tchaton commented Feb 15, 2021

IhabBendidi commented Feb 26, 2021 • edited Loading

carmocca commented Mar 1, 2021

IhabBendidi commented Mar 1, 2021

ifsheldon commented Feb 13, 2021 •

edited

Loading

IhabBendidi commented Feb 26, 2021 •

edited

Loading