Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trainer.fit() stuck with accelerator set to "ddp" #5961

Closed
ifsheldon opened this issue Feb 13, 2021 · 7 comments · Fixed by #5970
Closed

trainer.fit() stuck with accelerator set to "ddp" #5961

ifsheldon opened this issue Feb 13, 2021 · 7 comments · Fixed by #5970
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@ifsheldon
Copy link
Contributor

ifsheldon commented Feb 13, 2021

🐛 Bug

The problem is that trainer.fit() with accelerator set to ddp takes extremely long time to do something before it can get CPUs and GPUs working. And I cannot interrupt the kernel but have to restart it.

Please reproduce using the BoringModel

To Reproduce

I tried the Boring Model, and I can reproduce the issue.

The only modification I made is in Define the test section. The code is below

def test_x(tmpdir):
    # init model
    model = BoringModel()

    # Initialize a trainer
    trainer = pl.Trainer(
        max_epochs=1, 
        progress_bar_refresh_rate=20,
        gpus = 4, # added to use 4 gpus
        accelerator='ddp' # added to use ddp
    )

    # Train the model ⚡
    trainer.fit(model, train, val)

    trainer.test(test_dataloaders=test)

And my code that initially encountered this issue is in the discussion post.

Expected behavior

The expected behavior is that the training should start in a couple of minutes, but trainer.fit() is stuck while GPUs and CPUs stay idle.

Environment

My environment is below as detected by the official python script. I run my code on a shared GPU cluster after I apply for computation resources. I usually apply for 512GB memory, 32 cores and 4 V100. The environment is managed by my personal conda without messing with others' environment. If you want to know more about the configuration, just let me know.

(torch) [liangf@gpu208-14 liangf]$ python collect_env_details.py
* CUDA:
        - GPU:
                - Tesla V100-SXM2-32GB
                - Tesla V100-SXM2-32GB
                - Tesla V100-SXM2-32GB
                - Tesla V100-SXM2-32GB
        - available:         True
        - version:           11.0
* Packages:
        - numpy:             1.20.0
        - pyTorch_debug:     False
        - pyTorch_version:   1.7.1
        - pytorch-lightning: 1.1.8
        - tqdm:              4.56.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.9.1
        - version:           #1 SMP Tue Nov 17 13:59:11 UTC 2020

Additional context

If I changed the code above in the boring model to the below, the trainer "works" as expected. The trainer with accelerator='dp' takes less than 1min to get everything set up and keeps CPUs and GPUs busy while the one with accelerator='ddp' takes 10min and more and does not successfully get things running before I lose my patience.

def test_x(tmpdir):
    # init model
    model = BoringModel()

    # Initialize a trainer
    trainer = pl.Trainer(
        max_epochs=1, 
        progress_bar_refresh_rate=20,
        gpus = 4, # added to use 4 gpus
        accelerator='dp' # changed to use dp instead of ddp
    )

    # Train the model ⚡
    trainer.fit(model, train, val)

    trainer.test(test_dataloaders=test)

By "works" I meant it can get GPUs running, but later a runtime error is thrown. And I think this will be another issue, which maybe that the code in the boring model notebook is not runnable in multi-GPU environment. However, I don't know what is the cause, since I am just transferring from ordinary pytorch to pytorch-lightning, and the code in the notebook looks reasonably good for me.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-10-1f9f6fbe4f6c> in <module>
----> 1 test_x(tmpdir)

<ipython-input-9-8b8914eff5a4> in test_x(tmpdir)
     12 
     13     # Train the model ⚡
---> 14     trainer.fit(model, train, val)
     15 
     16     trainer.test(test_dataloaders=test)

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloader, val_dataloaders, datamodule)
    508         self.call_hook('on_fit_start')
    509 
--> 510         results = self.accelerator_backend.train()
    511         self.accelerator_backend.teardown()
    512 

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py in train(self)
     55     def train(self):
     56         self.trainer.setup_trainer(self.trainer.model)
---> 57         return self.train_or_test()
     58 
     59     def teardown(self):

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py in train_or_test(self)
     72         else:
     73             self.trainer.train_loop.setup_training()
---> 74             results = self.trainer.train()
     75         return results
     76 

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py in train(self)
    559                 with self.profiler.profile("run_training_epoch"):
    560                     # run train epoch
--> 561                     self.train_loop.run_training_epoch()
    562 
    563                 if self.max_steps and self.max_steps <= self.global_step:

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_epoch(self)
    548             # ------------------------------------
    549             with self.trainer.profiler.profile("run_training_batch"):
--> 550                 batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
    551 
    552             # when returning -1 from train_step, we end epoch early

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in run_training_batch(self, batch, batch_idx, dataloader_idx)
    716 
    717                         # optimizer step
--> 718                         self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
    719 
    720                     else:

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in optimizer_step(self, optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
    483 
    484         # model hook
--> 485         model_ref.optimizer_step(
    486             self.trainer.current_epoch,
    487             batch_idx,

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py in optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, on_tpu, using_native_amp, using_lbfgs)
   1296 
   1297         """
-> 1298         optimizer.step(closure=optimizer_closure)
   1299 
   1300     def optimizer_zero_grad(

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py in step(self, closure, make_optimizer_step, *args, **kwargs)
    284 
    285         if make_optimizer_step:
--> 286             self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
    287         else:
    288             # make sure to call optimizer_closure when accumulating

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py in __optimizer_step(self, closure, profiler_name, *args, **kwargs)
    142         else:
    143             with trainer.profiler.profile(profiler_name):
--> 144                 optimizer.step(closure=closure, *args, **kwargs)
    145 
    146         accelerator_backend = trainer.accelerator_backend

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/optim/lr_scheduler.py in wrapper(*args, **kwargs)
     65                 instance._step_count += 1
     66                 wrapped = func.__get__(instance, cls)
---> 67                 return wrapped(*args, **kwargs)
     68 
     69             # Note that the returned function here is no longer a bound method,

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
     24         def decorate_context(*args, **kwargs):
     25             with self.__class__():
---> 26                 return func(*args, **kwargs)
     27         return cast(F, decorate_context)
     28 

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/optim/sgd.py in step(self, closure)
     84         if closure is not None:
     85             with torch.enable_grad():
---> 86                 loss = closure()
     87 
     88         for group in self.param_groups:

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in train_step_and_backward_closure()
    706 
    707                         def train_step_and_backward_closure():
--> 708                             result = self.training_step_and_backward(
    709                                 split_batch,
    710                                 batch_idx,

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in training_step_and_backward(self, split_batch, batch_idx, opt_idx, optimizer, hiddens)
    814                 # backward pass
    815                 with self.trainer.profiler.profile("model_backward"):
--> 816                     self.backward(result, optimizer, opt_idx)
    817 
    818                 # hook - call this hook only

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py in backward(self, result, optimizer, opt_idx, *args, **kwargs)
    840             self.trainer.accelerator_backend.backward(result, optimizer, opt_idx, *args, **kwargs)
    841         else:
--> 842             result.closure_loss = self.trainer.accelerator_backend.backward(
    843                 result.closure_loss, optimizer, opt_idx, *args, **kwargs
    844             )

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py in backward(self, closure_loss, optimizer, opt_idx, *args, **kwargs)
    107             # do backward pass
    108             model = self.trainer.get_model()
--> 109             model.backward(closure_loss, optimizer, opt_idx, *args, **kwargs)
    110 
    111             # once backward has been applied, release graph

~/miniconda3/envs/torch/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py in backward(self, loss, optimizer, optimizer_idx, *args, **kwargs)
   1160         """
   1161         if self.trainer.train_loop.automatic_optimization or self._running_manual_backward:
-> 1162             loss.backward(*args, **kwargs)
   1163 
   1164     def toggle_optimizer(self, optimizer: Optimizer, optimizer_idx: int):

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    219                 retain_graph=retain_graph,
    220                 create_graph=create_graph)
--> 221         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    222 
    223     def register_hook(self, hook):

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
    124 
    125     grad_tensors_ = _tensor_or_tensors_to_tuple(grad_tensors, len(tensors))
--> 126     grad_tensors_ = _make_grads(tensors, grad_tensors_)
    127     if retain_graph is None:
    128         retain_graph = create_graph

~/miniconda3/envs/torch/lib/python3.9/site-packages/torch/autograd/__init__.py in _make_grads(outputs, grads)
     48             if out.requires_grad:
     49                 if out.numel() != 1:
---> 50                     raise RuntimeError("grad can be implicitly created only for scalar outputs")
     51                 new_grads.append(torch.ones_like(out, memory_format=torch.preserve_format))
     52             else:

RuntimeError: grad can be implicitly created only for scalar outputs
@ifsheldon ifsheldon added bug Something isn't working help wanted Open to be worked on labels Feb 13, 2021
@awaelchli
Copy link
Contributor

I answered in the discussion post about usage of ddp in Jupyter environment.
About your second error:

By "works" I meant it can get GPUs running, but later a runtime error is thrown. And I think this will be another issue, which maybe that the code in the boring model notebook is not runnable in multi-GPU environment.

The boring model runs fine for multi gpu (current code from this repo), I can confirm that on master and also with your version 1.1.8.

Please post your full boring model as one script so that I can run it.

@ifsheldon
Copy link
Contributor Author

The only modification I made just gpus=4 and accelerator, but anyway, my notebook is here and you can see the backtrace stack and environment settings. I ran it using Jupyter Lab on a shared GPU cluster. Neither accelerator=dp nor accelerator=ddp_spawn works, which is weird.

@sooperset
Copy link

Have you tried pytorch-lightning 1.1.6? For me, after 1.1.7, ddp training is stuck.. I also wonder what a cause is.

@tchaton
Copy link
Contributor

tchaton commented Feb 15, 2021

Dear @ifsheldon,

ddp doesn't work in notebook if you are trying to do so.

Best,
T.C

@awaelchli awaelchli linked a pull request Feb 18, 2021 that will close this issue
12 tasks
@IhabBendidi
Copy link

IhabBendidi commented Feb 26, 2021

I have had this kind of issue (to note, I'm working on a terminal on a server, so i'm not on a notebook). The training just got stuck after two epochs when using ddp. I tried out a couple of things that didn't work, including lessening the number of workers in the data loader. This only happened when I used 3 gpus. when using two gpus, this didn't happen. Cuda version 10.2. I also tried in another server, with the same issue repeating itself.

Have you tried pytorch-lightning 1.1.6? For me, after 1.1.7, ddp training is stuck.. I also wonder what a cause is.

I have tried uninstalling 1.1.7 and installing 1.1.6 and it worked without any issue !

@carmocca
Copy link
Contributor

carmocca commented Mar 1, 2021

Hi @IhabBendidi can you check if this #5604 (comment) fixes your problem?

That might be a more appropriate issue than this one

@IhabBendidi
Copy link

Hi @IhabBendidi can you check if this #5604 (comment) fixes your problem?

That might be a more appropriate issue than this one

That solved it thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants