Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adagrad not working with GPU and DDP #6824

Closed
qianlivia opened this issue Apr 4, 2021 · 1 comment · Fixed by #7277
Closed

Adagrad not working with GPU and DDP #6824

qianlivia opened this issue Apr 4, 2021 · 1 comment · Fixed by #7277
Assignees
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on priority: 0 High priority task
Milestone

Comments

@qianlivia
Copy link

qianlivia commented Apr 4, 2021

🐛 Bug

Adagrad doesn't work with GPUs and DDP as the optimizer is created before the model is moved to CUDA. I believe this issue has been addressed in an earlier version: #554

How to reproduce using the BoringModel

https://colab.research.google.com/drive/1HfyL5htoOkPETggTLwYNfh94HrNc6TOS?usp=sharing

The error emerged when I tried using Adagrad with both one and multiple GPUs.

Stack trace

LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2

  | Name  | Type   | Params
---------------------------------
0 | layer | Linear | 66    
---------------------------------
66        Trainable params
0         Non-trainable params
66        Total params
0.000     Total estimated model params size (MB)
/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 20 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:52: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 20 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Epoch 0:   0%|                                          | 0/314 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/rajmund/test.py", line 118, in <module>
    test_x(tmpdir)
  File "/home/rajmund/test.py", line 110, in test_x
    trainer.fit(model, train, val)
Traceback (most recent call last):
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
  File "test.py", line 118, in <module>
    test_x(tmpdir)
  File "test.py", line 110, in test_x
    trainer.fit(model, train, val)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.dispatch()
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.accelerator.start_training(self)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
    self._results = trainer.run_train()
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train
    self.training_type_plugin.start_training(trainer)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
    self._results = trainer.run_train()
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train
    self.train_loop.run_training_epoch()
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 492, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 654, in run_training_batch
    self.train_loop.run_training_epoch()
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 492, in run_training_epoch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 433, in optimizer_step
    using_lbfgs=is_lbfgs,
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 1390, in optimizer_step
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 654, in run_training_batch
    optimizer.step(closure=optimizer_closure)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 433, in optimizer_step
    trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 277, in optimizer_step
    self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 282, in run_optimizer_step
    using_lbfgs=is_lbfgs,
    self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 1390, in optimizer_step
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 163, in optimizer_step
    optimizer.step(closure=lambda_closure, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
    optimizer.step(closure=optimizer_closure)
    return wrapped(*args, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
    trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 277, in optimizer_step
    return func(*args, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/optim/adagrad.py", line 90, in step
    self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 282, in run_optimizer_step
    self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 163, in optimizer_step
    group['eps'])
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/optim/functional.py", line 48, in adagrad
    optimizer.step(closure=lambda_closure, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/optim/lr_scheduler.py", line 67, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/optim/adagrad.py", line 90, in step
    state_sum.addcmul_(grad, grad, value=1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
    group['eps'])
  File "/home/rajmund/miniconda3/envs/test/lib/python3.6/site-packages/torch/optim/functional.py", line 48, in adagrad
    state_sum.addcmul_(grad, grad, value=1)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu!

Environment

  • PyTorch Version: 1.7.1
  • PyTorch Lightning: 1.2.6
  • OS: Linux
  • How you installed PyTorch: pip
  • Python version: 3.6
  • CUDA/cuDNN version: 10.1
  • GPU models and configuration: Titan xp
  • Any other relevant information: -
@qianlivia qianlivia added bug Something isn't working help wanted Open to be worked on labels Apr 4, 2021
@qianlivia qianlivia changed the title Adagrad not working with GPU Adagrad not working with GPU and DDP Apr 5, 2021
@Borda
Copy link
Member

Borda commented Apr 12, 2021

@awaelchli this looks like some accelerator issue, right?

@Borda Borda added the priority: 1 Medium priority task label Apr 12, 2021
@edenlightning edenlightning added distributed Generic distributed-related topic priority: 0 High priority task and removed priority: 1 Medium priority task labels Apr 13, 2021
@edenlightning edenlightning added this to the v1.3 milestone Apr 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants