Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDPShardedPlugin consolidate_state_dict RuntimeError #5646

Closed
robogast opened this issue Jan 25, 2021 · 35 comments · Fixed by facebookresearch/fairscale#323
Closed

DDPShardedPlugin consolidate_state_dict RuntimeError #5646

robogast opened this issue Jan 25, 2021 · 35 comments · Fixed by facebookresearch/fairscale#323
Labels
3rd party Related to a 3rd-party bug Something isn't working distributed Generic distributed-related topic waiting on author Waiting on user action, correction, or update won't fix This will not be worked on

Comments

@robogast
Copy link

robogast commented Jan 25, 2021

🐛 Bug

After an (seemingly arbitrary) number of steps/epochs, DDPShardedPlugin::optimizer_state crashes on its consolidate_state_dict call:

  1. Pytorch's distributed broadcast_object_list tries object_tensor = torch.ByteTensor(torch.sum(object_sizes_tensor).item())
  2. RuntimeError: Trying to create tensor with negative dimension -5193452289200645882: [-5193452289200645882]

Stacktrace:

Traceback (most recent call last):
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 560, in train
    self.train_loop.run_training_epoch()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 562, in run_training_epoch
    self.trainer.run_evaluation()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 667, in run_evaluation
    self.evaluation_loop.on_evaluation_end()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 110, in on_evaluation_end
    self.trainer.call_hook('on_validation_end', *args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 924, in call_hook
    trainer_hook(*args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-object_sizes_tensorpackages/pytorch_lightning/trainer/callback_hook.py", line 177, in on_validation_end
    callback.on_validation_end(self, self.get_model())
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 203, in on_validation_end
    self.save_checkpoint(trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 253, in save_checkpoint
    self._save_last_checkpoint(trainer, pl_module, monitor_candidates)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 567, in _save_last_checkpoint
    self._save_model(last_filepath, trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 361, in _save_model
    self.save_function(filepath, self.save_weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 257, in save_checkpoint
    self.checkpoint_connector.save_checkpoint(filepath, weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 392, in save_checkpoint
    checkpoint = self.dump_checkpoint(weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 283, in dump_checkpoint
    optimizer_state = self.trainer.accelerator_backend.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 206, in optimizer_state
    return self.ddp_plugin.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/plugins/sharded_plugin.py", line 42, in optimizer_state
    optimizer.consolidate_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 320, in consolidate_state_dict
    self._broadcast_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 349, in _broadcast_state_dict
    dist.broadcast_object_list([0], src=global_rank, group=self.group)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1687, in broadcast_object_list
    object_tensor = torch.ByteTensor(torch.sum(object_sizes_tensor).item())
RuntimeError: Trying to create tensor with negative dimension -5193452289200645882: [-5193452289200645882]

Environment

  • CUDA:
    • GPU:
      • TITAN RTX
    • available: True
    • version: 11.0
  • Packages:
    • numpy: 1.18.1
    • pyTorch_debug: False
    • pyTorch_version: 1.8.0.dev20210122
    • pytorch-lightning: 1.1.4
    • tqdm: 4.48.2
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor:
    • python: 3.8.3
    • version: Proposal for help #1 SMP Debian 4.19.160-2 (2020-11-28)

cc @awaelchli @rohitgr7 @akihironitta

@robogast robogast added bug Something isn't working help wanted Open to be worked on labels Jan 25, 2021
@SeanNaren SeanNaren added distributed Generic distributed-related topic 3rd party Related to a 3rd-party labels Jan 25, 2021
@SeanNaren SeanNaren self-assigned this Jan 25, 2021
@SeanNaren
Copy link
Contributor

CC @blefaudeux who may have some more insight!

Another point to make is that this happens randomly within the middle of training when saving the model. so it can happen after a few successful iterations. Was wondering if we're running out of memory at all

@blefaudeux
Copy link

oh wow, I never saw that, interesting. In recent fairscale I switched to using pytorch's broadcast object util instead of a dedicated one, looks like it can fail somehow. One check could be to try to set _torch_broadcast_object to False, I'll check with Rohan (author of this part) whether there's something I'm doing wrong

@blefaudeux
Copy link

I pushed a small change which should fix that, working around this case which seems to not be handled very well by this util

@robogast
Copy link
Author

Cool, I'll give it a try today, I'll keep you up to date :)

@robogast
Copy link
Author

I now seem to get some kind of deadlock within the broadcast:
(it was caught by setting export NCCL_ASYNC_ERROR_HANDLING=1)

Validating: 0it [00:00, ?it/s]�[A

Validating:   0%|          | 0/6 [00:00<?, ?it/s]�[A

Validating:  17%|#6        | 1/6 [00:09<00:46,  9.37s/it]�[A
Epoch 1:  97%|#########7| 115/118 [14:00<00:21,  7.31s/it, loss=0.052, v_num=7388088, recon loss=0.0364, commitment loss=0.00812]

Validating:  33%|###3      | 2/6 [00:10<00:27,  6.93s/it]�[A

Validating:  50%|#####     | 3/6 [00:11<00:15,  5.23s/it]�[A
Epoch 1:  99%|#########9| 117/118 [14:02<00:07,  7.20s/it, loss=0.052, v_num=7388088, recon loss=0.0364, commitment loss=0.00812]

Validating:  67%|######6   | 4/6 [00:13<00:08,  4.03s/it]�[A

Validating:  83%|########3 | 5/6 [00:14<00:03,  3.19s/it]�[A
[E ProcessGroupNCCL.cpp:485] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809029 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809035 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 17] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809025 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 16] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809023 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 19] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809027 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804391 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804393 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 18] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809055 milliseconds before timing out.

r35n1:11710:11792 [0] include/socket.h:416 NCCL WARN Net : Connection closed by remote peer
r35n1:11710:11792 [0] NCCL INFO transport/net_socket.cc:405 -> 2
r35n1:11710:11792 [0] NCCL INFO include/net.h:28 -> 2
r35n1:11710:11792 [0] NCCL INFO transport/net.cc:357 -> 2
r35n1:11710:11792 [0] NCCL INFO proxy.cc:198 -> 2 [Proxy Thread]
[E ProcessGroupNCCL.cpp:485] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804424 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804433 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801009 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 21] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809152 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804461 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 20] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809173 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804491 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 23] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809211 milliseconds before timing out.
Traceback (most recent call last):
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 561, in train
    self.train_loop.run_training_epoch()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 562, in run_training_epoch
    self.trainer.run_evaluation()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 668, in run_evaluation
    self.evaluation_loop.on_evaluation_end()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 110, in on_evaluation_end
    self.trainer.call_hook('on_validation_end', *args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 925, in call_hook
    trainer_hook(*args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 177, in on_validation_end
    callback.on_validation_end(self, self.get_model())
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
    self.save_checkpoint(trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 254, in save_checkpoint
    self._save_last_checkpoint(trainer, pl_module, monitor_candidates)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 568, in _save_last_checkpoint
    self._save_model(last_filepath, trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 362, in _save_model
    self.save_function(filepath, self.save_weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 257, in save_checkpoint
    self.checkpoint_connector.save_checkpoint(filepath, weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 391, in save_checkpoint
    checkpoint = self.dump_checkpoint(weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 282, in dump_checkpoint
    optimizer_state = self.trainer.accelerator_backend.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 206, in optimizer_state
    return self.ddp_plugin.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/plugins/sharded_plugin.py", line 42, in optimizer_state
    optimizer.consolidate_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 320, in consolidate_state_dict
    self._broadcast_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 352, in _broadcast_state_dict
    dist.broadcast_object_list([dummy_sync_tensor], src=global_rank, group=self.group)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1691, in broadcast_object_list
    broadcast(object_tensor, src=src, group=group)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1039, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL communicator was aborted.

@robogast
Copy link
Author

robogast commented Jan 26, 2021

cc @blefaudeux

I can hereby (sadly) say that the code change didn't conclusively fix the issue, I ran into the same issue again:

Traceback (most recent call last):
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 561, in train
    self.train_loop.run_training_epoch()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 562, in run_training_epoch
    self.trainer.run_evaluation()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 668, in run_evaluation
    self.evaluation_loop.on_evaluation_end()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 110, in on_evaluation_end
    self.trainer.call_hook('on_validation_end', *args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 925, in call_hook
    trainer_hook(*args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 177, in on_validation_end
    callback.on_validation_end(self, self.get_model())
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
    self.save_checkpoint(trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 254, in save_checkpoint
    self._save_last_checkpoint(trainer, pl_module, monitor_candidates)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 568, in _save_last_checkpoint
    self._save_model(last_filepath, trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 362, in _save_model
    self.save_function(filepath, self.save_weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 257, in save_checkpoint
    self.checkpoint_connector.save_checkpoint(filepath, weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 391, in save_checkpoint
    checkpoint = self.dump_checkpoint(weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 282, in dump_checkpoint
    optimizer_state = self.trainer.accelerator_backend.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 206, in optimizer_state
    return self.ddp_plugin.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/plugins/sharded_plugin.py", line 42, in optimizer_state
    optimizer.consolidate_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 320, in consolidate_state_dict
    self._broadcast_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 352, in _broadcast_state_dict
    dist.broadcast_object_list([dummy_sync_tensor], src=global_rank, group=self.group)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1687, in broadcast_object_list
    object_tensor = torch.ByteTensor(torch.sum(object_sizes_tensor).item())
RuntimeError: Trying to create tensor with negative dimension -5185613574447507032: [-5185613574447507032]

@blefaudeux
Copy link

cc @rohan-varma, author of this part, it's certainly not normal, really sorry about that @robogast. Could you try to set OSS._torch_broadcast_object to False, so that it falls back to the more mundane implementation ?

@blefaudeux
Copy link

also @robogast do you see the issue on pytorch stable ? (1.7.1) I've never seen that myself so it would be nice to corner this to a specific combination, if any

@rohan-varma
Copy link

@robogast Thanks for flagging this! Do you have a runnable script that can reproduce the issue? In the meantime, hopefully the fallback @blefaudeux suggested can unblock.

@robogast
Copy link
Author

robogast commented Jan 26, 2021

@blefaudeux Yes, this also happened before pytorch1.7.1.

Let me pull out some old stacktraces, I believe it's the same issue, old code.
I posted this on the 21th of december in the pytorch lightning slack channel.

I can see if I can reproduce it with some minimal example, but that would take some time.

Stacktrace 1

Validating:  83%|########3 | 5/6 [00:21<00:04,  4.66s/it][A
Epoch 10:  50%|#####     | 59/118 [09:39<09:39,  9.82s/it, loss=0.0056, v_num=7158123]

Validating: 100%|##########| 6/6 [00:23<00:00,  3.98s/it][A
Epoch 10:  51%|#####     | 60/118 [09:42<09:22,  9.70s/it, loss=0.0056, v_num=7158123]
Epoch 10:  51%|#####     | 60/118 [09:49<09:30,  9.83s/it, loss=0.0056, v_num=7158123]
Traceback (most recent call last):
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 521, in train
    self.train_loop.run_training_epoch()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 588, in run_training_epoch
    self.trainer.run_evaluation(test_mode=False)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 628, in run_evaluation
    self.evaluation_loop.on_evaluation_end()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 111, in on_evaluation_end
    self.trainer.call_hook('on_validation_end', *args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 887, in call_hook
    trainer_hook(*args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 177, in on_validation_end
    callback.on_validation_end(self, self.get_model())
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
    self.save_checkpoint(trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 252, in save_checkpoint
    self._save_last_checkpoint(trainer, pl_module, monitor_candidates)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 567, in _save_last_checkpoint
    self._save_model(last_filepath, trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 361, in _save_model
    self.save_function(filepath, self.save_weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 236, in save_checkpoint
    self.checkpoint_connector.save_checkpoint(filepath, weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 382, in save_checkpoint
    checkpoint = self.dump_checkpoint(weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 273, in dump_checkpoint
    optimizer_state = self.trainer.accelerator_backend.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 201, in optimizer_state
    return self.ddp_plugin.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/plugins/sharded_plugin.py", line 42, in optimizer_state
    optimizer.consolidate_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 299, in consolidate_state_dict
    self._broadcast_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 482, in _broadcast_state_dict
    broadcast_object(empty_buffer, src_rank=global_rank, group=self.group, dist_device=self._device)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/utils.py", line 123, in broadcast_object
    data_recv_tensor = torch.empty([int(length_tensor.item())], dtype=torch.uint8, device=dist_device)
RuntimeError: CUDA out of memory. Tried to allocate 3968492912.96 GiB (GPU 2; 23.65 GiB total capacity; 64.10 MiB already allocated; 21.70 GiB free; 954.00 MiB reserved in total by PyTorch)

Stacktrace 2

Validating:  83%|########3 | 5/6 [00:16<00:03,  3.68s/it][A
Validating: 100%|##########| 6/6 [00:18<00:00,  3.14s/it][A
Epoch 11: 100%|##########| 118/118 [20:31<00:00, 10.44s/it, loss=0.00465, v_num=7158119]
Traceback (most recent call last):
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 521, in train
    self.train_loop.run_training_epoch()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 588, in run_training_epoch
    self.trainer.run_evaluation(test_mode=False)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 628, in run_evaluation
    self.evaluation_loop.on_evaluation_end()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 111, in on_evaluation_end
    self.trainer.call_hook('on_validation_end', *args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 887, in call_hook
    trainer_hook(*args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 177, in on_validation_end
    callback.on_validation_end(self, self.get_model())
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
    self.save_checkpoint(trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 252, in save_checkpoint
    self._save_last_checkpoint(trainer, pl_module, monitor_candidates)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 567, in _save_last_checkpoint
    self._save_model(last_filepath, trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 361, in _save_model
    self.save_function(filepath, self.save_weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 236, in save_checkpoint
    self.checkpoint_connector.save_checkpoint(filepath, weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 382, in save_checkpoint
    checkpoint = self.dump_checkpoint(weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 273, in dump_checkpoint
    optimizer_state = self.trainer.accelerator_backend.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 201, in optimizer_state
    return self.ddp_plugin.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/plugins/sharded_plugin.py", line 42, in optimizer_state
    optimizer.consolidate_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 299, in consolidate_state_dict
    self._broadcast_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 482, in _broadcast_state_dict
    broadcast_object(empty_buffer, src_rank=global_rank, group=self.group, dist_device=self._device)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/utils.py", line 123, in broadcast_object
    data_recv_tensor = torch.empty([int(length_tensor.item())], dtype=torch.uint8, device=dist_device)
RuntimeError: Trying to create tensor with negative dimension -4803151849830123752: [-4803151849830123752]

@robogast
Copy link
Author

robogast commented Jan 26, 2021

By the way @rohan-varma, since _broadcast_state_dict and _collect_sharded_states are so interconnected, shouldn't they be merged to one _collect_sharded_states function?

  • They perform the same syncing behaviour
  • They depend on one another for this syncing behaviour
  • Therefore a change to one function is a change to the other

You can also see that this already went wrong(!): in the bugfix where the 0 value was replaced with some dummy tensor ( facebookresearch/fairscale#323 ), if some magic happened in broadcast_object_list that depended on the object itself, changing _broadcast_state_dict would have already caused a bug in _collect_sharded_states!

But maybe that's a discussion for in the fairscale repo? :)

@blefaudeux
Copy link

blefaudeux commented Jan 26, 2021

@blefaudeux Yes, this also happened before pytorch1.7.1.

Let me pull out some old stacktraces, I believe it's the same issue, old code.
I posted this on the 21th of december in the pytorch lightning slack channel.

I can see if I can reproduce it with some minimal example, but that would take some time.

Validating:  83%|########3 | 5/6 [00:21<00:04,  4.66s/it][A
Epoch 10:  50%|#####     | 59/118 [09:39<09:39,  9.82s/it, loss=0.0056, v_num=7158123]

Validating: 100%|##########| 6/6 [00:23<00:00,  3.98s/it][A
Epoch 10:  51%|#####     | 60/118 [09:42<09:22,  9.70s/it, loss=0.0056, v_num=7158123]
Epoch 10:  51%|#####     | 60/118 [09:49<09:30,  9.83s/it, loss=0.0056, v_num=7158123]
Traceback (most recent call last):
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 521, in train
    self.train_loop.run_training_epoch()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 588, in run_training_epoch
    self.trainer.run_evaluation(test_mode=False)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 628, in run_evaluation
    self.evaluation_loop.on_evaluation_end()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 111, in on_evaluation_end
    self.trainer.call_hook('on_validation_end', *args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 887, in call_hook
    trainer_hook(*args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 177, in on_validation_end
    callback.on_validation_end(self, self.get_model())
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
    self.save_checkpoint(trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 252, in save_checkpoint
    self._save_last_checkpoint(trainer, pl_module, monitor_candidates)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 567, in _save_last_checkpoint
    self._save_model(last_filepath, trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 361, in _save_model
    self.save_function(filepath, self.save_weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 236, in save_checkpoint
    self.checkpoint_connector.save_checkpoint(filepath, weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 382, in save_checkpoint
    checkpoint = self.dump_checkpoint(weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 273, in dump_checkpoint
    optimizer_state = self.trainer.accelerator_backend.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 201, in optimizer_state
    return self.ddp_plugin.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/plugins/sharded_plugin.py", line 42, in optimizer_state
    optimizer.consolidate_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 299, in consolidate_state_dict
    self._broadcast_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 482, in _broadcast_state_dict
    broadcast_object(empty_buffer, src_rank=global_rank, group=self.group, dist_device=self._device)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/utils.py", line 123, in broadcast_object
    data_recv_tensor = torch.empty([int(length_tensor.item())], dtype=torch.uint8, device=dist_device)
RuntimeError: CUDA out of memory. Tried to allocate 3968492912.96 GiB (GPU 2; 23.65 GiB total capacity; 64.10 MiB already allocated; 21.70 GiB free; 954.00 MiB reserved in total by PyTorch)
Validating:  83%|########3 | 5/6 [00:16<00:03,  3.68s/it][A
Validating: 100%|##########| 6/6 [00:18<00:00,  3.14s/it][A
Epoch 11: 100%|##########| 118/118 [20:31<00:00, 10.44s/it, loss=0.00465, v_num=7158119]
Traceback (most recent call last):
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 521, in train
    self.train_loop.run_training_epoch()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 588, in run_training_epoch
    self.trainer.run_evaluation(test_mode=False)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 628, in run_evaluation
    self.evaluation_loop.on_evaluation_end()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 111, in on_evaluation_end
    self.trainer.call_hook('on_validation_end', *args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 887, in call_hook
    trainer_hook(*args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 177, in on_validation_end
    callback.on_validation_end(self, self.get_model())
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
    self.save_checkpoint(trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 252, in save_checkpoint
    self._save_last_checkpoint(trainer, pl_module, monitor_candidates)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 567, in _save_last_checkpoint
    self._save_model(last_filepath, trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 361, in _save_model
    self.save_function(filepath, self.save_weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 236, in save_checkpoint
    self.checkpoint_connector.save_checkpoint(filepath, weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 382, in save_checkpoint
    checkpoint = self.dump_checkpoint(weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 273, in dump_checkpoint
    optimizer_state = self.trainer.accelerator_backend.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 201, in optimizer_state
    return self.ddp_plugin.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/plugins/sharded_plugin.py", line 42, in optimizer_state
    optimizer.consolidate_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 299, in consolidate_state_dict
    self._broadcast_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 482, in _broadcast_state_dict
    broadcast_object(empty_buffer, src_rank=global_rank, group=self.group, dist_device=self._device)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/utils.py", line 123, in broadcast_object
    data_recv_tensor = torch.empty([int(length_tensor.item())], dtype=torch.uint8, device=dist_device)
RuntimeError: Trying to create tensor with negative dimension -4803151849830123752: [-4803151849830123752]

you've seen that it OOMs, right ? anything after that is pretty hard to debug, the collective communication primitives assume that every rank does the same thing, it's obviously not the case here

edit: wow I missed that initially, the allocation is nuts

@blefaudeux
Copy link

blefaudeux commented Jan 26, 2021

You can also see that this already went wrong(!): in the bugfix where the 0 value was replaced with some dummy tensor ( facebookresearch/fairscale#323 ), if some magic happened in broadcast_object_list that depended on the object itself, changing _broadcast_state_dict would have already caused a bug in _collect_sharded_states!

The only change of the bugfix was to try to pass something (a tensor sized 1), in case the issue was in the serialization of 0 (could be considered 'null' with a bug -> break). That's all, if the util was not broken to begin with then this change does not break anything either, I'm not sure why it is "broken". Of course some magic happens in broadcast_object_list which depends on the object, at that point not all ranks have the same object, and for the trace that you're seeing to happen then some of this magic needs to be broken

@robogast
Copy link
Author

To be clear: those are two different stacktraces.
They happen on the exact same LOC, so I assume they're related.

@blefaudeux
Copy link

By the way @rohan-varma, since _broadcast_state_dict and _collect_sharded_states are so interconnected, shouldn't they be merged to one _collect_sharded_states function?

  • They perform the same syncing behaviour
  • They depend on one another for this syncing behaviour
  • Therefore a change to one function is a change to the other

You can also see that this already went wrong(!): in the bugfix where the 0 value was replaced with some dummy tensor ( facebookresearch/fairscale#323 ), if some magic happened in broadcast_object_list that depended on the object itself, changing _broadcast_state_dict would have already caused a bug in _collect_sharded_states!

But maybe that's a discussion for in the fairscale repo? :)

coding style preferences, function not too big + consolidate_state_dict is public interface, easier to understand what it does I believe if it's 20 loc long vs. 200, but that's beside the point

@blefaudeux
Copy link

@robogast just in case, would you know what optimizer you're using ?

@robogast
Copy link
Author

robogast commented Jan 26, 2021

Pytorch native adam with amsgrad (wrapped by lightning ofcourse)

@blefaudeux
Copy link

still trying to wrap my head around that.. Just to try to get somewhere, there's something really strange with the trace above, in that there's almost no memory allocated to begin with when that fails, which is strange to me since there should at least be the model + probably the last or next batch. @SeanNaren @ananthsub is there something I should know which happens around the checkpointing in lightning ? Like some items being deleted ?

@SeanNaren
Copy link
Contributor

Thanks for the investigation guys, its totally possible that something on the Lightning side is messing this up. When we save the model we save model states/optimizer states, some metadata info (like arguments given to the lightningmodule) but that covers most. @robogast are you able to share any code to work with to reproduce? Would help us pinpoint what's causing this issue!

@blefaudeux
Copy link

@robogast @SeanNaren one option is that it's all because of the use of the torch broadcast helper, it was also being used in 1.7.1, and it's subtly broken or hard to use (it defaults in rank == device). This could explain why the gpu has nothing at that point, maybe that you were using pipe or something and the gpu being used is not the correct one. I've removed that from upstream fairscale, awaiting confirmation but that could be it. I've never tested this configuration myself, which could explain why I've never seen the issue..

@robogast
Copy link
Author

Sorry that might not have been clear, I'm using the nightly version of pytorch exactly because I ran into the broadcast_object_list was broken because of the rank != device bug.
I've spent the day chasing the error, but nothing so far.

@blefaudeux
Copy link

Sorry that might not have been clear, I'm using the nightly version of pytorch exactly because I ran into the broadcast_object_list was broken because of the rank != device bug.
I've spent the day chasing the error, but nothing so far.

even with the newer torch, this util assumes that the current torch device is changed to whatever is passed, which I was not doing. Could you try to set _torch_broadcast_object to False, or test this branch by any chance ?

@robogast
Copy link
Author

robogast commented Feb 5, 2021

Update:
After encountering the same error without enabling ddp_sharded, I did some further digging and found the following:

Currently on PL master branch, model_checkpoint.py creates multiple broadcasts on all ranks at every validation_end (because I setup my ModelCheckpoint to track a validation metric) to keep the checkpoint filename consistent.
However, the logic which manages this seems a bit smelly to me; thus my working hypothesis is that there is a circumstance in which model_checkpoint.py wouldn't call an equal amount of broadcast on all ranks, thus leading to the strange NCCL behaviour described above.

This logic is removed from model_checkpoint.py in the branch 1.2-dev, and I haven't observed any issues (with or without ddp_sharded) since I've switched to that version.
So, as far as I'm concerned, I'm closing this issue whenever PL version 1.2 is officially released.

@blefaudeux
Copy link

Update:
After encountering the same error without enabling ddp_sharded, I did some further digging and found the following:

Currently on PL master branch, model_checkpoint.py creates multiple broadcasts on all ranks at every validation_end (because I setup my ModelCheckpoint to track a validation metric) to keep the checkpoint filename consistent.
However, the logic which manages this seems a bit smelly to me; thus my working hypothesis is that there is a circumstance in which model_checkpoint.py wouldn't call an equal amount of broadcast on all ranks, thus leading to the strange NCCL behaviour described above.

This logic is removed from model_checkpoint.py in the branch 1.2-dev, and I haven't observed any issues (with or without ddp_sharded) since I've switched to that version.
So, as far as I'm concerned, I'm closing this issue whenever PL version 1.2 is officially released.

many thanks for all this work @robogast, and sorry for not having been more reactive, I had a hard time reproducing but that might explain why. Feel free to pull me in again and I'm glad that works for you

@SeanNaren
Copy link
Contributor

Thanks guys! I think after extensive debugging from @robogast it seems unrelated to sharded from that I understand. Few de-sync issues on our end based on 1.2 vs what's on master, but hopefully next week it will be resolved!

@edenlightning
Copy link
Contributor

@robogast mind checking in if this is solved with lightning 1.2?

@robogast
Copy link
Author

I haven't had the chance to test the release version of 1.2 yet myself, but I'm running 1.2rc0 and haven't ran into this issue anymore.

I'll close the issue, and if anyone runs into the same error, we can always reopen :)

@robogast
Copy link
Author

For reference, probably the PR that also fixed this issue:
#5155

@IsCoelacanth
Copy link

reopening this, getting the same OOM error when using ddp_shared, breaks on the exact same point

File "/mnt/anurag/miniconda3/envs/py3/lib/python3.6/site-packages/fairscale/optim/oss.py", line 297, in consolidate_state_dict
    dist_device=dist_device,
  File "/mnt/anurag/miniconda3/envs/py3/lib/python3.6/site-packages/fairscale/utils/params.py", line 82, in broadcast_object
    data_recv_tensor = torch.empty([int(length_tensor.item())], dtype=torch.uint8, device=dist_device)
RuntimeError: CUDA out of memory. Tried to allocate 4664509241.18 GiB (GPU 0; 10.76 GiB total capacity; 694.07 MiB already allocated; 7.56 GiB free; 888.00 MiB reserved in total by PyTorch)
    optimizer.consolidate_state_dict() 

using the latest master for both FairScale and PL

@robogast
Copy link
Author

Reopening, given @IsCoelacanth's comment

@robogast robogast reopened this Aug 18, 2021
@IsCoelacanth
Copy link

making these changes stops the error from happening on the first epoch (still testing if it comes up afterwards at random)

before (error):

trainer = pl.Trainer(
        max_epochs=epochs,
        gpus=-1,
        logger=logger,
        callbacks=[image_logger, lr_monitor, checkpoint],
        accelerator="ddp",
        gradient_clip_val=0.1,
        track_grad_norm=2,
        amp_level="O1",
        precision=16,
        plugins="ddp_sharded",
    )

after (no-errors yet):

trainer = pl.Trainer(
        max_epochs=epochs,
        gpus=-1,
        logger=logger,
        callbacks=[image_logger, lr_monitor, checkpoint],
        accelerator="ddp",
        # gradient_clip_val=0.1,
        # track_grad_norm=2,
        amp_level="O1",
        precision=16,
        plugins="ddp_sharded",
    )

@ananthsub
Copy link
Contributor

@IsCoelacanth do you see a difference if you do this?

trainer = pl.Trainer(
        max_epochs=epochs,
        gpus=-1,
        logger=logger,
        callbacks=[image_logger, lr_monitor, checkpoint],
        gradient_clip_val=0.1,
        track_grad_norm=2,
        amp_level="O1",
        precision=16,
        plugins="ddp_sharded",
    )

@carmocca carmocca added waiting on author Waiting on user action, correction, or update and removed help wanted Open to be worked on labels Jan 12, 2022
@stale
Copy link

stale bot commented Feb 18, 2022

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Feb 18, 2022
@blefaudeux
Copy link

@IsCoelacanth is that still an issue ? thanks for your observation, the only thing that I can think of which could affect is that using the clip_grad_norm goes through this codepath (+ some lightning mechanic that I don't know), but the error looks like a bug in NCCL since the above does not change the parameter size (or there's a bug somewhere else which nukes a parameter somehow).

@stale stale bot removed the won't fix This will not be worked on label Feb 18, 2022
@stale
Copy link

stale bot commented Apr 17, 2022

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Apr 17, 2022
@stale stale bot closed this as completed Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3rd party Related to a 3rd-party bug Something isn't working distributed Generic distributed-related topic waiting on author Waiting on user action, correction, or update won't fix This will not be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants