DDPShardedPlugin consolidate_state_dict RuntimeError #5646

robogast · 2021-01-25T11:53:40Z

🐛 Bug

After an (seemingly arbitrary) number of steps/epochs, DDPShardedPlugin::optimizer_state crashes on its consolidate_state_dict call:

Pytorch's distributed broadcast_object_list tries object_tensor = torch.ByteTensor(torch.sum(object_sizes_tensor).item())
RuntimeError: Trying to create tensor with negative dimension -5193452289200645882: [-5193452289200645882]

Stacktrace:

Traceback (most recent call last):
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 560, in train
    self.train_loop.run_training_epoch()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 562, in run_training_epoch
    self.trainer.run_evaluation()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 667, in run_evaluation
    self.evaluation_loop.on_evaluation_end()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 110, in on_evaluation_end
    self.trainer.call_hook('on_validation_end', *args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 924, in call_hook
    trainer_hook(*args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-object_sizes_tensorpackages/pytorch_lightning/trainer/callback_hook.py", line 177, in on_validation_end
    callback.on_validation_end(self, self.get_model())
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 203, in on_validation_end
    self.save_checkpoint(trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 253, in save_checkpoint
    self._save_last_checkpoint(trainer, pl_module, monitor_candidates)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 567, in _save_last_checkpoint
    self._save_model(last_filepath, trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 361, in _save_model
    self.save_function(filepath, self.save_weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 257, in save_checkpoint
    self.checkpoint_connector.save_checkpoint(filepath, weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 392, in save_checkpoint
    checkpoint = self.dump_checkpoint(weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 283, in dump_checkpoint
    optimizer_state = self.trainer.accelerator_backend.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 206, in optimizer_state
    return self.ddp_plugin.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/plugins/sharded_plugin.py", line 42, in optimizer_state
    optimizer.consolidate_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 320, in consolidate_state_dict
    self._broadcast_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 349, in _broadcast_state_dict
    dist.broadcast_object_list([0], src=global_rank, group=self.group)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1687, in broadcast_object_list
    object_tensor = torch.ByteTensor(torch.sum(object_sizes_tensor).item())
RuntimeError: Trying to create tensor with negative dimension -5193452289200645882: [-5193452289200645882]

Environment

CUDA:
- GPU:
  - TITAN RTX
- available: True
- version: 11.0
Packages:
- numpy: 1.18.1
- pyTorch_debug: False
- pyTorch_version: 1.8.0.dev20210122
- pytorch-lightning: 1.1.4
- tqdm: 4.48.2
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor:
- python: 3.8.3
- version: Proposal for help #1 SMP Debian 4.19.160-2 (2020-11-28)

cc @awaelchli @rohitgr7 @akihironitta

The text was updated successfully, but these errors were encountered:

SeanNaren · 2021-01-25T12:03:29Z

CC @blefaudeux who may have some more insight!

Another point to make is that this happens randomly within the middle of training when saving the model. so it can happen after a few successful iterations. Was wondering if we're running out of memory at all

blefaudeux · 2021-01-25T17:04:59Z

oh wow, I never saw that, interesting. In recent fairscale I switched to using pytorch's broadcast object util instead of a dedicated one, looks like it can fail somehow. One check could be to try to set _torch_broadcast_object to False, I'll check with Rohan (author of this part) whether there's something I'm doing wrong

blefaudeux · 2021-01-25T17:31:13Z

I pushed a small change which should fix that, working around this case which seems to not be handled very well by this util

robogast · 2021-01-26T10:13:13Z

Cool, I'll give it a try today, I'll keep you up to date :)

robogast · 2021-01-26T12:22:52Z

I now seem to get some kind of deadlock within the broadcast:
(it was caught by setting export NCCL_ASYNC_ERROR_HANDLING=1)

Validating: 0it [00:00, ?it/s]�[A

Validating:   0%|          | 0/6 [00:00<?, ?it/s]�[A

Validating:  17%|#6        | 1/6 [00:09<00:46,  9.37s/it]�[A
Epoch 1:  97%|#########7| 115/118 [14:00<00:21,  7.31s/it, loss=0.052, v_num=7388088, recon loss=0.0364, commitment loss=0.00812]

Validating:  33%|###3      | 2/6 [00:10<00:27,  6.93s/it]�[A

Validating:  50%|#####     | 3/6 [00:11<00:15,  5.23s/it]�[A
Epoch 1:  99%|#########9| 117/118 [14:02<00:07,  7.20s/it, loss=0.052, v_num=7388088, recon loss=0.0364, commitment loss=0.00812]

Validating:  67%|######6   | 4/6 [00:13<00:08,  4.03s/it]�[A

Validating:  83%|########3 | 5/6 [00:14<00:03,  3.19s/it]�[A
[E ProcessGroupNCCL.cpp:485] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809029 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809035 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 17] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809025 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 16] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809023 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 19] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809027 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804391 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804393 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 18] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809055 milliseconds before timing out.

r35n1:11710:11792 [0] include/socket.h:416 NCCL WARN Net : Connection closed by remote peer
r35n1:11710:11792 [0] NCCL INFO transport/net_socket.cc:405 -> 2
r35n1:11710:11792 [0] NCCL INFO include/net.h:28 -> 2
r35n1:11710:11792 [0] NCCL INFO transport/net.cc:357 -> 2
r35n1:11710:11792 [0] NCCL INFO proxy.cc:198 -> 2 [Proxy Thread]
[E ProcessGroupNCCL.cpp:485] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804424 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804433 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1801009 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 21] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809152 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804461 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 20] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809173 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804491 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:485] [Rank 23] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1809211 milliseconds before timing out.
Traceback (most recent call last):
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 561, in train
    self.train_loop.run_training_epoch()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 562, in run_training_epoch
    self.trainer.run_evaluation()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 668, in run_evaluation
    self.evaluation_loop.on_evaluation_end()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 110, in on_evaluation_end
    self.trainer.call_hook('on_validation_end', *args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 925, in call_hook
    trainer_hook(*args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 177, in on_validation_end
    callback.on_validation_end(self, self.get_model())
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
    self.save_checkpoint(trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 254, in save_checkpoint
    self._save_last_checkpoint(trainer, pl_module, monitor_candidates)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 568, in _save_last_checkpoint
    self._save_model(last_filepath, trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 362, in _save_model
    self.save_function(filepath, self.save_weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 257, in save_checkpoint
    self.checkpoint_connector.save_checkpoint(filepath, weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 391, in save_checkpoint
    checkpoint = self.dump_checkpoint(weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 282, in dump_checkpoint
    optimizer_state = self.trainer.accelerator_backend.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 206, in optimizer_state
    return self.ddp_plugin.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/plugins/sharded_plugin.py", line 42, in optimizer_state
    optimizer.consolidate_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 320, in consolidate_state_dict
    self._broadcast_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 352, in _broadcast_state_dict
    dist.broadcast_object_list([dummy_sync_tensor], src=global_rank, group=self.group)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1691, in broadcast_object_list
    broadcast(object_tensor, src=src, group=group)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1039, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL communicator was aborted.

robogast · 2021-01-26T14:23:01Z

cc @blefaudeux

I can hereby (sadly) say that the code change didn't conclusively fix the issue, I ran into the same issue again:

Traceback (most recent call last):
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 561, in train
    self.train_loop.run_training_epoch()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 562, in run_training_epoch
    self.trainer.run_evaluation()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 668, in run_evaluation
    self.evaluation_loop.on_evaluation_end()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 110, in on_evaluation_end
    self.trainer.call_hook('on_validation_end', *args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 925, in call_hook
    trainer_hook(*args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 177, in on_validation_end
    callback.on_validation_end(self, self.get_model())
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
    self.save_checkpoint(trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 254, in save_checkpoint
    self._save_last_checkpoint(trainer, pl_module, monitor_candidates)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 568, in _save_last_checkpoint
    self._save_model(last_filepath, trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 362, in _save_model
    self.save_function(filepath, self.save_weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 257, in save_checkpoint
    self.checkpoint_connector.save_checkpoint(filepath, weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 391, in save_checkpoint
    checkpoint = self.dump_checkpoint(weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 282, in dump_checkpoint
    optimizer_state = self.trainer.accelerator_backend.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 206, in optimizer_state
    return self.ddp_plugin.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/plugins/sharded_plugin.py", line 42, in optimizer_state
    optimizer.consolidate_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 320, in consolidate_state_dict
    self._broadcast_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 352, in _broadcast_state_dict
    dist.broadcast_object_list([dummy_sync_tensor], src=global_rank, group=self.group)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1687, in broadcast_object_list
    object_tensor = torch.ByteTensor(torch.sum(object_sizes_tensor).item())
RuntimeError: Trying to create tensor with negative dimension -5185613574447507032: [-5185613574447507032]

blefaudeux · 2021-01-26T16:50:39Z

cc @rohan-varma, author of this part, it's certainly not normal, really sorry about that @robogast. Could you try to set OSS._torch_broadcast_object to False, so that it falls back to the more mundane implementation ?

blefaudeux · 2021-01-26T16:53:53Z

also @robogast do you see the issue on pytorch stable ? (1.7.1) I've never seen that myself so it would be nice to corner this to a specific combination, if any

rohan-varma · 2021-01-26T19:23:20Z

@robogast Thanks for flagging this! Do you have a runnable script that can reproduce the issue? In the meantime, hopefully the fallback @blefaudeux suggested can unblock.

robogast · 2021-01-26T19:36:54Z

@blefaudeux Yes, this also happened before pytorch1.7.1.

Let me pull out some old stacktraces, I believe it's the same issue, old code.
I posted this on the 21th of december in the pytorch lightning slack channel.

I can see if I can reproduce it with some minimal example, but that would take some time.

Stacktrace 1

Validating:  83%|########3 | 5/6 [00:21<00:04,  4.66s/it][A
Epoch 10:  50%|#####     | 59/118 [09:39<09:39,  9.82s/it, loss=0.0056, v_num=7158123]

Validating: 100%|##########| 6/6 [00:23<00:00,  3.98s/it][A
Epoch 10:  51%|#####     | 60/118 [09:42<09:22,  9.70s/it, loss=0.0056, v_num=7158123]
Epoch 10:  51%|#####     | 60/118 [09:49<09:30,  9.83s/it, loss=0.0056, v_num=7158123]
Traceback (most recent call last):
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 521, in train
    self.train_loop.run_training_epoch()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 588, in run_training_epoch
    self.trainer.run_evaluation(test_mode=False)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 628, in run_evaluation
    self.evaluation_loop.on_evaluation_end()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 111, in on_evaluation_end
    self.trainer.call_hook('on_validation_end', *args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 887, in call_hook
    trainer_hook(*args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 177, in on_validation_end
    callback.on_validation_end(self, self.get_model())
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
    self.save_checkpoint(trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 252, in save_checkpoint
    self._save_last_checkpoint(trainer, pl_module, monitor_candidates)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 567, in _save_last_checkpoint
    self._save_model(last_filepath, trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 361, in _save_model
    self.save_function(filepath, self.save_weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 236, in save_checkpoint
    self.checkpoint_connector.save_checkpoint(filepath, weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 382, in save_checkpoint
    checkpoint = self.dump_checkpoint(weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 273, in dump_checkpoint
    optimizer_state = self.trainer.accelerator_backend.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 201, in optimizer_state
    return self.ddp_plugin.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/plugins/sharded_plugin.py", line 42, in optimizer_state
    optimizer.consolidate_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 299, in consolidate_state_dict
    self._broadcast_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 482, in _broadcast_state_dict
    broadcast_object(empty_buffer, src_rank=global_rank, group=self.group, dist_device=self._device)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/utils.py", line 123, in broadcast_object
    data_recv_tensor = torch.empty([int(length_tensor.item())], dtype=torch.uint8, device=dist_device)
RuntimeError: CUDA out of memory. Tried to allocate 3968492912.96 GiB (GPU 2; 23.65 GiB total capacity; 64.10 MiB already allocated; 21.70 GiB free; 954.00 MiB reserved in total by PyTorch)

Stacktrace 2

Validating:  83%|########3 | 5/6 [00:16<00:03,  3.68s/it][A
Validating: 100%|##########| 6/6 [00:18<00:00,  3.14s/it][A
Epoch 11: 100%|##########| 118/118 [20:31<00:00, 10.44s/it, loss=0.00465, v_num=7158119]
Traceback (most recent call last):
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 521, in train
    self.train_loop.run_training_epoch()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 588, in run_training_epoch
    self.trainer.run_evaluation(test_mode=False)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 628, in run_evaluation
    self.evaluation_loop.on_evaluation_end()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 111, in on_evaluation_end
    self.trainer.call_hook('on_validation_end', *args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 887, in call_hook
    trainer_hook(*args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 177, in on_validation_end
    callback.on_validation_end(self, self.get_model())
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
    self.save_checkpoint(trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 252, in save_checkpoint
    self._save_last_checkpoint(trainer, pl_module, monitor_candidates)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 567, in _save_last_checkpoint
    self._save_model(last_filepath, trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 361, in _save_model
    self.save_function(filepath, self.save_weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 236, in save_checkpoint
    self.checkpoint_connector.save_checkpoint(filepath, weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 382, in save_checkpoint
    checkpoint = self.dump_checkpoint(weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 273, in dump_checkpoint
    optimizer_state = self.trainer.accelerator_backend.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 201, in optimizer_state
    return self.ddp_plugin.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/plugins/sharded_plugin.py", line 42, in optimizer_state
    optimizer.consolidate_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 299, in consolidate_state_dict
    self._broadcast_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 482, in _broadcast_state_dict
    broadcast_object(empty_buffer, src_rank=global_rank, group=self.group, dist_device=self._device)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/utils.py", line 123, in broadcast_object
    data_recv_tensor = torch.empty([int(length_tensor.item())], dtype=torch.uint8, device=dist_device)
RuntimeError: Trying to create tensor with negative dimension -4803151849830123752: [-4803151849830123752]

robogast · 2021-01-26T19:59:27Z

By the way @rohan-varma, since _broadcast_state_dict and _collect_sharded_states are so interconnected, shouldn't they be merged to one _collect_sharded_states function?

They perform the same syncing behaviour
They depend on one another for this syncing behaviour
Therefore a change to one function is a change to the other

You can also see that this already went wrong(!): in the bugfix where the 0 value was replaced with some dummy tensor ( facebookresearch/fairscale#323 ), if some magic happened in broadcast_object_list that depended on the object itself, changing _broadcast_state_dict would have already caused a bug in _collect_sharded_states!

But maybe that's a discussion for in the fairscale repo? :)

blefaudeux · 2021-01-26T21:31:17Z

@blefaudeux Yes, this also happened before pytorch1.7.1.

Let me pull out some old stacktraces, I believe it's the same issue, old code.
I posted this on the 21th of december in the pytorch lightning slack channel.

I can see if I can reproduce it with some minimal example, but that would take some time.

Validating:  83%|########3 | 5/6 [00:21<00:04,  4.66s/it][A
Epoch 10:  50%|#####     | 59/118 [09:39<09:39,  9.82s/it, loss=0.0056, v_num=7158123]

Validating: 100%|##########| 6/6 [00:23<00:00,  3.98s/it][A
Epoch 10:  51%|#####     | 60/118 [09:42<09:22,  9.70s/it, loss=0.0056, v_num=7158123]
Epoch 10:  51%|#####     | 60/118 [09:49<09:30,  9.83s/it, loss=0.0056, v_num=7158123]
Traceback (most recent call last):
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 521, in train
    self.train_loop.run_training_epoch()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 588, in run_training_epoch
    self.trainer.run_evaluation(test_mode=False)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 628, in run_evaluation
    self.evaluation_loop.on_evaluation_end()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 111, in on_evaluation_end
    self.trainer.call_hook('on_validation_end', *args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 887, in call_hook
    trainer_hook(*args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 177, in on_validation_end
    callback.on_validation_end(self, self.get_model())
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
    self.save_checkpoint(trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 252, in save_checkpoint
    self._save_last_checkpoint(trainer, pl_module, monitor_candidates)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 567, in _save_last_checkpoint
    self._save_model(last_filepath, trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 361, in _save_model
    self.save_function(filepath, self.save_weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 236, in save_checkpoint
    self.checkpoint_connector.save_checkpoint(filepath, weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 382, in save_checkpoint
    checkpoint = self.dump_checkpoint(weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 273, in dump_checkpoint
    optimizer_state = self.trainer.accelerator_backend.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 201, in optimizer_state
    return self.ddp_plugin.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/plugins/sharded_plugin.py", line 42, in optimizer_state
    optimizer.consolidate_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 299, in consolidate_state_dict
    self._broadcast_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 482, in _broadcast_state_dict
    broadcast_object(empty_buffer, src_rank=global_rank, group=self.group, dist_device=self._device)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/utils.py", line 123, in broadcast_object
    data_recv_tensor = torch.empty([int(length_tensor.item())], dtype=torch.uint8, device=dist_device)
RuntimeError: CUDA out of memory. Tried to allocate 3968492912.96 GiB (GPU 2; 23.65 GiB total capacity; 64.10 MiB already allocated; 21.70 GiB free; 954.00 MiB reserved in total by PyTorch)

Validating:  83%|########3 | 5/6 [00:16<00:03,  3.68s/it][A
Validating: 100%|##########| 6/6 [00:18<00:00,  3.14s/it][A
Epoch 11: 100%|##########| 118/118 [20:31<00:00, 10.44s/it, loss=0.00465, v_num=7158119]
Traceback (most recent call last):
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 521, in train
    self.train_loop.run_training_epoch()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 588, in run_training_epoch
    self.trainer.run_evaluation(test_mode=False)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 628, in run_evaluation
    self.evaluation_loop.on_evaluation_end()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 111, in on_evaluation_end
    self.trainer.call_hook('on_validation_end', *args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 887, in call_hook
    trainer_hook(*args, **kwargs)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 177, in on_validation_end
    callback.on_validation_end(self, self.get_model())
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
    self.save_checkpoint(trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 252, in save_checkpoint
    self._save_last_checkpoint(trainer, pl_module, monitor_candidates)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 567, in _save_last_checkpoint
    self._save_model(last_filepath, trainer, pl_module)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 361, in _save_model
    self.save_function(filepath, self.save_weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 236, in save_checkpoint
    self.checkpoint_connector.save_checkpoint(filepath, weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 382, in save_checkpoint
    checkpoint = self.dump_checkpoint(weights_only)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 273, in dump_checkpoint
    optimizer_state = self.trainer.accelerator_backend.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 201, in optimizer_state
    return self.ddp_plugin.optimizer_state(optimizer)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/plugins/sharded_plugin.py", line 42, in optimizer_state
    optimizer.consolidate_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 299, in consolidate_state_dict
    self._broadcast_state_dict()
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 482, in _broadcast_state_dict
    broadcast_object(empty_buffer, src_rank=global_rank, group=self.group, dist_device=self._device)
  File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/utils.py", line 123, in broadcast_object
    data_recv_tensor = torch.empty([int(length_tensor.item())], dtype=torch.uint8, device=dist_device)
RuntimeError: Trying to create tensor with negative dimension -4803151849830123752: [-4803151849830123752]

you've seen that it OOMs, right ? anything after that is pretty hard to debug, the collective communication primitives assume that every rank does the same thing, it's obviously not the case here

edit: wow I missed that initially, the allocation is nuts

blefaudeux · 2021-01-26T21:35:43Z

You can also see that this already went wrong(!): in the bugfix where the 0 value was replaced with some dummy tensor ( facebookresearch/fairscale#323 ), if some magic happened in broadcast_object_list that depended on the object itself, changing _broadcast_state_dict would have already caused a bug in _collect_sharded_states!

The only change of the bugfix was to try to pass something (a tensor sized 1), in case the issue was in the serialization of 0 (could be considered 'null' with a bug -> break). That's all, if the util was not broken to begin with then this change does not break anything either, I'm not sure why it is "broken". Of course some magic happens in broadcast_object_list which depends on the object, at that point not all ranks have the same object, and for the trace that you're seeing to happen then some of this magic needs to be broken

robogast · 2021-01-26T21:37:49Z

To be clear: those are two different stacktraces.
They happen on the exact same LOC, so I assume they're related.

blefaudeux · 2021-01-26T21:41:20Z

By the way @rohan-varma, since _broadcast_state_dict and _collect_sharded_states are so interconnected, shouldn't they be merged to one _collect_sharded_states function?

They perform the same syncing behaviour

They depend on one another for this syncing behaviour

Therefore a change to one function is a change to the other

You can also see that this already went wrong(!): in the bugfix where the 0 value was replaced with some dummy tensor ( facebookresearch/fairscale#323 ), if some magic happened in broadcast_object_list that depended on the object itself, changing _broadcast_state_dict would have already caused a bug in _collect_sharded_states!

But maybe that's a discussion for in the fairscale repo? :)

coding style preferences, function not too big + consolidate_state_dict is public interface, easier to understand what it does I believe if it's 20 loc long vs. 200, but that's beside the point

blefaudeux · 2021-01-26T21:44:20Z

@robogast just in case, would you know what optimizer you're using ?

robogast · 2021-01-26T21:48:07Z

Pytorch native adam with amsgrad (wrapped by lightning ofcourse)

blefaudeux · 2021-01-26T22:09:08Z

still trying to wrap my head around that.. Just to try to get somewhere, there's something really strange with the trace above, in that there's almost no memory allocated to begin with when that fails, which is strange to me since there should at least be the model + probably the last or next batch. @SeanNaren @ananthsub is there something I should know which happens around the checkpointing in lightning ? Like some items being deleted ?

SeanNaren · 2021-01-27T00:15:58Z

Thanks for the investigation guys, its totally possible that something on the Lightning side is messing this up. When we save the model we save model states/optimizer states, some metadata info (like arguments given to the lightningmodule) but that covers most. @robogast are you able to share any code to work with to reproduce? Would help us pinpoint what's causing this issue!

blefaudeux · 2021-01-27T18:40:47Z

@robogast @SeanNaren one option is that it's all because of the use of the torch broadcast helper, it was also being used in 1.7.1, and it's subtly broken or hard to use (it defaults in rank == device). This could explain why the gpu has nothing at that point, maybe that you were using pipe or something and the gpu being used is not the correct one. I've removed that from upstream fairscale, awaiting confirmation but that could be it. I've never tested this configuration myself, which could explain why I've never seen the issue..

robogast · 2021-01-27T18:57:14Z

Sorry that might not have been clear, I'm using the nightly version of pytorch exactly because I ran into the broadcast_object_list was broken because of the rank != device bug.
I've spent the day chasing the error, but nothing so far.

blefaudeux · 2021-01-27T19:38:28Z

Sorry that might not have been clear, I'm using the nightly version of pytorch exactly because I ran into the broadcast_object_list was broken because of the rank != device bug.
I've spent the day chasing the error, but nothing so far.

even with the newer torch, this util assumes that the current torch device is changed to whatever is passed, which I was not doing. Could you try to set _torch_broadcast_object to False, or test this branch by any chance ?

robogast · 2021-02-05T12:49:18Z

Update:
After encountering the same error without enabling ddp_sharded, I did some further digging and found the following:

Currently on PL master branch, model_checkpoint.py creates multiple broadcasts on all ranks at every validation_end (because I setup my ModelCheckpoint to track a validation metric) to keep the checkpoint filename consistent.
However, the logic which manages this seems a bit smelly to me; thus my working hypothesis is that there is a circumstance in which model_checkpoint.py wouldn't call an equal amount of broadcast on all ranks, thus leading to the strange NCCL behaviour described above.

This logic is removed from model_checkpoint.py in the branch 1.2-dev, and I haven't observed any issues (with or without ddp_sharded) since I've switched to that version.
So, as far as I'm concerned, I'm closing this issue whenever PL version 1.2 is officially released.

blefaudeux · 2021-02-05T19:16:31Z

Update:
After encountering the same error without enabling ddp_sharded, I did some further digging and found the following:

Currently on PL master branch, model_checkpoint.py creates multiple broadcasts on all ranks at every validation_end (because I setup my ModelCheckpoint to track a validation metric) to keep the checkpoint filename consistent.
However, the logic which manages this seems a bit smelly to me; thus my working hypothesis is that there is a circumstance in which model_checkpoint.py wouldn't call an equal amount of broadcast on all ranks, thus leading to the strange NCCL behaviour described above.

This logic is removed from model_checkpoint.py in the branch 1.2-dev, and I haven't observed any issues (with or without ddp_sharded) since I've switched to that version.
So, as far as I'm concerned, I'm closing this issue whenever PL version 1.2 is officially released.

many thanks for all this work @robogast, and sorry for not having been more reactive, I had a hard time reproducing but that might explain why. Feel free to pull me in again and I'm glad that works for you

SeanNaren · 2021-02-05T20:58:10Z

Thanks guys! I think after extensive debugging from @robogast it seems unrelated to sharded from that I understand. Few de-sync issues on our end based on 1.2 vs what's on master, but hopefully next week it will be resolved!

edenlightning · 2021-02-22T16:48:05Z

@robogast mind checking in if this is solved with lightning 1.2?

robogast · 2021-02-23T08:31:59Z

I haven't had the chance to test the release version of 1.2 yet myself, but I'm running 1.2rc0 and haven't ran into this issue anymore.

I'll close the issue, and if anyone runs into the same error, we can always reopen :)

robogast · 2021-02-23T09:17:51Z

For reference, probably the PR that also fixed this issue:
#5155

IsCoelacanth · 2021-08-18T06:03:26Z

reopening this, getting the same OOM error when using ddp_shared, breaks on the exact same point

File "/mnt/anurag/miniconda3/envs/py3/lib/python3.6/site-packages/fairscale/optim/oss.py", line 297, in consolidate_state_dict
    dist_device=dist_device,
  File "/mnt/anurag/miniconda3/envs/py3/lib/python3.6/site-packages/fairscale/utils/params.py", line 82, in broadcast_object
    data_recv_tensor = torch.empty([int(length_tensor.item())], dtype=torch.uint8, device=dist_device)
RuntimeError: CUDA out of memory. Tried to allocate 4664509241.18 GiB (GPU 0; 10.76 GiB total capacity; 694.07 MiB already allocated; 7.56 GiB free; 888.00 MiB reserved in total by PyTorch)
    optimizer.consolidate_state_dict()

using the latest master for both FairScale and PL

robogast · 2021-08-18T09:03:01Z

Reopening, given @IsCoelacanth's comment

IsCoelacanth · 2021-08-26T07:22:15Z

making these changes stops the error from happening on the first epoch (still testing if it comes up afterwards at random)

before (error):

trainer = pl.Trainer(
        max_epochs=epochs,
        gpus=-1,
        logger=logger,
        callbacks=[image_logger, lr_monitor, checkpoint],
        accelerator="ddp",
        gradient_clip_val=0.1,
        track_grad_norm=2,
        amp_level="O1",
        precision=16,
        plugins="ddp_sharded",
    )

after (no-errors yet):

trainer = pl.Trainer(
        max_epochs=epochs,
        gpus=-1,
        logger=logger,
        callbacks=[image_logger, lr_monitor, checkpoint],
        accelerator="ddp",
        # gradient_clip_val=0.1,
        # track_grad_norm=2,
        amp_level="O1",
        precision=16,
        plugins="ddp_sharded",
    )

ananthsub · 2021-08-26T07:26:37Z

@IsCoelacanth do you see a difference if you do this?

trainer = pl.Trainer(
        max_epochs=epochs,
        gpus=-1,
        logger=logger,
        callbacks=[image_logger, lr_monitor, checkpoint],
        gradient_clip_val=0.1,
        track_grad_norm=2,
        amp_level="O1",
        precision=16,
        plugins="ddp_sharded",
    )

stale · 2022-02-18T00:55:58Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

blefaudeux · 2022-02-18T02:12:49Z

@IsCoelacanth is that still an issue ? thanks for your observation, the only thing that I can think of which could affect is that using the clip_grad_norm goes through this codepath (+ some lightning mechanic that I don't know), but the error looks like a bug in NCCL since the above does not change the parameter size (or there's a bug somewhere else which nukes a parameter somehow).

stale · 2022-04-17T10:10:00Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

robogast added bug Something isn't working help wanted Open to be worked on labels Jan 25, 2021

SeanNaren added distributed Generic distributed-related topic 3rd party Related to a 3rd-party labels Jan 25, 2021

SeanNaren self-assigned this Jan 25, 2021

blefaudeux mentioned this issue Jan 25, 2021

[OSS] Fix for torch dist broadcast randomly failing on dummy object facebookresearch/fairscale#323

Merged

4 tasks

blefaudeux mentioned this issue Jan 27, 2021

[fix] OSS: removing the torch broadcast util altogether, broken on 1.7.1 facebookresearch/fairscale#329

Merged

4 tasks

robogast closed this as completed Feb 23, 2021

robogast reopened this Aug 18, 2021

carmocca added waiting on author Waiting on user action, correction, or update and removed help wanted Open to be worked on labels Jan 12, 2022

carmocca unassigned SeanNaren Jan 12, 2022

stale bot added the won't fix This will not be worked on label Feb 18, 2022

stale bot removed the won't fix This will not be worked on label Feb 18, 2022

stale bot added the won't fix This will not be worked on label Apr 17, 2022

stale bot closed this as completed Apr 27, 2022

DDPShardedPlugin consolidate_state_dict RuntimeError #5646

DDPShardedPlugin consolidate_state_dict RuntimeError #5646

Comments

robogast commented Jan 25, 2021 • edited by github-actions bot Loading

🐛 Bug

Stacktrace:

Environment

SeanNaren commented Jan 25, 2021

blefaudeux commented Jan 25, 2021

blefaudeux commented Jan 25, 2021

robogast commented Jan 26, 2021

robogast commented Jan 26, 2021

robogast commented Jan 26, 2021 • edited Loading

blefaudeux commented Jan 26, 2021

blefaudeux commented Jan 26, 2021

rohan-varma commented Jan 26, 2021

robogast commented Jan 26, 2021 • edited Loading

Stacktrace 1

Stacktrace 2

robogast commented Jan 26, 2021 • edited Loading

blefaudeux commented Jan 26, 2021 • edited Loading

blefaudeux commented Jan 26, 2021 • edited Loading

robogast commented Jan 26, 2021

blefaudeux commented Jan 26, 2021

blefaudeux commented Jan 26, 2021

robogast commented Jan 26, 2021 • edited Loading

blefaudeux commented Jan 26, 2021

SeanNaren commented Jan 27, 2021

blefaudeux commented Jan 27, 2021

robogast commented Jan 27, 2021

blefaudeux commented Jan 27, 2021

robogast commented Feb 5, 2021

blefaudeux commented Feb 5, 2021

SeanNaren commented Feb 5, 2021

edenlightning commented Feb 22, 2021

robogast commented Feb 23, 2021

robogast commented Feb 23, 2021

IsCoelacanth commented Aug 18, 2021

robogast commented Aug 18, 2021

IsCoelacanth commented Aug 26, 2021

ananthsub commented Aug 26, 2021

stale bot commented Feb 18, 2022

blefaudeux commented Feb 18, 2022

stale bot commented Apr 17, 2022

robogast commented Jan 25, 2021 •

edited by github-actions bot

Loading

robogast commented Jan 26, 2021 •

edited

Loading

robogast commented Jan 26, 2021 •

edited

Loading

robogast commented Jan 26, 2021 •

edited

Loading

blefaudeux commented Jan 26, 2021 •

edited

Loading

blefaudeux commented Jan 26, 2021 •

edited

Loading

robogast commented Jan 26, 2021 •

edited

Loading