-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDPShardedPlugin consolidate_state_dict RuntimeError #5646
Comments
CC @blefaudeux who may have some more insight! Another point to make is that this happens randomly within the middle of training when saving the model. so it can happen after a few successful iterations. Was wondering if we're running out of memory at all |
oh wow, I never saw that, interesting. In recent fairscale I switched to using pytorch's broadcast object util instead of a dedicated one, looks like it can fail somehow. One check could be to try to set |
I pushed a small change which should fix that, working around this case which seems to not be handled very well by this util |
Cool, I'll give it a try today, I'll keep you up to date :) |
I now seem to get some kind of deadlock within the broadcast:
|
cc @blefaudeux I can hereby (sadly) say that the code change didn't conclusively fix the issue, I ran into the same issue again: Traceback (most recent call last):
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 561, in train
self.train_loop.run_training_epoch()
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 562, in run_training_epoch
self.trainer.run_evaluation()
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 668, in run_evaluation
self.evaluation_loop.on_evaluation_end()
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 110, in on_evaluation_end
self.trainer.call_hook('on_validation_end', *args, **kwargs)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 925, in call_hook
trainer_hook(*args, **kwargs)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 177, in on_validation_end
callback.on_validation_end(self, self.get_model())
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
self.save_checkpoint(trainer, pl_module)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 254, in save_checkpoint
self._save_last_checkpoint(trainer, pl_module, monitor_candidates)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 568, in _save_last_checkpoint
self._save_model(last_filepath, trainer, pl_module)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 362, in _save_model
self.save_function(filepath, self.save_weights_only)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 257, in save_checkpoint
self.checkpoint_connector.save_checkpoint(filepath, weights_only)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 391, in save_checkpoint
checkpoint = self.dump_checkpoint(weights_only)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 282, in dump_checkpoint
optimizer_state = self.trainer.accelerator_backend.optimizer_state(optimizer)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 206, in optimizer_state
return self.ddp_plugin.optimizer_state(optimizer)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/plugins/sharded_plugin.py", line 42, in optimizer_state
optimizer.consolidate_state_dict()
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 320, in consolidate_state_dict
self._broadcast_state_dict()
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 352, in _broadcast_state_dict
dist.broadcast_object_list([dummy_sync_tensor], src=global_rank, group=self.group)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1687, in broadcast_object_list
object_tensor = torch.ByteTensor(torch.sum(object_sizes_tensor).item())
RuntimeError: Trying to create tensor with negative dimension -5185613574447507032: [-5185613574447507032] |
cc @rohan-varma, author of this part, it's certainly not normal, really sorry about that @robogast. Could you try to set OSS._torch_broadcast_object to False, so that it falls back to the more mundane implementation ? |
also @robogast do you see the issue on pytorch stable ? (1.7.1) I've never seen that myself so it would be nice to corner this to a specific combination, if any |
@robogast Thanks for flagging this! Do you have a runnable script that can reproduce the issue? In the meantime, hopefully the fallback @blefaudeux suggested can unblock. |
@blefaudeux Yes, this also happened before pytorch1.7.1. Let me pull out some old stacktraces, I believe it's the same issue, old code. I can see if I can reproduce it with some minimal example, but that would take some time. Stacktrace 1Validating: 83%|########3 | 5/6 [00:21<00:04, 4.66s/it][A
Epoch 10: 50%|##### | 59/118 [09:39<09:39, 9.82s/it, loss=0.0056, v_num=7158123]
Validating: 100%|##########| 6/6 [00:23<00:00, 3.98s/it][A
Epoch 10: 51%|##### | 60/118 [09:42<09:22, 9.70s/it, loss=0.0056, v_num=7158123]
Epoch 10: 51%|##### | 60/118 [09:49<09:30, 9.83s/it, loss=0.0056, v_num=7158123]
Traceback (most recent call last):
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 521, in train
self.train_loop.run_training_epoch()
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 588, in run_training_epoch
self.trainer.run_evaluation(test_mode=False)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 628, in run_evaluation
self.evaluation_loop.on_evaluation_end()
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 111, in on_evaluation_end
self.trainer.call_hook('on_validation_end', *args, **kwargs)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 887, in call_hook
trainer_hook(*args, **kwargs)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 177, in on_validation_end
callback.on_validation_end(self, self.get_model())
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
self.save_checkpoint(trainer, pl_module)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 252, in save_checkpoint
self._save_last_checkpoint(trainer, pl_module, monitor_candidates)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 567, in _save_last_checkpoint
self._save_model(last_filepath, trainer, pl_module)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 361, in _save_model
self.save_function(filepath, self.save_weights_only)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 236, in save_checkpoint
self.checkpoint_connector.save_checkpoint(filepath, weights_only)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 382, in save_checkpoint
checkpoint = self.dump_checkpoint(weights_only)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 273, in dump_checkpoint
optimizer_state = self.trainer.accelerator_backend.optimizer_state(optimizer)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 201, in optimizer_state
return self.ddp_plugin.optimizer_state(optimizer)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/plugins/sharded_plugin.py", line 42, in optimizer_state
optimizer.consolidate_state_dict()
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 299, in consolidate_state_dict
self._broadcast_state_dict()
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 482, in _broadcast_state_dict
broadcast_object(empty_buffer, src_rank=global_rank, group=self.group, dist_device=self._device)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/utils.py", line 123, in broadcast_object
data_recv_tensor = torch.empty([int(length_tensor.item())], dtype=torch.uint8, device=dist_device)
RuntimeError: CUDA out of memory. Tried to allocate 3968492912.96 GiB (GPU 2; 23.65 GiB total capacity; 64.10 MiB already allocated; 21.70 GiB free; 954.00 MiB reserved in total by PyTorch) Stacktrace 2Validating: 83%|########3 | 5/6 [00:16<00:03, 3.68s/it][A
Validating: 100%|##########| 6/6 [00:18<00:00, 3.14s/it][A
Epoch 11: 100%|##########| 118/118 [20:31<00:00, 10.44s/it, loss=0.00465, v_num=7158119]
Traceback (most recent call last):
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 521, in train
self.train_loop.run_training_epoch()
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 588, in run_training_epoch
self.trainer.run_evaluation(test_mode=False)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 628, in run_evaluation
self.evaluation_loop.on_evaluation_end()
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 111, in on_evaluation_end
self.trainer.call_hook('on_validation_end', *args, **kwargs)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 887, in call_hook
trainer_hook(*args, **kwargs)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/callback_hook.py", line 177, in on_validation_end
callback.on_validation_end(self, self.get_model())
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
self.save_checkpoint(trainer, pl_module)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 252, in save_checkpoint
self._save_last_checkpoint(trainer, pl_module, monitor_candidates)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 567, in _save_last_checkpoint
self._save_model(last_filepath, trainer, pl_module)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 361, in _save_model
self.save_function(filepath, self.save_weights_only)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/properties.py", line 236, in save_checkpoint
self.checkpoint_connector.save_checkpoint(filepath, weights_only)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 382, in save_checkpoint
checkpoint = self.dump_checkpoint(weights_only)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 273, in dump_checkpoint
optimizer_state = self.trainer.accelerator_backend.optimizer_state(optimizer)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 201, in optimizer_state
return self.ddp_plugin.optimizer_state(optimizer)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/pytorch_lightning/plugins/sharded_plugin.py", line 42, in optimizer_state
optimizer.consolidate_state_dict()
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 299, in consolidate_state_dict
self._broadcast_state_dict()
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/oss.py", line 482, in _broadcast_state_dict
broadcast_object(empty_buffer, src_rank=global_rank, group=self.group, dist_device=self._device)
File "/home/robertsc/.conda/envs/pytorch1.7/lib/python3.8/site-packages/fairscale/optim/utils.py", line 123, in broadcast_object
data_recv_tensor = torch.empty([int(length_tensor.item())], dtype=torch.uint8, device=dist_device)
RuntimeError: Trying to create tensor with negative dimension -4803151849830123752: [-4803151849830123752] |
By the way @rohan-varma, since
You can also see that this already went wrong(!): in the bugfix where the But maybe that's a discussion for in the fairscale repo? :) |
you've seen that it OOMs, right ? anything after that is pretty hard to debug, the collective communication primitives assume that every rank does the same thing, it's obviously not the case here edit: wow I missed that initially, the allocation is nuts |
The only change of the bugfix was to try to pass something (a tensor sized 1), in case the issue was in the serialization of 0 (could be considered 'null' with a bug -> break). That's all, if the util was not broken to begin with then this change does not break anything either, I'm not sure why it is "broken". Of course some magic happens in broadcast_object_list which depends on the object, at that point not all ranks have the same object, and for the trace that you're seeing to happen then some of this magic needs to be broken |
To be clear: those are two different stacktraces. |
coding style preferences, function not too big + consolidate_state_dict is public interface, easier to understand what it does I believe if it's 20 loc long vs. 200, but that's beside the point |
@robogast just in case, would you know what optimizer you're using ? |
Pytorch native adam with amsgrad (wrapped by lightning ofcourse) |
still trying to wrap my head around that.. Just to try to get somewhere, there's something really strange with the trace above, in that there's almost no memory allocated to begin with when that fails, which is strange to me since there should at least be the model + probably the last or next batch. @SeanNaren @ananthsub is there something I should know which happens around the checkpointing in lightning ? Like some items being deleted ? |
Thanks for the investigation guys, its totally possible that something on the Lightning side is messing this up. When we save the model we save model states/optimizer states, some metadata info (like arguments given to the lightningmodule) but that covers most. @robogast are you able to share any code to work with to reproduce? Would help us pinpoint what's causing this issue! |
@robogast @SeanNaren one option is that it's all because of the use of the torch broadcast helper, it was also being used in 1.7.1, and it's subtly broken or hard to use (it defaults in rank == device). This could explain why the gpu has nothing at that point, maybe that you were using pipe or something and the gpu being used is not the correct one. I've removed that from upstream fairscale, awaiting confirmation but that could be it. I've never tested this configuration myself, which could explain why I've never seen the issue.. |
Sorry that might not have been clear, I'm using the nightly version of pytorch exactly because I ran into the |
even with the newer torch, this util assumes that the current torch device is changed to whatever is passed, which I was not doing. Could you try to set |
Update: Currently on PL This logic is removed from |
many thanks for all this work @robogast, and sorry for not having been more reactive, I had a hard time reproducing but that might explain why. Feel free to pull me in again and I'm glad that works for you |
Thanks guys! I think after extensive debugging from @robogast it seems unrelated to sharded from that I understand. Few de-sync issues on our end based on 1.2 vs what's on master, but hopefully next week it will be resolved! |
@robogast mind checking in if this is solved with lightning 1.2? |
I haven't had the chance to test the release version of 1.2 yet myself, but I'm running 1.2rc0 and haven't ran into this issue anymore. I'll close the issue, and if anyone runs into the same error, we can always reopen :) |
For reference, probably the PR that also fixed this issue: |
reopening this, getting the same OOM error when using
using the latest master for both FairScale and PL |
Reopening, given @IsCoelacanth's comment |
making these changes stops the error from happening on the first epoch (still testing if it comes up afterwards at random) before (error): trainer = pl.Trainer(
max_epochs=epochs,
gpus=-1,
logger=logger,
callbacks=[image_logger, lr_monitor, checkpoint],
accelerator="ddp",
gradient_clip_val=0.1,
track_grad_norm=2,
amp_level="O1",
precision=16,
plugins="ddp_sharded",
) after (no-errors yet): trainer = pl.Trainer(
max_epochs=epochs,
gpus=-1,
logger=logger,
callbacks=[image_logger, lr_monitor, checkpoint],
accelerator="ddp",
# gradient_clip_val=0.1,
# track_grad_norm=2,
amp_level="O1",
precision=16,
plugins="ddp_sharded",
) |
@IsCoelacanth do you see a difference if you do this?
|
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
@IsCoelacanth is that still an issue ? thanks for your observation, the only thing that I can think of which could affect is that using the clip_grad_norm goes through this codepath (+ some lightning mechanic that I don't know), but the error looks like a bug in NCCL since the above does not change the parameter size (or there's a bug somewhere else which nukes a parameter somehow). |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
🐛 Bug
After an (seemingly arbitrary) number of steps/epochs,
DDPShardedPlugin::optimizer_state
crashes on itsconsolidate_state_dict
call:broadcast_object_list
triesobject_tensor = torch.ByteTensor(torch.sum(object_sizes_tensor).item())
RuntimeError: Trying to create tensor with negative dimension -5193452289200645882: [-5193452289200645882]
Stacktrace:
Environment
cc @awaelchli @rohitgr7 @akihironitta
The text was updated successfully, but these errors were encountered: