-
Notifications
You must be signed in to change notification settings - Fork 656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hydra + multirun + ddp error #226
Comments
Hi there. Is there any particular error? I don't have access to multi-GPU right now but can take a look later this week |
I am also struggling to get DDP to work with Multirun. Although additionally I am on a Slurm cluster, which may be yet another complication. One thing thay you may try: I have noticed that PL requires you to calculate losses in a special location, in order for DP to work:
def training_step(self, batch: Any, batch_idx: int):
loss, preds, targets = self.step(batch)
# we can return here dict with any tensors
# and then read it in some callback or in `training_epoch_end()`` below
# remember to always return loss from `training_step()` or else backpropagation will fail!
return {"loss": loss, "preds": preds, "targets": targets}
def training_step_end(self, outputs):
# log train metrics
acc = self.train_acc(outputs['preds'], outputs['targets'])
self.log("train/loss", outputs['loss'], on_step=False, on_epoch=True, prog_bar=False, sync_dist=True)
self.log("train/acc", acc, on_step=False, on_epoch=True, prog_bar=True, sync_dist=True) These rules may only apply for DP and not for DDP, but it certainly won't hurt to try it out. |
Thank you for the suggestion. I have tried that, but the same crash occurs :(
Plus, I found out others are struggling with the same issues. But it seems not to be solved with simple solutions. Lightning-AI/pytorch-lightning#2727 |
@shim94kr @nils-werner I played a bit with 2xV100 and found that removing the dirpath argument for checkpoint callback prevents the "Killed" error. Try to simply comment out the following line and see if DDP works:
I believe this worked correctly before, might be due to some changes in recent lightning release. I will make an issue about it in the lightning repo |
I checked your solution resolves the problem of mine! Thank you for your investigation! |
Hello, I understand the configuration with Hydra + PL DDP works fine. Does it still work when using the hydra submitit plugin to launch SLURM jobs? Would you have a working example to share @nils-werner? |
I've got the same problem while using SLURM (+optuna) - killed after the end of the first optuna run - and unfortunately, just commenting out the dirpath doesn't work. |
All DDP discussions are moved to #393 |
Hello. I found that the combination of multi-run and DDP doesn't work with the following command.
python run.py -m datamodule.batch_size=32,64 trainer=ddp trainer.max_epochs=2
Can you have a look at this?
The text was updated successfully, but these errors were encountered: