hydra + multirun + ddp error #226

shim94kr · 2021-12-15T02:28:15Z

Hello. I found that the combination of multi-run and DDP doesn't work with the following command.

python run.py -m datamodule.batch_size=32,64 trainer=ddp trainer.max_epochs=2

Can you have a look at this?

ashleve · 2021-12-15T02:40:46Z

Hi there. Is there any particular error?

I don't have access to multi-GPU right now but can take a look later this week

shim94kr · 2021-12-15T02:46:54Z

It crashes before the second run and weirdly passes the test phase of the first run.

This is the screenshot of what is going on.

nils-werner · 2021-12-15T18:35:02Z

I am also struggling to get DDP to work with Multirun. Although additionally I am on a Slurm cluster, which may be yet another complication.

One thing thay you may try: I have noticed that PL requires you to calculate losses in a special location, in order for DP to work:

Calculate the losses in *_step_end, not in *_step
Log them using self.log(..., sync_dist=True)

    def training_step(self, batch: Any, batch_idx: int):
        loss, preds, targets = self.step(batch)

        # we can return here dict with any tensors
        # and then read it in some callback or in `training_epoch_end()`` below
        # remember to always return loss from `training_step()` or else backpropagation will fail!
        return {"loss": loss, "preds": preds, "targets": targets}

    def training_step_end(self, outputs):
        # log train metrics
        acc = self.train_acc(outputs['preds'], outputs['targets'])
        self.log("train/loss", outputs['loss'], on_step=False, on_epoch=True, prog_bar=False, sync_dist=True)
        self.log("train/acc", acc, on_step=False, on_epoch=True, prog_bar=True, sync_dist=True)

These rules may only apply for DP and not for DDP, but it certainly won't hurt to try it out.

shim94kr · 2021-12-16T01:52:06Z

Thank you for the suggestion.

I have tried that, but the same crash occurs :(
I changed the code below, and I'm running the code on a single multi-gpu server.

    def training_step(self, batch: Any, batch_idx: int):
        loss, preds, targets = self.step(batch)

        # we can return here dict with any tensors
        # and then read it in some callback or in `training_epoch_end()`` below
        # remember to always return loss from `training_step()` or else backpropagation will fail!
        return {"loss": loss, "preds": preds, "targets": targets}

    def training_step_end(self, outputs):
        # log train metrics
        acc = self.train_acc(outputs['preds'], outputs['targets'])
        self.log("train/loss", outputs['loss'], on_step=False, on_epoch=True, prog_bar=False, sync_dist=True)
        self.log("train/acc", acc, on_step=False, on_epoch=True, prog_bar=True, sync_dist=True)

    def validation_step(self, batch: Any, batch_idx: int):
        loss, preds, targets = self.step(batch)
        return {"loss": loss, "preds": preds, "targets": targets}
    
    def validation_step_end(self, outputs):
        # log val metrics
        acc = self.val_acc(outputs['preds'], outputs['targets'])
        self.log("val/loss", outputs['loss'], on_step=False, on_epoch=True, prog_bar=False, sync_dist=True)
        self.log("val/acc", acc, on_step=False, on_epoch=True, prog_bar=True, sync_dist=True)

    def validation_epoch_end(self, outputs: List[Any]):
        acc = self.val_acc.compute()  # get val accuracy from current epoch
        self.val_acc_best.update(acc)
        self.log("val/acc_best", self.val_acc_best.compute(), on_epoch=True, prog_bar=True)

    def test_step(self, batch: Any, batch_idx: int):
        loss, preds, targets = self.step(batch)
        return {"loss": loss, "preds": preds, "targets": targets}

    def test_step_end(self, outputs):
        # log test metrics
        acc = self.test_acc(outputs['preds'], outputs['targets'])
        self.log("test/loss", outputs['loss'], on_step=False, on_epoch=True, sync_dist=True)
        self.log("test/acc", acc, on_step=False, on_epoch=True, sync_dist=True)

Plus, I found out others are struggling with the same issues. But it seems not to be solved with simple solutions.
It will be appreciated if there's a workaround.

Lightning-AI/pytorch-lightning#2727
https://forums.pytorchlightning.ai/t/using-hydra-ddp/567/6

ashleve · 2022-01-03T17:26:54Z

@shim94kr @nils-werner I played a bit with 2xV100 and found that removing the dirpath argument for checkpoint callback prevents the "Killed" error.

Try to simply comment out the following line and see if DDP works:

lightning-hydra-template/configs/callbacks/default.yaml

Line 8 in 8b62eef

dirpath: "checkpoints/"

I believe this worked correctly before, might be due to some changes in recent lightning release. I will make an issue about it in the lightning repo

shim94kr · 2022-01-05T02:33:11Z

I checked your solution resolves the problem of mine! Thank you for your investigation!

LTMeyer · 2022-01-06T14:15:15Z

Hello, I understand the configuration with Hydra + PL DDP works fine. Does it still work when using the hydra submitit plugin to launch SLURM jobs? Would you have a working example to share @nils-werner?
I understood Hydra + Sumbitit plugin for SLURM + PL DDP was problematic (Lightning-AI/pytorch-lightning#2727).

michalshap · 2022-06-19T11:33:55Z

I've got the same problem while using SLURM (+optuna) - killed after the end of the first optuna run - and unfortunately, just commenting out the dirpath doesn't work.
Are there any other solutions?

ashleve · 2022-07-15T19:10:08Z

All DDP discussions are moved to #393

ashleve added the bug Something isn't working label Dec 17, 2021

ashleve linked a pull request May 31, 2022 that will close this issue

Release 1.4.0 #322

Merged

ashleve removed a link to a pull request Jun 8, 2022

Release 1.4.0 #322

Merged

ashleve mentioned this issue Jul 15, 2022

Problems with DDP + hydra #393

Closed

ashleve closed this as completed Jul 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hydra + multirun + ddp error #226

hydra + multirun + ddp error #226

shim94kr commented Dec 15, 2021

ashleve commented Dec 15, 2021

shim94kr commented Dec 15, 2021 •

edited

Loading

nils-werner commented Dec 15, 2021

shim94kr commented Dec 16, 2021 •

edited

Loading

ashleve commented Jan 3, 2022 •

edited

Loading

shim94kr commented Jan 5, 2022

LTMeyer commented Jan 6, 2022

michalshap commented Jun 19, 2022 •

edited

Loading

ashleve commented Jul 15, 2022

hydra + multirun + ddp error #226

hydra + multirun + ddp error #226

Comments

shim94kr commented Dec 15, 2021

ashleve commented Dec 15, 2021

shim94kr commented Dec 15, 2021 • edited Loading

nils-werner commented Dec 15, 2021

shim94kr commented Dec 16, 2021 • edited Loading

ashleve commented Jan 3, 2022 • edited Loading

shim94kr commented Jan 5, 2022

LTMeyer commented Jan 6, 2022

michalshap commented Jun 19, 2022 • edited Loading

ashleve commented Jul 15, 2022

shim94kr commented Dec 15, 2021 •

edited

Loading

shim94kr commented Dec 16, 2021 •

edited

Loading

ashleve commented Jan 3, 2022 •

edited

Loading

michalshap commented Jun 19, 2022 •

edited

Loading