Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hydra + multirun + ddp error #226

Closed
shim94kr opened this issue Dec 15, 2021 · 9 comments
Closed

hydra + multirun + ddp error #226

shim94kr opened this issue Dec 15, 2021 · 9 comments
Labels
bug Something isn't working

Comments

@shim94kr
Copy link

Hello. I found that the combination of multi-run and DDP doesn't work with the following command.

python run.py -m datamodule.batch_size=32,64 trainer=ddp trainer.max_epochs=2

Can you have a look at this?

@ashleve
Copy link
Owner

ashleve commented Dec 15, 2021

Hi there. Is there any particular error?

I don't have access to multi-GPU right now but can take a look later this week

@shim94kr
Copy link
Author

shim94kr commented Dec 15, 2021

It crashes before the second run and weirdly passes the test phase of the first run.
image

This is the screenshot of what is going on.

@nils-werner
Copy link
Contributor

I am also struggling to get DDP to work with Multirun. Although additionally I am on a Slurm cluster, which may be yet another complication.

One thing thay you may try: I have noticed that PL requires you to calculate losses in a special location, in order for DP to work:

  1. Calculate the losses in *_step_end, not in *_step
  2. Log them using self.log(..., sync_dist=True)
    def training_step(self, batch: Any, batch_idx: int):
        loss, preds, targets = self.step(batch)

        # we can return here dict with any tensors
        # and then read it in some callback or in `training_epoch_end()`` below
        # remember to always return loss from `training_step()` or else backpropagation will fail!
        return {"loss": loss, "preds": preds, "targets": targets}

    def training_step_end(self, outputs):
        # log train metrics
        acc = self.train_acc(outputs['preds'], outputs['targets'])
        self.log("train/loss", outputs['loss'], on_step=False, on_epoch=True, prog_bar=False, sync_dist=True)
        self.log("train/acc", acc, on_step=False, on_epoch=True, prog_bar=True, sync_dist=True)

These rules may only apply for DP and not for DDP, but it certainly won't hurt to try it out.

@shim94kr
Copy link
Author

shim94kr commented Dec 16, 2021

Thank you for the suggestion.

I have tried that, but the same crash occurs :(
I changed the code below, and I'm running the code on a single multi-gpu server.

    def training_step(self, batch: Any, batch_idx: int):
        loss, preds, targets = self.step(batch)

        # we can return here dict with any tensors
        # and then read it in some callback or in `training_epoch_end()`` below
        # remember to always return loss from `training_step()` or else backpropagation will fail!
        return {"loss": loss, "preds": preds, "targets": targets}

    def training_step_end(self, outputs):
        # log train metrics
        acc = self.train_acc(outputs['preds'], outputs['targets'])
        self.log("train/loss", outputs['loss'], on_step=False, on_epoch=True, prog_bar=False, sync_dist=True)
        self.log("train/acc", acc, on_step=False, on_epoch=True, prog_bar=True, sync_dist=True)

    def validation_step(self, batch: Any, batch_idx: int):
        loss, preds, targets = self.step(batch)
        return {"loss": loss, "preds": preds, "targets": targets}
    
    def validation_step_end(self, outputs):
        # log val metrics
        acc = self.val_acc(outputs['preds'], outputs['targets'])
        self.log("val/loss", outputs['loss'], on_step=False, on_epoch=True, prog_bar=False, sync_dist=True)
        self.log("val/acc", acc, on_step=False, on_epoch=True, prog_bar=True, sync_dist=True)

    def validation_epoch_end(self, outputs: List[Any]):
        acc = self.val_acc.compute()  # get val accuracy from current epoch
        self.val_acc_best.update(acc)
        self.log("val/acc_best", self.val_acc_best.compute(), on_epoch=True, prog_bar=True)

    def test_step(self, batch: Any, batch_idx: int):
        loss, preds, targets = self.step(batch)
        return {"loss": loss, "preds": preds, "targets": targets}

    def test_step_end(self, outputs):
        # log test metrics
        acc = self.test_acc(outputs['preds'], outputs['targets'])
        self.log("test/loss", outputs['loss'], on_step=False, on_epoch=True, sync_dist=True)
        self.log("test/acc", acc, on_step=False, on_epoch=True, sync_dist=True)

Plus, I found out others are struggling with the same issues. But it seems not to be solved with simple solutions.
It will be appreciated if there's a workaround.

Lightning-AI/pytorch-lightning#2727
https://forums.pytorchlightning.ai/t/using-hydra-ddp/567/6

@ashleve ashleve added the bug Something isn't working label Dec 17, 2021
@ashleve
Copy link
Owner

ashleve commented Jan 3, 2022

@shim94kr @nils-werner I played a bit with 2xV100 and found that removing the dirpath argument for checkpoint callback prevents the "Killed" error.

Try to simply comment out the following line and see if DDP works:

I believe this worked correctly before, might be due to some changes in recent lightning release. I will make an issue about it in the lightning repo

@shim94kr
Copy link
Author

shim94kr commented Jan 5, 2022

I checked your solution resolves the problem of mine! Thank you for your investigation!

@LTMeyer
Copy link

LTMeyer commented Jan 6, 2022

Hello, I understand the configuration with Hydra + PL DDP works fine. Does it still work when using the hydra submitit plugin to launch SLURM jobs? Would you have a working example to share @nils-werner?
I understood Hydra + Sumbitit plugin for SLURM + PL DDP was problematic (Lightning-AI/pytorch-lightning#2727).

@ashleve ashleve linked a pull request May 31, 2022 that will close this issue
@ashleve ashleve removed a link to a pull request Jun 8, 2022
@michalshap
Copy link

michalshap commented Jun 19, 2022

I've got the same problem while using SLURM (+optuna) - killed after the end of the first optuna run - and unfortunately, just commenting out the dirpath doesn't work.
Are there any other solutions?

@ashleve
Copy link
Owner

ashleve commented Jul 15, 2022

All DDP discussions are moved to #393

@ashleve ashleve closed this as completed Jul 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants