-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP with Hydra multirun doesn't work when dirpath in checkpoint callback is specified #11300
Comments
Hey @ashleve, The code in Lightning to enable DDP with Hydra is there: https://github.com/PyTorchLightning/pytorch-lightning/blob/7fa1aebcc99297e4d7eb8dcf2deb22e6da814edf/pytorch_lightning/strategies/ddp.py#L231. Would you be interested in investigating this behavior with sweep and creating a bug fix PR if you manage to solve it.? Otherwise, would you mind providing a reproducible script with the BoringModel + Hydra ? Alternatively, did you try using Best, |
@tchaton Hey, here's a minimal example: I was not able to find an easy fix, but here's what I found:
|
I brought it up in #2727, what has mostly been working for me is to change the lines @tchaton mentioned above to: if _HYDRA_AVAILABLE:
if HydraConfig.initialized():
cwd = get_original_cwd()
os_cwd = f'"{os.getcwd()}"' # this is needed to handle characters like `=` in the directory name
command = command[:2]
command += ["-cp", str(Path.cwd().relative_to(Path(cwd)) / ".hydra"), "-cn", "config.yaml"]
command += [f"hydra.run.dir={os_cwd}", f"hydra.job.name=train_ddp_process_{local_rank}"] |
Dear @jgbos, Would you be willing to make a PR to resolve this bug ? |
@tchaton I'll think about it. I'm running into some issue with multirun that I need to debug, which brings me to the actual challenge of this PR... is there a good way to test Hydra+PL? It's still a bit hacky to assume the |
FYI, if someone wants to fix this before I get to it, I think this code should work (needs testing though): if _HYDRA_AVAILABLE:
if HydraConfig.initialized():
orig_cwd = get_original_cwd()
cwd = os.getcwd()
os_cwd = f'"{cwd}"' # this is needed to handle characters like `=` in the directory name
hydra_cfg = HydraConfig.get()
hydra_output = os.path.join(os.path.relpath(cwd, orig_cwd), hydra_cfg.output_subdir)
command = command_no_args # command[:2] or command[:3] if using "python -m script.main"
command += ["-cp", hydra_output, "-cn", "config.yaml"]
command += [f"hydra.run.dir={os_cwd}", f"hydra.job.name=train_ddp_process_{local_rank}"] |
Hey @jgbos, Great question. I think the simplest is to create a simple test with the config file, not in the right place and see if you can recover from it. Best, |
@tchaton True, I'm also thinking about PL+Hydra properties to test. There are going to be a few edge cases my solution doesn't support. I would rather PL support something like |
Yes, this needs to be revisited :) Feel free to open a PR with your solution, so we can iterate on your findings. |
hi @ashleve - thanks for creating the minimal repro! that was really helpful. Sounds like there are two issues here:
As for 1, in Hydra 1.2 (the one we are currently working on), we added an option to not changing current working dir. If you run your application with |
@carmocca Yes, it should fix this issue. |
Hope this issue will be fixed |
@OZOOOOOH sorry, I've been swamped on other projects to finish the PR. You can checkout a current solution we have here: https://mit-ll-responsible-ai.github.io/responsible-ai-toolbox/how_to/hydraddp.html. |
We had to revert part of this change in #15737. I'm reopening the issue so that we can continue to look for a solution that does not involve changing the current working directory when launching the processes. |
I've been working to come up with a solution for this (and more generally). A robust solution would remove Lightnings need to depend on from typing import Optional
import submitit
from submitit.core.core import Executor
class SubmititTrainer:
def __init__(self, executor: Optional[Executor] = None, devices: int = 1, **kwargs):
"""PyTorch Lightning Trainer Wrapped by Submitit
This class does not inherit `Trainer` because I want to support DDP in the notebook.
Parameters
----------
executor: Executor | None (default: None)
The submitit executor, if `None` it uses a default local executor.
devices: int = 1
The devices for `pl.Trainer`
**kwargs: Any
Arguments for `pl.Trainer`
"""
self.kwargs = kwargs
self.devices = devices
if executor is None:
self.executor = submitit.AutoExecutor("submitit_log", cluster="local")
self.executor.update_parameters(gpus_per_node=devices)
def _init_trainer(self):
return pl.Trainer(devices=self.devices, **self.kwargs)
def _fit(self, *args, **kwargs):
trainer = self._init_trainer()
return trainer.fit(*args, **kwargs)
def _test(self, *args, **kwargs):
trainer = self._init_trainer()
return trainer.test(*args, **kwargs)
def _predict(self, *args, **kwargs):
trainer = self._init_trainer()
return trainer.predict(*args, **kwargs)
def fit(self, *args, **kwargs):
return self.executor.submit(self._fit, *args, **kwargs)
def test(self, *args, **kwargs):
return self.executor.submit(self._test, *args, **kwargs)
def predict(self, *args, **kwargs):
return self.executor.submit(self._predict, *args, **kwargs)
from rai_toolbox.mushin.testing.lightning import SimpleLightningModule, SimpleDataModule
dm = SimpleDataModule()
model = SimpleLightningModule()
trainer = SubmititTrainer(strategy="ddp", accelerator="gpu", devices=2, max_epochs=1)
job = trainer.test(model=model, datamodule=dm)
print(f"Started Local Job: {job.job_id}")
|
π Bug
Running DDP with Hydra multirun ends up with "Killed" error message when launching the second task:
I experience this ONLY when passing the
dirpath
parameter to checkpoint callback:Tested for lightning v1.5.7. I believe this issue wasn't around in one of the previous releases.
This probably has something to do with the way hydra changes working directory for each new run - the directory for storing checkpoints also gets changed. If I remember correctly, there was some workaround implemented in lightning which made DDP possible despite that.
cc @tchaton @rohitgr7 @justusschock @kaushikb11 @awaelchli @akihironitta
The text was updated successfully, but these errors were encountered: