-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FileNotFoundError for best checkpoint when using DDP with Hydra #5512
Comments
Hi! thanks for your contribution!, great first issue! |
Hey @inzouzouwetrust could you reproduce via the bug_report_model I shared with you and paste it here? Will help me debug EDIT: it's in the main issue, missed it... I assume it needs the config's to be specified in the repo as well! |
Only had a quick glance at this issue, but could simply by my fix here: #5155 |
I have also run into the issue @awaelchli mentioned independent of hydra launching. Thanks for the fix! I'll try pulling it down to see if it makes a difference in this context 🙂 . |
I tried @awaelchli fix but it doesn't seem to work. Looking at what happens this is still related to the output run directory of Hydra (which wraps our output dir). I had to make a few modifications for the bug to appear @inzouzouwetrust: hydra:
run:
dir: "data/${now:%Y-%m-%d_%H-%M-%S}"
sweep:
dir: "data/${now:%Y-%m-%d_%H-%M-%S}"
subdir: ${hydra.job.num}
checkpoint:
monitor: "x"
mode: "max"
verbose: True
save_top_k: 1 import os
import hydra
import torch
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.callbacks import ModelCheckpoint
from torch.utils.data import Dataset
class RandomDataset(Dataset):
"""
>>> RandomDataset(size=10, length=20) # doctest: +ELLIPSIS
<...bug_report_model.RandomDataset object at ...>
"""
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
"""
>>> BoringModel() # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
BoringModel(
(layer): Linear(...)
)
"""
def __init__(self):
"""
Testing PL Module
Use as follows:
- subclass
- modify the behavior for what you want
class TestModel(BaseTestModel):
def training_step(...):
# do your own thing
or:
model = BaseTestModel()
model.training_epoch_end = None
"""
super().__init__()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def loss(self, batch, prediction):
# An arbitrary loss to have a loss that updates the model weights during `Trainer.fit` calls
return torch.nn.functional.mse_loss(prediction, torch.ones_like(prediction))
def step(self, x):
x = self.layer(x)
out = torch.nn.functional.mse_loss(x, torch.ones_like(x))
return out
def training_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
return {"loss": loss}
def training_step_end(self, training_step_outputs):
return training_step_outputs
def training_epoch_end(self, outputs) -> None:
torch.stack([x["loss"] for x in outputs]).mean()
def validation_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
return {"x": loss}
def test_step(self, batch, batch_idx):
output = self.layer(batch)
loss = self.loss(batch, output)
return {"y": loss}
def configure_optimizers(self):
optimizer = torch.optim.SGD(self.layer.parameters(), lr=0.1)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1)
return [optimizer], [lr_scheduler]
# NOTE: If you are using a cmd line to run your script,
# provide the cmd line as below.
# opt = "--max_epochs 1 --limit_train_batches 1".split(" ")
# parser = ArgumentParser()
# args = parser.parse_args(opt)
@hydra.main(config_path="", config_name="config")
def test_run(cfg):
print(cfg)
class TestModel(BoringModel):
def on_train_epoch_start(self) -> None:
print("override any method to prove your bug")
# fake data
train_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
val_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
test_data = torch.utils.data.DataLoader(RandomDataset(32, 64))
# model
model = TestModel()
callbacks = [ModelCheckpoint(**cfg.checkpoint)]
trainer = Trainer(
default_root_dir=os.getcwd(),
limit_train_batches=1,
limit_val_batches=1,
max_epochs=1,
weights_summary=None,
callbacks=callbacks,
accelerator="ddp",
gpus=2,
deterministic=True,
benchmark=False,
)
trainer.fit(model, train_data, val_data)
trainer.test(test_dataloaders=test_data)
if __name__ == "__main__":
test_run() The issue comes from that when internally in DDP we spin up multiple processes, Hydra is creating new output run directories that wrap our output directory. We have some additional hydra logic before we spin up processes, but nothing handles this: Some potential solutions:
|
Hey @romesco, I experienced this bug before. Here is my hacky solution. Removed
from @SeanNaren comment, Best, |
Hydra in general does not take environment variables to configure user options. hydra:
run:
dir: ${env:HYDRA_RUN_DIR,a_default_value} Maybe a better option is to allow users to pass additional command line arguments to the ddp process.
The idea with overriding the hydra.run.dir to the proper one is that such cleanup or chdir would not be necessary. |
Alright, applying @awaelchli's fix #5155 along with the configuration found here: is working for me. This is with:
|
hey @romesco I think it works because your run dir is set to: run_dir = "${now:%Y-%m-%d}" if you were to add a finer granularity, it would probably break (add I don't like having to force users to remove I think in the perfect world, we'd manually pass the interpolated run directory from the main process to the child processes, but this might be more complicated... |
I am proposing to do it when spawning the ddp.
Side issues: |
nice! apologies just lack of my own understanding, testing locally and it seems to work fine (not sure what's going to happen in the logs with multiple processes writing to it, but atleast 1 folder containing everything). Will make a PR! |
I'm late to the party but I can confirm that #5629 fixes it for me :) |
Well I'm getting this error now with pl 1.5.2 and hydra 1.0. It can't actually find the train script and freeezes
|
For me error was related to
and replacing it with
fixes the error |
I have the same error.
It seems that the file does not exists when I'm trying to load it, and in fact it is created just after the program crashed. I don't know if it is due to Hydra or not .... However, I'm debugging the code (the problematic part is the following): checkpoint_callback = ModelCheckpoint(
monitor='val_loss',
dirpath=f'{BASEPATH}/.checkpoints/{cfg.model.name}/{cfg.data.task}/{cfg.seed}-{cfg.data.dataset}/',
filename='{epoch:02d}-{val_loss:.2f}-{val_acc:.2f}',
save_top_k=3,
mode='min'
)
# popolate the config trainer with configurations
trainer = pl.Trainer(
**cfg.trainer,
# when strategy:'ddp' and find_unused_parameters:False,
strategy= DDPStrategy(find_unused_parameters=False)
if cfg.custom_trainer.strategy == 'ddp' and
cfg.custom_trainer.find_unused_parameters == False
else 'ddp',
logger = wandb_logger,
callbacks=[early_stop_callback, checkpoint_callback],
)
trainer.fit(model, train_loader, val_loader)
# Load the best checkpoint
best_checkpoint_path = trainer.checkpoint_callback.best_model_path
print(f"Loading best checkpoint from {best_checkpoint_path}")
model = model.load_from_checkpoint(best_checkpoint_path) When I'm debugging, when one subprocess reaches the |
🐛 Bug
I am getting a FileNotFoundError for loading the best checkpoint when using
trainer.test()
aftertrainer.fit()
in DDP mode with Hydra.My configuration file specifies that
hydra.run.dir="/path/to/data/${now:%Y-%m-%d_%H-%M-%S}"
.As a result, the first process (rank 0) spawns in "/path/to/data/datetime1" and creates the "ckpts" and "logs" folders there while the second process (rank 1) spawns in "/path/to/data/datetime2" and cannot access the "ckpts" and "logs" folders.
It appears that when calling
trainer.test()
, the program looks for "/path/to/data/datetime2/ckpts/best.ckpt" which is indeed not there.Here is the error stack:
Please reproduce using the BoringModel
Error is triggered by using DDP with at least 2 GPUs. Hence I cannot use Colab.
To Reproduce
Use this repository
Have at least 2 GPUs available.
Expected behavior
I would expect the program to use the subfolder spawned by the first process (rank 0) when loading the best checkpoint.
Environment
Additional context
The text was updated successfully, but these errors were encountered: