Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModelCheckpoint can't work properly with filename which contains named formatting options containing dots #12770

Closed
HenryLau0220 opened this issue Apr 15, 2022 · 1 comment · Fixed by #12783
Labels
bug Something isn't working callback: model checkpoint
Milestone

Comments

@HenryLau0220
Copy link
Contributor

HenryLau0220 commented Apr 15, 2022

🐛 Bug

pytorch_lightning.callbacks.ModelCheckpoint can't work properly with filename which contains named formatting options containing dots like this: filename="epoch={epoch}-mAP@0.50={val/mAP@0.50:.4f}"

To Reproduce

# Model
class MyModel(LightningModule):
	...
	def validation_epoch_end(self, validation_step_outputs):
		self.log('val/mAP@0.50', 0.2)
# config.yaml
trainer:
  accelerator: gpu
  devices: 1
  callbacks:
    - class_path: pytorch_lightning.callbacks.ModelCheckpoint
      init_args:
        filename: "epoch={epoch}-mAP@0.50={val/mAP@0.50:.4f}"
        save_last: True
        save_top_k: 1
        monitor: "val/mAP@0.50"
        mode: "max"
        every_n_epochs: 1
        auto_insert_metric_name: False
Traceback (most recent call last):                                                                                               
  File "main_cli.py", line 6, in <module>
    cli = LightningCLI()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/cli.py", line 528, in __init__
    self._run_subcommand(self.subcommand)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/cli.py", line 783, in _run_subcommand
    fn(**fn_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
    self.fit_loop.run()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
    self.epoch_loop.run(data_fetcher)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 151, in run
    output = self.on_run_end()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 298, in on_run_end
    self.trainer.call_hook("on_train_epoch_end")
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1495, in call_hook
    callback_fx(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/callback_hook.py", line 93, in on_train_epoch_end
    callback.on_train_epoch_end(self, self.lightning_module)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 321, in on_train_epoch_end
    self.save_checkpoint(trainer)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 396, in save_checkpoint
    self._save_top_k_checkpoint(trainer, monitor_candidates)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 683, in _save_top_k_checkpoint
    self._update_best_and_save(current, trainer, monitor_candidates)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 717, in _update_best_and_save
    filepath = self._get_metric_interpolated_filepath_name(monitor_candidates, trainer, del_filepath)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 651, in _get_metric_interpolated_filepath_name
    filepath = self.format_checkpoint_name(monitor_candidates)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 582, in format_checkpoint_name
    filename = self._format_checkpoint_name(filename, metrics, auto_insert_metric_name=self.auto_insert_metric_name)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 543, in _format_checkpoint_name
    filename = filename.format(**metrics)
KeyError: 'val/mAP@0'

Environment

  • PyTorch Lightning Version:1.5.10
  • PyTorch Version:1.9.0
  • Python version:3.7
  • OS: Ubuntu20.04
  • CUDA/cuDNN version:cuda11.1/cudnn8
  • GPU models and configuration: 3090
  • How you installed PyTorch (conda, pip, source): pull from dockerhub

cc @carmocca @awaelchli @ninginthecloud @jjenniferdai @rohitgr7

@HenryLau0220 HenryLau0220 added the needs triage Waiting to be triaged by maintainers label Apr 15, 2022
@awaelchli awaelchli added bug Something isn't working callback: model checkpoint and removed needs triage Waiting to be triaged by maintainers labels Apr 16, 2022
@awaelchli awaelchli added this to the 1.6.x milestone Apr 16, 2022
@awaelchli
Copy link
Contributor

Hello
Breaking this down, this fails when doing string formatting like so:

metrics = {"val/mAP@0.50": 0.2}
"{val/mAP@0.50:.4f}".format(**metrics)
  File "<ipython-input-29-c7a5e66c6218>", line 1, in <module>
    "{val/mAP@0.50:.4f}".format(**{"val/mAP@0.50": 0.2})
KeyError: 'val/mAP@0'

The dot in the key of the metric is not supported when string formatting.
Here is a better explanation: https://stackoverflow.com/a/23909682

We may want to validate against these names in advance to warn or error to the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working callback: model checkpoint
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants