Skip to content

Commit

Permalink
fix global step and epoch counters on failed checkpointing
Browse files Browse the repository at this point in the history
  • Loading branch information
awaelchli committed Aug 25, 2021
1 parent 5000f2e commit 4bfc7bd
Showing 1 changed file with 5 additions and 0 deletions.
5 changes: 5 additions & 0 deletions pytorch_lightning/trainer/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -1379,4 +1379,9 @@ def _on_exception(self):
return
# save a checkpoint for fault tolerant training. we don't use `log_dir` to minimize the chances of failure.
file_path = os.path.join(self.default_root_dir, ".pl_auto_save.ckpt")
# CheckpointConnector.dump_checkpoint will bump the counters, but we counteract it here since we failed
# and have not actually completed the epoch/step.
# TODO: remove when FitLoop and TrainingEpochLoop do no longer depend on these counters for done() condition
self.fit_loop.global_step -= 1
self.fit_loop.current_epoch -= 1
self.save_checkpoint(file_path)

0 comments on commit 4bfc7bd

Please sign in to comment.