Skip to content

Conversation

@Iruos8805
Copy link

@Iruos8805 Iruos8805 commented Nov 2, 2025

What does this PR do?

This PR introduces a new callback hook, on_checkpoint_write_end, which is triggered after a checkpoint file has been fully written to disk.

It allows users to run custom logic such as validation, integrity checks, or post save actions once checkpoint writing is complete.

Currently, on_save_checkpoint is triggered before the checkpoint file is written to disk.
However, there is no callback hook that runs after the checkpoint write operation is fully complete.

This limitation makes it difficult to safely perform actions that depend on the finalized checkpoint file, such as:

  • Running integrity checks or file validations
  • Launching asynchronous or distributed processes that use the newly written checkpoint
  • Triggering external tools (uploading to remote storage, post-save logging)

The motivation behind this change is to enable developers to reliably run logic only after the checkpoint is guaranteed to exist on disk.

Fixes #15795

Before submitting
  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--21323.org.readthedocs.build/en/21323/

@github-actions github-actions bot added docs Documentation related pl Generic label for PyTorch Lightning package labels Nov 2, 2025
@bhimrazy
Copy link
Collaborator

bhimrazy commented Nov 4, 2025

Hi @Iruos8805,

I noticed there’s an after_save_checkpoint method defined in the logger:

class Logger(FabricLogger, ABC):
"""Base class for experiment loggers."""
def after_save_checkpoint(self, checkpoint_callback: ModelCheckpoint) -> None:
"""Called after model checkpoint callback saves a new checkpoint.

Would this method already address the issue being discussed here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Documentation related pl Generic label for PyTorch Lightning package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

on_checkpoint_write_end callback

2 participants