Skip to content

Support time-based checkpointing trigger #6286

@ananthsub

Description

@ananthsub

🚀 Feature

Support time-based checkpointing in model checkpoint callback

Motivation

After #6146 we'll have support in Lightning to checkpoint after N training batches, or after M validation epochs. A useful feature would be to checkpoint after T time during training phase (e.g. checkpoint every 1 hour).

Pitch

This would entail:

  • adding a new optional argument time_interval to the callback constructor. This is of type timedelta. For all practical purposes, this cannot be smaller than the amount of time it takes to process a single training batch. This is not guaranteed to execute at the exact time specified, but should be close.
  • Inside of on_train_batch_end: we add the following check
now = time.monotonic()
time_interval = self.time_interval
prev_time_check = self._prev_time_check
skip_time = (
    time_interval is None
    or prev_time_check is None
    or (now - prev_time_check) < time_interval.total_seconds()
)
if skip_batch and skip_time:
    return
if not skip_time:
    self._prev_time_check = now
...  # commence with saving checkpoint

note we will need a synchronization between ranks such that all ranks enter the checkpoint save logic together in case their timers are slightly off.

Metadata

Metadata

Assignees

Labels

callbackfeatureIs an improvement or enhancementhelp wantedOpen to be worked on

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions