-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
callbackfeatureIs an improvement or enhancementIs an improvement or enhancementhelp wantedOpen to be worked onOpen to be worked on
Milestone
Description
🚀 Feature
Support time-based checkpointing in model checkpoint callback
Motivation
After #6146 we'll have support in Lightning to checkpoint after N training batches, or after M validation epochs. A useful feature would be to checkpoint after T time during training phase (e.g. checkpoint every 1 hour).
Pitch
This would entail:
- adding a new optional argument
time_intervalto the callback constructor. This is of type timedelta. For all practical purposes, this cannot be smaller than the amount of time it takes to process a single training batch. This is not guaranteed to execute at the exact time specified, but should be close. - Inside of
on_train_batch_end: we add the following check
now = time.monotonic()
time_interval = self.time_interval
prev_time_check = self._prev_time_check
skip_time = (
time_interval is None
or prev_time_check is None
or (now - prev_time_check) < time_interval.total_seconds()
)
if skip_batch and skip_time:
return
if not skip_time:
self._prev_time_check = now
... # commence with saving checkpoint
note we will need a synchronization between ranks such that all ranks enter the checkpoint save logic together in case their timers are slightly off.
SeanNaren and hadim
Metadata
Metadata
Assignees
Labels
callbackfeatureIs an improvement or enhancementIs an improvement or enhancementhelp wantedOpen to be worked onOpen to be worked on