Skip to content

Checkpointing by time interval #7621

@hudeven

Description

@hudeven

🚀 Feature

for ModelCheckpoint callback, support time_period to save checkpoint every X second/min/hour.

Motivation

It takes days to train large models and sometimes it crashes in the middle of epoch due to infra issue. Besides per epoch checkpointing, I hope to checkpoint in a fine grained way. Currently, ModelCheckpoint callback supports "every_n_train_steps", however, the time for each train step varies depending on the configuration of batch_size, accumulate grad batch etc.

Pitch

It would be better if we could support checkpoint by time period(optional to run validation, mostly for resuming training from failure), along with checkpoint by epoch/steps with validation

Alternatives

I have to start a run to get training time for a step and find a proper number for "every_n_train_steps".

Additional context

cc: @shuyingsunshine21 @ananthsub

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureIs an improvement or enhancementhelp wantedOpen to be worked on

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions