Skip to content

ModelCheckpoint improvement from "remove & write" to "write & remove" #1986

@devrimcavusoglu

Description

@devrimcavusoglu

Lately I experienced an issue with model checkpointing, so I wanted to move it to a discussion although I am unsure about this is a "bug", and thus I opened as "question". To sum it up, when using model checkpointing with certain a certain configuration, where
n_saved=1, there is a potential risk to lose the checkpoint due to "remove first, and then write" logic.

Problem
You can create a model checkpoint at specific timestamps to save a checkpoint. However, there is a potential risk that you can lose the checkpoint due to write errors, interruption, machine shut-down, or outage etc. DiskSaver used on ModelCheckpoint first checks if checkpoint count doesn't exceed n_saved, and after that it removes old/older checkpoints, and then writes new checkpoint. When n_saved=1 this turns it into basic "remove existing & write", and if by any case write process is corrupted, then since the older checkpoint is deleted, you simply waste your training resources and time. Option n_saved > 1 can create many dangling and redundant model checkpoints, and with large number of experiments and experimenting with huge models especially on cloud causes unnecessary files claiming large amount of storage space.

Ideas
The trivial idea is to set n_saved > 1, but this has some negative consequences that people may not want generally and avoid it. The second idea is to replace "remove first and then write" logic to "write first and then remove" logic. Is there any possible practical ideas ?

NOTE: This issue is opened to discuss the situation, not meant to imply any feature request.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions