ModelCheckpoint improvement from "remove & write" to "write & remove"

Lately I experienced an issue with model checkpointing, so I wanted to move it to a discussion although I am unsure about this is a "bug", and thus I opened as "question". To sum it up, when using model checkpointing with certain a certain configuration, where 
`n_saved=1`, there is a potential risk to lose the checkpoint due to "remove first, and then write" logic. 

**Problem**
You can create a model checkpoint at specific timestamps to save a checkpoint. However, there is a potential risk that you can lose the checkpoint due to write errors, interruption, machine shut-down, or outage etc. DiskSaver used on ModelCheckpoint first checks if checkpoint count doesn't exceed `n_saved`, and after that it removes old/older checkpoints, and then writes new checkpoint. When `n_saved=1` this turns it into basic "remove existing & write", and if by any case write process is corrupted, then since the older checkpoint is deleted, you simply waste your training resources and time. Option `n_saved > 1` can create many dangling and redundant model checkpoints, and with large number of experiments and experimenting with huge models especially on cloud causes unnecessary files claiming large amount of storage space.

**Ideas**
The trivial idea is to set `n_saved > 1`, but this has some negative consequences that people may not want generally and avoid it. The second idea is to replace "remove first and then write" logic to "write first and then remove" logic. Is there any possible practical ideas ? 

NOTE: This issue is opened to discuss the situation, not meant to imply any feature request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ModelCheckpoint improvement from "remove & write" to "write & remove" #1986

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ModelCheckpoint improvement from "remove & write" to "write & remove" #1986

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions