You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Models trained with pl_helper.run_pl_training with option --save are saved with a pytorch lightning callback (ModelCheckpoint).
This callback is configured to save:
the top_k best models using the following naming convention : "{epoch}-{step}.ckpt"
the last model : "last.ckpt"
When loading a trained model with pl_helper.load_training with option --run_id, the checkpoint file used to load the model is "last.ckpt". This means the weights used are the weights obtained after last training iteration, and not the weights of the best model.
Proposition
It would be good to add a parameter in pl_helper.load_training to select between last and best model.
For this we need to :
change the naming convention in ModelCheckpoint, in order to add the val_loss value in the checkpoint file name.
when loading the model: parse the filenames in the checkpoint folder corresponding to the run_id, and keep the one with the best value. important we need to be sure of what we mean by best value (is it possible that for some models, a greater val_loss is better, or is it impossible with ModelCheckpoint ?)
The text was updated successfully, but these errors were encountered:
Problem
Models trained with
pl_helper.run_pl_training
with option--save
are saved with a pytorch lightning callback (ModelCheckpoint
).This callback is configured to save:
When loading a trained model with
pl_helper.load_training
with option--run_id
, the checkpoint file used to load the model is "last.ckpt". This means the weights used are the weights obtained after last training iteration, and not the weights of the best model.Proposition
It would be good to add a parameter in
pl_helper.load_training
to select between last and best model.For this we need to :
important we need to be sure of what we mean by best value (is it possible that for some models, a greater val_loss is better, or is it impossible with
ModelCheckpoint
?)The text was updated successfully, but these errors were encountered: