-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identical best monitored metric values across different depths can result in depth-aligned checkpoint metadata corruption #15
Comments
Thanks for noticing the issue and providing the repro @CyprienRicque! Root CauseTo minimize the dependent parent callback code overridden ( Unfortunately, (irrespective of ResolutionTo address these edge cases, I've just pushed a commit that:
I've also added a few tests to avoid similar issues/regressions in the future. I hope it's okay that I slightly rename the issue title to reflect the identified bug scope. This fix should be available with the next FTS patch release ( Thanks again for identifying this issue and taking the time to report it with a good repro, you've helped improve FTS for everyone! Feel free to reach out anytime if you have other issues or want to share more about your use case. Best of luck with your work! |
fixed with 0ef51de |
Great thank you ! 😊 |
🐛 Bug
While fitting, if the current level
l
did not generate new checkpoints, then the trainer will move one to next levell+1
and reload the last best model saved, likely from levell-1
.Upon further tests, it turns out it is because
save_last
is set toTrue
.This loading fails with the error:
To Reproduce
Expected behavior
Should just reload the checkpoint without error.
Environment
conda
,pip
, source):conda
The text was updated successfully, but these errors were encountered: