-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug-fix] Delete .pt checkpoints past keep-checkpoints #5271
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good except for the training status version update that I do not yet understand.
@@ -12,7 +12,7 @@ | |||
|
|||
logger = get_logger(__name__) | |||
|
|||
STATUS_FORMAT_VERSION = "0.2.0" | |||
STATUS_FORMAT_VERSION = "0.2.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why bump this? It does not seem to be used anywhere. If this is needed, you need to check for backwards compatibility at some point no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should still tick it. We don't check it b/c there hasn't been any breaking changes with the loading of this file - if we do make a breaking change, I would think we'll have to tick to 1.0 and check. Checking if the minor versions are the same was somewhat subsumed by the mlagents version check, so we didn't add a separate check for this file. At the end of the day this versioning was intended so that Cloud could reliably use this file if needed.
With that said I did tick it incorrectly - it should be 0.3.0 not a patch ver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, feel free to merge when the tests pass
Proposed change(s)
Update the checkpoint manager to manage non-onnx files as well. This means that PyTorch checkpoints also show up in the
training_status.json
, and are deleted when they expire.Types of change(s)
Checklist
Other comments