-
Notifications
You must be signed in to change notification settings - Fork 517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/sg 757 resume for spots #870
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. only one comment: I don't understand why we need resume_from_remote_sg_logger
when we already have the resumed
resumed is just a flag we need for wandb logger to continue logging properly (it is not explicitly passed but rather derrived from resume in training params). |
Looks good. Do you want to add a section on this feature to docs? I is not quite clear from the first glance how to use this feature. Some example snippets (For both use cases) would definitely help and smooth learning curve. |
Sure, I added a section in our checkpoints.md in the latest commit. |
… into feature/SG-757_resume_for_spots
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* wip * wip * added functionality to get wandb latest ckpt before launch * renamed param and simplified procedure * added error for wandb * added docs and example * fix tests --------- Co-authored-by: Eugene Khvedchenya <ekhvedchenya@gmail.com> Co-authored-by: Louis-Dupont <35190946+Louis-Dupont@users.noreply.github.com>
* wip * wip * added functionality to get wandb latest ckpt before launch * renamed param and simplified procedure * added error for wandb * added docs and example * fix tests --------- Co-authored-by: Eugene Khvedchenya <ekhvedchenya@gmail.com> Co-authored-by: Louis-Dupont <35190946+Louis-Dupont@users.noreply.github.com>
* wip * wip * added functionality to get wandb latest ckpt before launch * renamed param and simplified procedure * added error for wandb * added docs and example * fix tests --------- Co-authored-by: Eugene Khvedchenya <ekhvedchenya@gmail.com> Co-authored-by: Louis-Dupont <35190946+Louis-Dupont@users.noreply.github.com>
Added support for resuming from a remote ckpt stored by the SG logger during training (meaning when sg_logger_params.save_checkpoints_remot=True).
For the base sg logger this is still problematic, as we dont have run ids for S3.
Platform loggers- currently we cant download files from the platform except the ones they explicitly allow, I talked to @roikoren755 and once it will be possible- I will add the mechanism for the platform as well.
Regarding PR content: