Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/sg 757 resume for spots #870

Merged
merged 18 commits into from
May 15, 2023
Merged

Conversation

shaydeci
Copy link
Contributor

Added support for resuming from a remote ckpt stored by the SG logger during training (meaning when sg_logger_params.save_checkpoints_remot=True).

For the base sg logger this is still problematic, as we dont have run ids for S3.
Platform loggers- currently we cant download files from the platform except the ones they explicitly allow, I talked to @roikoren755 and once it will be possible- I will add the mechanism for the platform as well.

Regarding PR content:

  • I moved SG Logger initialization to be performed prior to checkpoint loading, so we can download the checkpoint which we wich to resume from.
  • I introduced a "resume_from_remote_sg_logger" training param, that when set will download "ckpt_name" into our checkpoints directory, then rsume training from it.

@dagshub
Copy link

dagshub bot commented Apr 30, 2023

@shaydeci shaydeci marked this pull request as ready for review April 30, 2023 12:27
Copy link
Collaborator

@ofrimasad ofrimasad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. only one comment: I don't understand why we need resume_from_remote_sg_logger when we already have the resumed

@shaydeci
Copy link
Contributor Author

looks good. only one comment: I don't understand why we need resume_from_remote_sg_logger when we already have the resumed

resumed is just a flag we need for wandb logger to continue logging properly (it is not explicitly passed but rather derrived from resume in training params).
Even if it was specified for that- I think its better to pass this parameter through the training params (for me its more clear, but might be biased) for clarity/simplicity.

@BloodAxe
Copy link
Contributor

BloodAxe commented May 1, 2023

Looks good. Do you want to add a section on this feature to docs? I is not quite clear from the first glance how to use this feature. Some example snippets (For both use cases) would definitely help and smooth learning curve.

@shaydeci
Copy link
Contributor Author

shaydeci commented May 1, 2023

Looks good. Do you want to add a section on this feature to docs? I is not quite clear from the first glance how to use this feature. Some example snippets (For both use cases) would definitely help and smooth learning curve.

Sure, I added a section in our checkpoints.md in the latest commit.

Copy link
Contributor

@BloodAxe BloodAxe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shaydeci shaydeci merged commit 023ac16 into master May 15, 2023
@shaydeci shaydeci deleted the feature/SG-757_resume_for_spots branch May 15, 2023 10:29
avideci pushed a commit that referenced this pull request May 23, 2023
* wip

* wip

* added functionality to get wandb latest ckpt before launch

* renamed param and simplified procedure

* added error for wandb

* added docs and example

* fix tests

---------

Co-authored-by: Eugene Khvedchenya <ekhvedchenya@gmail.com>
Co-authored-by: Louis-Dupont <35190946+Louis-Dupont@users.noreply.github.com>
avideci pushed a commit that referenced this pull request May 23, 2023
* wip

* wip

* added functionality to get wandb latest ckpt before launch

* renamed param and simplified procedure

* added error for wandb

* added docs and example

* fix tests

---------

Co-authored-by: Eugene Khvedchenya <ekhvedchenya@gmail.com>
Co-authored-by: Louis-Dupont <35190946+Louis-Dupont@users.noreply.github.com>
geoffrey-g-delhomme pushed a commit to geoffrey-g-delhomme/super-gradients that referenced this pull request May 26, 2023
* wip

* wip

* added functionality to get wandb latest ckpt before launch

* renamed param and simplified procedure

* added error for wandb

* added docs and example

* fix tests

---------

Co-authored-by: Eugene Khvedchenya <ekhvedchenya@gmail.com>
Co-authored-by: Louis-Dupont <35190946+Louis-Dupont@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants