Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store logger experiment id in checkpoint to enable correct resuming of experiments #5342

Open
ashleve opened this issue Jan 3, 2021 · 8 comments
Assignees
Labels
checkpointing Related to checkpointing feature Is an improvement or enhancement help wanted Open to be worked on logger Related to the Loggers
Milestone

Comments

@ashleve
Copy link
Contributor

ashleve commented Jan 3, 2021

🚀 Feature

Enable each logger a possibility to attach custom data to checkpoint, like id of experiment defined by logger.

Motivation

Currently, if you use loggers like Wandb, Comet or Nepture, you have to manually restore correct experiment id when resuming from checkpoint in order to resume logging of previous experiment, instead of creating a new experiment in logger.

It would be better if id of experiment/run could be attached by logger to every checkpoint, and that id could be then automatically passed to logger when resuming from those checkpoints.

This was recently mentioned in #4935 and it seems to me like a very useful feature.

Solution

So my suggestion is the following:
Currently existing loggers could accept parameters like

WandbLogger(store_id_in_cktp=True)

which would make them attach experiment id to checkpoint.
Or alternatively, all loggers should always attached their id without specifying it.

Then, when we resume experiment from checkpoint, trainer could accept parameter like resume_logger_experiment

logger = WandbLogger()
trainer = Trainer(resume_from_checkpoint="last.ckpt", logger=logger, resume_logger_experiment=True)

which would make trainer modify logger experiment id.

I've so far only used wandb, but I assume other loggers like Neptune or Tensorboard, also have equivalent "id" parameter than enables them to resume experiment? If so this change should affect every one of them.

cc @Borda @awaelchli @edward-io @ananthsub @rohitgr7 @kamil-kaczmarek @Raalsky @Blaizzy @ninginthecloud

@ashleve ashleve added feature Is an improvement or enhancement help wanted Open to be worked on labels Jan 3, 2021
@tchaton
Copy link
Contributor

tchaton commented Jan 4, 2021

Hey there,

It is a good idea. Would you mind making a PR for it ?

Best regards,
T.C

@tchaton tchaton added this to the 1.2 milestone Jan 4, 2021
@tchaton tchaton added the priority: 0 High priority task label Jan 4, 2021
@ashleve
Copy link
Contributor Author

ashleve commented Jan 4, 2021

Hey @tchaton.
Sure :)
I can try making a PR, but I haven't made any contributions here before so let me know if I think correctly:

  • I don't see any hooks in LightningLoggerBase, that would allow for state saving and restoring (like "on_checkpoint_save" in base Callback class). I assume we should first somehow implement those hooks?
  • it seems that CheckpointConnector currently implements state restoriation of callbacks, schedulers, optimizer, etc. but not loggers. I assume we can just extend this functionality to loggers also?

@edenlightning edenlightning modified the milestones: 1.2, 1.3 Feb 9, 2021
@awaelchli awaelchli self-assigned this Feb 14, 2021
@awaelchli
Copy link
Contributor

@hobogalaxy

Are you still interested? I can help too.

I don't see any hooks in LightningLoggerBase, that would allow for state saving and restoring (like "on_checkpoint_save" in base Callback class). I assume we should first somehow implement those hooks?

because loggers are not callbacks.
but of course we could think of a way to restore loggers. This will be very different between the loggers.

So I suggest the following:

  1. One PR that implements dumping a state_dict of some sort for loggers that goes into the checkpoint dict. Implement that first only for the TensorBoardLogger (no restore).
  2. A PR that adds the functionality to restore the TensorBoardLogger from the state dict out of a checkpoint
  3. Then follow up with other PRs to add this also to the other loggers.

@ashleve
Copy link
Contributor Author

ashleve commented Mar 3, 2021

@awaelchli Thanks for the suggestions. I will try to make a PR this week.

@edenlightning edenlightning removed the priority: 0 High priority task label Apr 27, 2021
@edenlightning edenlightning modified the milestones: v1.3, v1.4 Apr 27, 2021
@Queuecumber
Copy link
Contributor

Pretty sure this is a (sort of) dupe of #6205 and #7104
Those dealt with the HPC checkpoints but it should really apply to both HPC checkpoints and regular checkpoints.

@edenlightning edenlightning modified the milestones: v1.4, v1.5 Jul 6, 2021
@awaelchli awaelchli modified the milestones: v1.5, v1.6 Nov 4, 2021
@carmocca carmocca modified the milestones: 1.6, None Feb 1, 2022
@carmocca carmocca added checkpointing Related to checkpointing logger Related to the Loggers labels Feb 1, 2022
@yuvalkirstain
Copy link

Hey, any news about this? that's a very useful feature.

@awaelchli
Copy link
Contributor

Hello, the status is that we are currently investigating adding a consistent interface for stateful components, #11429. And loggers would also fall into this category imo and could inherit from this too. I believe this issue can move forward once we have that in place, which is likely for 1.7.

@awaelchli awaelchli modified the milestones: future, 1.7 Feb 10, 2022
@borisdayma
Copy link
Contributor

Just for info, you can easily know which run created or use a specific artifact with artifact.used_by() and artifact.logged_by()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checkpointing Related to checkpointing feature Is an improvement or enhancement help wanted Open to be worked on logger Related to the Loggers
Projects
None yet
Development

No branches or pull requests

9 participants