Store logger experiment id in checkpoint to enable correct resuming of experiments #5342

ashleve · 2021-01-03T16:41:43Z

🚀 Feature

Enable each logger a possibility to attach custom data to checkpoint, like id of experiment defined by logger.

Motivation

Currently, if you use loggers like Wandb, Comet or Nepture, you have to manually restore correct experiment id when resuming from checkpoint in order to resume logging of previous experiment, instead of creating a new experiment in logger.

It would be better if id of experiment/run could be attached by logger to every checkpoint, and that id could be then automatically passed to logger when resuming from those checkpoints.

This was recently mentioned in #4935 and it seems to me like a very useful feature.

Solution

So my suggestion is the following:
Currently existing loggers could accept parameters like

WandbLogger(store_id_in_cktp=True)

which would make them attach experiment id to checkpoint.
Or alternatively, all loggers should always attached their id without specifying it.

Then, when we resume experiment from checkpoint, trainer could accept parameter like resume_logger_experiment

logger = WandbLogger()
trainer = Trainer(resume_from_checkpoint="last.ckpt", logger=logger, resume_logger_experiment=True)

which would make trainer modify logger experiment id.

I've so far only used wandb, but I assume other loggers like Neptune or Tensorboard, also have equivalent "id" parameter than enables them to resume experiment? If so this change should affect every one of them.

cc @Borda @awaelchli @edward-io @ananthsub @rohitgr7 @kamil-kaczmarek @Raalsky @Blaizzy @ninginthecloud

The text was updated successfully, but these errors were encountered:

tchaton · 2021-01-04T08:43:40Z

Hey there,

It is a good idea. Would you mind making a PR for it ?

Best regards,
T.C

ashleve · 2021-01-04T18:15:25Z

Hey @tchaton.
Sure :)
I can try making a PR, but I haven't made any contributions here before so let me know if I think correctly:

I don't see any hooks in LightningLoggerBase, that would allow for state saving and restoring (like "on_checkpoint_save" in base Callback class). I assume we should first somehow implement those hooks?
it seems that CheckpointConnector currently implements state restoriation of callbacks, schedulers, optimizer, etc. but not loggers. I assume we can just extend this functionality to loggers also?

awaelchli · 2021-03-02T00:12:27Z

@hobogalaxy

Are you still interested? I can help too.

I don't see any hooks in LightningLoggerBase, that would allow for state saving and restoring (like "on_checkpoint_save" in base Callback class). I assume we should first somehow implement those hooks?

because loggers are not callbacks.
but of course we could think of a way to restore loggers. This will be very different between the loggers.

So I suggest the following:

One PR that implements dumping a state_dict of some sort for loggers that goes into the checkpoint dict. Implement that first only for the TensorBoardLogger (no restore).
A PR that adds the functionality to restore the TensorBoardLogger from the state dict out of a checkpoint
Then follow up with other PRs to add this also to the other loggers.

ashleve · 2021-03-03T00:17:25Z

@awaelchli Thanks for the suggestions. I will try to make a PR this week.

Queuecumber · 2021-05-29T18:20:46Z

Pretty sure this is a (sort of) dupe of #6205 and #7104
Those dealt with the HPC checkpoints but it should really apply to both HPC checkpoints and regular checkpoints.

yuvalkirstain · 2022-02-08T09:26:10Z

Hey, any news about this? that's a very useful feature.

awaelchli · 2022-02-10T01:09:01Z

Hello, the status is that we are currently investigating adding a consistent interface for stateful components, #11429. And loggers would also fall into this category imo and could inherit from this too. I believe this issue can move forward once we have that in place, which is likely for 1.7.

borisdayma · 2022-02-21T18:40:26Z

Just for info, you can easily know which run created or use a specific artifact with artifact.used_by() and artifact.logged_by()

ashleve added feature Is an improvement or enhancement help wanted Open to be worked on labels Jan 3, 2021

tchaton added this to the 1.2 milestone Jan 4, 2021

tchaton added the priority: 0 High priority task label Jan 4, 2021

edenlightning modified the milestones: 1.2, 1.3 Feb 9, 2021

awaelchli self-assigned this Feb 14, 2021

This was referenced Mar 5, 2021

Logger state dumping #6361

Closed

Resuming from a specific run grok-ai/nn-template#3

Closed

edenlightning removed the priority: 0 High priority task label Apr 27, 2021

edenlightning modified the milestones: v1.3, v1.4 Apr 27, 2021

edenlightning modified the milestones: v1.4, v1.5 Jul 6, 2021

awaelchli modified the milestones: v1.5, v1.6 Nov 4, 2021

carmocca modified the milestones: 1.6, None Feb 1, 2022

carmocca added checkpointing Related to checkpointing logger Related to the Loggers labels Feb 1, 2022

awaelchli modified the milestones: future, 1.7 Feb 10, 2022

ashleve mentioned this issue May 13, 2022

Add logger state dumping and restoring #13069

Open

12 tasks

awaelchli mentioned this issue Jul 7, 2022

WandbLogger not resuming with resubmitted SLURM job #13524

Closed

carmocca modified the milestones: pl:1.7, pl:1.8 Jul 19, 2022

carmocca modified the milestones: v1.8, v1.9 Oct 13, 2022

Borda modified the milestones: v1.9, v1.9.x Jan 16, 2023

awaelchli mentioned this issue May 13, 2023

lightning version is the SLURM job number when run on a node provisioned by SLURM #17620

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store logger experiment id in checkpoint to enable correct resuming of experiments #5342

Store logger experiment id in checkpoint to enable correct resuming of experiments #5342

ashleve commented Jan 3, 2021 •

edited by github-actions bot

Loading

tchaton commented Jan 4, 2021

ashleve commented Jan 4, 2021

awaelchli commented Mar 2, 2021

ashleve commented Mar 3, 2021 •

edited

Loading

Queuecumber commented May 29, 2021

yuvalkirstain commented Feb 8, 2022

awaelchli commented Feb 10, 2022

borisdayma commented Feb 21, 2022

Store logger experiment id in checkpoint to enable correct resuming of experiments #5342

Store logger experiment id in checkpoint to enable correct resuming of experiments #5342

Comments

ashleve commented Jan 3, 2021 • edited by github-actions bot Loading

🚀 Feature

Motivation

Solution

tchaton commented Jan 4, 2021

ashleve commented Jan 4, 2021

awaelchli commented Mar 2, 2021

ashleve commented Mar 3, 2021 • edited Loading

Queuecumber commented May 29, 2021

yuvalkirstain commented Feb 8, 2022

awaelchli commented Feb 10, 2022

borisdayma commented Feb 21, 2022

ashleve commented Jan 3, 2021 •

edited by github-actions bot

Loading

ashleve commented Mar 3, 2021 •

edited

Loading