-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store logger experiment id in checkpoint to enable correct resuming of experiments #5342
Comments
Hey there, It is a good idea. Would you mind making a PR for it ? Best regards, |
Hey @tchaton.
|
@hobogalaxy Are you still interested? I can help too.
because loggers are not callbacks. So I suggest the following:
|
@awaelchli Thanks for the suggestions. I will try to make a PR this week. |
Hey, any news about this? that's a very useful feature. |
Hello, the status is that we are currently investigating adding a consistent interface for stateful components, #11429. And loggers would also fall into this category imo and could inherit from this too. I believe this issue can move forward once we have that in place, which is likely for 1.7. |
Just for info, you can easily know which run created or use a specific artifact with |
🚀 Feature
Enable each logger a possibility to attach custom data to checkpoint, like id of experiment defined by logger.
Motivation
Currently, if you use loggers like Wandb, Comet or Nepture, you have to manually restore correct experiment id when resuming from checkpoint in order to resume logging of previous experiment, instead of creating a new experiment in logger.
It would be better if id of experiment/run could be attached by logger to every checkpoint, and that id could be then automatically passed to logger when resuming from those checkpoints.
This was recently mentioned in #4935 and it seems to me like a very useful feature.
Solution
So my suggestion is the following:
Currently existing loggers could accept parameters like
which would make them attach experiment id to checkpoint.
Or alternatively, all loggers should always attached their id without specifying it.
Then, when we resume experiment from checkpoint, trainer could accept parameter like
resume_logger_experiment
which would make trainer modify logger experiment id.
I've so far only used wandb, but I assume other loggers like Neptune or Tensorboard, also have equivalent "id" parameter than enables them to resume experiment? If so this change should affect every one of them.
cc @Borda @awaelchli @edward-io @ananthsub @rohitgr7 @kamil-kaczmarek @Raalsky @Blaizzy @ninginthecloud
The text was updated successfully, but these errors were encountered: