Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resuming from a specific run #3

Closed
hedjm opened this issue Mar 6, 2021 · 4 comments
Closed

Resuming from a specific run #3

hedjm opened this issue Mar 6, 2021 · 4 comments

Comments

@hedjm
Copy link

hedjm commented Mar 6, 2021

Thank you very much for this template, how can we restore the weights of a specific run using wandb? thanks

@Flegyas
Copy link
Member

Flegyas commented Mar 6, 2021

Thank you for using it!

Regarding the issue, we are still figuring out a clean way to do it. It will be implemented soon!
In the meantime, you can use https://pytorch-lightning.readthedocs.io/en/1.1.0/trainer.html?highlight=resume#resume-from-checkpoint .

@ashleve
Copy link

ashleve commented Mar 6, 2021

hey @hedjm
If you mean to resume training both for Lightning and for WandbLogger, then apart from setting resume_from_checkpoint parameter for Trainer, you also need to pass id of previous run to WandbLogger on initialization (e.g. WandbLogger(id="1cxvmnfn")). I'm currently working on implementing automatic resuming of loggers when loading checkpoint, so you won't need to pass id in the future (see this issue), but it won't but out until Lightning 1.3 is released.

@hedjm
Copy link
Author

hedjm commented Mar 7, 2021

Thank you @Flegyas @hobogalaxy, I changed the code a little bit:

# Hydra run directory
    hydra_dir = Path(HydraConfig.get().run.dir + '/' +
                     cfg.data.datamodule.datasets.name + '/' +
                     cfg.logging.wandb.project + '/' +
                     cfg.logging.wandb.name + '/')
    os.makedirs(hydra_dir, exist_ok=True)

and

if "wandb" in cfg.logging:
        hydra.utils.log.info(f"Instantiating <WandbLogger>")
        wandb_config = cfg.logging.wandb
        wandb_logger = WandbLogger(name=wandb_config.name,
                                   project=wandb_config.project,
                                   entity=wandb_config.entity,
                                   tags=cfg.core.tags,
                                   id=wandb_config.name,
                                   version=wandb_config.name,
                                   save_dir=hydra_dir,
                                   log_model=True)
        hydra.utils.log.info(f"W&B is now watching <{wandb_config.watch.log}>!")
        wandb_logger.watch(model, log=wandb_config.watch.log, log_freq=wandb_config.watch.log_freq)

    # The Lightning core, the Trainer
    hydra.utils.log.info(f"Instantiating the Trainer")
    resume = cfg.train.resume if cfg.train.resume != '' else None

    trainer = pl.Trainer(
        default_root_dir=hydra_dir,
        resume_from_checkpoint=resume,
        logger=wandb_logger,
        callbacks=callbacks,
        deterministic=cfg.train.deterministic,
        val_check_interval=cfg.logging.val_check_interval,
        progress_bar_refresh_rate=cfg.logging.progress_bar_refresh_rate,
        **cfg.train.pl_trainer,
    )

@lucmos
Copy link
Member

lucmos commented Mar 26, 2021

Hello @hedjm, the recent changes should make this straightforward

  • You can resume the training (in a new wandb run atm) by adding the variable resume_from_checkpoint=path_to_checkpoint in the pl_tariner training conf

  • You can restore a checkpoint to use your model at inference time with Model.load_from_checkpoint(...) as done in the streamlit app here

Happy to re-open if there are other problems

@lucmos lucmos closed this as completed Mar 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants