Resuming from a specific run #3

hedjm · 2021-03-06T21:36:11Z

Thank you very much for this template, how can we restore the weights of a specific run using wandb? thanks

Flegyas · 2021-03-06T22:56:42Z

Thank you for using it!

Regarding the issue, we are still figuring out a clean way to do it. It will be implemented soon!
In the meantime, you can use https://pytorch-lightning.readthedocs.io/en/1.1.0/trainer.html?highlight=resume#resume-from-checkpoint .

ashleve · 2021-03-06T23:16:55Z

hey @hedjm
If you mean to resume training both for Lightning and for WandbLogger, then apart from setting resume_from_checkpoint parameter for Trainer, you also need to pass id of previous run to WandbLogger on initialization (e.g. WandbLogger(id="1cxvmnfn")). I'm currently working on implementing automatic resuming of loggers when loading checkpoint, so you won't need to pass id in the future (see this issue), but it won't but out until Lightning 1.3 is released.

hedjm · 2021-03-07T14:59:59Z

Thank you @Flegyas @hobogalaxy, I changed the code a little bit:

# Hydra run directory
    hydra_dir = Path(HydraConfig.get().run.dir + '/' +
                     cfg.data.datamodule.datasets.name + '/' +
                     cfg.logging.wandb.project + '/' +
                     cfg.logging.wandb.name + '/')
    os.makedirs(hydra_dir, exist_ok=True)

and

if "wandb" in cfg.logging:
        hydra.utils.log.info(f"Instantiating <WandbLogger>")
        wandb_config = cfg.logging.wandb
        wandb_logger = WandbLogger(name=wandb_config.name,
                                   project=wandb_config.project,
                                   entity=wandb_config.entity,
                                   tags=cfg.core.tags,
                                   id=wandb_config.name,
                                   version=wandb_config.name,
                                   save_dir=hydra_dir,
                                   log_model=True)
        hydra.utils.log.info(f"W&B is now watching <{wandb_config.watch.log}>!")
        wandb_logger.watch(model, log=wandb_config.watch.log, log_freq=wandb_config.watch.log_freq)

    # The Lightning core, the Trainer
    hydra.utils.log.info(f"Instantiating the Trainer")
    resume = cfg.train.resume if cfg.train.resume != '' else None

    trainer = pl.Trainer(
        default_root_dir=hydra_dir,
        resume_from_checkpoint=resume,
        logger=wandb_logger,
        callbacks=callbacks,
        deterministic=cfg.train.deterministic,
        val_check_interval=cfg.logging.val_check_interval,
        progress_bar_refresh_rate=cfg.logging.progress_bar_refresh_rate,
        **cfg.train.pl_trainer,
    )

lucmos · 2021-03-26T00:19:13Z

Hello @hedjm, the recent changes should make this straightforward

You can resume the training (in a new wandb run atm) by adding the variable resume_from_checkpoint=path_to_checkpoint in the pl_tariner training conf
You can restore a checkpoint to use your model at inference time with Model.load_from_checkpoint(...) as done in the streamlit app here

Happy to re-open if there are other problems

lucmos closed this as completed Mar 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resuming from a specific run #3

Resuming from a specific run #3

hedjm commented Mar 6, 2021

Flegyas commented Mar 6, 2021

ashleve commented Mar 6, 2021

hedjm commented Mar 7, 2021 •

edited

Loading

lucmos commented Mar 26, 2021 •

edited

Loading

Resuming from a specific run #3

Resuming from a specific run #3

Comments

hedjm commented Mar 6, 2021

Flegyas commented Mar 6, 2021

ashleve commented Mar 6, 2021

hedjm commented Mar 7, 2021 • edited Loading

lucmos commented Mar 26, 2021 • edited Loading

hedjm commented Mar 7, 2021 •

edited

Loading

lucmos commented Mar 26, 2021 •

edited

Loading