Add support for setting the current epoch for resumption of a previou… #6687

matt3o · 2023-07-02T17:42:42Z

Description

Adds support for resuming a previous run of a trainer. Only change is to add a current_epoch parameter to Workflow and Trainer (is there any other class that will need this change?).
Maybe there is a better way to resume a previous but as of right now I do load network, optimizer and lr_scheduler from the previous run. The last thing missing is to set the current epoch. E.g. my run stopped at epoch 153, then I want to resume from there. If I do that and tell the network to run to epoch 200 which results in my code in 200-153 = 57 epochs max_epochs for the trainer to run. However the trainer does not know about the already run epochs, it prints something like Engine run resuming from iteration 0, epoch 0 until 57 epochs.
The output from this commit would be Engine run resuming from iteration 0, epoch 153 until 200 epochs. When settting the train up, only the parameter current_epoch needs to be added.

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
In-line docstrings updated.
Documentation updated, tested make html command in the docs/ folder.

…s run Signed-off-by: Matthias Hadlich <matthiashadlich@posteo.de>

Signed-off-by: Matthias Hadlich <matthiashadlich@posteo.de>

KumoLiu · 2023-07-03T10:11:20Z

Hi @matt3o, thanks for your PR.
I just take a look at this issue. And I find that when using

to_save = {'trainer': trainer, 'model': model, 'optimizer': opt}
handler = CheckpointLoader(load_path=ck, load_dict=to_save)
handler.attach(trainer)
trainer.run()

The output would be something like the below, so do you mean the first line of the Log is wrong?

INFO:ignite.engine.engine.SupervisedTrainer:Engine run resuming from iteration 0, epoch 0 until 5 epochs
INFO:ignite.engine.engine.SupervisedTrainer:Restored all variables from /workspace/Code/tutorials/2d_classification/runs/checkpoint_epoch=1.pt
INFO:ignite.engine.engine.SupervisedTrainer:Epoch: 2/5, Iter: 1/93 -- label: 4.0000 loss: 0.3763 
INFO:ignite.engine.engine.SupervisedTrainer:Epoch: 2/5, Iter: 2/93 -- label: 3.0000 loss: 0.3782 
INFO:ignite.engine.engine.SupervisedTrainer:Epoch: 2/5, Iter: 3/93 -- label: 0.0000 loss: 0.3938

And If that's right, the possible reason for this is due to a sequential problem, you can get the correct output log by changing handler.attach(trainer) this line to handler(trainer) to directly call CheckpointLoader, then the output may be right.

INFO:root:Restored all variables from /workspace/Code/tutorials/2d_classification/runs/checkpoint_epoch=1.pt
INFO:ignite.engine.engine.SupervisedTrainer:Engine run resuming from iteration 93, epoch 1 until 5 epochs
INFO:ignite.engine.engine.SupervisedTrainer:Epoch: 2/5, Iter: 1/93 -- label: 4.0000 loss: 0.4107 
INFO:ignite.engine.engine.SupervisedTrainer:Epoch: 2/5, Iter: 2/93 -- label: 1.0000 loss: 0.3842 
INFO:ignite.engine.engine.SupervisedTrainer:Epoch: 2/5, Iter: 3/93 -- label: 4.0000 loss: 0.3739 
INFO:ignite.engine.engine.SupervisedTrainer:Epoch: 2/5, Iter: 4/93 -- label: 0.0000 loss: 0.3759

Correct me if I understand wrong.
Thanks!

matt3o · 2023-07-03T10:51:39Z

Hi @KumoLiu! I could not get the CheckpointLoader to work initially so I resorted to using the torch builtin tools for resuming. So that for sure is the first reason why this doesn't work.
However I just switched my code to use the CheckpointLoader as you wrote above. It correctly prints Restored all variables from ../data/100_checkpoint.pt and the dice is super good, so the network loaded as well.
However the epochs are incorrect. The network already ran 153 epochs, nevertheless the training begins at 0 epochs. Is there any other parameter I have to set? This might also have been the confusion why I went back to torch since I thought that it wouldn't work.

[2023-07-03 12:46:45.535][INFO] run - Engine run resuming from iteration 0, epoch 0 until 10 epochs
...
[2023-07-03 12:46:57.076][INFO] _default_iteration_print - Epoch: 1/10, Iter: 1/10 -- train_loss: 0.0234

KumoLiu · 2023-07-03T11:04:21Z

Hi @matt3o, here is my demo.

trainer = SupervisedTrainer(...)
to_save = {'trainer': trainer, 'model': model, 'optimizer': opt}
save_handler = CheckpointSaver(save_dir='./runs/', save_dict=to_save, save_interval=1)
save_handler.attach(trainer)

load_handler = CheckpointLoader(load_path=ck_path, load_dict=to_save)
load_handler(trainer)

trainer.run()

matt3o · 2023-07-03T11:27:24Z

@KumoLiu Ahh I now see the difference. I have been using the default dict which can also be found in the docs: {'network': net, 'optimizer': optimizer, 'lr_scheduler': lr_scheduler} from CheckpointLoader.
I have been searching for tutorials on Google but found nothing that was actually explaining how to set the dict correctly or even which options are available. The CheckpointLoader docs do not hint which objects can be saved and the Ignite documentation was also not helpful in this context.

In terms of my specific context I will now try out the following:
save {'trainer': trainer, 'evaluator': evaluator, "net": network, "opt": optimizer, "lr": lr_scheduler}
and then call handler(evaluator) and handler(trainer)
Would that be correct or is loading the trainer already enough? Also I have to initialize all of the objects in the dict, right? So optimizer e.g. cannot be None if it shall be loaded from the dict?

Thanks a lot, that info is helping a lot! I think we can close this PR then, it is not really useful then.

KumoLiu · 2023-07-03T11:36:10Z

Hi @matt3o, yes, you can also refer to the doc from the ignite.
https://pytorch.org/ignite/engine.html#resuming-the-training
(I will close this PR then if you have any other issue feel free to create another issue or discussion)

Related to #6687, this adds some documentation how to resume a network. I did not find this documented anywhere in MONAI and it's rather important imo. Please shortly check doc the syntax, I don't have any experience with it. The code itself is from my own repository and working. What is potentially still missing here is some information on when to use the Checkpoint Loader. As far I understood the docs, this shall mostly be used to resume interrupted training runs. But what is 1) with pure inference runs where the state of the trainer does not matter, only the evaluator 2) Resuming the training at epoch 200 but with a learning rate reset (e.g. DeepEdit train without clicks first for 200 epochs, then 200 epochs with clicks on top). 1 works well in my experience, 2 as well if you modify the state_dict to exclude e.g. the learning rate scheduler. ### Description A few sentences describing the changes proposed in this pull request. ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Integration tests passed locally by running `./runtests.sh -f -u --net --coverage`. - [ ] Quick tests passed locally by running `./runtests.sh --quick --unittests --disttests`. - [ ] In-line docstrings updated. - [ ] Documentation updated, tested `make html` command in the `docs/` folder. --------- Signed-off-by: Matthias Hadlich <matthiashadlich@posteo.de>

matt3o added 2 commits July 2, 2023 19:28

Add support for setting the current epoch for resumption of a previou…

ebc5e36

…s run Signed-off-by: Matthias Hadlich <matthiashadlich@posteo.de>

Fix code style

8b6fca3

Signed-off-by: Matthias Hadlich <matthiashadlich@posteo.de>

wyli requested review from ericspod, Nic-Ma and KumoLiu July 2, 2023 21:20

KumoLiu closed this Jul 3, 2023

matt3o deleted the Current_epoch_support branch July 3, 2023 12:28

matt3o mentioned this pull request Aug 15, 2023

Update Checkpoint Loader docs with an example #6871

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for setting the current epoch for resumption of a previou… #6687

Add support for setting the current epoch for resumption of a previou… #6687

matt3o commented Jul 2, 2023 •

edited

Loading

KumoLiu commented Jul 3, 2023

matt3o commented Jul 3, 2023

KumoLiu commented Jul 3, 2023

matt3o commented Jul 3, 2023 •

edited

Loading

KumoLiu commented Jul 3, 2023

Add support for setting the current epoch for resumption of a previou… #6687

Add support for setting the current epoch for resumption of a previou… #6687

Conversation

matt3o commented Jul 2, 2023 • edited Loading

Description

Types of changes

KumoLiu commented Jul 3, 2023

matt3o commented Jul 3, 2023

KumoLiu commented Jul 3, 2023

matt3o commented Jul 3, 2023 • edited Loading

KumoLiu commented Jul 3, 2023

matt3o commented Jul 2, 2023 •

edited

Loading

matt3o commented Jul 3, 2023 •

edited

Loading