Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for setting the current epoch for resumption of a previou… #6687

Closed
wants to merge 2 commits into from

Conversation

matt3o
Copy link
Contributor

@matt3o matt3o commented Jul 2, 2023

Description

Adds support for resuming a previous run of a trainer. Only change is to add a current_epoch parameter to Workflow and Trainer (is there any other class that will need this change?).
Maybe there is a better way to resume a previous but as of right now I do load network, optimizer and lr_scheduler from the previous run. The last thing missing is to set the current epoch. E.g. my run stopped at epoch 153, then I want to resume from there. If I do that and tell the network to run to epoch 200 which results in my code in 200-153 = 57 epochs max_epochs for the trainer to run. However the trainer does not know about the already run epochs, it prints something like Engine run resuming from iteration 0, epoch 0 until 57 epochs.
The output from this commit would be Engine run resuming from iteration 0, epoch 153 until 200 epochs. When settting the train up, only the parameter current_epoch needs to be added.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

…s run

Signed-off-by: Matthias Hadlich <matthiashadlich@posteo.de>
Signed-off-by: Matthias Hadlich <matthiashadlich@posteo.de>
@wyli wyli requested review from ericspod, Nic-Ma and KumoLiu July 2, 2023 21:20
@KumoLiu
Copy link
Contributor

KumoLiu commented Jul 3, 2023

Hi @matt3o, thanks for your PR.
I just take a look at this issue. And I find that when using

to_save = {'trainer': trainer, 'model': model, 'optimizer': opt}
handler = CheckpointLoader(load_path=ck, load_dict=to_save)
handler.attach(trainer)
trainer.run()

The output would be something like the below, so do you mean the first line of the Log is wrong?

INFO:ignite.engine.engine.SupervisedTrainer:Engine run resuming from iteration 0, epoch 0 until 5 epochs
INFO:ignite.engine.engine.SupervisedTrainer:Restored all variables from /workspace/Code/tutorials/2d_classification/runs/checkpoint_epoch=1.pt
INFO:ignite.engine.engine.SupervisedTrainer:Epoch: 2/5, Iter: 1/93 -- label: 4.0000 loss: 0.3763 
INFO:ignite.engine.engine.SupervisedTrainer:Epoch: 2/5, Iter: 2/93 -- label: 3.0000 loss: 0.3782 
INFO:ignite.engine.engine.SupervisedTrainer:Epoch: 2/5, Iter: 3/93 -- label: 0.0000 loss: 0.3938 

And If that's right, the possible reason for this is due to a sequential problem, you can get the correct output log by changing handler.attach(trainer) this line to handler(trainer) to directly call CheckpointLoader, then the output may be right.

INFO:root:Restored all variables from /workspace/Code/tutorials/2d_classification/runs/checkpoint_epoch=1.pt
INFO:ignite.engine.engine.SupervisedTrainer:Engine run resuming from iteration 93, epoch 1 until 5 epochs
INFO:ignite.engine.engine.SupervisedTrainer:Epoch: 2/5, Iter: 1/93 -- label: 4.0000 loss: 0.4107 
INFO:ignite.engine.engine.SupervisedTrainer:Epoch: 2/5, Iter: 2/93 -- label: 1.0000 loss: 0.3842 
INFO:ignite.engine.engine.SupervisedTrainer:Epoch: 2/5, Iter: 3/93 -- label: 4.0000 loss: 0.3739 
INFO:ignite.engine.engine.SupervisedTrainer:Epoch: 2/5, Iter: 4/93 -- label: 0.0000 loss: 0.3759 

Correct me if I understand wrong.
Thanks!

@matt3o
Copy link
Contributor Author

matt3o commented Jul 3, 2023

Hi @KumoLiu! I could not get the CheckpointLoader to work initially so I resorted to using the torch builtin tools for resuming. So that for sure is the first reason why this doesn't work.
However I just switched my code to use the CheckpointLoader as you wrote above. It correctly prints Restored all variables from ../data/100_checkpoint.pt and the dice is super good, so the network loaded as well.
However the epochs are incorrect. The network already ran 153 epochs, nevertheless the training begins at 0 epochs. Is there any other parameter I have to set? This might also have been the confusion why I went back to torch since I thought that it wouldn't work.

[2023-07-03 12:46:45.535][INFO] run - Engine run resuming from iteration 0, epoch 0 until 10 epochs
...
[2023-07-03 12:46:57.076][INFO] _default_iteration_print - Epoch: 1/10, Iter: 1/10 -- train_loss: 0.0234 

@KumoLiu
Copy link
Contributor

KumoLiu commented Jul 3, 2023

Hi @matt3o, here is my demo.

trainer = SupervisedTrainer(...)
to_save = {'trainer': trainer, 'model': model, 'optimizer': opt}
save_handler = CheckpointSaver(save_dir='./runs/', save_dict=to_save, save_interval=1)
save_handler.attach(trainer)

load_handler = CheckpointLoader(load_path=ck_path, load_dict=to_save)
load_handler(trainer)

trainer.run()

@matt3o
Copy link
Contributor Author

matt3o commented Jul 3, 2023

@KumoLiu Ahh I now see the difference. I have been using the default dict which can also be found in the docs: {'network': net, 'optimizer': optimizer, 'lr_scheduler': lr_scheduler} from CheckpointLoader.
I have been searching for tutorials on Google but found nothing that was actually explaining how to set the dict correctly or even which options are available. The CheckpointLoader docs do not hint which objects can be saved and the Ignite documentation was also not helpful in this context.

In terms of my specific context I will now try out the following:
save {'trainer': trainer, 'evaluator': evaluator, "net": network, "opt": optimizer, "lr": lr_scheduler}
and then call handler(evaluator) and handler(trainer)
Would that be correct or is loading the trainer already enough? Also I have to initialize all of the objects in the dict, right? So optimizer e.g. cannot be None if it shall be loaded from the dict?

Thanks a lot, that info is helping a lot! I think we can close this PR then, it is not really useful then.

@KumoLiu
Copy link
Contributor

KumoLiu commented Jul 3, 2023

Hi @matt3o, yes, you can also refer to the doc from the ignite.
https://pytorch.org/ignite/engine.html#resuming-the-training
(I will close this PR then if you have any other issue feel free to create another issue or discussion)

@KumoLiu KumoLiu closed this Jul 3, 2023
@matt3o matt3o deleted the Current_epoch_support branch July 3, 2023 12:28
wyli pushed a commit that referenced this pull request Aug 15, 2023
Related to #6687, this adds some documentation how to resume a network.
I did not find this documented anywhere in MONAI and it's rather
important imo. Please shortly check doc the syntax, I don't have any
experience with it. The code itself is from my own repository and
working.

What is potentially still missing here is some information on when to
use the Checkpoint Loader. As far I understood the docs, this shall
mostly be used to resume interrupted training runs.
But what is 1) with pure inference runs where the state of the trainer
does not matter, only the evaluator 2) Resuming the training at epoch
200 but with a learning rate reset (e.g. DeepEdit train without clicks
first for 200 epochs, then 200 epochs with clicks on top).
1 works well in my experience, 2 as well if you modify the state_dict to
exclude e.g. the learning rate scheduler.

### Description

A few sentences describing the changes proposed in this pull request.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Integration tests passed locally by running `./runtests.sh -f -u
--net --coverage`.
- [ ] Quick tests passed locally by running `./runtests.sh --quick
--unittests --disttests`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated, tested `make html` command in the `docs/`
folder.

---------

Signed-off-by: Matthias Hadlich <matthiashadlich@posteo.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants