-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for setting the current epoch for resumption of a previou… #6687
Conversation
…s run Signed-off-by: Matthias Hadlich <matthiashadlich@posteo.de>
Signed-off-by: Matthias Hadlich <matthiashadlich@posteo.de>
Hi @matt3o, thanks for your PR.
The output would be something like the below, so do you mean the first line of the Log is wrong?
And If that's right, the possible reason for this is due to a sequential problem, you can get the correct output log by changing
Correct me if I understand wrong. |
Hi @KumoLiu! I could not get the CheckpointLoader to work initially so I resorted to using the torch builtin tools for resuming. So that for sure is the first reason why this doesn't work.
|
Hi @matt3o, here is my demo.
|
@KumoLiu Ahh I now see the difference. I have been using the default dict which can also be found in the docs: In terms of my specific context I will now try out the following: Thanks a lot, that info is helping a lot! I think we can close this PR then, it is not really useful then. |
Hi @matt3o, yes, you can also refer to the doc from the ignite. |
Related to #6687, this adds some documentation how to resume a network. I did not find this documented anywhere in MONAI and it's rather important imo. Please shortly check doc the syntax, I don't have any experience with it. The code itself is from my own repository and working. What is potentially still missing here is some information on when to use the Checkpoint Loader. As far I understood the docs, this shall mostly be used to resume interrupted training runs. But what is 1) with pure inference runs where the state of the trainer does not matter, only the evaluator 2) Resuming the training at epoch 200 but with a learning rate reset (e.g. DeepEdit train without clicks first for 200 epochs, then 200 epochs with clicks on top). 1 works well in my experience, 2 as well if you modify the state_dict to exclude e.g. the learning rate scheduler. ### Description A few sentences describing the changes proposed in this pull request. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Integration tests passed locally by running `./runtests.sh -f -u --net --coverage`. - [ ] Quick tests passed locally by running `./runtests.sh --quick --unittests --disttests`. - [ ] In-line docstrings updated. - [ ] Documentation updated, tested `make html` command in the `docs/` folder. --------- Signed-off-by: Matthias Hadlich <matthiashadlich@posteo.de>
Description
Adds support for resuming a previous run of a trainer. Only change is to add a current_epoch parameter to Workflow and Trainer (is there any other class that will need this change?).
Maybe there is a better way to resume a previous but as of right now I do load network, optimizer and lr_scheduler from the previous run. The last thing missing is to set the current epoch. E.g. my run stopped at epoch 153, then I want to resume from there. If I do that and tell the network to run to epoch 200 which results in my code in 200-153 = 57 epochs max_epochs for the trainer to run. However the trainer does not know about the already run epochs, it prints something like Engine run resuming from iteration 0, epoch 0 until 57 epochs.
The output from this commit would be Engine run resuming from iteration 0, epoch 153 until 200 epochs. When settting the train up, only the parameter current_epoch needs to be added.
Types of changes
./runtests.sh -f -u --net --coverage
../runtests.sh --quick --unittests --disttests
.make html
command in thedocs/
folder.