Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cfg.resume_from doesn't work as expected #292

Closed
antoniolanza1996 opened this issue Jun 17, 2021 · 6 comments
Closed

cfg.resume_from doesn't work as expected #292

antoniolanza1996 opened this issue Jun 17, 2021 · 6 comments
Assignees
Labels
bug Something isn't working upstream

Comments

@antoniolanza1996
Copy link
Contributor

Describe the bug

With tools/train.py script, I've trained DBNet_r50dcn for 50 epochs on my own dataset with SGD (default params here).

Below the main information on the epoch_50.pth checkpoint:

model = torch.load('epoch_50.pth',map_location=torch.device('cpu'))
print(model['meta']['epoch']) # 50
print(model['meta']['iter']) # 950
print(model['optimizer']['param_groups'][0]['lr']) # 0.00020712311161269083

After that, I've run tools/train.py with cfg.resume_from = 'epoch_50.pth' but here I've noted 2 problems:

  1. Learning rate is not correctly restored
    Instead of 0.00020712311161269083, the learning rate on epoch 51 is 6.725e-03 (which is actually not even the default value 7e-03 defined here)

  2. The checkpoints created during this resumed training (e.g. epoch_51.pth, epoch_52.pth, epoch_53.pth) still continue to have epoch and iter equal to the values in epoch_50.pth

For example, below the main information on the epoch_51.pth checkpoint:

model = torch.load('epoch_51.pth',map_location=torch.device('cpu'))
print(model['meta']['epoch']) # 50
print(model['meta']['iter']) # 950
print(model['optimizer']['param_groups'][0]['lr']) # 0.006725485699476165

and on the epoch_52.pth checkpoint:

model = torch.load('epoch_52.pth',map_location=torch.device('cpu'))
print(model['meta']['epoch']) # 50
print(model['meta']['iter']) # 950
print(model['optimizer']['param_groups'][0]['lr']) # 0.0067199828609755315

Clearly, this turns out a problem If one wants to restore another time the training from one of these checkpoints.

@innerlee
Copy link
Contributor

What's the version of mmcv, mmdet and mmocr?

@innerlee
Copy link
Contributor

Not sure if the recent hook priority is related.

open-mmlab/mmdetection#5343

@antoniolanza1996
Copy link
Contributor Author

# Check Pytorch installation
import torch, torchvision
print(torch.__version__, torch.cuda.is_available())

# Check MMDetection installation
import mmdet
print(mmdet.__version__)

# Check mmcv installation
import mmcv
from mmcv.ops import get_compiling_cuda_version, get_compiler_version
print(mmcv.__version__)
print(get_compiling_cuda_version())
print(get_compiler_version())

# Check mmocr installation
import mmocr
print(mmocr.__version__)

Output:

1.5.0+cu101 True
2.11.0
1.3.7
10.1
GCC 7.3
0.2.0

@innerlee innerlee added the bug Something isn't working label Jun 17, 2021
@antoniolanza1996
Copy link
Contributor Author

On problem 1 (i.e. Learning rate is not correctly restored), I've better read mmcv code and I've noted that with PolyLrUpdaterHook the learning rate is computed using runner.max_epochs. However, I've used two different max_epochs in the two training setups (i.e. the first one, and the restored one). Hence, I've been unable to restore the correct learning rate.

However, problem 2 still persists: all checkpoints computed during restored training don't have correct values for epoch and iter. This error happens only when I run training starting from a restored checkpoint.

@antoniolanza1996
Copy link
Contributor Author

I've better investigated also on problem 2 and I've found the problem in EpochBasedRunner's save_checkpoint function.

In particular, this dict update should be done before this one. This because in a restored training setup, epoch and iter from restored checkpoint are saved in self.meta. Hence, if you run these updates in the current order, epoch and iter of the current checkpoint will be set to the values of the restored one.

I'll open a PR in MMCV ASAP (hopefully tomorrow) to fix this behaviour.

@antoniolanza1996
Copy link
Contributor Author

I'm gonna close this issue because it has been solved in open-mmlab/mmcv#1108

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working upstream
Projects
None yet
Development

No branches or pull requests

3 participants