cfg.resume_from doesn't work as expected #292

antoniolanza1996 · 2021-06-17T10:11:16Z

Describe the bug

With tools/train.py script, I've trained DBNet_r50dcn for 50 epochs on my own dataset with SGD (default params here).

Below the main information on the epoch_50.pth checkpoint:

model = torch.load('epoch_50.pth',map_location=torch.device('cpu'))
print(model['meta']['epoch']) # 50
print(model['meta']['iter']) # 950
print(model['optimizer']['param_groups'][0]['lr']) # 0.00020712311161269083

After that, I've run tools/train.py with cfg.resume_from = 'epoch_50.pth' but here I've noted 2 problems:

Learning rate is not correctly restored
Instead of 0.00020712311161269083, the learning rate on epoch 51 is 6.725e-03 (which is actually not even the default value 7e-03 defined here)
The checkpoints created during this resumed training (e.g. epoch_51.pth, epoch_52.pth, epoch_53.pth) still continue to have epoch and iter equal to the values in epoch_50.pth

For example, below the main information on the epoch_51.pth checkpoint:

model = torch.load('epoch_51.pth',map_location=torch.device('cpu'))
print(model['meta']['epoch']) # 50
print(model['meta']['iter']) # 950
print(model['optimizer']['param_groups'][0]['lr']) # 0.006725485699476165

and on the epoch_52.pth checkpoint:

model = torch.load('epoch_52.pth',map_location=torch.device('cpu'))
print(model['meta']['epoch']) # 50
print(model['meta']['iter']) # 950
print(model['optimizer']['param_groups'][0]['lr']) # 0.0067199828609755315

Clearly, this turns out a problem If one wants to restore another time the training from one of these checkpoints.

The text was updated successfully, but these errors were encountered:

innerlee · 2021-06-17T10:15:55Z

What's the version of mmcv, mmdet and mmocr?

innerlee · 2021-06-17T10:18:30Z

Not sure if the recent hook priority is related.

open-mmlab/mmdetection#5343

antoniolanza1996 · 2021-06-17T10:19:05Z

# Check Pytorch installation
import torch, torchvision
print(torch.__version__, torch.cuda.is_available())

# Check MMDetection installation
import mmdet
print(mmdet.__version__)

# Check mmcv installation
import mmcv
from mmcv.ops import get_compiling_cuda_version, get_compiler_version
print(mmcv.__version__)
print(get_compiling_cuda_version())
print(get_compiler_version())

# Check mmocr installation
import mmocr
print(mmocr.__version__)

Output:

1.5.0+cu101 True
2.11.0
1.3.7
10.1
GCC 7.3
0.2.0

antoniolanza1996 · 2021-06-17T14:13:27Z

On problem 1 (i.e. Learning rate is not correctly restored), I've better read mmcv code and I've noted that with PolyLrUpdaterHook the learning rate is computed using runner.max_epochs. However, I've used two different max_epochs in the two training setups (i.e. the first one, and the restored one). Hence, I've been unable to restore the correct learning rate.

However, problem 2 still persists: all checkpoints computed during restored training don't have correct values for epoch and iter. This error happens only when I run training starting from a restored checkpoint.

antoniolanza1996 · 2021-06-17T16:37:52Z

I've better investigated also on problem 2 and I've found the problem in EpochBasedRunner's save_checkpoint function.

In particular, this dict update should be done before this one. This because in a restored training setup, epoch and iter from restored checkpoint are saved in self.meta. Hence, if you run these updates in the current order, epoch and iter of the current checkpoint will be set to the values of the restored one.

I'll open a PR in MMCV ASAP (hopefully tomorrow) to fix this behaviour.

antoniolanza1996 · 2021-06-29T14:26:54Z

I'm gonna close this issue because it has been solved in open-mmlab/mmcv#1108

innerlee assigned gaotongxiao Jun 17, 2021

innerlee added the bug Something isn't working label Jun 17, 2021

innerlee added the upstream label Jun 17, 2021

antoniolanza1996 mentioned this issue Jun 17, 2021

Change dict update order open-mmlab/mmcv#1108

Merged

antoniolanza1996 closed this as completed Jun 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cfg.resume_from doesn't work as expected #292

cfg.resume_from doesn't work as expected #292

antoniolanza1996 commented Jun 17, 2021

innerlee commented Jun 17, 2021

innerlee commented Jun 17, 2021

antoniolanza1996 commented Jun 17, 2021

antoniolanza1996 commented Jun 17, 2021

antoniolanza1996 commented Jun 17, 2021

antoniolanza1996 commented Jun 29, 2021

cfg.resume_from doesn't work as expected #292

cfg.resume_from doesn't work as expected #292

Comments

antoniolanza1996 commented Jun 17, 2021

innerlee commented Jun 17, 2021

innerlee commented Jun 17, 2021

antoniolanza1996 commented Jun 17, 2021

antoniolanza1996 commented Jun 17, 2021

antoniolanza1996 commented Jun 17, 2021

antoniolanza1996 commented Jun 29, 2021