ETA (estimated time of arrival) being negative numbers during multi-node training resumed from a checkpoint #3676

HawkRong · 2020-09-03T04:39:58Z

Describe the bug
I performed a multi-node-multi-gpu training of my detector over a cluster managed by slurm, and the training is resumed from a checkpoint produced by a previous single-node-multi-gpu training. During the current training, the ETA constantly read negative numbers (for example, eta: -2 days, 22:45:24).

Reproduction

What command or script did you run?
I run the following command on several nodes separately:
'''
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=x tools/train.py ...
'''
the 'x' in '--node_rank=x' is assigned 0 on master node and 1 on the worker node. (Totally 2 nodes.)
Did you make any modifications on the code or config? Did you understand what you have modified?
My detector could be viewed as a modified FCOS, and it's training on single-machine gave normal ETA.
What dataset did you use?
MSCOCO2017

Environment

PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: CentOS Linux 7 (Core)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB

Nvidia driver version: 440.64.00
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] numpy==1.17.0
[pip] torch==1.4.0
[pip] torchvision==0.4.0a0
[conda] _pytorch_select 0.2 gpu_0 defaults
[conda] blas 1.0 mkl defaults
[conda] mkl 2019.4 243 defaults
[conda] mkl-service 2.3.0 py37he904b0f_0 defaults
[conda] mkl_fft 1.0.15 py37ha843d7b_0 defaults
[conda] mkl_random 1.1.0 py37hd6b4f25_0 defaults
[conda] pytorch 1.2.0 cuda100py37h938c94c_0 defaults
[conda] torch 1.4.0 pypi_0 pypi
[conda] torchvision 0.4.0 cuda100py37hecfc37a_0 defaults

Error traceback
No error report. The training went normally except for the ETA reads.
'''
2020-09-03 11:11:49,475 - INFO - Epoch [9][2100/4887] lr: 0.00063, eta: -2 days, 23:02:22, time: 5.316, data_time: 0.011, memory: 10684, loss_cls: 0.2043, loss_bbox: 0.2768, loss_centerness: 0.5867, loss_dens: 0.0010, loss_cov: 0.0525, loss_rescore: 0.1161, loss: 1.2373
2020-09-03 11:16:07,152 - INFO - Epoch [9][2150/4887] lr: 0.00063, eta: -2 days, 22:58:19, time: 5.154, data_time: 0.011, memory: 10684, loss_cls: 0.2323, loss_bbox: 0.3215, loss_centerness: 0.5930, loss_dens: 0.0009, loss_cov: 0.0511, loss_rescore: 0.1320, loss: 1.3307
2020-09-03 11:20:40,660 - INFO - Epoch [9][2200/4887] lr: 0.00063, eta: -2 days, 22:53:38, time: 5.469, data_time: 0.011, memory: 10684, loss_cls: 0.2145, loss_bbox: 0.2919, loss_centerness: 0.5892, loss_dens: 0.0012, loss_cov: 0.0528, loss_rescore: 0.1152, loss: 1.2648
2020-09-03 11:24:57,096 - INFO - Epoch [9][2250/4887] lr: 0.00063, eta: -2 days, 22:49:38, time: 5.127, data_time: 0.038, memory: 10684, loss_cls: 0.2113, loss_bbox: 0.2678, loss_centerness: 0.5879, loss_dens: 0.0010, loss_cov: 0.0580, loss_rescore: 0.1084, loss: 1.2344
2020-09-03 11:29:18,945 - INFO - Epoch [9][2300/4887] lr: 0.00063, eta: -2 days, 22:45:24, time: 5.239, data_time: 0.020, memory: 10684, loss_cls: 0.2052, loss_bbox: 0.2732, loss_centerness: 0.5878, loss_dens: 0.0010, loss_cov: 0.0525, loss_rescore: 0.1167, loss: 1.2364
'''

Bug fix
Not yet.

Feobi1999 · 2023-12-04T06:57:41Z

I still meet this negative eta time when I resume a checkpoint during multi-node training. My mmdet version is 3.2.0

v-qjqs added the awaiting response label Sep 3, 2020

hellock removed the awaiting response label Sep 12, 2020

hellock assigned yhcao6 Sep 12, 2020

hhaAndroid mentioned this issue Feb 20, 2021

Fix the iter error when the number of GPUs is different during resume open-mmlab/mmcv#844

Merged

hhaAndroid closed this as completed Apr 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETA (estimated time of arrival) being negative numbers during multi-node training resumed from a checkpoint #3676

ETA (estimated time of arrival) being negative numbers during multi-node training resumed from a checkpoint #3676

HawkRong commented Sep 3, 2020

Feobi1999 commented Dec 4, 2023

ETA (estimated time of arrival) being negative numbers during multi-node training resumed from a checkpoint #3676

ETA (estimated time of arrival) being negative numbers during multi-node training resumed from a checkpoint #3676

Comments

HawkRong commented Sep 3, 2020

Feobi1999 commented Dec 4, 2023