Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETA (estimated time of arrival) being negative numbers during multi-node training resumed from a checkpoint #3676

Closed
HawkRong opened this issue Sep 3, 2020 · 1 comment
Assignees

Comments

@HawkRong
Copy link

HawkRong commented Sep 3, 2020

Describe the bug
I performed a multi-node-multi-gpu training of my detector over a cluster managed by slurm, and the training is resumed from a checkpoint produced by a previous single-node-multi-gpu training. During the current training, the ETA constantly read negative numbers (for example, eta: -2 days, 22:45:24).

Reproduction

  1. What command or script did you run?
    I run the following command on several nodes separately:
    '''
    python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=x tools/train.py ...
    '''
    the 'x' in '--node_rank=x' is assigned 0 on master node and 1 on the worker node. (Totally 2 nodes.)

  2. Did you make any modifications on the code or config? Did you understand what you have modified?
    My detector could be viewed as a modified FCOS, and it's training on single-machine gave normal ETA.

  3. What dataset did you use?
    MSCOCO2017

Environment

PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: CentOS Linux 7 (Core)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB

Nvidia driver version: 440.64.00
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] numpy==1.17.0
[pip] torch==1.4.0
[pip] torchvision==0.4.0a0
[conda] _pytorch_select 0.2 gpu_0 defaults
[conda] blas 1.0 mkl defaults
[conda] mkl 2019.4 243 defaults
[conda] mkl-service 2.3.0 py37he904b0f_0 defaults
[conda] mkl_fft 1.0.15 py37ha843d7b_0 defaults
[conda] mkl_random 1.1.0 py37hd6b4f25_0 defaults
[conda] pytorch 1.2.0 cuda100py37h938c94c_0 defaults
[conda] torch 1.4.0 pypi_0 pypi
[conda] torchvision 0.4.0 cuda100py37hecfc37a_0 defaults

Error traceback
No error report. The training went normally except for the ETA reads.
'''
2020-09-03 11:11:49,475 - INFO - Epoch [9][2100/4887] lr: 0.00063, eta: -2 days, 23:02:22, time: 5.316, data_time: 0.011, memory: 10684, loss_cls: 0.2043, loss_bbox: 0.2768, loss_centerness: 0.5867, loss_dens: 0.0010, loss_cov: 0.0525, loss_rescore: 0.1161, loss: 1.2373
2020-09-03 11:16:07,152 - INFO - Epoch [9][2150/4887] lr: 0.00063, eta: -2 days, 22:58:19, time: 5.154, data_time: 0.011, memory: 10684, loss_cls: 0.2323, loss_bbox: 0.3215, loss_centerness: 0.5930, loss_dens: 0.0009, loss_cov: 0.0511, loss_rescore: 0.1320, loss: 1.3307
2020-09-03 11:20:40,660 - INFO - Epoch [9][2200/4887] lr: 0.00063, eta: -2 days, 22:53:38, time: 5.469, data_time: 0.011, memory: 10684, loss_cls: 0.2145, loss_bbox: 0.2919, loss_centerness: 0.5892, loss_dens: 0.0012, loss_cov: 0.0528, loss_rescore: 0.1152, loss: 1.2648
2020-09-03 11:24:57,096 - INFO - Epoch [9][2250/4887] lr: 0.00063, eta: -2 days, 22:49:38, time: 5.127, data_time: 0.038, memory: 10684, loss_cls: 0.2113, loss_bbox: 0.2678, loss_centerness: 0.5879, loss_dens: 0.0010, loss_cov: 0.0580, loss_rescore: 0.1084, loss: 1.2344
2020-09-03 11:29:18,945 - INFO - Epoch [9][2300/4887] lr: 0.00063, eta: -2 days, 22:45:24, time: 5.239, data_time: 0.020, memory: 10684, loss_cls: 0.2052, loss_bbox: 0.2732, loss_centerness: 0.5878, loss_dens: 0.0010, loss_cov: 0.0525, loss_rescore: 0.1167, loss: 1.2364
'''

Bug fix
Not yet.

@Feobi1999
Copy link

I still meet this negative eta time when I resume a checkpoint during multi-node training. My mmdet version is 3.2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants