You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I performed a multi-node-multi-gpu training of my detector over a cluster managed by slurm, and the training is resumed from a checkpoint produced by a previous single-node-multi-gpu training. During the current training, the ETA constantly read negative numbers (for example, eta: -2 days, 22:45:24).
Reproduction
What command or script did you run?
I run the following command on several nodes separately:
'''
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=x tools/train.py ...
'''
the 'x' in '--node_rank=x' is assigned 0 on master node and 1 on the worker node. (Totally 2 nodes.)
Did you make any modifications on the code or config? Did you understand what you have modified?
My detector could be viewed as a modified FCOS, and it's training on single-machine gave normal ETA.
What dataset did you use?
MSCOCO2017
Environment
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1
OS: CentOS Linux 7 (Core)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)
CMake version: Could not collect
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB
Nvidia driver version: 440.64.00
cuDNN version: Could not collect
Describe the bug
I performed a multi-node-multi-gpu training of my detector over a cluster managed by slurm, and the training is resumed from a checkpoint produced by a previous single-node-multi-gpu training. During the current training, the ETA constantly read negative numbers (for example, eta: -2 days, 22:45:24).
Reproduction
What command or script did you run?
I run the following command on several nodes separately:
'''
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=x tools/train.py ...
'''
the 'x' in '--node_rank=x' is assigned 0 on master node and 1 on the worker node. (Totally 2 nodes.)
Did you make any modifications on the code or config? Did you understand what you have modified?
My detector could be viewed as a modified FCOS, and it's training on single-machine gave normal ETA.
What dataset did you use?
MSCOCO2017
Environment
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1
OS: CentOS Linux 7 (Core)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)
CMake version: Could not collect
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB
Nvidia driver version: 440.64.00
cuDNN version: Could not collect
Versions of relevant libraries:
[pip] numpy==1.17.0
[pip] torch==1.4.0
[pip] torchvision==0.4.0a0
[conda] _pytorch_select 0.2 gpu_0 defaults
[conda] blas 1.0 mkl defaults
[conda] mkl 2019.4 243 defaults
[conda] mkl-service 2.3.0 py37he904b0f_0 defaults
[conda] mkl_fft 1.0.15 py37ha843d7b_0 defaults
[conda] mkl_random 1.1.0 py37hd6b4f25_0 defaults
[conda] pytorch 1.2.0 cuda100py37h938c94c_0 defaults
[conda] torch 1.4.0 pypi_0 pypi
[conda] torchvision 0.4.0 cuda100py37hecfc37a_0 defaults
Error traceback
No error report. The training went normally except for the ETA reads.
'''
2020-09-03 11:11:49,475 - INFO - Epoch [9][2100/4887] lr: 0.00063, eta: -2 days, 23:02:22, time: 5.316, data_time: 0.011, memory: 10684, loss_cls: 0.2043, loss_bbox: 0.2768, loss_centerness: 0.5867, loss_dens: 0.0010, loss_cov: 0.0525, loss_rescore: 0.1161, loss: 1.2373
2020-09-03 11:16:07,152 - INFO - Epoch [9][2150/4887] lr: 0.00063, eta: -2 days, 22:58:19, time: 5.154, data_time: 0.011, memory: 10684, loss_cls: 0.2323, loss_bbox: 0.3215, loss_centerness: 0.5930, loss_dens: 0.0009, loss_cov: 0.0511, loss_rescore: 0.1320, loss: 1.3307
2020-09-03 11:20:40,660 - INFO - Epoch [9][2200/4887] lr: 0.00063, eta: -2 days, 22:53:38, time: 5.469, data_time: 0.011, memory: 10684, loss_cls: 0.2145, loss_bbox: 0.2919, loss_centerness: 0.5892, loss_dens: 0.0012, loss_cov: 0.0528, loss_rescore: 0.1152, loss: 1.2648
2020-09-03 11:24:57,096 - INFO - Epoch [9][2250/4887] lr: 0.00063, eta: -2 days, 22:49:38, time: 5.127, data_time: 0.038, memory: 10684, loss_cls: 0.2113, loss_bbox: 0.2678, loss_centerness: 0.5879, loss_dens: 0.0010, loss_cov: 0.0580, loss_rescore: 0.1084, loss: 1.2344
2020-09-03 11:29:18,945 - INFO - Epoch [9][2300/4887] lr: 0.00063, eta: -2 days, 22:45:24, time: 5.239, data_time: 0.020, memory: 10684, loss_cls: 0.2052, loss_bbox: 0.2732, loss_centerness: 0.5878, loss_dens: 0.0010, loss_cov: 0.0525, loss_rescore: 0.1167, loss: 1.2364
'''
Bug fix
Not yet.
The text was updated successfully, but these errors were encountered: