Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training failed with 4 GPUs after first epoch #4409

Closed
kinoute opened this issue Aug 13, 2021 · 6 comments · Fixed by #4422
Closed

Training failed with 4 GPUs after first epoch #4409

kinoute opened this issue Aug 13, 2021 · 6 comments · Fixed by #4422
Labels
bug Something isn't working

Comments

@kinoute
Copy link
Contributor

kinoute commented Aug 13, 2021

🐛 Bug

I was able to train on OVH AI Cloud with 4 classes and 500 images in total three days ago with 4 GPUs but when I try again to train with my full dataset this time (around 9000 images for 4 classes), the training stops after the first epoch, when the validation step is about to finish.

I tried to change different things: get rid of the cache argument, change for a smaller model (I was using 5MP6 at first), changing batch size, changing the number of GPUs, still the same.

To Reproduce (REQUIRED)

First, here is my Dockerfile. It is based on the Official Yolov5 docker image with W&B integrated:

FROM ultralytics/yolov5:latest

# unfortunately, wandb is commented out in the official image
RUN pip3 install wandb

# pass the wandb API key at build time
ARG wandb_key
ENV wandb_api_key=$wandb_key

# setup wandb account
RUN wandb login "$wandb_api_key"

WORKDIR /usr/src/app

RUN chown -R 42420:42420 /usr/src

# do stuff at start
COPY entrypoint.sh /usr/src/app

ENTRYPOINT ["/bin/bash", "-c", "./entrypoint.sh && bash"]

entrypoint.sh with:

  • a call to the autosplit() function ;
  • a call to train.py to start the training.
#!/bin/bash

# split datasets into training, validation & test
python3 -c "from utils.datasets import autosplit; autosplit('../logos/images', annotated_only=True);"

# start the training
python3 -m torch.distributed.launch \
  --nproc_per_node 4 train.py \
  --img-size 1280 \
  --epochs 100 \
  --data ../logos/logo.yaml \
  --weights yolov5m.pt \
  --batch-size 64 \
  --device 0,1,2,3 \
  --project results \
  --name "$(date +'%Y-%m-%d')" \
  --exist-ok \
  --workers 0

Full output from the server:


  0%|          | 0/11481 [00:00<?, ?it/s]
  3%|▎         | 293/11481 [00:00<00:03, 2926.18it/s]
  4%|▍         | 483/11481 [00:00<00:04, 2517.14it/s]
  6%|▌         | 670/11481 [00:00<00:04, 2276.94it/s]
  8%|▊         | 909/11481 [00:00<00:04, 2309.31it/s]
 10%|▉         | 1103/11481 [00:00<00:04, 2181.77it/s]
 11%|█         | 1280/11481 [00:00<00:05, 2035.86it/s]
 13%|█▎        | 1457/11481 [00:00<00:05, 1936.71it/s]
 15%|█▍        | 1699/11481 [00:00<00:04, 2059.93it/s]
 17%|█▋        | 1935/11481 [00:00<00:04, 2139.96it/s]
 19%|█▉        | 2234/11481 [00:01<00:03, 2338.47it/s]
 22%|██▏       | 2469/11481 [00:01<00:04, 2164.21it/s]
 23%|██▎       | 2689/11481 [00:01<00:04, 2037.83it/s]
 25%|██▌       | 2897/11481 [00:01<00:04, 1946.45it/s]
 27%|██▋       | 3096/11481 [00:01<00:04, 1909.82it/s]
 29%|██▊       | 3290/11481 [00:01<00:04, 1840.54it/s]
 30%|███       | 3477/11481 [00:01<00:04, 1772.23it/s]
 32%|███▏      | 3671/11481 [00:01<00:04, 1819.05it/s]
 34%|███▎      | 3855/11481 [00:01<00:04, 1778.36it/s]
 35%|███▌      | 4060/11481 [00:02<00:04, 1851.52it/s]
 37%|███▋      | 4273/11481 [00:02<00:03, 1924.63it/s]
 39%|███▉      | 4468/11481 [00:02<00:03, 1872.27it/s]
 41%|████      | 4658/11481 [00:02<00:03, 1879.67it/s]
 43%|████▎     | 4891/11481 [00:02<00:03, 1994.13it/s]
 44%|████▍     | 5108/11481 [00:02<00:03, 2042.15it/s]
 47%|████▋     | 5435/11481 [00:02<00:02, 2301.21it/s]
 50%|█████     | 5773/11481 [00:02<00:02, 2543.61it/s]
 53%|█████▎    | 6047/11481 [00:02<00:02, 2598.59it/s]
 55%|█████▌    | 6319/11481 [00:02<00:02, 2573.29it/s]
 64%|██████▍   | 7351/11481 [00:03<00:01, 3320.80it/s]
 68%|██████▊   | 7847/11481 [00:03<00:01, 2920.33it/s]
 72%|███████▏  | 8263/11481 [00:03<00:01, 2642.29it/s]
 75%|███████▌  | 8619/11481 [00:03<00:01, 2663.95it/s]
 78%|███████▊  | 8950/11481 [00:03<00:01, 2476.00it/s]
 81%|████████  | 9245/11481 [00:03<00:00, 2286.48it/s]
 83%|████████▎ | 9510/11481 [00:04<00:00, 2122.86it/s]
 85%|████████▍ | 9750/11481 [00:04<00:00, 2170.76it/s]
 87%|████████▋ | 9987/11481 [00:04<00:00, 2163.66it/s]
 89%|████████▉ | 10218/11481 [00:04<00:00, 2051.27it/s]
 91%|█████████ | 10434/11481 [00:04<00:00, 1996.93it/s]
 93%|█████████▎| 10642/11481 [00:04<00:00, 1875.36it/s]
 94%|█████████▍| 10837/11481 [00:04<00:00, 1888.51it/s]
 96%|█████████▌| 11031/11481 [00:04<00:00, 1875.07it/s]
 98%|█████████▊| 11257/11481 [00:04<00:00, 1975.48it/s]
100%|██████████| 11481/11481 [00:04<00:00, 2308.00it/s]
Autosplitting images from ../logos/images, using *.txt labeled images only
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
  logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
 Please read local_rank from `os.environ('LOCAL_RANK')` instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : train.py
  min_nodes        : 1
  max_nodes        : 1
  nproc_per_node   : 4
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 127.0.0.1:29500
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_wqpgf9x4/none_u1agcdp1
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
  warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3]
  role_ranks=[0, 1, 2, 3]
  global_ranks=[0, 1, 2, 3]
  role_world_sizes=[4, 4, 4, 4]
  global_world_sizes=[4, 4, 4, 4]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_wqpgf9x4/none_u1agcdp1/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_wqpgf9x4/none_u1agcdp1/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_wqpgf9x4/none_u1agcdp1/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_wqpgf9x4/none_u1agcdp1/attempt_0/3/error.json
�[34m�[1mtrain: �[0mweights=yolov5m.pt, cfg=, data=../logos/logo.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=100, batch_size=64, imgsz=1280, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1,2,3, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=0, project=results, entity=None, name=2021-08-13, exist_ok=True, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias=latest, local_rank=0, freeze=0
�[34m�[1mgithub: �[0mskipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 v5.0-360-gd9f23ed torch 1.9.0+cu102 CUDA:0 (Tesla V100S-PCIE-32GB, 32510.5MB)
                                              CUDA:1 (Tesla V100S-PCIE-32GB, 32510.5MB)
                                              CUDA:2 (Tesla V100S-PCIE-32GB, 32510.5MB)
                                              CUDA:3 (Tesla V100S-PCIE-32GB, 32510.5MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for 4 nodes.
�[34m�[1mhyperparameters: �[0mlr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
�[34m�[1mTensorBoard: �[0mStart with 'tensorboard --logdir results', view at http://localhost:6006/
[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
wandb: Currently logged in as: hivacruz (use `wandb login --relogin` to force relogin)

CondaEnvException: Unable to determine environment

Please re-run this command with one of the following options:

* Provide an environment name via --name or -n
* Re-run this command inside an activated conda environment.

wandb: Tracking run with wandb version 0.12.0
wandb: Syncing run 2021-08-13
wandb:  View project at https://wandb.ai/hivacruz/results

wandb:  View run at https://wandb.ai/hivacruz/results/runs/b7blzdq6
wandb: Run data is saved locally in /usr/src/app/wandb/run-20210813_152353-b7blzdq6
wandb: Run `wandb offline` to turn off syncing.
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Downloading https://github.com/ultralytics/yolov5/releases/download/v5.0/yolov5m.pt to yolov5m.pt...

  0% 0.00/41.1M [00:00<?, ?B/s]
  4% 1.80M/41.1M [00:00<00:02, 18.6MB/s]
 15% 6.01M/41.1M [00:00<00:01, 22.5MB/s]
 40% 16.3M/41.1M [00:00<00:00, 29.6MB/s]
 55% 22.7M/41.1M [00:00<00:00, 34.3MB/s]
 72% 29.6M/41.1M [00:00<00:00, 40.8MB/s]
 88% 36.1M/41.1M [00:00<00:00, 46.3MB/s]
100% 41.1M/41.1M [00:00<00:00, 56.7MB/s]

Overriding model.yaml nc=80 with nc=4

                 from  n    params  module                                  arguments                     
  0                -1  1      5280  models.common.Focus                     [3, 48, 3]                    
  1                -1  1     41664  models.common.Conv                      [48, 96, 3, 2]                
  2                -1  2     65280  models.common.C3                        [96, 96, 2]                   
  3                -1  1    166272  models.common.Conv                      [96, 192, 3, 2]               
  4                -1  6    629760  models.common.C3                        [192, 192, 6]                 
  5                -1  1    664320  models.common.Conv                      [192, 384, 3, 2]              
  6                -1  6   2512896  models.common.C3                        [384, 384, 6]                 
  7                -1  1   2655744  models.common.Conv                      [384, 768, 3, 2]              
  8                -1  1   1476864  models.common.SPP                       [768, 768, [5, 9, 13]]        
  9                -1  2   4134912  models.common.C3                        [768, 768, 2, False]          
 10                -1  1    295680  models.common.Conv                      [768, 384, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  2   1182720  models.common.C3                        [768, 384, 2, False]          
 14                -1  1     74112  models.common.Conv                      [384, 192, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  2    296448  models.common.C3                        [384, 192, 2, False]          
 18                -1  1    332160  models.common.Conv                      [192, 192, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  2   1035264  models.common.C3                        [384, 384, 2, False]          
 21                -1  1   1327872  models.common.Conv                      [384, 384, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  2   4134912  models.common.C3                        [768, 768, 2, False]          
 24      [17, 20, 23]  1     36369  models.yolo.Detect                      [4, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [192, 384, 768]]
Model Summary: 391 layers, 21068529 parameters, 21068529 gradients, 50.4 GFLOPs

Transferred 500/506 items from yolov5m.pt
Scaled weight_decay = 0.0005
�[34m�[1moptimizer:�[0m SGD with parameter groups 83 weight, 86 weight (no decay), 86 bias

�[34m�[1mtrain: �[0mScanning '/usr/src/logos/autosplit_train.cache' images and labels... 9461 found, 0 missing, 2701 empty, 4 corrupted: 100% 9465/9465 [00:00<?, ?it/s]�[34m�[1mtrain: �[0mWARNING: Ignoring corrupted image and/or label /usr/src/logos/images/xxx/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.png: cannot identify image file '/usr/src/logos/images/xxx/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.png'
�[34m�[1mtrain: �[0mWARNING: Ignoring corrupted image and/or label /usr/src/logos/images/xxx/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.png: cannot identify image file "/usr/src/logos/images/xxx/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.png"
�[34m�[1mtrain: �[0mWARNING: Ignoring corrupted image and/or label /usr/src/logos/images/xxx/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.png: cannot identify image file '/usr/src/logos/images/xxx/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.png'
�[34m�[1mtrain: �[0mWARNING: Ignoring corrupted image and/or label /usr/src/logos/images/xxx/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.png: cannot identify image file '/usr/src/logos/images/xxx/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.png'

�[34m�[1mtrain: �[0mScanning '/usr/src/logos/autosplit_train.cache' images and labels... 9461 found, 0 missing, 2701 empty, 4 corrupted: 100% 9465/9465 [00:00<?, ?it/s]

�[34m�[1mval: �[0mScanning '/usr/src/logos/autosplit_val.cache' images and labels... 996 found, 0 missing, 286 empty, 0 corrupted: 100% 996/996 [00:00<?, ?it/s]
�[34m�[1mval: �[0mScanning '/usr/src/logos/autosplit_val.cache' images and labels... 996 found, 0 missing, 286 empty, 0 corrupted: 100% 996/996 [00:00<?, ?it/s]
Plotting labels... 

�[34m�[1mtrain: �[0mScanning '/usr/src/logos/autosplit_train.cache' images and labels... 9461 found, 0 missing, 2701 empty, 4 corrupted: 100%|██████████| 9465/9465 [00:00<?, ?it/s]
�[34m�[1mtrain: �[0mScanning '/usr/src/logos/autosplit_train.cache' images and labels... 9461 found, 0 missing, 2701 empty, 4 corrupted: 100%|██████████| 9465/9465 [00:00<?, ?it/s]

�[34m�[1mtrain: �[0mScanning '/usr/src/logos/autosplit_train.cache' images and labels... 9461 found, 0 missing, 2701 empty, 4 corrupted: 100%|██████████| 9465/9465 [00:00<?, ?it/s]
�[34m�[1mtrain: �[0mScanning '/usr/src/logos/autosplit_train.cache' images and labels... 9461 found, 0 missing, 2701 empty, 4 corrupted: 100%|██████████| 9465/9465 [00:00<?, ?it/s]

�[34m�[1mtrain: �[0mScanning '/usr/src/logos/autosplit_train.cache' images and labels... 9461 found, 0 missing, 2701 empty, 4 corrupted: 100%|██████████| 9465/9465 [00:00<?, ?it/s]
�[34m�[1mtrain: �[0mScanning '/usr/src/logos/autosplit_train.cache' images and labels... 9461 found, 0 missing, 2701 empty, 4 corrupted: 100%|██████████| 9465/9465 [00:00<?, ?it/s]

�[34m�[1mautoanchor: �[0mAnalyzing anchors... anchors/target = 5.73, Best Possible Recall (BPR) = 1.0000
Image sizes 1280 train, 1280 val
Using 0 dataloader workers
Logging results to results/2021-08-13
Starting training for 100 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size

  0% 0/148 [00:00<?, ?it/s]
      0/99     19.8G    0.1254   0.08545   0.04525        30      1280:   0% 0/148 [00:11<?, ?it/s]
      0/99     19.8G    0.1254   0.08545   0.04525        30      1280:   1% 1/148 [00:15<37:35, 15.34s/it]Reducer buckets have been rebuilt in this iteration.

      0/99     21.5G     0.123   0.08545   0.04553        25      1280:   1% 1/148 [00:20<37:35, 15.34s/it]
      0/99     21.5G     0.123   0.08545   0.04553        25      1280:   1% 2/148 [00:20<29:54, 12.29s/it]
      0/99     21.5G    0.1212   0.08502   0.04532        17      1280:   1% 2/148 [00:25<29:54, 12.29s/it]
      0/99     21.5G    0.1212   0.08502   0.04532        17      1280:   2% 3/148 [00:25<24:42, 10.22s/it]
      0/99     21.5G    0.1217   0.08223    0.0452        17      1280:   2% 3/148 [00:31<24:42, 10.22s/it]
      0/99     21.5G    0.1217   0.08223    0.0452        17      1280:   3% 4/148 [00:31<20:53,  8.70s/it]
      0/99     21.5G    0.1213   0.07896   0.04531        20      1280:   3% 4/148 [00:36<20:53,  8.70s/it]
      0/99     21.5G    0.1213   0.07896   0.04531        20      1280:   3% 5/148 [00:36<18:19,  7.69s/it]
      0/99     21.5G    0.1211   0.07538   0.04512        25      1280:   3% 5/148 [00:41<18:19,  7.69s/it]
      0/99     21.5G    0.1211   0.07538   0.04512        25      1280:   4% 6/148 [00:41<16:25,  6.94s/it]
      0/99     21.5G    0.1203   0.07139   0.04532        13      1280:   4% 6/148 [00:46<16:25,  6.94s/it]
      0/99     21.5G    0.1203   0.07139   0.04532        13      1280:   5% 7/148 [00:46<14:59,  6.38s/it]
      0/99     21.5G    0.1193   0.06797   0.04522        20      1280:   5% 7/148 [00:51<14:59,  6.38s/it]
      0/99     21.5G    0.1193   0.06797   0.04522        20      1280:   5% 8/148 [00:51<14:04,  6.03s/it]
      0/99     21.5G    0.1192   0.06457   0.04495        21      1280:   5% 8/148 [00:57<14:04,  6.03s/it]
      0/99     21.5G    0.1192   0.06457   0.04495        21      1280:   6% 9/148 [00:57<13:24,  5.79s/it]
      0/99     21.5G    0.1181   0.06186    0.0449        24      1280:   6% 9/148 [01:02<13:24,  5.79s/it]
      0/99     21.5G    0.1181   0.06186    0.0449        24      1280:   7% 10/148 [01:02<12:55,  5.62s/it]
      0/99     21.5G     0.118   0.05965   0.04474        38      1280:   7% 10/148 [01:07<12:55,  5.62s/it]
      0/99     21.5G     0.118   0.05965   0.04474        38      1280:   7% 11/148 [01:07<12:41,  5.56s/it]
      0/99     21.5G     0.117   0.05714   0.04468        18      1280:   7% 11/148 [01:12<12:41,  5.56s/it]
      0/99     21.5G     0.117   0.05714   0.04468        18      1280:   8% 12/148 [01:12<12:22,  5.46s/it]
      0/99     21.5G    0.1164   0.05487   0.04446        20      1280:   8% 12/148 [01:18<12:22,  5.46s/it]
      0/99     21.5G    0.1164   0.05487   0.04446        20      1280:   9% 13/148 [01:18<12:07,  5.39s/it]
      0/99     21.5G    0.1158   0.05271   0.04426        19      1280:   9% 13/148 [01:23<12:07,  5.39s/it]
      0/99     21.5G    0.1158   0.05271   0.04426        19      1280:   9% 14/148 [01:23<11:56,  5.35s/it]
      0/99     21.5G    0.1152   0.05089    0.0441        22      1280:   9% 14/148 [01:28<11:56,  5.35s/it]
      0/99     21.5G    0.1152   0.05089    0.0441        22      1280:  10% 15/148 [01:28<11:51,  5.35s/it]
      0/99     21.5G    0.1143   0.04923   0.04396        20      1280:  10% 15/148 [01:34<11:51,  5.35s/it]
      0/99     21.5G    0.1143   0.04923   0.04396        20      1280:  11% 16/148 [01:34<11:41,  5.32s/it]
      0/99     21.5G    0.1138   0.04776   0.04378        22      1280:  11% 16/148 [01:39<11:41,  5.32s/it]
      0/99     21.5G    0.1138   0.04776   0.04378        22      1280:  11% 17/148 [01:39<11:37,  5.33s/it]
      0/99     21.5G    0.1132   0.04637    0.0436        19      1280:  11% 17/148 [01:44<11:37,  5.33s/it]
      0/99     21.5G    0.1132   0.04637    0.0436        19      1280:  12% 18/148 [01:44<11:27,  5.29s/it]
      0/99     21.5G    0.1122   0.04538    0.0434        25      1280:  12% 18/148 [01:49<11:27,  5.29s/it]
      0/99     21.5G    0.1122   0.04538    0.0434        25      1280:  13% 19/148 [01:49<11:24,  5.31s/it]
      0/99     21.5G    0.1116   0.04417   0.04321        20      1280:  13% 19/148 [01:55<11:24,  5.31s/it]
      0/99     21.5G    0.1116   0.04417   0.04321        20      1280:  14% 20/148 [01:55<11:19,  5.31s/it]
      0/99     21.5G    0.1109   0.04299   0.04305        16      1280:  14% 20/148 [02:00<11:19,  5.31s/it]
      0/99     21.5G    0.1109   0.04299   0.04305        16      1280:  14% 21/148 [02:00<11:10,  5.28s/it]
      0/99     21.5G    0.1102   0.04209   0.04288        21      1280:  14% 21/148 [02:05<11:10,  5.28s/it]
      0/99     21.5G    0.1102   0.04209   0.04288        21      1280:  15% 22/148 [02:05<11:04,  5.28s/it]
      0/99     21.5G    0.1095   0.04134    0.0427        23      1280:  15% 22/148 [02:10<11:04,  5.28s/it]
      0/99     21.5G    0.1095   0.04134    0.0427        23      1280:  16% 23/148 [02:10<10:53,  5.23s/it]
      0/99     21.5G    0.1084   0.04068   0.04254        22      1280:  16% 23/148 [02:16<10:53,  5.23s/it]
      0/99     21.5G    0.1084   0.04068   0.04254        22      1280:  16% 24/148 [02:16<10:48,  5.23s/it]
      0/99     21.5G    0.1076   0.03996   0.04231        22      1280:  16% 24/148 [02:21<10:48,  5.23s/it]
      0/99     21.5G    0.1076   0.03996   0.04231        22      1280:  17% 25/148 [02:21<10:42,  5.22s/it]
      0/99     21.5G    0.1068   0.03924   0.04201        19      1280:  17% 25/148 [02:26<10:42,  5.22s/it]
      0/99     21.5G    0.1068   0.03924   0.04201        19      1280:  18% 26/148 [02:26<10:34,  5.20s/it]
      0/99     21.5G    0.1059    0.0387   0.04182        21      1280:  18% 26/148 [02:31<10:34,  5.20s/it]
      0/99     21.5G    0.1059    0.0387   0.04182        21      1280:  18% 27/148 [02:31<10:26,  5.18s/it]
      0/99     21.5G    0.1053   0.03812   0.04158        21      1280:  18% 27/148 [02:36<10:26,  5.18s/it]
      0/99     21.5G    0.1053   0.03812   0.04158        21      1280:  19% 28/148 [02:36<10:21,  5.18s/it]
      0/99     21.5G    0.1046   0.03767   0.04139        24      1280:  19% 28/148 [02:41<10:21,  5.18s/it]
      0/99     21.5G    0.1046   0.03767   0.04139        24      1280:  20% 29/148 [02:41<10:15,  5.17s/it]
      0/99     21.5G    0.1041   0.03737   0.04124        33      1280:  20% 29/148 [02:47<10:15,  5.17s/it]
      0/99     21.5G    0.1041   0.03737   0.04124        33      1280:  20% 30/148 [02:47<10:12,  5.19s/it]
      0/99     21.5G    0.1033   0.03673   0.04106        12      1280:  20% 30/148 [02:52<10:12,  5.19s/it]
      0/99     21.5G    0.1033   0.03673   0.04106        12      1280:  21% 31/148 [02:52<10:07,  5.19s/it]
      0/99     21.5G    0.1025   0.03623   0.04103        17      1280:  21% 31/148 [02:57<10:07,  5.19s/it]
      0/99     21.5G    0.1025   0.03623   0.04103        17      1280:  22% 32/148 [02:57<09:59,  5.16s/it]
      0/99     21.5G     0.102   0.03585   0.04075        24      1280:  22% 32/148 [03:02<09:59,  5.16s/it]
      0/99     21.5G     0.102   0.03585   0.04075        24      1280:  22% 33/148 [03:02<09:57,  5.19s/it]
      0/99     21.5G    0.1014   0.03556   0.04061        24      1280:  22% 33/148 [03:08<09:57,  5.19s/it]
      0/99     21.5G    0.1014   0.03556   0.04061        24      1280:  23% 34/148 [03:08<09:58,  5.25s/it]
      0/99     21.5G    0.1009   0.03512   0.04034        18      1280:  23% 34/148 [03:13<09:58,  5.25s/it]
      0/99     21.5G    0.1009   0.03512   0.04034        18      1280:  24% 35/148 [03:13<09:49,  5.22s/it]
      0/99     21.5G    0.1002   0.03484   0.04013        23      1280:  24% 35/148 [03:18<09:49,  5.22s/it]
      0/99     21.5G    0.1002   0.03484   0.04013        23      1280:  24% 36/148 [03:18<09:49,  5.26s/it]
      0/99     21.5G   0.09946   0.03466   0.03991        25      1280:  24% 36/148 [03:23<09:49,  5.26s/it]
      0/99     21.5G   0.09946   0.03466   0.03991        25      1280:  25% 37/148 [03:23<09:42,  5.24s/it]
      0/99     21.5G   0.09885   0.03448   0.03971        27      1280:  25% 37/148 [03:28<09:42,  5.24s/it]
      0/99     21.5G   0.09885   0.03448   0.03971        27      1280:  26% 38/148 [03:28<09:35,  5.23s/it]
      0/99     21.5G   0.09817   0.03415   0.03964        17      1280:  26% 38/148 [03:34<09:35,  5.23s/it]
      0/99     21.5G   0.09817   0.03415   0.03964        17      1280:  26% 39/148 [03:34<09:32,  5.26s/it]
      0/99     21.5G   0.09746   0.03382    0.0396        16      1280:  26% 39/148 [03:39<09:32,  5.26s/it]
      0/99     21.5G   0.09746   0.03382    0.0396        16      1280:  27% 40/148 [03:39<09:25,  5.24s/it]
      0/99     21.5G   0.09692   0.03345    0.0395        18      1280:  27% 40/148 [03:44<09:25,  5.24s/it]
      0/99     21.5G   0.09692   0.03345    0.0395        18      1280:  28% 41/148 [03:44<09:18,  5.22s/it]
      0/99     21.5G   0.09618   0.03334    0.0393        23      1280:  28% 41/148 [03:49<09:18,  5.22s/it]
      0/99     21.5G   0.09618   0.03334    0.0393        23      1280:  28% 42/148 [03:49<09:14,  5.23s/it]
      0/99     21.5G   0.09566   0.03295   0.03918        12      1280:  28% 42/148 [03:55<09:14,  5.23s/it]
      0/99     21.5G   0.09566   0.03295   0.03918        12      1280:  29% 43/148 [03:55<09:10,  5.24s/it]
      0/99     21.5G   0.09514   0.03277   0.03899        21      1280:  29% 43/148 [04:00<09:10,  5.24s/it]
      0/99     21.5G   0.09514   0.03277   0.03899        21      1280:  30% 44/148 [04:00<09:05,  5.24s/it]
      0/99     21.5G   0.09457   0.03248   0.03881        16      1280:  30% 44/148 [04:05<09:05,  5.24s/it]
      0/99     21.5G   0.09457   0.03248   0.03881        16      1280:  30% 45/148 [04:05<08:58,  5.23s/it]
      0/99     21.5G   0.09407   0.03232   0.03866        21      1280:  30% 45/148 [04:10<08:58,  5.23s/it]
      0/99     21.5G   0.09407   0.03232   0.03866        21      1280:  31% 46/148 [04:10<08:55,  5.25s/it]
      0/99     21.5G   0.09346   0.03217   0.03852        20      1280:  31% 46/148 [04:16<08:55,  5.25s/it]
      0/99     21.5G   0.09346   0.03217   0.03852        20      1280:  32% 47/148 [04:16<08:47,  5.23s/it]
      0/99     21.5G   0.09273   0.03227   0.03832        29      1280:  32% 47/148 [04:21<08:47,  5.23s/it]
      0/99     21.5G   0.09273   0.03227   0.03832        29      1280:  32% 48/148 [04:21<08:43,  5.23s/it]
      0/99     21.5G   0.09208    0.0321   0.03811        18      1280:  32% 48/148 [04:26<08:43,  5.23s/it]
      0/99     21.5G   0.09208    0.0321   0.03811        18      1280:  33% 49/148 [04:26<08:41,  5.27s/it]
      0/99     21.5G   0.09158   0.03196   0.03786        20      1280:  33% 49/148 [04:31<08:41,  5.27s/it]
      0/99     21.5G   0.09158   0.03196   0.03786        20      1280:  34% 50/148 [04:31<08:36,  5.27s/it]
      0/99     21.5G   0.09112   0.03185   0.03766        23      1280:  34% 50/148 [04:37<08:36,  5.27s/it]
      0/99     21.5G   0.09112   0.03185   0.03766        23      1280:  34% 51/148 [04:37<08:30,  5.26s/it]
      0/99     21.5G   0.09058   0.03176   0.03749        21      1280:  34% 51/148 [04:42<08:30,  5.26s/it]
      0/99     21.5G   0.09058   0.03176   0.03749        21      1280:  35% 52/148 [04:42<08:20,  5.21s/it]
      0/99     21.5G    0.0901   0.03162   0.03729        22      1280:  35% 52/148 [04:47<08:20,  5.21s/it]
      0/99     21.5G    0.0901   0.03162   0.03729        22      1280:  36% 53/148 [04:47<08:16,  5.22s/it]
      0/99     21.5G   0.08952   0.03155   0.03716        20      1280:  36% 53/148 [04:52<08:16,  5.22s/it]
      0/99     21.5G   0.08952   0.03155   0.03716        20      1280:  36% 54/148 [04:52<08:09,  5.20s/it]
      0/99     21.5G   0.08881   0.03129   0.03692        12      1280:  36% 54/148 [04:57<08:09,  5.20s/it]
      0/99     21.5G   0.08881   0.03129   0.03692        12      1280:  37% 55/148 [04:57<08:00,  5.17s/it]
      0/99     21.5G   0.08828   0.03122   0.03682        23      1280:  37% 55/148 [05:02<08:00,  5.17s/it]
      0/99     21.5G   0.08828   0.03122   0.03682        23      1280:  38% 56/148 [05:02<07:54,  5.16s/it]
      0/99     21.5G    0.0878   0.03132   0.03663        32      1280:  38% 56/148 [05:08<07:54,  5.16s/it]
      0/99     21.5G    0.0878   0.03132   0.03663        32      1280:  39% 57/148 [05:08<07:48,  5.14s/it]
      0/99     21.5G   0.08735   0.03124   0.03645        23      1280:  39% 57/148 [05:13<07:48,  5.14s/it]
      0/99     21.5G   0.08735   0.03124   0.03645        23      1280:  39% 58/148 [05:13<07:41,  5.13s/it]
      0/99     21.5G    0.0869   0.03122   0.03627        27      1280:  39% 58/148 [05:18<07:41,  5.13s/it]
      0/99     21.5G    0.0869   0.03122   0.03627        27      1280:  40% 59/148 [05:18<07:39,  5.16s/it]
      0/99     21.5G   0.08649   0.03112   0.03606        21      1280:  40% 59/148 [05:23<07:39,  5.16s/it]
      0/99     21.5G   0.08649   0.03112   0.03606        21      1280:  41% 60/148 [05:23<07:34,  5.17s/it]
      0/99     21.5G   0.08614     0.031   0.03593        19      1280:  41% 60/148 [05:28<07:34,  5.17s/it]
      0/99     21.5G   0.08614     0.031   0.03593        19      1280:  41% 61/148 [05:28<07:34,  5.23s/it]
      0/99     21.5G   0.08561   0.03092   0.03564        24      1280:  41% 61/148 [05:34<07:34,  5.23s/it]
      0/99     21.5G   0.08561   0.03092   0.03564        24      1280:  42% 62/148 [05:34<07:28,  5.21s/it]
      0/99     21.5G   0.08514   0.03078   0.03549        17      1280:  42% 62/148 [05:39<07:28,  5.21s/it]
      0/99     21.5G   0.08514   0.03078   0.03549        17      1280:  43% 63/148 [05:39<07:22,  5.21s/it]
      0/99     21.5G   0.08474   0.03067   0.03525        19      1280:  43% 63/148 [05:44<07:22,  5.21s/it]
      0/99     21.5G   0.08474   0.03067   0.03525        19      1280:  43% 64/148 [05:44<07:13,  5.16s/it]
      0/99     21.5G   0.08431   0.03063   0.03502        23      1280:  43% 64/148 [05:49<07:13,  5.16s/it]
      0/99     21.5G   0.08431   0.03063   0.03502        23      1280:  44% 65/148 [05:49<07:09,  5.18s/it]
      0/99     21.5G   0.08386    0.0306   0.03476        27      1280:  44% 65/148 [05:54<07:09,  5.18s/it]
      0/99     21.5G   0.08386    0.0306   0.03476        27      1280:  45% 66/148 [05:54<07:03,  5.17s/it]
      0/99     21.5G    0.0834   0.03056   0.03451        23      1280:  45% 66/148 [05:59<07:03,  5.17s/it]
      0/99     21.5G    0.0834   0.03056   0.03451        23      1280:  45% 67/148 [05:59<06:58,  5.17s/it]
      0/99     21.5G   0.08308   0.03052    0.0343        22      1280:  45% 67/148 [06:04<06:58,  5.17s/it]
      0/99     21.5G   0.08308   0.03052    0.0343        22      1280:  46% 68/148 [06:04<06:50,  5.13s/it]
      0/99     21.5G   0.08266   0.03037   0.03418        16      1280:  46% 68/148 [06:09<06:50,  5.13s/it]
      0/99     21.5G   0.08266   0.03037   0.03418        16      1280:  47% 69/148 [06:09<06:42,  5.10s/it]
      0/99     21.5G   0.08229   0.03032   0.03391        22      1280:  47% 69/148 [06:15<06:42,  5.10s/it]
      0/99     21.5G   0.08229   0.03032   0.03391        22      1280:  47% 70/148 [06:15<06:41,  5.15s/it]
      0/99     21.5G   0.08184   0.03027   0.03378        21      1280:  47% 70/148 [06:20<06:41,  5.15s/it]
      0/99     21.5G   0.08184   0.03027   0.03378        21      1280:  48% 71/148 [06:20<06:37,  5.16s/it]
      0/99     21.5G   0.08163   0.03012   0.03356        17      1280:  48% 71/148 [06:25<06:37,  5.16s/it]
      0/99     21.5G   0.08163   0.03012   0.03356        17      1280:  49% 72/148 [06:25<06:32,  5.17s/it]
      0/99     21.5G   0.08134   0.03009   0.03335        24      1280:  49% 72/148 [06:30<06:32,  5.17s/it]
      0/99     21.5G   0.08134   0.03009   0.03335        24      1280:  49% 73/148 [06:30<06:30,  5.21s/it]
      0/99     21.5G   0.08105   0.03004   0.03313        26      1280:  49% 73/148 [06:35<06:30,  5.21s/it]
      0/99     21.5G   0.08105   0.03004   0.03313        26      1280:  50% 74/148 [06:36<06:22,  5.17s/it]
      0/99     21.5G   0.08076   0.02988   0.03302        15      1280:  50% 74/148 [06:41<06:22,  5.17s/it]
      0/99     21.5G   0.08076   0.02988   0.03302        15      1280:  51% 75/148 [06:41<06:18,  5.19s/it]
      0/99     21.5G   0.08024   0.02988   0.03276        27      1280:  51% 75/148 [06:46<06:18,  5.19s/it]
      0/99     21.5G   0.08024   0.02988   0.03276        27      1280:  51% 76/148 [06:46<06:13,  5.19s/it]
      0/99     21.5G   0.07997   0.02982   0.03257        22      1280:  51% 76/148 [06:51<06:13,  5.19s/it]
      0/99     21.5G   0.07997   0.02982   0.03257        22      1280:  52% 77/148 [06:51<06:05,  5.15s/it]
      0/99     21.5G   0.07961   0.02976   0.03236        23      1280:  52% 77/148 [06:56<06:05,  5.15s/it]
      0/99     21.5G   0.07961   0.02976   0.03236        23      1280:  53% 78/148 [06:56<05:58,  5.12s/it]
      0/99     21.5G   0.07921   0.02966   0.03215        18      1280:  53% 78/148 [07:01<05:58,  5.12s/it]
      0/99     21.5G   0.07921   0.02966   0.03215        18      1280:  53% 79/148 [07:01<05:53,  5.13s/it]
      0/99     21.5G   0.07888   0.02955   0.03198        17      1280:  53% 79/148 [07:06<05:53,  5.13s/it]
      0/99     21.5G   0.07888   0.02955   0.03198        17      1280:  54% 80/148 [07:06<05:48,  5.13s/it]
      0/99     21.5G    0.0785   0.02954   0.03176        26      1280:  54% 80/148 [07:11<05:48,  5.13s/it]
      0/99     21.5G    0.0785   0.02954   0.03176        26      1280:  55% 81/148 [07:11<05:44,  5.14s/it]
      0/99     21.5G   0.07826   0.02949   0.03158        27      1280:  55% 81/148 [07:17<05:44,  5.14s/it]
      0/99     21.5G   0.07826   0.02949   0.03158        27      1280:  55% 82/148 [07:17<05:39,  5.14s/it]
      0/99     21.5G   0.07797   0.02942   0.03137        21      1280:  55% 82/148 [07:22<05:39,  5.14s/it]
      0/99     21.5G   0.07797   0.02942   0.03137        21      1280:  56% 83/148 [07:22<05:34,  5.14s/it]
      0/99     21.5G   0.07772   0.02932   0.03122        20      1280:  56% 83/148 [07:27<05:34,  5.14s/it]
      0/99     21.5G   0.07772   0.02932   0.03122        20      1280:  57% 84/148 [07:27<05:29,  5.15s/it]
      0/99     21.5G   0.07741   0.02926   0.03102        22      1280:  57% 84/148 [07:32<05:29,  5.15s/it]
      0/99     21.5G   0.07741   0.02926   0.03102        22      1280:  57% 85/148 [07:32<05:24,  5.14s/it]
      0/99     21.5G   0.07713   0.02919   0.03086        20      1280:  57% 85/148 [07:37<05:24,  5.14s/it]
      0/99     21.5G   0.07713   0.02919   0.03086        20      1280:  58% 86/148 [07:37<05:19,  5.16s/it]
      0/99     21.5G   0.07676   0.02913   0.03067        23      1280:  58% 86/148 [07:42<05:19,  5.16s/it]
      0/99     21.5G   0.07676   0.02913   0.03067        23      1280:  59% 87/148 [07:42<05:14,  5.15s/it]
      0/99     21.5G   0.07667   0.02898   0.03047        20      1280:  59% 87/148 [07:48<05:14,  5.15s/it]
      0/99     21.5G   0.07667   0.02898   0.03047        20      1280:  59% 88/148 [07:48<05:09,  5.15s/it]
      0/99     21.5G   0.07636   0.02885   0.03026        15      1280:  59% 88/148 [07:53<05:09,  5.15s/it]
      0/99     21.5G   0.07636   0.02885   0.03026        15      1280:  60% 89/148 [07:53<05:05,  5.18s/it]
      0/99     21.5G   0.07617   0.02875   0.03012        20      1280:  60% 89/148 [07:58<05:05,  5.18s/it]
      0/99     21.5G   0.07617   0.02875   0.03012        20      1280:  61% 90/148 [07:58<05:01,  5.19s/it]
      0/99     21.5G   0.07593   0.02859   0.02993        13      1280:  61% 90/148 [08:03<05:01,  5.19s/it]
      0/99     21.5G   0.07593   0.02859   0.02993        13      1280:  61% 91/148 [08:03<04:55,  5.18s/it]
      0/99     21.5G   0.07569    0.0285   0.02973        23      1280:  61% 91/148 [08:08<04:55,  5.18s/it]
      0/99     21.5G   0.07569    0.0285   0.02973        23      1280:  62% 92/148 [08:08<04:48,  5.15s/it]
      0/99     21.5G   0.07544   0.02852   0.02957        31      1280:  62% 92/148 [08:13<04:48,  5.15s/it]
      0/99     21.5G   0.07544   0.02852   0.02957        31      1280:  63% 93/148 [08:13<04:42,  5.14s/it]
      0/99     21.5G   0.07528   0.02857   0.02948        44      1280:  63% 93/148 [08:19<04:42,  5.14s/it]
      0/99     21.5G   0.07528   0.02857   0.02948        44      1280:  64% 94/148 [08:19<04:38,  5.17s/it]
      0/99     21.5G   0.07504   0.02846   0.02929        20      1280:  64% 94/148 [08:24<04:38,  5.17s/it]
      0/99     21.5G   0.07504   0.02846   0.02929        20      1280:  64% 95/148 [08:24<04:33,  5.16s/it]
      0/99     21.5G   0.07486   0.02841   0.02914        26      1280:  64% 95/148 [08:29<04:33,  5.16s/it]
      0/99     21.5G   0.07486   0.02841   0.02914        26      1280:  65% 96/148 [08:29<04:29,  5.19s/it]
      0/99     21.5G   0.07459   0.02831   0.02894        18      1280:  65% 96/148 [08:34<04:29,  5.19s/it]
      0/99     21.5G   0.07459   0.02831   0.02894        18      1280:  66% 97/148 [08:34<04:24,  5.19s/it]
      0/99     21.5G   0.07448   0.02822   0.02879        23      1280:  66% 97/148 [08:39<04:24,  5.19s/it]
      0/99     21.5G   0.07448   0.02822   0.02879        23      1280:  66% 98/148 [08:39<04:19,  5.19s/it]
      0/99     21.5G   0.07422   0.02813   0.02864        17      1280:  66% 98/148 [08:45<04:19,  5.19s/it]
      0/99     21.5G   0.07422   0.02813   0.02864        17      1280:  67% 99/148 [08:45<04:14,  5.20s/it]
      0/99     21.5G   0.07409   0.02812   0.02848        31      1280:  67% 99/148 [08:50<04:14,  5.20s/it]
      0/99     21.5G   0.07409   0.02812   0.02848        31      1280:  68% 100/148 [08:50<04:09,  5.19s/it]
      0/99     21.5G   0.07385   0.02799    0.0283        14      1280:  68% 100/148 [08:55<04:09,  5.19s/it]
      0/99     21.5G   0.07385   0.02799    0.0283        14      1280:  68% 101/148 [08:55<04:03,  5.17s/it]
      0/99     21.5G   0.07362   0.02784   0.02811        11      1280:  68% 101/148 [09:00<04:03,  5.17s/it]
      0/99     21.5G   0.07362   0.02784   0.02811        11      1280:  69% 102/148 [09:00<03:58,  5.19s/it]
      0/99     21.5G   0.07344   0.02783   0.02795        28      1280:  69% 102/148 [09:05<03:58,  5.19s/it]
      0/99     21.5G   0.07344   0.02783   0.02795        28      1280:  70% 103/148 [09:05<03:53,  5.19s/it]
      0/99     21.5G   0.07319   0.02775   0.02781        19      1280:  70% 103/148 [09:10<03:53,  5.19s/it]
      0/99     21.5G   0.07319   0.02775   0.02781        19      1280:  70% 104/148 [09:10<03:47,  5.17s/it]
      0/99     21.5G   0.07306   0.02766   0.02767        20      1280:  70% 104/148 [09:16<03:47,  5.17s/it]
      0/99     21.5G   0.07306   0.02766   0.02767        20      1280:  71% 105/148 [09:16<03:42,  5.19s/it]
      0/99     21.5G   0.07292   0.02756   0.02751        20      1280:  71% 105/148 [09:21<03:42,  5.19s/it]
      0/99     21.5G   0.07292   0.02756   0.02751        20      1280:  72% 106/148 [09:21<03:37,  5.18s/it]
      0/99     21.5G   0.07276   0.02755   0.02735        34      1280:  72% 106/148 [09:26<03:37,  5.18s/it]
      0/99     21.5G   0.07276   0.02755   0.02735        34      1280:  72% 107/148 [09:26<03:32,  5.17s/it]
      0/99     21.5G   0.07253   0.02748   0.02719        21      1280:  72% 107/148 [09:31<03:32,  5.17s/it]
      0/99     21.5G   0.07253   0.02748   0.02719        21      1280:  73% 108/148 [09:31<03:26,  5.16s/it]
      0/99     21.5G    0.0724   0.02734   0.02702        15      1280:  73% 108/148 [09:36<03:26,  5.16s/it]
      0/99     21.5G    0.0724   0.02734   0.02702        15      1280:  74% 109/148 [09:36<03:22,  5.18s/it]
      0/99     21.5G    0.0723   0.02721   0.02692        16      1280:  74% 109/148 [09:41<03:22,  5.18s/it]
      0/99     21.5G    0.0723   0.02721   0.02692        16      1280:  74% 110/148 [09:41<03:16,  5.17s/it]
      0/99     21.5G   0.07217   0.02713   0.02676        21      1280:  74% 110/148 [09:47<03:16,  5.17s/it]
      0/99     21.5G   0.07217   0.02713   0.02676        21      1280:  75% 111/148 [09:47<03:12,  5.20s/it]
      0/99     21.5G   0.07203   0.02699   0.02661        13      1280:  75% 111/148 [09:52<03:12,  5.20s/it]
      0/99     21.5G   0.07203   0.02699   0.02661        13      1280:  76% 112/148 [09:52<03:06,  5.18s/it]
      0/99     21.5G   0.07184   0.02689   0.02644        18      1280:  76% 112/148 [09:57<03:06,  5.18s/it]
      0/99     21.5G   0.07184   0.02689   0.02644        18      1280:  76% 113/148 [09:57<03:01,  5.18s/it]
      0/99     21.5G   0.07167   0.02679    0.0263        14      1280:  76% 113/148 [10:02<03:01,  5.18s/it]
      0/99     21.5G   0.07167   0.02679    0.0263        14      1280:  77% 114/148 [10:02<02:56,  5.18s/it]
      0/99     21.5G   0.07165   0.02674   0.02617        29      1280:  77% 114/148 [10:07<02:56,  5.18s/it]
      0/99     21.5G   0.07165   0.02674   0.02617        29      1280:  78% 115/148 [10:07<02:50,  5.17s/it]
      0/99     21.5G    0.0716   0.02665   0.02604        22      1280:  78% 115/148 [10:13<02:50,  5.17s/it]
      0/99     21.5G    0.0716   0.02665   0.02604        22      1280:  78% 116/148 [10:13<02:46,  5.21s/it]
      0/99     21.5G   0.07151   0.02662   0.02591        31      1280:  78% 116/148 [10:18<02:46,  5.21s/it]
      0/99     21.5G   0.07151   0.02662   0.02591        31      1280:  79% 117/148 [10:18<02:41,  5.19s/it]
      0/99     21.5G   0.07148   0.02655   0.02578        29      1280:  79% 117/148 [10:23<02:41,  5.19s/it]
      0/99     21.5G   0.07148   0.02655   0.02578        29      1280:  80% 118/148 [10:23<02:34,  5.16s/it]
      0/99     21.5G   0.07141   0.02648   0.02563        23      1280:  80% 118/148 [10:28<02:34,  5.16s/it]
      0/99     21.5G   0.07141   0.02648   0.02563        23      1280:  80% 119/148 [10:28<02:29,  5.16s/it]
      0/99     21.5G   0.07135   0.02639    0.0255        23      1280:  80% 119/148 [10:33<02:29,  5.16s/it]
      0/99     21.5G   0.07135   0.02639    0.0255        23      1280:  81% 120/148 [10:33<02:23,  5.13s/it]
      0/99     21.5G   0.07127   0.02628   0.02537        16      1280:  81% 120/148 [10:38<02:23,  5.13s/it]
      0/99     21.5G   0.07127   0.02628   0.02537        16      1280:  82% 121/148 [10:38<02:18,  5.14s/it]
      0/99     21.5G   0.07108   0.02623   0.02526        22      1280:  82% 121/148 [10:44<02:18,  5.14s/it]
      0/99     21.5G   0.07108   0.02623   0.02526        22      1280:  82% 122/148 [10:44<02:14,  5.17s/it]
      0/99     21.5G   0.07101   0.02615   0.02513        24      1280:  82% 122/148 [10:49<02:14,  5.17s/it]
      0/99     21.5G   0.07101   0.02615   0.02513        24      1280:  83% 123/148 [10:49<02:08,  5.15s/it]
      0/99     21.5G   0.07088   0.02601   0.02498         7      1280:  83% 123/148 [10:54<02:08,  5.15s/it]
      0/99     21.5G   0.07088   0.02601   0.02498         7      1280:  84% 124/148 [10:54<02:03,  5.15s/it]
      0/99     21.5G   0.07076   0.02593   0.02488        18      1280:  84% 124/148 [10:59<02:03,  5.15s/it]
      0/99     21.5G   0.07076   0.02593   0.02488        18      1280:  84% 125/148 [10:59<01:59,  5.18s/it]
      0/99     21.5G   0.07068   0.02587   0.02475        26      1280:  84% 125/148 [11:04<01:59,  5.18s/it]
      0/99     21.5G   0.07068   0.02587   0.02475        26      1280:  85% 126/148 [11:04<01:53,  5.17s/it]
      0/99     21.5G   0.07062   0.02578   0.02462        24      1280:  85% 126/148 [11:09<01:53,  5.17s/it]
      0/99     21.5G   0.07062   0.02578   0.02462        24      1280:  86% 127/148 [11:09<01:48,  5.17s/it]
      0/99     21.5G   0.07049    0.0257   0.02448        21      1280:  86% 127/148 [11:15<01:48,  5.17s/it]
      0/99     21.5G   0.07049    0.0257   0.02448        21      1280:  86% 128/148 [11:15<01:44,  5.22s/it]
      0/99     21.5G   0.07035   0.02562    0.0244        17      1280:  86% 128/148 [11:20<01:44,  5.22s/it]
      0/99     21.5G   0.07035   0.02562    0.0244        17      1280:  87% 129/148 [11:20<01:38,  5.20s/it]
      0/99     21.5G    0.0702   0.02554   0.02428        20      1280:  87% 129/148 [11:25<01:38,  5.20s/it]
      0/99     21.5G    0.0702   0.02554   0.02428        20      1280:  88% 130/148 [11:25<01:33,  5.19s/it]
      0/99     21.5G   0.07006   0.02554   0.02416        30      1280:  88% 130/148 [11:30<01:33,  5.19s/it]
      0/99     21.5G   0.07006   0.02554   0.02416        30      1280:  89% 131/148 [11:30<01:28,  5.18s/it]
      0/99     21.5G   0.07002   0.02543   0.02406        15      1280:  89% 131/148 [11:35<01:28,  5.18s/it]
      0/99     21.5G   0.07002   0.02543   0.02406        15      1280:  89% 132/148 [11:35<01:23,  5.20s/it]
      0/99     21.5G   0.06992   0.02534   0.02394        17      1280:  89% 132/148 [11:41<01:23,  5.20s/it]
      0/99     21.5G   0.06992   0.02534   0.02394        17      1280:  90% 133/148 [11:41<01:18,  5.20s/it]
      0/99     21.5G   0.06975   0.02527   0.02384        17      1280:  90% 133/148 [11:46<01:18,  5.20s/it]
      0/99     21.5G   0.06975   0.02527   0.02384        17      1280:  91% 134/148 [11:46<01:12,  5.19s/it]
      0/99     21.5G   0.06961   0.02519   0.02371        19      1280:  91% 134/148 [11:51<01:12,  5.19s/it]
      0/99     21.5G   0.06961   0.02519   0.02371        19      1280:  91% 135/148 [11:51<01:07,  5.16s/it]
      0/99     21.5G   0.06958   0.02512    0.0236        24      1280:  91% 135/148 [11:56<01:07,  5.16s/it]
      0/99     21.5G   0.06958   0.02512    0.0236        24      1280:  92% 136/148 [11:56<01:01,  5.16s/it]
      0/99     21.5G   0.06949   0.02507   0.02349        25      1280:  92% 136/148 [12:01<01:01,  5.16s/it]
      0/99     21.5G   0.06949   0.02507   0.02349        25      1280:  93% 137/148 [12:01<00:57,  5.19s/it]
      0/99     21.5G    0.0694   0.02504   0.02339        31      1280:  93% 137/148 [12:06<00:57,  5.19s/it]
      0/99     21.5G    0.0694   0.02504   0.02339        31      1280:  93% 138/148 [12:06<00:51,  5.17s/it]
      0/99     21.5G    0.0693   0.02499   0.02329        23      1280:  93% 138/148 [12:12<00:51,  5.17s/it]
      0/99     21.5G    0.0693   0.02499   0.02329        23      1280:  94% 139/148 [12:12<00:46,  5.19s/it]
      0/99     21.5G   0.06923   0.02492   0.02318        22      1280:  94% 139/148 [12:17<00:46,  5.19s/it]
      0/99     21.5G   0.06923   0.02492   0.02318        22      1280:  95% 140/148 [12:17<00:41,  5.18s/it]
      0/99     21.5G   0.06913   0.02487   0.02307        25      1280:  95% 140/148 [12:22<00:41,  5.18s/it]
      0/99     21.5G   0.06913   0.02487   0.02307        25      1280:  95% 141/148 [12:22<00:36,  5.18s/it]
      0/99     21.5G   0.06908   0.02482   0.02299        22      1280:  95% 141/148 [12:27<00:36,  5.18s/it]
      0/99     21.5G   0.06908   0.02482   0.02299        22      1280:  96% 142/148 [12:27<00:30,  5.15s/it]
      0/99     21.5G   0.06899   0.02473   0.02288        18      1280:  96% 142/148 [12:32<00:30,  5.15s/it]
      0/99     21.5G   0.06899   0.02473   0.02288        18      1280:  97% 143/148 [12:32<00:25,  5.16s/it]
      0/99     21.5G   0.06892    0.0247    0.0228        39      1280:  97% 143/148 [12:37<00:25,  5.16s/it]
      0/99     21.5G   0.06892    0.0247    0.0228        39      1280:  97% 144/148 [12:37<00:20,  5.17s/it]
      0/99     21.5G   0.06881   0.02464    0.0227        24      1280:  97% 144/148 [12:43<00:20,  5.17s/it]
      0/99     21.5G   0.06881   0.02464    0.0227        24      1280:  98% 145/148 [12:43<00:15,  5.20s/it]
      0/99     21.5G   0.06878   0.02462   0.02261        38      1280:  98% 145/148 [12:48<00:15,  5.20s/it]
      0/99     21.5G   0.06878   0.02462   0.02261        38      1280:  99% 146/148 [12:48<00:10,  5.22s/it]
      0/99     21.5G   0.06871   0.02453    0.0225        14      1280:  99% 146/148 [12:53<00:10,  5.22s/it]
      0/99     21.5G   0.06871   0.02453    0.0225        14      1280:  99% 147/148 [12:53<00:05,  5.20s/it]
      0/99     17.4G   0.06859   0.02448    0.0224        24      1280:  99% 147/148 [13:06<00:05,  5.20s/it]
      0/99     17.4G   0.06859   0.02448    0.0224        24      1280: 100% 148/148 [13:06<00:00,  7.56s/it]
      0/99     17.4G   0.06859   0.02448    0.0224        24      1280: 100% 148/148 [13:06<00:00,  5.32s/it]

               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:   0% 0/32 [00:00<?, ?it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:   3% 1/32 [00:05<02:42,  5.23s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:   6% 2/32 [00:08<02:16,  4.57s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:   9% 3/32 [00:11<01:56,  4.03s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  12% 4/32 [00:13<01:42,  3.65s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  16% 5/32 [00:16<01:31,  3.40s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  19% 6/32 [00:19<01:23,  3.22s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  22% 7/32 [00:22<01:17,  3.11s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  25% 8/32 [00:25<01:12,  3.02s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  28% 9/32 [00:27<01:07,  2.93s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  31% 10/32 [00:30<01:03,  2.89s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  34% 11/32 [00:33<01:00,  2.88s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  38% 12/32 [00:36<00:56,  2.85s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  41% 13/32 [00:39<00:53,  2.84s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  44% 14/32 [00:41<00:50,  2.82s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  47% 15/32 [00:44<00:47,  2.80s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  50% 16/32 [00:47<00:44,  2.80s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  53% 17/32 [00:50<00:42,  2.85s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  56% 18/32 [00:53<00:39,  2.81s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  59% 19/32 [00:55<00:36,  2.81s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  62% 20/32 [00:58<00:33,  2.80s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  66% 21/32 [01:01<00:30,  2.79s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  69% 22/32 [01:04<00:27,  2.80s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  72% 23/32 [01:07<00:25,  2.80s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  75% 24/32 [01:09<00:22,  2.83s/it][E ProcessGroupNCCL.cpp:566] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=60000) ran for 66853 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=60000) ran for 66854 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=60000) ran for 66664 milliseconds before timing out.

               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95:  78% 25/32 [01:12<00:20,  2.86s/it][E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=60000) ran for 66664 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=60000) ran for 66854 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=60000) ran for 66853 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 174) of binary: /opt/conda/bin/python3
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3]
  role_ranks=[0, 1, 2, 3]
  global_ranks=[0, 1, 2, 3]
  role_world_sizes=[4, 4, 4, 4]
  global_world_sizes=[4, 4, 4, 4]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_wqpgf9x4/none_u1agcdp1/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_wqpgf9x4/none_u1agcdp1/attempt_1/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_wqpgf9x4/none_u1agcdp1/attempt_1/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_wqpgf9x4/none_u1agcdp1/attempt_1/3/error.json
�[34m�[1mtrain: �[0mweights=yolov5m.pt, cfg=, data=../logos/logo.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=100, batch_size=64, imgsz=1280, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1,2,3, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=0, project=results, entity=None, name=2021-08-13, exist_ok=True, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias=latest, local_rank=0, freeze=0
�[34m�[1mgithub: �[0mskipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 v5.0-360-gd9f23ed torch 1.9.0+cu102 CUDA:0 (Tesla V100S-PCIE-32GB, 32510.5MB)
                                              CUDA:1 (Tesla V100S-PCIE-32GB, 32510.5MB)
                                              CUDA:2 (Tesla V100S-PCIE-32GB, 32510.5MB)
                                              CUDA:3 (Tesla V100S-PCIE-32GB, 32510.5MB)

Added key: store_based_barrier_key:1 to store for rank: 0
/opt/conda/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)
Traceback (most recent call last):
  File "train.py", line 600, in <module>
    main(opt)
  File "train.py", line 494, in main
    dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)
Traceback (most recent call last):
  File "train.py", line 600, in <module>
    main(opt)
  File "train.py", line 494, in main
    dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 3, for key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)
Traceback (most recent call last):
Traceback (most recent call last):
  File "train.py", line 600, in <module>
  File "train.py", line 600, in <module>
    main(opt)
  File "train.py", line 494, in main
    main(opt)
  File "train.py", line 494, in main
    dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
    _store_based_barrier(rank, store, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
    raise RuntimeError(
    RuntimeErrorraise RuntimeError(: 
Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=4, worker_count=8, timeout=0:01:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 298) of binary: /opt/conda/bin/python3
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 2/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=2
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3]
  role_ranks=[0, 1, 2, 3]
  global_ranks=[0, 1, 2, 3]
  role_world_sizes=[4, 4, 4, 4]
  global_world_sizes=[4, 4, 4, 4]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_wqpgf9x4/none_u1agcdp1/attempt_2/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_wqpgf9x4/none_u1agcdp1/attempt_2/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_wqpgf9x4/none_u1agcdp1/attempt_2/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_wqpgf9x4/none_u1agcdp1/attempt_2/3/error.json
�[34m�[1mtrain: �[0mweights=yolov5m.pt, cfg=, data=../logos/logo.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=100, batch_size=64, imgsz=1280, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1,2,3, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=0, project=results, entity=None, name=2021-08-13, exist_ok=True, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias=latest, local_rank=0, freeze=0
�[34m�[1mgithub: �[0mskipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 v5.0-360-gd9f23ed torch 1.9.0+cu102 CUDA:0 (Tesla V100S-PCIE-32GB, 32510.5MB)
                                              CUDA:1 (Tesla V100S-PCIE-32GB, 32510.5MB)
                                              CUDA:2 (Tesla V100S-PCIE-32GB, 32510.5MB)
                                              CUDA:3 (Tesla V100S-PCIE-32GB, 32510.5MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=12, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=12, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=12, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=12, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=12, timeout=0:01:00)
Traceback (most recent call last):
  File "train.py", line 600, in <module>
    main(opt)
  File "train.py", line 494, in main
    dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 3, for key: store_based_barrier_key:1 (world_size=4, worker_count=12, timeout=0:01:00)
Traceback (most recent call last):
  File "train.py", line 600, in <module>
Traceback (most recent call last):
  File "train.py", line 600, in <module>
    main(opt)
  File "train.py", line 494, in main
    dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
        _store_based_barrier(rank, store, timeout)main(opt)

  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
  File "train.py", line 494, in main
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=4, worker_count=12, timeout=0:01:00)
    dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=4, worker_count=12, timeout=0:01:00)
Traceback (most recent call last):
  File "train.py", line 600, in <module>
    main(opt)
  File "train.py", line 494, in main
    dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=4, worker_count=12, timeout=0:01:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 346) of binary: /opt/conda/bin/python3
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 1/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=3
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3]
  role_ranks=[0, 1, 2, 3]
  global_ranks=[0, 1, 2, 3]
  role_world_sizes=[4, 4, 4, 4]
  global_world_sizes=[4, 4, 4, 4]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_wqpgf9x4/none_u1agcdp1/attempt_3/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_wqpgf9x4/none_u1agcdp1/attempt_3/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_wqpgf9x4/none_u1agcdp1/attempt_3/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_wqpgf9x4/none_u1agcdp1/attempt_3/3/error.json
�[34m�[1mtrain: �[0mweights=yolov5m.pt, cfg=, data=../logos/logo.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=100, batch_size=64, imgsz=1280, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1,2,3, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=0, project=results, entity=None, name=2021-08-13, exist_ok=True, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias=latest, local_rank=0, freeze=0
�[34m�[1mgithub: �[0mskipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 v5.0-360-gd9f23ed torch 1.9.0+cu102 CUDA:0 (Tesla V100S-PCIE-32GB, 32510.5MB)
                                              CUDA:1 (Tesla V100S-PCIE-32GB, 32510.5MB)
                                              CUDA:2 (Tesla V100S-PCIE-32GB, 32510.5MB)
                                              CUDA:3 (Tesla V100S-PCIE-32GB, 32510.5MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=16, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=16, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=16, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=16, timeout=0:01:00)
Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=4, worker_count=16, timeout=0:01:00)
Traceback (most recent call last):
  File "train.py", line 600, in <module>
    main(opt)
  File "train.py", line 494, in main
    dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=4, worker_count=16, timeout=0:01:00)
Traceback (most recent call last):
  File "train.py", line 600, in <module>
    main(opt)
  File "train.py", line 494, in main
    dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=4, worker_count=16, timeout=0:01:00)
Traceback (most recent call last):
  File "train.py", line 600, in <module>
    main(opt)
  File "train.py", line 494, in main
    dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 3, for key: store_based_barrier_key:1 (world_size=4, worker_count=16, timeout=0:01:00)
Traceback (most recent call last):
  File "train.py", line 600, in <module>
    main(opt)
  File "train.py", line 494, in main
    dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 219, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=4, worker_count=16, timeout=0:01:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 390) of binary: /opt/conda/bin/python3
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future.
  warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0009250640869140625 seconds
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "390", "role": "default", "hostname": "job-155b782d-12d0-457b-ada3-ee678ed0e091", "state": "FAILED", "total_run_time": 1071, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python3\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [4]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 1, "group_rank": 0, "worker_id": "391", "role": "default", "hostname": "job-155b782d-12d0-457b-ada3-ee678ed0e091", "state": "FAILED", "total_run_time": 1071, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python3\", \"local_rank\": [1], \"role_rank\": [1], \"role_world_size\": [4]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 2, "group_rank": 0, "worker_id": "392", "role": "default", "hostname": "job-155b782d-12d0-457b-ada3-ee678ed0e091", "state": "FAILED", "total_run_time": 1071, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python3\", \"local_rank\": [2], \"role_rank\": [2], \"role_world_size\": [4]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 3, "group_rank": 0, "worker_id": "393", "role": "default", "hostname": "job-155b782d-12d0-457b-ada3-ee678ed0e091", "state": "FAILED", "total_run_time": 1071, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python3\", \"local_rank\": [3], \"role_rank\": [3], \"role_world_size\": [4]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "job-155b782d-12d0-457b-ada3-ee678ed0e091", "state": "SUCCEEDED", "total_run_time": 1071, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python3\"}", "agent_restarts": 3}}
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:354: UserWarning: 

**********************************************************************
               CHILD PROCESS FAILED WITH NO ERROR_FILE                
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 390 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

  from torch.distributed.elastic.multiprocessing.errors import record

  @record
  def trainer_main(args):
     # do train
**********************************************************************
  warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 173, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 169, in main
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 621, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
***************************************
            train.py FAILED            
=======================================
Root Cause:
[0]:
  time: 2021-08-13_15:41:42
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 390)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=======================================
Other Failures:
[1]:
  time: 2021-08-13_15:41:42
  rank: 1 (local_rank: 1)
  exitcode: 1 (pid: 391)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
[2]:
  time: 2021-08-13_15:41:42
  rank: 2 (local_rank: 2)
  exitcode: 1 (pid: 392)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
[3]:
  time: 2021-08-13_15:41:42
  rank: 3 (local_rank: 3)
  exitcode: 1 (pid: 393)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
***************************************

Expected behavior

The training should keep going after the first epoch is over and the first validation step is over.

Environment

I'm using OVH AI Cloud to train. Using the Docker image described above (basically the official one).

  • Yolo v5.0-360-gd9f23ed torch 1.9.0+cu102
  • CUDA:0 (Tesla V100S-PCIE-32GB, 32510.5MB)
  • CUDA:1 (Tesla V100S-PCIE-32GB, 32510.5MB)
  • CUDA:2 (Tesla V100S-PCIE-32GB, 32510.5MB)
  • CUDA:3 (Tesla V100S-PCIE-32GB, 32510.5MB)

Ressources for the job:
Cpu: 13
Memory: 40.0 GiB
Public Network: 1.5 Gbps
Private Network: 0 bps
Ephemeral Storage: 650.0 GiB
Gpu Model: Tesla-V100S
Gpu Brand: NVIDIA
Gpu Memory: 32.0 GiB
Flavor: ai1-1-gpu

Additional context

I didn't encounter this problem with only 500 images a few days ago, with 4 GPUs. I encountered multiple problems today due to the cache argument being used but not that is gone, I can't find the reason why it's failing at the end of the first validation step (around 900 images).

@kinoute kinoute added the bug Something isn't working label Aug 13, 2021
@iceisfun
Copy link

I seem to be able to train fine on 3 or 4 gpus. I don't have a V100 machine to test on however I suspect something else is going on here.

Using old method it works fine -- python train.py --img 640 --batch 30 --epochs 20 --data /data/dataset/coco.yaml --weights yolov5s.pt

GPU 0: NVIDIA TITAN RTX (UUID: GPU-31995e9a-cb40-7de8-e9f3-24da46f6445c)
GPU 1: NVIDIA TITAN RTX (UUID: GPU-be09f2ae-8083-58b1-01df-ebcef7d4a708)
GPU 2: NVIDIA TITAN RTX (UUID: GPU-e56556aa-3e26-53a8-6837-9090380b4564)
GPU 3: NVIDIA TITAN RTX (UUID: GPU-1e84d093-865f-d73c-b7e9-52e82bfa67f5)

Using newer method it works fine python3 -m torch.distributed.launch -- python3 -m torch.distributed.launch --nproc_per_node 4 train.py --epochs 10 --batch-size 64 --device 0,1,2,3

GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-64e9c581-61c9-c378-ef93-7c5a0275c14e)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-967ece83-6868-e140-305f-c1ae8d0ae3ba)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-457ccc84-3f32-993e-f83f-7c0c4faa256c)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-ce4d7cd2-927d-ea55-de86-6e6484a90406)

root@container:/usr/src/app# python3 -m torch.distributed.launch --nproc_per_node 4 train.py --epochs 10 --batch-size 64 --device 0,1,2,3
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
  logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:torch.distributed.run:`torch.distributed.launch` is Deprecated. Use torch.distributed.run
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : train.py
  min_nodes        : 1
  max_nodes        : 1
  nproc_per_node   : 4
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 127.0.0.1:29500
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_jvtzrjoi/none_yo7k7_ei
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
  warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3]
  role_ranks=[0, 1, 2, 3]
  global_ranks=[0, 1, 2, 3]
  role_world_sizes=[4, 4, 4, 4]
  global_world_sizes=[4, 4, 4, 4]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_jvtzrjoi/none_yo7k7_ei/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_jvtzrjoi/none_yo7k7_ei/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_jvtzrjoi/none_yo7k7_ei/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_jvtzrjoi/none_yo7k7_ei/attempt_0/3/error.json
train: weights=yolov5s.pt, cfg=, data=data/coco128.yaml, hyp=data/hyps/hyp.scratch.yaml, epochs=10, batch_size=64, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1,2,3, multi_scale=False, single_cls=False, adam=False, sync_bn=False, workers=8, project=runs/train, entity=None, name=exp, exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias=latest, local_rank=0, freeze=0
github: skipping check (not a git repository), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 2021-8-12 torch 1.9.0a0+c3d40fd CUDA:0 (NVIDIA A100-SXM4-40GB, 40536.1875MB)
                                          CUDA:1 (NVIDIA A100-SXM4-40GB, 40536.1875MB)
                                          CUDA:2 (NVIDIA A100-SXM4-40GB, 40536.1875MB)
                                          CUDA:3 (NVIDIA A100-SXM4-40GB, 40536.1875MB)

Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for 4 nodes.
hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
[W ProcessGroupNCCL.cpp:1618] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs (RECOMMENDED)
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
[W ProcessGroupNCCL.cpp:1618] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1618] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

WARNING: Dataset not found, nonexistent paths: ['/usr/src/datasets/coco128/images/train2017']
Downloading https://github.com/ultralytics/yolov5/releases/download/v1.0/coco128.zip ...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.66M/6.66M [00:00<00:00, 19.3MB/s]
Dataset autodownload success

[W ProcessGroupNCCL.cpp:1618] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

                 from  n    params  module                                  arguments                     
  0                -1  1      3520  models.common.Focus                     [3, 32, 3]                    
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  2                -1  1     18816  models.common.C3                        [64, 64, 1]                   
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  4                -1  3    156928  models.common.C3                        [128, 128, 3]                 
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  6                -1  3    625152  models.common.C3                        [256, 256, 3]                 
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  8                -1  1    656896  models.common.SPP                       [512, 512, [5, 9, 13]]        
  9                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]          
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]          
 24      [17, 20, 23]  1    229245  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
Model Summary: 283 layers, 7276605 parameters, 7276605 gradients, 17.1 GFLOPs

Transferred 362/362 items from yolov5s.pt
Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 59 weight, 62 weight (no decay), 62 bias
train: Scanning '../datasets/coco128/labels/train2017' images and labels...128 found, 0 missing, 2 empty, 0 corrupted: 100%|█████████████████████████████████████████████████████████████████████| 128/128 [00:00<00:00, 3834.41it/s]
train: New cache created: ../datasets/coco128/labels/train2017.cache
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
train: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|████████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s][W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Plotting labels... 
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]
val: Scanning '../datasets/coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100%|██████████████████████████████████████████████████████████████████████████| 128/128 [00:00<?, ?it/s]

autoanchor: Analyzing anchors... anchors/target = 4.28, Best Possible Recall (BPR) = 0.9946
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp
Starting training for 10 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       0/9      2.9G   0.04236   0.06898   0.01796       229       640:  50%|█████████████████████████████████████████████████████████████                                                             | 1/2 [00:06<00:06,  6.12s/it]Reducer buckets have been rebuilt in this iteration.
       0/9     6.65G   0.04388    0.0629   0.01675       172       640: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.13s/it]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.71it/s]
                 all        128        929      0.735      0.552      0.651      0.422

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       1/9     6.17G   0.04767   0.07541   0.01793       247       640: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  8.67it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  7.59it/s]
                 all        128        929      0.697       0.57      0.651      0.421

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       2/9     6.17G   0.04485   0.07673   0.01617       269       640: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 10.76it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  7.44it/s]
                 all        128        929      0.699       0.57      0.652      0.423

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       3/9     6.17G   0.05006   0.06113   0.02123       218       640: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 10.01it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  7.62it/s]
                 all        128        929      0.686      0.576      0.654      0.428


     Epoch   gpu_mem       box       obj       cls    labels  img_size
       4/9     6.17G   0.04093   0.06771   0.01907       191       640: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 10.02it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  8.18it/s]
                 all        128        929      0.717       0.57      0.654      0.429

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       5/9     6.17G   0.04414   0.06721   0.01828       242       640: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  9.58it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  7.48it/s]
                 all        128        929      0.752      0.548      0.652      0.429

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       6/9     6.17G   0.04034   0.05063   0.01799       124       640: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.34it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  7.27it/s]
                 all        128        929      0.744      0.556      0.656       0.43

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       7/9     6.17G   0.04331   0.06853     0.017       194       640: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 10.46it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  7.68it/s]
                 all        128        929      0.711      0.564      0.656      0.429

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       8/9     6.17G   0.04199   0.05638   0.01743       177       640: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 10.08it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  7.73it/s]
                 all        128        929      0.572      0.659      0.656      0.429

     Epoch   gpu_mem       box       obj       cls    labels  img_size
       9/9     6.17G    0.0436   0.06589   0.01898       182       640: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 10.47it/s]
               Class     Images     Labels          P          R     mAP@.5 mAP@.5:.95: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.49it/s]
                 all        128        929      0.636      0.622      0.659      0.433

10 epochs completed in 0.008 hours.
Optimizer stripped from runs/train/exp/weights/last.pt, 14.8MB
Optimizer stripped from runs/train/exp/weights/best.pt, 14.8MB
Results saved to runs/train/exp
Destroying process group... Done.

INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future.
  warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0005681514739990234 seconds
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "368", "role": "default", "hostname": "container", "state": "SUCCEEDED", "total_run_time": 50, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python3\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 1, "group_rank": 0, "worker_id": "369", "role": "default", "hostname": "container", "state": "SUCCEEDED", "total_run_time": 50, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python3\", \"local_rank\": [1], \"role_rank\": [1], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 2, "group_rank": 0, "worker_id": "370", "role": "default", "hostname": "container", "state": "SUCCEEDED", "total_run_time": 50, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python3\", \"local_rank\": [2], \"role_rank\": [2], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 3, "group_rank": 0, "worker_id": "371", "role": "default", "hostname": "container", "state": "SUCCEEDED", "total_run_time": 50, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python3\", \"local_rank\": [3], \"role_rank\": [3], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "container", "state": "SUCCEEDED", "total_run_time": 50, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python3\"}", "agent_restarts": 0}}
root@container:/usr/src/app# 

@glenn-jocher
Copy link
Member

@kinoute I see the same error on DDP in Docker with CUDA 11.1 now when training COCO. Not sure what the cause it, but it appears related to the dist.barrier() ops. If I install PyTorch nightly it actually happens before training even starts.

It usually happens 60 seconds after a dist.barrier() call because the DDP ops have a 60 second timeout in place here:

yolov5/train.py

Line 496 in 4e65052

dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=60))

I'm not sure what would happen if I remove this line, I think the default timeout was 30 minutes.

@glenn-jocher
Copy link
Member

@kinoute you might also try the gloo DPP backend, though is slower in my experience.

@glenn-jocher glenn-jocher linked a pull request Aug 15, 2021 that will close this issue
@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 15, 2021

@kinoute @iceisfun good news 😃! Your original issue may now be fixed ✅ in PR #4422.

This PR updates the DDP process group, and was verified over 3 epochs of COCO training with 4x A100 DDP NCCL on EC2 P4d instance with official Docker image and CUDA 11.1 pip install from https://pytorch.org/get-started/locally/

d=yolov5 && git clone https://github.com/ultralytics/yolov5 -b master $d && cd $d

python -m torch.distributed.launch --nproc_per_node 4 --master_port 1 train.py --data coco.yaml --batch 64 --weights '' --project study --cfg yolov5l.yaml --epochs 300 --name yolov5l-1280 --img 1280 --linear --device 0,1,2,3
python -m torch.distributed.launch --nproc_per_node 4 --master_port 2 train.py --data coco.yaml --batch 64 --weights '' --project study --cfg yolov5l.yaml --epochs 300 --name yolov5l-1280 --img 1280 --linear --device 4,5,6,7

To receive this update:

  • Gitgit pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
  • PyTorch Hub – Force-reload with model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
  • Notebooks – View updated notebooks Open In Colab Open In Kaggle
  • Dockersudo docker pull ultralytics/yolov5:latest to update your image Docker Pulls

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

@iceisfun
Copy link

@glenn-jocher I think this issue can be reproduced with the right size dataset and hardware.

In my case I have 250k images and the first epoch took about 15 minutes to train, the first val was taking 1:12 to run and after 60 seconds I get the same timeout ( WorkNCCL(OpType=BROADCAST, Timeout(ms)=60000))

This is 3gpu rtx titan, I have not tried to replicate on A100, they are probably too fast without more data and/or shifting data from train into val

Is there something that can be done about the watchdog on val?

@glenn-jocher
Copy link
Member

@iceisfun I don't understand. What is a watchdog? And what is the problem exactly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants