Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training with Docker and recommended DDP command on multi-GPUs will hang #7336

Closed
1 of 2 tasks
ChongyuNVIDIA opened this issue Apr 7, 2022 · 12 comments
Closed
1 of 2 tasks
Labels
bug Something isn't working Stale

Comments

@ChongyuNVIDIA
Copy link

Search before asking

  • I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Multi-GPU

Bug

I follow the recommended steps by using the docker and the DDP multi-GPU training. However, the training will hang at the first training epoch.

Environment

YOLO version: latest with commit id: 0ca85ed
GPU Type: Tesla V100-SXM2-16GB-N, 16160MiB
GPU Number: 8
Docker: nvidia/pytorch:21.10-py3
PyTorch Version: torch 1.11.0+cu113
Torchvision Version: torchvision 0.12.0+cu113
Driver Version: 470.82.01
CUDA Version: 11.3

Minimal Reproducible Example

The command lines to prepare the env:

apt update && apt install -y zip htop screen libgl1-mesa-glx

python -m pip install --upgrade pip

pip uninstall -y torch torchvision torchtext

pip install --no-cache -r requirements.txt albumentations wandb gsutil notebook \
    torch==1.11.0+cu113 torchvision==0.12.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

export OMP_NUM_THREADS=8

The command is as follows:

python -u -m torch.distributed.launch --nproc_per_node 8 train.py --data coco_Chong_NGC.yaml --cfg yolov5n.yaml --weights '' --sync-bn --batch-size 1024 --imgsz 640 --device 0,1,2,3,4,5,6,7

The log for this training and hang:

root@2789503:/ngc_0/Ultralytics/YOLOv5# python -u -m torch.distributed.launch --nproc_per_node 8 train.py --data coco.yaml --cfg yolov5n.yaml --weights '' --sync-bn --batch-size 1024 --imgsz 640 --device 0,1,2,3,4,5,6,7
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
Downloading https://ultralytics.com/assets/Arial.ttf to /root/.config/Ultralytics/Arial.ttf...
train: weights=, cfg=yolov5n.yaml, data=coco.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=300, batch_size=1024, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1,2,3,4,5,6,7, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=True, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: skipping check (not a git repository), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 287bae0 torch 1.11.0+cu113 CUDA:0 (Tesla V100-SXM2-16GB-N, 16160MiB)
                                     CUDA:1 (Tesla V100-SXM2-16GB-N, 16160MiB)
                                     CUDA:2 (Tesla V100-SXM2-16GB-N, 16160MiB)
                                     CUDA:3 (Tesla V100-SXM2-16GB-N, 16160MiB)
                                     CUDA:4 (Tesla V100-SXM2-16GB-N, 16160MiB)
                                     CUDA:5 (Tesla V100-SXM2-16GB-N, 16160MiB)
                                     CUDA:6 (Tesla V100-SXM2-16GB-N, 16160MiB)
                                     CUDA:7 (Tesla V100-SXM2-16GB-N, 16160MiB)

Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

                 from  n    params  module                                  arguments
  0                -1  1      1760  models.common.Conv                      [3, 16, 6, 2, 2]
  1                -1  1      4672  models.common.Conv                      [16, 32, 3, 2]
  2                -1  1      4800  models.common.C3                        [32, 32, 1]
  3                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]
  4                -1  2     29184  models.common.C3                        [64, 64, 2]
  5                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]
  6                -1  3    156928  models.common.C3                        [128, 128, 3]
  7                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
  8                -1  1    296448  models.common.C3                        [256, 256, 1]
  9                -1  1    164608  models.common.SPPF                      [256, 256, 5]
 10                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 12           [-1, 6]  1         0  models.common.Concat                    [1]
 13                -1  1     90880  models.common.C3                        [256, 128, 1, False]
 14                -1  1      8320  models.common.Conv                      [128, 64, 1, 1]
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 16           [-1, 4]  1         0  models.common.Concat                    [1]
 17                -1  1     22912  models.common.C3                        [128, 64, 1, False]
 18                -1  1     36992  models.common.Conv                      [64, 64, 3, 2]
 19          [-1, 14]  1         0  models.common.Concat                    [1]
 20                -1  1     74496  models.common.C3                        [128, 128, 1, False]
 21                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]
 22          [-1, 10]  1         0  models.common.Concat                    [1]
 23                -1  1    296448  models.common.C3                        [256, 256, 1, False]
 24      [17, 20, 23]  1    115005  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [64, 128, 256]]
YOLOv5n summary: 270 layers, 1872157 parameters, 1872157 gradients, 4.5 GFLOPs

Scaled weight_decay = 0.008
optimizer: SGD with parameter groups 57 weight (no decay), 60 weight, 60 bias
Using SyncBatchNorm()
albumentations: Blur(always_apply=False, p=0.01, blur_limit=(3, 7)), MedianBlur(always_apply=False, p=0.01, blur_limit=(3, 7)), ToGray(always_apply=False, p=0.01), CLAHE(always_apply=False, p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))


train: Scanning '/ngc_0/Ultralytics/datasets/coco/train2017.cache' images and label
val: Scanning '/ngc_0/Ultralytics/datasets/coco/val2017.cache' images and labels...
Plotting labels to runs/train/exp/labels.jpg...

AutoAnchor: 4.45 anchors/target, 0.995 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Image sizes 640 train, 640 val
Using 64 dataloader workers
Logging results to runs/train/exp
Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
  0%|          | 0/116 [00:00<?, ?it/s]

Additional

No response

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@ChongyuNVIDIA ChongyuNVIDIA added the bug Something isn't working label Apr 7, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Apr 7, 2022

👋 Hello @ChongyuNVIDIA, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email support@ultralytics.com.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 7, 2022

@ChongyuNVIDIA I've run your exact command with a smaller batch size since our machines are all in use, and I have no problems. Note the warning that torch.distributed.launch is deprecated in favor of torch.distributed.run, but this still worked for me.

python -u -m torch.distributed.launch --nproc_per_node 8 train.py --data coco.yaml --cfg yolov5n.yaml --weights '' --sync-bn --batch-size 64 --imgsz 640 --device 0,1,2,3,4,5,6,7
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: (30 second timeout) 3
wandb: You chose 'Don't visualize my results'
train: weights=, cfg=yolov5n.yaml, data=coco.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=300, batch_size=64, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1,2,3,4,5,6,7, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=True, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 v6.1-111-gb7faeda torch 1.11.0+cu113 CUDA:0 (A100-SXM-80GB, 81251MiB)
                                               CUDA:1 (A100-SXM-80GB, 81251MiB)
                                               CUDA:2 (A100-SXM-80GB, 81251MiB)
                                               CUDA:3 (A100-SXM-80GB, 81251MiB)
                                               CUDA:4 (A100-SXM-80GB, 81251MiB)
                                               CUDA:5 (A100-SXM-80GB, 81251MiB)
                                               CUDA:6 (A100-SXM-80GB, 81251MiB)
                                               CUDA:7 (A100-SXM-80GB, 81251MiB)

Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs (RECOMMENDED)
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

                 from  n    params  module                                  arguments                     
  0                -1  1      1760  models.common.Conv                      [3, 16, 6, 2, 2]              
  1                -1  1      4672  models.common.Conv                      [16, 32, 3, 2]                
  2                -1  1      4800  models.common.C3                        [32, 32, 1]                   
  3                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  4                -1  2     29184  models.common.C3                        [64, 64, 2]                   
  5                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  6                -1  3    156928  models.common.C3                        [128, 128, 3]                 
  7                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  8                -1  1    296448  models.common.C3                        [256, 256, 1]                 
  9                -1  1    164608  models.common.SPPF                      [256, 256, 5]                 
 10                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 14                -1  1      8320  models.common.Conv                      [128, 64, 1, 1]               
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     22912  models.common.C3                        [128, 64, 1, False]           
 18                -1  1     36992  models.common.Conv                      [64, 64, 3, 2]                
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1     74496  models.common.C3                        [128, 128, 1, False]          
 21                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 24      [17, 20, 23]  1    115005  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [64, 128, 256]]
YOLOv5n summary: 270 layers, 1872157 parameters, 1872157 gradients, 4.5 GFLOPs

Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight (no decay), 60 weight, 60 bias
Using SyncBatchNorm()
albumentations: Blur(always_apply=False, p=0.01, blur_limit=(3, 7)), MedianBlur(always_apply=False, p=0.01, blur_limit=(3, 7)), ToGray(always_apply=False, p=0.01), CLAHE(always_apply=False, p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      
val: Scanning '/usr/src/datasets/coco/val2017.cache' images and labels... 4952 found, 48 missing, 0 empty, 0 corrupt: 100%|██████████| 5000/5000 [00:00<?, ?it/s]                                  
Plotting labels to runs/train/exp/labels.jpg... 
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      

AutoAnchor: 4.45 anchors/target, 0.995 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Image sizes 640 train, 640 val
Using 64 dataloader workers
Logging results to runs/train/exp
Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
     0/299    0.971G    0.1124    0.0528    0.1085        58       640:   0%|          | 1/1849 [00:03<1:57:24,  3.81s/it]                                                                         Reducer buckets have been rebuilt in this iteration.
     0/299        1G    0.1047   0.08592    0.1023       133       640:   3%|| 58/1849 [00:49<24:51,  1.20it/s]  

@glenn-jocher
Copy link
Member

@ChongyuNVIDIA with torch.distributed.run all warnings disappear for me and training runs equally well.

@glenn-jocher
Copy link
Member

@ChongyuNVIDIA pushed #7337 to resolve multi-dataset scanning printout with DDP. Unrelated to your issue but something I realized when seeing our output.

@ChongyuNVIDIA
Copy link
Author

In DDP, the batch size in the command line is the global batch size, right? So in your case, the batch size for each GPU is only 64/8=8 right?

@ChongyuNVIDIA
Copy link
Author

In my training process, the steps in the following position will take a long time. Also curious whether it is expected behavior?

Using SyncBatchNorm()
albumentations: Blur(always_apply=False, p=0.01, blur_limit=(3, 7)), MedianBlur(always_apply=False, p=0.01, blur_limit=(3, 7)), ToGray(always_apply=False, p=0.01), CLAHE(always_apply=False, p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]    

@glenn-jocher
Copy link
Member

@ChongyuNVIDIA yes --batch is total batch size across all GPUs, so in my command each GPU using 64/8 = 8 batch size.

Some steps may take a long time, especially for large datasets. In my example code above it probably takes about 30 seconds from command to first batches training, but larger datasets, AutoAnchor etc. may take several minutes.

@ChongyuNVIDIA
Copy link
Author

I tried the batch size 64 as @glenn-jocher, but still hang at the beginning.

train: Scanning '/ngc_0/Ultralytics/datasets/coco/train2017.cache' images and label
val: Scanning '/ngc_0/Ultralytics/datasets/coco/val2017.cache' images and labels...
Plotting labels to runs/train/exp3/labels.jpg...
AutoAnchor: 4.45 anchors/target, 0.995 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Image sizes 640 train, 640 val
Using 64 dataloader workers
Logging results to runs/train/exp3
Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
  0%|          | 0/1849 [00:00<?, ?it/s]

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 7, 2022 via email

@github-actions
Copy link
Contributor

github-actions bot commented May 8, 2022

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

@tarunsharma1
Copy link

I had the same issue and infact using a smaller batch size did fix it. If it is hanging your batch size is still too big.

@glenn-jocher
Copy link
Member

@tarunsharma1 i had the same issue and indeed, using a smaller batch size did resolve it for me. If your training process is hanging, it is likely that your batch size is still too large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

3 participants