Training with Docker and recommended DDP command on multi-GPUs will hang #7336

ChongyuNVIDIA · 2022-04-07T14:14:14Z

Search before asking

I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Multi-GPU

Bug

I follow the recommended steps by using the docker and the DDP multi-GPU training. However, the training will hang at the first training epoch.

Environment

YOLO version: latest with commit id: 0ca85ed
GPU Type: Tesla V100-SXM2-16GB-N, 16160MiB
GPU Number: 8
Docker: nvidia/pytorch:21.10-py3
PyTorch Version: torch 1.11.0+cu113
Torchvision Version: torchvision 0.12.0+cu113
Driver Version: 470.82.01
CUDA Version: 11.3

Minimal Reproducible Example

The command lines to prepare the env:

apt update && apt install -y zip htop screen libgl1-mesa-glx

python -m pip install --upgrade pip

pip uninstall -y torch torchvision torchtext

pip install --no-cache -r requirements.txt albumentations wandb gsutil notebook \
    torch==1.11.0+cu113 torchvision==0.12.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

export OMP_NUM_THREADS=8

The command is as follows:

python -u -m torch.distributed.launch --nproc_per_node 8 train.py --data coco_Chong_NGC.yaml --cfg yolov5n.yaml --weights '' --sync-bn --batch-size 1024 --imgsz 640 --device 0,1,2,3,4,5,6,7

The log for this training and hang:

root@2789503:/ngc_0/Ultralytics/YOLOv5# python -u -m torch.distributed.launch --nproc_per_node 8 train.py --data coco.yaml --cfg yolov5n.yaml --weights '' --sync-bn --batch-size 1024 --imgsz 640 --device 0,1,2,3,4,5,6,7
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  warnings.warn(
Downloading https://ultralytics.com/assets/Arial.ttf to /root/.config/Ultralytics/Arial.ttf...
train: weights=, cfg=yolov5n.yaml, data=coco.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=300, batch_size=1024, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1,2,3,4,5,6,7, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=True, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: skipping check (not a git repository), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 287bae0 torch 1.11.0+cu113 CUDA:0 (Tesla V100-SXM2-16GB-N, 16160MiB)
                                     CUDA:1 (Tesla V100-SXM2-16GB-N, 16160MiB)
                                     CUDA:2 (Tesla V100-SXM2-16GB-N, 16160MiB)
                                     CUDA:3 (Tesla V100-SXM2-16GB-N, 16160MiB)
                                     CUDA:4 (Tesla V100-SXM2-16GB-N, 16160MiB)
                                     CUDA:5 (Tesla V100-SXM2-16GB-N, 16160MiB)
                                     CUDA:6 (Tesla V100-SXM2-16GB-N, 16160MiB)
                                     CUDA:7 (Tesla V100-SXM2-16GB-N, 16160MiB)

Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

                 from  n    params  module                                  arguments
  0                -1  1      1760  models.common.Conv                      [3, 16, 6, 2, 2]
  1                -1  1      4672  models.common.Conv                      [16, 32, 3, 2]
  2                -1  1      4800  models.common.C3                        [32, 32, 1]
  3                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]
  4                -1  2     29184  models.common.C3                        [64, 64, 2]
  5                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]
  6                -1  3    156928  models.common.C3                        [128, 128, 3]
  7                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
  8                -1  1    296448  models.common.C3                        [256, 256, 1]
  9                -1  1    164608  models.common.SPPF                      [256, 256, 5]
 10                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 12           [-1, 6]  1         0  models.common.Concat                    [1]
 13                -1  1     90880  models.common.C3                        [256, 128, 1, False]
 14                -1  1      8320  models.common.Conv                      [128, 64, 1, 1]
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 16           [-1, 4]  1         0  models.common.Concat                    [1]
 17                -1  1     22912  models.common.C3                        [128, 64, 1, False]
 18                -1  1     36992  models.common.Conv                      [64, 64, 3, 2]
 19          [-1, 14]  1         0  models.common.Concat                    [1]
 20                -1  1     74496  models.common.C3                        [128, 128, 1, False]
 21                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]
 22          [-1, 10]  1         0  models.common.Concat                    [1]
 23                -1  1    296448  models.common.C3                        [256, 256, 1, False]
 24      [17, 20, 23]  1    115005  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [64, 128, 256]]
YOLOv5n summary: 270 layers, 1872157 parameters, 1872157 gradients, 4.5 GFLOPs

Scaled weight_decay = 0.008
optimizer: SGD with parameter groups 57 weight (no decay), 60 weight, 60 bias
Using SyncBatchNorm()
albumentations: Blur(always_apply=False, p=0.01, blur_limit=(3, 7)), MedianBlur(always_apply=False, p=0.01, blur_limit=(3, 7)), ToGray(always_apply=False, p=0.01), CLAHE(always_apply=False, p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))


train: Scanning '/ngc_0/Ultralytics/datasets/coco/train2017.cache' images and label
val: Scanning '/ngc_0/Ultralytics/datasets/coco/val2017.cache' images and labels...
Plotting labels to runs/train/exp/labels.jpg...

AutoAnchor: 4.45 anchors/target, 0.995 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Image sizes 640 train, 640 val
Using 64 dataloader workers
Logging results to runs/train/exp
Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
  0%|          | 0/116 [00:00<?, ?it/s]

Additional

No response

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

The text was updated successfully, but these errors were encountered:

github-actions · 2022-04-07T14:14:54Z

👋 Hello @ChongyuNVIDIA, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email support@ultralytics.com.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher · 2022-04-07T14:24:28Z

@ChongyuNVIDIA I've run your exact command with a smaller batch size since our machines are all in use, and I have no problems. Note the warning that torch.distributed.launch is deprecated in favor of torch.distributed.run, but this still worked for me.

python -u -m torch.distributed.launch --nproc_per_node 8 train.py --data coco.yaml --cfg yolov5n.yaml --weights '' --sync-bn --batch-size 64 --imgsz 640 --device 0,1,2,3,4,5,6,7
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: (30 second timeout) 3
wandb: You chose 'Don't visualize my results'
train: weights=, cfg=yolov5n.yaml, data=coco.yaml, hyp=data/hyps/hyp.scratch-low.yaml, epochs=300, batch_size=64, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=None, bucket=, cache=None, image_weights=False, device=0,1,2,3,4,5,6,7, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=True, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=0, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: skipping check (Docker image), for updates see https://github.com/ultralytics/yolov5
YOLOv5 🚀 v6.1-111-gb7faeda torch 1.11.0+cu113 CUDA:0 (A100-SXM-80GB, 81251MiB)
                                               CUDA:1 (A100-SXM-80GB, 81251MiB)
                                               CUDA:2 (A100-SXM-80GB, 81251MiB)
                                               CUDA:3 (A100-SXM-80GB, 81251MiB)
                                               CUDA:4 (A100-SXM-80GB, 81251MiB)
                                               CUDA:5 (A100-SXM-80GB, 81251MiB)
                                               CUDA:6 (A100-SXM-80GB, 81251MiB)
                                               CUDA:7 (A100-SXM-80GB, 81251MiB)

Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Weights & Biases: run 'pip install wandb' to automatically track and visualize YOLOv5 🚀 runs (RECOMMENDED)
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/

                 from  n    params  module                                  arguments                     
  0                -1  1      1760  models.common.Conv                      [3, 16, 6, 2, 2]              
  1                -1  1      4672  models.common.Conv                      [16, 32, 3, 2]                
  2                -1  1      4800  models.common.C3                        [32, 32, 1]                   
  3                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]                
  4                -1  2     29184  models.common.C3                        [64, 64, 2]                   
  5                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  6                -1  3    156928  models.common.C3                        [128, 128, 3]                 
  7                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  8                -1  1    296448  models.common.C3                        [256, 256, 1]                 
  9                -1  1    164608  models.common.SPPF                      [256, 256, 5]                 
 10                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  1     90880  models.common.C3                        [256, 128, 1, False]          
 14                -1  1      8320  models.common.Conv                      [128, 64, 1, 1]               
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  1     22912  models.common.C3                        [128, 64, 1, False]           
 18                -1  1     36992  models.common.Conv                      [64, 64, 3, 2]                
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  1     74496  models.common.C3                        [128, 128, 1, False]          
 21                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  1    296448  models.common.C3                        [256, 256, 1, False]          
 24      [17, 20, 23]  1    115005  models.yolo.Detect                      [80, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [64, 128, 256]]
YOLOv5n summary: 270 layers, 1872157 parameters, 1872157 gradients, 4.5 GFLOPs

Scaled weight_decay = 0.0005
optimizer: SGD with parameter groups 57 weight (no decay), 60 weight, 60 bias
Using SyncBatchNorm()
albumentations: Blur(always_apply=False, p=0.01, blur_limit=(3, 7)), MedianBlur(always_apply=False, p=0.01, blur_limit=(3, 7)), ToGray(always_apply=False, p=0.01), CLAHE(always_apply=False, p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      
val: Scanning '/usr/src/datasets/coco/val2017.cache' images and labels... 4952 found, 48 missing, 0 empty, 0 corrupt: 100%|██████████| 5000/5000 [00:00<?, ?it/s]                                  
Plotting labels to runs/train/exp/labels.jpg... 
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]                      

AutoAnchor: 4.45 anchors/target, 0.995 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Image sizes 640 train, 640 val
Using 64 dataloader workers
Logging results to runs/train/exp
Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
     0/299    0.971G    0.1124    0.0528    0.1085        58       640:   0%|          | 1/1849 [00:03<1:57:24,  3.81s/it]                                                                         Reducer buckets have been rebuilt in this iteration.
     0/299        1G    0.1047   0.08592    0.1023       133       640:   3%|▎         | 58/1849 [00:49<24:51,  1.20it/s]

glenn-jocher · 2022-04-07T14:27:04Z

@ChongyuNVIDIA with torch.distributed.run all warnings disappear for me and training runs equally well.

glenn-jocher · 2022-04-07T14:45:30Z

@ChongyuNVIDIA pushed #7337 to resolve multi-dataset scanning printout with DDP. Unrelated to your issue but something I realized when seeing our output.

ChongyuNVIDIA · 2022-04-07T15:08:13Z

In DDP, the batch size in the command line is the global batch size, right? So in your case, the batch size for each GPU is only 64/8=8 right?

ChongyuNVIDIA · 2022-04-07T15:09:56Z

In my training process, the steps in the following position will take a long time. Also curious whether it is expected behavior?

Using SyncBatchNorm()
albumentations: Blur(always_apply=False, p=0.01, blur_limit=(3, 7)), MedianBlur(always_apply=False, p=0.01, blur_limit=(3, 7)), ToGray(always_apply=False, p=0.01), CLAHE(always_apply=False, p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
train: Scanning '/usr/src/datasets/coco/train2017.cache' images and labels... 117266 found, 1021 missing, 0 empty, 0 corrupt: 100%|██████████| 118287/118287 [00:00<?, ?it/s]

glenn-jocher · 2022-04-07T15:21:07Z

@ChongyuNVIDIA yes --batch is total batch size across all GPUs, so in my command each GPU using 64/8 = 8 batch size.

Some steps may take a long time, especially for large datasets. In my example code above it probably takes about 30 seconds from command to first batches training, but larger datasets, AutoAnchor etc. may take several minutes.

ChongyuNVIDIA · 2022-04-07T16:30:50Z

I tried the batch size 64 as @glenn-jocher, but still hang at the beginning.

train: Scanning '/ngc_0/Ultralytics/datasets/coco/train2017.cache' images and label
val: Scanning '/ngc_0/Ultralytics/datasets/coco/val2017.cache' images and labels...
Plotting labels to runs/train/exp3/labels.jpg...
AutoAnchor: 4.45 anchors/target, 0.995 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Image sizes 640 train, 640 val
Using 64 dataloader workers
Logging results to runs/train/exp3
Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls    labels  img_size
  0%|          | 0/1849 [00:00<?, ?it/s]

glenn-jocher · 2022-04-07T17:21:23Z

Sure, batch size is irrelevant, my example was at 64 because our GPUs are already in use training other models.

On Thu, 7 Apr 2022 at 18:31 ChongyuNVIDIA ***@***.***> wrote: I tried the batch size 64 as @glenn-jocher <https://github.com/glenn-jocher>, but still hang at the beginning. train: Scanning '/ngc_0/Ultralytics/datasets/coco/train2017.cache' images and label val: Scanning '/ngc_0/Ultralytics/datasets/coco/val2017.cache' images and labels... Plotting labels to runs/train/exp3/labels.jpg... AutoAnchor: 4.45 anchors/target, 0.995 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅ Image sizes 640 train, 640 val Using 64 dataloader workers Logging results to runs/train/exp3 Starting training for 300 epochs... Epoch gpu_mem box obj cls labels img_size 0%| | 0/1849 [00:00<?, ?it/s] — Reply to this email directly, view it on GitHub <#7336 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGMXEGIMR4KX37BTKARDLT3VD4EUNANCNFSM5SZQ35TQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- <https://www.ultralytics.com/> *Glenn Jocher* Founder & CEO, Ultralytics +1 301 237 6695 <https://www.twitter.com/ultralytics> <https://www.youtube.com/ultralytics> <https://www.github.com/ultralytics> <https://www.linkedin.com/company/ultralytics>

github-actions · 2022-05-08T00:19:55Z

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com/hub
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

tarunsharma1 · 2023-07-10T01:10:54Z

I had the same issue and infact using a smaller batch size did fix it. If it is hanging your batch size is still too big.

glenn-jocher · 2023-07-10T04:39:23Z

@tarunsharma1 i had the same issue and indeed, using a smaller batch size did resolve it for me. If your training process is hanging, it is likely that your batch size is still too large.

ChongyuNVIDIA added the bug Something isn't working label Apr 7, 2022

ChongyuNVIDIA mentioned this issue Apr 7, 2022

Training stuck when using multiple GPUs #7304

Closed

2 tasks

glenn-jocher mentioned this issue Apr 7, 2022

Print dataset scan only if RANK in (-1, 0) #7337

Merged

github-actions bot added the Stale label May 8, 2022

github-actions bot closed this as completed May 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training with Docker and recommended DDP command on multi-GPUs will hang #7336

Training with Docker and recommended DDP command on multi-GPUs will hang #7336

ChongyuNVIDIA commented Apr 7, 2022

github-actions bot commented Apr 7, 2022 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Apr 7, 2022 •

edited

Loading

glenn-jocher commented Apr 7, 2022

glenn-jocher commented Apr 7, 2022

ChongyuNVIDIA commented Apr 7, 2022

ChongyuNVIDIA commented Apr 7, 2022

glenn-jocher commented Apr 7, 2022

ChongyuNVIDIA commented Apr 7, 2022

glenn-jocher commented Apr 7, 2022 via email

github-actions bot commented May 8, 2022 •

edited by glenn-jocher

Loading

tarunsharma1 commented Jul 10, 2023

glenn-jocher commented Jul 10, 2023

Training with Docker and recommended DDP command on multi-GPUs will hang #7336

Training with Docker and recommended DDP command on multi-GPUs will hang #7336

Comments

ChongyuNVIDIA commented Apr 7, 2022

Search before asking

YOLOv5 Component

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

github-actions bot commented Apr 7, 2022 • edited by UltralyticsAssistant Loading

Requirements

Environments

Status

glenn-jocher commented Apr 7, 2022 • edited Loading

glenn-jocher commented Apr 7, 2022

glenn-jocher commented Apr 7, 2022

ChongyuNVIDIA commented Apr 7, 2022

ChongyuNVIDIA commented Apr 7, 2022

glenn-jocher commented Apr 7, 2022

ChongyuNVIDIA commented Apr 7, 2022

glenn-jocher commented Apr 7, 2022 via email

github-actions bot commented May 8, 2022 • edited by glenn-jocher Loading

tarunsharma1 commented Jul 10, 2023

glenn-jocher commented Jul 10, 2023

github-actions bot commented Apr 7, 2022 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Apr 7, 2022 •

edited

Loading

github-actions bot commented May 8, 2022 •

edited by glenn-jocher

Loading