-
-
Notifications
You must be signed in to change notification settings - Fork 16.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training failed with 4 GPUs after first epoch #4409
Comments
I seem to be able to train fine on 3 or 4 gpus. I don't have a V100 machine to test on however I suspect something else is going on here. Using old method it works fine -- python train.py --img 640 --batch 30 --epochs 20 --data /data/dataset/coco.yaml --weights yolov5s.pt GPU 0: NVIDIA TITAN RTX (UUID: GPU-31995e9a-cb40-7de8-e9f3-24da46f6445c) Using newer method it works fine python3 -m torch.distributed.launch -- python3 -m torch.distributed.launch --nproc_per_node 4 train.py --epochs 10 --batch-size 64 --device 0,1,2,3 GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-64e9c581-61c9-c378-ef93-7c5a0275c14e)
|
@kinoute I see the same error on DDP in Docker with CUDA 11.1 now when training COCO. Not sure what the cause it, but it appears related to the dist.barrier() ops. If I install PyTorch nightly it actually happens before training even starts. It usually happens 60 seconds after a dist.barrier() call because the DDP ops have a 60 second timeout in place here: Line 496 in 4e65052
I'm not sure what would happen if I remove this line, I think the default timeout was 30 minutes. |
@kinoute you might also try the gloo DPP backend, though is slower in my experience. |
@kinoute @iceisfun good news 😃! Your original issue may now be fixed ✅ in PR #4422. This PR updates the DDP process group, and was verified over 3 epochs of COCO training with 4x A100 DDP NCCL on EC2 P4d instance with official Docker image and CUDA 11.1 pip install from https://pytorch.org/get-started/locally/ d=yolov5 && git clone https://github.com/ultralytics/yolov5 -b master $d && cd $d
python -m torch.distributed.launch --nproc_per_node 4 --master_port 1 train.py --data coco.yaml --batch 64 --weights '' --project study --cfg yolov5l.yaml --epochs 300 --name yolov5l-1280 --img 1280 --linear --device 0,1,2,3
python -m torch.distributed.launch --nproc_per_node 4 --master_port 2 train.py --data coco.yaml --batch 64 --weights '' --project study --cfg yolov5l.yaml --epochs 300 --name yolov5l-1280 --img 1280 --linear --device 4,5,6,7 To receive this update:
Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀! |
@glenn-jocher I think this issue can be reproduced with the right size dataset and hardware. In my case I have 250k images and the first epoch took about 15 minutes to train, the first val was taking 1:12 to run and after 60 seconds I get the same timeout ( WorkNCCL(OpType=BROADCAST, Timeout(ms)=60000)) This is 3gpu rtx titan, I have not tried to replicate on A100, they are probably too fast without more data and/or shifting data from train into val Is there something that can be done about the watchdog on val? |
@iceisfun I don't understand. What is a watchdog? And what is the problem exactly? |
🐛 Bug
I was able to train on OVH AI Cloud with 4 classes and 500 images in total three days ago with 4 GPUs but when I try again to train with my full dataset this time (around 9000 images for 4 classes), the training stops after the first epoch, when the validation step is about to finish.
I tried to change different things: get rid of the cache argument, change for a smaller model (I was using 5MP6 at first), changing batch size, changing the number of GPUs, still the same.
To Reproduce (REQUIRED)
First, here is my Dockerfile. It is based on the Official Yolov5 docker image with W&B integrated:
entrypoint.sh with:
autosplit()
function ;Full output from the server:
Expected behavior
The training should keep going after the first epoch is over and the first validation step is over.
Environment
I'm using OVH AI Cloud to train. Using the Docker image described above (basically the official one).
Ressources for the job:
Cpu: 13
Memory: 40.0 GiB
Public Network: 1.5 Gbps
Private Network: 0 bps
Ephemeral Storage: 650.0 GiB
Gpu Model: Tesla-V100S
Gpu Brand: NVIDIA
Gpu Memory: 32.0 GiB
Flavor: ai1-1-gpu
Additional context
I didn't encounter this problem with only 500 images a few days ago, with 4 GPUs. I encountered multiple problems today due to the
cache
argument being used but not that is gone, I can't find the reason why it's failing at the end of the first validation step (around 900 images).The text was updated successfully, but these errors were encountered: