HUB not working correctly with Multi-GPU custom agent setup #695

sinchinpark · 2024-05-23T10:01:40Z

Search before asking

I have searched the HUB issues and found no similar bug report.

HUB Component

Models, Training

Bug

Description

I am experiencing issues when using HUB portal for training on dataset with a multi-GPU custom agent setup. Specifically, I am using 2xGPUs and have modified the default parameters as follows:

device=0,1
workers=16

However, the HUB does not seem to process the training data correctly and gets stuck throughout the training process. This issue persists even after the training is supposedly finished, as shown in the attached screenshot.

Interestingly, using device=0 on the same machine with the same model works fine!

Logs and Errors:

Here are some potentially useful logs and errors from my custom agents:

Ultralytics HUB: View model at https://hub.ultralytics.com/models/zCnR3gSc9n1xTow1CTpS 🚀
Ultralytics YOLOv8.2.19 🚀 Python-3.10.12 torch-2.3.0+cu121 CUDA:0 (NVIDIA GeForce RTX 3090, 24253MiB)
                                                                CUDA:1 (NVIDIA GeForce RTX 3090, 24253MiB)
engine/trainer: task=detect, mode=train, model=yolov8m.pt, data=***, epochs=10, time=None, patience=100, batch=-1, imgsz=640, save=True, save_period=-1, cache=ram, device=[0, 1], workers=8, project=None, name=train2, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=True, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=runs/detect/train2

WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Also, I encountered the following warnings multiple times:

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py:456](https://jupyter81.backprop.co/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py#line=455): UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)

Expected Behavior:

The training should proceed without getting stuck, showing progress and metrics on Dashboard and allow to deploy/export after training finished (similar to the behavior observed when using device=0).

Custom Agent Env

Python: 3.10.12
PyTorch: 2.3.0+cu121
GPUs: 2x NVIDIA GeForce RTX 3090
Ultralytics YOLOv8.2.19

Environment

Ultralytics HUB Version
v0.1.43
Client User Agent
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36
Operating System
Linux x86_64
Server Timestamp
1716456982

Minimal Reproducible Example

No response

Additional

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-05-23T10:02:06Z

👋 Hello @sinchinpark, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:

Quickstart. Start training and deploying YOLO models with HUB in seconds.
Datasets: Preparing and Uploading. Learn how to prepare and upload your datasets to HUB in YOLO format.
Projects: Creating and Managing. Group your models into projects for improved organization.
Models: Training and Exporting. Train YOLOv5 and YOLOv8 models on your custom datasets and export them to various formats for deployment.
Integrations. Explore different integration options for your trained models, such as TensorFlow, ONNX, OpenVINO, CoreML, and PaddlePaddle.
Ultralytics HUB App. Learn about the Ultralytics App for iOS and Android, which allows you to run models directly on your mobile device.
- iOS. Learn about YOLO CoreML models accelerated on Apple's Neural Engine on iPhones and iPads.
- Android. Explore TFLite acceleration on mobile devices.
Inference API. Understand how to use the Inference API for running your trained models in the cloud to generate predictions.

If this is a 🐛 Bug Report, please provide screenshots and steps to reproduce your problem to help us get started working on a fix.

If this is a ❓ Question, please provide as much information as possible, including dataset, model, environment details etc. so that we might provide the most helpful response.

We try to respond to all issues as promptly as possible. Thank you for your patience!

sinchinpark · 2024-05-23T10:05:39Z

Sorry it's duplicate of #606

sergiuwaxmann · 2024-05-23T10:11:53Z

@sinchinpark Did you use the Custom option from the Advanced Model Configuration accordion (read more here) to change the device from 0 to 0,1?

sinchinpark · 2024-05-23T10:13:05Z

@sinchinpark Did you use the Custom option from the Advanced Model Configuration accordion ([read more here]

Yes, I'm using HUB portal for all operations (from importing dataset to training the model)

sergiuwaxmann · 2024-05-23T10:14:56Z

@sinchinpark Our team will investigate this issue and I will update you as soon as possible.
Thank you for your patience!

sinchinpark · 2024-05-23T10:16:05Z

@sergiuwaxmann Thanks
BTW this is the model ID if it helps the further investigation:
https://hub.ultralytics.com/models/zCnR3gSc9n1xTow1CTpS

sergiuwaxmann · 2024-05-23T10:17:02Z

@sinchinpark Thank you!

github-actions · 2024-06-23T00:22:12Z

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

sergiuwaxmann · 2024-07-04T07:26:57Z

@sinchinpark Hey there!
I apologize for the delay in replying. Multi-GPU training now works correctly with Ultralytics HUB.

sinchinpark added the bug Something isn't working label May 23, 2024

sergiuwaxmann self-assigned this May 23, 2024

Burhan-Q mentioned this issue May 24, 2024

Fix HUB session with DDP training ultralytics/ultralytics#13103

Merged

github-actions bot added the Stale label Jun 23, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HUB not working correctly with Multi-GPU custom agent setup #695

HUB not working correctly with Multi-GPU custom agent setup #695

sinchinpark commented May 23, 2024 •

edited

Loading

github-actions bot commented May 23, 2024

sinchinpark commented May 23, 2024

sergiuwaxmann commented May 23, 2024 •

edited

Loading

sinchinpark commented May 23, 2024

sergiuwaxmann commented May 23, 2024

sinchinpark commented May 23, 2024

sergiuwaxmann commented May 23, 2024

github-actions bot commented Jun 23, 2024

sergiuwaxmann commented Jul 4, 2024

HUB not working correctly with Multi-GPU custom agent setup #695

HUB not working correctly with Multi-GPU custom agent setup #695

Comments

sinchinpark commented May 23, 2024 • edited Loading

Search before asking

HUB Component

Bug

Description

Logs and Errors:

Expected Behavior:

Custom Agent Env

Environment

Minimal Reproducible Example

Additional

github-actions bot commented May 23, 2024

sinchinpark commented May 23, 2024

sergiuwaxmann commented May 23, 2024 • edited Loading

sinchinpark commented May 23, 2024

sergiuwaxmann commented May 23, 2024

sinchinpark commented May 23, 2024

sergiuwaxmann commented May 23, 2024

github-actions bot commented Jun 23, 2024

sergiuwaxmann commented Jul 4, 2024

sinchinpark commented May 23, 2024 •

edited

Loading

sergiuwaxmann commented May 23, 2024 •

edited

Loading