-
Notifications
You must be signed in to change notification settings - Fork 7.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi-gpu is unstable? #51
Comments
I suggest you first try using the existing COCO configs on COCO dataset, to see whether there is anything special in your dataset that can cause the issue. This would make it much easier to isolate the potential causes. The error messages you saw does indicate that the anchors may have a non-positive size. However, the way the anchors are generated (in |
Hi there. Thank you for reply! I looked dataset carefully, but I can't find any differences between original coco dataset and my dataset. I described both below. Additionally, the bugs came out randomly. When I tried training, three bugs came out randomly as I mentioned in the article and Training suddenly succeeded during several attempts. It's so weird. coco dataset as follows:
my dataset as follows:
|
After I add |
hello, I have also met the problem. |
@zhoudongliang did you check that the number of classes in your config file is correct? I had the same bug and I fixed it by setting |
I set this item in the json file, it is not possible, no matter the number of GPUs is 1 or 2. Who can answer me, when I train, this error always occurs and the training is interrupted, which is uncomfortable! |
If you do not know the root cause of the problem / bug, and wish someone to help you, please
include:
To Reproduce
In tools/train_net.py, I add new dataset in the beginning of main function.
You can download moda.json here
You can also download partial of moda images here
Full images are here. Not recommended due to large size.
configs/modanet.yaml as follows:
python tools/train_net.py --num-gpus 4 --config-file configs/modanet.yaml
When I use single GPU, it always works fine! But, when I tried to use multiple GPU several bugs occur randomly. Rarely, multiple GPU works fine. What's wrong with it?
First bug is about Box2BoxTransform. When I debug it, the anchor's width is lower than 0.
Second bug is as follows:
Third bug is as follows:
When it works, the distribution between GPUs is unbalanced as follows:
Expected behavior
It should work both single and multiple GPUs. But, I feel like it's quite unstable when I use multiple GPU.
Environment
Please paste the output of
python -m detectron2.utils.collect_env
.The text was updated successfully, but these errors were encountered: