-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(need help)failed to train model with mmdetection #6394
Comments
Please follow the issue template to provide more details. |
I followed this tutorial "https://github.com/open-mmlab/mmdetection/blob/master/demo/MMDet_Tutorial.ipynb" I have dataset in VOC format , with one class but after running following code from notebook ,i am getting 20 classes instead one class , i made changes in voc512.py ,class_names.py but still getting 20 classes. `from mmdet.datasets import build_dataset Build datasetdatasets = [build_dataset(cfg.data.train)] Build the detectormodel = build_detector( Add an attribute for visualization conveniencedatasets[0].CLASSES` from mmcv import Config Config: |
i have already updated the issue. tks~ |
There are many potential reasons, maybe because of the cuda and Nvidia driver versions. Some versions may have some compatibility issues on A100. Try to upgrade your GPU diver and cuda. Or maybe the GPU has broken. |
GPU driver version: 460.27.04. The machine and the system worked well when training yolov5. And I tried to compile mmcv in local, but a compiler version error has occurred. Is there a version limit of mmcv in A100? |
So, the same code works fine on V100 but failed on A100. However, yolov5 is runnable in the same environment. High probability is because of the cuda version. Try to use cuda11.1 or a higher version. But I can not be sure because I do not have an A100 to reproduce this error. Just have a try. |
pytorch 1.10 MMCV: 1.3.15 build and install MMCV and MMDetection from source follow the guide, everything is ok now. thank you ~ |
No issue template in General questions. I use the Error report issue Template as follow
Thanks for your error report and we appreciate it a lot.
Checklist
Describe the bug
if i use V100-16G machine, everything is ok, but A100 machine will report errors after running few steps.(sorry for bad english...)
Reproduction
here is the
coco_config.py
coco
Environment
sys.platform: linux
Python: 3.7.9 (default, Aug 31 2020, 12:42:55) [GCC 7.3.0]
CUDA available: True
GPU 0,1: A100-SXM4-40GB
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.0_bu.TC445_37.28845127_0
GCC: gcc (GCC) 5.4.0
PyTorch: 1.7.0
PyTorch compiling details: PyTorch built with:
TorchVision: 0.8.0
OpenCV: 4.5.4-dev
MMCV: 1.3.15
MMCV Compiler: GCC 5.4
MMCV CUDA Compiler: 11.0
MMDetection: 2.17.0+a5054bd
install pytorch method:
Error traceback
If applicable, paste the error trackback here.
The text was updated successfully, but these errors were encountered: