(need help)failed to train model with mmdetection #6394

WepLeo · 2021-10-28T06:38:58Z

No issue template in General questions. I use the Error report issue Template as follow

Thanks for your error report and we appreciate it a lot.

Checklist

I have searched related issues but cannot get the expected help.
I have read the FAQ documentation but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug
if i use V100-16G machine, everything is ok, but A100 machine will report errors after running few steps.(sorry for bad english...)

Reproduction

What command or script did you run?

python tools/train.py configs/coco/coco_config.py

Did you make any modifications on the code or config? Did you understand what you have modified?
here is the coco_config.py

_base_ = '../fcos/fcos_center-normbbox-centeronreg-giou_r50_caffe_fpn_gn-head_1x_coco.py'
model = dict(
    backbone=dict(
        init_cfg=dict(
            type='Pretrained',
            checkpoint='./checkpoints/resnet50_caffe-788b5fa3.pth')))

# Modify dataset related settings
dataset_type = 'CocoDataset'
data_root = '/workdir/wepleo/data/open_datasets/coco/'
data = dict(
    train=dict(
        type=dataset_type,
        ann_file=data_root + 'Annotations/instances_train2017.json',
        img_prefix=data_root + 'images/train'),
    val=dict(
        type=dataset_type,
        ann_file=data_root + 'Annotations/instances_val2017.json',
        img_prefix=data_root + 'images/val'),
    test=dict(
        type=dataset_type,
        ann_file=data_root + 'Annotations/instances_val2017.json',
        img_prefix=data_root + 'images/val'))

optimizer_config = dict(_delete_=True, grad_clip=dict(max_norm=35, norm_type=2))

# We can use the pre-trained model to obtain higher performance
load_from = 'checkpoints/fcos_center-normbbox-centeronreg-giou_r50_caffe_fpn_gn-head_1x_coco-0a0d75a8.pth'

What dataset did you use?
coco

Environment
sys.platform: linux
Python: 3.7.9 (default, Aug 31 2020, 12:42:55) [GCC 7.3.0]
CUDA available: True
GPU 0,1: A100-SXM4-40GB
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.0_bu.TC445_37.28845127_0
GCC: gcc (GCC) 5.4.0
PyTorch: 1.7.0
PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2021.3-Product Build 20210617 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.0
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_37,code=compute_37
CuDNN 8.0.3
Magma 2.5.2
Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.8.0
OpenCV: 4.5.4-dev
MMCV: 1.3.15
MMCV Compiler: GCC 5.4
MMCV CUDA Compiler: 11.0
MMDetection: 2.17.0+a5054bd

install pytorch method:

conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=11.0

Error traceback
If applicable, paste the error trackback here.


2021-10-29 11:32:21,262 - mmdet - INFO - Epoch [1][70/58633]	lr: 4.253e-03, eta: 2 days, 14:19:30, time: 0.273, data_time: 0.045, memory: 3046, loss_cls: 0.8306, loss_bbox: 0.5608, loss_centerness: 0.6810, loss: 2.0725
Traceback (most recent call last):
  File "tools/train.py", line 189, in <module>
    main()
  File "tools/train.py", line 185, in main
    meta=meta)
  File "/home/wepleo/code/mmdetection/mmdetection-2.17.0/mmdet/apis/train.py", line 174, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/wepleo/code/mmdetection/mmcv-1.3.15/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/wepleo/code/mmdetection/mmcv-1.3.15/mmcv/runner/epoch_based_runner.py", line 51, in train
    self.call_hook('after_train_iter')
  File "/home/wepleo/code/mmdetection/mmcv-1.3.15/mmcv/runner/base_runner.py", line 307, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/wepleo/code/mmdetection/mmcv-1.3.15/mmcv/runner/hooks/optimizer.py", line 35, in after_train_iter
    runner.outputs['loss'].backward()
  File "/workdir/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/workdir/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/autograd/__init__.py", line 132, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: an illegal memory access was encountered

The text was updated successfully, but these errors were encountered:

RangiLyu · 2021-10-28T08:51:41Z

Please follow the issue template to provide more details.

riyaj8888 · 2021-10-28T14:05:06Z

I followed this tutorial "https://github.com/open-mmlab/mmdetection/blob/master/demo/MMDet_Tutorial.ipynb"

I have dataset in VOC format , with one class but after running following code from notebook ,i am getting 20 classes instead one class , i made changes in voc512.py ,class_names.py but still getting 20 classes.

`from mmdet.datasets import build_dataset
from mmdet.models import build_detector
from mmdet.apis import train_detector

Build dataset

datasets = [build_dataset(cfg.data.train)]

Build the detector

model = build_detector(
cfg.model, train_cfg=cfg.get('train_cfg'), test_cfg=cfg.get('test_cfg'))

Add an attribute for visualization convenience

datasets[0].CLASSES`

from mmcv import Config
cfg = Config.fromfile('/content/mmdetection/configs/pascal_voc/faster_rcnn_r50_fpn_1x_voc0712.py')
print(f'Config:\n{cfg.pretty_text}')

Config:
model = dict(
type='FasterRCNN',
backbone=dict(
type='ResNet',
depth=50,
num_stages=4,
out_indices=(0, 1, 2, 3),
frozen_stages=1,
norm_cfg=dict(type='BN', requires_grad=True),
norm_eval=True,
style='pytorch',
init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
neck=dict(
type='FPN',
in_channels=[256, 512, 1024, 2048],
out_channels=256,
num_outs=5),
rpn_head=dict(
type='RPNHead',
in_channels=256,
feat_channels=256,
anchor_generator=dict(
type='AnchorGenerator',
scales=[8],
ratios=[0.5, 1.0, 2.0],
strides=[4, 8, 16, 32, 64]),
bbox_coder=dict(
type='DeltaXYWHBBoxCoder',
target_means=[0.0, 0.0, 0.0, 0.0],
target_stds=[1.0, 1.0, 1.0, 1.0]),
loss_cls=dict(
type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
roi_head=dict(
type='StandardRoIHead',
bbox_roi_extractor=dict(
type='SingleRoIExtractor',
roi_layer=dict(type='RoIAlign', output_size=7, sampling_ratio=0),
out_channels=256,
featmap_strides=[4, 8, 16, 32]),
bbox_head=dict(
type='Shared2FCBBoxHead',
in_channels=256,
fc_out_channels=1024,
roi_feat_size=7,
num_classes=1,
bbox_coder=dict(
type='DeltaXYWHBBoxCoder',
target_means=[0.0, 0.0, 0.0, 0.0],
target_stds=[0.1, 0.1, 0.2, 0.2]),
reg_class_agnostic=False,
loss_cls=dict(
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
loss_bbox=dict(type='L1Loss', loss_weight=1.0))),
train_cfg=dict(
rpn=dict(
assigner=dict(
type='MaxIoUAssigner',
pos_iou_thr=0.7,
neg_iou_thr=0.3,
min_pos_iou=0.3,
match_low_quality=True,
ignore_iof_thr=-1),
sampler=dict(
type='RandomSampler',
num=256,
pos_fraction=0.5,
neg_pos_ub=-1,
add_gt_as_proposals=False),
allowed_border=-1,
pos_weight=-1,
debug=False),
rpn_proposal=dict(
nms_pre=2000,
max_per_img=1000,
nms=dict(type='nms', iou_threshold=0.7),
min_bbox_size=0),
rcnn=dict(
assigner=dict(
type='MaxIoUAssigner',
pos_iou_thr=0.5,
neg_iou_thr=0.5,
min_pos_iou=0.5,
match_low_quality=False,
ignore_iof_thr=-1),
sampler=dict(
type='RandomSampler',
num=512,
pos_fraction=0.25,
neg_pos_ub=-1,
add_gt_as_proposals=True),
pos_weight=-1,
debug=False)),
test_cfg=dict(
rpn=dict(
nms_pre=1000,
max_per_img=1000,
nms=dict(type='nms', iou_threshold=0.7),
min_bbox_size=0),
rcnn=dict(
score_thr=0.05,
nms=dict(type='nms', iou_threshold=0.5),
max_per_img=100)))
dataset_type = 'VOCDataset'
data_root = '/content/mmdetection/'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='Resize', img_scale=(1000, 600), keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1000, 600),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]
data = dict(
samples_per_gpu=2,
workers_per_gpu=2,
train=dict(
type='RepeatDataset',
times=3,
dataset=dict(
type='VOCDataset',
ann_file=[
'/content/mmdetection/VOC2007/ImageSets/Main/trainval.txt'
],
img_prefix=['/content/mmdetection/VOC2007/'],
pipeline=[
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='Resize', img_scale=(1000, 600), keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
])),
val=dict(
type='VOCDataset',
ann_file='/content/mmdetection/VOC2007/ImageSets/Main/test.txt',
img_prefix='/content/mmdetection/VOC2007/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1000, 600),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]),
test=dict(
type='VOCDataset',
ann_file='/content/mmdetection/VOC2007/ImageSets/Main/test.txt',
img_prefix='/content/mmdetection/VOC2007/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1000, 600),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]))
evaluation = dict(interval=1, metric='mAP')
checkpoint_config = dict(interval=1)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
lr_config = dict(policy='step', step=[3])
runner = dict(type='EpochBasedRunner', max_epochs=4)

WepLeo · 2021-10-29T03:36:28Z

Please follow the issue template to provide more details.

i have already updated the issue. tks~

RangiLyu · 2021-11-01T02:40:37Z

Describe the bug if i use V100-16G machine, everything is ok, but A100 machine will report errors after running few steps.(sorry for bad English...)

There are many potential reasons, maybe because of the cuda and Nvidia driver versions. Some versions may have some compatibility issues on A100. Try to upgrade your GPU diver and cuda. Or maybe the GPU has broken.

WepLeo · 2021-11-01T03:00:13Z

Describe the bug if i use V100-16G machine, everything is ok, but A100 machine will report errors after running few steps.(sorry for bad English...)

There are many potential reasons, maybe because of the cuda and Nvidia driver versions. Some versions may have some compatibility issues on A100. Try to upgrade your GPU diver and cuda. Or maybe the GPU has broken.

GPU driver version: 460.27.04. The machine and the system worked well when training yolov5. And I tried to compile mmcv in local, but a compiler version error has occurred. Is there a version limit of mmcv in A100?

RangiLyu · 2021-11-01T03:17:21Z

Describe the bug if i use V100-16G machine, everything is ok, but A100 machine will report errors after running few steps.(sorry for bad English...)

There are many potential reasons, maybe because of the cuda and Nvidia driver versions. Some versions may have some compatibility issues on A100. Try to upgrade your GPU diver and cuda. Or maybe the GPU has broken.

GPU driver version: 460.27.04. The machine and the system worked well when training yolov5. And I tried to compile mmcv in local, but a compiler version error has occurred. Is there a version limit of mmcv in A100?

So, the same code works fine on V100 but failed on A100. However, yolov5 is runnable in the same environment. High probability is because of the cuda version. Try to use cuda11.1 or a higher version. But I can not be sure because I do not have an A100 to reproduce this error. Just have a try.

WepLeo · 2021-11-01T12:09:00Z

pytorch 1.10
cudatoolkit=11.3

MMCV: 1.3.15
MMCV Compiler: GCC 5.4
MMCV CUDA Compiler: 11.0
MMDetection: 2.17.0+a5054bd

build and install MMCV and MMDetection from source follow the guide, everything is ok now. thank you ~

openmmlab-bot assigned RangiLyu Oct 28, 2021

WepLeo closed this as completed Nov 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(need help)failed to train model with mmdetection #6394

(need help)failed to train model with mmdetection #6394

WepLeo commented Oct 28, 2021 •

edited

Loading

RangiLyu commented Oct 28, 2021

riyaj8888 commented Oct 28, 2021 •

edited

Loading

WepLeo commented Oct 29, 2021

RangiLyu commented Nov 1, 2021

WepLeo commented Nov 1, 2021

RangiLyu commented Nov 1, 2021

WepLeo commented Nov 1, 2021

(need help)failed to train model with mmdetection #6394

(need help)failed to train model with mmdetection #6394

Comments

WepLeo commented Oct 28, 2021 • edited Loading

RangiLyu commented Oct 28, 2021

riyaj8888 commented Oct 28, 2021 • edited Loading

Build dataset

Build the detector

Add an attribute for visualization convenience

WepLeo commented Oct 29, 2021

RangiLyu commented Nov 1, 2021

WepLeo commented Nov 1, 2021

RangiLyu commented Nov 1, 2021

WepLeo commented Nov 1, 2021

WepLeo commented Oct 28, 2021 •

edited

Loading

riyaj8888 commented Oct 28, 2021 •

edited

Loading