Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error: an illegal memory access was encountered #270

Closed
bkbjsd opened this issue Nov 21, 2020 · 8 comments
Closed

CUDA error: an illegal memory access was encountered #270

bkbjsd opened this issue Nov 21, 2020 · 8 comments
Assignees
Labels

Comments

@bkbjsd
Copy link

bkbjsd commented Nov 21, 2020

@xvjiarui
Copy link
Collaborator

Hi @bkbjsd
Which config are you using?
You may check if you set num_classess correctly.

@xvjiarui xvjiarui self-assigned this Nov 23, 2020
@bkbjsd
Copy link
Author

bkbjsd commented Nov 24, 2020

I check it and retry several times。—— seems not the num_classes problem。
i paste whole environment、config and error message:

Run:

(py38_source) # mmsegmentation_ python tools/train.py --no-validate configs/luke/pspnet_ABCDataset.py

2020-11-24 14:53:47,964 - mmseg - INFO - Environment info:

sys.platform: linux
Python: 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0]
CUDA available: True
GPU 0: GeForce RTX 3090
CUDA_HOME: /usr/local/cuda-11.1
NVCC: Build cuda_11.1.TC455_06.29069683_0
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.8.0a0+8819bad
PyTorch compiling details: PyTorch built with:

  • GCC 9.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.1
  • NVCC architecture flags: -gencode;arch=compute_86,code=sm_86
  • CuDNN 8.0.5
  • Magma 2.5.2
  • Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.9.0.dev20201118
OpenCV: 4.4.0
MMCV: 1.2.6
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.1
MMSegmentation: 0.8.0+


2020-11-24 14:53:47,964 - mmseg - INFO - Distributed training: False
2020-11-24 14:53:48,710 - mmseg - INFO - Config:
ABC_data_root = '/home//code/remote/mmsegmentation_/data/ABC_58G'
ABC_img_dir = 'images'
ABC_ann_dir = 'labels'
ABC_split_dir = 'splits'
ABC_work_dir = '/home//code/remote/mmsegmentation_/work_dirs/ABC_exp'
ABC_cfg_from_file = '/home//code/remote/mmsegmentation_/configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py'
ABC_checkpoint_load_from = '/home//code/remote/mmsegmentation_/checkpoints/pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth'
ABC_classes = ('void', 's_w_d', 's_y_d', 'ds_w_dn', 'ds_y_dn', 'sb_w_do',
'sb_y_do', 'b_w_g', 'b_y_g', 'db_w_g', 'db_y_g', 'db_w_s',
's_w_s', 'ds_w_s', 's_w_c', 's_y_c', 's_w_p', 's_n_p',
'c_wy_z', 'a_w_u', 'a_w_t', 'a_w_tl', 'a_w_tr', 'a_w_tlr',
'a_w_l', 'a_w_r', 'a_w_lr', 'a_n_lu', 'a_w_tu', 'a_w_m',
'a_y_t', 'b_n_sr', 'd_wy_za', 'r_wy_np', 'vom_wy_n',
'om_n_n', 'noise', 'ignored')
ABC_palette = [[0, 0, 0], [70, 130, 180], [220, 20, 60], [128, 0, 128],
[255, 0, 0], [0, 0, 60], [0, 60, 100], [0, 0, 142],
[119, 11, 32], [244, 35, 232], [0, 0, 160], [153, 153, 153],
[220, 220, 0], [250, 170, 30], [102, 102, 156], [128, 0, 0],
[128, 64, 128],
[238, 232, 170], [190, 153, 153], [0, 0, 230], [128, 128, 0],
[128, 78, 160], [150, 100, 100], [255, 165, 0],
[180, 165, 180], [107, 142, 35], [201, 255, 229],
[0, 191, 255], [51, 255, 51], [250, 128, 114], [127, 255, 0],
[255, 128, 0], [0, 255, 255], [178, 132,
190], [128, 128, 64],
[102, 0, 204], [0, 153, 153], [255, 255, 255]]
ABC_num_classes = 38
ABC_samples_per_gpu = 8
ABC_workers_per_gpu = 1
ABC_crop_size = (758, 768)
dataset_type = 'ABCDataset'
data_root = '/home//code/remote/mmsegmentation_/data/ABC_58G'
work_dir = '/home//code/remote/mmsegmentation_/work_dirs/ABC_exp_202011241453'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
crop_size = (758, 768)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations'),
dict(type='Resize', img_scale=(3384, 2710), ratio_range=(0.5, 2.0)),
dict(type='RandomCrop', crop_size=(758, 768), cat_max_ratio=0.75),
dict(type='RandomFlip', prob=0.5),
dict(type='PhotoMetricDistortion'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size=(758, 768), pad_val=0, seg_pad_val=255),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(3384, 2710),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]
data = dict(
samples_per_gpu=8,
workers_per_gpu=1,
train=dict(
type='ABCDataset',
data_root=
'/home//code/remote/mmsegmentation_/data/ABC_58G',
img_dir='images',
ann_dir='labels',
pipeline=[
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations'),
dict(
type='Resize', img_scale=(3384, 2710), ratio_range=(0.5, 2.0)),
dict(type='RandomCrop', crop_size=(758, 768), cat_max_ratio=0.75),
dict(type='RandomFlip', prob=0.5),
dict(type='PhotoMetricDistortion'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size=(758, 768), pad_val=0, seg_pad_val=255),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_semantic_seg'])
],
split='splits/train.txt'),
val=dict(
type='ABCDataset',
data_root=
'/home//code/remote/mmsegmentation_/data/ABC_58G',
img_dir='images',
ann_dir='labels',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(3384, 2710),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
],
split='splits/val.txt'),
test=dict(
type='ABCDataset',
data_root=
'/home//code/remote/mmsegmentation_/data/ABC_58G',
img_dir='images',
ann_dir='labels',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(3384, 2710),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
],
split='splits/val.txt'))
norm_cfg = dict(type='BN', requires_grad=True)
model = dict(
type='EncoderDecoder',
pretrained='open-mmlab://resnet50_v1c',
backbone=dict(
type='ResNetV1c',
depth=50,
num_stages=4,
out_indices=(0, 1, 2, 3),
dilations=(1, 1, 2, 4),
strides=(1, 2, 1, 1),
norm_cfg=dict(type='BN', requires_grad=True),
norm_eval=False,
style='pytorch',
contract_dilation=True),
decode_head=dict(
type='PSPHead',
in_channels=2048,
in_index=3,
channels=512,
pool_scales=(1, 2, 3, 6),
dropout_ratio=0.1,
num_classes=38,
norm_cfg=dict(type='BN', requires_grad=True),
align_corners=False,
loss_decode=dict(
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
auxiliary_head=dict(
type='FCNHead',
in_channels=1024,
in_index=2,
channels=256,
num_convs=1,
concat_input=False,
dropout_ratio=0.1,
num_classes=38,
norm_cfg=dict(type='BN', requires_grad=True),
align_corners=False,
loss_decode=dict(
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)))
train_cfg = dict()
test_cfg = dict(mode='whole')
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)
optimizer_config = dict()
lr_config = dict(policy='poly', power=0.9, min_lr=0.0001, by_epoch=False)
runner = dict(type='IterBasedRunner', max_iters=40000)
checkpoint_config = dict(by_epoch=False, interval=200)
evaluation = dict(interval=200, metric='mIoU')
log_config = dict(
interval=20, hooks=[dict(type='TextLoggerHook', by_epoch=False)])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = '/home//code/remote/mmsegmentation_/checkpoints/pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth'
resume_from = None
workflow = [('train', 1)]
cudnn_benchmark = True
gpu_ids = range(0, 1)

2020-11-24 14:53:49,047 - mmseg - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias


2020-11-24 14:53:49,049 - mmseg - INFO - EncoderDecoder(
(backbone): ResNetV1c(
(stem): Sequential(
(0): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU(inplace=True)
(6): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(7): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU(inplace=True)
)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): ResLayer(
(0): Bottleneck(
(conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): Bottleneck(
(conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
)
(layer2): ResLayer(
(0): Bottleneck(
(conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(3): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
)
(layer3): ResLayer(
(0): Bottleneck(
(conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(3): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(4): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(5): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
)
(layer4): ResLayer(
(0): Bottleneck(
(conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): Bottleneck(
(conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(4, 4), dilation=(4, 4), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(4, 4), dilation=(4, 4), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
)
)
(decode_head): PSPHead(
input_transform=None, ignore_index=255, align_corners=False
(loss_decode): CrossEntropyLoss()
(conv_seg): Conv2d(512, 38, kernel_size=(1, 1), stride=(1, 1))
(dropout): Dropout2d(p=0.1, inplace=False)
(psp_modules): PPM(
(0): Sequential(
(0): AdaptiveAvgPool2d(output_size=1)
(1): ConvModule(
(conv): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
)
(1): Sequential(
(0): AdaptiveAvgPool2d(output_size=2)
(1): ConvModule(
(conv): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
)
(2): Sequential(
(0): AdaptiveAvgPool2d(output_size=3)
(1): ConvModule(
(conv): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
)
(3): Sequential(
(0): AdaptiveAvgPool2d(output_size=6)
(1): ConvModule(
(conv): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
)
)
(bottleneck): ConvModule(
(conv): Conv2d(4096, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
)
(auxiliary_head): FCNHead(
input_transform=None, ignore_index=255, align_corners=False
(loss_decode): CrossEntropyLoss()
(conv_seg): Conv2d(256, 38, kernel_size=(1, 1), stride=(1, 1))
(dropout): Dropout2d(p=0.1, inplace=False)
(convs): Sequential(
(0): ConvModule(
(conv): Conv2d(1024, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
)
)
)

Running Error Message:


2020-11-24 14:53:49,090 - mmseg - INFO - Loaded 26494 images
fatal: not a git repository (or any of the parent directories): .git
2020-11-24 14:53:50,657 - mmseg - INFO - load checkpoint from /home//code/remote/mmsegmentation_/checkpoints/pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth
2020-11-24 14:53:50,740 - mmseg - WARNING - The model and loaded state dict do not match exactly

size mismatch for decode_head.conv_seg.weight: copying a param with shape torch.Size([19, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([38, 512, 1, 1]).
size mismatch for decode_head.conv_seg.bias: copying a param with shape torch.Size([19]) from checkpoint, the shape in current model is torch.Size([38]).
size mismatch for auxiliary_head.conv_seg.weight: copying a param with shape torch.Size([19, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([38, 256, 1, 1]).
size mismatch for auxiliary_head.conv_seg.bias: copying a param with shape torch.Size([19]) from checkpoint, the shape in current model is torch.Size([38]).
2020-11-24 14:53:50,745 - mmseg - INFO - Start running, host: @ai-server24G, work_dir: /home//code/remote/mmsegmentation_/work_dirs/ABC_exp_202011241453
2020-11-24 14:53:50,745 - mmseg - INFO - workflow: [('train', 1)], max: 40000 iters
Traceback (most recent call last):
File "tools/train.py", line 166, in
main()
File "tools/train.py", line 155, in main
train_segmentor(
File "/home//code/remote/mmsegmentation_/mmseg/apis/train.py", line 116, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/home//code/remote/mmcv_/mmcv/runner/iter_based_runner.py", line 130, in run
iter_runner(iter_loaders[i], **kwargs)
File "/home//code/remote/mmcv_/mmcv/runner/iter_based_runner.py", line 60, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/home//code/remote/mmcv_/mmcv/parallel/data_parallel.py", line 67, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/home//code/remote/mmsegmentation_/mmseg/models/segmentors/base.py", line 152, in train_step
losses = self(**data_batch)
File "/home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in call_impl
result = self.forward(*input, **kwargs)
File "/home//code/remote/mmcv
/mmcv/runner/fp16_utils.py", line 84, in new_func
return old_func(*args, **kwargs)
File "/home//code/remote/mmsegmentation_/mmseg/models/segmentors/base.py", line 122, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/home//code/remote/mmsegmentation_/mmseg/models/segmentors/encoder_decoder.py", line 162, in forward_train
loss_aux = self.auxiliary_head_forward_train(
File "/home//code/remote/mmsegmentation
/mmseg/models/segmentors/encoder_decoder.py", line 124, in auxiliary_head_forward_train
loss_aux = self.auxiliary_head.forward_train(
File "/home//code/remote/mmsegmentation
/mmseg/models/decode_heads/decode_head.py", line 186, in forward_train
seg_logits = self.forward(inputs)
File "/home//code/remote/mmsegmentation_/mmseg/models/decode_heads/fcn_head.py", line 72, in forward
output = self.convs(x)
File "/home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
result = self.forward(*input, **kwargs)
File "/home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in call_impl
result = self.forward(*input, **kwargs)
File "/home//code/remote/mmcv
/mmcv/cnn/bricks/conv_module.py", line 192, in forward
x = self.conv(x)
File "/home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
result = self.forward(*input, **kwargs)
File "/home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 393, in forward
return self._conv_forward(input, self.weight)
File "/home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 389, in _conv_forward
return F.conv2d(input, weight, self.bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([8, 1024, 95, 96], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(1024, 256, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
data_type = CUDNN_DATA_FLOAT
padding = [1, 1, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x5605aca58f20
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 8, 1024, 95, 96,
strideA = 9338880, 9120, 96, 1,
output: TensorDescriptor 0x5605aca54d10
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 8, 256, 95, 96,
strideA = 2334720, 9120, 96, 1,
weight: FilterDescriptor 0x5605aca59ee0
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 256, 1024, 3, 3,
Pointer addresses:
input: 0x7f9642e60000
output: 0x7f97f8298000
weight: 0x7f9ae2800000

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7f9bda09092c in /home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x1cb08 (0x7f9bda0d1b08 in /home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x51 (0x7f9bda07ab21 in /home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: + 0x89731a (0x7f9bf2b3831a in /home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x8973e5 (0x7f9bf2b383e5 in /home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

frame #22: __libc_start_main + 0xf3 (0x7f9c2cc240b3 in /lib/x86_64-linux-gnu/libc.so.6)

[2] 9376 abort (core dumped) python tools/train.py --no-validate configs/luke/pspnet_ABCDataset.py

@Shiming94
Copy link

Hi @bkbjsd,

I met the same problem. Did you solve it? I have even the same error message.

@bkbjsd bkbjsd changed the title Many days still can't solve it: CUDA error: an illegal memory access was encountered still can't solve it: CUDA error: an illegal memory access was encountered Nov 27, 2020
@wanghao9610
Copy link

I have met the same issue, it almost may the 'num_classes' not match with your label index, I suggest you have to carefully check your label index of ground truth and the num_classes config of the model. For example, if your ground truth label index range from 0 to 19(include), you have to set the num_classes to 20.

@bkbjsd bkbjsd changed the title still can't solve it: CUDA error: an illegal memory access was encountered Still can't solve it: CUDA error: an illegal memory access was encountered Nov 29, 2020
@bkbjsd bkbjsd changed the title Still can't solve it: CUDA error: an illegal memory access was encountered Surely can't solve it: CUDA error: an illegal memory access was encountered Dec 3, 2020
@JPLAY0
Copy link

JPLAY0 commented Dec 4, 2020

I have encountered the same problem and solved it.

I'm sure the problem is from the dataset. Take a closer look at the category range of tags in the dataset and num_classes in the configuration file. Notice the use of ignore_index in dataset implement, does it contain unlabeled data?

@bkbjsd bkbjsd changed the title Surely can't solve it: CUDA error: an illegal memory access was encountered CUDA error: an illegal memory access was encountered Dec 7, 2020
@NickChang97
Copy link

NickChang97 commented Dec 10, 2020

I also encontered it and solved it. You may check whther your label data's index is satisfied with your num_classes. Except the ignore index, all of them should be 0~num_classes-1

@Shiming94
Copy link

I also encontered it and solved it. You may check whther your label data's index is satisfied with your num_classes. Except the ignore index, all of them should be 0~num_classes-1

Hi Nick, thanks a lot for your replies. It really solves my problem. But could you please further explain what does "Except the ignore index, all of them should be 0~num_classes-1" mean? I mean how to implement the ignore index? Thanks a lot.

@NickChang97
Copy link

NickChang97 commented Dec 11, 2020

I also encontered it and solved it. You may check whther your label data's index is satisfied with your num_classes. Except the ignore index, all of them should be 0~num_classes-1

Hi Nick, thanks a lot for your replies. It really solves my problem. But could you please further explain what does "Except the ignore index, all of them should be 0~num_classes-1" mean? I mean how to implement the ignore index? Thanks a lot.

Because I did not use the ignore index, you may check it's code about ignore index, I remember it's 255

@xvjiarui xvjiarui closed this as completed Jan 5, 2021
@Junjun2016 Junjun2016 added the FAQ label Aug 4, 2021
aravind-h-v pushed a commit to aravind-h-v/mmsegmentation that referenced this issue Mar 27, 2023
* Support K-LMS in img2img

* Apply review suggestions
wjkim81 pushed a commit to wjkim81/mmsegmentation that referenced this issue Dec 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants