Closed
Description
Checklist
- I have searched related issues but cannot get the expected help. (CUDA error: an illegal memory access was encountered #270, CUDA error: an illegal memory access was encountered #42)
- The bug has not been fixed in the latest version. (mmseg - 0.21.1)
Describe the bug
RuntimeError: CUDA error: an illegal memory access was encountered
- I was training my custom dataset on all the available models but facing the following error with several attention heads such as Segmenter, FastSTDC, ISANet, Iraspp, FastFCN, Distangled/Assymetric Non local Networks, CCNet, DANet
- Faced the same issue with CGNet, FastSCNN but resolved by changing norm_cfg from "SyncBN" to just "BN"
- Training/evaluation is woking well with remaining models, so I guess problem is not with custom dataset
Reproduction
-
What command or script did you run?
python tools/train.py [config_path]
-
Did you make any modifications on the code or config? Did you understand what you have modified?
- Only changed the num_classes in decode_heads/wherever req. to fit to my custom dataset(19 classes)
- What dataset did you use?
- Custom dataset where, I create a 2d mask with labelled pixels as (0, num_classes-1) and 255 as unlabelled
Environment
- Please run
python mmseg/utils/collect_env.py
to collect necessary environment information and paste it here.
'tail' is not recognized as an internal or external command,
operable program or batch file.
'gcc' is not recognized as an internal or external command,
operable program or batch file.
sys.platform: win32
Python: 3.7.11 (default, Jul 27 2021, 09:42:29) [MSC v.1916 64 bit (AMD64)]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6
NVCC: Not Available
GCC: n/a
PyTorch: 1.10.2
PyTorch compiling details: PyTorch built with:
- C++ Version: 199711
- MSVC 192829337
- Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
- OpenMP 2019
- LAPACK is enabled (usually provided by MKL)
- CPU capability usage: AVX512
- CUDA Runtime 11.3
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
- CuDNN 8.2
- Magma 2.5.4
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=C:/cb/pytorch_1000000000000/work/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/cb/pytorch_1000000000000/work/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON,
TorchVision: 0.11.3
OpenCV: 4.5.5
MMCV: 1.4.4
MMCV Compiler: MSVC 193030709
MMCV CUDA Compiler: 11.6
MMSegmentation: 0.21.1+b163101
- You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch [e.g., pip, conda, source]
pip install torch torchvision
- Other environment variables that may be related (such as
$PATH
,$LD_LIBRARY_PATH
,$PYTHONPATH
, etc.)
- How you installed PyTorch [e.g., pip, conda, source]
Error traceback
If applicable, paste the error trackback here.
File "C:\Users\Sai_Nivedh\Projects\mmsegmentation\mmseg\apis\train.py", line 174, in train_segmentor
runner.run(data_loaders, cfg.workflow)
return self.module.train_step(*inputs[0], **kwargs[0])
File "C:\Users\Sai_Nivedh\Projects\mmsegmentation\mmseg\models\segmentors\base.py", line 139, in train_step
loss, log_vars = self._parse_losses(losses)
File "C:\Users\Sai_Nivedh\Projects\mmsegmentation\mmseg\models\segmentors\base.py", line 208, in _parse_losses
log_vars[loss_name] = loss_value.item()
RuntimeError: CUDA error: an illegal memory access was encountered
Whole Config
# model settings
norm_cfg = dict(type='SyncBN', requires_grad=True)
model = dict(
type='EncoderDecoder',
pretrained='open-mmlab://resnet50_v1c',
backbone=dict(
type='ResNetV1c',
depth=50,
num_stages=4,
out_indices=(0, 1, 2, 3),
dilations=(1, 1, 2, 4),
strides=(1, 2, 1, 1),
norm_cfg=norm_cfg,
norm_eval=False,
style='pytorch',
contract_dilation=True),
decode_head=dict(
type='DNLHead',
in_channels=2048,
in_index=3,
channels=512,
dropout_ratio=0.1,
reduction=2,
use_scale=True,
mode='embedded_gaussian',
num_classes=19,
norm_cfg=norm_cfg,
align_corners=False,
loss_decode=dict(
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
auxiliary_head=dict(
type='FCNHead',
in_channels=1024,
in_index=2,
channels=256,
num_convs=1,
concat_input=False,
dropout_ratio=0.1,
num_classes=19,
norm_cfg=norm_cfg,
align_corners=False,
loss_decode=dict(
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
# model training and testing settings
train_cfg=dict(),
test_cfg=dict(mode='whole'))
dataset_type = 'CustomDataset'
data_root = 'datasets\custom_cityscapes'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
crop_size = (512, 512)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', reduce_zero_label=True),
dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
dict(type='RandomCrop', crop_size=crop_size, cat_max_ratio=0.75),
dict(type='RandomFlip', prob=0.5),
dict(type='PhotoMetricDistortion'),
dict(type='Normalize', **img_norm_cfg),
dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=255),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_semantic_seg']),
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(2048, 512),
# img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75],
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img']),
])
]
data = dict(
samples_per_gpu=2,
workers_per_gpu=2,
train=dict(
type=dataset_type,
data_root=data_root,
reduce_zero_label=True,
img_dir='images',
ann_dir='labels',
pipeline=train_pipeline),
val=dict(
type=dataset_type,
data_root=data_root,
reduce_zero_label=True,
img_dir='images',
ann_dir='labels',
pipeline=test_pipeline),
test=dict(
type=dataset_type,
data_root=data_root,
reduce_zero_label=True,
img_dir='images',
ann_dir='labels',
pipeline=test_pipeline))
# optimizer
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)
optimizer_config = dict()
# learning policy
lr_config = dict(policy='poly', power=0.9, min_lr=1e-4, by_epoch=False)
# runtime settings
runner = dict(type='IterBasedRunner', max_iters=40000)
checkpoint_config = dict(by_epoch=False, interval=4000)
evaluation = dict(interval=400, metric='mIoU', pre_eval=True)
log_config = dict(
interval=50,
hooks=[
dict(type='TextLoggerHook', by_epoch=False),
# dict(type='TensorboardLoggerHook')
])
# yapf:enable
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = r'checkpoints\dnl_r50-d8_512x1024_40k_cityscapes_20200904_233629-53d4ea93.pth'
resume_from = None
workflow = [('train', 1)]
cudnn_benchmark = True
Bug fix
If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!
Metadata
Metadata
Assignees
Labels
No labels