Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error with several attention heads #1330

Closed
sainivedh19pt opened this issue Feb 28, 2022 · 3 comments
Closed

CUDA error with several attention heads #1330

sainivedh19pt opened this issue Feb 28, 2022 · 3 comments
Assignees

Comments

@sainivedh19pt
Copy link

Checklist

  1. I have searched related issues but cannot get the expected help. (CUDA error: an illegal memory access was encountered #270, CUDA error: an illegal memory access was encountered #42)
  2. The bug has not been fixed in the latest version. (mmseg - 0.21.1)

Describe the bug

RuntimeError: CUDA error: an illegal memory access was encountered

  • I was training my custom dataset on all the available models but facing the following error with several attention heads such as Segmenter, FastSTDC, ISANet, Iraspp, FastFCN, Distangled/Assymetric Non local Networks, CCNet, DANet
  • Faced the same issue with CGNet, FastSCNN but resolved by changing norm_cfg from "SyncBN" to just "BN"
  • Training/evaluation is woking well with remaining models, so I guess problem is not with custom dataset

Reproduction

  1. What command or script did you run?

    python tools/train.py [config_path]
    
  2. Did you make any modifications on the code or config? Did you understand what you have modified?

  • Only changed the num_classes in decode_heads/wherever req. to fit to my custom dataset(19 classes)
  1. What dataset did you use?
  • Custom dataset where, I create a 2d mask with labelled pixels as (0, num_classes-1) and 255 as unlabelled

Environment

  1. Please run python mmseg/utils/collect_env.py to collect necessary environment information and paste it here.
'tail' is not recognized as an internal or external command,
operable program or batch file.
'gcc' is not recognized as an internal or external command,
operable program or batch file.
sys.platform: win32
Python: 3.7.11 (default, Jul 27 2021, 09:42:29) [MSC v.1916 64 bit (AMD64)]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 3060 Laptop GPU
CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6
NVCC: Not Available
GCC: n/a
PyTorch: 1.10.2
PyTorch compiling details: PyTorch built with:
  - C++ Version: 199711
  - MSVC 192829337
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 2019
  - LAPACK is enabled (usually provided by MKL)
  - CPU capability usage: AVX512
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.2
  - Magma 2.5.4
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=C:/cb/pytorch_1000000000000/work/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/cb/pytorch_1000000000000/work/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON,

TorchVision: 0.11.3
OpenCV: 4.5.5
MMCV: 1.4.4
MMCV Compiler: MSVC 193030709
MMCV CUDA Compiler: 11.6
MMSegmentation: 0.21.1+b163101
  1. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source]
      pip install torch torchvision
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Error traceback

If applicable, paste the error trackback here.

File "C:\Users\Sai_Nivedh\Projects\mmsegmentation\mmseg\apis\train.py", line 174, in train_segmentor
    runner.run(data_loaders, cfg.workflow)
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "C:\Users\Sai_Nivedh\Projects\mmsegmentation\mmseg\models\segmentors\base.py", line 139, in train_step
    loss, log_vars = self._parse_losses(losses)
  File "C:\Users\Sai_Nivedh\Projects\mmsegmentation\mmseg\models\segmentors\base.py", line 208, in _parse_losses
    log_vars[loss_name] = loss_value.item()
RuntimeError: CUDA error: an illegal memory access was encountered

Whole Config

# model settings
norm_cfg = dict(type='SyncBN', requires_grad=True)
model = dict(
    type='EncoderDecoder',
    pretrained='open-mmlab://resnet50_v1c',
    backbone=dict(
        type='ResNetV1c',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        dilations=(1, 1, 2, 4),
        strides=(1, 2, 1, 1),
        norm_cfg=norm_cfg,
        norm_eval=False,
        style='pytorch',
        contract_dilation=True),
    decode_head=dict(
        type='DNLHead',
        in_channels=2048,
        in_index=3,
        channels=512,
        dropout_ratio=0.1,
        reduction=2,
        use_scale=True,
        mode='embedded_gaussian',
        num_classes=19,
        norm_cfg=norm_cfg,
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
    auxiliary_head=dict(
        type='FCNHead',
        in_channels=1024,
        in_index=2,
        channels=256,
        num_convs=1,
        concat_input=False,
        dropout_ratio=0.1,
        num_classes=19,
        norm_cfg=norm_cfg,
        align_corners=False,
        loss_decode=dict(
            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
    # model training and testing settings
    train_cfg=dict(),
    test_cfg=dict(mode='whole'))



dataset_type = 'CustomDataset'
data_root = 'datasets\custom_cityscapes'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
crop_size = (512, 512)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', reduce_zero_label=True),
    dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
    dict(type='RandomCrop', crop_size=crop_size, cat_max_ratio=0.75),
    dict(type='RandomFlip', prob=0.5),
    dict(type='PhotoMetricDistortion'),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=255),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_semantic_seg']),
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(2048, 512),
        # img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75],
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(type='Normalize', **img_norm_cfg),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]
data = dict(
    samples_per_gpu=2,
    workers_per_gpu=2,
    train=dict(
        type=dataset_type,
        data_root=data_root,
        reduce_zero_label=True,
        img_dir='images',
        ann_dir='labels',
        pipeline=train_pipeline),
    val=dict(
        type=dataset_type,
        data_root=data_root,
        reduce_zero_label=True,
        img_dir='images',
        ann_dir='labels',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        data_root=data_root,
        reduce_zero_label=True,
        img_dir='images',
        ann_dir='labels',
        pipeline=test_pipeline))

# optimizer
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)
optimizer_config = dict()
# learning policy
lr_config = dict(policy='poly', power=0.9, min_lr=1e-4, by_epoch=False)
# runtime settings
runner = dict(type='IterBasedRunner', max_iters=40000)
checkpoint_config = dict(by_epoch=False, interval=4000)
evaluation = dict(interval=400, metric='mIoU', pre_eval=True)


log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook', by_epoch=False),
        # dict(type='TensorboardLoggerHook')
    ])
# yapf:enable
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = r'checkpoints\dnl_r50-d8_512x1024_40k_cityscapes_20200904_233629-53d4ea93.pth'
resume_from = None
workflow = [('train', 1)]
cudnn_benchmark = True

Bug fix

If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

@MengzhangLI
Copy link
Contributor

I have two questions or comments.

(1) Could you try to train these bug-happened models with our provided datasets such as Cityscapes and ADE20K to test whether cuda out of memory still encounter?

(2) Before your issue, I have faced certain problems because I missed 'RandomCrop' and 'Pad' like here:

#955 (comment)

Hope my experience could help you locate your problems.

Best,

@MengzhangLI MengzhangLI self-assigned this Feb 28, 2022
@sainivedh19pt
Copy link
Author

Hi @MengzhangLI ,

Thanks for the response

The error I faced is not CUDA Out of memory, posting extended stacktrace for better insights,

File "C:\Users\Sai_Nivedh\Projects\mmsegmentation\mmseg\apis\train.py", line 174, in train_segmentor
    runner.run(data_loaders, cfg.workflow)
  File "c:\users\sai_nivedh\projects\mmcv\mmcv\runner\iter_based_runner.py", line 134, in run
    iter_runner(iter_loaders[i], **kwargs)
  File "c:\users\sai_nivedh\projects\mmcv\mmcv\runner\iter_based_runner.py", line 61, in train
    outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
  File "c:\users\sai_nivedh\projects\mmcv\mmcv\parallel\data_parallel.py", line 74, in train_step
    inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
  File "c:\users\sai_nivedh\projects\mmcv\mmcv\parallel\data_parallel.py", line 53, in scatter
    return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
  File "c:\users\sai_nivedh\projects\mmcv\mmcv\parallel\scatter_gather.py", line 51, in scatter_kwargs
    inputs = scatter(inputs, target_gpus, dim) if inputs else []
  File "c:\users\sai_nivedh\projects\mmcv\mmcv\parallel\scatter_gather.py", line 44, in scatter
    return scatter_map(inputs)
  File "c:\users\sai_nivedh\projects\mmcv\mmcv\parallel\scatter_gather.py", line 29, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "c:\users\sai_nivedh\projects\mmcv\mmcv\parallel\scatter_gather.py", line 34, in scatter_map
    out = list(map(type(obj), zip(*map(scatter_map, obj.items()))))
  File "c:\users\sai_nivedh\projects\mmcv\mmcv\parallel\scatter_gather.py", line 29, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "c:\users\sai_nivedh\projects\mmcv\mmcv\parallel\scatter_gather.py", line 27, in scatter_map
    return Scatter.forward(target_gpus, obj.data)
  File "c:\users\sai_nivedh\projects\mmcv\mmcv\parallel\_functions.py", line 71, in forward
    outputs = scatter(input, target_gpus, streams)
  File "c:\users\sai_nivedh\projects\mmcv\mmcv\parallel\_functions.py", line 15, in scatter
    [streams[i // chunk_size]]) for i in range(len(input))
  File "c:\users\sai_nivedh\projects\mmcv\mmcv\parallel\_functions.py", line 15, in <listcomp>
    [streams[i // chunk_size]]) for i in range(len(input))
  File "c:\users\sai_nivedh\projects\mmcv\mmcv\parallel\_functions.py", line 24, in scatter
    output = output.cuda(devices[0], non_blocking=True)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

@MengzhangLI
Copy link
Contributor

It is usually caused by your wrong num_classes in config, it should be n = number of foreground + background (usually it is label 0). For example, if you have only one kind of foreground, it should be num_classes=2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants