Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error in ms_deformable_col2im_cuda: an illegal memory access was encountered #7186

Open
makifozkanoglu opened this issue Feb 17, 2022 · 7 comments

Comments

@makifozkanoglu
Copy link

Describe the bug
I'm getting the following error when trying to run deformable_detr

Reproduction

  1. What command or script did you run?
    I tried to train the config file below

https://github.com/open-mmlab/mmdetection/blob/7a9bc498d5cc972171ec4f7332afcd70bb50e60e/configs/deformable_detr/deformable_detr_r50_16x2_50e_coco.py

  1. Did you make any modifications on the code or config? Did you understand what you have modified?
    No I did not modify
  2. What dataset did you use?

Environment
sys.platform: linux
Python: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0]
CUDA available: True
GPU 0: TITAN RTX
CUDA_HOME: /usr/local/cuda-11.0
NVCC: Build cuda_11.0_bu.TC445_37.28845127_0
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.7.0
PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 10.2
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
  • CuDNN 7.6.5
  • Magma 2.5.2
  • Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.8.0
OpenCV: 4.5.5
MMCV: 1.4.4
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 11.0
MMDetection: 2.20.0+

torch is installed by pip

Error traceback

error in ms_deformable_col2im_cuda: an illegal memory access was encountered
Traceback (most recent call last):
  File "tools/train.py", line 200, in <module>
    main()
  File "tools/train.py", line 188, in main
    train_detector(
  File "/cta/users/mehmet/CenterNetMMCV/ssod/apis/train.py", line 206, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/cta/users/mehmet/CenterNetMMCV/thirdparty/mmcv/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/cta/users/mehmet/CenterNetMMCV/thirdparty/mmcv/mmcv/runner/epoch_based_runner.py", line 51, in train
    self.call_hook('after_train_iter')
  File "/cta/users/mehmet/CenterNetMMCV/thirdparty/mmcv/mmcv/runner/base_runner.py", line 309, in call_hook
    getattr(hook, fn_name)(self)
  File "/cta/users/mehmet/CenterNetMMCV/thirdparty/mmcv/mmcv/runner/hooks/optimizer.py", line 56, in after_train_iter
    runner.outputs['loss'].backward()
  File "/cta/users/mehmet/.conda/envs/centernetmmcv/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/cta/users/mehmet/.conda/envs/centernetmmcv/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA error: an illegal memory access was encountered.

Bug fix
If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

@PeterVennerstrom
Copy link
Contributor

PeterVennerstrom commented Feb 25, 2022

Experienced the same issue and tested a few environments and GPU models.

Fixed by using an earlier version of mmcv-full. 1.4.2 is the latest version of mmcv-full that worked for me.

CUDA_LAUNCH_BLOCKING=1 python ./tools/train.py configs/config.....

@imkzh
Copy link

imkzh commented Mar 27, 2022

Exactly same error:

error in ms_deformable_col2im_cuda: an illegal memory access was encountered
Traceback (most recent call last):
  File "./mmdetection/tools/train.py", line 209, in <module>
    main()
  File "./mmdetection/tools/train.py", line 198, in main
    train_detector(
  File "/home/user/.local/lib/python3.8/site-packages/mmdet/apis/train.py", line 208, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/user/.local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/user/.local/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
    self.call_hook('after_train_iter')
  File "/home/user/.local/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/user/.local/lib/python3.8/site-packages/mmcv/runner/hooks/optimizer.py", line 56, in after_train_iter
    runner.outputs['loss'].backward()
  File "/home/user/.local/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/user/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA error: an illegal memory access was encountered

I'm on:

  • Ubuntu 20.04
  • CUDA 11.2 (RTX3090)
  • torch 1.9.0+cu111
  • mmcv-full 1.4.7
  • mmdet 2.22.0
  • python 3.8.10
  • nvcc V11.2.67
  • gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

P.S.: downgrading mmcv-full to 1.4.2 solved the problem as @PeterVennerstrom mentioned above.

@Manningchan
Copy link

i met the same question, and in my environment, there is 8 gpus, if i use 0, it will not happened and if i use other gpus, it will occured

@xuqingyu26
Copy link

Hello, i met the same question as you. Have you solved this question?

@imkzh
Copy link

imkzh commented Mar 23, 2023

@xuqingyu26 a workaround is downgrading mmcv-full to 1.4.2 which solved the problem in my case, as mentioned in my comment.

@xbkaishui
Copy link

hi, any update on this?

@PeterVennerstrom
Copy link
Contributor

It was fixed. Here's a link to the issue with a link to the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants