Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: SigmoidFocalLoss is not compiled with GPU support #21

Closed
Morizb opened this issue Sep 14, 2023 · 6 comments
Closed

RuntimeError: SigmoidFocalLoss is not compiled with GPU support #21

Morizb opened this issue Sep 14, 2023 · 6 comments

Comments

@Morizb
Copy link

Morizb commented Sep 14, 2023

Hello, when I download the fusion_voxel0075_R50.pth you provided, and run sh . /tools/dist_train.sh . /configs/MSMDFusion_nusc_voxel_LC.py 2 for the 2-nd stage training, the error is reported as follows, tried some solutions on the Internet still did not get a solution, I hope you can point out, thank you!

2023-09-14 10:43:15,801 - mmdet - INFO - Start running, host: xzluo@b5163d5d11c9, work_dir: /public/home/xzluo/zc/MSMDFusion-main/work_dirs/MSMDFusion_nusc_voxel_LC
2023-09-14 10:43:15,801 - mmdet - INFO - workflow: [('train', 1)], max: 6 epochs
Traceback (most recent call last):
File "./tools/train.py", line 283, in
main()
File "./tools/train.py", line 272, in main
train_detector(
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/apis/train.py", line 170, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run
epoch_runner(data_loaders[i], **kwargs)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 46, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 247, in train_step
losses = self(**data)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func
return old_func(*args, **kwargs)
File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/detectors/base.py", line 58, in forward
return self.forward_train(**kwargs)
File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/detectors/MSMDFusion.py", line 534, in forward_train
losses_pts = self.forward_pts_train(pts_feats, img_feats, gt_bboxes_3d,
File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/detectors/MSMDFusion.py", line 574, in forward_pts_train
losses = self.pts_bbox_head.loss(*loss_inputs)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 164, in new_func
return old_func(*args, **kwargs)
File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/dense_heads/transfusion_head.py", line 1260, in loss
layer_loss_cls = self.loss_cls(layer_cls_score, layer_labels, layer_label_weights, avg_factor=max(num_pos, 1))
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/models/losses/focal_loss.py", line 170, in forward
loss_cls = self.loss_weight * calculate_loss_func(
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/models/losses/focal_loss.py", line 85, in sigmoid_focal_loss
loss = _sigmoid_focal_loss(pred.contiguous(), target, gamma, alpha, None,
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/ops/focal_loss.py", line 54, in forward
ext_module.sigmoid_focal_loss_forward(
RuntimeError: SigmoidFocalLoss is not compiled with GPU support
Traceback (most recent call last):
File "./tools/train.py", line 283, in
main()
File "./tools/train.py", line 272, in main
train_detector(
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/apis/train.py", line 170, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run
epoch_runner(data_loaders[i], **kwargs)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 46, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 247, in train_step
losses = self(**data)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func
return old_func(*args, **kwargs)
File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/detectors/base.py", line 58, in forward
return self.forward_train(**kwargs)
File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/detectors/MSMDFusion.py", line 534, in forward_train
losses_pts = self.forward_pts_train(pts_feats, img_feats, gt_bboxes_3d,
File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/detectors/MSMDFusion.py", line 574, in forward_pts_train
losses = self.pts_bbox_head.loss(*loss_inputs)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 164, in new_func
return old_func(*args, **kwargs)
File "/public/home/xzluo/zc/MSMDFusion-main/mmdet3d/models/dense_heads/transfusion_head.py", line 1260, in loss
layer_loss_cls = self.loss_cls(layer_cls_score, layer_labels, layer_label_weights, avg_factor=max(num_pos, 1))
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/models/losses/focal_loss.py", line 170, in forward
loss_cls = self.loss_weight * calculate_loss_func(
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmdet/models/losses/focal_loss.py", line 85, in sigmoid_focal_loss
loss = _sigmoid_focal_loss(pred.contiguous(), target, gamma, alpha, None,
File "/public/home/xzluo/anaconda3/envs/zc/lib/python3.8/site-packages/mmcv/ops/focal_loss.py", line 54, in forward
ext_module.sigmoid_focal_loss_forward(
RuntimeError: SigmoidFocalLoss is not compiled with GPU support
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 29983) of binary: /public/home/xzluo/anaconda3/envs/zc/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group

@SxJyJay
Copy link
Owner

SxJyJay commented Sep 14, 2023

How do you set up the mmcv library? If you compile it locally, please check up whether your cuda/nvcc is enabled during compiling.

@Morizb
Copy link
Author

Morizb commented Sep 14, 2023

Thanks for your reply, I found the problem, when I run python mmdet3d/utils/collect_env.py, it shows
TorchVision: 0.10.0+cu111
OpenCV: 4.8.0
MMCV: 1.2.7
MMCV Compiler: GCC 8.4
MMCV CUDA Compiler: not available
MMDetection: 2.10.0
MMDetection3D: 0.11.0+

@Morizb
Copy link
Author

Morizb commented Sep 15, 2023

Hi, I modified the previous bug,
725c7603607ff52e1ece8d5c519f7ac

but when I continue to run sh . /tools/dist_train.sh . /configs/MSMDFusion_nusc_voxel_LC.py 2, it reports the following error:
6acddd9bce1281e0d09aa8b179f2c52

The environment for installation is as follows:
(msmd) xzluo@d037a065fa35:~/zc/MSMDFusion-main$ conda list

packages in environment at /public/home/xzluo/anaconda3/envs/msmd:

Name Version Build Channel

_libgcc_mutex 0.1 main https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
_sysroot_linux-64_curr_repodata_hack 3 haa98f57_10 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
absl-py 1.4.0
addict 2.4.0
aiofiles 22.1.0
aiosqlite 0.19.0
anyio 3.7.1
argon2-cffi 23.1.0
argon2-cffi-bindings 21.2.0
arrow 1.2.3
astor 0.8.1
attrs 23.1.0
Babel 2.12.1
backcall 0.2.0
beautifulsoup4 4.12.2
binutils_impl_linux-64 2.35.1 h27ae35d_9 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
binutils_linux-64 2.35.1 h454624a_30 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
black 23.3.0
blas 1.0 mkl https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
bleach 6.0.0
ca-certificates 2019.11.28 hecc5488_0 moussi
cached-property 1.5.2
cachetools 4.2.4
ccimport 0.4.2
certifi 2019.11.28 py37_0 moussi
cffi 1.15.1
charset-normalizer 3.2.0
click 8.1.7
comm 0.1.4
cumm-cu117 0.4.11
cycler 0.11.0
Cython 3.0.2
dataclasses 0.6
debugpy 1.7.0
decorator 5.1.1
defusedxml 0.7.1
deprecation 2.1.0
descartes 1.1.0
entrypoints 0.4
exceptiongroup 1.1.3
fastjsonschema 2.18.0
fire 0.5.0
flake8 5.0.4
fonttools 4.38.0
fqdn 1.5.1
future 0.18.3
gast 0.2.2
gcc_impl_linux-64 8.4.0 he7ac559_17 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
gcc_linux-64 8.4.0 he201b7d_30 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
google-auth 1.35.0
google-auth-oauthlib 0.4.6
google-pasta 0.2.0
grpcio 1.58.0
gxx_impl_linux-64 8.4.0 h9ce2e92_17 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
gxx_linux-64 8.4.0 h85ed34b_30 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
h5py 3.8.0
idna 3.4
imageio 2.27.0
importlib-metadata 4.2.0
importlib-resources 5.12.0
iniconfig 2.0.0
intel-openmp 2022.0.1 h06a4308_3633 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
ipdb 0.13.13
ipykernel 6.16.2
ipython 7.34.0
ipython-genutils 0.2.0
ipywidgets 8.1.1
isoduration 20.11.0
jedi 0.19.0
Jinja2 3.1.2
joblib 1.3.2
json5 0.9.14
jsonpointer 2.4
jsonschema 4.17.3
jupyter 1.0.0
jupyter-console 6.6.3
jupyter-events 0.6.3
jupyter-server 1.24.0
jupyter-ydoc 0.2.5
jupyter_client 7.4.9
jupyter_core 4.12.0
jupyter_packaging 0.12.3
jupyter_server_fileid 0.9.0
jupyter_server_ydoc 0.8.0
jupyterlab 3.6.5
jupyterlab-pygments 0.2.2
jupyterlab-widgets 3.0.9
jupyterlab_server 2.24.0
Keras-Applications 1.0.8
Keras-Preprocessing 1.1.2
kernel-headers_linux-64 3.10.0 h57e8cba_10 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
kiwisolver 1.4.5
lark 1.1.7
ld_impl_linux-64 2.35.1 h7274673_9 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
libffi 3.2.1 he1b5a44_1007 moussi
libgcc-devel_linux-64 8.4.0 hd257e2f_17 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
libgcc-ng 9.1.0 hdf63c60_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
libgfortran-ng 7.3.0 hdf63c60_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
libgomp 11.2.0 h1234567_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
libstdcxx-devel_linux-64 8.4.0 hf0c5c8d_17 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
libstdcxx-ng 9.1.0 hdf63c60_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
llvmlite 0.31.0
lyft-dataset-sdk 0.0.8
Markdown 3.3.4
MarkupSafe 2.1.3
matplotlib 3.5.2
matplotlib-inline 0.1.6
mccabe 0.7.0
mistune 3.0.1
mkl 2019.4 243 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
mkl-service 2.3.0 py37he8ac12f_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
mkl_fft 1.0.14 py37hd81dba3_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
mkl_random 1.0.4 py37hd81dba3_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
mmcv-full 1.2.7
mmdet 2.10.0
mmdet3d 0.11.0
mmpycocotools 12.0.3
mypy-extensions 1.0.0
nbclassic 1.0.0
nbclient 0.7.4
nbconvert 7.6.0
nbformat 5.8.0
ncurses 6.3 h7f8727e_2 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
nest-asyncio 1.5.7
networkx 2.2
ninja 1.11.1
notebook 6.5.5
notebook_shim 0.2.3
numba 0.48.0
numpy 1.17.0 py37h7e9f1db_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
numpy 1.19.5
numpy-base 1.17.0 py37hde5b4d6_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
nuscenes-devkit 1.1.10
oauthlib 3.2.2
open3d 0.13.0
opencv-python 4.5.5.64
openssl 1.1.1e h516909a_0 moussi
opt-einsum 3.3.0
packaging 23.1
pandas 1.3.5
pandocfilters 1.5.0
parso 0.8.3
pathspec 0.11.2
pccm 0.4.8
pexpect 4.8.0
pickleshare 0.7.5
Pillow 9.5.0
pip 22.3.1 py37h06a4308_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
pkgutil_resolve_name 1.3.10
platformdirs 3.10.0
plotly 5.16.1
pluggy 1.2.0
plyfile 0.8.1
portalocker 2.7.0
prometheus-client 0.17.1
prompt-toolkit 3.0.39
protobuf 4.24.3
psutil 5.9.5
ptyprocess 0.7.0
pyasn1 0.5.0
pyasn1-modules 0.3.0
pybind11 2.11.1
pycodestyle 2.9.1
pycparser 2.21
pyflakes 2.5.0
Pygments 2.16.1
pyparsing 3.1.1
pyquaternion 0.9.9
pyrsistent 0.19.3
pytest 7.4.2
python 3.7.7 hcf32534_0_cpython https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
python-dateutil 2.8.2
python-json-logger 2.0.7
pytz 2023.3.post1
PyWavelets 1.3.0
PyYAML 6.0.1
pyzmq 24.0.1
qtconsole 5.4.4
QtPy 2.4.0
readline 8.1.2 h7f8727e_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
requests 2.31.0
requests-oauthlib 1.3.1
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rsa 4.9
scikit-image 0.19.3
scikit-learn 1.0.2
scipy 1.4.1
Send2Trash 1.8.2
setuptools 65.6.3 py37h06a4308_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
Shapely 1.8.5
six 1.16.0 pyhd3eb1b0_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
sniffio 1.3.0
soupsieve 2.4.1
spconv-cu117 2.3.6
sqlite 3.38.5 hc218d9a_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
sysroot_linux-64 2.17 h57e8cba_10 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
tenacity 8.2.3
tensorboard 2.1.1
tensorflow-estimator 2.1.0
tensorflow-gpu 2.1.0
termcolor 2.3.0
terminado 0.17.1
terminaltables 3.1.10
threadpoolctl 3.1.0
tifffile 2021.11.2
tinycss2 1.2.1
tk 8.6.12 h1ccaba5_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
tomli 2.0.1
tomlkit 0.12.1
torch 1.7.0+cu110
torch-scatter 2.0.7
torchaudio 0.7.0
torchvision 0.8.1+cu110
tornado 6.2
tqdm 4.66.1
traitlets 5.9.0
trimesh 2.35.39
typed-ast 1.5.5
typing_extensions 4.7.1
uri-template 1.3.0
urllib3 2.0.4
waymo-open-dataset-tf-2-1-0 1.2.0
wcwidth 0.2.6
webcolors 1.13
webencodings 0.5.1
websocket-client 1.6.1
Werkzeug 2.2.3
wheel 0.38.4 py37h06a4308_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
widgetsnbextension 4.0.9
wrapt 1.15.0
xz 5.2.5 h7f8727e_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
y-py 0.6.0
yapf 0.40.1
ypy-websocket 0.8.4
zipp 3.15.0
zlib 1.2.12 h7f8727e_2 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main

Do you know what the problem is, please?

@SxJyJay
Copy link
Owner

SxJyJay commented Sep 15, 2023

Error "numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject" indicates that your numpy version is not compatible with another library, to solve this problem, you can refer to this site. However, since numpy is a foundation library of other libraries like torch, scipy, etc, modifying the numpy version will arouse more version conflicts. Therefore, I suggest you find the library incompatible with the current numpy version, or setup a new environment by referring to my environment details.

@Morizb
Copy link
Author

Morizb commented Sep 15, 2023

What is your graphics card model and memory? I can only apply two cards, the model is GeForce RTX 2080 Ti, the video memory is 11G, when I set samples_per_gpu=2, workers_per_gpu=2, it will report error when I run the code:
cf80622e18c5ee28f841b1dd2e57e70

Do you know how to solve this issue?

@SxJyJay
Copy link
Owner

SxJyJay commented Sep 15, 2023

We use RTX3090 with 24G memory. You can try some techniques (like fp16, pytorch checkpoint, etc.) for saving the GPU memory.

@Morizb Morizb closed this as completed Oct 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants