Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] CUBLAS_STATUS_EXECUTION_FAILED error for BasicVSR_PP for tasks with resolutions >~1700x1080 #2124

Open
3 tasks done
jacob-stein opened this issue Mar 9, 2024 · 0 comments
Assignees
Labels
kind/bug something isn't working

Comments

@jacob-stein
Copy link

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmagic

Environment

[2024-03-09 01:01:02,325] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
sys.platform: linux
Python: 3.11.7 (main, Dec  8 2023, 18:56:58) [GCC 11.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1: NVIDIA A100-SXM4-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.1.1+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.16.1+cu121
OpenCV: 4.9.0
MMEngine: 0.10.3
MMCV: 2.1.0
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 12.1
MMagic: 1.2.0+0a560bb

Reproduces the problem - code sample

        return modulated_deform_conv2d(x, offset, mask, self.weight, self.bias,
                                       self.stride, self.padding,
                                       self.dilation, self.groups,
                                       self.deform_groups)

Running modulated_deform_conv2d seems to cause the error

Reproduces the problem - command or script

python demo/mmagic_inference_demo.py --model-name basicvsr_pp --video /home/paperspace/BasicVSR_PlusPlus/demo/input/full1.mov --result-out-dir ./resources/output/video_restoration/demo_video_restoration_basicvsr_res.mp4 --extra-parameters max_seq_len=5

Causes error, full1.mov is 1920 × 1080

python demo/mmagic_inference_demo.py --model-name basicvsr_pp --video /home/paperspace/BasicVSR_PlusPlus/demo/input/partial3.mov --result-out-dir ./resources/output/video_restoration/demo_video_restoration_basicvsr_res.mp4 --extra-parameters max_seq_len=5
python demo/mmagic_inference_demo.py --model-name basicvsr_pp --video /home/paperspace/BasicVSR_PlusPlus/demo/input/partial3.mov --result-out-dir ./resources/output/video_restoration/demo_video_restoration_basicvsr_res.mp4 --extra-parameters max_seq_len=2

Both cause error, partial3.mov is 1755 × 1080

python demo/mmagic_inference_demo.py --model-name basicvsr_pp --video /home/paperspace/BasicVSR_PlusPlus/demo/input/partial3.mov --result-out-dir ./resources/output/video_restoration/demo_video_restoration_basicvsr_res.mp4 --extra-parameters max_seq_len=1

Does not cause error, same video, but with max_seq_len 1, the network never needs to forward propagate

python demo/mmagic_inference_demo.py --model-name basicvsr_pp --video /home/paperspace/BasicVSR_PlusPlus/demo/input/partial4.mov --result-out-dir ./resources/output/video_restoration/demo_video_restoration_basicvsr_res.mp4 --extra-parameters max_seq_len=5

does not cause error, partial4.mov is 1646x1080

Reproduces the problem - error message

Traceback (most recent call last):
  File "/home/paperspace/mmagic/demo/mmagic_inference_demo.py", line 142, in <module>
    main()
  File "/home/paperspace/mmagic/demo/mmagic_inference_demo.py", line 138, in main
    editor.infer(**user_defined)
  File "/home/paperspace/.local/lib/python3.11/site-packages/mmagic/apis/mmagic_inferencer.py", line 231, in infer
    return self.inferencer(
           ^^^^^^^^^^^^^^^^
  File "/home/paperspace/.local/lib/python3.11/site-packages/mmagic/apis/inferencers/__init__.py", line 110, in __call__
    return self.inferencer(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/paperspace/.local/lib/python3.11/site-packages/mmagic/apis/inferencers/base_mmagic_inferencer.py", line 139, in __call__
    results = self.base_call(**kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/paperspace/.local/lib/python3.11/site-packages/mmagic/apis/inferencers/base_mmagic_inferencer.py", line 165, in base_call
    preds = self.forward(data, **forward_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/paperspace/.local/lib/python3.11/site-packages/mmagic/apis/inferencers/video_restoration_inferencer.py", line 134, in forward
    self.model(
  File "/home/paperspace/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/paperspace/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/paperspace/.local/lib/python3.11/site-packages/mmagic/models/base_models/base_edit_model.py", line 109, in forward
    return self.forward_tensor(inputs, data_samples, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/paperspace/.local/lib/python3.11/site-packages/mmagic/models/base_models/base_edit_model.py", line 167, in forward_tensor
    feats = self.generator(inputs, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/paperspace/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/paperspace/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/paperspace/.local/lib/python3.11/site-packages/mmagic/models/editors/basicvsr_plusplus_net/basicvsr_plusplus_net.py", line 348, in forward
    feats = self.propagate(feats, flows, module)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/paperspace/.local/lib/python3.11/site-packages/mmagic/models/editors/basicvsr_plusplus_net/basicvsr_plusplus_net.py", line 218, in propagate
    feat_prop = self.deform_align[module_name](feat_prop, cond,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/paperspace/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/paperspace/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/paperspace/.local/lib/python3.11/site-packages/mmagic/models/editors/basicvsr_plusplus_net/basicvsr_plusplus_net.py", line 416, in forward
    return modulated_deform_conv2d(x, offset, mask, self.weight, self.bias,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/paperspace/.local/lib/python3.11/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/paperspace/.local/lib/python3.11/site-packages/mmcv/ops/modulated_deform_conv.py", line 149, in forward
    ext_module.modulated_deform_conv_forward(
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

Additional information

I keep running into the error above when trying to run BasicVSR_PP on videos at least 1746x1080 or larger. The last confirmed resolution that works is 1646x1080. The error seems to be occuring during forward propagation.

I've tried testing this with multiple versions of PyTorch and mmcv/mmcv-full, and they all fail in a similar way.

The GPU has plenty of memory when max_seq_len=2 (only using ~25g/80g). Is there any workaround available without resorting to methods like tiling the video?

@jacob-stein jacob-stein added the kind/bug something isn't working label Mar 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants