[Bug] Failed to build mmcv in rocm/pytorch docker image: call to '__shfl_down' is ambiguous #2919

choyuansu · 2023-09-01T18:43:08Z

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmcv).

Environment

OrderedDict([('sys.platform', 'linux'), ('Python', '3.8.16 (default, Jun 12 2023, 18:09:05) [GCC 11.2.0]'), ('CUDA available', True), ('numpy_random_seed', 2147483648), ('GPU 0', 'AMD Radeon RX 6600'), ('CUDA_HOME', '/opt/rocm'), ('NVCC', 'HIP version: 5.6.31061-8c743ae5d\nAMD clang version 16.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.6.0 23243 be997b2f3651a41597d7a41441fff8ade4ac59ac)\nTarget: x86_64-unknown-linux-gnu\nThread model: posix\nInstalledDir: /opt/rocm/llvm/bin'), ('GCC', 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0'), ('PyTorch', '2.0.0a0+git70f6d0c'), ('PyTorch compiling details', 'PyTorch built with:\n - GCC 9.4\n - C++ Version: 201703\n - Intel(R) oneAPI Math Kernel Library Version 2022.0-Product Build 20211112 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v2.7.2 (Git Hash fbec3e25a559ee252022ae066817b204e106a6ba)\n - OpenMP 201511 (a.k.a. OpenMP 4.5)\n - LAPACK is enabled (usually provided by MKL)\n - NNPACK is enabled\n - CPU capability usage: AVX2\n - HIP Runtime 5.6.31061\n - MIOpen 2.20.0\n - Magma 2.6.2\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CXX_COMPILER=/opt/cache/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.0, USE_CUDA=OFF, USE_CUDNN=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=ON, \n'), ('TorchVision', '0.15.0a0+c206a47'), ('OpenCV', '4.8.0'), ('MMEngine', '0.8.4'), ('MMCV', '2.0.1'), ('MMCV Compiler', 'n/a'), ('MMCV CUDA Compiler', 'n/a')])

Reproduces the problem - code sample

version: '3'
services:
  main:
    image: rocm/pytorch:latest
    command:
      - bash
      - -c
      - |
        pip install -U openmim
        mim install mmengine
        python -c 'from mmengine.utils.dl_utils import collect_env;print(collect_env())'

        git clone --single-branch --branch=v2.0.1 --depth=1 https://github.com/open-mmlab/mmcv.git
        cd mmcv
        pip install -r requirements/optional.txt
        MMCV_WITH_OPS=1 ROCM_HOME=/opt/rocm-5.6.0 python setup.py install
        python -c "from mmcv.utils import collect_env; print(collect_env())"
    environment:
      - HSA_OVERRIDE_GFX_VERSION=10.3.0
    cap_add:
      - SYS_PTRACE
    security_opt:
      - seccomp=unconfined
    devices:
      - /dev/kfd
      - /dev/dri
    group_add:
      - video
    ipc: host
    shm_size: 8G

Reproduces the problem - command or script

docker compose up

Reproduces the problem - error message

Part of the log:

In file included from /var/lib/jenkins/mmcv/mmcv/ops/csrc/pytorch/hip/carafe_hip.hip:4:
/var/lib/jenkins/mmcv/mmcv/ops/csrc/common/cuda/../hip/carafe_hip_kernel.cuh:61:21: error: call to '__shfl_down' is ambiguous
    __PHALF(val) += __shfl_down(val, offset);
                    ^~~~~~~~~~~
/opt/rocm-5.6.0/include/hip/amd_detail/amd_warp_functions.h:315:7: note: candidate function
float __shfl_down(float var, unsigned int lane_delta, int width = warpSize) {
      ^
/opt/rocm-5.6.0/include/hip/amd_detail/amd_warp_functions.h:322:8: note: candidate function
double __shfl_down(double var, unsigned int lane_delta, int width = warpSize) {
       ^
/opt/rocm-5.6.0/include/hip/amd_detail/amd_warp_functions.h:300:5: note: candidate function
int __shfl_down(int var, unsigned int lane_delta, int width = warpSize) {
    ^
/opt/rocm-5.6.0/include/hip/amd_detail/amd_warp_functions.h:308:14: note: candidate function
unsigned int __shfl_down(unsigned int var, unsigned int lane_delta, int width = warpSize) {
             ^
/opt/rocm-5.6.0/include/hip/amd_detail/amd_warp_functions.h:336:6: note: candidate function
long __shfl_down(long var, unsigned int lane_delta, int width = warpSize)
     ^
/opt/rocm-5.6.0/include/hip/amd_detail/amd_warp_functions.h:356:15: note: candidate function
unsigned long __shfl_down(unsigned long var, unsigned int lane_delta, int width = warpSize)
              ^
/opt/rocm-5.6.0/include/hip/amd_detail/amd_warp_functions.h:376:11: note: candidate function
long long __shfl_down(long long var, unsigned int lane_delta, int width = warpSize)
          ^
/opt/rocm-5.6.0/include/hip/amd_detail/amd_warp_functions.h:389:20: note: candidate function
unsigned long long __shfl_down(unsigned long long var, unsigned int lane_delta, int width = warpSize)
                   ^
/opt/rocm-5.6.0/include/hip/amd_detail/amd_hip_fp16.h:1759:17: note: candidate function
         __half __shfl_down(__half var, unsigned int lane_delta, int width = warpSize) {
                ^

Entire log: mmcv-log.tar.gz

Additional information

What's your expected result? Build succeed.
What dataset did you use? N/A
What do you think might be the reason? No idea.

The text was updated successfully, but these errors were encountered:

choyuansu · 2023-09-02T14:09:24Z

This only happens with the rocm/pytorch:rocm5.6_ubuntu20.04_py3.8_pytorch_2.0.1 image and not the rocm/pytorch:rocm5.5_ubuntu20.04_py3.8_pytorch_1.13.1 image.

choyuansu changed the title ~~[Bug] Failed to build mmcv in rocm/pytorch docker image~~ [Bug] Failed to build mmcv in rocm/pytorch docker image: call to '__shfl_down' is ambiguous Sep 1, 2023

zhouzaida added the ROCm label Sep 3, 2023

zhouzaida linked a pull request Sep 3, 2023 that will close this issue

using PyTorch WARP_SHFL_DOWN macro for half support #2843

Merged

7 tasks

zhouzaida closed this as completed in #2843 Sep 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Failed to build mmcv in rocm/pytorch docker image: call to '__shfl_down' is ambiguous #2919

[Bug] Failed to build mmcv in rocm/pytorch docker image: call to '__shfl_down' is ambiguous #2919

choyuansu commented Sep 1, 2023 •

edited

Loading

choyuansu commented Sep 2, 2023

[Bug] Failed to build mmcv in rocm/pytorch docker image: call to '__shfl_down' is ambiguous #2919

[Bug] Failed to build mmcv in rocm/pytorch docker image: call to '__shfl_down' is ambiguous #2919

Comments

choyuansu commented Sep 1, 2023 • edited Loading

Prerequisite

Environment

Reproduces the problem - code sample

Reproduces the problem - command or script

Reproduces the problem - error message

Additional information

choyuansu commented Sep 2, 2023

choyuansu commented Sep 1, 2023 •

edited

Loading