Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix repository URL in ubuntu_install_rocm.sh #9425

Merged
merged 2 commits into from
Nov 11, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docker/install/ubuntu_install_rocm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,10 @@ set -u
set -o pipefail

# Install ROCm cross compilation toolchain.
wget -qO - http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add -
echo deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main > /etc/apt/sources.list.d/rocm.list
wget -qO - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/4.3/ ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use ROCM version 4.2 here. ROCM version 4.3 includes LLVM 13, which doesn't build with TVM, see https://discuss.tvm.apache.org/t/rocm-target-fails-with-llvm-error/11208/2. ROCM version 4.2 includes LLVM 12, which works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting because I checked the current version being used, and it looks like it is rocm 4.3.0. That is why I proposed to use this version specifically:

$ docker run -it --rm tlcpack/ci-gpu:v0.78 bash
root@488adea49541:/# dpkg -l | grep rocm
ii  rocm-clang-ocl                        0.5.0.40300-52                                                   amd64        OpenCL compilation with clang compiler.
ii  rocm-cmake                            0.5.0.40300-52                                                   amd64        rocm-cmake built using CMake
ii  rocm-dbgapi                           0.48.0.40300-52                                                  amd64        Library to provide AMD GPU debugger API
ii  rocm-debug-agent                      2.0.1.40300-52                                                   amd64        Radeon Open Compute Debug Agent (ROCdebug-agent)
ii  rocm-dev                              4.3.0.40300-52                                                   amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-device-libs                      1.0.0.40300-52                                                   amd64        Radeon Open Compute - device libraries
ii  rocm-gdb                              10.2.40300-52                                                    amd64        ROCgdb
ii  rocm-opencl                           2.0.0.40300-52                                                   amd64        OpenCL: Open Computing Language on ROCclr
ii  rocm-opencl-dev                       2.0.0.40300-52                                                   amd64        OpenCL: Open Computing Language on ROCclr
ii  rocm-smi-lib                          4.0.0.40300-52                                                   amd64        AMD System Management libraries
ii  rocm-utils                            4.3.0.40300-52                                                   amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocminfo                              1.0.0.40300-52                                                   amd64        Radeon Open Compute (ROCm) Runtime rocminfo tool
root@488adea49541:/# 

I'm not very familiar with ROCm in general, so can you have a look and see what's best for us to do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to use rocm's fork of llvm 13. Rocm 4.3 works fine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the TVM build is fine but I am seeing following issue when running an example because we have LLVM 12 in TVM:

E           TVMError: Fail to load bitcode file /opt/rocm/amdgcn/bitcode/hc.bc
E           line -1:Invalid record (Producer: 'LLVM13.0.0git' Reader: 'LLVM 12.0.1')

I went back and tried a bunch versions of ROCM and LLVM (upstream, not the one included in ROCM) and this is what I got when running an example with every combination:

ROCM 4.3
+ lld-9 + llvm-config-9   -> LLVM ERROR: Unknown specifier in datalayout string
+ lld-10 + llvm-config-10 -> LLVM ERROR: Unknown specifier in datalayout string
+ lld-11 + llvm-config-11 -> LLVM ERROR: Unknown specifier in datalayout string
+ lld-12 + llvm-config-12 -> TVMError: Fail to load bitcode file /opt/rocm/amdgcn/bitcode/hc.bc    line -1:Invalid record (Producer: 'LLVM13.0.0git' Reader: 'LLVM 12.0.1')

ROCM 4.2
+ lld-9 + llvm-config-9   -> LLVM ERROR: Unknown specifier in datalayout string
+ lld-10 + llvm-config-10 -> LLVM ERROR: Unknown specifier in datalayout string
+ lld-11 + llvm-config-11 -> LLVM ERROR: Unknown specifier in datalayout string
+ lld-12 + llvm-config-12 -> Check failed: ret == 0 (-1 vs. 0) : TVMError: ROCM HIP Error: hipModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: hipErrorSharedObjectInitFailed

ROCM 4.1
+ lld-9 + llvm-config-9   -> Works
+ lld-10 + llvm-config-10 -> Works
+ lld-11 + llvm-config-11 -> Works
+ lld-12 + llvm-config-12 -> Check failed: ret == 0 (-1 vs. 0) : TVMError: ROCM HIP Error: hipModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: hipErrorSharedObjectInitFaile


ROCM 4.0
+ lld-9 + llvm-config-9   -> Works
+ lld-10 + llvm-config-10 -> Works
+ lld-11 + llvm-config-11 -> Works
+ lld-12 + llvm-config-12 -> Works

It looks like the last version of ROCM that works across the board is v4.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes rocm 4.3 apparently requires llvm 13. I confirmed that rocm 4.3 + the upstream llvm 13 works, but to build TVM with the upstream llvm 13, we need to fix one line in codegen_llvm.cc:
https://discuss.tvm.apache.org/t/rocm-target-fails-with-llvm-error/11208/6

There is also an open issue for building with llvm 13 #9319

@leandron can you also add the llvm 13 build fix in my discuss post above?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I tried building with llvm-13 but actually didn't hit an error. The discussions in https://discuss.tvm.apache.org/t/rocm-target-fails-with-llvm-error/11208/8 and #9319 confused me, but probably they were using on older TVM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to use ROCM 4.3, I think we should add llvm-13 to the install script: https://github.com/apache/tvm/blob/main/docker/install/ubuntu1804_install_llvm.sh before merging. Otherwise, running TVM with ROCM inside the docker images won't work because the last version there is llvm-12.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, there are some rocm tests here: https://github.com/apache/tvm/blob/main/tests/python/unittest/test_target_codegen_rocm.py but I don't think they are actually being run inside the ci-gpu regressions? Otherwise, we should have encountered these version issues earlier?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never used ci-gpu locally, so I'm not sure. But I wouldn't be surprised if rocm tests are not exercised at all.

Yes, if rocm 4.3 is intended to be used with TVM in a docker, llvm should also be upgraded to 13.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from tests/scripts/task_python_unittest_gpuonly.sh:

export TVM_TEST_TARGETS="cuda;opencl;metal;rocm;nvptx;opencl -device=mali,aocl_sw_emu"

these are the targets exercised on ci-gpu. looks like rocm should be.

apt-get update && apt-get install -y \
rocm-dev \
lld && \
lld-12 && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*