-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix repository URL in ubuntu_install_rocm.sh #9425
Conversation
* ROCm dependency installation process was following outdated procedures. This PR makes the installation script to point to the correct repository * Installation documentation is at: https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html * Fixes apache#9413
I was just looking at this as well. However, I had to update the lld to lld-12 as well. Otherwise, I am seeing following issue when building the demo_rocm docker image:
|
I see. So what do we need to do? Migrate the whole CI to LLVM-12? Can you help me to understand what needs to be done? perhaps in separate patches, so that we can safely update ROCm and unblock the update of Docker images? |
Co-authored-by: Christopher Sidebottom <git@damouse.co.uk>
I think we fixed it now, can you have a look @jtuyls? |
wget -qO - http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add - | ||
echo deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main > /etc/apt/sources.list.d/rocm.list | ||
wget -qO - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add - | ||
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/4.3/ ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use ROCM version 4.2 here. ROCM version 4.3 includes LLVM 13, which doesn't build with TVM, see https://discuss.tvm.apache.org/t/rocm-target-fails-with-llvm-error/11208/2. ROCM version 4.2 includes LLVM 12, which works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting because I checked the current version being used, and it looks like it is rocm 4.3.0. That is why I proposed to use this version specifically:
$ docker run -it --rm tlcpack/ci-gpu:v0.78 bash
root@488adea49541:/# dpkg -l | grep rocm
ii rocm-clang-ocl 0.5.0.40300-52 amd64 OpenCL compilation with clang compiler.
ii rocm-cmake 0.5.0.40300-52 amd64 rocm-cmake built using CMake
ii rocm-dbgapi 0.48.0.40300-52 amd64 Library to provide AMD GPU debugger API
ii rocm-debug-agent 2.0.1.40300-52 amd64 Radeon Open Compute Debug Agent (ROCdebug-agent)
ii rocm-dev 4.3.0.40300-52 amd64 Radeon Open Compute (ROCm) Runtime software stack
ii rocm-device-libs 1.0.0.40300-52 amd64 Radeon Open Compute - device libraries
ii rocm-gdb 10.2.40300-52 amd64 ROCgdb
ii rocm-opencl 2.0.0.40300-52 amd64 OpenCL: Open Computing Language on ROCclr
ii rocm-opencl-dev 2.0.0.40300-52 amd64 OpenCL: Open Computing Language on ROCclr
ii rocm-smi-lib 4.0.0.40300-52 amd64 AMD System Management libraries
ii rocm-utils 4.3.0.40300-52 amd64 Radeon Open Compute (ROCm) Runtime software stack
ii rocminfo 1.0.0.40300-52 amd64 Radeon Open Compute (ROCm) Runtime rocminfo tool
root@488adea49541:/#
I'm not very familiar with ROCm in general, so can you have a look and see what's best for us to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to use rocm's fork of llvm 13. Rocm 4.3 works fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the TVM build is fine but I am seeing following issue when running an example because we have LLVM 12 in TVM:
E TVMError: Fail to load bitcode file /opt/rocm/amdgcn/bitcode/hc.bc
E line -1:Invalid record (Producer: 'LLVM13.0.0git' Reader: 'LLVM 12.0.1')
I went back and tried a bunch versions of ROCM and LLVM (upstream, not the one included in ROCM) and this is what I got when running an example with every combination:
ROCM 4.3
+ lld-9 + llvm-config-9 -> LLVM ERROR: Unknown specifier in datalayout string
+ lld-10 + llvm-config-10 -> LLVM ERROR: Unknown specifier in datalayout string
+ lld-11 + llvm-config-11 -> LLVM ERROR: Unknown specifier in datalayout string
+ lld-12 + llvm-config-12 -> TVMError: Fail to load bitcode file /opt/rocm/amdgcn/bitcode/hc.bc line -1:Invalid record (Producer: 'LLVM13.0.0git' Reader: 'LLVM 12.0.1')
ROCM 4.2
+ lld-9 + llvm-config-9 -> LLVM ERROR: Unknown specifier in datalayout string
+ lld-10 + llvm-config-10 -> LLVM ERROR: Unknown specifier in datalayout string
+ lld-11 + llvm-config-11 -> LLVM ERROR: Unknown specifier in datalayout string
+ lld-12 + llvm-config-12 -> Check failed: ret == 0 (-1 vs. 0) : TVMError: ROCM HIP Error: hipModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: hipErrorSharedObjectInitFailed
ROCM 4.1
+ lld-9 + llvm-config-9 -> Works
+ lld-10 + llvm-config-10 -> Works
+ lld-11 + llvm-config-11 -> Works
+ lld-12 + llvm-config-12 -> Check failed: ret == 0 (-1 vs. 0) : TVMError: ROCM HIP Error: hipModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: hipErrorSharedObjectInitFaile
ROCM 4.0
+ lld-9 + llvm-config-9 -> Works
+ lld-10 + llvm-config-10 -> Works
+ lld-11 + llvm-config-11 -> Works
+ lld-12 + llvm-config-12 -> Works
It looks like the last version of ROCM that works across the board is v4.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes rocm 4.3 apparently requires llvm 13. I confirmed that rocm 4.3 + the upstream llvm 13 works, but to build TVM with the upstream llvm 13, we need to fix one line in codegen_llvm.cc
:
https://discuss.tvm.apache.org/t/rocm-target-fails-with-llvm-error/11208/6
There is also an open issue for building with llvm 13 #9319
@leandron can you also add the llvm 13 build fix in my discuss post above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I tried building with llvm-13 but actually didn't hit an error. The discussions in https://discuss.tvm.apache.org/t/rocm-target-fails-with-llvm-error/11208/8 and #9319 confused me, but probably they were using on older TVM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to use ROCM 4.3, I think we should add llvm-13 to the install script: https://github.com/apache/tvm/blob/main/docker/install/ubuntu1804_install_llvm.sh before merging. Otherwise, running TVM with ROCM inside the docker images won't work because the last version there is llvm-12.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, there are some rocm tests here: https://github.com/apache/tvm/blob/main/tests/python/unittest/test_target_codegen_rocm.py but I don't think they are actually being run inside the ci-gpu regressions? Otherwise, we should have encountered these version issues earlier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never used ci-gpu
locally, so I'm not sure. But I wouldn't be surprised if rocm tests are not exercised at all.
Yes, if rocm 4.3 is intended to be used with TVM in a docker, llvm should also be upgraded to 13.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from tests/scripts/task_python_unittest_gpuonly.sh:
export TVM_TEST_TARGETS="cuda;opencl;metal;rocm;nvptx;opencl -device=mali,aocl_sw_emu"
these are the targets exercised on ci-gpu. looks like rocm should be.
I should clarify that the patch to fix LLVM 13 is not required. As @jtuyls suggested, we should install llvm 13 in https://github.com/apache/tvm/blob/main/docker/install/ubuntu1804_install_llvm.sh to avoid problem when TVM is built with rocm 4.3. @leandron So can you update |
* Installs LLVM 13 in for Ubuntu 18.04 Docker images * This is needed as a requirement for apache#9425
Yes, I've done this in #9498, as I consider it is a separate change. |
* Installs LLVM 13 in for Ubuntu 18.04 Docker images * This is needed as a requirement for #9425
* main: (119 commits) [Topi][Op][PyTorch][Vitas] Fix inconsistent kernel layout conventions for conv2d_transpose (apache#9336) Fix repository URL in ubuntu_install_rocm.sh (apache#9425) Add LLVM-13 installation to Docker setup (apache#9498) [Relay] Use target_host determined at Relay level instead of recalculating it (apache#9499) Arm(R) Ethos(TM)-U NPU BinaryElementwise operators support (apache#9442) [COMMUNITY] Junru's and Wuwei's PGP key for ASF release (apache#9488) Add default for split op (apache#9489) [HOTFIX][TARGET] Change LOG in compilation config to DLOG (apache#9486) Fixed some warnings about lambda's closures that are bigger than necessary (apache#9481) [Support] Add libinfo into the runtime build (apache#9310) Change Call with TIRCallAttrs to call_lowered op (apache#9312) [ETHOSN] Streamline Ethos(TM)-N cross-compile rpc usage (apache#9477) [CMSIS-NN] Assert correct amount of CMSIS-NN artifacts in MLF (apache#9480) [MicroTVM][PyTest] Explicitly skip MicroTVM unittests. (apache#9335) [microNPU] Replace ICHECK with diagnostic context in type inference (apache#9470) Better host handling in CompilationConfig & debug printing (apache#9460) [AOT][Tests] Use pre-built libraries in Reference System tests (apache#9271) [TIR] Add type hint for TIR (apache#9432) [TVMC] Add test for quantized pytorch model (apache#9467) [CMSIS-NN] Convert CMSIS-NN to use Target Hooks (apache#9397) ...
So it seems the installation is fixed now: https://ci.tlcpack.ai/blue/organizations/jenkins/docker-images-ci%2Fdaily-docker-image-rebuild/detail/daily-docker-image-rebuild/109/pipeline |
* Installs LLVM 13 in for Ubuntu 18.04 Docker images * This is needed as a requirement for apache#9425
* Fix repository URL in ubuntu_install_rocm.sh * ROCm dependency installation process was following outdated procedures. This PR makes the installation script to point to the correct repository * Installation documentation is at: https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html * Fixes apache#9413 * Update docker/install/ubuntu_install_rocm.sh Co-authored-by: Christopher Sidebottom <git@damouse.co.uk> Co-authored-by: Christopher Sidebottom <git@damouse.co.uk>
* Installs LLVM 13 in for Ubuntu 18.04 Docker images * This is needed as a requirement for apache#9425
* Fix repository URL in ubuntu_install_rocm.sh * ROCm dependency installation process was following outdated procedures. This PR makes the installation script to point to the correct repository * Installation documentation is at: https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html * Fixes apache#9413 * Update docker/install/ubuntu_install_rocm.sh Co-authored-by: Christopher Sidebottom <git@damouse.co.uk> Co-authored-by: Christopher Sidebottom <git@damouse.co.uk>
* Installs LLVM 13 in for Ubuntu 18.04 Docker images * This is needed as a requirement for apache#9425
* Fix repository URL in ubuntu_install_rocm.sh * ROCm dependency installation process was following outdated procedures. This PR makes the installation script to point to the correct repository * Installation documentation is at: https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html * Fixes apache#9413 * Update docker/install/ubuntu_install_rocm.sh Co-authored-by: Christopher Sidebottom <git@damouse.co.uk> Co-authored-by: Christopher Sidebottom <git@damouse.co.uk>
* Installs LLVM 13 in for Ubuntu 18.04 Docker images * This is needed as a requirement for apache#9425
* Fix repository URL in ubuntu_install_rocm.sh * ROCm dependency installation process was following outdated procedures. This PR makes the installation script to point to the correct repository * Installation documentation is at: https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html * Fixes apache#9413 * Update docker/install/ubuntu_install_rocm.sh Co-authored-by: Christopher Sidebottom <git@damouse.co.uk> Co-authored-by: Christopher Sidebottom <git@damouse.co.uk>
* Installs LLVM 13 in for Ubuntu 18.04 Docker images * This is needed as a requirement for apache#9425
* Fix repository URL in ubuntu_install_rocm.sh * ROCm dependency installation process was following outdated procedures. This PR makes the installation script to point to the correct repository * Installation documentation is at: https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html * Fixes apache#9413 * Update docker/install/ubuntu_install_rocm.sh Co-authored-by: Christopher Sidebottom <git@damouse.co.uk> Co-authored-by: Christopher Sidebottom <git@damouse.co.uk>
Fix repository URL in ubuntu_install_rocm.sh:
This PR makes the installation script to point to the correct repository
https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html
cc @tqchen @jtuyls @Mousius