Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix repository URL in ubuntu_install_rocm.sh #9425

Merged
merged 2 commits into from
Nov 11, 2021

Conversation

leandron
Copy link
Contributor

@leandron leandron commented Nov 2, 2021

Fix repository URL in ubuntu_install_rocm.sh:

cc @tqchen @jtuyls @Mousius

* ROCm dependency installation process was following outdated procedures.
  This PR makes the installation script to point to the correct repository

* Installation documentation is at:
  https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html

* Fixes apache#9413
@jtuyls
Copy link
Contributor

jtuyls commented Nov 2, 2021

I was just looking at this as well. However, I had to update the lld to lld-12 as well. Otherwise, I am seeing following issue when building the demo_rocm docker image:

The following packages have unmet dependencies:
 lld : Depends: lld-14 (>= 14~) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.
The command '/bin/sh -c bash /install/ubuntu_install_rocm.sh' returned a non-zero code: 100
ERROR: docker build failed.

@leandron
Copy link
Contributor Author

leandron commented Nov 4, 2021

I was just looking at this as well. However, I had to update the lld to lld-12 as well. Otherwise, I am seeing following issue when building the demo_rocm docker image:

The following packages have unmet dependencies:
 lld : Depends: lld-14 (>= 14~) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.
The command '/bin/sh -c bash /install/ubuntu_install_rocm.sh' returned a non-zero code: 100
ERROR: docker build failed.

I see. So what do we need to do? Migrate the whole CI to LLVM-12? Can you help me to understand what needs to be done? perhaps in separate patches, so that we can safely update ROCm and unblock the update of Docker images?

Co-authored-by: Christopher Sidebottom <git@damouse.co.uk>
@leandron
Copy link
Contributor Author

leandron commented Nov 4, 2021

I think we fixed it now, can you have a look @jtuyls?

wget -qO - http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add -
echo deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main > /etc/apt/sources.list.d/rocm.list
wget -qO - https://repo.radeon.com/rocm/rocm.gpg.key | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/4.3/ ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use ROCM version 4.2 here. ROCM version 4.3 includes LLVM 13, which doesn't build with TVM, see https://discuss.tvm.apache.org/t/rocm-target-fails-with-llvm-error/11208/2. ROCM version 4.2 includes LLVM 12, which works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting because I checked the current version being used, and it looks like it is rocm 4.3.0. That is why I proposed to use this version specifically:

$ docker run -it --rm tlcpack/ci-gpu:v0.78 bash
root@488adea49541:/# dpkg -l | grep rocm
ii  rocm-clang-ocl                        0.5.0.40300-52                                                   amd64        OpenCL compilation with clang compiler.
ii  rocm-cmake                            0.5.0.40300-52                                                   amd64        rocm-cmake built using CMake
ii  rocm-dbgapi                           0.48.0.40300-52                                                  amd64        Library to provide AMD GPU debugger API
ii  rocm-debug-agent                      2.0.1.40300-52                                                   amd64        Radeon Open Compute Debug Agent (ROCdebug-agent)
ii  rocm-dev                              4.3.0.40300-52                                                   amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-device-libs                      1.0.0.40300-52                                                   amd64        Radeon Open Compute - device libraries
ii  rocm-gdb                              10.2.40300-52                                                    amd64        ROCgdb
ii  rocm-opencl                           2.0.0.40300-52                                                   amd64        OpenCL: Open Computing Language on ROCclr
ii  rocm-opencl-dev                       2.0.0.40300-52                                                   amd64        OpenCL: Open Computing Language on ROCclr
ii  rocm-smi-lib                          4.0.0.40300-52                                                   amd64        AMD System Management libraries
ii  rocm-utils                            4.3.0.40300-52                                                   amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocminfo                              1.0.0.40300-52                                                   amd64        Radeon Open Compute (ROCm) Runtime rocminfo tool
root@488adea49541:/# 

I'm not very familiar with ROCm in general, so can you have a look and see what's best for us to do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to use rocm's fork of llvm 13. Rocm 4.3 works fine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the TVM build is fine but I am seeing following issue when running an example because we have LLVM 12 in TVM:

E           TVMError: Fail to load bitcode file /opt/rocm/amdgcn/bitcode/hc.bc
E           line -1:Invalid record (Producer: 'LLVM13.0.0git' Reader: 'LLVM 12.0.1')

I went back and tried a bunch versions of ROCM and LLVM (upstream, not the one included in ROCM) and this is what I got when running an example with every combination:

ROCM 4.3
+ lld-9 + llvm-config-9   -> LLVM ERROR: Unknown specifier in datalayout string
+ lld-10 + llvm-config-10 -> LLVM ERROR: Unknown specifier in datalayout string
+ lld-11 + llvm-config-11 -> LLVM ERROR: Unknown specifier in datalayout string
+ lld-12 + llvm-config-12 -> TVMError: Fail to load bitcode file /opt/rocm/amdgcn/bitcode/hc.bc    line -1:Invalid record (Producer: 'LLVM13.0.0git' Reader: 'LLVM 12.0.1')

ROCM 4.2
+ lld-9 + llvm-config-9   -> LLVM ERROR: Unknown specifier in datalayout string
+ lld-10 + llvm-config-10 -> LLVM ERROR: Unknown specifier in datalayout string
+ lld-11 + llvm-config-11 -> LLVM ERROR: Unknown specifier in datalayout string
+ lld-12 + llvm-config-12 -> Check failed: ret == 0 (-1 vs. 0) : TVMError: ROCM HIP Error: hipModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: hipErrorSharedObjectInitFailed

ROCM 4.1
+ lld-9 + llvm-config-9   -> Works
+ lld-10 + llvm-config-10 -> Works
+ lld-11 + llvm-config-11 -> Works
+ lld-12 + llvm-config-12 -> Check failed: ret == 0 (-1 vs. 0) : TVMError: ROCM HIP Error: hipModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: hipErrorSharedObjectInitFaile


ROCM 4.0
+ lld-9 + llvm-config-9   -> Works
+ lld-10 + llvm-config-10 -> Works
+ lld-11 + llvm-config-11 -> Works
+ lld-12 + llvm-config-12 -> Works

It looks like the last version of ROCM that works across the board is v4.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes rocm 4.3 apparently requires llvm 13. I confirmed that rocm 4.3 + the upstream llvm 13 works, but to build TVM with the upstream llvm 13, we need to fix one line in codegen_llvm.cc:
https://discuss.tvm.apache.org/t/rocm-target-fails-with-llvm-error/11208/6

There is also an open issue for building with llvm 13 #9319

@leandron can you also add the llvm 13 build fix in my discuss post above?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I tried building with llvm-13 but actually didn't hit an error. The discussions in https://discuss.tvm.apache.org/t/rocm-target-fails-with-llvm-error/11208/8 and #9319 confused me, but probably they were using on older TVM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to use ROCM 4.3, I think we should add llvm-13 to the install script: https://github.com/apache/tvm/blob/main/docker/install/ubuntu1804_install_llvm.sh before merging. Otherwise, running TVM with ROCM inside the docker images won't work because the last version there is llvm-12.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, there are some rocm tests here: https://github.com/apache/tvm/blob/main/tests/python/unittest/test_target_codegen_rocm.py but I don't think they are actually being run inside the ci-gpu regressions? Otherwise, we should have encountered these version issues earlier?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never used ci-gpu locally, so I'm not sure. But I wouldn't be surprised if rocm tests are not exercised at all.

Yes, if rocm 4.3 is intended to be used with TVM in a docker, llvm should also be upgraded to 13.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from tests/scripts/task_python_unittest_gpuonly.sh:

export TVM_TEST_TARGETS="cuda;opencl;metal;rocm;nvptx;opencl -device=mali,aocl_sw_emu"

these are the targets exercised on ci-gpu. looks like rocm should be.

@areusch
Copy link
Contributor

areusch commented Nov 10, 2021

per discussion with @masahi , waiting for @leandron to post a patch to fix LLVM 13 builds.

@masahi
Copy link
Member

masahi commented Nov 10, 2021

per discussion with @masahi , waiting for @leandron to post a patch to fix LLVM 13 builds.

I should clarify that the patch to fix LLVM 13 is not required. As @jtuyls suggested, we should install llvm 13 in https://github.com/apache/tvm/blob/main/docker/install/ubuntu1804_install_llvm.sh to avoid problem when TVM is built with rocm 4.3.

@leandron So can you update ubuntu1804_install_llvm.sh as well? Or update to rocm 4.2 instead, which doesn't require llvm 13.

leandron added a commit to leandron/tvm that referenced this pull request Nov 11, 2021
 * Installs LLVM 13 in for Ubuntu 18.04 Docker images
 * This is needed as a requirement for apache#9425
@leandron
Copy link
Contributor Author

per discussion with @masahi , waiting for @leandron to post a patch to fix LLVM 13 builds.

I should clarify that the patch to fix LLVM 13 is not required. As @jtuyls suggested, we should install llvm 13 in https://github.com/apache/tvm/blob/main/docker/install/ubuntu1804_install_llvm.sh to avoid problem when TVM is built with rocm 4.3.

@leandron So can you update ubuntu1804_install_llvm.sh as well? Or update to rocm 4.2 instead, which doesn't require llvm 13.

Yes, I've done this in #9498, as I consider it is a separate change.

masahi pushed a commit that referenced this pull request Nov 11, 2021
* Installs LLVM 13 in for Ubuntu 18.04 Docker images
 * This is needed as a requirement for #9425
@masahi masahi merged commit 4bebfd8 into apache:main Nov 11, 2021
AndrewZhaoLuo added a commit to AndrewZhaoLuo/tvm that referenced this pull request Nov 12, 2021
* main: (119 commits)
  [Topi][Op][PyTorch][Vitas] Fix inconsistent kernel layout conventions for conv2d_transpose (apache#9336)
  Fix repository URL in ubuntu_install_rocm.sh (apache#9425)
  Add LLVM-13 installation to Docker setup (apache#9498)
  [Relay] Use target_host determined at Relay level instead of recalculating it (apache#9499)
  Arm(R) Ethos(TM)-U NPU BinaryElementwise operators support (apache#9442)
  [COMMUNITY] Junru's and Wuwei's PGP key for ASF release (apache#9488)
  Add default for split op (apache#9489)
  [HOTFIX][TARGET] Change LOG in compilation config to DLOG (apache#9486)
  Fixed some warnings about lambda's closures that are bigger than necessary (apache#9481)
  [Support] Add libinfo into the runtime build (apache#9310)
  Change Call with TIRCallAttrs to call_lowered op (apache#9312)
  [ETHOSN] Streamline Ethos(TM)-N cross-compile rpc usage (apache#9477)
  [CMSIS-NN] Assert correct amount of CMSIS-NN artifacts in MLF (apache#9480)
  [MicroTVM][PyTest] Explicitly skip MicroTVM unittests. (apache#9335)
  [microNPU] Replace ICHECK with diagnostic context in type inference (apache#9470)
  Better host handling in CompilationConfig & debug printing (apache#9460)
  [AOT][Tests] Use pre-built libraries in Reference System tests (apache#9271)
  [TIR] Add type hint for TIR  (apache#9432)
  [TVMC] Add test for quantized pytorch model (apache#9467)
  [CMSIS-NN] Convert CMSIS-NN to use Target Hooks (apache#9397)
  ...
@leandron
Copy link
Contributor Author

@leandron leandron deleted the fix-rocm-docker branch November 12, 2021 09:21
mehrdadh pushed a commit to mehrdadh/tvm that referenced this pull request Dec 1, 2021
* Installs LLVM 13 in for Ubuntu 18.04 Docker images
 * This is needed as a requirement for apache#9425
mehrdadh pushed a commit to mehrdadh/tvm that referenced this pull request Dec 1, 2021
* Fix repository URL in ubuntu_install_rocm.sh

* ROCm dependency installation process was following outdated procedures.
  This PR makes the installation script to point to the correct repository

* Installation documentation is at:
  https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html

* Fixes apache#9413

* Update docker/install/ubuntu_install_rocm.sh

Co-authored-by: Christopher Sidebottom <git@damouse.co.uk>

Co-authored-by: Christopher Sidebottom <git@damouse.co.uk>
mehrdadh pushed a commit to mehrdadh/tvm that referenced this pull request Dec 1, 2021
* Installs LLVM 13 in for Ubuntu 18.04 Docker images
 * This is needed as a requirement for apache#9425
mehrdadh pushed a commit to mehrdadh/tvm that referenced this pull request Dec 1, 2021
* Fix repository URL in ubuntu_install_rocm.sh

* ROCm dependency installation process was following outdated procedures.
  This PR makes the installation script to point to the correct repository

* Installation documentation is at:
  https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html

* Fixes apache#9413

* Update docker/install/ubuntu_install_rocm.sh

Co-authored-by: Christopher Sidebottom <git@damouse.co.uk>

Co-authored-by: Christopher Sidebottom <git@damouse.co.uk>
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 7, 2022
* Installs LLVM 13 in for Ubuntu 18.04 Docker images
 * This is needed as a requirement for apache#9425
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 7, 2022
* Fix repository URL in ubuntu_install_rocm.sh

* ROCm dependency installation process was following outdated procedures.
  This PR makes the installation script to point to the correct repository

* Installation documentation is at:
  https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html

* Fixes apache#9413

* Update docker/install/ubuntu_install_rocm.sh

Co-authored-by: Christopher Sidebottom <git@damouse.co.uk>

Co-authored-by: Christopher Sidebottom <git@damouse.co.uk>
yangulei pushed a commit to yangulei/tvm that referenced this pull request Jan 11, 2022
* Installs LLVM 13 in for Ubuntu 18.04 Docker images
 * This is needed as a requirement for apache#9425
yangulei pushed a commit to yangulei/tvm that referenced this pull request Jan 11, 2022
* Fix repository URL in ubuntu_install_rocm.sh

* ROCm dependency installation process was following outdated procedures.
  This PR makes the installation script to point to the correct repository

* Installation documentation is at:
  https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html

* Fixes apache#9413

* Update docker/install/ubuntu_install_rocm.sh

Co-authored-by: Christopher Sidebottom <git@damouse.co.uk>

Co-authored-by: Christopher Sidebottom <git@damouse.co.uk>
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
* Installs LLVM 13 in for Ubuntu 18.04 Docker images
 * This is needed as a requirement for apache#9425
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
* Fix repository URL in ubuntu_install_rocm.sh

* ROCm dependency installation process was following outdated procedures.
  This PR makes the installation script to point to the correct repository

* Installation documentation is at:
  https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html

* Fixes apache#9413

* Update docker/install/ubuntu_install_rocm.sh

Co-authored-by: Christopher Sidebottom <git@damouse.co.uk>

Co-authored-by: Christopher Sidebottom <git@damouse.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] ubuntu_install_rocm.sh fails when generating ci-gpu image
5 participants