Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use ROCM version 4.2 here. ROCM version 4.3 includes LLVM 13, which doesn't build with TVM, see https://discuss.tvm.apache.org/t/rocm-target-fails-with-llvm-error/11208/2. ROCM version 4.2 includes LLVM 12, which works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting because I checked the current version being used, and it looks like it is rocm 4.3.0. That is why I proposed to use this version specifically:
I'm not very familiar with ROCm in general, so can you have a look and see what's best for us to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to use rocm's fork of llvm 13. Rocm 4.3 works fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the TVM build is fine but I am seeing following issue when running an example because we have LLVM 12 in TVM:
I went back and tried a bunch versions of ROCM and LLVM (upstream, not the one included in ROCM) and this is what I got when running an example with every combination:
It looks like the last version of ROCM that works across the board is v4.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes rocm 4.3 apparently requires llvm 13. I confirmed that rocm 4.3 + the upstream llvm 13 works, but to build TVM with the upstream llvm 13, we need to fix one line in
codegen_llvm.cc
:https://discuss.tvm.apache.org/t/rocm-target-fails-with-llvm-error/11208/6
There is also an open issue for building with llvm 13 #9319
@leandron can you also add the llvm 13 build fix in my discuss post above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I tried building with llvm-13 but actually didn't hit an error. The discussions in https://discuss.tvm.apache.org/t/rocm-target-fails-with-llvm-error/11208/8 and #9319 confused me, but probably they were using on older TVM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to use ROCM 4.3, I think we should add llvm-13 to the install script: https://github.com/apache/tvm/blob/main/docker/install/ubuntu1804_install_llvm.sh before merging. Otherwise, running TVM with ROCM inside the docker images won't work because the last version there is llvm-12.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, there are some rocm tests here: https://github.com/apache/tvm/blob/main/tests/python/unittest/test_target_codegen_rocm.py but I don't think they are actually being run inside the ci-gpu regressions? Otherwise, we should have encountered these version issues earlier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never used
ci-gpu
locally, so I'm not sure. But I wouldn't be surprised if rocm tests are not exercised at all.Yes, if rocm 4.3 is intended to be used with TVM in a docker, llvm should also be upgraded to 13.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from tests/scripts/task_python_unittest_gpuonly.sh:
these are the targets exercised on ci-gpu. looks like rocm should be.