-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
boost cmake version for Tensile 4.2 #961
Conversation
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerryyin could you comment on the selection of cmake version for libMLIRMIOpen?
Dockerfile
Outdated
@@ -80,7 +80,7 @@ RUN cget -p $PREFIX init --cxx /opt/rocm/llvm/bin/clang++ --std=c++14 -DAMDGPU_T | |||
# Install dependencies | |||
RUN cget -p $PREFIX install pfultz2/rocm-recipes | |||
# Install a newer version of cmake for libMLIRMIOpen | |||
RUN cget -p $PREFIX install kitware/cmake@v3.13.4 | |||
RUN if [ "$USE_TARGETID" = "ON" ] ; then cget -p $PREFIX install kitware/cmake@v3.15.1; else cget -p $PREFIX install kitware/cmake@v3.13.4; fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ce1adon It's more a question to discuss how do we determine the versions for Common Development Environment (CDE) and Common Runtime Environment (CRE)?
https://pkgs.org/download/cmake and https://cliutils.gitlab.io/modern-cmake/chapters/intro/installing.html as references for cmake versions:
Ubuntu 18.04 LTS has 3.10.2
Ubuntu 20.04 LTS has 3.16.3
RHEL/CentOS well being RHEL/CentOS 2.8.12.2 (7)
The two versions mentioned in this change, however:
3.13.4 seems to be with Ubuntu 19.04, which EOL on Jan 23rd 2021 or Debian 10
Why 3.15.1? Is it selected based on a specific feature, or in other words, why not 3.16.3 or even 3.20.3 directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3.15.1 is the minimum version for the latest tensile to build.
3.20.x is not working, too new for latest tensile build.
3.16.x may work but there may be concern about its compatibility with ubuntu 18.04.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. No objections to this PR, just curious how we determine these dependencies and their long term effects on support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully @pfultz2 can give some comment on this question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@junliume Sorry I was busy with the TF/MLIR triaging and didn't got a chance to look at this. I'm a bit concerned at this comment:
3.20.x is not working, too new for latest tensile build.
The cmake required to build the latest LLVM usually bump to the latest version. I can forsee this causing conflict between libMLIRMIOpen and Tensile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a possibility that we have to bump it to use 3.15.1 at all scenarios for next MLIR PRs, we'll see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I do not see anything that enforces us to stay with 3.13.4. Let's switch to 3.15 globally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#978 (that bumps CMake to 3.15.1 unconditionally) is merged in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
solved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the rest of the stack are all using cmake 3.15.1 then it is fine with us. Consistency and stability are my only concerns.
Ideally, all of our components should use the cmake version in 18.04(with the exception of LLVM), and most of our components follow that. It is concerning that tensile requires such a new version. The entire rocm stack needs to decide the minimum version of cmake we will use and then there should be a process in place which includes all component owners to consider a newer version. For now, we have no choice but to upgrade the cmake version in order to build tensile, but there should be a larger discussion about what minimum version of cmake we plan to support in rocm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments from MLIR, blocking this PR for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unblock it for comments from MLIR.
3.15.1 is still a little bit awkward, while again as long as overall consistency is kept.
P.S. this page is outdated: https://github.com/ROCmSoftwarePlatform/Tensile/wiki/Dependencies
Y'all could go for any 3.15.x We're good through 3.20 |
CI errors. |
Strange that boost cmake version can cause Memory access fault
|
@atamazov For some reason the CI is no longer print out the configures that failed in conv2d and immed_con2d, which makes the tracking of the issue very difficult. Is it possible to enable printing out of the driver cmd in CI? |
WRT "Fp16 Hip Tensile All gfx908" that failed at run
to this line: https://github.com/ROCmSoftwarePlatform/MIOpen/blob/01ff3da68b68b2ffbc31cb681184a8082ac7ce26/Jenkinsfile#L843 should add necessary printing. The line have been added in #556. 🔴 Please note that we should use |
Please review. |
Therefore I recommend using MIOPEN_LOG_LEVEL=6 for triaging the issue. Is it reproducible locally? |
Please keep both 7 and 8 build alive. They are testing different stages. |
@ce1adon I suspect ther ecould be some problem with some other algorithm, Maybe it's worth disabling ALL algorithms except GEMM for MIOpenTensile "Full" stages? This way, the problems in other algorithms won't affect these stages. I am looking into ways of implementing this. |
Good thoughts. Will try to prove this theory locally. |
As @atamazov said, cmd "MIOPEN_DEBUG_HIP_KERNELS=0 MIOPEN_LOG_LEVEL=5 CTEST_PARALLEL_LEVEL=4 MIOPEN_CONV_PRECISE_ROCBLAS_TIMING=0 MIOPEN_DEBUG_CONV_FFT=0 MIOPEN_DEBUG_CONV_DIRECT=0 MIOPEN_DEBUG_CONV_WINOGRAD=0 MIOPEN_DEBUG_CONV_IMPLICIT_GEMM=0 MIOPEN_DEBUG_CONV_SCGEMM=0 make -j check" actually passed the test so the issue is not in GEMM. I ran the tests with log level at 6 and got the following error msg. I will share the full log through email.
|
@ce1adon I will provide you with the necessary patch (that disables all but GEMM for conv tests) soon. |
@ce1adon Please let me know if you would like me to update this branch directly. I will give you a .diff file otherwise. |
@ce1adon Here it is: test_conv_all_miotensile_only_gemm.diff.txt. Currently it affects test_conv2d/3d and test_immed_conv2d/3d only. |
@atamazov This is not going to work since some configures requires workspace that exceeds the memory limitation in GEMM path. Let's pin relevant author to solve the root cause in ConvAsmImplicitGemmGTCDynamicWrwXdlops, unless there are other concerns. |
Please use |
Thank you Artem. These are all good thoughts. But I'm afraid I'm not planning to add workarounds to let this test pass. I would prefer to wait for solution to the blocker. Thanks for understanding! |
No problem. But please note that "my" solution improves the quality of miopentensile testing. It makes sure that GEMM will be always fully validated. Without it, the "best" solver will be verified. |
Sure. This is in fact my original proposal to run miopentensile test one year ago. Since the test rule has been settled, let's follow it. |
Merge conflicts. |
To issue #960
2021/6/11 blocker #980