-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Fix CD failure due to illegal instruction in OpenBLAS #18408
Conversation
Hey @leezu , Thanks for submitting the PR
CI supported jobs: [windows-cpu, centos-gpu, miscellaneous, edge, unix-cpu, website, sanity, centos-cpu, windows-gpu, unix-gpu, clang] Note: |
@mxnet-bot run ci [unix-cpu] |
Jenkins CI successfully triggered : [unix-cpu] |
@mxnet-bot run ci [unix-cpu] |
Jenkins CI successfully triggered : [unix-cpu] |
Hi @leezu , can you help to backport this PR to v1.7.x and v1.x as well? Thanks. |
@ciyongch I intend to first make the staticbuild script more maintainable, then backport the complete change. I'll CC you in that PR |
Sure, thanks @leezu . As we're going to tag the branch in the following days if there's no more any other comments/pending issues, please help to arrange your time :) |
For this change, we may need to wait until OpenBLAS releases a stable version containing the fix. Currently we rely on a pre-release version. But this doesn't need to be a dependency for 1.7 release |
Ok @leezu, just want to confirm with you, the formal release of OpenBLAS is preferred for this fix, and we're not targeting to include this fix in the upcoming 1.7, right? Thanks! |
Yes, let's not keep this as a blocker. I'll monitor the OpenBLAS release situation and will open a PR if OpenBLAS releases in time. |
* Update to OpenBlas 0.3.10 pre-release Includes OpenMathLib/OpenBLAS#2527 * Enable support for older architectures in OpenBLAS dynamic architecture feature
The first pipeline that fails with illegal instruction errors is http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job/detail/mxnet-cd-release-job/1145/pipeline/305
The last working one is http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job/detail/mxnet-cd-release-job/1142/pipeline/483
As
47b0bdd was merged in between the two runs, one hypothesis is that OpenBLAS build on CD is including instructions that are only available on the CPU arch used for the build. That shouldn't happen, as OpenBLAS is built with
DYNAMIC_ARCH=1
https://github.com/apache/incubator-mxnet/blob/2219f1ad77b685d4e615fb8cd7f1992e9764ca7c/tools/dependencies/openblas.sh#L36but it turns out there is an OpenBLAS bug that causes this.
I reproduced the issue locally by building the libopenblas.so and libmxnet.so via the staticbuild script on c5 instance and running the onnx unittests (which are shown as segfaulting in the CD log) after changing the instance type to c1.
Looking at the coredump, I find that the illegal instruction occurs in
cblas_sgemm
OpenBLAS function:This is a longstanding issue upstream and was fortunately fixed a few weeks ago (though no release containing the fix exists yet): OpenMathLib/OpenBLAS#2533
Backporting the fix to the latest stable release(the fix relies on some other new functionality) Updating to 0.3.10 pre-release version and updating our static build scripts to make use of it, fixes the issue in my local c5-build, c1-test setup.Further, I add the
DYNAMIC_OLDER=1
flag to the openblas build, to support dynamic architecture selection featuren in OpenBLAS for older CPUs.BTW, the reason this bug didn't cause any issues earlier on is that the CD used to run with gcc 4.8 on Ubuntu 14.04 and gcc 4.8 does not support AVX512 instructions. Once updating the toolchain in #17984 gcc7 is used on CD