Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix CD failure due to illegal instruction in OpenBLAS #18408

Merged
merged 3 commits into from
May 27, 2020
Merged

Conversation

leezu
Copy link
Contributor

@leezu leezu commented May 26, 2020

The first pipeline that fails with illegal instruction errors is http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job/detail/mxnet-cd-release-job/1145/pipeline/305
The last working one is http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job/detail/mxnet-cd-release-job/1142/pipeline/483

As
47b0bdd was merged in between the two runs, one hypothesis is that OpenBLAS build on CD is including instructions that are only available on the CPU arch used for the build. That shouldn't happen, as OpenBLAS is built with DYNAMIC_ARCH=1 https://github.com/apache/incubator-mxnet/blob/2219f1ad77b685d4e615fb8cd7f1992e9764ca7c/tools/dependencies/openblas.sh#L36
but it turns out there is an OpenBLAS bug that causes this.

I reproduced the issue locally by building the libopenblas.so and libmxnet.so via the staticbuild script on c5 instance and running the onnx unittests (which are shown as segfaulting in the CD log) after changing the instance type to c1.
Looking at the coredump, I find that the illegal instruction occurs in cblas_sgemm OpenBLAS function:

#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  <signal handler called>
#2  0x00007f91d4252ecf in sgemm_kernel_direct () from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libopenblas.so.0
#3  0x00007f91d2a6247c in cblas_sgemm () from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libopenblas.so.0
#4  0x00007f91d6b5eaa0 in void linalg_batch_gemm<mshadow::cpu, float>(mshadow::Tensor<mshadow::cpu, 3, float> const&, mshadow::Tensor<mshadow::cpu, 3, float> const&, mshadow::Tensor<m
shadow::cpu, 3, float> const&, float, float, bool, bool, mshadow::Stream<mshadow::cpu>*) () from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so
#5  0x00007f91daacd2bc in void mxnet::op::LaOpGemmForward<mshadow::cpu, 2, 2, 2, 1, mxnet::op::gemm2>(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::a
llocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&) ()
   from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so
#6  0x00007f91d5d3bf4b in mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool) () from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so
#7  0x00007f91d5d498ed in ?? () from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so
#8  0x00007f91d5d4998f in ?? () from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so
#9  0x00007f91d5d2801c in mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*) ()
   from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so
#10 0x00007f91d5d288e7 in std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lam
bda()#1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) ()
   from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so
#11 0x00007f91d5d24eaa in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)>, std::shared_ptr<dmlc::ManualEvent> > > >:
:_M_run() () from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so
#12 0x00007f91dc5ca1ff in ?? () from /home/ubuntu/src/mxnet-master/python/mxnet/../../lib/libmxnet.so
#13 0x00007f91f8bac6db in start_thread (arg=0x7f91ad7b6700) at pthread_create.c:463
#14 0x00007f91f813088f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

This is a longstanding issue upstream and was fortunately fixed a few weeks ago (though no release containing the fix exists yet): OpenMathLib/OpenBLAS#2533

Backporting the fix to the latest stable release (the fix relies on some other new functionality) Updating to 0.3.10 pre-release version and updating our static build scripts to make use of it, fixes the issue in my local c5-build, c1-test setup.

Further, I add the DYNAMIC_OLDER=1 flag to the openblas build, to support dynamic architecture selection featuren in OpenBLAS for older CPUs.

BTW, the reason this bug didn't cause any issues earlier on is that the CD used to run with gcc 4.8 on Ubuntu 14.04 and gcc 4.8 does not support AVX512 instructions. Once updating the toolchain in #17984 gcc7 is used on CD

@leezu leezu requested a review from szha as a code owner May 26, 2020 19:34
@mxnet-bot
Copy link

Hey @leezu , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [windows-cpu, centos-gpu, miscellaneous, edge, unix-cpu, website, sanity, centos-cpu, windows-gpu, unix-gpu, clang]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@leezu
Copy link
Contributor Author

leezu commented May 27, 2020

@mxnet-bot run ci [unix-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu]

@leezu
Copy link
Contributor Author

leezu commented May 27, 2020

@mxnet-bot run ci [unix-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu]

@leezu leezu mentioned this pull request May 27, 2020
@leezu leezu merged commit 382279e into apache:master May 27, 2020
@leezu leezu deleted the fixcd branch May 27, 2020 20:34
@ciyongch
Copy link
Contributor

Hi @leezu , can you help to backport this PR to v1.7.x and v1.x as well? Thanks.

@leezu
Copy link
Contributor Author

leezu commented Jun 1, 2020

@ciyongch I intend to first make the staticbuild script more maintainable, then backport the complete change. I'll CC you in that PR

@ciyongch
Copy link
Contributor

ciyongch commented Jun 2, 2020

Sure, thanks @leezu . As we're going to tag the branch in the following days if there's no more any other comments/pending issues, please help to arrange your time :)

@leezu
Copy link
Contributor Author

leezu commented Jun 2, 2020

For this change, we may need to wait until OpenBLAS releases a stable version containing the fix. Currently we rely on a pre-release version. But this doesn't need to be a dependency for 1.7 release

@ciyongch
Copy link
Contributor

ciyongch commented Jun 2, 2020

Ok @leezu, just want to confirm with you, the formal release of OpenBLAS is preferred for this fix, and we're not targeting to include this fix in the upcoming 1.7, right? Thanks!

@leezu
Copy link
Contributor Author

leezu commented Jun 3, 2020

Yes, let's not keep this as a blocker. I'll monitor the OpenBLAS release situation and will open a PR if OpenBLAS releases in time.

AntiZpvoh pushed a commit to AntiZpvoh/incubator-mxnet that referenced this pull request Jul 6, 2020
* Update to OpenBlas 0.3.10 pre-release

Includes OpenMathLib/OpenBLAS#2527

* Enable support for older architectures in OpenBLAS dynamic architecture feature
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants