CentOS GPU tests failing in master #16951

larroy · 2019-11-30T07:05:44Z

Description

Centos GPU tests are failing in master:

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/master/1341/

I couldn't reproduce in p3 instance over ubuntu 18.04. Trying in the CI AMI now.

Seems to be a problem in the base AMI, reproduced by running the following commands:

time ci/build.py --docker-registry mxnetci --platform centos7_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh build_centos7_gpu
time ci/build.py --docker-registry mxnetci --nvidiadocker --platform centos7_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh unittest_centos7_gpu

Failure is:

[07:03:53] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
terminate called after throwing an instance of 'dmlc::Error'
  what():  [07:03:59] /work/mxnet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:107: Check failed: err == CUBLAS_STATUS_SUCCESS (7 vs. 0) : Destory cublas handle failed
Stack trace:
  [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x2b) [0x7f0376aa865b]
  [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(void mshadow::DeleteStream<mshadow::gpu>(mshadow::Stream<mshadow::gpu>*)+0x227) [0x7f037aa308e7]
  [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mshadow::Stream<mshadow::gpu>* mshadow::NewStream<mshadow::gpu>(bool, bool, int)+0x244) [0x7f037aa30e14]
  [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0x19f) [0x7f037aa513ef]
  [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>)+0x46) [0x7f037aa51626]
  [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x44) [0x7f037aa3d1c4]
  [bt] (6) /usr/lib64/libstdc++.so.6(+0xb5070) [0x7f03e2478070]
  [bt] (7) /usr/lib64/libpthread.so.0(+0x7e65) [0x7f03f4f92e65]
  [bt] (8) /usr/lib64/libc.so.6(clone+0x6d) [0x7f03f45b288d]


/work/runtime_functions.sh: line 1312:     6 Aborted                 (core dumped) python3.6 -m "nose" $NOSE_COVERAGE_ARGUMENTS $NOSE_TIMER_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu
2019-11-30 07:03:59,955 - root - INFO - Waiting for status of container ea33d765417a for 600 s.
2019-11-30 07:04:00,117 - root - INFO - Container exit status: {'StatusCode': 134, 'Error': None}
2019-11-30 07:04:00,117 - root - ERROR - Container exited with an error 😞
2019-11-30 07:04:00,117 - root - INFO - Executed command for reproduction:

ci/build.py --docker-registry mxnetci --nvidiadocker --platform centos7_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh unittest_centos7_gpu

2019-11-30 07:04:00,117 - root - INFO - Stopping container: ea33d765417a
2019-11-30 07:04:00,119 - root - INFO - Removing container: ea33d765417a
2019-11-30 07:04:00,140 - root - CRITICAL - Execution of ['/work/runtime_functions.sh', 'unittest_centos7_gpu'] failed with status: 134

A solution would be to update the AMI

The text was updated successfully, but these errors were encountered:

larroy · 2019-11-30T07:10:13Z

@mxnet-label-bot add [CI]

haojin2 · 2019-12-01T03:16:49Z

For more info, I've made a change to print the cublaserror's error message out:

terminate called after throwing an instance of 'dmlc::Error'

  what():  [05:07:32] /work/mxnet/include/mshadow/./stream_gpu-inl.h:125: Check failed: err == CUBLAS_STATUS_SUCCESS (7 vs. 0) : Destory cublas handle failed with error CUBLAS_STATUS_INVALID_VALUE

Stack trace:

  [bt] (0) build/tests/mxnet_unit_tests(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x194a5f2]

  [bt] (1) build/tests/mxnet_unit_tests(mshadow::Stream<mshadow::gpu>::DestroyBlasHandle()+0x14f) [0x1985b2f]

  [bt] (2) build/tests/mxnet_unit_tests(void mshadow::DeleteStream<mshadow::gpu>(mshadow::Stream<mshadow::gpu>*)+0xb7) [0x1986617]

  [bt] (3) build/tests/mxnet_unit_tests(mshadow::Stream<mshadow::gpu>* mshadow::NewStream<mshadow::gpu>(bool, bool, int)+0x30b) [0x1986c4b]

  [bt] (4) build/tests/mxnet_unit_tests(mxnet::test::op::GPUStreamScope::GPUStreamScope(mxnet::OpContext*)+0xfd) [0x198888d]

  [bt] (5) build/tests/mxnet_unit_tests(std::__shared_ptr<mxnet::test::op::CoreOpExecutor<float, float>, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<mxnet::test::op::CoreOpExecutor<float, float> >, bool, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> > >(std::_Sp_make_shared_tag, std::allocator<mxnet::test::op::CoreOpExecutor<float, float> > const&, bool&&, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >&&)+0x3c7) [0x19a1e57]

  [bt] (6) build/tests/mxnet_unit_tests(mxnet::test::OperatorRunner<mxnet::test::op::CoreOpProp, mxnet::test::op::CoreOpExecutor<float, float> >::RunGenericOperatorForward(bool, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> > const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&, unsigned long)+0xb6) [0x19a8546]

  [bt] (7) build/tests/mxnet_unit_tests(ACTIVATION_PERF_ExecuteBidirectional_Test::TestBody()+0x74e) [0x197ebbe]

  [bt] (8) build/tests/mxnet_unit_tests(void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x43) [0x1ab60d3]

The error type is CUBLAS_STATUS_INVALID_VALUE.

larroy · 2019-12-01T05:04:16Z

I think the error doesn't say much. I think the issue is the driver inside the docker image causes problems, at least I saw nvidia engineers acknowledging such an issue. If you see one of my PRs the failure goes away but some jobs require cuda libs in the container.

larroy · 2019-12-06T19:46:34Z

Fixed by #16968

larroy added the Bug label Nov 30, 2019

lanking520 added the CI label Nov 30, 2019

larroy closed this as completed Dec 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CentOS GPU tests failing in master #16951

CentOS GPU tests failing in master #16951

larroy commented Nov 30, 2019

larroy commented Nov 30, 2019

haojin2 commented Dec 1, 2019

larroy commented Dec 1, 2019

larroy commented Dec 6, 2019

CentOS GPU tests failing in master #16951

CentOS GPU tests failing in master #16951

Comments

larroy commented Nov 30, 2019

Description

larroy commented Nov 30, 2019

haojin2 commented Dec 1, 2019

larroy commented Dec 1, 2019

larroy commented Dec 6, 2019