Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

CentOS GPU tests failing in master #16951

Closed
larroy opened this issue Nov 30, 2019 · 4 comments
Closed

CentOS GPU tests failing in master #16951

larroy opened this issue Nov 30, 2019 · 4 comments
Labels

Comments

@larroy
Copy link
Contributor

larroy commented Nov 30, 2019

Description

Centos GPU tests are failing in master:

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/master/1341/

I couldn't reproduce in p3 instance over ubuntu 18.04. Trying in the CI AMI now.

Seems to be a problem in the base AMI, reproduced by running the following commands:

time ci/build.py --docker-registry mxnetci --platform centos7_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh build_centos7_gpu
time ci/build.py --docker-registry mxnetci --nvidiadocker --platform centos7_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh unittest_centos7_gpu

Failure is:

[07:03:53] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
terminate called after throwing an instance of 'dmlc::Error'
  what():  [07:03:59] /work/mxnet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:107: Check failed: err == CUBLAS_STATUS_SUCCESS (7 vs. 0) : Destory cublas handle failed
Stack trace:
  [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x2b) [0x7f0376aa865b]
  [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(void mshadow::DeleteStream<mshadow::gpu>(mshadow::Stream<mshadow::gpu>*)+0x227) [0x7f037aa308e7]
  [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mshadow::Stream<mshadow::gpu>* mshadow::NewStream<mshadow::gpu>(bool, bool, int)+0x244) [0x7f037aa30e14]
  [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0x19f) [0x7f037aa513ef]
  [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>)+0x46) [0x7f037aa51626]
  [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x44) [0x7f037aa3d1c4]
  [bt] (6) /usr/lib64/libstdc++.so.6(+0xb5070) [0x7f03e2478070]
  [bt] (7) /usr/lib64/libpthread.so.0(+0x7e65) [0x7f03f4f92e65]
  [bt] (8) /usr/lib64/libc.so.6(clone+0x6d) [0x7f03f45b288d]


/work/runtime_functions.sh: line 1312:     6 Aborted                 (core dumped) python3.6 -m "nose" $NOSE_COVERAGE_ARGUMENTS $NOSE_TIMER_ARGUMENTS --with-xunit --xunit-file nosetests_gpu.xml --verbose tests/python/gpu
2019-11-30 07:03:59,955 - root - INFO - Waiting for status of container ea33d765417a for 600 s.
2019-11-30 07:04:00,117 - root - INFO - Container exit status: {'StatusCode': 134, 'Error': None}
2019-11-30 07:04:00,117 - root - ERROR - Container exited with an error 😞
2019-11-30 07:04:00,117 - root - INFO - Executed command for reproduction:

ci/build.py --docker-registry mxnetci --nvidiadocker --platform centos7_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh unittest_centos7_gpu

2019-11-30 07:04:00,117 - root - INFO - Stopping container: ea33d765417a
2019-11-30 07:04:00,119 - root - INFO - Removing container: ea33d765417a
2019-11-30 07:04:00,140 - root - CRITICAL - Execution of ['/work/runtime_functions.sh', 'unittest_centos7_gpu'] failed with status: 134

A solution would be to update the AMI

@larroy larroy added the Bug label Nov 30, 2019
@larroy
Copy link
Contributor Author

larroy commented Nov 30, 2019

@mxnet-label-bot add [CI]

@lanking520 lanking520 added the CI label Nov 30, 2019
@haojin2
Copy link
Contributor

haojin2 commented Dec 1, 2019

For more info, I've made a change to print the cublaserror's error message out:

terminate called after throwing an instance of 'dmlc::Error'

  what():  [05:07:32] /work/mxnet/include/mshadow/./stream_gpu-inl.h:125: Check failed: err == CUBLAS_STATUS_SUCCESS (7 vs. 0) : Destory cublas handle failed with error CUBLAS_STATUS_INVALID_VALUE

Stack trace:

  [bt] (0) build/tests/mxnet_unit_tests(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x194a5f2]

  [bt] (1) build/tests/mxnet_unit_tests(mshadow::Stream<mshadow::gpu>::DestroyBlasHandle()+0x14f) [0x1985b2f]

  [bt] (2) build/tests/mxnet_unit_tests(void mshadow::DeleteStream<mshadow::gpu>(mshadow::Stream<mshadow::gpu>*)+0xb7) [0x1986617]

  [bt] (3) build/tests/mxnet_unit_tests(mshadow::Stream<mshadow::gpu>* mshadow::NewStream<mshadow::gpu>(bool, bool, int)+0x30b) [0x1986c4b]

  [bt] (4) build/tests/mxnet_unit_tests(mxnet::test::op::GPUStreamScope::GPUStreamScope(mxnet::OpContext*)+0xfd) [0x198888d]

  [bt] (5) build/tests/mxnet_unit_tests(std::__shared_ptr<mxnet::test::op::CoreOpExecutor<float, float>, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<mxnet::test::op::CoreOpExecutor<float, float> >, bool, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> > >(std::_Sp_make_shared_tag, std::allocator<mxnet::test::op::CoreOpExecutor<float, float> > const&, bool&&, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> >&&)+0x3c7) [0x19a1e57]

  [bt] (6) build/tests/mxnet_unit_tests(mxnet::test::OperatorRunner<mxnet::test::op::CoreOpProp, mxnet::test::op::CoreOpExecutor<float, float> >::RunGenericOperatorForward(bool, std::vector<mxnet::TShape, std::allocator<mxnet::TShape> > const&, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&, unsigned long)+0xb6) [0x19a8546]

  [bt] (7) build/tests/mxnet_unit_tests(ACTIVATION_PERF_ExecuteBidirectional_Test::TestBody()+0x74e) [0x197ebbe]

  [bt] (8) build/tests/mxnet_unit_tests(void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x43) [0x1ab60d3]

The error type is CUBLAS_STATUS_INVALID_VALUE.

@larroy
Copy link
Contributor Author

larroy commented Dec 1, 2019

I think the error doesn't say much. I think the issue is the driver inside the docker image causes problems, at least I saw nvidia engineers acknowledging such an issue. If you see one of my PRs the failure goes away but some jobs require cuda libs in the container.

@larroy
Copy link
Contributor Author

larroy commented Dec 6, 2019

Fixed by #16968

@larroy larroy closed this as completed Dec 6, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants