Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit test test_CompareTwoNets and test_CompareSparse failed on NVIDIA DRIVE PX2. #2304

Closed
Xreki opened this issue May 27, 2017 · 0 comments
Closed

Comments

@Xreki
Copy link
Contributor

Xreki commented May 27, 2017

I built Paddle on NVIDIA DRIVE PX2 with WITH_GPU=ON, most of the unit tests passed.
test_CompareTwoNets and test_CompareSparse failed because of the same problem.

test_CompareTwoNets:

57: I0527 00:48:58.920536 18051 GradientMachine.cpp:92] Init parameters done.
57: I0527 00:48:59.326797 18051 test_CompareTwoNets.cpp:175] 
57: 
57: forwardBackward of the Network B is finished
57: 
57: I0527 00:48:59.327240 18051 test_CompareTwoNets.cpp:120] 
57: -------------------------------- Check Network Output_0: -------------------------------------
57: I0527 00:48:59.327301 18051 test_CompareTwoNets.cpp:104] maxValue=1.14861 maxDiff=0
57: 
57: I0527 00:48:59.327356 18051 test_CompareTwoNets.cpp:120] 
57: -------------------------------- Check Network Output_1: -------------------------------------
57: I0527 00:48:59.327606 18051 test_CompareTwoNets.cpp:104] maxValue=0.83593 maxDiff=0
57: 
57: I0527 00:48:59.327639 18051 test_CompareTwoNets.cpp:134] 
57: 
57: -------------------------------- Check Gradient Machine Parameters: -------------------------------------
57: F0527 00:48:59.327821 18051 Allocator.h:51] Check failed: posix_memalign(&ptr, 32ul, size) == 0 (12 vs. 0) 
57: *** Check failure stack trace: ***
57:     @           0x9bcf28  google::LogMessage::Fail()
57:     @           0x9be88c  google::LogMessage::SendToLog()
57:     @           0x9bca24  google::LogMessage::Flush()
57:     @           0x9c01ec  google::LogMessageFatal::~LogMessageFatal()
57:     @           0x808518  paddle::CpuAllocator::alloc()
57:     @           0x8073a8  paddle::PoolAllocator::alloc()
57:     @           0x800fc0  paddle::CpuMemoryHandle::CpuMemoryHandle()
57:     @           0x7f9c94  paddle::CpuVectorT<>::CpuVectorT()
57:     @           0x5cca7c  compareGradient()
57:     @           0x5cdfe8  Trainer_create_Test::TestBody()
57:     @           0xad4cf4  testing::internal::HandleExceptionsInMethodIfSupported<>()
57:     @           0xacb3ec  testing::Test::Run()
57:     @           0xacb528  testing::TestInfo::Run()
57:     @           0xacb634  testing::TestCase::Run()
57:     @           0xacd888  testing::internal::UnitTestImpl::RunAllTests()
57:     @           0xacdbb8  testing::UnitTest::Run()
57:     @           0x5b81b4  main
57:     @       0x7f8e8448a0  __libc_start_main
57: /home/ubuntu/liuyiqun01/Paddle/paddle/.set_python_path.sh: line 42: 18051 Aborted                 (core dumped) $@
1/1 Test #57: test_CompareTwoNets ..............***Failed   52.92 sec

test_CompareSparse:

59: [==========] Running 5 tests from 1 test case.
59: [----------] Global test environment set-up.
59: [----------] 5 tests from compareSparse
59: [ RUN      ] compareSparse.cpu
59: I0527 00:51:39.578558 18880 test_CompareSparse.cpp:56]  useGpu=0 trainerCount=1 configFile=trainer/tests/sample_trainer_config_qb_rnn.conf sparseUpdate=1
59: I0527 00:51:40.245077 18880 Trainer.cpp:114] ignore sparse_remote_update=true due to  --local=true
59: I0527 00:51:40.245674 18880 Trainer.cpp:162] trainer mode: SgdSparseCpuTraining
59: I0527 00:51:42.028664 18880 ProtoDataProvider.cpp:55] load data file trainer/tests/data_bin_part
59: I0527 00:51:42.037497 18880 ProtoDataProvider.cpp:70] read done, num of instance=1000
59: I0527 00:51:42.037689 18880 ProtoDataProvider.cpp:367] slot0:avgNNZ=6.678; slot1:avgNNZ=5.47; slot2:avgNNZ=15.924; slot3:avgNNZ=12.808; slot4:avgNNZ=6.713; slot5:avgNNZ=5.489; slot6:avgNNZ=16.915; slot7:avgNNZ=13.482; 
59: I0527 00:51:42.038173 18880 GradientMachine.cpp:85] Initing parameters..
59: I0527 00:52:03.085042 18880 GradientMachine.cpp:92] Init parameters done.
59: ..........I0527 00:52:08.090608 18880 CostLayer.cpp:337] calc pos/neg: 1.12314 pos= 529 neg= 471
59: I0527 00:52:08.090728 18880 TrainerInternal.cpp:181]  Pass=0 Batch=10 samples=1000 AvgCost=0.859857 Eval: 
59: I0527 00:52:08.094395 18880 GradientMachine.cpp:63] Saving parameters to ./output/model/pass-00000
59: I0527 00:52:13.917371 18880 test_CompareSparse.cpp:56]  useGpu=0 trainerCount=1 configFile=trainer/tests/sample_trainer_config_qb_rnn.conf sparseUpdate=0
59: I0527 00:52:13.950443 18880 Trainer.cpp:114] ignore sparse_remote_update=true due to  --local=true
59: I0527 00:52:13.950489 18880 Trainer.cpp:165] trainer mode: Normal
59: I0527 00:52:16.702788 18880 ProtoDataProvider.cpp:55] load data file trainer/tests/data_bin_part
59: I0527 00:52:16.708703 18880 ProtoDataProvider.cpp:70] read done, num of instance=1000
59: I0527 00:52:16.708886 18880 ProtoDataProvider.cpp:367] slot0:avgNNZ=6.678; slot1:avgNNZ=5.47; slot2:avgNNZ=15.924; slot3:avgNNZ=12.808; slot4:avgNNZ=6.713; slot5:avgNNZ=5.489; slot6:avgNNZ=16.915; slot7:avgNNZ=13.482; 
59: I0527 00:52:16.709239 18880 GradientMachine.cpp:85] Initing parameters..
59: I0527 00:52:37.756114 18880 GradientMachine.cpp:92] Init parameters done.
59: ..........I0527 00:52:44.449095 18880 CostLayer.cpp:337] calc pos/neg: 1.12314 pos= 529 neg= 471
59: I0527 00:52:44.449199 18880 TrainerInternal.cpp:181]  Pass=0 Batch=10 samples=1000 AvgCost=0.859857 Eval: 
59: I0527 00:52:44.449470 18880 GradientMachine.cpp:63] Saving parameters to ./output/model/pass-00000
59: I0527 00:52:51.500138 18880 test_CompareSparse.cpp:115] 
59: 
59: -------------------------------- Check Gradient Machine Parameters: -------------------------------------
59: F0527 00:52:51.509215 18880 Allocator.h:51] Check failed: posix_memalign(&ptr, 32ul, size) == 0 (12 vs. 0) 
59: *** Check failure stack trace: ***
59:     @           0x9fa038  google::LogMessage::Fail()
59:     @           0x9fb99c  google::LogMessage::SendToLog()
59:     @           0x9f9b34  google::LogMessage::Flush()
59:     @           0x9fd2fc  google::LogMessageFatal::~LogMessageFatal()
59:     @           0x8433c8  paddle::CpuAllocator::alloc()
59:     @           0x842258  paddle::PoolAllocator::alloc()
59:     @           0x83be70  paddle::CpuMemoryHandle::CpuMemoryHandle()
59:     @           0x834b44  paddle::CpuVectorT<>::CpuVectorT()
59:     @           0x5edea4  compareValue()
59:     @           0x5ef1ec  compareSparse_cpu_Test::TestBody()
59:     @           0xb11ad4  testing::internal::HandleExceptionsInMethodIfSupported<>()
59:     @           0xb081cc  testing::Test::Run()
59:     @           0xb08308  testing::TestInfo::Run()
59:     @           0xb08414  testing::TestCase::Run()
59:     @           0xb0a668  testing::internal::UnitTestImpl::RunAllTests()
59:     @           0xb0a998  testing::UnitTest::Run()
59:     @           0x5db0b4  main
59:     @       0x7f7d8e88a0  __libc_start_main
59: ./.common_test_util.sh: line 72: 18880 Aborted                 (core dumped) $cmd --$port_type=$port
59: /home/ubuntu/liuyiqun01/Paddle/build/paddle/trainer/tests/test_CompareSparse run wrong
1/1 Test #59: test_CompareSparse ...............***Failed   76.19 sec

When running test_CompareTwoNets, I tracked the memory usage using top command and guess it was caused by exhausted of memory.
image

@Xreki Xreki closed this as completed Mar 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant