Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in GreedyLB (assertion failure) #958

Closed
lifflander opened this issue Jul 29, 2020 · 4 comments
Closed

Bug in GreedyLB (assertion failure) #958

lifflander opened this issue Jul 29, 2020 · 4 comments

Comments

@lifflander
Copy link
Collaborator

Describe the bug
I've reproduced this in two contexts. Either in the docker container (GNU gcc-7, debug) or on my Mac with (clang-5).

Run this to reproduce:

ctest -I 157,158 --repeat-until-fail 1000 --output-on-failure .

Assertion that breaks:

vt: [0] lb: LBManager::releaseNow: finished LB, phase=3, invocations=1
vt: [0] lb: BaseLB: Statistic=P_l:  max=5.10, min=4.55, sum=19.24, avg=4.81, var=0.04, stdev=0.20, nproc=4, cardinality=4 skewness=0.17, kurtosis=-1.87, npr=4, imb=0.06, num_stats=1
vt: [0] lb: BaseLB: Statistic=O_l:  max=0.001, min=0.000, sum=0.02, avg=0.000, var=0.000, stdev=0.000, nproc=64, cardinality=64 skewness=0.02, kurtosis=-1.25, npr=64, imb=1.06, num_stats=2
vt: [0] lb: loadStats: load=4.55, total=19.24, avg=4.81, I=0.06,should_lb=true, auto=true, threshold=0.9390901317338556
vt: [1] ------------------------------------------------------------------------------------------------------------------------
vt: [1] ------------------------------------------- Runtime Error: System Aborting! --------------------------------------------
vt: [1] ------------------------------------------------ Fatal Error on Node 1 -------------------------------------------------
vt: [1] ------------------------------------------------------------------------------------------------------------------------
vt: [1]
vt: [1]              Reason: Must have object
vt: [1]    Assertion failed: (theProcStats()->hasObjectToMigrate(obj_id))
vt: [1]                Node: 1
vt: [1]           Num Nodes: 4
vt: [1]                File: /vt/src/vt/vrt/collection/balance/baselb/baselb.cc
vt: [1]                Line: 230
vt: [1]            Function: transferMigrations
vt: [1]                Code: 1
vt: [1]           Build SHA: 181e188d3fca91bab0a2d0efc765d8366031e5da
vt: [1]           Build Ref: refs/heads/develop
vt: [1]         Description: heads/develop-0-g181e188d3f
vt: [1]            GIT Repo: *dirty*
vt: [1]            Hostname: 41fe2b81da16
vt: [1]
vt: [2] ------------------------------------------------------------------------------------------------------------------------
vt: [2] ------------------------------------------- Runtime Error: System Aborting! --------------------------------------------
vt: [2] ------------------------------------------------ Fatal Error on Node 2 -------------------------------------------------
vt: [2] ------------------------------------------------------------------------------------------------------------------------
vt: [2]
vt: [2]              Reason: Must have object
vt: [2]    Assertion failed: (theProcStats()->hasObjectToMigrate(obj_id))
vt: [2]                Node: 2
vt: [2]           Num Nodes: 4
vt: [2]                File: /vt/src/vt/vrt/collection/balance/baselb/baselb.cc
vt: [2]                Line: 230
vt: [2]            Function: transferMigrations
vt: [2]                Code: 1
vt: [2]           Build SHA: 181e188d3fca91bab0a2d0efc765d8366031e5da
vt: [2]           Build Ref: refs/heads/develop
vt: [2]         Description: heads/develop-0-g181e188d3f
vt: [2]            GIT Repo: *dirty*
vt: [2]            Hostname: 41fe2b81da16
vt: [2]
vt: [3] ------------------------------------------------------------------------------------------------------------------------
vt: [3] ------------------------------------------- Runtime Error: System Aborting! --------------------------------------------
vt: [3] ------------------------------------------------ Fatal Error on Node 3 -------------------------------------------------
vt: [3] ------------------------------------------------------------------------------------------------------------------------
vt: [3]
vt: [3]              Reason: Must have object
vt: [3]    Assertion failed: (theProcStats()->hasObjectToMigrate(obj_id))
vt: [3]                Node: 3
vt: [3]           Num Nodes: 4
vt: [3]                File: /vt/src/vt/vrt/collection/balance/baselb/baselb.cc
vt: [3]                Line: 230
vt: [3]            Function: transferMigrations
vt: [3]                Code: 1
vt: [3]           Build SHA: 181e188d3fca91bab0a2d0efc765d8366031e5da
vt: [3]           Build Ref: refs/heads/develop
vt: [3]         Description: heads/develop-0-g181e188d3f
vt: [3]            GIT Repo: *dirty*
vt: [3]            Hostname: 41fe2b81da16
vt: [3]
vt: [3] ------------------------------------------------------------------------------------------------------------------------
vt: [3] -------------------------------------------- Dump Stack Backtrace on Node 3 --------------------------------------------
vt: [3] ------------------------------------------------------------------------------------------------------------------------
vt: [3] 0   18  0x55be2ff00548 vt::debug::stack::dumpStack[abi:cxx11](int) + 83
vt: [3] 1   18  0x55be2fb00c98 vt::runtime::Runtime::output(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, bool, bool, bool) + 1868
vt: [3] 2   18  0x55be2f99e3cf vt::CollectiveAnyOps<(vt::runtime::eRuntimeInstance)0>::output(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, bool, bool, bool, bool) + 209
vt: [3] 3   18  0x55be2f99d163 vt::output(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, bool, bool, bool, bool) + 143
vt: [3] 4   18  0x55be2f78b85f std::enable_if<std::tuple_size<std::tuple<> >::value==(0), void>::type vt::debug::assert::assertOut<>(bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::tuple<>&&) + 359
vt: [3] 5   18  0x55be30102aae vt::vrt::collection::lb::BaseLB::transferMigrations(vt::vrt::collection::lb::TransferMsg<std::vector<std::tuple<unsigned long, short>, std::allocator<std::tuple<unsigned long, short> > > >*) + 682
vt: [3] 6   18  0x55be2fcfe1e6 vt::objgroup::dispatch::Dispatch<vt::vrt::collection::lb::BaseLB>::run(long, vt::messaging::BaseMsg*) + 920
vt: [3] 7   18  0x55be2fd137b6 vt::objgroup::ObjGroupManager::dispatch(vt::messaging::MsgSharedPtr<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> >, long) + 860
vt: [3] 8   18  0x55be2fd142c8 vt::objgroup::dispatchObjGroup(vt::messaging::MsgSharedPtr<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> >, long) + 150
vt: [3] 9   18  0x55be2f7fcb1f vt::runnable::Runnable<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> >::runObj(long, vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope>*, short) + 725
vt: [3] 10  18  0x55be2fd143ab ./test_lb_extended(+0x1d393ab) [0x55be2fd143ab] + 0
vt: [3] 11  18  0x55be2fd147bf ./test_lb_extended(+0x1d397bf) [0x55be2fd147bf] + 0
vt: [3] 12  18  0x55be2f79136b std::function<void ()>::operator()() const + 77
vt: [3] 13  18  0x55be2feaee2d vt::sched::PriorityUnit::execute() + 467
vt: [3] 14  18  0x55be2feaec4d vt::sched::PriorityUnit::operator()() + 33
vt: [3] 15  18  0x55be2fea803f vt::sched::Scheduler::runWorkUnit(vt::sched::PriorityUnit&) + 691
vt: [3] 16  18  0x55be2fea8a1e vt::sched::Scheduler::scheduler(bool) + 566
vt: [3] 17  18  0x55be2fea8f75 vt::sched::Scheduler::runSchedulerWhile(std::function<bool ()>) + 845
vt: [3] 18  18  0x55be2feaa06b vt::runSchedulerThrough(unsigned long) + 145
vt: [3] 19  18  0x55be2feaa4f1 vt::runInEpochCollective(std::function<void ()>&&) + 437
vt: [3] 20  18  0x55be2fcc379c void vt::vrt::collection::balance::LBManager::makeLB<vt::vrt::collection::lb::GreedyLB>(vt::messaging::MsgSharedPtr<vt::vrt::collection::balance::StartLBMsg>) + 702
vt: [3] 21  18  0x55be2fc98fd0 vt::vrt::collection::balance::LBManager::collectiveImpl(unsigned long, vt::vrt::collection::balance::LBType, bool, unsigned long) + 738
vt: [3] 22  18  0x55be2f86c742 void vt::vrt::collection::balance::LBManager::sysLB<vt::vrt::collection::balance::InvokeBaseMsg<vt::collective::reduce::operators::ReduceTMsg<char> > >(vt::vrt::collection::balance::InvokeBaseMsg<vt::collective::reduce::operators::ReduceTMsg<char> >*) + 214
vt: [3] 23  18  0x55be2fcfe7cc vt::objgroup::dispatch::Dispatch<vt::vrt::collection::balance::LBManager>::run(long, vt::messaging::BaseMsg*) + 920
vt: [3] 24  18  0x55be2fd137b6 vt::objgroup::ObjGroupManager::dispatch(vt::messaging::MsgSharedPtr<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> >, long) + 860
vt: [3] 25  18  0x55be2fd142c8 vt::objgroup::dispatchObjGroup(vt::messaging::MsgSharedPtr<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> >, long) + 150
vt: [3] 26  18  0x55be2f7fcb1f vt::runnable::Runnable<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> >::runObj(long, vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope>*, short) + 725
vt: [3] 27  18  0x55be2f7e8924 vt::runnable::Runnable<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> >::run(long, void (*)(vt::messaging::BaseMsg*), vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope>*, short, int) + 144
vt: [3] 28  18  0x55be2fd4e7b3 vt::messaging::ActiveMessenger::deliverActiveMsg(vt::messaging::MsgSharedPtr<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> > const&, short const&, bool, std::function<void ()>) + 1821
vt: [3] 29  18  0x55be2fd4dfa6 vt::messaging::ActiveMessenger::processActiveMsg(vt::messaging::MsgSharedPtr<vt::messaging::ActiveMsg<vt::messaging::ActiveEnvelope> > const&, short const&, int const&, bool, std::function<void ()>) + 476
vt: [3] 30  18  0x55be2fd4d853 ./test_lb_extended(+0x1d72853) [0x55be2fd4d853] + 0
vt: [3] 31  18  0x55be2fd51a52 ./test_lb_extended(+0x1d76a52) [0x55be2fd51a52] + 0
vt: [3] 32  18  0x55be2f79136b std::function<void ()>::operator()() const + 77
vt: [3] 33  18  0x55be2feaee2d vt::sched::PriorityUnit::execute() + 467
vt: [3] 34  18  0x55be2feaec4d vt::sched::PriorityUnit::operator()() + 33
vt: [3] 35  18  0x55be2fea803f vt::sched::Scheduler::runWorkUnit(vt::sched::PriorityUnit&) + 691
vt: [3] 36  18  0x55be2fea8a1e vt::sched::Scheduler::scheduler(bool) + 566
vt: [3] 37  18  0x55be2fea8f75 vt::sched::Scheduler::runSchedulerWhile(std::function<bool ()>) + 845
vt: [3] 38  18  0x55be2fc99abd vt::vrt::collection::balance::LBManager::waitLBCollective() + 181
vt: [3] 39  18  0x55be2fbe34f6 vt::vrt::collection::CollectionManager::startPhaseCollective(std::function<void ()>, unsigned long) + 196
vt: [3] 40  18  0x55be2f6a2ef4 ./test_lb_extended(+0x16c7ef4) [0x55be2f6a2ef4] + 0
vt: [3] 41  18  0x55be2f6a4054 ./test_lb_extended(+0x16c9054) [0x55be2f6a4054] + 0
vt: [3] 42  18  0x55be2f79136b std::function<void ()>::operator()() const + 77
vt: [3] 43  18  0x55be2feaa444 vt::runInEpochCollective(std::function<void ()>&&) + 264
vt: [3] 44  18  0x55be2f6a3232 vt::tests::unit::TestLoadBalancer_test_load_balancer_1_Test::TestBody() + 726
vt: [3] 45  18  0x55be2f90cefb void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 101
vt: [3] 46  18  0x55be2f906ef7 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 90
vt: [3] 47  18  0x55be2f8e41d4 testing::Test::Run() + 238
vt: [3] 48  18  0x55be2f8e4b59 testing::TestInfo::Run() + 271
vt: [3] 49  18  0x55be2f8e524f testing::TestSuite::Run() + 297
vt: [3] 50  18  0x55be2f8f0c61 testing::internal::UnitTestImpl::RunAllTests() + 1029
vt: [3] 51  18  0x55be2f90dff3 bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 101
vt: [3] 52  18  0x55be2f907dd3 bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 90
vt: [3] 53  18  0x55be2f8ef55a testing::UnitTest::Run() + 192
vt: [3] 54  18  0x55be2f677c1e RUN_ALL_TESTS() + 35
vt: [3] 55  18  0x55be2f6769aa main + 109
vt: [3] 56  18  0x7fe93d5a6b97 __libc_start_main + 231
vt: [3] 57  18  0x55be2f6761aa _start + 42
vt: [3] ------------------------------------------------------------------------------------------------------------------------
@lifflander
Copy link
Collaborator Author

This is causing test failures on develop regularly now.

@PhilMiller
Copy link
Member

https://github.com/DARMA-tasking/vt/pull/1013/checks?check_run_id=1418863757 again, though the assertion output is different.

@lifflander
Copy link
Collaborator Author

This is fixed. YAY

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants