You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
make storage change leader quickly and keep inserting edge with 1024 concurrent clients
then kill all the services with SIGTERM
Expected behavior
all nebula services exit gracefully, but we found two storages instance stuck and can only be killed with -9 signal.
Additional context
part of storage stack, the full process stack can be found here : 2927120.txt
Thread 123 (Thread 0x7f7fa91ff700 (LWP 2930639) "hazptr-tpe-0"):
#0 0x00007f7ff809fd19 in syscall () from /lib64/libc.so.6#1 0x000000000410cc01 in folly::detail::futexWaitImpl(std::atomic<unsigned int> const*, unsigned int, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const*, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const*, unsigned int) ()#2 0x0000000002a71508 in folly::detail::futexWaitImpl<std::atomic<unsigned int> const, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > > (futex=0x7f7ff2de8488, expected=2, deadline=..., waitMask=4294967295) at /root/src/wwl/nebula/build/third-party/install/include/folly/detail/Futex-inl.h:81#3 0x0000000002a6cf6e in folly::detail::futexWaitUntil<std::atomic<unsigned int>, std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (futex=0x7f7ff2de8488, expected=2, deadline=..., waitMask=4294967295) at /root/src/wwl/nebula/build/third-party/install/include/folly/detail/Futex-inl.h:119#4 0x0000000002a6764d in folly::detail::MemoryIdler::futexWaitUntil<std::atomic<unsigned int>, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (fut=..., expected=2, deadline=..., waitMask=4294967295, idleTimeout=..., stackToRetain=1024, timeoutVariationFrac=0.5) at /root/src/wwl/nebula/build/third-party/install/include/folly/detail/MemoryIdler.h:164#5 0x0000000002aa86c9 in folly::SaturatingSemaphore<true, std::atomic>::tryWaitSlow<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7f7ff2de8488, deadline=..., opt=...) at /root/src/wwl/nebula/build/third-party/install/include/folly/synchronization/SaturatingSemaphore.h:312#6 0x0000000002aa5d28 in folly::SaturatingSemaphore<true, std::atomic>::try_wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (opt=..., deadline=..., this=0x7f7ff2de8488) at /root/src/wwl/nebula/build/third-party/install/include/folly/synchronization/SaturatingSemaphore.h:173#7 folly::detail::LifoSemBase<folly::SaturatingSemaphore<true, std::atomic>, std::atomic>::try_wait_until<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7f7fb18ee2c0, deadline=...) at /root/src/wwl/nebula/build/third-party/install/include/folly/synchronization/LifoSem.h:512#8 0x00000000046445e0 in folly::UnboundedBlockingQueue<folly::CPUThreadPoolExecutor::CPUTask>::try_take_for(std::chrono::duration<long, std::ratio<1l, 1000l> >) ()#9 0x000000000463e427 in folly::CPUThreadPoolExecutor::threadRun(std::shared_ptr<folly::ThreadPoolExecutor::Thread>) ()#10 0x0000000004586a19 in void folly::detail::function::FunctionTraits<void ()>::callBig<std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)> >(folly::detail::function::Data&) ()#11 0x0000000002a3e85c in folly::detail::function::FunctionTraits<void ()>::operator()() (this=0x7f7fe7561470) at /root/src/wwl/nebula/build/third-party/install/include/folly/Function.h:400#12 0x0000000002a4ea06 in folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}::operator()() (__closure=0x7f7fe7561470) at /root/src/wwl/nebula/build/third-party/install/include/folly/executors/thread_factory/NamedThreadFactory.h:40#13 0x0000000002a60400 in std::__invoke_impl<void, folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(std::__invoke_other, folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}&&) (__f=...) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/invoke.h:60#14 0x0000000002a5994a in std::__invoke<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}&&) (__fn=...) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/invoke.h:95#15 0x0000000002aa52c2 in std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x7f7fe7561470) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/thread:234#16 0x0000000002aa47b7 in std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> >::operator()() (this=0x7f7fe7561470) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/thread:243#17 0x0000000002aa1a54 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> > >::_M_run() (this=0x7f7fe7561460) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/thread:186#18 0x0000000004b4143f in execute_native_thread_routine ()#19 0x00007f7ff837cea5 in start_thread () from /lib64/libpthread.so.0#20 0x00007f7ff80a59fd in clone () from /lib64/libc.so.6
...
Thread 1 (Thread 0x7f7ff8ec49c0 (LWP 2927120) "nebula-storaged"):
#0 0x00007f7ff809fd19 in syscall () from /lib64/libc.so.6#1 0x000000000410cc01 in folly::detail::futexWaitImpl(std::atomic<unsigned int> const*, unsigned int, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const*, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const*, unsigned int) ()#2 0x0000000002a6cf52 in folly::detail::futexWaitUntil<std::atomic<unsigned int>, std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (futex=0x7f7ff7a53b20, expected=2, deadline=..., waitMask=4294967295) at /root/src/wwl/nebula/build/third-party/install/include/folly/detail/Futex-inl.h:119#3 0x0000000002a6764d in folly::detail::MemoryIdler::futexWaitUntil<std::atomic<unsigned int>, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (fut=..., expected=2, deadline=..., waitMask=4294967295, idleTimeout=..., stackToRetain=1024, timeoutVariationFrac=0.5) at /root/src/wwl/nebula/build/third-party/install/include/folly/detail/MemoryIdler.h:164#4 0x0000000002b2504c in folly::Baton<true, std::atomic>::tryWaitSlow<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7f7ff7a53b20, deadline=..., opt=...) at /root/src/wwl/nebula/build/third-party/install/include/folly/synchronization/Baton.h:305#5 0x00000000043373eb in folly::DefaultKeepAliveExecutor::joinKeepAlive() ()#6 0x0000000004331559 in apache::thrift::concurrency::ThreadManager::Impl::stopImpl(bool) ()#7 0x00000000043376f9 in apache::thrift::concurrency::PriorityThreadManager::PriorityImpl::join() ()#8 0x0000000004239054 in apache::thrift::ThriftServer::~ThriftServer() ()#9 0x0000000004239481 in apache::thrift::ThriftServer::~ThriftServer() ()#10 0x0000000002a645f9 in std::default_delete<apache::thrift::ThriftServer>::operator() (this=0x7f7ff7a323c0, __ptr=0x7f7ff7a3e000) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/unique_ptr.h:78#11 0x0000000002a66b55 in std::unique_ptr<apache::thrift::ThriftServer, std::default_delete<apache::thrift::ThriftServer> >::reset (this=0x7f7ff7a323c0, __p=0x7f7ff7a3e000) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/unique_ptr.h:371#12 0x0000000003d9591d in nebula::raftex::RaftexService::waitUntilStop (this=0x7f7ff7a323b0) at /root/src/wwl/nebula/src/kvstore/raftex/RaftexService.cpp:141#13 0x0000000003cf738b in nebula::kvstore::NebulaStore::~NebulaStore (this=0x7f7ff7a57400, __in_chrg=<optimized out>) at /root/src/wwl/nebula/src/kvstore/NebulaStore.cpp:42#14 0x0000000003cf75e4 in nebula::kvstore::NebulaStore::~NebulaStore (this=0x7f7ff7a57400, __in_chrg=<optimized out>) at /root/src/wwl/nebula/src/kvstore/NebulaStore.cpp:50#15 0x0000000002a647b7 in std::default_delete<nebula::kvstore::KVStore>::operator() (this=0x7f7ff7a4f1f0, __ptr=0x7f7ff7a57400) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/unique_ptr.h:78#16 0x0000000002a60073 in std::unique_ptr<nebula::kvstore::KVStore, std::default_delete<nebula::kvstore::KVStore> >::reset (this=0x7f7ff7a4f1f0, __p=0x7f7ff7a57400) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/unique_ptr.h:371#17 0x0000000002a59326 in nebula::storage::StorageServer::stop (this=0x7f7ff7a4f180) at /root/src/wwl/nebula/src/storage/StorageServer.cpp:349#18 0x0000000002a3ded8 in signalHandler (sig=15) at /root/src/wwl/nebula/src/daemons/StorageDaemon.cpp:175#19 0x0000000002a3dd1c in <lambda(nebula::SignalHandler::GeneralSignalInfo*)>::operator()(nebula::SignalHandler::GeneralSignalInfo *) const (__closure=0x5d9b020 <nebula::SignalHandler::get()::instance+448>, info=0x7ffff8a15630) at /root/src/wwl/nebula/src/daemons/StorageDaemon.cpp:166#20 0x0000000002a45b2e in std::_Function_handler<void(nebula::SignalHandler::GeneralSignalInfo*), setupSignalHandler()::<lambda(nebula::SignalHandler::GeneralSignalInfo*)> >::_M_invoke(const std::_Any_data &, nebula::SignalHandler::GeneralSignalInfo *&&) (__functor=..., __args#0=@0x7ffff8a155f0: 0x7ffff8a15630) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/std_function.h:316#21 0x0000000003c0b842 in std::function<void (nebula::SignalHandler::GeneralSignalInfo*)>::operator()(nebula::SignalHandler::GeneralSignalInfo*) const (this=0x5d9b020 <nebula::SignalHandler::get()::instance+448>, __args#0=0x7ffff8a15630) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/std_function.h:706#22 0x0000000003c0b41d in nebula::SignalHandler::handleGeneralSignal (this=0x5d9ae60 <nebula::SignalHandler::get()::instance>, sig=15, info=0x7ffff8a157f0) at /root/src/wwl/nebula/src/common/base/SignalHandler.cpp:107#23 0x0000000003c0b3b9 in nebula::SignalHandler::doHandle (this=0x5d9ae60 <nebula::SignalHandler::get()::instance>, sig=15, info=0x7ffff8a157f0, uctx=0x7ffff8a156c0) at /root/src/wwl/nebula/src/common/base/SignalHandler.cpp:101#24 0x0000000003c0b337 in nebula::SignalHandler::handlerHook (sig=15, info=0x7ffff8a157f0, uctx=0x7ffff8a156c0) at /root/src/wwl/nebula/src/common/base/SignalHandler.cpp:82#25 <signal handler called>#26 0x00007f7ff837e017 in pthread_join () from /lib64/libpthread.so.0#27 0x0000000004b414d3 in std::thread::join() ()#28 0x0000000002a59090 in nebula::storage::StorageServer::waitUntilStop (this=0x7f7ff7a4f180) at /root/src/wwl/nebula/src/storage/StorageServer.cpp:310#29 0x0000000002a3d9f4 in main (argc=2, argv=0x7ffff8a168c0) at /root/src/wwl/nebula/src/daemons/StorageDaemon.cpp:158
and the log shows that raft is trying hard to catch up log index but no luck for more nearly 5mins(and then we send sigterm to to process but it stuck):
14278 I1014 14:37:54.739661 2927154 RaftPart.cpp:1549] [Port: 42648, Space: 1, Part: 1] Local is missing logs from id 251343. Need to catch up
14279 I1014 14:37:54.739706 2930624 ChainAddEdgesProcessorLocal.cpp:404] e0645f4e-23fe-42b5-a63f-4d807d06e1dc 104->118, val: "76->13430" 2021-10-14T06:37:54.836000 commitLocal(), code_ =
...
8409188 I1014 14:42:35.491660 2927168 MetaClient.cpp:590] Send request to meta "192.168.15.11":9559
8409189 I1014 14:42:35.492377 2927168 MetaClient.cpp:2379] Metad last update time: 1634193507184
8409190 I1014 14:42:35.495230 2927157 RaftPart.cpp:1549] [Port: 42648, Space: 1, Part: 1] Local is missing logs from id 251343. Need to catch up
The text was updated successfully, but these errors were encountered:
Please check the FAQ documentation before raising an issue
Please check the FAQ documentation and old issues before raising an issue in case someone has asked the same question that you are asking.
Describe the bug (must be provided)
storage block on exit, stack and log are provided below, this is a different kind of storage stuck, we've never seen before during our previous test.
Your Environments (must be provided)
uname -a
g++ --version
orclang++ --version
lscpu
How To Reproduce(must be provided)
Steps to reproduce the behavior:
Expected behavior
all nebula services exit gracefully, but we found two storages instance stuck and can only be killed with -9 signal.
Additional context
part of storage stack, the full process stack can be found here :
2927120.txt
and the log shows that raft is trying hard to catch up log index but no luck for more nearly 5mins(and then we send sigterm to to process but it stuck):
The text was updated successfully, but these errors were encountered: