fix crash problem when stop process. #705

wadeliuyi · 2019-07-31T07:59:42Z

fix crash problem when stop process that meta client and raft part depend io thread pool, but the io thread pool will stopped first by gServer:
1, raft bt.
(gdb) bt
#0 0x000000000208f587 in folly::IOThreadPoolExecutor::getEventBase (this=) at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#1 0x0000000001b48970 in nebula::raftex::RaftPart::appendLogsInternal (this=0x7fbfdc168c10, iter=..., termId=8) at /home/wade.liu/rd/nebula/src/kvstore/raftex/RaftPart.cpp:512
#2 0x0000000001b47dd0 in nebula::raftex::RaftPart::appendLogAsync (this=0x7fbfdc168c10, source=0 '\000', logType=nebula::raftex::LogType::NORMAL, log="")
at /home/wade.liu/rd/nebula/src/kvstore/raftex/RaftPart.cpp:452
#3 0x0000000001b501ff in nebula::raftex::RaftPart::sendHeartbeat (this=0x7fbfdc168c10) at /home/wade.liu/rd/nebula/src/kvstore/raftex/RaftPart.cpp:1270
#4 0x0000000001b4ce4f in nebula::raftex::RaftPart::statusPolling (this=0x7fbfdc168c10) at /home/wade.liu/rd/nebula/src/kvstore/raftex/RaftPart.cpp:940
#5 0x0000000001b4c9d8 in nebula::raftex::RaftPart::<lambda()>::operator()(void) const (__closure=0x7fbfd7ce7400) at /home/wade.liu/rd/nebula/src/kvstore/raftex/RaftPart.cpp:949
#6 0x0000000001b5d36c in std::__invoke_impl<void, nebula::raftex::RaftPart::statusPolling()::<lambda()>&>(std::__invoke_other, nebula::raftex::RaftPart::<lambda()> &) (__f=...)
at /usr/include/c++/8/bits/invoke.h:60
#7 0x0000000001b5d2a1 in std::__invoke<nebula::raftex::RaftPart::statusPolling()::<lambda()>&>(nebula::raftex::RaftPart::<lambda()> &) (__fn=...) at /usr/include/c++/8/bits/invoke.h:95
#8 0x0000000001b5d19a in std::_Bind<nebula::raftex::RaftPart::statusPolling()::<lambda()>()>::__call(std::tuple<> &&, std::_Index_tuple<>) (this=0x7fbfd7ce7400, __args=...)
at /usr/include/c++/8/functional:400
#9 0x0000000001b5ccaa in std::_Bind<nebula::raftex::RaftPart::statusPolling()::<lambda()>()>::operator()<>(void) (this=0x7fbfd7ce7400) at /usr/include/c++/8/functional:484
#10 0x0000000001b5c647 in std::_Function_handler<void(), std::_Bind<nebula::raftex::RaftPart::statusPolling()::<lambda()>()> >::_M_invoke(const std::_Any_data &) (__functor=...)
at /usr/include/c++/8/bits/std_function.h:297

2, meta bt.
#0 0x000000000208dc37 in folly::IOThreadPoolExecutor::getEventBase (this=) at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#1 0x00000000018169eb in nebula::meta::MetaClient::getResponse<nebula::meta::cpp2::HBReq, nebula::meta::MetaClient::heartbeat()::<lambda(auto:110, auto:111)>, nebula::meta::MetaClient::heartbeat()::<lambda(nebula::meta::cpp2::HBResp&&)> >(nebula::meta::cpp2::HBReq, nebula::meta::MetaClient::<lambda(auto:110, auto:111)>, nebula::meta::MetaClient::<lambda(nebula::meta::cpp2::HBResp&&)>, bool) (
this=0x7f7642d60600, req=..., remoteFunc=..., respGen=..., toLeader=true) at /home/wade.liu/rd/nebula/src/meta/client/MetaClient.cpp:254
#2 0x000000000180d6f6 in nebula::meta::MetaClient::heartbeat (this=0x7f7642d60600) at /home/wade.liu/rd/nebula/src/meta/client/MetaClient.cpp:987
#3 0x0000000001806204 in nebula::meta::MetaClient::heartBeatThreadFunc (this=0x7f7642d60600) at /home/wade.liu/rd/nebula/src/meta/client/MetaClient.cpp:85
#4 0x00000000018876da in std::__invoke_impl<void, void (nebula::meta::MetaClient::&)(), nebula::meta::MetaClient&> (
__f=@0x7f7642dc1ea0: (void (nebula::meta::MetaClient::*)(nebula::meta::MetaClient * const)) 0x18061e0 nebula::meta::MetaClient::heartBeatThreadFunc(), __t=@0x7f7642dc1eb0: 0x7f7642d60600)
at /usr/include/c++/8/bits/invoke.h:73

…pend io thread pool, but the io thread pool will stopped first by gServer

wadeliuyi · 2019-07-31T08:01:40Z

the reason is that when call gServer->stop(), the gServer will stop all the io thread, but the metaclient and raft part need io thead pool send message out, so the process will crash.

wadeliuyi · 2019-07-31T08:10:57Z

Jenkins go.

nebula-community-bot · 2019-07-31T08:43:25Z

Unit testing passed.

src/kvstore/raftex/RaftexService.cpp

nebula-community-bot · 2019-07-31T11:37:47Z

Unit testing failed.

nebula-community-bot · 2019-07-31T14:00:49Z

Unit testing passed.

dutor

Could please extract these shit logics out for us?

wadeliuyi · 2019-08-01T03:05:56Z

Could please extract these shit logics out for us?

meta client has a background thread to send heart beat, and depend the io thread pool in main, the thrift server depend the io thread pool too, but when we stop the thrift server, the thrift server will stop all threads in the io pool, but the background thread in meta client is not stop, then if meta client send heartbeat, it will call auto* evb = ioThreadPool_->getEventBase(); to get eb, the function like follow:
EventBase* IOThreadPoolExecutor::getEventBase() {
ensureActiveThreads();
SharedMutex::ReadHolder r{&threadListLock_};
return pickThread()->eventBase;
}
std::shared_ptrIOThreadPoolExecutor::IOThread
IOThreadPoolExecutor::pickThread() {
auto& me = *thisThread_;
auto& ths = threadList_.get();
// When new task is added to IOThreadPoolExecutor, a thread is chosen for it
// to be executed on, thisThread_ is by default chosen, however, if the new
// task is added by the clean up operations on thread destruction, thisThread_
// is not an available thread anymore, thus, always check whether or not
// thisThread_ is an available thread before choosing it.
if (me && std::find(ths.cbegin(), ths.cend(), me) != ths.cend()) {
return me;
}
auto n = ths.size();
if (n == 0) {
return me;
}
auto thread = ths[nextThread_.fetch_add(1, std::memory_order_relaxed) % n];
return std::static_pointer_cast(thread);
}

me and ths are nullptr, so the process crash.

wadeliuyi · 2019-08-01T03:08:46Z

when I fix this problem, I find some complex cyclic dependence like worker thread in thrift and kvstore in main, the accept thread pool and io thread pool.

nebula-community-bot · 2019-08-01T05:50:08Z

Unit testing passed.

dangleptr

The PR looks good to me. Thanks for taking care of it.

Do you try it? Is it worked?

dangleptr · 2019-07-31T12:26:00Z

src/daemons/MetaDaemon.cpp

 auto ioPool = std::make_shared<folly::IOThreadPoolExecutor>(FLAGS_num_io_threads);
+ auto acceptThreadPool = std::make_shared<folly::IOThreadPoolExecutor>(1);


Make it configurable.

good point, I think about this problem, but I get the result is there is no need configure, 1 thread is too much, because accept thread pool just accept connection from client which is our process like graphd or meta, and they a long connection, so the performance is not high, what do you think about?

dangleptr · 2019-08-01T07:31:42Z

src/kvstore/raftex/RaftexService.cpp

-void RaftexService::initThriftServer(std::shared_ptr<folly::IOThreadPoolExecutor> pool,
- uint16_t port) {
+void RaftexService::initThriftServer(std::shared_ptr<folly::IOThreadPoolExecutor> ioPool,
+ std::shared_ptr<folly::IOThreadPoolExecutor> acceptPool,


wadeliuyi · 2019-08-01T07:40:49Z

The PR looks good to me. Thanks for taking care of it.

Do you try it? Is it worked?

yes, I test the case that let main sleep some second after main exit from gServer->serve(), when meta client send heart beat, process not crash.

CPWstatic · 2019-08-01T03:07:30Z

src/meta/client/MetaClient.cpp

 , localHost_(localHost)
 , sendHeartBeat_(sendHeartBeat) {
+ ioThreadPool_ = std::make_unique<folly::IOThreadPoolExecutor>(FLAGS_meta_client_io_thread_num);
 CHECK(ioThreadPool_ != nullptr) << "IOThreadPool is required";


It doesn't seems necessary anymore.

CPWstatic · 2019-08-01T08:34:11Z

src/daemons/MetaDaemon.cpp

 auto ioPool = std::make_shared<folly::IOThreadPoolExecutor>(FLAGS_num_io_threads);
+ auto acceptThreadPool = std::make_shared<folly::IOThreadPoolExecutor>(1);


Why MetaDeamon share the acceptors with NebulaStore? Same question with StorageDeamon.

Why do we create acceptors with 1 worker? The ThriftServer create acceptors with 1 worker by default after all.

Can acceptors with 1 worker meet our performance requirements?

dutor · 2019-08-01T13:58:22Z

Could please extract these shit logics out for us?

meta client has a background thread to send heart beat, and depend the io thread pool in main, the thrift server depend the io thread pool too, but when we stop the thrift server, the thrift server will stop all threads in the io pool, but the background thread in meta client is not stop, then if meta client send heartbeat, it will call auto* evb = ioThreadPool_->getEventBase(); to get eb, the function like follow:
EventBase* IOThreadPoolExecutor::getEventBase() {
ensureActiveThreads();
SharedMutex::ReadHolder r{&threadListLock_};
return pickThread()->eventBase;
}
std::shared_ptrIOThreadPoolExecutor::IOThread
IOThreadPoolExecutor::pickThread() {
auto& me = *thisThread_;
auto& ths = threadList_.get();
// When new task is added to IOThreadPoolExecutor, a thread is chosen for it
// to be executed on, thisThread_ is by default chosen, however, if the new
// task is added by the clean up operations on thread destruction, thisThread_
// is not an available thread anymore, thus, always check whether or not
// thisThread_ is an available thread before choosing it.
if (me && std::find(ths.cbegin(), ths.cend(), me) != ths.cend()) {
return me;
}
auto n = ths.size();
if (n == 0) {
return me;
}
auto thread = ths[nextThread_.fetch_add(1, std::memory_order_relaxed) % n];
return std::static_pointer_cast(thread);
}

me and ths are nullptr, so the process crash.

I knew almost everything on these. But what I meant is that we cannot live with fix and fix and fix forever.

Generally, fixing in this way is not clean and not OK to me.

wadeliuyi · 2019-08-02T00:44:50Z

Could please extract these shit logics out for us?

meta client has a background thread to send heart beat, and depend the io thread pool in main, the thrift server depend the io thread pool too, but when we stop the thrift server, the thrift server will stop all threads in the io pool, but the background thread in meta client is not stop, then if meta client send heartbeat, it will call auto* evb = ioThreadPool_->getEventBase(); to get eb, the function like follow:
EventBase* IOThreadPoolExecutor::getEventBase() {
ensureActiveThreads();
SharedMutex::ReadHolder r{&threadListLock_};
return pickThread()->eventBase;
}
std::shared_ptrIOThreadPoolExecutor::IOThread
IOThreadPoolExecutor::pickThread() {
auto& me = *thisThread_;
auto& ths = threadList_.get();
// When new task is added to IOThreadPoolExecutor, a thread is chosen for it
// to be executed on, thisThread_ is by default chosen, however, if the new
// task is added by the clean up operations on thread destruction, thisThread_
// is not an available thread anymore, thus, always check whether or not
// thisThread_ is an available thread before choosing it.
if (me && std::find(ths.cbegin(), ths.cend(), me) != ths.cend()) {
return me;
}
auto n = ths.size();
if (n == 0) {
return me;
}
auto thread = ths[nextThread_.fetch_add(1, std::memory_order_relaxed) % n];
return std::static_pointer_cast(thread);
}
me and ths are nullptr, so the process crash.

I knew almost everything on these. But what I meant is that we cannot live with fix and fix and fix forever.

Generally, fixing in this way is not clean and not OK to me.

thanks very much, this is very good view. I think about it too, we can not just fix problem for fix problem, we need find the problem about the design, and avoid the dependence problem from whole architecture.
I think two way to solve it, maybe not the best.
first is one process just one rpc server, maybe we can combine the raft service and main service together, but we also take care about the resource release sequence when the process exit.
second is raft service and main service not share the io thread pool, raft use it self, but raft service have a high flow of io, so we need open many thread for raft service, this result the whole process has too much thread.

wadeliuyi · 2019-08-02T00:56:51Z

like what dutor say, this fix is very ugly, I am very ashamed of myself, and I close this pr and open an issue. this is an open source project, we must take some good ideal to our code.

* persist peer info when balancing * add part with peers * add cache lib dependency on common test Co-authored-by: pengwei.song <90180021+pengweisong@users.noreply.github.com> Co-authored-by: Sophie <84560950+Sophie-Xie@users.noreply.github.com>

fix crash problem when stop process that meta client and raft part de…

50ab61e

…pend io thread pool, but the io thread pool will stopped first by gServer

wadeliuyi requested review from dangleptr, sherman-the-tank, a user, critical27 and darionyaphet July 31, 2019 08:09

wadeliuyi added the ready-for-testing PR: ready for the CI test label Jul 31, 2019

darionyaphet reviewed Jul 31, 2019

View reviewed changes

src/kvstore/raftex/RaftexService.cpp Outdated Show resolved Hide resolved

address comment

917be7b

fix ut test failed

1efec09

dutor reviewed Aug 1, 2019

View reviewed changes

wadeliuyi closed this Aug 1, 2019

wadeliuyi reopened this Aug 1, 2019

dangleptr reviewed Aug 1, 2019

View reviewed changes

CPWstatic reviewed Aug 1, 2019

View reviewed changes

wadeliuyi closed this Aug 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix crash problem when stop process. #705

fix crash problem when stop process. #705

wadeliuyi commented Jul 31, 2019

wadeliuyi commented Jul 31, 2019

wadeliuyi commented Jul 31, 2019

nebula-community-bot commented Jul 31, 2019

nebula-community-bot commented Jul 31, 2019

nebula-community-bot commented Jul 31, 2019

dutor left a comment

wadeliuyi commented Aug 1, 2019

wadeliuyi commented Aug 1, 2019

nebula-community-bot commented Aug 1, 2019

dangleptr left a comment

dangleptr Jul 31, 2019

wadeliuyi Aug 1, 2019

dangleptr Aug 1, 2019

wadeliuyi commented Aug 1, 2019

CPWstatic Aug 1, 2019

CPWstatic Aug 1, 2019

dutor commented Aug 1, 2019

wadeliuyi commented Aug 2, 2019

wadeliuyi commented Aug 2, 2019

		auto ioPool = std::make_shared<folly::IOThreadPoolExecutor>(FLAGS_num_io_threads);
		auto acceptThreadPool = std::make_shared<folly::IOThreadPoolExecutor>(1);

fix crash problem when stop process. #705

fix crash problem when stop process. #705

Conversation

wadeliuyi commented Jul 31, 2019

wadeliuyi commented Jul 31, 2019

wadeliuyi commented Jul 31, 2019

nebula-community-bot commented Jul 31, 2019

nebula-community-bot commented Jul 31, 2019

nebula-community-bot commented Jul 31, 2019

dutor left a comment

Choose a reason for hiding this comment

wadeliuyi commented Aug 1, 2019

wadeliuyi commented Aug 1, 2019

nebula-community-bot commented Aug 1, 2019

dangleptr left a comment

Choose a reason for hiding this comment

dangleptr Jul 31, 2019

Choose a reason for hiding this comment

wadeliuyi Aug 1, 2019

Choose a reason for hiding this comment

dangleptr Aug 1, 2019

Choose a reason for hiding this comment

wadeliuyi commented Aug 1, 2019

CPWstatic Aug 1, 2019

Choose a reason for hiding this comment

CPWstatic Aug 1, 2019

Choose a reason for hiding this comment

dutor commented Aug 1, 2019

wadeliuyi commented Aug 2, 2019

wadeliuyi commented Aug 2, 2019