Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix crash problem when stop process. #705

Closed
wants to merge 3 commits into from
Closed

fix crash problem when stop process. #705

wants to merge 3 commits into from

Conversation

wadeliuyi
Copy link
Contributor

fix crash problem when stop process that meta client and raft part depend io thread pool, but the io thread pool will stopped first by gServer:
1, raft bt.
(gdb) bt
#0 0x000000000208f587 in folly::IOThreadPoolExecutor::getEventBase (this=) at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#1 0x0000000001b48970 in nebula::raftex::RaftPart::appendLogsInternal (this=0x7fbfdc168c10, iter=..., termId=8) at /home/wade.liu/rd/nebula/src/kvstore/raftex/RaftPart.cpp:512
#2 0x0000000001b47dd0 in nebula::raftex::RaftPart::appendLogAsync (this=0x7fbfdc168c10, source=0 '\000', logType=nebula::raftex::LogType::NORMAL, log="")
at /home/wade.liu/rd/nebula/src/kvstore/raftex/RaftPart.cpp:452
#3 0x0000000001b501ff in nebula::raftex::RaftPart::sendHeartbeat (this=0x7fbfdc168c10) at /home/wade.liu/rd/nebula/src/kvstore/raftex/RaftPart.cpp:1270
#4 0x0000000001b4ce4f in nebula::raftex::RaftPart::statusPolling (this=0x7fbfdc168c10) at /home/wade.liu/rd/nebula/src/kvstore/raftex/RaftPart.cpp:940
#5 0x0000000001b4c9d8 in nebula::raftex::RaftPart::<lambda()>::operator()(void) const (__closure=0x7fbfd7ce7400) at /home/wade.liu/rd/nebula/src/kvstore/raftex/RaftPart.cpp:949
#6 0x0000000001b5d36c in std::__invoke_impl<void, nebula::raftex::RaftPart::statusPolling()::<lambda()>&>(std::__invoke_other, nebula::raftex::RaftPart::<lambda()> &) (__f=...)
at /usr/include/c++/8/bits/invoke.h:60
#7 0x0000000001b5d2a1 in std::__invoke<nebula::raftex::RaftPart::statusPolling()::<lambda()>&>(nebula::raftex::RaftPart::<lambda()> &) (__fn=...) at /usr/include/c++/8/bits/invoke.h:95
#8 0x0000000001b5d19a in std::_Bind<nebula::raftex::RaftPart::statusPolling()::<lambda()>()>::__call(std::tuple<> &&, std::_Index_tuple<>) (this=0x7fbfd7ce7400, __args=...)
at /usr/include/c++/8/functional:400
#9 0x0000000001b5ccaa in std::_Bind<nebula::raftex::RaftPart::statusPolling()::<lambda()>()>::operator()<>(void) (this=0x7fbfd7ce7400) at /usr/include/c++/8/functional:484
#10 0x0000000001b5c647 in std::_Function_handler<void(), std::_Bind<nebula::raftex::RaftPart::statusPolling()::<lambda()>()> >::_M_invoke(const std::_Any_data &) (__functor=...)
at /usr/include/c++/8/bits/std_function.h:297

2, meta bt.
#0 0x000000000208dc37 in folly::IOThreadPoolExecutor::getEventBase (this=) at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#1 0x00000000018169eb in nebula::meta::MetaClient::getResponse<nebula::meta::cpp2::HBReq, nebula::meta::MetaClient::heartbeat()::<lambda(auto:110, auto:111)>, nebula::meta::MetaClient::heartbeat()::<lambda(nebula::meta::cpp2::HBResp&&)> >(nebula::meta::cpp2::HBReq, nebula::meta::MetaClient::<lambda(auto:110, auto:111)>, nebula::meta::MetaClient::<lambda(nebula::meta::cpp2::HBResp&&)>, bool) (
this=0x7f7642d60600, req=..., remoteFunc=..., respGen=..., toLeader=true) at /home/wade.liu/rd/nebula/src/meta/client/MetaClient.cpp:254
#2 0x000000000180d6f6 in nebula::meta::MetaClient::heartbeat (this=0x7f7642d60600) at /home/wade.liu/rd/nebula/src/meta/client/MetaClient.cpp:987
#3 0x0000000001806204 in nebula::meta::MetaClient::heartBeatThreadFunc (this=0x7f7642d60600) at /home/wade.liu/rd/nebula/src/meta/client/MetaClient.cpp:85
#4 0x00000000018876da in std::__invoke_impl<void, void (nebula::meta::MetaClient::&)(), nebula::meta::MetaClient&> (
__f=@0x7f7642dc1ea0: (void (nebula::meta::MetaClient::*)(nebula::meta::MetaClient * const)) 0x18061e0 nebula::meta::MetaClient::heartBeatThreadFunc(), __t=@0x7f7642dc1eb0: 0x7f7642d60600)
at /usr/include/c++/8/bits/invoke.h:73

…pend io thread pool, but the io thread pool will stopped first by gServer
@wadeliuyi
Copy link
Contributor Author

the reason is that when call gServer->stop(), the gServer will stop all the io thread, but the metaclient and raft part need io thead pool send message out, so the process will crash.

@wadeliuyi wadeliuyi added the ready-for-testing PR: ready for the CI test label Jul 31, 2019
@wadeliuyi
Copy link
Contributor Author

Jenkins go.

@nebula-community-bot
Copy link
Member

Unit testing passed.

@nebula-community-bot
Copy link
Member

Unit testing failed.

@nebula-community-bot
Copy link
Member

Unit testing passed.

Copy link
Contributor

@dutor dutor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could please extract these shit logics out for us?

@wadeliuyi
Copy link
Contributor Author

Could please extract these shit logics out for us?

meta client has a background thread to send heart beat, and depend the io thread pool in main, the thrift server depend the io thread pool too, but when we stop the thrift server, the thrift server will stop all threads in the io pool, but the background thread in meta client is not stop, then if meta client send heartbeat, it will call auto* evb = ioThreadPool_->getEventBase(); to get eb, the function like follow:
EventBase* IOThreadPoolExecutor::getEventBase() {
ensureActiveThreads();
SharedMutex::ReadHolder r{&threadListLock_};
return pickThread()->eventBase;
}
std::shared_ptrIOThreadPoolExecutor::IOThread
IOThreadPoolExecutor::pickThread() {
auto& me = *thisThread_;
auto& ths = threadList_.get();
// When new task is added to IOThreadPoolExecutor, a thread is chosen for it
// to be executed on, thisThread_ is by default chosen, however, if the new
// task is added by the clean up operations on thread destruction, thisThread_
// is not an available thread anymore, thus, always check whether or not
// thisThread_ is an available thread before choosing it.
if (me && std::find(ths.cbegin(), ths.cend(), me) != ths.cend()) {
return me;
}
auto n = ths.size();
if (n == 0) {
return me;
}
auto thread = ths[nextThread_.fetch_add(1, std::memory_order_relaxed) % n];
return std::static_pointer_cast(thread);
}

me and ths are nullptr, so the process crash.

@wadeliuyi
Copy link
Contributor Author

when I fix this problem, I find some complex cyclic dependence like worker thread in thrift and kvstore in main, the accept thread pool and io thread pool.

@wadeliuyi wadeliuyi closed this Aug 1, 2019
@wadeliuyi wadeliuyi reopened this Aug 1, 2019
@nebula-community-bot
Copy link
Member

Unit testing passed.

Copy link
Contributor

@dangleptr dangleptr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good to me. Thanks for taking care of it.

Do you try it? Is it worked?

auto ioPool = std::make_shared<folly::IOThreadPoolExecutor>(FLAGS_num_io_threads);
auto acceptThreadPool = std::make_shared<folly::IOThreadPoolExecutor>(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it configurable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I think about this problem, but I get the result is there is no need configure, 1 thread is too much, because accept thread pool just accept connection from client which is our process like graphd or meta, and they a long connection, so the performance is not high, what do you think about?

void RaftexService::initThriftServer(std::shared_ptr<folly::IOThreadPoolExecutor> pool,
uint16_t port) {
void RaftexService::initThriftServer(std::shared_ptr<folly::IOThreadPoolExecutor> ioPool,
std::shared_ptr<folly::IOThreadPoolExecutor> acceptPool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alignment

@wadeliuyi
Copy link
Contributor Author

The PR looks good to me. Thanks for taking care of it.

Do you try it? Is it worked?

yes, I test the case that let main sleep some second after main exit from gServer->serve(), when meta client send heart beat, process not crash.

, localHost_(localHost)
, sendHeartBeat_(sendHeartBeat) {
ioThreadPool_ = std::make_unique<folly::IOThreadPoolExecutor>(FLAGS_meta_client_io_thread_num);
CHECK(ioThreadPool_ != nullptr) << "IOThreadPool is required";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seems necessary anymore.

auto ioPool = std::make_shared<folly::IOThreadPoolExecutor>(FLAGS_num_io_threads);
auto acceptThreadPool = std::make_shared<folly::IOThreadPoolExecutor>(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why MetaDeamon share the acceptors with NebulaStore? Same question with StorageDeamon.
  2. Why do we create acceptors with 1 worker? The ThriftServer create acceptors with 1 worker by default after all.
  3. Can acceptors with 1 worker meet our performance requirements?

@dutor
Copy link
Contributor

dutor commented Aug 1, 2019

Could please extract these shit logics out for us?

meta client has a background thread to send heart beat, and depend the io thread pool in main, the thrift server depend the io thread pool too, but when we stop the thrift server, the thrift server will stop all threads in the io pool, but the background thread in meta client is not stop, then if meta client send heartbeat, it will call auto* evb = ioThreadPool_->getEventBase(); to get eb, the function like follow:
EventBase* IOThreadPoolExecutor::getEventBase() {
ensureActiveThreads();
SharedMutex::ReadHolder r{&threadListLock_};
return pickThread()->eventBase;
}
std::shared_ptrIOThreadPoolExecutor::IOThread
IOThreadPoolExecutor::pickThread() {
auto& me = *thisThread_;
auto& ths = threadList_.get();
// When new task is added to IOThreadPoolExecutor, a thread is chosen for it
// to be executed on, thisThread_ is by default chosen, however, if the new
// task is added by the clean up operations on thread destruction, thisThread_
// is not an available thread anymore, thus, always check whether or not
// thisThread_ is an available thread before choosing it.
if (me && std::find(ths.cbegin(), ths.cend(), me) != ths.cend()) {
return me;
}
auto n = ths.size();
if (n == 0) {
return me;
}
auto thread = ths[nextThread_.fetch_add(1, std::memory_order_relaxed) % n];
return std::static_pointer_cast(thread);
}

me and ths are nullptr, so the process crash.

I knew almost everything on these. But what I meant is that we cannot live with fix and fix and fix forever.

Generally, fixing in this way is not clean and not OK to me.

@wadeliuyi
Copy link
Contributor Author

Could please extract these shit logics out for us?

meta client has a background thread to send heart beat, and depend the io thread pool in main, the thrift server depend the io thread pool too, but when we stop the thrift server, the thrift server will stop all threads in the io pool, but the background thread in meta client is not stop, then if meta client send heartbeat, it will call auto* evb = ioThreadPool_->getEventBase(); to get eb, the function like follow:
EventBase* IOThreadPoolExecutor::getEventBase() {
ensureActiveThreads();
SharedMutex::ReadHolder r{&threadListLock_};
return pickThread()->eventBase;
}
std::shared_ptrIOThreadPoolExecutor::IOThread
IOThreadPoolExecutor::pickThread() {
auto& me = *thisThread_;
auto& ths = threadList_.get();
// When new task is added to IOThreadPoolExecutor, a thread is chosen for it
// to be executed on, thisThread_ is by default chosen, however, if the new
// task is added by the clean up operations on thread destruction, thisThread_
// is not an available thread anymore, thus, always check whether or not
// thisThread_ is an available thread before choosing it.
if (me && std::find(ths.cbegin(), ths.cend(), me) != ths.cend()) {
return me;
}
auto n = ths.size();
if (n == 0) {
return me;
}
auto thread = ths[nextThread_.fetch_add(1, std::memory_order_relaxed) % n];
return std::static_pointer_cast(thread);
}
me and ths are nullptr, so the process crash.

I knew almost everything on these. But what I meant is that we cannot live with fix and fix and fix forever.

Generally, fixing in this way is not clean and not OK to me.

thanks very much, this is very good view. I think about it too, we can not just fix problem for fix problem, we need find the problem about the design, and avoid the dependence problem from whole architecture.
I think two way to solve it, maybe not the best.
first is one process just one rpc server, maybe we can combine the raft service and main service together, but we also take care about the resource release sequence when the process exit.
second is raft service and main service not share the io thread pool, raft use it self, but raft service have a high flow of io, so we need open many thread for raft service, this result the whole process has too much thread.

@wadeliuyi wadeliuyi closed this Aug 2, 2019
@wadeliuyi
Copy link
Contributor Author

like what dutor say, this fix is very ugly, I am very ashamed of myself, and I close this pr and open an issue. this is an open source project, we must take some good ideal to our code.

yixinglu pushed a commit to yixinglu/nebula that referenced this pull request Mar 21, 2022
* persist peer info when balancing

* add part with peers

* add cache lib dependency on common test

Co-authored-by: pengwei.song <90180021+pengweisong@users.noreply.github.com>
Co-authored-by: Sophie <84560950+Sophie-Xie@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-testing PR: ready for the CI test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants