brpc+用户自定义线程池卡住 #1428

ChenChuang · 2021-06-10T04:33:01Z

Describe the bug (描述bug)
我们有一个计算引擎，需要在单独的线程池调用。因此，我们采用了如下的设计方案

brpc负责收发消息，在rpc处理方法中，把消息转换为计算任务投递到一个全局队列中，然后通过 bthread::Mutex + bthread::ConditionVariable 等待任务完成（如下面代码中的 Wait 方法）
计算线程池（N*pthread）不断从全局队列中取出任务，进行计算后，通过 Done 方法通知正在等待的 rpc 处理 bthread

class Task {
  void Done() {
    {
      std::unique_lock<bthread::Mutex> lock(mutex_);
      done_ = true;
    }
    cond_.notify_one();
  }

  void Wait() {
    std::unique_lock<bthread::Mutex> lock(mutex_);
    while (!done_) {
      cond_.wait(lock);
    }
  }

  bthread::Mutex mutex_;
  bthread::ConditionVariable cond_;
  bool done_ = false;
}

我们想请教两个问题

这种同一个 bthread::Mutex/ConditionVariable 被 pthread 和 bthread 同时使用的方式，是否合理？
我们发现在高负载情况下，出现了卡死的情况，是否跟我们这种使用方式有关系？

卡死时持续滚动如下日志：
[ERROR] [2021-06-10 11:00:25.103] [52858#57225] [task_group_inl.h:92(push_rq)] _rq is full, capacity=4096
[ERROR] [2021-06-10 11:00:26.082] [52858#57107] [task_group.cpp:673(ready_to_run_remote)] _remote_rq is full, capacity=2048
[ERROR] [2021-06-10 11:00:26.103] [52858#57195] [task_group_inl.h:92(push_rq)] _rq is full, capacity=4096
[ERROR] [2021-06-10 11:00:27.082] [52858#57152] [task_group.cpp:673(ready_to_run_remote)] _remote_rq is full, capacity=2048
[ERROR] [2021-06-10 11:00:27.103] [52858#57225] [task_group_inl.h:92(push_rq)] _rq is full, capacity=4096
[ERROR] [2021-06-10 11:00:28.082] [52858#57122] [task_group.cpp:673(ready_to_run_remote)] _remote_rq is full, capacity=2048
[ERROR] [2021-06-10 11:00:28.103] [52858#57225] [task_group_inl.h:92(push_rq)] _rq is full, capacity=4096

典型堆栈 [1]（rpc处理线程卡在上面的 Wait 方法）：
Thread 20 (Thread 0x7f84ee7fc700 (LWP 57277)):
#0 0x00007f8c40dbe809 in syscall () from /lib64/libc.so.6
#1 0x0000000001268b23 in futex_wait_private (timeout=0x0, expected=0, addr1=0x7f84ee7f5a40) at ./src/bthread/sys_futex.h:42
#2 bthread::wait_pthread (pw=..., ptimeout=ptimeout@entry=0x0) at src/bthread/butex.cpp:142
#3 0x0000000001269abc in butex_wait_from_pthread (abstime=0x0, expected_value=0, b=0x7f84dc801a40, g=) at src/bthread/butex.cpp:589
#4 bthread::butex_wait (arg=0x7f84dc801a40, expected_value=expected_value@entry=0, abstime=abstime@entry=0x0) at src/bthread/butex.cpp:622
#5 0x000000000118910e in bthread_cond_wait (c=0x7f84dc84d590, m=0x7f84dc84d578) at src/bthread/condition_variable.cpp:101
#6 0x0000000000c70310 in bthread::ConditionVariable::wait (this=0x7f84dc84d590, lock=...) at /brpc/include/bthread/condition_variable.h:60
#7 0x0000000000c7034b in common::Task::Wait (this=0x7f84dc84d578) at /src/common/pool/execute_queue.h:39
Python Exception <type 'exceptions.IndexError'> list index out of range:
#8 0x0000000000c6d38f in Searcher::Search (this=0x7f84ee7f5f80, group_candidates=std::map with 0 elements) at /src/retrieve/searcher.cpp:229
#9 0x0000000000c5e6d5 in SearchLogic::Retrieve (this=0x7ffd15ff74f8, request=0x7f84dc84bcc0, response=0x7f84dc84cea0) at /src/retrieve/search_logic.cpp:127
#10 0x0000000000c848c4 in RetrieveServiceImpl::Retrieve (this=0x7ffd15ff74f0, controller=0x7f84dc84ba90, request=0x7f84dc84bcc0, response=0x7f84dc84cea0, done=0x7f84dc84cef0)
at /src/retrieve/service_impl.cpp:16
#11 0x0000000000d5f47d in RetrieveService::CallMethod (this=0x7ffd15ff74f0, method=0x49f9570, controller=0x7f84dc84ba90, request=0x7f84dc84bcc0, response=0x7f84dc84cea0, done=0x7f84dc84cef0)
at /src/proto/retrieve_api.pb.cc:245
#12 0x0000000001323755 in brpc::policy::ProcessRpcRequest (msg_base=) at src/brpc/policy/baidu_rpc_protocol.cpp:499
#13 0x00000000012cb8ba in brpc::ProcessInputMessage (void_arg=) at src/brpc/input_messenger.cpp:136
#14 0x000000000118fb5f in bthread::TaskGroup::task_runner (skip_remained=skip_remained@entry=1) at src/bthread/task_group.cpp:297
#15 0x000000000119001b in bthread::TaskGroup::run_main_task (this=this@entry=0x7f84dc0008c0) at src/bthread/task_group.cpp:158
#16 0x0000000001266536 in bthread::TaskControl::worker_thread (arg=0x49df570) at src/bthread/task_control.cpp:77
#17 0x00007f8c41cd2e25 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f8c40dc435d in clone () from /lib64/libc.so.6

典型堆栈 [2]（计算线程卡在上面的 Done 方法）：
Thread 196 (Thread 0x7f8bb080d700 (LWP 57094)):
#0 0x00007f8c40d8b1bd in nanosleep () from /lib64/libc.so.6
#1 0x00007f8c40dbbed4 in usleep () from /lib64/libc.so.6
#2 0x000000000118e046 in bthread::TaskGroup::ready_to_run_remote (this=0x7f85980008c0, tid=tid@entry=51539635585, nosignal=nosignal@entry=false) at src/bthread/task_group.cpp:675
#3 0x000000000126910a in bthread::butex_wake (arg=) at src/bthread/butex.cpp:287
#4 0x0000000001189071 in bthread_cond_signal (c=) at src/bthread/condition_variable.cpp:69
#5 0x0000000000bf85b8 in bthread::ConditionVariable::notify_one (this=0x7f85dc28f680) at /data/devops/workspace/yt-industry-ai/zeus/p-8ab35777b3814c8e843aa982bee6e16a/third_path/brpc/include/bthread/condition_variable.h:94
#6 0x0000000000bf86e6 in common::Task::Done (this=0x7f85dc28f668, task_ret=0) at /src/common/pool/execute_queue.h:33
#7 0x0000000000c75cb5 in common::ExecuteQueue::ThreadLoop (this=0x4bf4d90, idx=3) at /src/common/pool/execute_queue.h:229
#8 0x0000000000c72608 in common::ExecuteQueue::InitAndStartThreads()::{lambda()#1}::operator()() const (__closure=0x4c2c130)
at /src/common/pool/execute_queue.h:142
#9 0x0000000000c7f8b2 in std::_Bind_simple<common::ExecuteQueue::InitAndStartThreads()::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>) (this=0x4c2c130) at /usr/include/c++/4.8.2/functional:1732
#10 0x0000000000c7f7bf in std::_Bind_simple<common::ExecuteQueue::InitAndStartThreads()::{lambda()#1} ()>::operator()() (this=0x4c2c130) at /usr/include/c++/4.8.2/functional:1720
#11 0x0000000000c7f61e in std::thread::_Impl<std::_Bind_simple<common::ExecuteQueueyoutu::zeus::SearchTask::InitAndStartThreads()::{lambda()#1} ()> >::_M_run() (this=0x4c2c118) at /usr/include/c++/4.8.2/thread:115
#12 0x00007f8c4165d220 in ?? () from /lib64/libstdc++.so.6
#13 0x00007f8c41cd2e25 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f8c40dc435d in clone () from /lib64/libc.so.6

To Reproduce (复现方法)
高负载后可能出现

Expected behavior (期望行为)
负载降低后，服务可自动恢复正常，不要一直卡住

Versions (各种版本)
OS: centos7
Compiler: gcc 4.8.5
brpc: 0.9.6
protobuf: 3.6.1

Additional context/screenshots (更多上下文/截图)

ChenChuang · 2021-06-10T04:39:12Z

另外，我们在典型堆栈 [1] frame 4 观察到：
(gdb) frame 4
#4 bthread::butex_wait (arg=0x7f84d508e230, expected_value=expected_value@entry=0, abstime=abstime@entry=0x0) at src/bthread/butex.cpp:622
(gdb) p *g->_cur_meta
$6 = {current_waiter = {<std::atomicbthread::ButexWaiter*> = {_M_b = {_M_p = 0x7f84ed7f3a20}}, }, current_sleep = 0, stop = false, interrupted = false, about_to_quit = false, version_lock = 1, version_butex = 0x7f8640124f70, tid = 60129595668,
fn = 0x12cb8b0 brpc::ProcessInputMessage(void*), arg = 0x7f8bf8063210, stack = 0x7f84d400ca70, attr = {stack_type = 1, flags = 32, keytable_pool = 0x4bdebe0}, cpuwide_start_ns = 29493140553366624, stat = {cputime_ns = 0, nswitch = 0}, local_storage = {keytable =
0x7f84d509f9a0, assigned_data = 0x0, rpcz_parent_span = 0x0}}
当前正在允许的 bthread 栈类型为 BTHREAD_STACKTYPE_PTHREAD，这个似乎不符合我的预期。我的预期是，处理 rpc 请求都是在 bthread 中执行的
事发当时机器的内存负载不高，不太可能出现触发因为 out of memory 强制将 bthread 的栈类型改为 pthread 的情况

cool-colo · 2021-06-10T09:13:23Z

看这个线程栈, Done好像会往_remote_rq里塞东西，塞给哪个task group是随机的，如果塞到对应rpc所在的task_group里面，并且rpc task run in pthread-mode, 并且其它task_group也洽好出现同样的情况，没办法互相working stealing, 就死锁了吧。

ChenChuang · 2021-06-10T13:13:30Z

已经解决。确实是因为该进程 map count 达到了 /proc/sys/vm/max_map_count 导致 allocate_stack_storage 失败，bthread 都进入了 pthread 模型，最终阻塞在 futex_wait_private

Cuttstage · 2021-08-09T12:50:44Z

已经解决。确实是因为该进程 map count 达到了 /proc/sys/vm/max_map_count 导致 allocate_stack_storage 失败，bthread 都进入了 pthread 模型，最终阻塞在 futex_wait_private

大佬，请问你是如何解决这个问题的呢？

gyd-a · 2022-05-23T02:34:09Z

已经解决。确实是因为该进程 map count 达到了 /proc/sys/vm/max_map_count 导致 allocate_stack_storage 失败，bthread 都进入了 pthread 模型，最终阻塞在 futex_wait_private

大佬，您好，这个问题有结果吗？你们真正用上自定义线程池了吗？

FTSSYY · 2022-06-14T06:31:11Z

请问这个问题有得到解决吗

AdiaLoveTrance · 2023-11-28T07:11:47Z

求更新

ChenChuang closed this as completed Jun 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

brpc+用户自定义线程池卡住 #1428

brpc+用户自定义线程池卡住 #1428

ChenChuang commented Jun 10, 2021

ChenChuang commented Jun 10, 2021

cool-colo commented Jun 10, 2021

ChenChuang commented Jun 10, 2021

Cuttstage commented Aug 9, 2021

gyd-a commented May 23, 2022

FTSSYY commented Jun 14, 2022

AdiaLoveTrance commented Nov 28, 2023

brpc+用户自定义线程池卡住 #1428

brpc+用户自定义线程池卡住 #1428

Comments

ChenChuang commented Jun 10, 2021

ChenChuang commented Jun 10, 2021

cool-colo commented Jun 10, 2021

ChenChuang commented Jun 10, 2021

Cuttstage commented Aug 9, 2021

gyd-a commented May 23, 2022

FTSSYY commented Jun 14, 2022

AdiaLoveTrance commented Nov 28, 2023