-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
brpc+用户自定义线程池卡住 #1428
Comments
另外,我们在典型堆栈 [1] frame 4 观察到: |
看这个线程栈, Done好像会往_remote_rq里塞东西,塞给哪个task group是随机的, 如果塞到对应rpc所在的task_group里面,并且rpc task run in pthread-mode, 并且其它task_group也洽好出现同样的情况,没办法互相working stealing, 就死锁了吧。 |
已经解决。确实是因为该进程 map count 达到了 /proc/sys/vm/max_map_count 导致 allocate_stack_storage 失败,bthread 都进入了 pthread 模型,最终阻塞在 futex_wait_private |
大佬,请问你是如何解决这个问题的呢? |
大佬,您好,这个问题有结果吗?你们真正用上自定义线程池了吗? |
请问这个问题有得到解决吗 |
求更新 |
Describe the bug (描述bug)
我们有一个计算引擎,需要在单独的线程池调用。因此,我们采用了如下的设计方案
我们想请教两个问题
卡死时持续滚动如下日志:
[ERROR] [2021-06-10 11:00:25.103] [52858#57225] [task_group_inl.h:92(push_rq)] _rq is full, capacity=4096
[ERROR] [2021-06-10 11:00:26.082] [52858#57107] [task_group.cpp:673(ready_to_run_remote)] _remote_rq is full, capacity=2048
[ERROR] [2021-06-10 11:00:26.103] [52858#57195] [task_group_inl.h:92(push_rq)] _rq is full, capacity=4096
[ERROR] [2021-06-10 11:00:27.082] [52858#57152] [task_group.cpp:673(ready_to_run_remote)] _remote_rq is full, capacity=2048
[ERROR] [2021-06-10 11:00:27.103] [52858#57225] [task_group_inl.h:92(push_rq)] _rq is full, capacity=4096
[ERROR] [2021-06-10 11:00:28.082] [52858#57122] [task_group.cpp:673(ready_to_run_remote)] _remote_rq is full, capacity=2048
[ERROR] [2021-06-10 11:00:28.103] [52858#57225] [task_group_inl.h:92(push_rq)] _rq is full, capacity=4096
典型堆栈 [1](rpc处理线程卡在上面的 Wait 方法):
Thread 20 (Thread 0x7f84ee7fc700 (LWP 57277)):
#0 0x00007f8c40dbe809 in syscall () from /lib64/libc.so.6
#1 0x0000000001268b23 in futex_wait_private (timeout=0x0, expected=0, addr1=0x7f84ee7f5a40) at ./src/bthread/sys_futex.h:42
#2 bthread::wait_pthread (pw=..., ptimeout=ptimeout@entry=0x0) at src/bthread/butex.cpp:142
#3 0x0000000001269abc in butex_wait_from_pthread (abstime=0x0, expected_value=0, b=0x7f84dc801a40, g=) at src/bthread/butex.cpp:589
#4 bthread::butex_wait (arg=0x7f84dc801a40, expected_value=expected_value@entry=0, abstime=abstime@entry=0x0) at src/bthread/butex.cpp:622
#5 0x000000000118910e in bthread_cond_wait (c=0x7f84dc84d590, m=0x7f84dc84d578) at src/bthread/condition_variable.cpp:101
#6 0x0000000000c70310 in bthread::ConditionVariable::wait (this=0x7f84dc84d590, lock=...) at /brpc/include/bthread/condition_variable.h:60
#7 0x0000000000c7034b in common::Task::Wait (this=0x7f84dc84d578) at /src/common/pool/execute_queue.h:39
Python Exception <type 'exceptions.IndexError'> list index out of range:
#8 0x0000000000c6d38f in Searcher::Search (this=0x7f84ee7f5f80, group_candidates=std::map with 0 elements) at /src/retrieve/searcher.cpp:229
#9 0x0000000000c5e6d5 in SearchLogic::Retrieve (this=0x7ffd15ff74f8, request=0x7f84dc84bcc0, response=0x7f84dc84cea0) at /src/retrieve/search_logic.cpp:127
#10 0x0000000000c848c4 in RetrieveServiceImpl::Retrieve (this=0x7ffd15ff74f0, controller=0x7f84dc84ba90, request=0x7f84dc84bcc0, response=0x7f84dc84cea0, done=0x7f84dc84cef0)
at /src/retrieve/service_impl.cpp:16
#11 0x0000000000d5f47d in RetrieveService::CallMethod (this=0x7ffd15ff74f0, method=0x49f9570, controller=0x7f84dc84ba90, request=0x7f84dc84bcc0, response=0x7f84dc84cea0, done=0x7f84dc84cef0)
at /src/proto/retrieve_api.pb.cc:245
#12 0x0000000001323755 in brpc::policy::ProcessRpcRequest (msg_base=) at src/brpc/policy/baidu_rpc_protocol.cpp:499
#13 0x00000000012cb8ba in brpc::ProcessInputMessage (void_arg=) at src/brpc/input_messenger.cpp:136
#14 0x000000000118fb5f in bthread::TaskGroup::task_runner (skip_remained=skip_remained@entry=1) at src/bthread/task_group.cpp:297
#15 0x000000000119001b in bthread::TaskGroup::run_main_task (this=this@entry=0x7f84dc0008c0) at src/bthread/task_group.cpp:158
#16 0x0000000001266536 in bthread::TaskControl::worker_thread (arg=0x49df570) at src/bthread/task_control.cpp:77
#17 0x00007f8c41cd2e25 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f8c40dc435d in clone () from /lib64/libc.so.6
典型堆栈 [2](计算线程卡在上面的 Done 方法):
Thread 196 (Thread 0x7f8bb080d700 (LWP 57094)):
#0 0x00007f8c40d8b1bd in nanosleep () from /lib64/libc.so.6
#1 0x00007f8c40dbbed4 in usleep () from /lib64/libc.so.6
#2 0x000000000118e046 in bthread::TaskGroup::ready_to_run_remote (this=0x7f85980008c0, tid=tid@entry=51539635585, nosignal=nosignal@entry=false) at src/bthread/task_group.cpp:675
#3 0x000000000126910a in bthread::butex_wake (arg=) at src/bthread/butex.cpp:287
#4 0x0000000001189071 in bthread_cond_signal (c=) at src/bthread/condition_variable.cpp:69
#5 0x0000000000bf85b8 in bthread::ConditionVariable::notify_one (this=0x7f85dc28f680) at /data/devops/workspace/yt-industry-ai/zeus/p-8ab35777b3814c8e843aa982bee6e16a/third_path/brpc/include/bthread/condition_variable.h:94
#6 0x0000000000bf86e6 in common::Task::Done (this=0x7f85dc28f668, task_ret=0) at /src/common/pool/execute_queue.h:33
#7 0x0000000000c75cb5 in common::ExecuteQueue::ThreadLoop (this=0x4bf4d90, idx=3) at /src/common/pool/execute_queue.h:229
#8 0x0000000000c72608 in common::ExecuteQueue::InitAndStartThreads()::{lambda()#1}::operator()() const (__closure=0x4c2c130)
at /src/common/pool/execute_queue.h:142
#9 0x0000000000c7f8b2 in std::_Bind_simple<common::ExecuteQueue::InitAndStartThreads()::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>) (this=0x4c2c130) at /usr/include/c++/4.8.2/functional:1732
#10 0x0000000000c7f7bf in std::_Bind_simple<common::ExecuteQueue::InitAndStartThreads()::{lambda()#1} ()>::operator()() (this=0x4c2c130) at /usr/include/c++/4.8.2/functional:1720
#11 0x0000000000c7f61e in std::thread::_Impl<std::_Bind_simple<common::ExecuteQueueyoutu::zeus::SearchTask::InitAndStartThreads()::{lambda()#1} ()> >::_M_run() (this=0x4c2c118) at /usr/include/c++/4.8.2/thread:115
#12 0x00007f8c4165d220 in ?? () from /lib64/libstdc++.so.6
#13 0x00007f8c41cd2e25 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f8c40dc435d in clone () from /lib64/libc.so.6
To Reproduce (复现方法)
高负载后可能出现
Expected behavior (期望行为)
负载降低后,服务可自动恢复正常,不要一直卡住
Versions (各种版本)
OS: centos7
Compiler: gcc 4.8.5
brpc: 0.9.6
protobuf: 3.6.1
Additional context/screenshots (更多上下文/截图)
The text was updated successfully, but these errors were encountered: