Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

brpc::Acceptor::StartAccept和brpc::Acceptor::BeforeRecycle之间构成死锁 #1772

Open
weingithub opened this issue Jun 1, 2022 · 4 comments · May be fixed by #1791
Open

brpc::Acceptor::StartAccept和brpc::Acceptor::BeforeRecycle之间构成死锁 #1772

weingithub opened this issue Jun 1, 2022 · 4 comments · May be fixed by #1791
Labels
bug the code does not work as expected

Comments

@weingithub
Copy link

weingithub commented Jun 1, 2022

Describe the bug (描述bug)
创建子进程,然后在子进程中调用start_brpc_server 接口,之后出现brpc::Acceptor::StartAccept和brpc::Acceptor::BeforeRecycle之间构成死锁,curl访问该监听端口,卡住。详情见如下堆栈

To Reproduce (复现方法)
1.创建子进程,然后在子进程中调用start_brpc_server 接口
2.杀掉子进程,父进程会有个监听线程,监听到子进程挂掉之后,又拉起子进程(之后会重复步骤1的过程)。

Expected behavior (期望行为)
子进程启动之后,端口能正常监听

Versions (各种版本)
OS: 基于linux内核3.10.0的自定义系统
Compiler: gcc 4.7
brpc: 2019年fork过去的版本
protobuf:

Additional context/screenshots (更多上下文/截图)
(gdb) bt
#0 0x00007fdcb176042d in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fdcb175bdcb in _L_lock_812 () from /lib64/libpthread.so.0
#2 0x00007fdcb175bc98 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fdcafe9650f in lock (this=0x55920555bc90) at /test/src/brpc/src/butil/synchronization/lock.h:55
#4 lock_guard (__m=..., this=) at /usr/include/c++/4.8.2/mutex:414
#5 brpc::Acceptor::BeforeRecycle (this=0x55920555bc20, sock=0x55920ac626c0) at /test/src/brpc/src/brpc/acceptor.cpp:325
#6 0x00007fdcafebc4ea in brpc::Socket::OnRecycle (this=0x55920ac626c0) at /test/src/brpc/src/brpc/socket.cpp:1015
#7 0x00007fdcafebccad in Dereference (this=0x238) at /test/src/brpc/src/brpc/socket_inl.h:110
#8 brpc::Socket::ReleaseAdditionalReference (this=this@entry=0x55920ac626c0) at /test/src/brpc/src/brpc/socket.cpp:783
#9 0x00007fdcafebd1ee in brpc::Socket::SetFailed (this=this@entry=0x55920ac626c0, error_code=error_code@entry=9, error_fmt=error_fmt@entry=0x7fdcb00c66c0 "Fail to ResetFileDescriptor: %s")
at /test/src/brpc/src/brpc/socket.cpp:848
#10 0x00007fdcafebdcbd in brpc::Socket::Create (options=..., id=id@entry=0x55920555bc88) at /test/src/brpc/src/brpc/socket.cpp:667
#11 0x00007fdcafe96a30 in brpc::Acceptor::StartAccept (this=0x55920555bc20, listened_fd=listened_fd@entry=3, idle_timeout_sec=-1, ssl_ctx=)
at /test/src/brpc/src/brpc/acceptor.cpp:82
#12 0x00007fdcafd99cd7 in brpc::Server::StartInternal (this=this@entry=0x5592009cf080, ip=..., port_range=..., opt=opt@entry=0x0) at /test/src/brpc/src/brpc/server.cpp:919
#13 0x00007fdcafd9b020 in brpc::Server::Start (this=this@entry=0x5592009cf080, endpoint=..., opt=opt@entry=0x0) at /test/src/brpc/src/brpc/server.cpp:997
#14 0x00007fdcb58061af in test::start_brpc_server (this=this@entry=0x5592009cf040) at /test//src/test/test_manager.cpp:194
#15 0x00007fdcb580626a in test::start (this=this@entry=0x5592009cf040) at /test//src/test/test_manager.cpp:106
#16 0x00007fdcb58073da in test::run (this=this@entry=0x5592009cf000) at /test//src/test/test.cpp:189
#17 0x00007fdcb5807a1f in test::start_work_process (this=this@entry=0x5592009cf000) at /test//src/test/test.cpp:177
#18 0x00007fdcb5808257 in test::daemon_thread (arg=0x5592009cf000) at /test//src/test/test.cpp:80
#19 0x00007fdcb1759e25 in start_thread () from /lib64/libpthread.so.0
#20 0x00007fdcae6d834d in clone () from /lib64/libc.so.6
(gdb) f 5
#5 brpc::Acceptor::BeforeRecycle (this=0x55920555bc20, sock=0x55920ac626c0) at /test/src/brpc/src/brpc/acceptor.cpp:325
325 /test/src/brpc/src/brpc/acceptor.cpp: No such file or directory.
(gdb) p _map_mutex
$1 = {_native_handle = {__data = {__lock = 2, __count = 0, __owner = 88919, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = "\002\000\000\000\000\000\000\000W[\001\000\001", '\000' <repeats 26 times>, __align = 2}}
(gdb) info thr 1
Id Target Id Frame

  • 1 Thread 0x7fdc95e0f700 (LWP 88919) "test" brpc::Acceptor::BeforeRecycle (this=0x55920555bc20, sock=0x55920ac626c0) at /test/src/brpc/src/brpc/acceptor.cpp:325
    (gdb)

---程序运行的日志----
2022-06-01 10:54:08.367247 - info test-5d6f30c3 W0601 10:54:08.367146 186761 socket.cpp:1219] Fail to add fd=4 into epoll: Bad file descriptor
2022-06-01 10:54:08.367722 - info test-5d6f30c3 E0601 10:54:08.367461 186746 socket.cpp:589] Fail to add SocketId=455 into EventDispatcher, fd 3 ret -1 errno 9 reason Bad file descriptor: Bad file des
criptor
2022-06-01 10:54:08.367728 - info test-5d6f30c3 E0601 10:54:08.367470 186746 socket.cpp:669] Fail to ResetFileDescriptor: Bad file descriptor

当前通过日志,暂时没有找到为何epoll_ctl失败的原因。目前只能看到这个epoll_ctl失败之后导致的死锁。
@JiaoZiLang

@wwbmmm
Copy link
Contributor

wwbmmm commented Jun 6, 2022

Acceptor::StartAccept中对_map_mutex的锁粒度太大了,如果在Socket::Create的期间释放锁,应该就不会死锁了

@wwbmmm
Copy link
Contributor

wwbmmm commented Jun 10, 2022

Acceptor::StartAccept中对_map_mutex的锁粒度太大了,如果在Socket::Create的期间释放锁,应该就不会死锁了

按照这个代码注释: https://github.com/apache/incubator-brpc/blob/master/src/brpc/acceptor.cpp#L77
Socket::Create的期间还是需要加锁的,不能用这个方案。

换了一个方案,可以试试这个PR #1791
@weingithub

@weingithub
Copy link
Author

Acceptor::StartAccept中对_map_mutex的锁粒度太大了,如果在Socket::Create的期间释放锁,应该就不会死锁了

按照这个代码注释: https://github.com/apache/incubator-brpc/blob/master/src/brpc/acceptor.cpp#L77 Socket::Create的期间还是需要加锁的,不能用这个方案。

换了一个方案,可以试试这个PR #1791 @weingithub

谢谢你的帮助。我看代码修改里面,改了socket的create的失败逻辑。当前的死锁问题肯定是能够解决的。不过不确定会不会在其他地方引入新的问题?我看这个接口调用的地方挺多的。

@wwbmmm
Copy link
Contributor

wwbmmm commented Jun 13, 2022

目前继承SocketUser的有几个地方:

而该PR的逻辑就是让Create过程不回调BeforeRecycle,综上所述,该PR修复了2处潜在的double free和1处死锁,除此之外应该没有其它影响

@wwbmmm wwbmmm added the bug the code does not work as expected label Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug the code does not work as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants