Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

疑似覆盖pthread_mutex_trylock后与jemalloc造成死锁 #2726

Closed
fausturs opened this issue Aug 6, 2024 · 5 comments · Fixed by #2727
Closed

疑似覆盖pthread_mutex_trylock后与jemalloc造成死锁 #2726

fausturs opened this issue Aug 6, 2024 · 5 comments · Fixed by #2727

Comments

@fausturs
Copy link

fausturs commented Aug 6, 2024

Describe the bug (描述bug)
我们尝试使用了一下brpc的最新的代码。包含了这个commit
e0c9c44

这个commit中,覆盖了pthread_mutex_trylock这个符号,同时我们使用了jemalloc(5.2.1),,疑似导致死锁了

这是我们的栈,

(gdb) bt
#0  futex_wait (private=0, expected=1, futex_word=0x25a1b1c <bthread::init_sys_mutex_lock_once>) at ../sysdeps/nptl/futex-internal.h:141
#1  futex_wait_simple (private=0, expected=1, futex_word=0x25a1b1c <bthread::init_sys_mutex_lock_once>) at ../sysdeps/nptl/futex-internal.h:172
#2  __pthread_once_slow (once_control=0x25a1b1c <bthread::init_sys_mutex_lock_once>, init_routine=0x12cfb30 <bthread::init_sys_mutex_lock()>) at pthread_once.c:105
#3  0x00000000012cfc1c in bthread::first_sys_pthread_mutex_trylock (mutex=0x2f4e2a0 <init_lock+64>) at src/bthread/mutex.cpp:453
#4  0x00000000012d109c in bthread::internal::pthread_mutex_trylock_internal (mutex=0x25a1b1c <bthread::init_sys_mutex_lock_once>) at src/bthread/mutex.cpp:583
#5  bthread::internal::pthread_mutex_trylock_impl<pthread_mutex_t> (mutex=0x25a1b1c <bthread::init_sys_mutex_lock_once>) at src/bthread/mutex.cpp:664
#6  bthread::pthread_mutex_trylock_impl (mutex=0x25a1b1c <bthread::init_sys_mutex_lock_once>) at src/bthread/mutex.cpp:717
#7  pthread_mutex_trylock (__mutex=0x25a1b1c <bthread::init_sys_mutex_lock_once>) at src/bthread/mutex.cpp:939
#8  0x00000000016905a9 in malloc_mutex_trylock_final (mutex=<optimized out>) at /root/.conan/data/jemalloc/5.2.1/_/_/build/55d721bf422a34e3db4f17a58c2f8d839c0b6932/src/include/jemalloc/internal/mutex.h:161
#9  malloc_mutex_lock (tsdn=0x0, mutex=<optimized out>) at /root/.conan/data/jemalloc/5.2.1/_/_/build/55d721bf422a34e3db4f17a58c2f8d839c0b6932/src/include/jemalloc/internal/mutex.h:220
#10 malloc_init_hard () at /root/.conan/data/jemalloc/5.2.1/_/_/build/55d721bf422a34e3db4f17a58c2f8d839c0b6932/src/src/jemalloc.c:1739
#11 malloc_init () at /root/.conan/data/jemalloc/5.2.1/_/_/build/55d721bf422a34e3db4f17a58c2f8d839c0b6932/src/src/jemalloc.c:223
#12 imalloc_init_check (sopts=<optimized out>, dopts=<optimized out>) at /root/.conan/data/jemalloc/5.2.1/_/_/build/55d721bf422a34e3db4f17a58c2f8d839c0b6932/src/src/jemalloc.c:2229
#13 imalloc (sopts=<optimized out>, dopts=<optimized out>) at /root/.conan/data/jemalloc/5.2.1/_/_/build/55d721bf422a34e3db4f17a58c2f8d839c0b6932/src/src/jemalloc.c:2260
#14 calloc (num=num@entry=1, size=size@entry=32) at /root/.conan/data/jemalloc/5.2.1/_/_/build/55d721bf422a34e3db4f17a58c2f8d839c0b6932/src/src/jemalloc.c:2494
#15 0x00007f8b0cbcec05 in _dlerror_run (operate=operate@entry=0x7f8b0cbce490 <dlsym_doit>, args=args@entry=0x7ffde1e4d880) at dlerror.c:148
#16 0x00007f8b0cbce525 in __dlsym (handle=<optimized out>, name=0x1b97c44 "pthread_mutex_trylock") at dlsym.c:70
#17 0x00000000012cfbbc in bthread::init_sys_mutex_lock () at src/bthread/mutex.cpp:435
#18 0x00007f8b0cd3447f in __pthread_once_slow (once_control=0x25a1b1c <bthread::init_sys_mutex_lock_once>, init_routine=0x12cfb30 <bthread::init_sys_mutex_lock()>) at pthread_once.c:116
#19 0x00000000012cfc1c in bthread::first_sys_pthread_mutex_trylock (mutex=0x2f4e2a0 <init_lock+64>) at src/bthread/mutex.cpp:453
#20 0x00000000012d109c in bthread::internal::pthread_mutex_trylock_internal (mutex=0x25a1b1c <bthread::init_sys_mutex_lock_once>) at src/bthread/mutex.cpp:583
#21 bthread::internal::pthread_mutex_trylock_impl<pthread_mutex_t> (mutex=0x25a1b1c <bthread::init_sys_mutex_lock_once>) at src/bthread/mutex.cpp:664
#22 bthread::pthread_mutex_trylock_impl (mutex=0x25a1b1c <bthread::init_sys_mutex_lock_once>) at src/bthread/mutex.cpp:717
#23 pthread_mutex_trylock (__mutex=0x25a1b1c <bthread::init_sys_mutex_lock_once>) at src/bthread/mutex.cpp:939
#24 0x000000000168e302 in malloc_mutex_trylock_final (mutex=<optimized out>) at /root/.conan/data/jemalloc/5.2.1/_/_/build/55d721bf422a34e3db4f17a58c2f8d839c0b6932/src/include/jemalloc/internal/mutex.h:161
#25 malloc_mutex_lock (tsdn=0x0, mutex=<optimized out>) at /root/.conan/data/jemalloc/5.2.1/_/_/build/55d721bf422a34e3db4f17a58c2f8d839c0b6932/src/include/jemalloc/internal/mutex.h:220
#26 malloc_init_hard () at /root/.conan/data/jemalloc/5.2.1/_/_/build/55d721bf422a34e3db4f17a58c2f8d839c0b6932/src/src/jemalloc.c:1739
#27 malloc_init () at /root/.conan/data/jemalloc/5.2.1/_/_/build/55d721bf422a34e3db4f17a58c2f8d839c0b6932/src/src/jemalloc.c:223
#28 imalloc_init_check (sopts=<optimized out>, dopts=<optimized out>) at /root/.conan/data/jemalloc/5.2.1/_/_/build/55d721bf422a34e3db4f17a58c2f8d839c0b6932/src/src/jemalloc.c:2229
#29 imalloc (sopts=<optimized out>, dopts=<optimized out>) at /root/.conan/data/jemalloc/5.2.1/_/_/build/55d721bf422a34e3db4f17a58c2f8d839c0b6932/src/src/jemalloc.c:2260
#30 je_malloc_default (size=72704) at /root/.conan/data/jemalloc/5.2.1/_/_/build/55d721bf422a34e3db4f17a58c2f8d839c0b6932/src/src/jemalloc.c:2289
#31 0x00007f8b0cdeba9a in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#32 0x00007f8b0cf4cb8a in ?? () from /lib64/ld-linux-x86-64.so.2
#33 0x00007f8b0cf4cc91 in ?? () from /lib64/ld-linux-x86-64.so.2
#34 0x00007f8b0cf3c13a in ?? () from /lib64/ld-linux-x86-64.so.2
#35 0x0000000000000007 in ?? ()
#36 0x00007ffde1e4f62c in ?? ()
#37 0x00007ffde1e4f64c in ?? ()
#38 0x00007ffde1e4f689 in ?? ()
#39 0x00007ffde1e4f6b4 in ?? ()
#40 0x00007ffde1e4f6cd in ?? ()
#41 0x00007ffde1e4f6e3 in ?? ()
#42 0x00007ffde1e4f701 in ?? ()
#43 0x0000000000000000 in ?? ()

可以看到

第24帧是,jemalloc中进行pthread_mutex_trylock,然后进入了第23帧brpc,然后自然的进入到了第16帧开始调用__dlsym, 然后第15帧调用了_dlerror_run。
然后第14帧调用到了calloc,又重新进入到jemalloc,然后第8帧进入到和24帧同一个位置进行trylock,此时应该死锁了。

从代码中的一段注释(https://github.com/apache/brpc/blob/b4d4acb7cd9a677039f662f18df37d4be7172ed3/src/bthread/mutex.cpp#L390)来看,
看上去是类似的行为。
这个问题有什么办法可以修复一下吗?

To Reproduce (复现方法)

Expected behavior (期望行为)

Versions (各种版本)
OS:
Compiler:
brpc:
protobuf:

Additional context/screenshots (更多上下文/截图)

这是我们使用的jemalloc的源代码的161行所处的位置。
image

@chenBright
Copy link
Contributor

chenBright commented Aug 6, 2024

试一下#2727 看看还有没有问题。

@fausturs
Copy link
Author

fausturs commented Aug 7, 2024

试一下#2727 看看还有没有问题。
@chenBright

首先,感谢你的回复与支持。

https://github.com/fausturs/test-brpc-jemalloc
这里我给了一个小demo用于复现这个栈。

从给出的修复来看,似乎是通过宏NO_PTHREAD_MUTEX_HOOK控制是否去覆盖pthread的符号。那么完全不覆盖pthread的mutex符号的时候,我想本issue的这个问题应该OK了。

此外我有一点点好奇,对于brpc中,为什么要去hack pthread中的mutex相关符号呢,好处是什么?是能够在真正调用pthread之前,能够进行一些额外的统计信息吗?那是不是当我设置NO_PTHREAD_MUTEX_HOOK时,对pyhread mutex的操作,可能会更快?

@chenBright
Copy link
Contributor

chenBright commented Aug 7, 2024

主要是通过__dl_sym来hook,避免_dlerror_run申请内存导致malloc库死锁。不需要NO_PTHREAD_MUTEX_HOOK宏应该也是没问题的了。NO_PTHREAD_MUTEX_HOOK跟这个issue关系不大,主要解决没法调整动态库加载顺序导致找不到pthread_mutex_*符号的问题。

对于brpc中,为什么要去hack pthread中的mutex相关符号呢

hook pthread mutex是为了支持contention profilerworker潜在死锁问题的检测

那是不是当我设置NO_PTHREAD_MUTEX_HOOK时,对pthread mutex的操作,可能会更快?

没开contention profiler的情况下,只是多了一个判断和线程局部变量加减,性能应该没有差别的。

@fausturs
Copy link
Author

fausturs commented Aug 7, 2024

我通过我上述的小demo,使用__dl_sym代替dlsym,程序表现确实符合预期了。

再次感谢你的回复,与支持。

@chenBright
Copy link
Contributor

好的,感谢反馈!

yiguolei pushed a commit to apache/doris that referenced this issue Oct 15, 2024
…ock (#41891)

BRPC contention profiler hooks pthread mutex, which may deadlock when
used with Jemalloc.
This PR remove pthread mutex hook and disable BRPC contention profiler.


![image](https://github.com/user-attachments/assets/62ccc04c-718a-43db-8354-b1bbc0565958)

similar issue: apache/brpc#2726
reference fix: apache/brpc#2727
xinyiZzz added a commit to xinyiZzz/incubator-doris that referenced this issue Oct 16, 2024
…ock (apache#41891)

BRPC contention profiler hooks pthread mutex, which may deadlock when
used with Jemalloc.
This PR remove pthread mutex hook and disable BRPC contention profiler.

![image](https://github.com/user-attachments/assets/62ccc04c-718a-43db-8354-b1bbc0565958)

similar issue: apache/brpc#2726
reference fix: apache/brpc#2727
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants