Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4.5.11版本swoole 服务请求无响应 #4716

Closed
Frank-JY opened this issue May 18, 2022 · 42 comments
Closed

4.5.11版本swoole 服务请求无响应 #4716

Frank-JY opened this issue May 18, 2022 · 42 comments

Comments

@Frank-JY
Copy link

Frank-JY commented May 18, 2022

Please answer these questions before submitting your issue.

  1. What did you do? If possible, provide a simple script for reproducing the error.
    swoole版本为4.5.11, 服务开启了抢占型协程
Swoole\Coroutine::set([
    'enable_preemptive_scheduler' => 1 //设置打开协程抢占式调度,协程最大执行时间为 10ms
]);

服务启动一段时间后,请求不同,无法正常响应。master、worker进程都在,端口仍被占用。但请求不通。
排查后发现worker被锁。
image
kill掉一个worker后,有一定几率请求成功。

  1. What did you expect to see?、

希望了解worker为什么被锁,以及怎么解决问题
swoole服务正常,请求正常响应

  1. What did you see instead?

服务worker锁死,请求无法正常响应

  1. What version of Swoole are you using (show your php --ri swoole)?
swoole

Swoole => enabled
Author => Swoole Team <team@swoole.com>
Version => 4.5.11
Built => Jan 23 2021 19:07:20
coroutine => enabled
epoll => enabled
eventfd => enabled
signalfd => enabled
cpu_affinity => enabled
spinlock => enabled
rwlock => enabled
openssl => OpenSSL 1.0.1j 15 Oct 2014
pcre => enabled
zlib => 1.2.7
mutex_timedlock => enabled
pthread_barrier => enabled
futex => enabled
async_redis => enabled

Directive => Local Value => Master Value
swoole.enable_coroutine => On => On
swoole.enable_library => On => On
swoole.enable_preemptive_scheduler => Off => Off
swoole.display_errors => On => On
swoole.use_shortname => On => On
swoole.unixsock_buffer_size => 8388608 => 8388608
  1. What is your machine environment used (show your uname -a & php -v & gcc -v) ?
[root@qbilling-price-idc-0 /data/log/billing]# uname -a
Linux qbilling-price-idc-0 5.4.119-1-tlinux4-0009-eks #1 SMP Wed Feb 16 14:03:22 CST 2022 x86_64 x86_64 x86_64 GNU/Linux
[root@qbilling-price-idc-0 /data/log/billing]# 
[root@qbilling-price-idc-0 /data/log/billing]# gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) 
@matyhtf
Copy link
Member

matyhtf commented May 18, 2022

请使用 strace / gdb 跟踪一下进程的状态

@Frank-JY
Copy link
Author

请使用 strace / gdb 跟踪一下进程的状态

[root/]# strace -p 200162
Process 200162 attached
flock(39, LOCK_EX

@Frank-JY
Copy link
Author

通过gstack获得一些信息
一、std::condition_variable::wait
wecom-temp-bb6600482e15dd7dc9edc99ec595f203
二、flock
wecom-temp-6f572c8da9fbeb9ce4953b062b24df99

@NathanFreeman
Copy link
Member

NathanFreeman commented May 18, 2022

看起来好像是因为skywaiking-php-sdk 这个扩展的flock导致的死锁。

@Frank-JY
Copy link
Author

看起来好像是因为skywaiking-php-sdk 这个扩展的flock导致的死锁。

当我的swoole版本为4.0时,编译出来的skywaiking-php-sdk是可以正常使用的。我升级4.5后,编译对应版本的skywalking后,出现了这个问题

@NathanFreeman
Copy link
Member

NathanFreeman commented May 18, 2022

有skywaiking-php-sdk这个的下载网址吗

@Frank-JY
Copy link
Author

我发您一份源码?

@NathanFreeman
Copy link
Member

mariasocute@163.com,这是我的邮箱,如果源码有隐私信息,请记得去掉

@Frank-JY
Copy link
Author

mariasocute@163.com,这是我的邮箱,如果源码有隐私信息,请记得去掉

已发送,谢谢

@Frank-JY
Copy link
Author

Frank-JY commented May 18, 2022

`[root]# gstack 1647

Thread 6 (Thread 0x7fba19088700 (LWP 1654)):
#0 0x00007fba27b6a8ed in nanosleep () from /lib64/libc.so.6
#1 0x00007fba27b9b1c4 in usleep () from /lib64/libc.so.6
#2 0x00007fba23f04d6e in sleep_for<long, std::ratio<1l, 1000l> > (__rtime=...) at /usr/include/c++/4.8.2/thread:281
#3 operator() (__closure=) at /data/release/jasper/swoole-src-4.5.11/ext-src/swoole_coroutine.cc:365
#4 _M_invoke<> (this=) at /usr/include/c++/4.8.2/functional:1732
#5 operator() (this=) at /usr/include/c++/4.8.2/functional:1720
#6 std::thread::_Impl<std::_Bind_simpleswoole::PHPCoroutine::interrupt_thread_start()::__lambda1() >::_M_run(void) (this=) at /usr/include/c++/4.8.2/thread:115
#7 0x00007fba23bc6340 in ?? () from /lib64/libstdc++.so.6
#8 0x00007fba2768dea5 in start_thread () from /lib64/libpthread.so.0
#9 0x00007fba27ba39fd in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7fba17c86700 (LWP 1779)):
#0 0x00007fba27691a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fba23bc2aec in std::condition_variable::wait(std::unique_lockstd::mutex&) () from /lib64/libstdc++.so.6
#2 0x00007fba23f9b0de in operator() (__closure=0x22f1990) at /data/release/jasper/swoole-src-4.5.11/src/os/async_thread.cc:298
#3 _M_invoke<> (this=0x22f1990) at /usr/include/c++/4.8.2/functional:1732
#4 operator() (this=0x22f1990) at /usr/include/c++/4.8.2/functional:1720
#5 std::thread::_Impl<std::_Bind_simpleswoole::async::ThreadPool::create_thread(bool)::__lambda0() >::_M_run(void) (this=0x22f1978) at /usr/include/c++/4.8.2/thread:115
#6 0x00007fba23bc6340 in ?? () from /lib64/libstdc++.so.6
#7 0x00007fba2768dea5 in start_thread () from /lib64/libpthread.so.0
#8 0x00007fba27ba39fd in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7fba17485700 (LWP 1780)):
#0 0x00007fba27691a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fba23bc2aec in std::condition_variable::wait(std::unique_lockstd::mutex&) () from /lib64/libstdc++.so.6
#2 0x00007fba23f9b0de in operator() (__closure=0x22f1cd0) at /data/release/jasper/swoole-src-4.5.11/src/os/async_thread.cc:298
#3 _M_invoke<> (this=0x22f1cd0) at /usr/include/c++/4.8.2/functional:1732
#4 operator() (this=0x22f1cd0) at /usr/include/c++/4.8.2/functional:1720
#5 std::thread::_Impl<std::_Bind_simpleswoole::async::ThreadPool::create_thread(bool)::__lambda0() >::_M_run(void) (this=0x22f1cb8) at /usr/include/c++/4.8.2/thread:115
#6 0x00007fba23bc6340 in ?? () from /lib64/libstdc++.so.6
#7 0x00007fba2768dea5 in start_thread () from /lib64/libpthread.so.0
#8 0x00007fba27ba39fd in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7fba16c84700 (LWP 1781)):
#0 0x00007fba27691a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fba23bc2aec in std::condition_variable::wait(std::unique_lockstd::mutex&) () from /lib64/libstdc++.so.6
#2 0x00007fba23f9b0de in operator() (__closure=0x22f2010) at /data/release/jasper/swoole-src-4.5.11/src/os/async_thread.cc:298
#3 _M_invoke<> (this=0x22f2010) at /usr/include/c++/4.8.2/functional:1732
#4 operator() (this=0x22f2010) at /usr/include/c++/4.8.2/functional:1720
#5 std::thread::_Impl<std::_Bind_simpleswoole::async::ThreadPool::create_thread(bool)::__lambda0() >::_M_run(void) (this=0x22f1ff8) at /usr/include/c++/4.8.2/thread:115
#6 0x00007fba23bc6340 in ?? () from /lib64/libstdc++.so.6
#7 0x00007fba2768dea5 in start_thread () from /lib64/libpthread.so.0
#8 0x00007fba27ba39fd in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fba16483700 (LWP 1782)):
#0 0x00007fba27691a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007fba23bc2aec in std::condition_variable::wait(std::unique_lockstd::mutex&) () from /lib64/libstdc++.so.6
#2 0x00007fba23f9b0de in operator() (__closure=0x22f2350) at /data/release/jasper/swoole-src-4.5.11/src/os/async_thread.cc:298
#3 _M_invoke<> (this=0x22f2350) at /usr/include/c++/4.8.2/functional:1732
#4 operator() (this=0x22f2350) at /usr/include/c++/4.8.2/functional:1720
#5 std::thread::_Impl<std::_Bind_simpleswoole::async::ThreadPool::create_thread(bool)::__lambda0() >::_M_run(void) (this=0x22f2338) at /usr/include/c++/4.8.2/thread:115
#6 0x00007fba23bc6340 in ?? () from /lib64/libstdc++.so.6
#7 0x00007fba2768dea5 in start_thread () from /lib64/libpthread.so.0
#8 0x00007fba27ba39fd in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fba29dc7840 (LWP 1647)):
#0 0x00007fba27b94f97 in flock () from /lib64/libc.so.6
#1 0x00000000007fa4ad in php_stdiop_set_option (stream=0x7fba159c51c0, option=, value=2, ptrparam=0x0) at /data/php/php-7.2.1/main/streams/plain_wrapper.c:680
#2 0x00000000007f611d in _php_stream_set_option (stream=stream@entry=0x7fba159c51c0, option=option@entry=6, value=2, ptrparam=ptrparam@entry=0x0) at /data/php/php-7.2.1/main/streams/streams.c:1322
#3 0x0000000000749e81 in zif_flock (execute_data=, return_value=0x7fb9b57aee40) at /data/php/php-7.2.1/ext/standard/file.c:362
#4 0x00000000008db8ca in ZEND_DO_ICALL_SPEC_RETVAL_USED_HANDLER () at /data/php/php-7.2.1/Zend/zend_vm_execute.h:617
#5 execute_ex (ex=0x22) at /data/php/php-7.2.1/Zend/zend_vm_execute.h:59737
#6 0x00007fba23f07020 in swoole::PHPCoroutine::main_func (arg=) at /data/release/jasper/swoole-src-4.5.11/ext-src/swoole_coroutine.cc:658
#7 0x00007fba23f7b8e2 in operator() (__args#0=, this=0x20a5fc0) at /usr/include/c++/4.8.2/functional:2471
#8 swoole::coroutine::Context::context_func (arg=0x20a5fc0) at /data/release/jasper/swoole-src-4.5.11/src/coroutine/context.cc:136
#9 0x00007fba23fd87c1 in make_fcontext () at /data/release/jasper/swoole-src-4.5.11/thirdparty/boost/asm/make_x86_64_sysv_elf_gas.S:64
#10 0x0000000000200010 in ?? ()
#11 0x0000000000259a71 in ?? ()
#12 0x000000000280e1e0 in ?? ()
#13 0x00007fba27e6cf88 in main_arena () from /lib64/libc.so.6`

——————————
去掉skywalking后仍出现锁worker了

@NathanFreeman
Copy link
Member

NathanFreeman commented May 18, 2022

这里是因为有个flock()之后又调用了sleep(),导致死锁了。
可以看一下这篇文章
https://course.swoole-cloud.com/article/2
https://wiki.swoole.com/#/memory/lock?id=错误示例

@Frank-JY
Copy link
Author

我的服务没有使用Swoole\Lock(),但有开启协程的逻辑,大致如下:
Coroutine::create(function() use ($param) {
Coroutine::sleep(0.001);
// 其他逻辑
});

@NathanFreeman
Copy link
Member

不是Swoole\Lock()引起的,是php自带的flock()函数引起的

@Frank-JY
Copy link
Author

不是Swoole\Lock()引起的,是php自带的flock()函数引起的

是写日志的flock+sleep引起的吗

@Frank-JY
Copy link
Author

'enable_preemptive_scheduler' => 1
是否因为我开始了抢占式调度,在写日志flock后,cpu切换至其他协程进行处理了

@NathanFreeman
Copy link
Member

不是,抢占式调度是为了避免协程死循环占用CPU而设置的。
flock()和sleep()死锁是因为,flock文件 + sleep之后,进程A让出协程,然后进程B和进程C要给同一份文件加flock(),但是因为进程A还占用锁,所以陷入阻塞。此时进程A因为协程化的缘故会在遇到新请求后又给同一份文件加锁,此时进程A又陷入阻塞了。

@Frank-JY
Copy link
Author

不是,抢占式调度是为了避免协程死循环占用CPU而设置的。 flock()和sleep()死锁是因为,flock文件 + sleep之后,进程A让出协程,然后进程B和进程C要给同一份文件加flock(),但是因为进程A还占用锁,所以陷入阻塞。此时进程A因为协程化的缘故会在遇到新请求后又给同一份文件加锁,此时进程A又陷入阻塞了。

我服务里只有写日志用到了flock,swoole版本从4.0升级到4.5后,没有其他改动。这里有可能是什么版本问题吗

@Frank-JY
Copy link
Author

不是,抢占式调度是为了避免协程死循环占用CPU而设置的。 flock()和sleep()死锁是因为,flock文件 + sleep之后,进程A让出协程,然后进程B和进程C要给同一份文件加flock(),但是因为进程A还占用锁,所以陷入阻塞。此时进程A因为协程化的缘故会在遇到新请求后又给同一份文件加锁,此时进程A又陷入阻塞了。

1、进程A写日志进行flock
2、enable_preemptive_scheduler导致进程A的协程挂起
3、进程B和进程C要给同一份文件加flock()
4、但进程A还占用锁
————————
会是enable_preemptive_scheduler导致死锁吗

@NathanFreeman
Copy link
Member

NathanFreeman commented May 18, 2022

这里要看看代码,看看有没有sleep()+flock(),如果没有sleep(),那就有可能是设置了enable_preemptive_scheduler让出CPU了

@Frank-JY
Copy link
Author

确认没有sleep()+flock()。那么enable_preemptive_scheduler不是很容易触发以上问题?有没有解决方法呢

这里要看看代码,看看有没有sleep()+flock(),如果没有sleep(),那就有可能是设置了enable_preemptive_scheduler让出CPU了

@NathanFreeman
Copy link
Member

如果协程的代码没有 CPU 密集型代码,enable_preemptive_scheduler这个可以不开

@Frank-JY
Copy link
Author

如果协程的代码没有 CPU 密集型代码,enable_preemptive_scheduler这个可以不开

存在 CPU 密集型代码的,因此升级了swoole版本,采用enable_preemptive_scheduler的方案哈

@NathanFreeman
Copy link
Member

NathanFreeman commented May 19, 2022

应该是某个协程获得锁的时间太长了没有释放,导致线程池中的线程都被flock()挂起来了,所以进程都死锁了

@Frank-JY
Copy link
Author

Frank-JY commented May 19, 2022

应该是某个协程获得锁的时间太长了没有释放,导致线程池中的线程都被flock()挂起来了,所以进程都死锁了
企业微信截图_6a75993c-9e9b-4101-88de-8285e102ca04

目前来看确实是日志flock造成的问题。我正在验证关闭enable_preemptive_scheduler后,服务是否恢复正常、
另外如此看来,由于日志组件很常用,再开启enable_preemptive_scheduler,两者结合造成死锁的风险还是挺大的

@Frank-JY
Copy link
Author

如果仍希望使用enable_preemptive_scheduler,有没有解决方法。

@NathanFreeman
Copy link
Member

看看可不可以记录完日志就马上释放锁。

@Frank-JY
Copy link
Author

看看可不可以记录完日志就马上释放锁。

image

这是log4php的基础类

@NathanFreeman
Copy link
Member

看看能不能把写日志的任务投递给task进程

@Frank-JY
Copy link
Author

看看能不能把写日志的任务投递给task进程

另外swoole 4.5.11出现下面的告警,请帮忙看下
WARNING swManager_check_exit_status: worker#4 abnormal exit, status=0, signal=11

@NathanFreeman
Copy link
Member

https://wiki.swoole.com/#/other/issue?id=关于段错误核心转储
麻烦按照这个步骤贴一下关键信息。

@Frank-JY
Copy link
Author

(gdb)
#0 0x00000000008872b0 in zend_assign_to_variable (value_type=1 '\001', value=0x7f7df1932910, variable_ptr=0x7f7d9cf98680) at /data/php/php-7.2.1/Zend/zend_execute.h:81
#1 ZEND_ASSIGN_SPEC_CV_CONST_RETVAL_UNUSED_HANDLER () at /data/php/php-7.2.1/Zend/zend_vm_execute.h:37342
#2 0x00000000008df0f9 in execute_ex (ex=0x7f7de39401f8) at /data/php/php-7.2.1/Zend/zend_vm_execute.h:62284
#3 0x00007f7dfe9092af in php_coro_create (php_arg=) at /data/extension/swoole/4.0.0/swoole-src-4.0.0-rc1/swoole_coroutine.cc:175
#4 0x00007f7dfe978b0b in swoole::Context::context_func (arg=0x2a335f0) at /data/extension/swoole/4.0.0/swoole-src-4.0.0-rc1/src/coroutine/context.cc:66
#5 0x00007f7dfe9a07d1 in make_fcontext () at /data/extension/swoole/4.0.0/swoole-src-4.0.0-rc1/thirdparty/boost/asm/make_x86_64_sysv_elf_gas.S:64
#6 0x0000000000200010 in ?? ()
#7 0x0000000000000021 in ?? ()
#8 0x0000000000000000 in ?? ()

@Frank-JY
Copy link
Author

Frank-JY commented May 19, 2022

(gdb) f 0
#0 0x00000000008872b0 in zend_assign_to_variable (value_type=1 '\001', value=0x7f7df1932910, variable_ptr=0x7f7d9cf98680) at /data/php/php-7.2.1/Zend/zend_execute.h:81
81 /data/php/php-7.2.1/Zend/zend_execute.h: No such file or directory.
(gdb) f 1
#1 ZEND_ASSIGN_SPEC_CV_CONST_RETVAL_UNUSED_HANDLER () at /data/php/php-7.2.1/Zend/zend_vm_execute.h:37342
37342 /data/php/php-7.2.1/Zend/zend_vm_execute.h: No such file or directory.
(gdb) f 2
#2 0x00000000008df0f9 in execute_ex (ex=0x7f7de39401f8) at /data/php/php-7.2.1/Zend/zend_vm_execute.h:62284
62284 in /data/php/php-7.2.1/Zend/zend_vm_execute.h
(gdb) f 3
#3 0x00007f7dfe9092af in php_coro_create (php_arg=) at /data/extension/swoole/4.0.0/swoole-src-4.0.0-rc1/swoole_coroutine.cc:175
175 /data/extension/swoole/4.0.0/swoole-src-4.0.0-rc1/swoole_coroutine.cc: No such file or directory.
(gdb) f 4
#4 0x00007f7dfe978b0b in swoole::Context::context_func (arg=0x2a335f0) at /data/extension/swoole/4.0.0/swoole-src-4.0.0-rc1/src/coroutine/context.cc:66
66 /data/extension/swoole/4.0.0/swoole-src-4.0.0-rc1/src/coroutine/context.cc: No such file or directory.
(gdb) f 5
#5 0x00007f7dfe9a07d1 in make_fcontext () at /data/extension/swoole/4.0.0/swoole-src-4.0.0-rc1/thirdparty/boost/asm/make_x86_64_sysv_elf_gas.S:64
64 /data/extension/swoole/4.0.0/swoole-src-4.0.0-rc1/thirdparty/boost/asm/make_x86_64_sysv_elf_gas.S: No such file or directory.

@Frank-JY
Copy link
Author

image

@NathanFreeman
Copy link
Member

感觉看起来像是没有make clean swoole,因为安装的是4.5.11,但是bt中显示的是4.0.0的路径。

@Frank-JY
Copy link
Author

sorry,确实是4.0.0的swoole,但依然出现这个报错,请问这种错误一般是由什么原因造成的?
企业微信截图_8b0fd1d1-1373-4b05-8cbe-e2e06fe4c13d

@NathanFreeman
Copy link
Member

php的版本有更新过吗

@Frank-JY
Copy link
Author

php的版本有更新过吗

在request请求中,开启了协程,就会发生worker退出的问题。如下面这段代码。这是什么原因?
Coroutine::create(function() use ($pid, $conf) {
Coroutine::sleep(0.001);
// 处理流程
});

@Frank-JY
Copy link
Author

image

开启协程导致的错误?

@NathanFreeman
Copy link
Member

image

开启协程导致的错误?

你这里看起来还是4.0.0的swoole,你现在使用的是4.0.0的swoole引发的这个问题的吗

@matyhtf
Copy link
Member

matyhtf commented May 30, 2022

请不要使用 4.0.0 版本了,更新至最新的 4.8.10

@Frank-JY
Copy link
Author

升级新版本是可以解决进程异常退出的问题。另外enable_preemptive_scheduler和写日志发flock(引用公共库,不会进行大改)造成的死锁问题在高并发场景下,是比较容易遇到的。

请不要使用 4.0.0 版本了,更新至最新的 4.8.10

@matyhtf
Copy link
Member

matyhtf commented Jun 22, 2022

@Frank-JY 请暂时关闭enable_preemptive_scheduler,这个功能需要谨慎使用,尤其是大量使用第三方库时,可能会在任意一行 php 代码执行过程中发生协程切换,而第三方库部分代码可能是不可重入的,引起一些 bug。

而关闭 enable_preemptive_scheduler 之后,只会在发生 IO 阻塞的位置产生协程切换,更安全一些。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants