Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random crash of FPM master process in fpm_stdio_child_said #8517

Closed
triplesixman opened this issue May 8, 2022 · 42 comments
Closed

Random crash of FPM master process in fpm_stdio_child_said #8517

triplesixman opened this issue May 8, 2022 · 42 comments

Comments

@triplesixman
Copy link

triplesixman commented May 8, 2022

Description

Hello,

Since the switch to php8.0-fpm, I have random php crashes which force me to restart it manually and cause a 502 bad gateaway error on my site.

The crashes are random and sometimes occur after several days.

The logs indicate this:

May 8 16:21:14 server kernel: [817145.478932] php-fpm8.0[3482145]: segfault at 105 ip 000055c8afc40309 sp 00007ffd430d3860 error 4 in php-fpm8.0[55c8af9c1000+285000]
May 8 16:21:14 server kernel: [817145.550172] Code: ac 4c 8b a0 f0 00 00 00 41 0f 94 c5 89 4d a8 0f 84 e6 02 00 00 48 8b 45 a0 48 83 c0 70 48 89 45 88 4d 85 e4 0f 84 2d 03 00 00 <41> f6 44 24 05 04 75 11 48 8b 45 a0 8b 80 e4 00 00 00 41 39 44 24
May 8 16:21:14 server systemd[1]: php8.0-fpm.service: Main process exited, code=killed, status=11/SEGV
May 8 16:21:14 server systemd[1]: php8.0-fpm.service: Failed with result 'signal'.
May 8 16:21:14 server systemd[1]: php8.0-fpm.service: Consumed 3d 57min 21.700s CPU time.

Unfortunately I don't have more information and it is impossible to identify the problem. Do you know how I can get more details?

I have thousands of active users on my site and it will not be possible for me to reproduce the crash by myself.

Thank you

PHP Version

PHP 8.0.18

Operating System

Debian 11

@triplesixman triplesixman changed the title Random crash of php8.0-fpm - segfault at 105 ip 000055c8afc40309 sp 00007ffd430d3860 error 4 in php-fpm8.0[55c8af9c1000+285000] Random crash of php8 - segfault at 105 ip 000055c8afc40309 sp 00007ffd430d3860 error 4 in php-fpm8.0[55c8af9c1000+285000] May 8, 2022
@Girgias
Copy link
Member

Girgias commented May 8, 2022

Possible duplicate of #8494 ?

@cmb69
Copy link
Member

cmb69 commented May 9, 2022

Do you know how I can get more details?

I stack backtrace might be helpful.

@iluuu1994
Copy link
Member

@triplesixman Would it be possible to generate a backtrace? Otherwise we have pretty much no chance of finding the actual issue.

@Poulpatine
Copy link

Hi,

Sorry for the delay we finally been able to reproduce the issue and to gather a backtrace:

gdb) bt
#0  fpm_stdio_child_said (ev=0x55bf8d3901f0, which=2, arg=0x55bf8d390180) at ./sapi/fpm/fpm/fpm_stdio.c:156
#1  0x000055bf8cc1ef66 in fpm_event_epoll_wait (queue=<optimized out>, timeout=<optimized out>) at ./sapi/fpm/fpm/events/epoll.c:143
#2  0x000055bf8cc11a85 in fpm_event_loop (err=err@entry=0) at ./sapi/fpm/fpm/fpm_events.c:427
#3  0x000055bf8cc0bea7 in fpm_run (max_requests=max_requests@entry=0x7ffefea58c3c) at ./sapi/fpm/fpm/fpm.c:113
#4  0x000055bf8c9d40f8 in main (argc=4, argv=0x7ffefea59178) at ./sapi/fpm/fpm/fpm_main.c:1828

@iluuu1994
Copy link
Member

iluuu1994 commented May 20, 2022

@bukka Looks like log_stream here is NULL (or an invalid pointer).

if (log_stream->is_stdout != (unsigned int)is_stdout || log_stream->child_pid != (int)child->pid) {

I'm completely unfamiliar with fpm, do you know how this could happen?

@bukka
Copy link
Member

bukka commented May 20, 2022

I don't see anything suspicious from looking to the code. If it was NULL, then it would fall to the different branch and was created so it would need to be some invalid pointer but again that should not happen based on the code. The only thing that comes to my mind is some sort of memory corruption somewhere. I will add this on my list and will try to do more careful review.

@Poulpatine could you share your FPM config?

@Poulpatine
Copy link

Poulpatine commented May 20, 2022

Sure, here it is.
It's the Debian default configuration, just modified to be able to generate coredumps.

;;;;;;;;;;;;;;;;;;;;;
; FPM Configuration ;
;;;;;;;;;;;;;;;;;;;;;

; All relative paths in this configuration file are relative to PHP's install
; prefix (/usr). This prefix can be dynamically changed by using the
; '-p' argument from the command line.

;;;;;;;;;;;;;;;;;;
; Global Options ;
;;;;;;;;;;;;;;;;;;

[global]
; Pid file
; Note: the default prefix is /var
; Default Value: none
; Warning: if you change the value here, you need to modify systemd
; service PIDFile= setting to match the value here.
pid = /run/php/php8.0-fpm.pid

; Error log file
; If it's set to "syslog", log is sent to syslogd instead of being written
; into a local file.
; Note: the default prefix is /var
; Default Value: log/php-fpm.log
error_log = /var/log/php8.0-fpm.log

; syslog_facility is used to specify what type of program is logging the
; message. This lets syslogd specify that messages from different facilities
; will be handled differently.
; See syslog(3) for possible values (ex daemon equiv LOG_DAEMON)
; Default Value: daemon
;syslog.facility = daemon

; syslog_ident is prepended to every message. If you have multiple FPM
; instances running on the same server, you can change the default value
; which must suit common needs.
; Default Value: php-fpm
;syslog.ident = php-fpm

; Log level
; Possible Values: alert, error, warning, notice, debug
; Default Value: notice
;log_level = notice

; Log limit on number of characters in the single line (log entry). If the
; line is over the limit, it is wrapped on multiple lines. The limit is for
; all logged characters including message prefix and suffix if present. However
; the new line character does not count into it as it is present only when
; logging to a file descriptor. It means the new line character is not present
; when logging to syslog.
; Default Value: 1024
;log_limit = 4096

; Log buffering specifies if the log line is buffered which means that the
; line is written in a single write operation. If the value is false, then the
; data is written directly into the file descriptor. It is an experimental
; option that can potentionaly improve logging performance and memory usage
; for some heavy logging scenarios. This option is ignored if logging to syslog
; as it has to be always buffered.
; Default value: yes
;log_buffering = no

; If this number of child processes exit with SIGSEGV or SIGBUS within the time
; interval set by emergency_restart_interval then FPM will restart. A value
; of '0' means 'Off'.
; Default Value: 0
;emergency_restart_threshold = 0

; Interval of time used by emergency_restart_interval to determine when
; a graceful restart will be initiated.  This can be useful to work around
; accidental corruptions in an accelerator's shared memory.
; Available Units: s(econds), m(inutes), h(ours), or d(ays)
; Default Unit: seconds
; Default Value: 0
;emergency_restart_interval = 0

; Time limit for child processes to wait for a reaction on signals from master.
; Available units: s(econds), m(inutes), h(ours), or d(ays)
; Default Unit: seconds
; Default Value: 0
;process_control_timeout = 0

; The maximum number of processes FPM will fork. This has been designed to control
; the global number of processes when using dynamic PM within a lot of pools.
; Use it with caution.
; Note: A value of 0 indicates no limit
; Default Value: 0
; process.max = 128

; Specify the nice(2) priority to apply to the master process (only if set)
; The value can vary from -19 (highest priority) to 20 (lowest priority)
; Note: - It will only work if the FPM master process is launched as root
;       - The pool process will inherit the master process priority
;         unless specified otherwise
; Default Value: no set
; process.priority = -19

; Send FPM to background. Set to 'no' to keep FPM in foreground for debugging.
; Default Value: yes
;daemonize = yes

; Set open file descriptor rlimit for the master process.
; Default Value: system defined value
;rlimit_files = 1024

; Set max core size rlimit for the master process.
; Possible Values: 'unlimited' or an integer greater or equal to 0
; Default Value: system defined value
;rlimit_core = 0
rlimit_core = unlimited

; Specify the event mechanism FPM will use. The following is available:
; - select     (any POSIX os)
; - poll       (any POSIX os)
; - epoll      (linux >= 2.5.44)
; - kqueue     (FreeBSD >= 4.1, OpenBSD >= 2.9, NetBSD >= 2.0)
; - /dev/poll  (Solaris >= 7)
; - port       (Solaris >= 10)
; Default Value: not set (auto detection)
;events.mechanism = epoll

; When FPM is built with systemd integration, specify the interval,
; in seconds, between health report notification to systemd.
; Set to 0 to disable.
; Available Units: s(econds), m(inutes), h(ours)
; Default Unit: seconds
; Default value: 10
;systemd_interval = 10

;;;;;;;;;;;;;;;;;;;;
; Pool Definitions ;
;;;;;;;;;;;;;;;;;;;;

; Multiple pools of child processes may be started with different listening
; ports and different management options.  The name of the pool will be
; used in logs and stats. There is no limitation on the number of pools which
; FPM can handle. Your system will tell you anyway :)

; Include one or more files. If glob(3) exists, it is used to include a bunch of
; files from a glob(3) pattern. This directive can be used everywhere in the
; file.
; Relative path can also be used. They will be prefixed by:
;  - the global prefix if it's been set (-p argument)
;  - /usr otherwise
include=/etc/php/8.0/fpm/pool.d/*.conf

And /etc/php/8.0/fpm/pool.d/www.conf :

[www]
user = www-data
group = www-data
listen = 127.0.0.1:9000
listen.owner = www-data
listen.group = www-data
pm = dynamic
pm.max_children = 400
pm.start_servers = 60
pm.min_spare_servers = 40
pm.max_spare_servers = 70
pm.max_requests = 20000
request_terminate_timeout = 3600
request_slowlog_timeout = 300s
rlimit_files = 131072
rlimit_core = unlimited
catch_workers_output = yes
pm.status_path = /status
slowlog = /var/log/php/slowlog-site.log

php_admin_value[error_log] = /var/log/php/php7.2-fpm.log
php_admin_flag[log_errors] = off

@bukka
Copy link
Member

bukka commented May 20, 2022

@Poulpatine would you be able to try it with commented out pm.max_requests and see if you still observe the issue, see it happening more often or even see some other problems (e.g. continuous memory increase in children that will eventually run out of memory)?

@Poulpatine
Copy link

Ok, I'm doing the test. I'll keep you informed.

@Poulpatine
Copy link

Hi @bukka , it seems that there is no difference regarding the crash frequency.
2 crashes occured last night.

@bukka
Copy link
Member

bukka commented May 28, 2022

@Poulpatine Thanks for checking. Would you be able to also try to temporarily disable slowlog (commenting out slowlog and request_slowlog_timeout). Could you watch the memory of the master process if you for example see continuous increase. Might be also worth to temporarily disable catch_workers_output to see how things change (it should prevent the crash but doesn't mean that it's the cause). Basically just trying to do reduce the problem search space...

@Poulpatine
Copy link

Poulpatine commented May 29, 2022

@bukka, Sure, I'll do that and tell you.

@Poulpatine
Copy link

Hi, a new crash occured last night.

The backtrace is similar to previous ones:

gdb /usr/sbin/php-fpm8.0 /core
GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/php-fpm8.0...
Reading symbols from /usr/lib/debug/.build-id/b7/a2f4c0126b11ebc10eb179e1e24adf97a938dd.debug...

warning: Can't open file /dev/zero (deleted) during file-backed mapping note processing
[New LWP 495223]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `php-fpm: master process (/etc/php/8.0/fpm/php-fpm.conf)                      '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  fpm_stdio_child_said (ev=0x557d69f763f0, which=2, arg=0x557d69f76380) at ./sapi/fpm/fpm/fpm_stdio.c:156
156     ./sapi/fpm/fpm/fpm_stdio.c: No such file or directory.
(gdb) bt
#0  fpm_stdio_child_said (ev=0x557d69f763f0, which=2, arg=0x557d69f76380) at ./sapi/fpm/fpm/fpm_stdio.c:156
#1  0x0000557d68f45f66 in fpm_event_epoll_wait (queue=<optimized out>, timeout=<optimized out>) at ./sapi/fpm/fpm/events/epoll.c:143
#2  0x0000557d68f38a85 in fpm_event_loop (err=err@entry=0) at ./sapi/fpm/fpm/fpm_events.c:427
#3  0x0000557d68f32ea7 in fpm_run (max_requests=max_requests@entry=0x7ffd121e05cc) at ./sapi/fpm/fpm/fpm.c:113
#4  0x0000557d68cfb0f8 in main (argc=4, argv=0x7ffd121e0b08) at ./sapi/fpm/fpm/fpm_main.c:1828

Here is the fpm configuration:

at /etc/php/8.0/fpm/pool.d/www.conf
[www]
user = www-data
group = www-data
listen = 127.0.0.1:9000
listen.owner = www-data
listen.group = www-data
pm = dynamic
pm.max_children = 400
pm.start_servers = 60
pm.min_spare_servers = 40
pm.max_spare_servers = 70
; pm.max_requests = 20000
request_terminate_timeout = 3600
; request_slowlog_timeout = 300s
rlimit_files = 131072
rlimit_core = unlimited
catch_workers_output = yes
pm.status_path = /status
; slowlog = /var/log/php/slowlog-site.log

php_admin_value[error_log] = /var/log/php/php7.2-fpm.log
php_admin_flag[log_errors] = off

@bukka
Copy link
Member

bukka commented Jun 15, 2022

@Poulpatine I have been searching for a possible source of this issue without any luck. But co-incidentally there is just a new PR fixing one problem in zlog that might cause this: see #8785 . If it's not easy for you to apply the patch, can you just disable docoration in the pool - add decorate_workers_output = no to the pool config?

hwde added a commit to hwde/php-src that referenced this issue Jun 15, 2022
@Poulpatine
Copy link

I've patched the Debian package and installed it.
I will keep you informed.

@Poulpatine
Copy link

Hi,

unfortunately and even with the patch, a new crash occured last night.
Stacktrace is the same as usual.

@bukka
Copy link
Member

bukka commented Jun 19, 2022

Can you try it with decorate_workers_output = no?

@bukka
Copy link
Member

bukka commented Jun 19, 2022

In case there is another issue related to this...

@triplesixman
Copy link
Author

I just applied the change, we will keep you informed

@bukka
Copy link
Member

bukka commented Jun 27, 2022

@Poulpatine @triplesixman Any update on this? Do you still see the issue even without decorated output?

@triplesixman
Copy link
Author

A new crash occurred yesterday afternoon, here is the trace:

(gdb) bt
#0  fpm_stdio_child_said (ev=0x55f73e73c130, which=2, arg=0x55f73e73c0c0) at ./sapi/fpm/fpm/fpm_stdio.c:156
#1  0x000055f73dc0f066 in fpm_event_epoll_wait (queue=<optimized out>, timeout=<optimized out>) at ./sapi/fpm/fpm/events/epoll.c:143
#2  0x000055f73dc01b95 in fpm_event_loop (err=err@entry=0) at ./sapi/fpm/fpm/fpm_events.c:427
#3  0x000055f73dbfbfb7 in fpm_run (max_requests=max_requests@entry=0x7ffc828e9c2c) at ./sapi/fpm/fpm/fpm.c:113
#4  0x000055f73d9c4148 in main (argc=4, argv=0x7ffc828ea168) at ./sapi/fpm/fpm/fpm_main.c:1828

@bukka
Copy link
Member

bukka commented Jul 3, 2022

I'm running out of ideas in terms of options. I think I might need to take a bit deeper look to the core dump if there's anything useful. Would you be able to share it? If so, feel free to email me privately. It would be also great if you could set log_level to debug and share with me some time before the crash depending on its size.

@triplesixman
Copy link
Author

I just sent you an email with the core dump file.

I also changed the log level to debug. I will keep you informed in case of new crash.

@triplesixman
Copy link
Author

Mail sent! Let me know if you need anything else to help you in your search.

bukka added a commit to bukka/php-src that referenced this issue Aug 28, 2022
@bukka
Copy link
Member

bukka commented Aug 28, 2022

@triplesixman I finally managed look into this properly. I have been actually looking and thinking about it last few days (on and off :) ) and I got an idea what could be potentially the issue and added an initial fix for that to #9444 . If you could try it, that would be great! If you don't see any further crash, could you also compare the load on master process before and after the patch?

@triplesixman
Copy link
Author

triplesixman commented Aug 29, 2022

Thanks for the update! We will try it out this weekend.

EDIT: We put the patch in place today 09/09. I will keep you informed

@bukka
Copy link
Member

bukka commented Sep 21, 2022

Hi @triplesixman , have you had any crash since deploying the patch?

@triplesixman
Copy link
Author

Hi @triplesixman , have you had any crash since deploying the patch?

No crashes observed since September 9 following the application of the patch. I guess we can consider that the problem is solved if there is no new crash by October 9th (I see wide because the crash frequencies are very random).

By curiosity I installed php 8.1 to see if the problem was reproduced on this version and it is indeed the case.

@bukka
Copy link
Member

bukka commented Sep 25, 2022

Thanks, did you notice any difference in the load of the master process. It should probably not be visible if you don't run excessive number of pools with high number of children. Anyway I will do some optimization if this is confirmed working. Was more wondering if there's anything visible in terms of load already?

@Poulpatine
Copy link

Hello @bukka , unfortunately the main load of the system does not come from PHP instances but from processes triggered from PHP so we can't really distinguish a trend.

Regarding the crashes, no crash have occured since last patch :).

Thanks !

@triplesixman
Copy link
Author

Hi @bukka, we would like to make all our servers benefit from this fix, our other servers are under php 8.1.11, we go through the deb.sury.org packages for updates. Will the next versions integrate the fix and from which version ? Thank you in advance

bukka added a commit to bukka/php-src that referenced this issue Oct 24, 2022
We should delete event before freeing child so it cannot be triggered
after the child is freed.
bukka added a commit to bukka/php-src that referenced this issue Oct 24, 2022
We should delete event before freeing child so it cannot be triggered
after the child is freed.
bukka added a commit to bukka/php-src that referenced this issue Oct 24, 2022
We should delete event before freeing child so it cannot be triggered
after the child is freed.
@bukka
Copy link
Member

bukka commented Oct 24, 2022

@Poulpatine @triplesixman Apology for the delay. The fix is not committed because I'm not too happy with the overhead it introduces. I spent some time to think aboutit and come up with a different approach that is in #9817 . Would you be able to test it?

@bukka
Copy link
Member

bukka commented Nov 22, 2022

@Poulpatine @triplesixman Did you get a chance to try the fix in #9817 ?

@Poulpatine
Copy link

Hi @bukka, we have been able to deploy the fix last week-end and we have faced 3 segfault since.

Here's the backtrace of one of them:

#0  fpm_stdio_child_said (ev=0x55b7343656d0, which=2, arg=0x55b734365660) at ./sapi/fpm/fpm/fpm_stdio.c:209
#1  0x000055b732e57c06 in fpm_event_epoll_wait (queue=<optimized out>, timeout=<optimized out>) at ./sapi/fpm/fpm/events/epoll.c:141
#2  0x000055b732e49f55 in fpm_event_loop (err=err@entry=0) at ./sapi/fpm/fpm/fpm_events.c:432
#3  0x000055b732e43e07 in fpm_run (max_requests=max_requests@entry=0x7fff96d357cc) at ./sapi/fpm/fpm/fpm.c:113
#4  0x000055b732b93d08 in main (argc=4, argv=0x7fff96d35d18) at ./sapi/fpm/fpm/fpm_main.c:1805

@bukka bukka changed the title Random crash of php8 - segfault at 105 ip 000055c8afc40309 sp 00007ffd430d3860 error 4 in php-fpm8.0[55c8af9c1000+285000] Random crash of FPM master process in fpm_stdio_child_said Nov 23, 2022
@bukka bukka closed this as completed in c9c1934 Nov 23, 2022
@bukka
Copy link
Member

bukka commented Nov 23, 2022

Ok it seems like #9817 still leaves the race condition so it doesn't resolve the issue. It might still be logical thing to do as there is no point to keep that event but that would be more optimization. Will leave it open for now and think about it later.

It means that I needed to go back #9444 which actually fixes the issue. After thinking about its impact, I realized it might not be that bad. The thing is that setups with many pools won't likely catch worker output but use logs instead so most users will have probably just one pool and if you didn't notice problems with master in your setup when I see you run quite a lot of children, then it's likely others won't see the issue either. Of course this is still not for sure so I merged it to 8.1+ and left out 8.0 because its normal bug fixing support ends in 2 days and I would have no way to address potential issues related to the fix. It means it will be part of 8.1.14 and will also go to 8.2.1.

@triplesixman
Copy link
Author

@bukka Hello, I wanted to inform you that we have had new segfaults recently despite the upgrade. However, the segfaults seem very rare compared to before the upgrade. I will try to give you more informations about the crashes in the next weeks (our server is not configured to get the backtrace anymore, but we will do it soon).

@bukka
Copy link
Member

bukka commented Feb 9, 2023

Ok would be great to get backtrace so we can see if it's the same thing as #10461 . If you could also try the proposed #10510 and see if it makes any difference, that would be great!

@bukka
Copy link
Member

bukka commented Apr 15, 2023

@triplesixman @Poulpatine So after more investigation, that previous commit wasn't really a right fix. It might have hidden the the problem but didn't fix the cause of the crash. It actually made the problem more visible under another workload. The main race condition seems to be about possibility to access the child after it's freed (delivery the input event after the signal). I just created a new PR which basically reverts the previous fix and instead tries to delay the freeing of the child by 1 second. It is available here: #11084 . If you were be able to test it, that would be appreciated!

@bukka bukka reopened this Apr 15, 2023
@bukka bukka closed this as completed in 1029537 May 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants