Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when OPcache is enabled along with an extension utilising the Observer API #13817

Closed
mtrop-godaddy opened this issue Mar 27, 2024 · 17 comments

Comments

@mtrop-godaddy
Copy link

Description

When OPcache is enabled and either NewRelic or the zend_test extensions are active, we're seeing a segmentation fault on a fairly complicated WordPress app. We've unfortunately not been able to come up with a simple PHP script that would allow us to simplify reproduction steps. The site's code is maintained by a customer of ours and has a large number of plugins so it's difficult to pinpoint the code that's causing the issue.

The issue has also been reported in NewRelic's PHP agent repo, but it looks like it's most likely not caused by their agent, since it's possible to reproduce it with the zend_test extension. The original reporter mentioned that the issue may have been caused by the font-awesome plugin but we've not been able to reproduce it with that so far:

It's not clear what's changed in NewRelic's PHP agent that started triggering this bug (it started happening in version 10.18.0.8), but we believe they've stopped using the old tracing method of replacing the zend_execute_ex function in minit/mshutdown and switched to using the Observer API in all cases.

A similar issue has also been reported in this repo so hopefully this isn't a duplicate, our crash seems to be different enough:

OPcache and JIT

The issue goes away completely if we disabled OPcache:

opcache.enable = 0

We also have JIT disabled by default (buffer size is set to 0):

opcache.jit => tracing => tracing
opcache.jit_buffer_size => 0 => 0

However, the issue still occurs even if we completely disable it while OPcache is still enabled:

opcache.jit = disable

Core dump debugging

We've looked at a few core dumps, and when OPcache is enabled, the func pointer in the execute_data variable ends up pointing at an invalid address which then crashes PHP when it dereferences it in the Observer API's exit handler:

Here's the backtrace and a some more debugging information that we've been able to extract from a core dump:

Core was generated by `php-fpm:'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  nr_php_backtrace_get_call_site () at /usr/local/src/newrelic-php-agent/agent/php_stack.c:220
220     /usr/local/src/newrelic-php-agent/agent/php_stack.c: No such file or directory.

(gdb) bt
#0  nr_php_backtrace_get_call_site () at /usr/local/src/newrelic-php-agent/agent/php_stack.c:220
#1  nr_php_frame_info () at /usr/local/src/newrelic-php-agent/agent/php_stack.c:267
#2  nr_php_backtrace_fd () at /usr/local/src/newrelic-php-agent/agent/php_stack.c:462
#3  0x00007f05b23172dc in nr_php_fatal_signal_handler () at /usr/local/src/newrelic-php-agent/agent/php_minit.c:740
#4  <signal handler called>
#5  call_end_observers (return_value=0x0, execute_data=0x55734da87d48) at /usr/src/php/Zend/zend_observer.c:265
#6  zend_observer_fcall_end_all () at /usr/src/php/Zend/zend_observer.c:293
#7  0x0000557382500b99 in php_request_shutdown (dummy=dummy@entry=0x0) at /usr/src/php/main/main.c:1841
#8  0x0000557382248a39 in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/php/sapi/fpm/fpm/fpm_main.c:1970

(gdb) up 5
#5  call_end_observers (return_value=0x0, execute_data=0x55734da87d48) at /usr/src/php/Zend/zend_observer.c:265
265     /usr/src/php/Zend/zend_observer.c: No such file or directory.

(gdb) source /core-dumps/php-gdbinit

(gdb) zbacktrace
[0x55734da87d48] Cannot access memory at address 0x23e00000357

(gdb) print execute_data
$1 = (zend_execute_data *) 0x55734da87d48

(gdb) set $ed = (zend_execute_data *) 0x55734da87d48

(gdb) print $ed->func
$2 = (zend_function *) 0x23e00000347

(gdb) set $edf = (zend_function *) 0x23e00000347

(gdb) print $edf->common.function_name->val
Cannot access memory at address 0x23e0000034f

(gdb) print $edf->type
Cannot access memory at address 0x23e00000347

(gdb) maintenance info sections
Exec file:
    `/usr/local/sbin/php-fpm', file type elf64-x86-64.
 [0]      0x5573820002a8->0x5573820002c4 at 0x000002a8: .interp ALLOC LOAD READONLY DATA HAS_CONTENTS
[...]
 [25]     0x5573836064a0->0x557383627990 at 0x01406498: .bss ALLOC
 [26]     0x00000000->0x00000027 at 0x01406498: .comment READONLY HAS_CONTENTS
 [27]     0x00000000->0x00007340 at 0x014064c0: .debug_aranges READONLY HAS_CONTENTS
 [28]     0x00000000->0x00bf3119 at 0x0140d800: .debug_info READONLY HAS_CONTENTS
 [29]     0x00000000->0x00088872 at 0x02000919: .debug_abbrev READONLY HAS_CONTENTS
 [30]     0x00000000->0x0066edef at 0x0208918b: .debug_line READONLY HAS_CONTENTS
 [31]     0x00000000->0x00220f7c at 0x026f7f7a: .debug_str READONLY HAS_CONTENTS
 [32]     0x00000000->0x00e610db at 0x02918ef6: .debug_loc READONLY HAS_CONTENTS
 [33]     0x00000000->0x002e3a90 at 0x03779fd1: .debug_ranges READONLY HAS_CONTENTS
 [34]     0x00000000->0x001b2373 at 0x03a5da61: .debug_macro READONLY HAS_CONTENTS
Core file:
    `/core-dumps/core-php-fpm.68.php-43789.1711460016', file type elf64-x86-64.
 [0]       0x00000000->0x0000d358 at 0x0000a108: note0 READONLY HAS_CONTENTS
 [1]       0x00000000->0x000000d8 at 0x0000a18c: .reg/68 HAS_CONTENTS
 [2]       0x00000000->0x000000d8 at 0x0000a18c: .reg HAS_CONTENTS
 [3]       0x00000000->0x00000080 at 0x0000a31c: .note.linuxcore.siginfo/68 HAS_CONTENTS
 [4]       0x00000000->0x00000080 at 0x0000a31c: .note.linuxcore.siginfo HAS_CONTENTS
 [5]       0x00000000->0x00000150 at 0x0000a3b0: .auxv HAS_CONTENTS
 [6]       0x00000000->0x0000c299 at 0x0000a514: .note.linuxcore.file/68 HAS_CONTENTS
 [7]       0x00000000->0x0000c299 at 0x0000a514: .note.linuxcore.file HAS_CONTENTS
 [8]       0x00000000->0x00000200 at 0x000167c4: .reg2/68 HAS_CONTENTS
 [9]       0x00000000->0x00000200 at 0x000167c4: .reg2 HAS_CONTENTS
 [10]      0x00000000->0x00000a88 at 0x000169d8: .reg-xstate/68 HAS_CONTENTS
 [11]      0x00000000->0x00000a88 at 0x000169d8: .reg-xstate HAS_CONTENTS
 [12]      0x557344200000->0x557382000000 at 0x00018000: load1 ALLOC LOAD HAS_CONTENTS
[...]
 [859]     0x7fffebdc6000->0x7fffebdca000 at 0x497c5000: load732 ALLOC LOAD READONLY HAS_CONTENTS
 [860]     0x7fffebdca000->0x7fffebdcc000 at 0x497c9000: load733 ALLOC LOAD READONLY CODE HAS_CONTENTS
 [861]     0xffffffffff600000->0xffffffffff601000 at 0x497cb000: load734 ALLOC LOAD READONLY CODE HAS_CONTENTS

The backtrace with zend_test enabled is almost identical, the only difference is that NewRelic's signal handler isn't invoked after the segmentation fault:

Core was generated by `php-fpm:'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  call_end_observers (return_value=0x0, execute_data=0x56205f087ec8) at /usr/src/php/Zend/zend_observer.c:265
265     /usr/src/php/Zend/zend_observer.c: No such file or directory.
(gdb) bt
#0  call_end_observers (return_value=0x0, execute_data=0x56205f087ec8) at /usr/src/php/Zend/zend_observer.c:265
#1  zend_observer_fcall_end_all () at /usr/src/php/Zend/zend_observer.c:293
#2  0x0000562093b00b99 in php_request_shutdown (dummy=dummy@entry=0x0) at /usr/src/php/main/main.c:1841
#3  0x0000562093848a39 in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/php/sapi/fpm/fpm/fpm_main.c:1970
(gdb) source /core-dumps/php-gdbinit
(gdb) zbacktrace
[0x56205f087ec8] Cannot access memory at address 0x23e00000357

The following config options are set for the zend_test extension:

zend_test.observer.enabled=1
zend_test.observer.observe_all=1
zend_test.observer.show_output=0

More info about the environment

The site's running in a Docker container, using a slightly modified image that's based on the official PHP Docker images - https://github.com/docker-library/php/tree/master/8.3/bullseye/fpm.

Some notable changes:

  • PHP is compiled with --enable-embed
  • We're installing jemalloc but it's not in use by default and isn't enabled for the site that's crashing
  • A few additional extensions are baked into the image

List of PHP modules:

php-modules.txt

PHP Version

PHP 8.3.4

Operating System

Ubuntu 20.04

@iluuu1994
Copy link
Member

@mtrop-godaddy It looks like this may be related to #13735. Would it be possible to check whether this issue is resolved on the PHP-8.3 branch?

@mtrop-godaddy
Copy link
Author

I've done a quick test with the 8.3.5RC1 release, which contains the patch that you've mentioned, but it doesn't seem to have helped. I'm still seeing the same segfault. I'll try and do some more testing with the code changes in that PR, just in case I missed something, but so far it looks like this might be a different issue.

@bwoebi
Copy link
Member

bwoebi commented Mar 28, 2024

This particular crash is in zend_observer_fcall_end_all, so it's likely not related to the other observer issues.

This kind of crash however is virtually impossible to track down without observing it in the debugger, I think. Is the crash reliably reproducible? I think we need a reproducer here. Feel free to reach out privately, if you cannot share it publicly.

@mtrop-godaddy
Copy link
Author

We'll see what we can come up with. The crash is easy to reproduce (any web request to the site in question) but it's happening on a small set of very complex WordPress sites (custom theme, lots of plugins, etc.) so getting a simple reproduction case, or just trying to track it down in a debugger isn't easy.

@walthowd
Copy link

We can also reproduce on complex WordPress sites, all on PHP 8.2.16, some running font-awesome plugin but other sites running a different set of plugins.

@mtrop-godaddy
Copy link
Author

Finally got around to debugging this some more.. Still don't have a reproduction case but was just wondering if some of the data we're seeing in executor_globals could be relevant, or if these are just leftovers from earlier in the execution.

When the SIGSEGV signal is raised, we're in zend_observer_fcall_end_all function. The current_execute_data field doesn't seem to contain any valid data at this point but we're seeing some potential clues in the trampoline field:

(gdb) print_zstr executor_globals.trampoline.op_array.function_name
string(15) "newly_installed"
(gdb) print_zstr executor_globals.trampoline.op_array.filename
string(89) "/[REDACTED]/wp-content/plugins/woocommerce/src/Internal/Traits/AccessiblePrivateMethods.php"
(gdb) print executor_globals.trampoline.op_array.line_start
$10 = 176

The filename/line_start point at the following __callStatic function in WooCommerce - https://github.com/woocommerce/woocommerce/blob/8.6.1/plugins/woocommerce/src/Internal/Traits/AccessiblePrivateMethods.php#L176-L189

The trait is used in a bunch of different places but AFAICT that function isn't executed before the crash (added a bunch of error_log calls).

newly_installed is also defined in WC but not sure how these relate to each other yet - https://github.com/woocommerce/woocommerce/blob/8.6.1/plugins/woocommerce/includes/class-wc-install.php#L287-L299.


It also looks like an exception may have been raised before the crash since opline_before_exception isn't NULL:

(gdb) print *executor_globals.opline_before_exception                                                                                                                                                                                                                                                                          $9 = {handler = 0x55ad21c3d6a2 <execute_ex-3775406>, op1 = {constant = 4294967295, var = 4294967295, num = 4294967295, opline_num = 4294967295, jmp_offset = 4294967295}, op2 = {constant = 4294967295, var = 4294967295, num = 4294967295, opline_num = 4294967295, jmp_offset = 4294967295}, result = {
    constant = 4294967295, var = 4294967295, num = 4294967295, opline_num = 4294967295, jmp_offset = 4294967295}, extended_value = 0, lineno = 46, opcode = 60 '<', op1_type = 0 '\000', op2_type = 0 '\000', result_type = 0 '\000'}

I'm not familiar enough with PHP internals to know if these are relevant so any pointers would be appreciated.

@nielsdos
Copy link
Member

nielsdos commented Apr 9, 2024

This could be a stupid question, but let's try.
In zend_get_call_trampoline_func I see func->T = (fbc->type == ZEND_USER_FUNCTION)? MAX(fbc->op_array.last_var + fbc->op_array.T, 2) : 2;, which does not account for the temporary that is added post startup for observers. Isn't that wrong? And since opcache reduce the number of temporaries, I think it may be possible that the previous observer frame gets written to a temporary-in-use.
If that's the case, then I think a patch like this may solve the issue: https://gist.github.com/nielsdos/4f43354948de79a82fe2ea0b813ddcf2 EDIT: or well, we don't have to account for the additional temporary in this case fbc->op_array.last_var + fbc->op_array.T, so perhaps this patch is better: https://gist.github.com/nielsdos/d26afa61d00a261893709b3c1e6497e0)

But I'm still trying to come up with a test...

@bwoebi
Copy link
Member

bwoebi commented Apr 9, 2024

@nielsdos trampoline funcs are explicitly exempt from being observed (see _zend_observe_fcall_begin) and as such should not need the extra temporary. Only actually observed functions ever have something written to that temporary.
Optimizer considers the observer temporary, see op_array->T += ZEND_OBSERVER_ENABLED; // reserve last temporary for observers if enabled in zend_redo_pass_two.

@bwoebi
Copy link
Member

bwoebi commented Apr 10, 2024

@mtrop-godaddy As you see zend_observer_fcall_end_all gets called in shutdown, more precisely when there was a fatal error (bailout) before (it gets always called, but shouldn't do anything otherwise).
So the debugging steps would be breaking on _zend_bailout, getting a zbacktrace from there, as well as dumping current_observed_frame. Given set $last_observed_frame = current_observed_frame, you can follow the chain of observed frames by calling prev_observed_frame(). Print set $last_observed_frame = *prev_observed_frame($last_observed_frame) until it's NULL.
Compare it against the execute_data pointers listed in zbacktrace. Some frames in the zbacktrace may be skipped. Check where addresses appear which are not present in the zbacktrace at all.

If you have that, you have tracked down in which function it goes sideways. Then that functions properties need to be inspected, possibly watchpoints added to track where the temporary was assigned this wrong value etc.

This is as said not a simple investigation and my offer still stands if you want to reach out privately to me for debugging over screen sharing or providing the applictaion or such.

@mtrop-godaddy
Copy link
Author

mtrop-godaddy commented Apr 10, 2024

Thanks @bwoebi, that helped a bunch! Just tried setting a breakpoint on _zend_bailout and it unfortunately never gets hit so we might not be hitting a fatal here.

I also played with a r/w watchpoint on the current_observed_frame global and it looks like it may be getting corrupted fairly early in the execution, gets stored as the previous frame and then, much later, ends up causing a seg fault when zend_observer_fcall_end_all tries to read from it.

The following plugin/code seems to be related - https://plugins.trac.wordpress.org/browser/kadence-woocommerce-email-designer/trunk/kadence-woocommerce-email-designer.php#L847. Disabling the plugin also seems to stop the seg faults from happening.

(gdb) awatch current_observed_frame if current_observed_frame == 0x55d398287d48
(gdb) c
Old value = (zend_execute_data *) 0x7f2659c16720
New value = (zend_execute_data *) 0x55d398287d48
0x000055d3cce0c18e in zend_observer_fcall_end (execute_data=0x7f2659c16720, return_value=0x7f2659c166e0) at /usr/src/php/Zend/zend_observer.c:285
285     /usr/src/php/Zend/zend_observer.c: No such file or directory.
(gdb) zbacktrace
[0x7f2659c16720] kadence_woomail_is_woo_active() [REDACTED]/wp-content/plugins/kadence-woocommerce-email-designer/kadence-woocommerce-email-designer.php:847
[...]

I agree it'd be easier to debug this over a screen share, but we need to check if the affected customer is okay with that first, so I'll reach out when we get confirmation from them. In the meantime I'll try collecting some more information.

@bwoebi
Copy link
Member

bwoebi commented Apr 18, 2024

After investiagting together with @mtrop-godaddy, we figured that the stack size of some observed function calls (INIT_FCALL) was computed wrongly.

Disabling pass 4 fixes it (opcache.optimization_level=0xFFFFFFFF7). Normally pass 12 fixes these up again this, but for some unknown reason it didn't here.

In any case, here pass 4 didn't take ZEND_OBSERVER_ENABLED into account and will compute a stack size missing the observer temporary.

@bwoebi bwoebi self-assigned this Apr 20, 2024
bwoebi added a commit to bwoebi/php-src that referenced this issue Apr 20, 2024
Instead of fixing up temporaries count in between observer steps, just take ZEND_ACC_DONE_PASS_TWO into account during stack_size calculation.
Introducing zend_vm_calc_ct_used_stack for that use case.

This should be much less susceptible to forgetting to handle the ZEND_OBSERVER_ENABLED temporary explicitly.
bwoebi added a commit to bwoebi/php-src that referenced this issue Apr 20, 2024
Instead of fixing up temporaries count in between observer steps, just take ZEND_ACC_DONE_PASS_TWO into account during stack_size calculation.
Introducing zend_vm_calc_ct_used_stack for that use case.

This should be much less susceptible to forgetting to handle the ZEND_OBSERVER_ENABLED temporary explicitly.
@mtrop-godaddy
Copy link
Author

Just wanted to confirm we're no longer seeing any crashes in the affected app after smoke testing it with the 8.3.4 release of PHP and patches in #14018 applied on top of that.

@scottbuscemi
Copy link

@mtrop-godaddy Are there discussions of this resolution being worked on for PHP 8.2 elsewhere, or is this the only place?

@mtrop-godaddy
Copy link
Author

@scottbuscemi This is the only place as far as I'm aware.

bwoebi added a commit to bwoebi/php-src that referenced this issue Jun 1, 2024
Instead of fixing up temporaries count in between observer steps, just take ZEND_ACC_DONE_PASS_TWO into account during stack_size calculation.
Introducing zend_vm_calc_ct_used_stack for that use case.

This should be much less susceptible to forgetting to handle the ZEND_OBSERVER_ENABLED temporary explicitly.
@bazza
Copy link

bazza commented Jun 29, 2024

Hey @bwoebi! Thanks for the patch. Are there plans to have this merged upstream? I guess it's starting to create more problems as the Observer API adoption/usage increases.

@bwoebi bwoebi closed this as completed in a18df90 Jul 22, 2024
@dsmith4-godaddy
Copy link

New versions of 8.2 and 8.3 were released a few days ago, but neither contained this fix. Is that because the merge happened just a day or two before the release? (While the fix is clearly in 8.4, I think some of us are still hoping to see it in 8.2 and 8.3.)

@bwoebi
Copy link
Member

bwoebi commented Aug 12, 2024

Yes, the bugfix is in the upcoming release. It did not make it in time for the RC of the current release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants