Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i#4271: Fix signal-syscall race handling on AArch64. #4453

Merged
merged 13 commits into from
Oct 2, 2020

Conversation

abhinav92003
Copy link
Contributor

@abhinav92003 abhinav92003 commented Sep 28, 2020

Fixes logic to detect the two jmps preceding a syscall on AArch64. These jmps are used to bypass the syscall, which bounds the time before exiting code cache and hence delivering the signal.

Enables the linux.signal_racesys test on AArch64, which reproduces the assert failure in current implementation. This required adding the missing implementation to check for pending signals before entering fcache in append_fcache_enter_prologue.

Makes the signals_pending char as signed explicitly, using the existing sbyte alias. This is required because chars are unsigned by default on ARM. Also, adds support for OP_ldrsb instrs on AArch64 to load this data.

Fixes logic for skip_pc detection on ARM32. The existing logic does not work when skip_pc is pointing to the exit cti_pc. This prevents fragment re-linking and may cause performance issues.

Fixes: #4271, #2043

Fixes logic to detect the two jmps preceding a syscall. These jmps are used to bypass the syscall, which bounds the time before exiting code cache and hence delivering the signal.

Also, enables the linux.signal_racesys test on AArch64. This required adding the missing implementation to check for pending signals before entering fcache.

Issue: #2043,#4271
Fixes: #4271
char is unsigned by default on ARM based machines; https://www.arm.linux.org.uk/docs/faqs/signedchar.php. This leads to wrong usage of the signals_pending variable, which is negative when a signal is being handled.
@abhinav92003
Copy link
Contributor Author

run arm tests

@abhinav92003
Copy link
Contributor Author

I was able to reproduce the original assert failure from #4271 on linux.signal_racesys, and fix it too. But, I'm seeing some other tests with a new failure now:

 Internal Error: DynamoRIO debug check failure: /var/lib/jenkins/.jenkins/workspace/DynamoRIO-AArch64-Precommit/core/unix/signal.c:6354 (app_pc)sc-SC_XIP != next_pc 

This comes up maybe ~half of the time on linux.signal0111.

Based on the comment at the assert, this is probably because of too small app stack.

ASSERT((app_pc)sc->SC_XIP != next_pc);

@derekbruening any feedback on the current set of changes in this PR, or how they could related to this assert failure?

@abhinav92003 abhinav92003 marked this pull request as ready for review September 29, 2020 08:57
@derekbruening
Copy link
Contributor

I was able to reproduce the original assert failure from #4271 on linux.signal_racesys, and fix it too. But, I'm seeing some other tests with a new failure now:

 Internal Error: DynamoRIO debug check failure: /var/lib/jenkins/.jenkins/workspace/DynamoRIO-AArch64-Precommit/core/unix/signal.c:6354 (app_pc)sc-SC_XIP != next_pc 

This comes up maybe ~half of the time on linux.signal0111.

Based on the comment at the assert, this is probably because of too small app stack.

ASSERT((app_pc)sc->SC_XIP != next_pc);

@derekbruening any feedback on the current set of changes in this PR, or how they could related to this assert failure?

Not sure I understand that comment...if the stack is too small to fit the frame, the frame will overflow the stack and either hit unreadable memory like a guard page and we would then send SIGSEGV which would likely kill the process, or if there really is no guard or adjacent unreadable you'd just clobber sthg.

@derekbruening
Copy link
Contributor

 abhinav92003 requested a review from derekbruening 8 hours ago

Sorry, missed this in the thread of other notifications. Looking now.

Copy link
Contributor

@derekbruening derekbruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we share the append_fcache_enter_prologue code?

core/arch/aarch64/emit_utils.c Outdated Show resolved Hide resolved
core/globals.h Outdated Show resolved Hide resolved
core/globals.h Outdated Show resolved Hide resolved
core/globals.h Outdated Show resolved Hide resolved
suite/tests/CMakeLists.txt Outdated Show resolved Hide resolved
suite/tests/CMakeLists.txt Outdated Show resolved Hide resolved
core/arch/mangle_shared.c Show resolved Hide resolved
@derekbruening
Copy link
Contributor

Internal Error: DynamoRIO debug check failure: /var/lib/jenkins/.jenkins/workspace/DynamoRIO-AArch64-Precommit/core/unix/signal.c:6354 (app_pc)sc-SC_XIP != next_pc

Not sure I understand that comment...if the stack is too small to fit the frame, the frame will overflow the stack and either hit unreadable memory like a guard page and we would then send SIGSEGV which would likely kill the process, or if there really is no guard or adjacent unreadable you'd just clobber sthg.

Trying to understand the assert: so it's making sure the passed-to-app sc->pc != fcache_return? Confused.

…es."

The earlier failure I saw was probably an unrelated one: i#4456.

This reverts commit 4bfe2b2.
The existing logic does not work when skip_pc is pointing to the exit cti_pc. This may cause issues during re-linking.
@abhinav92003
Copy link
Contributor Author

Trying to understand the assert: so it's making sure the passed-to-app sc->pc != fcache_return? Confused.

It seems so. Based on the comment there, it signifies that the inner frame was overwritten during handling of a nested signal?

I added some logs to handle_sigreturn

MYLOG: incoming sc->SC_XIP is 281473184303696,  sc->fault_address is 281473185361920
MYLOG BEFORE ASSERT: sc->SC_XIP is 281472031181952, next_pc is 281473184303696

MYLOG: incoming sc->SC_XIP is 281473184474844,  sc->fault_address is 281473185361920
MYLOG BEFORE ASSERT: sc->SC_XIP is 281472031181952, next_pc is 281473184474844

MYLOG: incoming sc->SC_XIP is 281473184467436,  sc->fault_address is 281473185361920
MYLOG BEFORE ASSERT: sc->SC_XIP is 281472031181952, next_pc is 281473184467436

Sending SIGUSR2
MYLOG: incoming sc->SC_XIP is 281473184314064,  sc->fault_address is 281473185361920
MYLOG BEFORE ASSERT: sc->SC_XIP is 281472031181952, next_pc is 281473184314064

Sending SIGUSR1
in signal handler
Got SIGUSR1
MYLOG: incoming sc->SC_XIP is 281473184226456,  sc->fault_address is 281473185361920
MYLOG BEFORE ASSERT: sc->SC_XIP is 281472031181952, next_pc is 281473184226456

Sending SIGRTMAX
in signal handler
Got SIGRTMAX
MYLOG: incoming sc->SC_XIP is 281473184226456,  sc->fault_address is 281473185361920
MYLOG BEFORE ASSERT: sc->SC_XIP is 281472031181952, next_pc is 281473184226456

// the sc->SC_XIP here seems far away from the ones in other logs
MYLOG: incoming sc->SC_XIP is 1899019016,  sc->fault_address is 281473185361920
MYLOG BEFORE ASSERT: sc->SC_XIP is 281472031181952, next_pc is 1899019016

MYLOG: incoming sc->SC_XIP is 281472031181952,  sc->fault_address is 281473185361920
MYLOG BEFORE ASSERT: sc->SC_XIP is 281472031181952, next_pc is 281472031181952

The second-to-last sc->SC_XIP (or sc->pc) stands out -- it's much farther away from other SC_XIPs in the logs. In runs where the assertion is not hit, this address doesn't appear.
gdb shows the contents at 1899019016 (or 0x7130bb08)

(gdb) x/10i 1899019008
   0x7130bb00 <dynamorio_clone>:	bl	0x7130b7c0 <unexpected_return>
   0x7130bb04 <dynamorio_sigreturn>:	mov	w8, #0x8b                  	// #139
   0x7130bb08 <dynamorio_sigreturn+4>:	svc	#0x0 // <-----
   0x7130bb0c <dynamorio_sigreturn+8>:	bl	0x7130b7c0 <unexpected_return>
   0x7130bb10 <dynamorio_sys_exit>:	mov	w0, #0x0                   	// #0
   0x7130bb14 <dynamorio_sys_exit+4>:	mov	w8, #0x5d                  	// #93
   0x7130bb18 <dynamorio_sys_exit+8>:	svc	#0x0
   0x7130bb1c <dynamorio_sys_exit+12>:	bl	0x7130b7c0 <unexpected_return>

Trying to find out what that signifies.

Also, there's an NYI comment where SC_XIP is defined. Not sure what that's for.

/* FIXME i#1569: NYI */

@derekbruening
Copy link
Contributor

So a signal came in while executing dynamorio_sigreturn? What are the signal #'s of the prior and new one; new one is asynch (right)?) so why wasn't it delayed; wouldn't the interrupted context still be dynamorio_sigreturn and not fcache_return; etc.

Also, there's an NYI comment where SC_XIP is defined. Not sure what that's for.

Looks like an obsolete comment.

@abhinav92003
Copy link
Contributor Author

 Internal Error: DynamoRIO debug check failure: /var/lib/jenkins/.jenkins/workspace/DynamoRIO-AArch64-Precommit/core/unix/signal.c:6354 (app_pc)sc-SC_XIP != next_pc 

Following up on this failure in linux.signal0111:

It seems that the new append_fcache_enter_prologue code I added in this PR isn't working as intended for negative signals_pending vals -- should enter fcache, but isn't. Seems that the CMP signals_pending LE 0 check is still not working.

A hang in linux.signest was fixed when I made signals_pending a signed char explicitly in 75b1231. But seems that that doesn't help with the comparison in assembly, which is causing the ASSERT(sc-SC_XIP != next_pc) failure in linux.signal0111 (and a few others too).

To verify this, I added a CMP signals_pending EQ 255 check too, which fixes linux.signal0111 (no failure in 50 runs).

Let me figure out a good fix to push. I'll also add an assert(signals_pending>0) at

if (dcontext->signals_pending) {
.

@abhinav92003
Copy link
Contributor Author

Let me figure out a good fix to push.

I added support for OP_ldrsb in AArch64, and used that to load the signals_pending signed char in append_fcache_enter_prologue. That fixed it. Will commit the fix soon.

@derekbruening
Copy link
Contributor

As requested offline, I ran 3 tests 20x each. The 2 signalnnnn ones passed all 20 times, but ptsig failed on the 10th run:

$ ctest --repeat-until-fail 20 -R 'signal0111|signal1111|client\.timer|ptsig'
Test project /home/derek/dr/git/build_dbg_tests
    Start 35: code_api|linux.signal0111
    Test #35: code_api|linux.signal0111 ........   Passed    3.03 sec
    Start 35: code_api|linux.signal0111
    Test #35: code_api|linux.signal0111 ........   Passed    3.11 sec
    Start 35: code_api|linux.signal0111
    Test #35: code_api|linux.signal0111 ........   Passed    3.05 sec
    Start 35: code_api|linux.signal0111
    Test #35: code_api|linux.signal0111 ........   Passed    3.19 sec
    Start 35: code_api|linux.signal0111
    Test #35: code_api|linux.signal0111 ........   Passed    3.09 sec
    Start 35: code_api|linux.signal0111
    Test #35: code_api|linux.signal0111 ........   Passed    3.06 sec
    Start 35: code_api|linux.signal0111
    Test #35: code_api|linux.signal0111 ........   Passed    3.08 sec
    Start 35: code_api|linux.signal0111
    Test #35: code_api|linux.signal0111 ........   Passed    3.03 sec
    Start 35: code_api|linux.signal0111
    Test #35: code_api|linux.signal0111 ........   Passed    3.04 sec
    Start 35: code_api|linux.signal0111
    Test #35: code_api|linux.signal0111 ........   Passed    3.10 sec
    Start 35: code_api|linux.signal0111
    Test #35: code_api|linux.signal0111 ........   Passed    3.08 sec
    Start 35: code_api|linux.signal0111
    Test #35: code_api|linux.signal0111 ........   Passed    3.04 sec
    Start 35: code_api|linux.signal0111
    Test #35: code_api|linux.signal0111 ........   Passed    3.06 sec
    Start 35: code_api|linux.signal0111
    Test #35: code_api|linux.signal0111 ........   Passed    3.05 sec
    Start 35: code_api|linux.signal0111
    Test #35: code_api|linux.signal0111 ........   Passed    3.05 sec
    Start 35: code_api|linux.signal0111
    Test #35: code_api|linux.signal0111 ........   Passed    3.11 sec
    Start 35: code_api|linux.signal0111
    Test #35: code_api|linux.signal0111 ........   Passed    3.13 sec
    Start 35: code_api|linux.signal0111
    Test #35: code_api|linux.signal0111 ........   Passed    3.16 sec
    Start 35: code_api|linux.signal0111
    Test #35: code_api|linux.signal0111 ........   Passed    3.15 sec
    Start 35: code_api|linux.signal0111
1/3 Test #35: code_api|linux.signal0111 ........   Passed    3.17 sec
    Start 43: code_api|linux.signal1111
    Test #43: code_api|linux.signal1111 ........   Passed    3.21 sec
    Start 43: code_api|linux.signal1111
    Test #43: code_api|linux.signal1111 ........   Passed    3.34 sec
    Start 43: code_api|linux.signal1111
    Test #43: code_api|linux.signal1111 ........   Passed    3.27 sec
    Start 43: code_api|linux.signal1111
    Test #43: code_api|linux.signal1111 ........   Passed    3.21 sec
    Start 43: code_api|linux.signal1111
    Test #43: code_api|linux.signal1111 ........   Passed    3.29 sec
    Start 43: code_api|linux.signal1111
    Test #43: code_api|linux.signal1111 ........   Passed    3.30 sec
    Start 43: code_api|linux.signal1111
    Test #43: code_api|linux.signal1111 ........   Passed    3.27 sec
    Start 43: code_api|linux.signal1111
    Test #43: code_api|linux.signal1111 ........   Passed    3.21 sec
    Start 43: code_api|linux.signal1111
    Test #43: code_api|linux.signal1111 ........   Passed    3.35 sec
    Start 43: code_api|linux.signal1111
    Test #43: code_api|linux.signal1111 ........   Passed    3.21 sec
    Start 43: code_api|linux.signal1111
    Test #43: code_api|linux.signal1111 ........   Passed    3.25 sec
    Start 43: code_api|linux.signal1111
    Test #43: code_api|linux.signal1111 ........   Passed    3.28 sec
    Start 43: code_api|linux.signal1111
    Test #43: code_api|linux.signal1111 ........   Passed    3.27 sec
    Start 43: code_api|linux.signal1111
    Test #43: code_api|linux.signal1111 ........   Passed    3.23 sec
    Start 43: code_api|linux.signal1111
    Test #43: code_api|linux.signal1111 ........   Passed    3.24 sec
    Start 43: code_api|linux.signal1111
    Test #43: code_api|linux.signal1111 ........   Passed    3.35 sec
    Start 43: code_api|linux.signal1111
    Test #43: code_api|linux.signal1111 ........   Passed    3.19 sec
    Start 43: code_api|linux.signal1111
    Test #43: code_api|linux.signal1111 ........   Passed    3.19 sec
    Start 43: code_api|linux.signal1111
    Test #43: code_api|linux.signal1111 ........   Passed    3.28 sec
    Start 43: code_api|linux.signal1111
2/3 Test #43: code_api|linux.signal1111 ........   Passed    3.22 sec
    Start 78: code_api|pthreads.ptsig
    Test #78: code_api|pthreads.ptsig ..........   Passed    1.36 sec
    Start 78: code_api|pthreads.ptsig
    Test #78: code_api|pthreads.ptsig ..........   Passed    1.33 sec
    Start 78: code_api|pthreads.ptsig
    Test #78: code_api|pthreads.ptsig ..........   Passed    1.37 sec
    Start 78: code_api|pthreads.ptsig
    Test #78: code_api|pthreads.ptsig ..........   Passed    1.36 sec
    Start 78: code_api|pthreads.ptsig
    Test #78: code_api|pthreads.ptsig ..........   Passed    1.34 sec
    Start 78: code_api|pthreads.ptsig
    Test #78: code_api|pthreads.ptsig ..........   Passed    1.35 sec
    Start 78: code_api|pthreads.ptsig
    Test #78: code_api|pthreads.ptsig ..........   Passed    1.35 sec
    Start 78: code_api|pthreads.ptsig
    Test #78: code_api|pthreads.ptsig ..........   Passed    1.36 sec
    Start 78: code_api|pthreads.ptsig
    Test #78: code_api|pthreads.ptsig ..........   Passed    1.30 sec
    Start 78: code_api|pthreads.ptsig
    Test #78: code_api|pthreads.ptsig ..........***Failed  Required regular expression not found.Regex=[^Estimation of pi is 3\.142425985001098
$
]  0.96 sec

Let me run HEAD and see whether ptsig is flaky there too.

@derekbruening
Copy link
Contributor

Here is the log for that failure:

78/182 Test: code_api|pthreads.ptsig
Command: "/home/derek/dr/git/build_dbg_tests/bin32/drrun" "-s" "90" "-quiet" "-debug" "-killpg" "-stderr_mask" "0xC" "-dumpcore_mask" "0" "-code_api" "--" "/home/derek/dr/git/build_dbg_tests/suite/tests/bin/pthreads.ptsig"
Directory: /home/derek/dr/git/build_dbg_tests/suite/tests
"code_api|pthreads.ptsig" start time: Oct 01 23:35 EDT
Output:
----------------------------------------------------------
<ERROR: master_signal_handler with no siginfo (i#26?): tid=18641, sig=11>
<Application /home/derek/dr/git/build_dbg_tests/suite/tests/bin/pthreads.ptsig (18640). Cannot correctly handle received signal 11 in thread 18641.>
<end of output>
Test time =   0.96 sec
----------------------------------------------------------
Test Fail Reason:
Required regular expression not found.Regex=[^Estimation of pi is 3\.142425985001098
$
]
"code_api|pthreads.ptsig" end time: Oct 01 23:35 EDT
"code_api|pthreads.ptsig" time elapsed: 00:00:00
----------------------------------------------------------

@derekbruening
Copy link
Contributor

Going back to 6c63819 in this branch, 20x of ptsig all passed. But running 50x did hit the same #26 failure as above. So probably that is unrelated to this PR.

@derekbruening
Copy link
Contributor

OK so I ran ptsig 100x and had 2/100 fail without this PR and 1/100 with so I think we're good.

@abhinav92003
Copy link
Contributor Author

As requested offline, I ran 3 tests 20x each. The 2 signalnnnn ones passed all 20 times, but ptsig failed on the 10th run:

Thanks a lot for checking Derek!

I just recalled there were two others that were important for this PR.
linux.signal_racesys
linux.signest

These were part of the earlier failures I faced in this PR so it skipped my mind before :(. Please run whenever you get the chance. Thanks again.

@derekbruening
Copy link
Contributor

I just recalled there were two others that were important for this PR.
linux.signal_racesys
linux.signest

For i4271-aarch64-linux-signal-race built DEBUG=ON:

$ ctest --repeat-until-fail 20 -R 'signal_racesys|signest'
...
1/2 Test #61: code_api|linux.signal_racesys ....   Passed    1.45 sec
...
2/2 Test #66: code_api|linux.signest ...........   Passed    2.66 sec

100% tests passed, 0 tests failed out of 2

@abhinav92003
Copy link
Contributor Author

For i4271-aarch64-linux-signal-race built DEBUG=ON:

Great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ASSERT skip_pc != NULL && cti_pc != NULL on AArch64
2 participants