Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

64-bit Travis tests failing non-deterministically after Travis upgrade #2641

Closed
derekbruening opened this issue Sep 19, 2017 · 10 comments
Closed

Comments

@derekbruening
Copy link
Contributor

Split from the 32-bit crashes of #2634

For 64-bit, a completely different subset of tests fails each run. The tests have zero output as #2640 shows. I cannot reproduce in a VM that is similar to the Travis VM.

debug-internal-64: 273 tests passed, **** 2 tests failed: ****
	code_api|linux.clone 
	code_api|linux.exit 

https://travis-ci.org/DynamoRIO/dynamorio/jobs/276032337

debug-internal-64: 273 tests passed, **** 2 tests failed: ****
	code_api|sample.prefetch 
	code_api|sample.stl_test 

https://travis-ci.org/DynamoRIO/dynamorio/jobs/277148078

debug-internal-64: 265 tests passed, **** 10 tests failed, but ignoring 0 for i: ****
	code_api|linux.signal0000 
	code_api|float_vmbase 
	code_api|client.count-ctis 
	code_api|client.syscall 
	code_api|client.count-bbs 
	code_api|client.exception 
	code_api|client.annotation-concurrency-opt 
	code_api|client.annotation-detection 
	code_api|sample.prefetch 
	code_api|tool.reuse 
@derekbruening
Copy link
Contributor Author

Running the full suite I can't reproduce quite the same thing, but on a Trusty VM with 4.4.0-93 the linux.fib-conflict* tests (and unit_tests) fail consistently.

We have several issues revealed by looking at those:

  1. old_libdr_base is not NULL the 1st time calling privload_early_inject
=> 0x7ffff7c71a1e <reloaded_xfer+12>:	callq  0x7ffff7cbd465 <privload_early_inject>
(gdb) info reg
rax            0x0	0
rbx            0x0	0
rcx            0x7fffffffdd20	140737488346400
rdx            0x7fffffffdd20	140737488346400
rsi            0x7ffff79d8c50	140737347685456
rdi            0x7fffffffdee0	140737488346848
rbp            0x0	0x0
rsp            0x7fffffffdee0	0x7fffffffdee0
r8             0x0	0
r9             0x0	0
r10            0x0	0
r11            0x0	0
r12            0x0	0
r13            0x0	0
r14            0x0	0
r15            0x0	0
rip            0x7ffff7c71a1e	0x7ffff7c71a1e <reloaded_xfer+12>
eflags         0x246	[ PF ZF IF ]
cs             0x33	51
ss             0x2b	43
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
(gdb) c
Continuing.

Breakpoint 3, privload_early_inject (sp=0x7fffffffdee0, old_libdr_base=0x7ffff79d8c50 "\250$a", old_libdr_size=140737488346400)
    at /home/derek/dr/git/src/core/unix/loader.c:1699
1699	    ptr_int_t *argc = (ptr_int_t *)sp;  /* Kernel writes an elf_addr_t. */
1: x/i $pc
=> 0x7ffff7cbd486 <privload_early_inject+33>:	mov    -0xa18(%rbp),%rax
(gdb) p /x old_libdr_size
$3 = 0x7fffffffdd20

relocate_dynamorio is not preserving rsi or rdi even though they're
callee-save.

This is true on laptop too: but the munmap of old_libdr_base just fails b/c
there's nothing there:

Breakpoint 3, privload_early_inject (sp=0x7ffcffba8440, old_libdr_base=0x55bb5ffb6c50 "\250$aq", old_libdr_size=140724598899328)
(gdb) 
1731	        os_unmap_file(old_libdr_base, old_libdr_size);
(gdb) step
os_unmap_file (map=0x55bb5ffb6c50 "\250$aq", size=140724598899328) at src/core/unix/os.c:4459
4459	    long res = munmap_syscall(map, size);
(gdb) next
4460	    return (res == 0);
(gdb) p res
$1 = -22
  1. Note how vvar and vdso are between the segments of libdynamorio:
7ffff79ac000-7ffff7d86000 r-xp 00000000 08:01 159329                     /home/derek/dr/git/build_suite/build_debug-internal-64/lib64/debug/libdynamorio.so
7ffff7f82000-7ffff7f84000 r--p 00000000 00:00 0                          [vvar]
7ffff7f84000-7ffff7f86000 r-xp 00000000 00:00 0                          [vdso]
7ffff7f86000-7ffff7fcf000 rw-p 003da000 08:01 159329                     /home/derek/dr/git/build_suite/build_debug-internal-64/lib64/debug/libdynamorio.so
7ffff7fcf000-7ffff7fff000 rw-p 00000000 00:00 0                          [heap]
7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
(gdb) bt
#0  memquery_library_bounds_by_iterator (name=0x7ffff7d5340a "libdynamorio.so", start=0x7fffffffd468, end=0x7fffffffd460, 
    fullpath=0x7ffff7ff43c0 <dynamorio_library_path> "", path_size=512) at /home/derek/dr/git/src/core/unix/memquery.c:80
#1  0x00007ffff7cbeeb0 in memquery_library_bounds (name=0x7ffff7d5340a "libdynamorio.so", start=0x7fffffffd468, end=0x7fffffffd460, 
    fullpath=0x7ffff7ff43c0 <dynamorio_library_path> "", path_size=512) at /home/derek/dr/git/src/core/unix/memquery_linux.c:316
#2  0x00007ffff7c9df8f in get_dynamo_library_bounds () at /home/derek/dr/git/src/core/unix/os.c:8676
#3  0x00007ffff7c9e5f8 in get_dynamorio_dll_start () at /home/derek/dr/git/src/core/unix/os.c:8847
#4  0x00007ffff7cbd612 in privload_early_inject (sp=0x7fffffffdf10, old_libdr_base=0x0, old_libdr_size=140737488346448)
    at /home/derek/dr/git/src/core/unix/loader.c:1760
#5  0x00007ffff7c71a2a in reloaded_xfer () at /home/derek/dr/git/src/core/arch/x86/x86.asm:1187
#6  0x0000000000000001 in ?? ()
#7  0x00007fffffffe28e in ?? ()
#8  0x0000000000000000 in ?? ()
<...>
(gdb) 
204	        ASSERT_CURIOSITY(iter.vm_start == cur_end /* no gaps, FIXME might there be
(gdb) p iter.vm_start
$13 = (app_pc) 0x7ffff7f82000 <error: Cannot access memory at address 0x7ffff7f82000>
(gdb) p cur_end
$14 = (app_pc) 0x7ffff7d86000 <error: Cannot access memory at address 0x7ffff7d86000>
(gdb) info local
count = 1
found_library = true
libname = "/home/derek/dr/git/build_x64_dbg_tests/lib64/debug/libdynamorio.so", '\000' <repeats 445 times>
name_cmp = 0x7ffff7d5340a "libdynamorio.so"
iter = {vm_start = 0x7ffff7f82000 <error: Cannot access memory at address 0x7ffff7f82000>, vm_end = 0x7ffff7f84000 "\177ELF\002\001\001", 
  prot = 1, offset = 0, inode = 0, comment = 0x7ffff7ff81c0 <comment_buf_scratch> "[vvar]", may_alloc = false, 
  internal = "\004\000\000\000\000\000\000\000{r\377\367\377\177\000\000\255\002\000\000\a\020\000\000\240q\377\367\377\177\000\000\300\201\377\367\377\177", '\000' <repeats 57 times>}
target = 0x7ffff79ac000 "\177ELF\002\001\001"
last_lib_base = 0x7ffff79ac000 "\177ELF\002\001\001"
last_lib_end = 0x7ffff7d86000 <error: Cannot access memory at address 0x7ffff7d86000>
prev_base = 0x7ffff79ac000 "\177ELF\002\001\001"
prev_end = 0x7ffff7d86000 <error: Cannot access memory at address 0x7ffff7d86000>
prev_prot = 5
image_size = 6631424
cur_end = 0x7ffff7d86000 <error: Cannot access memory at address 0x7ffff7d86000>
mod_start = 0x7ffff79ac000 "\177ELF\002\001\001"
__FUNCTION__ = "memquery_library_bounds_by_iterator"
(gdb) next
[Inferior 1 (process 46247) exited with code 0377]

So the assert can't run that early.

  1. linux.fib-conflict crashes, incl natively
derek@ubuntu:~/dr/git/build_x64_dbg_tests$ bin64/drrun -- suite/tests/bin/linux.fib-conflict
<Starting application /home/derek/dr/git/build_x64_dbg_tests/suite/tests/bin/linux.fib-conflict (47384)>
<Paste into GDB to debug DynamoRIO clients:
set confirm off
add-symbol-file '/home/derek/dr/git/build_x64_dbg_tests/lib64/debug/libdynamorio.so' 0x00007fe47dd35c68
>
<Initial options = -no_dynamic_options -code_api -stack_size 56K -signal_stack_size 32K -max_elide_jmp 0 -max_elide_call 0 -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct >
<(1+x) Handling our fault in a TRY at 0x00007fe47dfcee0a>
<Application tried to execute from unreadable memory 0x0000000000002330.
This may be a result of an unsuccessful attack or a potential application vulnerability.>
<Application /home/derek/dr/git/build_x64_dbg_tests/suite/tests/bin/linux.fib-conflict (47384).  Application exception at PC 0x0000000000002330.  
Signal 11 delivered to application as default action.
Callstack:
	0x0000000000002330  
>
<Stopping application /home/derek/dr/git/build_x64_dbg_tests/suite/tests/bin/linux.fib-conflict (47384)>
Segmentation fault (core dumped)
derek@ubuntu:~/dr/git/build_x64_dbg_tests$ suite/tests/bin/linux.fib-conflict
Segmentation fault
derek@ubuntu:~/dr/git/build_x64_dbg_tests$ gdb suite/tests/bin/linux.fib-conflict
(gdb) run
Starting program: /home/derek/dr/git/build_x64_dbg_tests/suite/tests/bin/linux.fib-conflict 
During startup program terminated with signal SIGSEGV, Segmentation fault.

@derekbruening derekbruening self-assigned this Sep 24, 2017
@derekbruening
Copy link
Contributor Author

derekbruening commented Sep 24, 2017

Another issue from vvar+vdso inside libdynamorio: handle is_in_dynamo_dll(). The problem is that this is all pre-heap, which is why we aren't using the general code for app libs that makes an array of segments and handles gaps.

@derekbruening
Copy link
Contributor Author

I tried a Travis run fixing the non-NULL old_libdr_base and disabling the asserts and it does get further: the handful of failures are now like this:

debug-internal-64: 271 tests passed, **** 4 tests failed: ****
	code_api|linux.signalfd =>  Application /home/travis/build/DynamoRIO/dynamorio/build_debug-internal-64/suite/tests/bin/linux.signalfd (20541).  Internal Error: DynamoRIO debug check failure: /home/travis/build/DynamoRIO/dynamorio/core/heap.c:1478 IF_WINDOWS(doing_detach || ) IF_DEBUG(dynamo_thread_init_during_process_exit || ) heapmgt-vmheap.num_free_blocks == heapmgt-vmheap.num_blocks - unfreed_blocks || ((ever_beyond_vmm IF_WINDOWS(|| get_os_version() = WINDOWS_VERSION_8_1))  heapmgt-vmheap.num_free_blocks = heapmgt-vmheap.num_blocks - unfreed_blocks) 
	code_api|linux.clone-reset =>  Application /home/travis/build/DynamoRIO/dynamorio/build_debug-internal-64/suite/tests/bin/linux.clone (20582).  Internal Error: DynamoRIO debug check failure: /home/travis/build/DynamoRIO/dynamorio/core/heap.c:1478 IF_WINDOWS(doing_detach || ) IF_DEBUG(dynamo_thread_init_during_process_exit || ) heapmgt-vmheap.num_free_blocks == heapmgt-vmheap.num_blocks - unfreed_blocks || ((ever_beyond_vmm IF_WINDOWS(|| get_os_version() = WINDOWS_VERSION_8_1))  heapmgt-vmheap.num_free_blocks = heapmgt-vmheap.num_blocks - unfreed_blocks) 
	code_api|client.cbr-retarget =>  Application /home/travis/build/DynamoRIO/dynamorio/build_debug-internal-64/suite/tests/bin/client.cbr-retarget (21086).  Internal Error: DynamoRIO debug check failure: /home/travis/build/DynamoRIO/dynamorio/core/heap.c:1478 IF_WINDOWS(doing_detach || ) IF_DEBUG(dynamo_thread_init_during_process_exit || ) heapmgt-vmheap.num_free_blocks == heapmgt-vmheap.num_blocks - unfreed_blocks || ((ever_beyond_vmm IF_WINDOWS(|| get_os_version() = WINDOWS_VERSION_8_1))  heapmgt-vmheap.num_free_blocks = heapmgt-vmheap.num_blocks - unfreed_blocks) 
	code_api|sample.countcalls =>  Application /home/travis/build/DynamoRIO/dynamorio/build_debug-internal-64/suite/tests/bin/common.eflags (21181).  Internal Error: DynamoRIO debug check failure: /home/travis/build/DynamoRIO/dynamorio/core/heap.c:1478 IF_WINDOWS(doing_detach || ) IF_DEBUG(dynamo_thread_init_during_process_exit || ) heapmgt-vmheap.num_free_blocks == heapmgt-vmheap.num_blocks - unfreed_blocks || ((ever_beyond_vmm IF_WINDOWS(|| get_os_version() = WINDOWS_VERSION_8_1))  heapmgt-vmheap.num_free_blocks = heapmgt-vmheap.num_blocks - unfreed_blocks) 
69: Test command: /home/travis/build/DynamoRIO/dynamorio/build_debug-internal-64/bin64/drrun "-s" "90" "-quiet" "-debug" "-killpg" "-stderr_mask" "0xC" "-dumpcore_mask" "0" "-code_api" "--" "/home/travis/build/DynamoRIO/dynamorio/build_debug-internal-64/suite/tests/bin/linux.signalfd"
69: Test timeout computed to be: 600
69: <Application /home/travis/build/DynamoRIO/dynamorio/build_debug-internal-64/suite/tests/bin/linux.signalfd (20541).  Internal Error: DynamoRIO debug check failure: /home/travis/build/DynamoRIO/dynamorio/core/heap.c:1478 IF_WINDOWS(doing_detach || ) IF_DEBUG(dynamo_thread_init_during_process_exit || ) heapmgt->vmheap.num_free_blocks == heapmgt->vmheap.num_blocks - unfreed_blocks || ((ever_beyond_vmm IF_WINDOWS(|| get_os_version() >= WINDOWS_VERSION_8_1)) && heapmgt->vmheap.num_free_blocks >= heapmgt->vmheap.num_blocks - unfreed_blocks)
69: (Error occurred @224 frags)
69: version 6.2.17433, custom build
69: -no_dynamic_options -code_api -stderr_mask 12 -stack_size 56K -signal_stack_size 32K -max_elide_jmp 0 -max_elide_call 0 -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct 
69: 0x000000004e77d240 0x00007fcb197a8d19
69: 0x000000004e77d480 0x00007fcb19837f12
69: 0x000000004e77d4b0 0x00007fcb1972028c
69: 0x000000004e77d4f0 0x00007fcb197205c0
69: 0x000000004e77d520 0x00007fcb19720749
69: 0x000000004e77d550 0x00007fcb19991ef2
69: 0x000000004e77d6a0 0x00007fcb19992227
69: 0x000000004e77d6d0 0x00007fcb19989d45
69: 0x000000004e77d740 0x00007fcb1998f480
69: 0x000000004e77d7c0 0x00007fcb1998c714
69: 0x000000004e77da00 0x00007fcb1998f139
69: 0x000000004e77dab0 0x00007fcb19951a40
69: 0x00007fffc97e51e0 0x00007fcb1947d6f5>

They're all very similar: they exit around 224 fragments in.
Callstacks are all similar.
Trying to figure out callstack using local build.

@derekbruening
Copy link
Contributor Author

derekbruening commented Sep 26, 2017

Since I could not reproduce anywhere else, I resorted to running all tests at -loglevel 2 and adding scripting in runsuite to look through all the logs and print relevant info from the failing tests.

Here's the info:

test 13
        Start  13: code_api|common.eflags

13: Test command: /home/travis/build/DynamoRIO/dynamorio/build_debug-internal-64/bin64/drrun "-s" "90" "-quiet" "-debug" "-killpg" "-loglevel" "2" "-stderr_mask" "0xC" "-dumpcore_mask" "0" "-code_api" "--" "/home/travis/build/DynamoRIO/dynamorio/build_debug-internal-64/suite/tests/bin/common.eflags"
13: Test timeout computed to be: 600
13: <Application /home/travis/build/DynamoRIO/dynamorio/build_debug-internal-64/suite/tests/bin/common.eflags (19777).  Internal Error: DynamoRIO debug check failure: /home/travis/build/DynamoRIO/dynamorio/core/heap.c:1478 IF_WINDOWS(doing_detach || ) IF_DEBUG(dynamo_thread_init_during_process_exit || ) heapmgt->vmheap.num_free_blocks == heapmgt->vmheap.num_blocks - unfreed_blocks || ((ever_beyond_vmm IF_WINDOWS(|| get_os_version() >= WINDOWS_VERSION_8_1)) && heapmgt->vmheap.num_free_blocks >= heapmgt->vmheap.num_blocks - unfreed_blocks)
13: (Error occurred @224 frags; DR base = 0x00007f45091fb000, dispatch = 0x00007f45092fffaf)
13: version 6.2.17434, custom build
13: -no_dynamic_options -loglevel 2 -code_api -stderr_mask 12 -stack_size 56K -signal_stack_size 32K -max_elide_jmp 0 -max_elide_call 0 -early_inject -emulate_brk -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct 
13: 0x0000000046c76230 0x00007f4509317d32
13: 0x0000000046c76480 0x00007f45093a6f2b
13: 0x0000000046c764b0 0x00007f450928f28c
13: 0x0000000046c764f0 0x00007f450928f5c0
13: 0x0000000046c76520 0x00007f450928f749
13: 0x0000000046c76550 0x00007f4509500f0b
13: 0x0000000046c766a0 0x00007f4509501240
13: 0x0000000046c766d0 0x00007f45094f8d5e
13: 0x0000000046c76740 0x00007f45094fe499
13: 0x0000000046c767c0 0x00007f45094fb72d
13: 0x0000000046c76a00 0x00007f45094fe152
13: 0x0000000046c76ab0 0x00007f45094c0a59
13: 0x00007ffd039fb180 0x00007f4508fec6f5>
 15/275 Test  #13: code_api|common.eflags .......................................***Failed 
Fragment 224, tag 0x00007f4508fd80f0, flags 0x9000630, shared, size 57:
        
Entry into F224(0x00007f4508fd80f0).0x0000000046c0897c (shared)

master_signal_handler: sig=11, retaddr=0x00007f45094c0a59
computing memory target for 0x0000000046c0898b causing SIGSEGV, kernel claims it is 0x00007f45097d5038
memory operand 0 has address 0x00007f45097d5038 and size 2
For SIGSEGV at cache pc 0x0000000046c0898b, computed target read 0x00007f45097d5038
        faulting instr: movzx  0x38(%rdx)[2byte] -> %edi
+* Received SIGSEGV at cache pc 0x0000000046c0898b in thread 19777
record_pending_signal(11) from cache pc 0x0000000046c0898b
        not certain can delay so handling now
translate context, thread 19777 at pc_recreatable spot translating
recreate_app_state -- translating from:
priv_mcontext_t @0x0000000046c76510
        xax = 0x00007f45091f9760
        xbx = 0x0000000000000000
        xcx = 0x0000000000000000
        xdx = 0x00007f45097d5000
        xsi = 0x00007f4508ff1583
        xdi = 0x00007f45091f9bf1
        xbp = 0x00007ffd039fb180
        xsp = 0x00007ffd039fb080
        r8  = 0x0000000000000000
        r9  = 0x0000000000000000
        r10 = 0x000000037ffffa00
        r11 = 0x000000006ffffeff
        r12 = 0x00007f45091f9760
        r13 = 0x0000000000000001
        r14 = 0x000000006fffffff
        r15 = 0x00007f45091f91c8
        ymm0= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm1= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm2= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm3= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm4= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm5= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm6= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm7= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm8= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm9= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm10= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm11= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm12= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm13= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm14= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm15= 0x0000000000000000000000000000000000000000000000000000000000000000
        mxcsr=0x00001f80
        eflags = 0x0000000000010246
        pc     = 0x0000000046c0898b
recreate_app : pc is in F224(0x00007f4508fd80f0)
ilist for recreation:
TAG  0x00007f4508fd80f0
 +0    m4 @0x0000000046c994a0  48 ba 78 7d 1f 09 45 mov    $0x00007f45091f7d78 -> %rdx
                               7f 00 00
 +10   L3                      48 8b 12             mov    (%rdx)[8byte] -> %rdx
 +13   L3                      31 c9                xor    %ecx %ecx -> %ecx
 +15   L3                      0f b7 7a 38          movzx  0x38(%rdx)[2byte] -> %edi
 +19   L3                      48 89 d0             mov    %rdx -> %rax
 +22   L3                      48 03 42 20          add    0x20(%rdx)[8byte] %rax -> %rax
 +26   L3                      48 85 ff             test   %rdi %rdi
 +29   L3                      49 89 84 24 a0 02 00 mov    %rax -> 0x000002a0(%r12)[8byte]
                               00
 +37   L3                      66 41 89 bc 24 b0 02 data16 mov    %di -> 0x000002b0(%r12)[2byte]
                               00 00
 +46   L4 @0x0000000046c99518  0f 85 ed 23 36 c2    jnz    $0x00007f4508fd8143
 +52   L4 @0x0000000046c5de08  e9 c5 23 36 c2       jmp    $0x00007f4508fd811a
END 0x00007f4508fd80f0
recreate_app -- found valid state pc 0x00007f4508fd80f9
        translation 0x00007f4508fd80f9 is post-walk 0x00007f4508fd80f0 so not fixing xsp
recreate_app -- found ok pc 0x00007f4508fd80f9
recreate_app_state -- translation is:
priv_mcontext_t @0x0000000046c76510
        xax = 0x00007f45091f9760
        xbx = 0x0000000000000000
        xcx = 0x0000000000000000
        xdx = 0x00007f45097d5000
        xsi = 0x00007f4508ff1583
        xdi = 0x00007f45091f9bf1
        xbp = 0x00007ffd039fb180
        xsp = 0x00007ffd039fb080
        r8  = 0x0000000000000000
        r9  = 0x0000000000000000
        r10 = 0x000000037ffffa00
        r11 = 0x000000006ffffeff
        r12 = 0x00007f45091f9760
        r13 = 0x0000000000000001
        r14 = 0x000000006fffffff
        r15 = 0x00007f45091f91c8
        ymm0= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm1= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm2= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm3= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm4= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm5= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm6= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm7= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm8= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm9= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm10= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm11= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm12= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm13= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm14= 0x0000000000000000000000000000000000000000000000000000000000000000
        ymm15= 0x0000000000000000000000000000000000000000000000000000000000000000
        mxcsr=0x00001f80
        eflags = 0x0000000000010246
        pc     = 0x00007f4508fd80f9
Got signal at pc 0x0000000046c0898b in this fragment:
Fragment 224, tag 0x00007f4508fd80f0, flags 0x9000630, shared, size 57:
        
  0x0000000046c0897c  48 ba 78 7d 1f 09 45 mov    $0x00007f45091f7d78 -> %rdx
                      7f 00 00
  0x0000000046c08986  48 8b 12             mov    (%rdx)[8byte] -> %rdx
  0x0000000046c08989  31 c9                xor    %ecx %ecx -> %ecx
  0x0000000046c0898b  0f b7 7a 38          movzx  0x38(%rdx)[2byte] -> %edi
  0x0000000046c0898f  48 89 d0             mov    %rdx -> %rax
  0x0000000046c08992  48 03 42 20          add    0x20(%rdx)[8byte] %rax -> %rax
  0x0000000046c08996  48 85 ff             test   %rdi %rdi
  0x0000000046c08999  49 89 84 24 a0 02 00 mov    %rax -> 0x000002a0(%r12)[8byte]
                      00
  0x0000000046c089a1  66 41 89 bc 24 b0 02 data16 mov    %di -> 0x000002b0(%r12)[2byte]
                      00 00
  0x0000000046c089aa  0f 85 97 e8 01 00    jnz    $0x0000000046c27247
  0x0000000046c089b0  e9 92 e8 01 00       jmp    $0x0000000046c27247
  -------- exit stub 0: -------- <target: 0x00007f4508fd8143> type: jmp/jcc
  0x0000000046c27247  67 65 48 a3 00 00 00 addr32 mov    %rax -> %gs:0x00[8byte]
                      00
  0x0000000046c2724f  48 b8 20 f0 c3 46 00 mov    $0x0000000046c3f020 -> %rax
                      00 00 00
  0x0000000046c27259  e9 62 9b fb ff       jmp    $0x0000000046be0dc0 <fcache_return> 
  -------- exit stub 1: -------- <target: 0x00007f4508fd811a> type: fall-through/speculated/IAT
  0x0000000046c27247  67 65 48 a3 00 00 00 addr32 mov    %rax -> %gs:0x00[8byte]
                      00
  0x0000000046c2724f  48 b8 20 f0 c3 46 00 mov    $0x0000000046c3f020 -> %rax
                      00 00 00
  0x0000000046c27259  e9 62 9b fb ff       jmp    $0x0000000046be0dc0 <fcache_return> 

Going to receive signal now
app signal handler is SIG_DFL: executing default action
SYSLOG_WARNING: Application /home/travis/build/DynamoRIO/dynamorio/build_debug-internal-64/suite/tests/bin/common.eflags (19777).  Application exception at PC 0x00007f4508fd80f9.  
Signal 11 delivered to application as default action.
Callstack:
        0x00007f4508fd80f9   </lib/x86_64-linux-gnu/ld-2.19.so+0x30f9>
        0x00007f4508fec6f5   </lib/x86_64-linux-gnu/ld-2.19.so+0x176f5>

SYSLOG_INFORMATION: Stopping application /home/travis/build/DynamoRIO/dynamorio/build_debug-internal-64/suite/tests/bin/common.eflags (19777)

I found that block in my VM:

interp: start_pc = 0x00007fd11eec30f0
check_thread_vm_area: pc = 0x00007fd11eec30f0
prepend_entry_to_fraglist: putting fragment @0x00007fd11eec30f0 (shared) on vmarea 0x00007fd11eec0000-0x00007fd11eee3000
  0x00007fd11eec30f0  48 8b 15 81 fc 21 00 mov    <rel> 0x00007fd11f0e2d78[8byte] -> %rdx
  0x00007fd11eec30f7  31 c9                xor    %ecx %ecx -> %ecx
        wrote all 6 flags now!
  0x00007fd11eec30f9  0f b7 7a 38          movzx  0x38(%rdx)[2byte] -> %edi
  0x00007fd11eec30fd  48 89 d0             mov    %rdx -> %rax
  0x00007fd11eec3100  48 03 42 20          add    0x20(%rdx)[8byte] %rax -> %rax
  0x00007fd11eec3104  48 85 ff             test   %rdi %rdi
  0x00007fd11eec3107  49 89 84 24 a0 02 00 mov    %rax -> 0x000002a0(%r12)[8byte]
                      00
  0x00007fd11eec310f  66 41 89 bc 24 b0 02 data16 mov    %di -> 0x000002b0(%r12)[2byte]
                      00 00
  0x00007fd11eec3118  75 29                jnz    $0x00007fd11eec3143
end_pc = 0x00007fd11eec311a

<on entry to subsequent bb:>
       xdx = 0x00007ffe2adfe000

It's ld-linux.so, reading from vdso + 0x38:

start=0x00007fd11eec0000 end=0x00007fd11eee3000 prot=5 comment=/lib/x86_64-linux-gnu/ld-2.19.so
start=0x00007fd11eee3000 end=0x00007fd11f0e2000 prot=0 comment=
start=0x00007fd11f0e2000 end=0x00007fd11f0e4000 prot=3 comment=/lib/x86_64-linux-gnu/ld-2.19.so
start=0x00007fd11f0e4000 end=0x00007fd11f0e5000 prot=3 comment=
start=0x00007fd11f0e5000 end=0x00007fd11f0e6000 prot=0 comment=
start=0x00007fd11f0e6000 end=0x00007fd11f4c0000 prot=5 comment=/home/derek/dr/git/build_x64_dbg_tests/lib64/debug/libdynamorio.so
start=0x00007fd11f4c0000 end=0x00007fd11f6bf000 prot=0 comment=
start=0x00007fd11f6bf000 end=0x00007fd11f708000 prot=3 comment=/home/derek/dr/git/build_x64_dbg_tests/lib64/debug/libdynamorio.so
start=0x00007fd11f708000 end=0x00007fd11f738000 prot=3 comment=
start=0x00007ffe2add6000 end=0x00007ffe2adf7000 prot=3 comment=[stack]
start=0x00007ffe2adfc000 end=0x00007ffe2adfe000 prot=1 comment=[vvar]
start=0x00007ffe2adfe000 end=0x00007ffe2ae00000 prot=5 comment=[vdso]
start=0xffffffffff600000 end=0xffffffffff601000 prot=5 comment=[vsyscall]

Adding the maps info to what my script gathers:

Entry into F224(0x00007f18645ea0f0).0x000000004a00997c (shared)

master_signal_handler: sig=11, retaddr=0x00007f1864ad2ad9
computing memory target for 0x000000004a00998b causing SIGSEGV, kernel claims it is 0x00007f1864de7038
memory operand 0 has address 0x00007f1864de7038 and size 2
For SIGSEGV at cache pc 0x000000004a00998b, computed target read 0x00007f1864de7038
        faulting instr: movzx  0x38(%rdx)[2byte] -> %edi
+* Received SIGSEGV at cache pc 0x000000004a00998b in thread 20610
record_pending_signal(11) from cache pc 0x000000004a00998b
        not certain can delay so handling now
translate context, thread 20610 at pc_recreatable spot translating
recreate_app_state -- translating from:
priv_mcontext_t @0x000000004a077510
        xax = 0x00007f186480b760
        xbx = 0x0000000000000000
        xcx = 0x0000000000000000
        xdx = 0x00007f1864de7000

ilist for recreation:
TAG  0x00007f18645ea0f0
 +0    m4 @0x000000004a09a4a0  48 ba 78 9d 80 64 18 mov    $0x00007f1864809d78 -> %rdx
                               7f 00 00
 +10   L3                      48 8b 12             mov    (%rdx)[8byte] -> %rdx
 +13   L3                      31 c9                xor    %ecx %ecx -> %ecx
 +15   L3                      0f b7 7a 38          movzx  0x38(%rdx)[2byte] -> %edi
 +19   L3                      48 89 d0             mov    %rdx -> %rax
 +22   L3                      48 03 42 20          add    0x20(%rdx)[8byte] %rax -> %rax
 +26   L3                      48 85 ff             test   %rdi %rdi
 +29   L3                      49 89 84 24 a0 02 00 mov    %rax -> 0x000002a0(%r12)[8byte]
                               00
 +37   L3                      66 41 89 bc 24 b0 02 data16 mov    %di -> 0x000002b0(%r12)[2byte]
                               00 00
 +46   L4 @0x000000004a09a518  0f 85 ed 33 57 1a    jnz    $0x00007f18645ea143
 +52   L4 @0x000000004a05ee08  e9 c5 33 57 1a       jmp    $0x00007f18645ea11a
END 0x00007f18645ea0f0

start=0x0000000000400000 end=0x0000000000405000 prot=5 comment=/home/travis/build/DynamoRIO/dynamorio/build_debug-internal-64/suite/tests/bin/linux.sigaction_nosignals
start=0x0000000000405000 end=0x0000000000604000 prot=0 comment=
start=0x0000000000604000 end=0x0000000000606000 prot=3 comment=/home/travis/build/DynamoRIO/dynamorio/build_debug-internal-64/suite/tests/bin/linux.sigaction_nosignals
start=0x0000000000606000 end=0x0000000000607000 prot=0 comment=
start=0x000000004a078000 end=0x0000000059fa7000 prot=0 comment=
start=0x00007f18645e7000 end=0x00007f186460a000 prot=5 comment=/lib/x86_64-linux-gnu/ld-2.19.so
start=0x00007f186460a000 end=0x00007f1864809000 prot=0 comment=
start=0x00007f1864809000 end=0x00007f186480b000 prot=3 comment=/lib/x86_64-linux-gnu/ld-2.19.so
start=0x00007f186480b000 end=0x00007f186480c000 prot=3 comment=
start=0x00007f186480c000 end=0x00007f186480d000 prot=0 comment=
start=0x00007f186480d000 end=0x00007f1864be9000 prot=5 comment=/home/travis/build/DynamoRIO/dynamorio/build_debug-internal-64/lib64/debug/libdynamorio.so
=> fault is here:
start=0x00007f1864be9000 end=0x00007f1864de9000 prot=0 comment=
start=0x00007f1864de9000 end=0x00007f1864e32000 prot=3 comment=/home/travis/build/DynamoRIO/dynamorio/build_debug-internal-64/lib64/debug/libdynamorio.so
start=0x00007f1864e32000 end=0x00007f1864e62000 prot=3 comment=
start=0x00007ffc79ddb000 end=0x00007ffc79dfd000 prot=3 comment=[stack]
start=0xffffffffff600000 end=0xffffffffff601000 prot=5 comment=[vsyscall]

Where are vvar and vdso? The crash is referencing the gap in libdynamorio
where you'd expect vdso to be if it matched my fib-conflict from above: so
did we clobber the vvar+vdso that used to be there?

It looks like code from #1659 is the culprit:

        /* i#1659: fill in the text-data segment gap to ensure no mmaps in between.
         * The kernel does not do this.  Our private loader does, so if we reloaded
         * ourselves this is already in place, but it's expensive to query so
         * we blindly clobber w/ another no-access mapping.
         */

The non-determinism must be from the kernel, whether or not
vvar+vdso are placed in the gap. (I revisited my VM and it
happened every time there only under gdb, where ASLR is
disabled.)

I don't think we want to try and avoid the gap via linker flags
or a linker script: best to go with recommended alignment.

I put in a hack to reload DR if vvar is in the gap (plus a hack
to not unload the old lib b/c that blindly clobbers vvar too) and it fixed the problem as Travis became green:
https://travis-ci.org/DynamoRIO/dynamorio/builds/279780578

So now to figure out a clean way to do this. If it's actually
rare then reloading DR is not an unreasonable solution if we
think it's too hard to actually handle stuff in the gap.

derekbruening added a commit that referenced this issue Sep 26, 2017
Fixes 3 problems related to vvar+vdso in libdynamorio's text-data gap:

1) Ensure old_libdr_base is NULL by not relying on the calling convention
   during early injection.

2) Handle asserts/curiosities inside memquery code by adding
   is_readable_without_exception_query_os_noblock()
   and memquery_from_os_will_block() to avoid deadlock on locks
   used on UNIX for memory queries.

3) Detect something in the text-data gap and reload libdynamorio during
   early injection as the rest of DR assumes there's nothing there.

Fixes #2641
@derekbruening
Copy link
Contributor Author

While the PR run had everything pass, the final commit run had one test fail with the same unmapped-vdso app-crash symptoms!

https://travis-ci.org/DynamoRIO/dynamorio/jobs/280172168

<Application /home/travis/build/DynamoRIO/dynamorio/build_debug-internal-64/suite/tests/bin/common.eflags (21262).  Internal Error: DynamoRIO debug check failure: /home/travis/build/DynamoRIO/dynamorio/core/heap.c:1493 IF_WINDOWS(doing_detach || ) IF_DEBUG(dynamo_thread_init_during_process_exit || ) heapmgt->vmheap.num_free_blocks == heapmgt->vmheap.num_blocks - unfreed_blocks || ((ever_beyond_vmm IF_WINDOWS(|| get_os_version() >= WINDOWS_VERSION_8_1)) && heapmgt->vmheap.num_free_blocks >= heapmgt->vmheap.num_blocks - unfreed_blocks)
(Error occurred @224 frags)

So there's still some scenario my fix doesn't handle??

@derekbruening
Copy link
Contributor Author

Re-opening to finish the missing piece here.

First I added an assert that the gap is empty when when clobber it for #1659.
That assert fires even when vvar+vdso aren't in the gap -- b/c there's some mysterious +rw page there:

in DR gap 1 0x00007f64a98d5000-0x00007f64a9ad4000: 0x00007f64a98d5000-0x00007f64a9ad3000 0
<...>
MAPS: 0x00007f64a94f5000-0x00007f64a94f7000 3 /home/travis/build/DynamoRIO/dynamorio/build_debug-internal-64/suite/tests/bin/linux.clone-pie
MAPS: 0x00007f64a94f7000-0x00007f64a94f8000 0 
MAPS: 0x00007f64a94f8000-0x00007f64a98d5000 5 /home/travis/build/DynamoRIO/dynamorio/build_debug-internal-64/lib64/debug/libdynamorio.so
MAPS: 0x00007f64a9ad3000-0x00007f64a9ad4000 3 
MAPS: 0x00007f64a9ad4000-0x00007f64a9b1d000 3 /home/travis/build/DynamoRIO/dynamorio/build_debug-internal-64/lib64/debug/libdynamorio.so
MAPS: 0x00007f64a9b1d000-0x00007f64a9b4d000 3 
MAPS: 0x00007ffc32cac000-0x00007ffc32cce000 3 [stack]
MAPS: 0x00007ffc32de4000-0x00007ffc32de6000 1 [vvar]
MAPS: 0x00007ffc32de6000-0x00007ffc32de8000 5 [vdso]
MAPS: 0xffffffffff600000-0xffffffffff601000 5 [vsyscall]
<Application /home/travis/build/DynamoRIO/dynamorio/build_debug-internal-64/suite/tests/bin/linux.clone-pie (19870).  Internal Error: DynamoRIO debug check failure: /home/travis/build/DynamoRIO/dynamorio/core/unix/loader.c:293 info.base_pc == libdr_opd->os_data.segments[i].end && info.size == sz && info.type == DR_MEMTYPE_FREE

What is that rw anon page:
MAPS: 0x00007f64a9ad3000-0x00007f64a9ad4000 3

To compare, here's from my VM:

          Start Addr           End Addr       Size     Offset objfile
      0x7ffff79ad000     0x7ffff7d87000   0x3da000        0x0 /home/derek/dr/git/build_x64_dbg_tests/lib64/debug/libdynamorio.so
      0x7ffff7f82000     0x7ffff7f84000     0x2000        0x0 [vvar]
      0x7ffff7f84000     0x7ffff7f86000     0x2000        0x0 [vdso]
      0x7ffff7f86000     0x7ffff7fcf000    0x49000   0x3d9000 /home/derek/dr/git/build_x64_dbg_tests/lib64/debug/libdynamorio.so
      0x7ffff7fcf000     0x7ffff7fff000    0x30000        0x0 [heap]

Program Header:
    LOAD off    0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**21
         filesz 0x00000000003d90c0 memsz 0x00000000003d90c0 flags r-x
    LOAD off    0x00000000003d9360 vaddr 0x00000000005d9360 paddr 0x00000000005d9360 align 2**21
         filesz 0x00000000000489a0 memsz 0x0000000000078188 flags rw-
 DYNAMIC off    0x00000000003fceb0 vaddr 0x00000000005fceb0 paddr 0x00000000005fceb0 align 2**3
         filesz 0x0000000000000140 memsz 0x0000000000000140 flags rw-

(gdb) p /x 0x00007f64a9b1d000-0x00007f64a9ad4000
$1 = 0x49000
(gdb) p /x 0x00007f64a9b4d000-0x00007f64a9b1d000
$2 = 0x30000

So the .data and .bss have the same sizes: this page seems separate from them.

@derekbruening derekbruening reopened this Sep 27, 2017
@derekbruening
Copy link
Contributor Author

Another issue is that on a reload the old .bss is not being freed. Even a check for an empty comment is not sufficient because the kernel incorrectly labels our .bss as "[heap]" in some cases, as shown here:

          Start Addr           End Addr       Size     Offset objfile
      0x7ffff79ad000     0x7ffff7d87000   0x3da000        0x0 /home/derek/dr/git/build_x64_dbg_tests/lib64/debug/libdynamorio.so
      0x7ffff7f82000     0x7ffff7f84000     0x2000        0x0 [vvar]
      0x7ffff7f84000     0x7ffff7f86000     0x2000        0x0 [vdso]
      0x7ffff7f86000     0x7ffff7fcf000    0x49000   0x3d9000 /home/derek/dr/git/build_x64_dbg_tests/lib64/debug/libdynamorio.so
      0x7ffff7fcf000     0x7ffff7fff000    0x30000        0x0 [heap]
```

@derekbruening
Copy link
Contributor Author

I found a machine where I can reproduce the vvar clobber even with my fix (if I run hundreds of times in a loop): I don't see the weird single page listed above, but I do see why it's being clobbered. I merged the reload_dynamorio case for the gap with the app conflict but didn't adjust the conflict bounds, so the reload for the gap ends up making a temp mmap that is very large, covering from the app to the old lib. This can clobber vvar and vdso.

@derekbruening
Copy link
Contributor Author

derekbruening commented Sep 28, 2017

I also reproduced the mysterious extra page at the end of the gap. It's our emulated brk!

Breaking at mmap_syscall: it's init_emulated_brk() called from
reserve_brk(). It asks for right after the app, but libdynamorio is there,
so it gets back the next thing which is in the libdynamorio gap:

mmap(0x7f95e2e4f000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f95e342b000

MAPS: 0x00007f95e2a23000-0x00007f95e2a46000 5 /lib/x86_64-linux-gnu/ld-2.19.so
MAPS: 0x00007f95e2a46000-0x00007f95e2c45000 0
MAPS: 0x00007f95e2c45000-0x00007f95e2c47000 3 /lib/x86_64-linux-gnu/ld-2.19.so
MAPS: 0x00007f95e2c47000-0x00007f95e2c48000 3
MAPS: 0x00007f95e2c48000-0x00007f95e2c49000 0
MAPS: 0x00007f95e2c49000-0x00007f95e2c4c000 5 build_x64_dbg_tests/suite/tests/bin/linux.fib-pie
MAPS: 0x00007f95e2c4c000-0x00007f95e2e4c000 0
MAPS: 0x00007f95e2e4c000-0x00007f95e2e4e000 3 build_x64_dbg_tests/suite/tests/bin/linux.fib-pie
MAPS: 0x00007f95e2e4e000-0x00007f95e2e4f000 0
MAPS: 0x00007f95e2e4f000-0x00007f95e322c000 5 build_x64_dbg_tests/lib64/debug/libdynamorio.so
MAPS: 0x00007f95e342b000-0x00007f95e342c000 3
MAPS: 0x00007f95e342c000-0x00007f95e3475000 3 build_x64_dbg_tests/lib64/debug/libdynamorio.so
MAPS: 0x00007f95e3475000-0x00007f95e34a5000 3

@derekbruening
Copy link
Contributor Author

The top-down packing of the libs by the kernel means we should request extra space along with the exe's ELF space to put the brk after it.

fhahn pushed a commit that referenced this issue Dec 4, 2017
Fixes 3 problems related to the kernel placing vvar+vdso in libdynamorio's text-data gap:

1) Ensures old_libdr_base is NULL by not relying on the calling convention
   during early injection.

2) Handles asserts/curiosities inside memquery code by adding
   is_readable_without_exception_query_os_noblock()
   and memquery_from_os_will_block() to avoid deadlock on locks
   used on UNIX for memory queries.

3) Detects something in the text-data gap and reloads libdynamorio during
   early injection as the rest of DR assumes there's nothing there.

Fixes #2641
fhahn pushed a commit that referenced this issue Dec 4, 2017
Solves several remaining issues beyond what the first #2641 commit addressed:

Moves the libdynamorio text-data gap filling earlier to prevent our own mmaps, such as for the brk, from landing there.

Adds explicit space for the brk in the initial app mmap, to deal with top-down packing by the kernel for PIE which otherwise leaves no space for the brk.

Refactors the scheme of "conflict mmaps" used for reloading libdynamorio to avoid clobbering existing mappings.

Adds more sanity check asserts regarding the libdynamorio text-data gap.

On a reload, properly unmaps the libdynamorio .bss, even when the kernel mis-labels it as "[heap]".

Fixes #2641
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant