Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexplained crashes in WinAFL #2904

Open
yoava333 opened this issue Mar 27, 2018 · 24 comments
Open

Unexplained crashes in WinAFL #2904

yoava333 opened this issue Mar 27, 2018 · 24 comments

Comments

@yoava333
Copy link

It's not an issue (it used to be an issue), it's more of an interest, I've been experiencing strange crashes with Release 6.2.0 and 7.0.0-rc1 that were solved in recent builds. Do you recognize what change caused this issues to go away? it's important to note that they were quite rare.
Thanks

@derekbruening
Copy link
Contributor

What version of Windows were they on?

@yoava333
Copy link
Author

Windows 10 (1709) Fall Creators Update x64

@derekbruening
Copy link
Contributor

Right, I see Windows 10 1709 is listed in your tracker. 1709 official support was added since those releases (naturally since it didn't exist when they were released) though IIRC there was no big change and I would expect 7.0.0 to work on 1709 unless there are a ton of hooks preventing dynamic syscall discovery.

Re: the interception hook code being -x: that might have been fixed in 73122dc "Fixes a race where we put interception_code hooks in place before marking them +x".

For the first problem in that tracker, more information is needed. It's not clear what happened and what the problem is: you're attaching to the parent at the time of ''Cannot kill child process" message? If you could get symbols for that callstack that would help (run our load_syms windbg script: see https://github.com/DynamoRIO/dynamorio/wiki/Debugging).

@yoava333
Copy link
Author

yoava333 commented Mar 28, 2018

It seems that the problem didn't go away, it was just hiding - I ran a fuzzing test I found several hung processes, I attached to one of the processes and this is the call stacks of all the threads:

0:000> ~*kb

.  0  Id: 2e34.3e64 Suspend: 1 Teb: 00350000 Unfrozen
 # ChildEBP RetAddr  Args to Child              
00 230deb90 5ed823d0 0000011c 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
01 230deba0 5ed70acd 0000011c 00000000 5ee15038 dynamorio!nt_wait_event_with_timeout+0x10 [c:\projects\dynamorio\core\win32\ntdll.c @ 3895] 
02 230debc4 5ed6d154 0000011c 00000000 5ee15000 dynamorio!os_wait_event+0x58d [c:\projects\dynamorio\core\win32\os.c @ 8420] 
03 230debe8 5ed0dfb7 5ee15000 00000000 5ed278ff dynamorio!mutex_wait_contended_lock+0x94 [c:\projects\dynamorio\core\win32\os.c @ 8477] 
04 230debf4 5ed278ff 5ee15034 230ded6c 00000001 dynamorio!mutex_lock+0x47 [c:\projects\dynamorio\core\utils.c @ 904] 
05 230dec44 5ecf437b 00000002 230dec68 230dec64 dynamorio!synch_with_all_threads+0x13f [c:\projects\dynamorio\core\synch.c @ 1294] 
06 230dec68 5ed68daa 5ed68e1e 230df000 5ed6fe6b dynamorio!dynamo_process_exit+0x13b [c:\projects\dynamorio\core\dynamo.c @ 1428] 
07 230deca4 5ed70043 2308e300 5ede22a4 00000001 dynamorio!cleanup_and_terminate+0x46 [C:\projects\dynamorio\build\build_release-external-32\core\CMakeFiles\dynamorio.dir\arch\x86\x86.asm.obj.s @ 1929] 
08 230decb8 5ed2d057 2308e300 00000005 00000000 dynamorio!os_terminate_with_code+0x13 [c:\projects\dynamorio\core\win32\os.c @ 1566] 
09 230deccc 64222a69 00000000 230e2cc0 00000001 dynamorio!dr_exit_process+0xb7 [c:\projects\dynamorio\core\lib\instrument.c @ 2461] 
WARNING: Frame IP not in any known module. Following frames may be wrong.
0a 230dece0 60de129e 230ded24 00000000 0053fe48 0x64222a69
0b 00000000 00000000 00000000 00000000 00000000 0x60de129e

   1  Id: 2e34.3f20 Suspend: 1 Teb: 00353000 Unfrozen
 # ChildEBP RetAddr  Args to Child              
00 00a3f9b8 76798484 008443d0 76798460 5f74617e dynamorio!interception_code_array+0xe2a
01 00a3f9cc 76f32ec0 008443d0 7999c398 00000000 KERNEL32!BaseThreadInitThunk+0x24
02 00a3fa14 76f32e90 ffffffff 76f4debc 00000000 ntdll!__RtlUserThreadStart+0x2f
03 00a3fa24 00000000 76f1c740 008443d0 00000000 ntdll!_RtlUserThreadStart+0x1b

   2  Id: 2e34.39c0 Suspend: 1 Teb: 00356000 Unfrozen
WARNING: Stack pointer is outside the normal stack bounds. Stack unwinding can be inaccurate.
 # ChildEBP RetAddr  Args to Child              
00 2306ee0c 5ed823d0 0000011c 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
01 2306ee1c 5ed70acd 0000011c 00000000 5ee15038 dynamorio!nt_wait_event_with_timeout+0x10 [c:\projects\dynamorio\core\win32\ntdll.c @ 3895] 
02 2306ee40 5ed6d154 0000011c 00000000 5ee15000 dynamorio!os_wait_event+0x58d [c:\projects\dynamorio\core\win32\os.c @ 8420] 
03 2306ee64 5ed0dfb7 5ee15000 00000000 5ecf4afb dynamorio!mutex_wait_contended_lock+0x94 [c:\projects\dynamorio\core\win32\os.c @ 8477] 
04 2306ee70 5ecf4afb 5ee15034 010003f1 ffffffff dynamorio!mutex_lock+0x47 [c:\projects\dynamorio\core\utils.c @ 904] 
05 2306ee80 5ed71f52 00000000 2306eeb8 00000000 dynamorio!dynamo_thread_init+0x8b [c:\projects\dynamorio\core\dynamo.c @ 2215] 
06 2306ee9c 5ed7e46d 2306eeb8 5ee02f1f 2306eeb0 dynamorio!thread_attach_setup+0xb2 [c:\projects\dynamorio\core\win32\os.c @ 2665] 
07 2306eea4 5ee02f1f 2306eeb0 00000000 5ee02fa3 dynamorio!thread_attach_takeover_callee+0xd [c:\projects\dynamorio\core\win32\callback.c @ 8749] 
08 00000000 00000000 00000000 00000000 00000000 dynamorio!interception_code_array+0xf1f

   3  Id: 2e34.336c Suspend: 1 Teb: 00359000 Unfrozen
 # ChildEBP RetAddr  Args to Child              
00 00f2f748 5ed823d0 0000011c 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
01 00f2f758 5ed70acd 0000011c 00000000 5ee15038 dynamorio!nt_wait_event_with_timeout+0x10 [c:\projects\dynamorio\core\win32\ntdll.c @ 3895] 
02 00f2f77c 5ed6d154 0000011c 00000000 5ee15000 dynamorio!os_wait_event+0x58d [c:\projects\dynamorio\core\win32\os.c @ 8420] 
03 00f2f7a0 5ed0dfb7 5ee15000 00000000 5ecf4afb dynamorio!mutex_wait_contended_lock+0x94 [c:\projects\dynamorio\core\win32\os.c @ 8477] 
04 00f2f7ac 5ecf4afb 5ee15034 01000000 00000000 dynamorio!mutex_lock+0x47 [c:\projects\dynamorio\core\utils.c @ 904] 
05 00f2f7bc 5ed7c911 00000000 00f2f810 00000000 dynamorio!dynamo_thread_init+0x8b [c:\projects\dynamorio\core\dynamo.c @ 2215] 
06 00f2f94c 5ed7c789 00f2fac8 00000000 00000000 dynamorio!intercept_new_thread+0x71 [c:\projects\dynamorio\core\win32\callback.c @ 3141] 
07 00f2f95c 5ee028a6 00f2f968 00000000 5ee02903 dynamorio!intercept_ldr_init+0x89 [c:\projects\dynamorio\core\win32\callback.c @ 3390] 
08 00000000 00000000 00000000 00000000 00000000 dynamorio!interception_code_array+0x8a6

   4  Id: 2e34.2bb4 Suspend: 1 Teb: 0035c000 Unfrozen
 # ChildEBP RetAddr  Args to Child              
00 0102f710 5ed823d0 0000011c 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
01 0102f720 5ed70acd 0000011c 00000000 5ee15038 dynamorio!nt_wait_event_with_timeout+0x10 [c:\projects\dynamorio\core\win32\ntdll.c @ 3895] 
02 0102f744 5ed6d154 0000011c 00000000 5ee15000 dynamorio!os_wait_event+0x58d [c:\projects\dynamorio\core\win32\os.c @ 8420] 
03 0102f768 5ed0dfb7 5ee15000 00000000 5ecf4afb dynamorio!mutex_wait_contended_lock+0x94 [c:\projects\dynamorio\core\win32\os.c @ 8477] 
04 0102f774 5ecf4afb 5ee15034 01000000 00000000 dynamorio!mutex_lock+0x47 [c:\projects\dynamorio\core\utils.c @ 904] 
05 0102f784 5ed7c911 00000000 0102f7d8 00000000 dynamorio!dynamo_thread_init+0x8b [c:\projects\dynamorio\core\dynamo.c @ 2215] 
06 0102f914 5ed7c789 0102fa90 00000000 00000000 dynamorio!intercept_new_thread+0x71 [c:\projects\dynamorio\core\win32\callback.c @ 3141] 
07 0102f924 5ee028a6 0102f930 00000000 5ee02903 dynamorio!intercept_ldr_init+0x89 [c:\projects\dynamorio\core\win32\callback.c @ 3390] 
08 00000000 00000000 00000000 00000000 00000000 dynamorio!interception_code_array+0x8a6

My setup:
DynamoRIO-Windows-7.0.17614-0
Windows 10 1709
32-bit application on a 64-bit machine
I ran the code with -no_hide option

@derekbruening
Copy link
Contributor

They're all waiting on thread_initexit_lock -- but who holds that? That thread got killed? Is there any termination of threads from other processes going on?

@yoava333
Copy link
Author

I used procmon, no thread was killed, so the stack dump is for all threads.
So, the process is in tear-down (that's why the windows loader lock is held) and one of the threads is holding the thread_initexit_lock.
Is there a way of telling which thread is holding the lock?

@derekbruening
Copy link
Contributor

From the callstacks, it looks like none of these threads holds the lock. In debug build the owner is stored in the lock data structure, but not in release build. The lock owner has presumably exited or been terminated, which shouldn't happen.

It may not be easy to figure out from this snapshot. One avenue is to see if DR thinks there's another thread: what is num_known_threads?

@derekbruening
Copy link
Contributor

Thread 1 is in KiRaiseUserExceptionDispatcher I think -- maybe worth snooping into that thread to see if there's some clue.

@yoava333
Copy link
Author

The application itself is a single threaded application, DR thinks there's only one thread (which makes sense).

Microsoft (R) Windows Debugger Version 10.0.16299.91 X86
Copyright (c) Microsoft Corporation. All rights reserved.

*** wait with pending attach

************* Path validation summary **************
Response                         Time (ms)     Location
Deferred                                       srv*c:\Symbols*http://msdl.microsoft.com/download/symbols
Symbol search path is: srv*c:\Symbols*http://msdl.microsoft.com/download/symbols
Executable search path is: 
ModLoad: 00dd0000 00dd6000   c:\work\fuzzing\dynamorio_bugs\crash_test.exe
ModLoad: 76ed0000 77060000   C:\WINDOWS\SYSTEM32\ntdll.dll
ModLoad: 76780000 76860000   C:\WINDOWS\System32\KERNEL32.DLL
ModLoad: 73bf0000 73dd3000   C:\WINDOWS\System32\KERNELBASE.dll
ModLoad: 76870000 7698e000   C:\WINDOWS\System32\ucrtbase.dll
ModLoad: 6e0b0000 6e0c5000   C:\WINDOWS\SYSTEM32\VCRUNTIME140.dll
ModLoad: 60420000 6055a000   C:\work\fuzzing\DynamoRIO-Windows-7.0.17605-0\lib32\release\dynamorio.dll
Break-in sent, waiting 30 seconds...
WARNING: Break-in timed out, suspending.
         This is usually caused by another thread holding the loader lock
(8d0.37cc): Wake debugger - code 80000007 (first chance)
eax=00000000 ebx=1891eac0 ecx=00000001 edx=00000004 esi=00000000 edi=000000d4
eip=76f3952c esp=1896eb94 ebp=00000000 iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000202
ntdll!NtWaitForSingleObject+0xc:
76f3952c c20c00          ret     0Ch
0:000> ~*kb

.  0  Id: 8d0.37cc Suspend: 1 Teb: 00e63000 Unfrozen
 # ChildEBP RetAddr  Args to Child              
00 1896eb90 604b23d0 000000d4 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
01 1896eba0 604a0acd 000000d4 00000000 60545038 dynamorio!nt_wait_event_with_timeout+0x10 [c:\projects\dynamorio\core\win32\ntdll.c @ 3895] 
02 1896ebc4 6049d154 000000d4 00000000 60545000 dynamorio!os_wait_event+0x58d [c:\projects\dynamorio\core\win32\os.c @ 8420] 
03 1896ebe8 6043dfb7 60545000 00000000 604578ff dynamorio!mutex_wait_contended_lock+0x94 [c:\projects\dynamorio\core\win32\os.c @ 8477] 
04 1896ebf4 604578ff 60545034 1896ed6c 00000001 dynamorio!mutex_lock+0x47 [c:\projects\dynamorio\core\utils.c @ 904] 
05 1896ec44 6042437b 00000002 1896ec68 1896ec64 dynamorio!synch_with_all_threads+0x13f [c:\projects\dynamorio\core\synch.c @ 1294] 
06 1896ec68 60498daa 60498e1e 1896f000 6049fe6b dynamorio!dynamo_process_exit+0x13b [c:\projects\dynamorio\core\dynamo.c @ 1428] 
07 1896eca4 604a0043 1891eac0 605122a4 00000001 dynamorio!cleanup_and_terminate+0x46 [C:\projects\dynamorio\build\build_release-external-32\core\CMakeFiles\dynamorio.dir\arch\x86\x86.asm.obj.s @ 1929] 
08 1896ecb8 6045d057 1891eac0 00000005 00000000 dynamorio!os_terminate_with_code+0x13 [c:\projects\dynamorio\core\win32\os.c @ 1566] 
09 1896eccc 6e092a69 00000000 18972cc0 00000001 dynamorio!dr_exit_process+0xb7 [c:\projects\dynamorio\core\lib\instrument.c @ 2461] 
WARNING: Frame IP not in any known module. Following frames may be wrong.
0a 1896ece0 0fa7129e 1896ed24 00000000 00dbfaac 0x6e092a69
0b 00000000 00000000 00000000 00000000 00000000 0xfa7129e

   1  Id: 8d0.fb4 Suspend: 1 Teb: 00e66000 Unfrozen
 # ChildEBP RetAddr  Args to Child              
00 013ffcbc 76798484 01204678 76798460 0150d5a6 dynamorio!interception_code_array+0xe2a
01 013ffcd0 76f32ec0 01204678 29e9847f 00000000 KERNEL32!BaseThreadInitThunk+0x24
02 013ffd18 76f32e90 ffffffff 76f4dea7 00000000 ntdll!__RtlUserThreadStart+0x2f
03 013ffd28 00000000 76f1c740 01204678 00000000 ntdll!_RtlUserThreadStart+0x1b

   2  Id: 8d0.3a00 Suspend: 1 Teb: 00e69000 Unfrozen
WARNING: Stack pointer is outside the normal stack bounds. Stack unwinding can be inaccurate.
 # ChildEBP RetAddr  Args to Child              
00 188fee0c 604b23d0 000000d4 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
01 188fee1c 604a0acd 000000d4 00000000 60545038 dynamorio!nt_wait_event_with_timeout+0x10 [c:\projects\dynamorio\core\win32\ntdll.c @ 3895] 
02 188fee40 6049d154 000000d4 00000000 60545000 dynamorio!os_wait_event+0x58d [c:\projects\dynamorio\core\win32\os.c @ 8420] 
03 188fee64 6043dfb7 60545000 00000000 60424afb dynamorio!mutex_wait_contended_lock+0x94 [c:\projects\dynamorio\core\win32\os.c @ 8477] 
04 188fee70 60424afb 60545034 010003f1 ffffffff dynamorio!mutex_lock+0x47 [c:\projects\dynamorio\core\utils.c @ 904] 
05 188fee80 604a1f52 00000000 188feeb8 00000000 dynamorio!dynamo_thread_init+0x8b [c:\projects\dynamorio\core\dynamo.c @ 2215] 
06 188fee9c 604ae46d 188feeb8 60532f1f 188feeb0 dynamorio!thread_attach_setup+0xb2 [c:\projects\dynamorio\core\win32\os.c @ 2665] 
07 188feea4 60532f1f 188feeb0 00000000 60532fa3 dynamorio!thread_attach_takeover_callee+0xd [c:\projects\dynamorio\core\win32\callback.c @ 8749] 
08 00000000 00000000 00000000 00000000 00000000 dynamorio!interception_code_array+0xf1f

   3  Id: 8d0.1dc4 Suspend: 1 Teb: 00e6c000 Unfrozen
 # ChildEBP RetAddr  Args to Child              
00 0190f82c 604b23d0 000000d4 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
01 0190f83c 604a0acd 000000d4 00000000 60545038 dynamorio!nt_wait_event_with_timeout+0x10 [c:\projects\dynamorio\core\win32\ntdll.c @ 3895] 
02 0190f860 6049d154 000000d4 00000000 60545000 dynamorio!os_wait_event+0x58d [c:\projects\dynamorio\core\win32\os.c @ 8420] 
03 0190f884 6043dfb7 60545000 00000000 60424afb dynamorio!mutex_wait_contended_lock+0x94 [c:\projects\dynamorio\core\win32\os.c @ 8477] 
04 0190f890 60424afb 60545034 01000000 00000000 dynamorio!mutex_lock+0x47 [c:\projects\dynamorio\core\utils.c @ 904] 
05 0190f8a0 604ac911 00000000 0190f8f4 00000000 dynamorio!dynamo_thread_init+0x8b [c:\projects\dynamorio\core\dynamo.c @ 2215] 
06 0190fa30 604ac789 0190fbac 00000000 00000000 dynamorio!intercept_new_thread+0x71 [c:\projects\dynamorio\core\win32\callback.c @ 3141] 
07 0190fa40 605328a6 0190fa4c 00000000 60532903 dynamorio!intercept_ldr_init+0x89 [c:\projects\dynamorio\core\win32\callback.c @ 3390] 
08 00000000 00000000 00000000 00000000 00000000 dynamorio!interception_code_array+0x8a6
0:000> x dynamorio!num_known_threads
60546800          dynamorio!num_known_threads = 0n1

I've built a debug version of DynamoRIO and ran a single file. I'm getting a memory leak warning and then a crash (access violation), the crash reproduces every time.


Microsoft (R) Windows Debugger Version 10.0.16299.91 X86
Copyright (c) Microsoft Corporation. All rights reserved.

*** wait with pending attach

************* Path validation summary **************
Response                         Time (ms)     Location
Deferred                                       srv*c:\Symbols*http://msdl.microsoft.com/download/symbols
Symbol search path is: srv*c:\Symbols*http://msdl.microsoft.com/download/symbols
Executable search path is: 
ModLoad: 00dd0000 00dd6000   c:\work\fuzzing\dynamorio_bugs\crash_test.exe
ModLoad: 76ed0000 77060000   C:\WINDOWS\SYSTEM32\ntdll.dll
ModLoad: 76780000 76860000   C:\WINDOWS\System32\KERNEL32.DLL
ModLoad: 73bf0000 73dd3000   C:\WINDOWS\System32\KERNELBASE.dll
ModLoad: 76870000 7698e000   C:\WINDOWS\System32\ucrtbase.dll
ModLoad: 6e0b0000 6e0c5000   C:\WINDOWS\SYSTEM32\VCRUNTIME140.dll
ModLoad: 143d0000 1484a000   c:\work\fuzzing\dynamorio\build\lib32\debug\dynamorio.dll
(92c.4d00): Access violation - code c0000005 (!!! second chance !!!)
eax=147f8670 ebx=1c55f000 ecx=147f8c68 edx=1c494790 esi=00000001 edi=00000001
eip=144fe9df esp=1c55e3c0 ebp=1c55e41c iopl=0         nv up ei pl nz ac pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010216
dynamorio!remove_process_lock+0x6af:
144fe9df 894a34          mov     dword ptr [edx+34h],ecx ds:002b:1c4947c4=????????
0:000> kb
 # ChildEBP RetAddr  Args to Child              
00 1c55e41c 1449d3dc 147f8670 14806408 1c55e444 dynamorio!remove_process_lock+0x6af [c:\work\fuzzing\dynamorio\core\utils.c @ 516] 
01 1c55e448 14502630 147f8670 00000000 1c55e484 dynamorio!mutex_delete+0x21c [c:\work\fuzzing\dynamorio\core\utils.c @ 959] 
02 1c55e458 143e0222 00000000 00000000 00000001 dynamorio!kstat_exit+0x90 [c:\work\fuzzing\dynamorio\core\stats.c @ 231] 
03 1c55e484 143eca8c 00000000 00000000 1c535040 dynamorio!dynamo_shared_exit+0x4b2 [c:\work\fuzzing\dynamorio\core\dynamo.c @ 1126] 
04 1c55e4ac 143dfb25 00000000 00000000 1c55e4cc dynamorio!dynamo_process_exit_cleanup+0x19c [c:\work\fuzzing\dynamorio\core\dynamo.c @ 1327] 
05 1c55e4bc 146d0b4a 146d0bc4 1c55f000 1465a728 dynamorio!dynamo_process_exit+0x135 [c:\work\fuzzing\dynamorio\core\dynamo.c @ 1386] 
06 1c55e4fc 1464fd13 1c535040 147c3348 00000001 dynamorio!cleanup_and_terminate+0x46 [C:\work\fuzzing\dynamorio\build\core\x86.asm_core.s @ 1929] 
07 1c55e514 1458a567 1c535040 00000005 00000001 dynamorio!os_terminate_with_code+0x13 [c:\work\fuzzing\dynamorio\core\win32\os.c @ 1566] 
08 1c55e544 6e091fbc 00000001 00000002 1c55e6fc dynamorio!dr_exit_process+0x207 [c:\work\fuzzing\dynamorio\core\lib\instrument.c @ 2461] 
WARNING: Frame IP not in any known module. Following frames may be wrong.
09 1c55e5bc 1449dfc9 00000000 1c55e5f0 1454afc2 0x6e091fbc
0a 1c55e6fc 14596be9 1c535040 1c55ea1c 605c3700 dynamorio!self_owns_write_lock+0x9 [c:\work\fuzzing\dynamorio\core\utils.c @ 1352] 
0b 1c55e738 146856ff 1c535040 1c55ea1c 00000148 dynamorio!instrument_exception+0xb9 [c:\work\fuzzing\dynamorio\core\lib\instrument.c @ 2127] 
0c 1c55ea44 14687f64 1c535040 006ff5d8 006ff588 dynamorio!client_exception_event+0x12f [c:\work\fuzzing\dynamorio\core\win32\callback.c @ 5088] 
0d 1c55eea4 147e4bec 1c55eeb0 00000000 147e4c70 dynamorio!intercept_exception+0x2084 [c:\work\fuzzing\dynamorio\core\win32\callback.c @ 5978] 
0e 1c55eeac 00000000 147e4c70 009f0000 00000002 dynamorio!interception_code_array+0xbec

The crash is here https://github.com/DynamoRIO/dynamorio/blob/master/core/utils.c#L516, lock->prev_process_lock is an invalid pointer, next_process_lock is valid

I'm on commit dd1589c on Windows 10 x64.

@derekbruening
Copy link
Contributor

The app is single-threaded? Hmm, the debugger usually injects just one thread -- what are the other threads then? Do you nudge from the outside or do anything to inject threads?

I didn't notice the dr_exit_process in the first callstacks before: that must be called in your code. In which contexts do you call it? The thread_initexit_lock shouldn't be held in normal client event callbacks but it is worth double-checking where it's called from.

The debug build callstack -- if you could investigate the real frames in between 08 and 0b. Did you run the load_syms script? Is your library in there?

@yoava333
Copy link
Author

The app is single threaded, the source code is here, the application is a test program for the fuzzer.

In my test scenario WinAFL calls dr_exit_process after the main function had finished running (there are more places for invoking the dr_exit_process, but they are not exercised in our case).
The code for the client is here.

In the debug build I was using the -no_hide option since the load_syms didn't work for me - using x86 windbg on 32 bit app on 64 bit (windows 10) machine I'm getting the following error:
Memory access error at ') + @$t0+5) & 0xfffff000);;$$ Check magic values

Steps to reproduce:

afl-showmap.exe -o output.bin -m none -q -t 100000 -D [dynamorio_dir]\bin32\ -- -target_module test.exe -target_offset 0x1000 -nargs 2 -covtype edge -call_convention stdcall -coverage_module test.exe -- crash_test.exe [path to an existing input file]

@Zhentar
Copy link

Zhentar commented Nov 13, 2019

I've been running into this at a pretty high rate with some code I'm trying to fuzz; it's hung 0.5%-5% of all target processes spawned in various runs. Given the inconsistency, I've mostly been using ETW to debug things, which fortunately gives me a pretty good set of samples across both hung & not hung processes (it's also fairly resilient to dr messing with things, though the lack of UNWIND_INFO for dr generated code cuts off stack traces).

So to answer some of the questions raised:

  • The extra threads are all ntdll!TppWorkerThread threads, so they're created by Windows to do whatever windows-y things windows does.
  • No threads are killed before the process terminates, much less while holding any mutexes
  • The rightful owner of the mutex is one of the TppWorkerThread threads that's waiting on the contended_event

Specifically, the issue occurs for me when one of the TppWorkerThread threads starts while take_over_threads holds the thread_initexit_mutex
(full stack if that's unclear:

dynamorio.dll!dynamo_auto_start
dynamorio.dll!auto_setup
dynamorio.dll!dynamorio_app_init
dynamorio.dll!dynamorio_take_over_threads
dynamorio.dll!os_take_over_all_unknown_threads

)

When that happens, the newly created thread hits contention on thread_initexit_mutex...

dynamorio.dll!__Data__
dynamorio.dll!intercept_ldr_init
dynamorio.dll!intercept_new_thread
dynamorio.dll!dynamo_thread_init
dynamorio.dll!d_r_mutex_lock_ev       //(wrapper I added to flag thread_initexit_mutex calls for event logging)
dynamorio.dll!d_r_mutex_lock_app
ThreadId Event Name lock (Field 1) lock_requests (Field 2) event (Field 3)
4212 mutex_wait_contended_lock 0x00000000711A5040 1 0x0000000000000000

(note that this is after lock_requests is incremented but before contended_event has been initialized)

Then a few hundred microseconds later, os_take_over_all_unknown_threads releases the lock and raises the sets the event:

ThreadId Event Name lock (Field 1) lock_requests (Field 2) event (Field 3)
1196 mutex_notify_released_lock 0x00000000711A5040 0 0x00000000000000E0

But the TppWorkerThread never gets it. It's still waiting on that event to acquire that mutex (and WinDbg confirms it isn't set)

Everything works fine when take_over_threads is done with the lock before a TppWorkerThread starts; it just never works when it hits the wait_contended lock.

Any ideas on why the nt_set_event wouldn't work? It seems... odd. The two are spaced out enough that it seems unlikely to be hitting a weird race condition, and I can't believe the Nt*Event functions would be that unreliable. Could DynamoRIO be interfering with it somehow?

@derekbruening
Copy link
Contributor

Crazy theory: could there have been a 3rd thread (another TppWorkerThread) who was also waiting on the thread_initexit_lock contention event, and it was the one woken up by the release of the mutex by os_take_over_all_unknown_threads, and then something external killed that thread before it could then in turn release the mutex it now holds and thus wake up the worker thread you are looking at?Does this reproduce in debug build where the owner of a mutex is stored? Although the thread could be killed between wake-up and writing the owner field.

If the event is changed to wake up all waiting threads instead of just one, does the hang never happen?

The main issue we have hit in os_take_over_all_unknown_threads is in identifying threads we have started to take over through the hook, and avoiding double-takeover. Maybe there was a double-takeover here: though I'm not sure how it would lead to these exact hang symptoms.

@Zhentar
Copy link

Zhentar commented Nov 14, 2019

There is a third thread (6748, for this process):

Process ID Thread ID Thread Start Module Thread Start Function Start Time (µs) End Time (µs)
3396 3020 fuzzapp.exe ? 17,019,499.70 n/a
3396 6748 ntdll.dll TppWorkerThread 17,054,599.10 n/a
3396 1676 ntdll.dll TppWorkerThread 17,057,893.80 n/a

However that thread never does anything. Literally, after it starts the context switch data shows it only spent a couple dozen microseconds running, it looks like something to do with APC. Never even attempts to acquire a mutex.

The full list of thread_initexit_lock activity for a process that gets hung:

Process (ID) ThreadId Event Name Opcode Name lock (Field 1) lock_requests (Field 2) contended_event (Field 3) Time (µs)
3396 3020 mutex_lock win:Start 0x00000000711A5040 -1 17,224,174.70
3396 3020 mutex_lock win:Stop 0x00000000711A5040 0 17,224,178.80
3396 3020 mutex_unlock win:Info 0x00000000711A5040 0 17,224,496.60
3396 3020 mutex_lock win:Start 0x00000000711A5040 -1 17,235,916.90
3396 3020 mutex_lock win:Stop 0x00000000711A5040 0 17,235,921.40
3396 1676 mutex_lock win:Start 0x00000000711A5040 0 17,235,981.20
3396 1676 mutex_wait_contended_lock win:Info 0x00000000711A5040 1 0x0000000000000000 17,236,105.70
3396 3020 mutex_unlock win:Info 0x00000000711A5040 1 17,236,738.70
3396 3020 mutex_notify_released_lock win:Info 0x00000000711A5040 0 0x00000000000000E8 17,236,742.60
3396 1676 mutex_lock win:Start 0x00000000711A5040 0 17,236,752.10
3396 1676 mutex_wait_contended_lock win:Info 0x00000000711A5040 1 0x00000000000000E8 17,236,878.60
3396 3020 mutex_lock win:Start 0x00000000711A5040 1 17,296,995.50
3396 3020 mutex_wait_contended_lock win:Info 0x00000000711A5040 2 0x00000000000000E8 17,296,999.40

(mutex_lock with win:Start is before it attempts to acquire the lock, mutex_lock with win:Stop is after the lock has been acquired)

And wait a tick... 10µs after the main thread unlocks, the waiting thread tries to get the lock a second time?? And there's a 100% failure rate on collecting stack traces for the nested lock & wait, quite inconvenient and very unusual.

@derekbruening
Copy link
Contributor

Is that double-takeover leading to the recursive lock attempt? Is os_take_over_thread()'s check for is_in_dynamo_dll() somehow failing (unclear how for this case)?

@Zhentar
Copy link

Zhentar commented Nov 14, 2019

Doesn't look like there is anything unusual going on in os_takeover_all_unknown_threads:

ThreadId Event Name Target Thread Time (µs)
7480 os_take_over_threads/new_thread 7480 11,497,390.60
7480 os_take_over_threads/new_thread 628 11,497,393.30
8572 mutex_lock   11,497,547.10
8572 mutex_wait_contended_lock   11,497,672.10
7480 os_take_over_threads/suspended 628 11,498,628.00
7480 os_take_over_threads/context_set 628 11,498,684.80
7480 os_take_over_threads/new_thread 8572 11,498,686.80
7480 os_take_over_threads/suspended 8572 11,498,709.70
7480 os_take_over_threads/context_set 8572 11,498,763.40
7480 os_take_over_threads/resume_thread 628 11,498,782.00
7480 os_take_over_threads/resume_thread 8572 11,498,785.40
7480 mutex_unlock   11,498,798.70
7480 mutex_notify_released_lock   11,498,802.70

The 1.3ms delay while suspending the first thread found seems somewhat common, but it looks like most of the time the other thread just doesn't get scheduled.

@Zhentar
Copy link

Zhentar commented Nov 14, 2019

Is os_take_over_thread()'s check for is_in_dynamo_dll() somehow failing

You mean the case where "\tthread " TIDFMT " is already waiting\n" gets logged? Across 332 test processes, only three of them had a thread hit that branch. (and of the other 329 processes, only 3 hung)

edit: dynamo_initialize has already been set before os_take_over_all_unknown_threads gets called, so they're only waiting if they haven't been scheduled since that was set

@derekbruening
Copy link
Contributor

Is os_take_over_thread()'s check for is_in_dynamo_dll() somehow failing

You mean the case where "\tthread " TIDFMT " is already waiting\n" gets logged?

Yes, the "is already waiting" code. For your described scenario of a thread waiting for the lock while the main thread tries to take over all the threads, if that "is already waiting" code is not hit, then I could see this hang happening with the recursive lock when DR tries to "interpret itself".

Across 332 test processes, only three of them had a thread hit that branch. (and of the other 329 processes, only 3 hung)

That sounds like confirmation that the underlying problem is double-takeover: the 3 cases of a hang did not hit the branch they needed to hit.

It looks to me like the next step is to figure out why neither is_in_dynamo_dll nor new_thread_is_waiting_for_dr_init properly identified the thread as already having hit the takeover hook.

@Zhentar
Copy link

Zhentar commented Nov 14, 2019

It looks to me like the next step is to figure out why neither is_in_dynamo_dll nor new_thread_is_waiting_for_dr_init properly identified the thread as already having hit the takeover hook.

err... I don't see any reason but luck that they would do that. The thread isn't waiting for dr_init because dr_init has already happened.
(os_mark_dynamo_initialized is my addition that's called immediately after setting dynamo_initilized = true, and os_thread_done_waiting is called immediately after possible_new_thread_wait_for_dr_init() returns)

ThreadId Event Name thread (Field 1) Time (µs)
7548 mutex_lock 0x00000000711A6040 11,710,207.40
7548 mutex_lock 0x00000000711A6040 11,710,212.30
7548 mutex_unlock 0x00000000711A6040 11,710,510.20
7548 os_mark_dynamo_initialized 11,725,595.50
7548 os_process_under_dynamorio_initiate 11,725,625.00
7548 os_thread_under_dynamo 11,725,656.20
7548 mutex_lock 0x00000000711A6040 11,725,698.20
7548 mutex_lock 0x00000000711A6040 11,725,702.70
7548 os_take_over_threads/new_thread 7548 11,725,718.20
7548 os_take_over_threads/new_thread 8732 11,725,720.50
7236 os_thread_done_waiting 11,726,013.20
7236 mutex_lock 0x00000000711A6040 11,726,039.50
7236 mutex_wait_contended_lock 0x00000000711A6040 11,726,164.80
7548 os_take_over_threads/suspended 8732 11,726,221.70
7548 os_take_over_threads/context_set 8732 11,726,236.80
7548 os_take_over_threads/new_thread 7236 11,726,238.00
7548 os_take_over_threads/suspended 7236 11,726,257.70
7548 os_take_over_threads/context_set 7236 11,726,268.30

@derekbruening
Copy link
Contributor

I think you're confusing process init and thread init. Here we are talking about thread init for the new thread. We are not talking about process init for DR itself. dynamo_initialized is not relevant here.

The thread isn't waiting for dr_init because dr_init has already happened.

No, DR initialization of this thread has not happened: in the scenario you described, this thread is sitting and waiting for the thread_initexit_lock so that it can begin its initialization.

@Zhentar
Copy link

Zhentar commented Nov 14, 2019

I'm equating dynamo_initialized with dr_init because that's what the code does:

while (!dynamo_initialized && !dynamo_exited) {
STATS_INC(apc_yields_while_initializing);
os_thread_yield();
}
if (idx < MAX_THREADS_WAITING_FOR_DR_INIT) {
/* os_take_over_all_unknown_threads()'s context check will work from here */
threads_waiting_for_dr_init[idx] = INVALID_THREAD_ID;
}

And I assume that is_in_dynamo_dll() fails because the thread is in ntdll!NtWaitForSingleObject

@derekbruening
Copy link
Contributor

Sorry, not the "is already waiting": it must be the sequence after that in os_take_over_thread. I have not looked at this code in a while, but somewhere there is code that is trying to identify a thread that is already on the path to DR control, where if we took it over at its current context, we're going to interpret our own code, which is not supported. The point is, os_take_over_all_unknown_threads() should leave this particular thread alone, b/c it's already going to be taken over. I am saying that the bug seems to be that DR incorrectly takes it over, resulting in the recursive lock. Double takeover.

@derekbruening
Copy link
Contributor

Xref #1443

@Zhentar
Copy link

Zhentar commented Nov 16, 2019

I tried rearranging things a bit so that the thread wasn't removed from threads_waiting_for_dr_init until right before intercept_ldr_init, and it's at least working well enough for my needs; fuzzer has been running for an hour now without any hung processes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants