-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexplained crashes in WinAFL #2904
Comments
What version of Windows were they on? |
Windows 10 (1709) Fall Creators Update x64 |
Right, I see Windows 10 1709 is listed in your tracker. 1709 official support was added since those releases (naturally since it didn't exist when they were released) though IIRC there was no big change and I would expect 7.0.0 to work on 1709 unless there are a ton of hooks preventing dynamic syscall discovery. Re: the interception hook code being -x: that might have been fixed in 73122dc "Fixes a race where we put interception_code hooks in place before marking them +x". For the first problem in that tracker, more information is needed. It's not clear what happened and what the problem is: you're attaching to the parent at the time of ''Cannot kill child process" message? If you could get symbols for that callstack that would help (run our load_syms windbg script: see https://github.com/DynamoRIO/dynamorio/wiki/Debugging). |
It seems that the problem didn't go away, it was just hiding - I ran a fuzzing test I found several hung processes, I attached to one of the processes and this is the call stacks of all the threads:
My setup: |
They're all waiting on thread_initexit_lock -- but who holds that? That thread got killed? Is there any termination of threads from other processes going on? |
I used procmon, no thread was killed, so the stack dump is for all threads. |
From the callstacks, it looks like none of these threads holds the lock. In debug build the owner is stored in the lock data structure, but not in release build. The lock owner has presumably exited or been terminated, which shouldn't happen. It may not be easy to figure out from this snapshot. One avenue is to see if DR thinks there's another thread: what is |
Thread 1 is in KiRaiseUserExceptionDispatcher I think -- maybe worth snooping into that thread to see if there's some clue. |
The application itself is a single threaded application, DR thinks there's only one thread (which makes sense).
I've built a debug version of DynamoRIO and ran a single file. I'm getting a memory leak warning and then a crash (access violation), the crash reproduces every time.
The crash is here https://github.com/DynamoRIO/dynamorio/blob/master/core/utils.c#L516, I'm on commit dd1589c on Windows 10 x64. |
The app is single-threaded? Hmm, the debugger usually injects just one thread -- what are the other threads then? Do you nudge from the outside or do anything to inject threads? I didn't notice the The debug build callstack -- if you could investigate the real frames in between 08 and 0b. Did you run the load_syms script? Is your library in there? |
The app is single threaded, the source code is here, the application is a test program for the fuzzer. In my test scenario WinAFL calls In the debug build I was using the Steps to reproduce:
|
I've been running into this at a pretty high rate with some code I'm trying to fuzz; it's hung 0.5%-5% of all target processes spawned in various runs. Given the inconsistency, I've mostly been using ETW to debug things, which fortunately gives me a pretty good set of samples across both hung & not hung processes (it's also fairly resilient to dr messing with things, though the lack of UNWIND_INFO for dr generated code cuts off stack traces). So to answer some of the questions raised:
Specifically, the issue occurs for me when one of the TppWorkerThread threads starts while take_over_threads holds the thread_initexit_mutex
) When that happens, the newly created thread hits contention on thread_initexit_mutex...
(note that this is after lock_requests is incremented but before contended_event has been initialized) Then a few hundred microseconds later, os_take_over_all_unknown_threads releases the lock and raises the sets the event:
But the TppWorkerThread never gets it. It's still waiting on that event to acquire that mutex (and WinDbg confirms it isn't set) Everything works fine when take_over_threads is done with the lock before a TppWorkerThread starts; it just never works when it hits the wait_contended lock. Any ideas on why the nt_set_event wouldn't work? It seems... odd. The two are spaced out enough that it seems unlikely to be hitting a weird race condition, and I can't believe the Nt*Event functions would be that unreliable. Could DynamoRIO be interfering with it somehow? |
Crazy theory: could there have been a 3rd thread (another TppWorkerThread) who was also waiting on the thread_initexit_lock contention event, and it was the one woken up by the release of the mutex by os_take_over_all_unknown_threads, and then something external killed that thread before it could then in turn release the mutex it now holds and thus wake up the worker thread you are looking at?Does this reproduce in debug build where the owner of a mutex is stored? Although the thread could be killed between wake-up and writing the owner field. If the event is changed to wake up all waiting threads instead of just one, does the hang never happen? The main issue we have hit in os_take_over_all_unknown_threads is in identifying threads we have started to take over through the hook, and avoiding double-takeover. Maybe there was a double-takeover here: though I'm not sure how it would lead to these exact hang symptoms. |
There is a third thread (6748, for this process):
However that thread never does anything. Literally, after it starts the context switch data shows it only spent a couple dozen microseconds running, it looks like something to do with APC. Never even attempts to acquire a mutex. The full list of thread_initexit_lock activity for a process that gets hung:
(mutex_lock with win:Start is before it attempts to acquire the lock, mutex_lock with win:Stop is after the lock has been acquired) And wait a tick... 10µs after the main thread unlocks, the waiting thread tries to get the lock a second time?? And there's a 100% failure rate on collecting stack traces for the nested lock & wait, quite inconvenient and very unusual. |
Is that double-takeover leading to the recursive lock attempt? Is os_take_over_thread()'s check for is_in_dynamo_dll() somehow failing (unclear how for this case)? |
Doesn't look like there is anything unusual going on in os_takeover_all_unknown_threads:
The 1.3ms delay while suspending the first thread found seems somewhat common, but it looks like most of the time the other thread just doesn't get scheduled. |
You mean the case where edit: dynamo_initialize has already been set before os_take_over_all_unknown_threads gets called, so they're only waiting if they haven't been scheduled since that was set |
Yes, the "is already waiting" code. For your described scenario of a thread waiting for the lock while the main thread tries to take over all the threads, if that "is already waiting" code is not hit, then I could see this hang happening with the recursive lock when DR tries to "interpret itself".
That sounds like confirmation that the underlying problem is double-takeover: the 3 cases of a hang did not hit the branch they needed to hit. It looks to me like the next step is to figure out why neither is_in_dynamo_dll nor new_thread_is_waiting_for_dr_init properly identified the thread as already having hit the takeover hook. |
err... I don't see any reason but luck that they would do that. The thread isn't waiting for dr_init because dr_init has already happened.
|
I think you're confusing process init and thread init. Here we are talking about thread init for the new thread. We are not talking about process init for DR itself.
No, DR initialization of this thread has not happened: in the scenario you described, this thread is sitting and waiting for the thread_initexit_lock so that it can begin its initialization. |
I'm equating dynamorio/core/win32/callback.c Lines 3059 to 3067 in e20fec8
And I assume that |
Sorry, not the "is already waiting": it must be the sequence after that in os_take_over_thread. I have not looked at this code in a while, but somewhere there is code that is trying to identify a thread that is already on the path to DR control, where if we took it over at its current context, we're going to interpret our own code, which is not supported. The point is, os_take_over_all_unknown_threads() should leave this particular thread alone, b/c it's already going to be taken over. I am saying that the bug seems to be that DR incorrectly takes it over, resulting in the recursive lock. Double takeover. |
Xref #1443 |
I tried rearranging things a bit so that the thread wasn't removed from |
It's not an issue (it used to be an issue), it's more of an interest, I've been experiencing strange crashes with Release 6.2.0 and 7.0.0-rc1 that were solved in recent builds. Do you recognize what change caused this issues to go away? it's important to note that they were quite rare.
Thanks
The text was updated successfully, but these errors were encountered: