-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ASSERT (drmgr-test) meta-instr faulted? must set translation field and handle fault! #1369
Comments
Related assert in translate.c Non-det (can't repro):
|
This happened in AppVeyor: https://ci.appveyor.com/project/DynamoRIO/dynamorio/build/1.0.157/job/q6loeg0tadiuf3u9
|
The simpler translation failure also happened in AppVeyor: https://ci.appveyor.com/project/DynamoRIO/dynamorio/build/1.0.180/job/2ufv6l8duo8m5bic
|
Reviving this thread as this issue is seen frequently on a proprietary app running on AArch64. |
The assert failure occurs in Line 910 in 1bc72b6
The code iterates over ilist; and when The ilist is created by Line 1292 in 1bc72b6
|
DR attempts to create the Line 1303 in 08f48f3
During Line 4125 in 08f48f3
|
I found that two different sets of instrumentations are being applied by the client (drcachesim) -- using To verify: 3 consecutive attempts with |
I see that drcachesim's tracer indeed invokes dynamorio/clients/drcachesim/tracer/tracer.cpp Line 1393 in 5cbe811
However, based on the documentation, it shouldn't be invoked with size=0, which seems to be the case in the call above. dynamorio/core/lib/instrument_api.h Line 6265 in 5cbe811
It also mentions some restrictions on the options that this routine is available with. Verifying if all these conditions are met... |
I thought you had printed out the instrumentation applied to the original block and to the recreated block and they were identical? |
And that the recreate walk was starting at a shifted point, beyond which the code was identical between the two: a clean call. I though the instrumentation prior to the threshold was an inlined counter increment, ruling out the delay. |
For using dr_unlink_flush_region: it is not synchronous, so there is a problem there with the instru switch. One solution is to use dr_flush_region instead, which is synchronous but more expensive -- but given that this only happens once that's probably ok. Another is to store translations, which adds its own overhead. Another is to track which blocks used which type of instru inside the tracer. |
This may not be simple with duplicate tags for thread-private blocks and intra-trace blocks. |
I'm first trying to replace Seeing a new debug assert failure:
This seems to come up when I add the |
Here's a stacktrace for this:
I was able to resolve this by relaxing the assert at dynamorio/core/arch/clean_call_opt_shared.c Line 216 in 5cbe811
to use a TRY_EXCEPT_ALLOW_NO_DCONTEXT with GLOBAL_DCONTEXT , instead of TRY_EXCEPT . This seems reasonable to me because the current thread's dcontext is not available anymore #21 onwards above (see ignored arg).
After this patch, I'm still seeing the "meta-instr faulted" assert failure. This happens before |
The current thread's dcontext is always available through its TLS pointer (get_thread_private_dcontext()), unless you're in early process init or late process exit code. |
@derekbruening Any thoughts about this? This seems like another race to me. It occurs during handling of As I understand it, the sequence of events for the thread T that sees the "meta-instr faulted" assert failure: |
Yes, safer when in code that might be called in early init or late exit: but I don't think that clean call analysis is such code.
Did you do the flush first and only when it's finished enable the second-phase instrumentation? |
Hmm it does seem to be impossible doesn't it, with the current interface. The flushing is all geared toward the client being able to free bookkeeping. It may be that the only solution is to store translations, or to add a callback to the synchall flush called at the all-suspended point. Interesting that this has not come up before with phased instrumentation. |
Before, I didn't try changing the current order of those: first the change in instrumentation and then the flush. My thought was that there'd be a race even if we change the order (to flush and then change instrumentation): as other threads are still running, it may lead to some bbs getting the old instrumentation again.
Storing translations leads to more memory overhead (upto 20% as per documentation). Is that acceptable?
And change the instrumentation in the callback right? But won't DR continue to build new fragments while the flush and callback is in-progress? Or can we pause all threads somehow until the callback completes? |
Yes, that is what I meant by it being impossible.
For most uses, like standalone tracing of SPEC or sthg, it's fine. For resource-budgeted uses on cloud/datacenter: maybe measure it on a large app to see a real number. The other concern is that storing is not used much so there may be stability risks -- but it would be good long-term to find any such issues anyway.
Yes.
This "synchall" type of flush suspends all the threads. |
I added an optional callback arg to
Back to this seg fault that I saw with the other workaround (of reversing the flush and change-instrumentation order) too: I'm currently trying to understand what's different between the threads that segfault and the ones that don't. Initially, all threads are suspended in
|
I'll create a draft PR to show intermediate progress, so that it's easier to provide feedback. |
DR translates a fault in the code cache to a fault at the corresponding application address. This is done using ilist reconstruction for the fragment where the fault occurred. But, this does not work as expected when the DR client changes instrumentation during execution; currently, drcachesim does this to enable tracing after -trace_after_instrs. The reconstructed basic block gets the new instrumentation whereas the one in code cache has the old one. This causes issues during fault handling. In the current drcachesim case, it appears as though a meta-instr has faulted because the reconstructed ilist has a meta-instr at the code cache fault pc. This issue may manifest differently if the basic block with the new instrumentation is smaller than the old one (unlike the drcachesim 'meta-instr faulted' case) and the faulting address lies beyond the end of the new instrumented basic block. We may see an ASSERT_NOT_REACHED due to the ilist walk ending before the faulting code cache pc was found in the reconstructed ilist. In the existing code, drcachesim attempts to avoid this by flushing old fragments using dr_unlink_flush_region after it switches to the tracing instrumentation. However, due to the flush being asynch, there's a race and the flush does not complete in time. This PR adds support for a callback in the synchronous dr_flush_region API. The callback is executed after the flush but before the threads are resumed. Using the dr_flush_region callback to change drcachesim instrumentation ensures that old instrumentation is not applied after the flush and the new one is not applied before. Issue: #1369
@abhinav92003 -- are you sure that PR #4491 was related to this issue? Stepping back, this was filed for this assert happening on x86 in the drmgr-test. Your fix was for phased instrumentation doing a racy phase switch, right? I don't think that's related to drmgr-test on x86? Probably should re-open? |
PR #4491 definitely fixed this assert on the proprietary app. I didn't reproduce the issue on those tests, as (based on the comments) it seemed to occur very infrequently, and it occurred 4/5 runs on the proprietary app. I guess it's possible that the tests hit the assert for a different reason than what PR #4491 fixes. |
This same assert appeared on the cleancallsig test:
|
Again on cleancallsig: |
From bruen...@google.com on February 19, 2014 10:57:54
This has happened at least twice in the last year or so when I've run the
test suite on win7 I can't reproduce it when run manually.
It happened on 32-bit once too -- from my notes from long ago, apparently never filed:
Original issue: http://code.google.com/p/dynamorio/issues/detail?id=1369
The text was updated successfully, but these errors were encountered: