Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tool.drcachesim.delay-global and tool.drcacheoff.max-global tests failing: did not hit max #4711

Closed
derekbruening opened this issue Feb 4, 2021 · 16 comments

Comments

@derekbruening
Copy link
Contributor

Looks like it didn't hit the max:
https://github.com/DynamoRIO/dynamorio/runs/1833435176?check_suite_focus=true

41/80 Test #252: code_api|tool.drcachesim.delay-global ........................***Failed  Required regular expression not found. Regex=[^Hit delay threshold: enabling tracing.
Hit -max_global_trace_refs: disabling tracing.
.*
     Total Number Of iterations   :  3
    ...................................................................
---- <application exited with code 0> ----
Basic counts tool results:
Total counts:
.*
           1 total threads
.*
$
]  0.18 sec
Hit delay threshold: enabling tracing.
---- <application exited with code 139> ----
Basic counts tool results:

Same thing for offline:

276/357 Test #277: code_api|tool.drcacheoff.max-global ..........................***Failed    0.64 sec
Running cmd |/home/runner/work/dynamorio/dynamorio/build_debug-internal-64/bin64/drrun;-s;90;-quiet;-debug;-killpg;-stderr_mask;0xC;-dumpcore_mask;0;-code_api;-t;drcachesim;-offline;-subdir_prefix;tool.drcacheoff.max-global;-trace_after_instrs;20K;-max_global_trace_refs;10K;--;/home/runner/work/dynamorio/dynamorio/build_debug-internal-64/suite/tests/bin/client.annotation-concurrency;libclient.annotation-concurrency.appdll.so;A;4;64;3|
Running postcmd |/home/runner/work/dynamorio/dynamorio/build_debug-internal-64/clients/bin64/drcachesim;-indir;/home/runner/work/dynamorio/dynamorio/build_debug-internal-64/suite/tests/tool.drcacheoff.max-global.client.annotation-concurrency.22882.5929.dir;-simulator_type;basic_counts|
CMake Error at /home/runner/work/dynamorio/dynamorio/suite/tests/runmulti.cmake:130 (message):
  output |Hit delay threshold: enabling tracing.

  Basic counts tool results:
<...>
  | failed to match expected output |Hit delay threshold: enabling tracing.

  Hit -max_global_trace_refs: disabling tracing.

  .*
@derekbruening
Copy link
Contributor Author

Happened again, on 2 runs in a row: https://github.com/DynamoRIO/dynamorio/pull/4721/checks?check_run_id=1841880860

@sapostolakis
Copy link
Contributor

Unfortunately, I have been unable so far to manually reproduce this failure.
I tested both tests for both a release and a debug build on a Linux x86_64 machine.
For each of the two tests and builds, I did 1000 runs with a loop in a bash script calling ctest -VV -I XXX,XXX for each run, and I also tested 300 runs with ctest --repeat-until-fail 300 -V -R delay-global (and for max-global). There was no failure in these runs. For the delay-global one I also did some manual runs directly invoking the command without using ctest. Not sure what else to try.

For the observed failures, the culprit is probably the x86 related changes in #4677.
Since, I did not have a failing case I could use to debug, I manually investigated the effect of these changes.
In particular, I looked into the logs on some runs and compared the instrumented code before and after #4677, and the behavior is what I would expect. #4677 addressed two potential issues with clobbered aflags for x86:

  1. In the one case that does not create any code changes it wraps a previously unprotected inserted cmp instruction with reserve/unreserve aflags. Since there is a reserve/unreserve aflags earlier (when updating the counter in drx.c) this change does not change the instrumented code. The aflags are already clobbered (!native) and the restoration happens towards the end of the basic block, so the new reserve does not save the aflags again and there is only one restoration of aflags at the end. Without this change there was no issue since even the unprotected cmp was indirectly protected, but I think this change is necessary since it is weird for tracer.cpp to rely on the implementation of a function in drx.c (e.g., there is a code path where we dont use drreg to save/restore aflags).
  2. The other case is restoring aflags and scratch registers in the path before the clean call to hit_instr_count_threshold() which might not return. For two test cases mentioned here, the threshold is less than INT_MAX so there is no restoration of scratch registers. Only the aflags is restored. By observing the logs, I see the expected behavior. In this path, the aflags are restored (newly inserted instructions: cmp %al $0x81, sahf %ah) and this is the only difference from before i#4487: inline instr count for trace_after_instrs in drcachesim, AArch64 #4677.

I also manually checked the counter updates and whether the threshold is hit at the correct time and it seems to work correctly. Of course, this is the case for runs that dont fail, but at least I confirmed that the threshold is hit after 20K instructions.

One option going forward might be to revert the x86-related changes in #4677 and see if these failures persist. Or find a way to reproduce these changes in manual runs to facilitate more debugging.

@derekbruening
Copy link
Contributor Author

The output shows that the tests are hitting the -trace_after_instrs threshold since they print Hit delay threshold: enabling tracing.. What they are not reaching is the -max_global_trace_refs threshold which is not related to #4677. I thought maybe the compiler version or something changed on GA CI and there just too few refs and so the app exited before htiting the threshold: though I thought the value chosen was not close to the end. I think to reproduce we would want to use an Ubuntu 16.04 VM to more closely match GA CI.

@sapostolakis
Copy link
Contributor

I see. Yeah I noticed the problem was hitting the max but I thought maybe some program state (potentially the flags) was corrupted before reaching the threshold or at the time of reaching the threshold leading to a problem later on. That was mostly because AFAIK #4677 was the most recent change that was somewhat related to this issue. In any case, it was good to do some sanity checks to exclude (to some extent) this scenario.

@derekbruening
Copy link
Contributor Author

I have an Ubuntu 16.04 VM with gcc 5.4.0 which is close to the GA CI setup (it has 5.5.0). I built and ran both tests 1000x (loop around individual ctest) and did not reproduce it.

@derekbruening
Copy link
Contributor Author

derekbruening commented Feb 9, 2021

Actually, looking more closely at the pasted output from the top comment, delay-global crashed with SIGSEGV! The exit code is 139 which I did not notice at first:

---- <application exited with code 139> ----

For the offline version max-global: it's not clear; is it possible it crashed too and runmulti.cmake doesn't care and looks only at the output?

So maybe #4677 is a possible suspect since this now seems to be an app crash?

The most recent case fails both tests on both debug and release -- which makes it seem like it's something on a particular VM or sthg and if the job is scheduled on that type of VM they all fail?
https://github.com/DynamoRIO/dynamorio/pull/4726/checks?check_run_id=1865162842

2021-02-09T17:58:30.6369267Z debug-internal-64: 353 tests passed, **** 4 tests failed, of which 2 were flaky: ****
2021-02-09T17:58:30.6370340Z 	code_api|tool.drcachesim.delay-global 
2021-02-09T17:58:30.6371322Z 	code_api|tool.drcacheoff.max-global 
2021-02-09T17:58:30.6372121Z 	code_api|api.detach_spawn_stress_FLAKY 
2021-02-09T17:58:30.6372728Z 	code_api|api.static_sideline_FLAKY 
2021-02-09T17:58:30.6373672Z release-external-64: 78 tests passed, **** 2 tests failed: ****
2021-02-09T17:58:30.6374689Z 	code_api|tool.drcachesim.delay-global 
2021-02-09T17:58:30.6375683Z 	code_api|tool.drcacheoff.max-global 

The prior ones are almost the same but only have 1 debug failure -- hmm:

debug-internal-64: 354 tests passed, **** 3 tests failed, of which 2 were flaky, but ignoring 0 for i#2941: ****
	code_api|tool.drcacheoff.max-global 
	code_api|api.detach_spawn_stress_FLAKY 
	code_api|api.static_sideline_FLAKY 
release-external-64: 78 tests passed, **** 2 tests failed, but ignoring 0 for i#2941: ****
	code_api|tool.drcachesim.delay-global 
	code_api|tool.drcacheoff.max-global 
2021-02-05T21:19:04.3306811Z debug-internal-64: 354 tests passed, **** 3 tests failed, of which 2 were flaky: ****
2021-02-05T21:19:04.3307803Z 	code_api|tool.drcacheoff.max-global 
2021-02-05T21:19:04.3308471Z 	code_api|api.detach_spawn_stress_FLAKY 
2021-02-05T21:19:04.3309023Z 	code_api|api.static_sideline_FLAKY 
2021-02-05T21:19:04.3309956Z release-external-64: 78 tests passed, **** 2 tests failed: ****
2021-02-05T21:19:04.3310926Z 	code_api|tool.drcachesim.delay-global 
2021-02-05T21:19:04.3311876Z 	code_api|tool.drcacheoff.max-global 

@derekbruening
Copy link
Contributor Author

@derekbruening
Copy link
Contributor Author

@sapostolakis : One idea is to create two branches, one with #4677 and one without, with each of them disabling all the other tests except these two and running them many times. Then we repeatedly re-run the GA CI tests (could script it through the github REST or whatnot; or just keep pushing empty commits; or even manually clicking several times a day) on those branches and see whether we get zero cases on the one without #4677 and non-zero on the other.

@sapostolakis
Copy link
Contributor

Sounds reasonable. I created the two branches. The i4711-with-4677 branch is just the same as master, while the i4711-without-4677 has all the x86-related changes in #4677 reverted.

@sapostolakis
Copy link
Contributor

I can see how I can manually run from the Actions window the whole ci-x86 workflow for these branches but I don't know how to selectively run only the two tests of interest. Do we have to create a new workflow?

@derekbruening
Copy link
Contributor Author

derekbruening commented Feb 10, 2021

I would go in and change the code (just temporarily on this branch of course) to only target those tests.
One way is via a label: in runsuite.cmake see how OSX uses one:

set(extra_ctest_args INCLUDE_LABEL OSX)

And how some tests are given that label in suite/tests/CMakeLists.txt:

set_tests_properties(code_api|client.flush PROPERTIES LABELS OSX)

I would assign a new label to these two tests and have runsuite.cmake target just that label.

If using a push to trigger, I would delete all the other workflow files so only the x86 one runs.

@derekbruening
Copy link
Contributor Author

@sapostolakis
Copy link
Contributor

In PR #4730 we are reverting a x86_64 fix (restoring the arithmetic flags and, if used, the scratch register before a clean call to hit_instr_count_threshold that might not return) to prevent these test failures. Essentially, we are leaving x86_64 with buggy code since the fix seems to cause problems that we do not yet understand.

sapostolakis added a commit that referenced this issue Feb 17, 2021
Revert certain x86-related changes from #4677. In particular, we avoid restoring the arithmetic flags and (if used) the scratch register before the call to `hit_instr_count_threshold`, which might not return. 
This fix is necessary but it seems to be causing failures in `drcachesim.delay-global` and `drcacheoff.max-global` on x86-64 CI testing. These failures are not reproducible outside CI testing and are (so far) inexplicable based on manual analysis of the instrumented assembly code (see #4711 (comment)). 
Thus, we are leaving x86_64 as technically broken to keep our tests green until the source of instability is found.

Issue: #4711
@derekbruening
Copy link
Contributor Author

For #4128 we're adding automated restoring of all app values and my plan is to remove the deliberately-not-restored buggy code and see whether the problems return.

@derekbruening
Copy link
Contributor Author

No failure in PR #5164 run. Closing as non-repro.

@derekbruening
Copy link
Contributor Author

We noticed in PR #5164 that the drreg_statelessly_restore_app_value() calls in tracer.cpp were passing the same where for where_respill and where_restore, which will do the wrong thing for aflags stored in rax. Maybe this was the cause of the failures.

derekbruening added a commit that referenced this issue Oct 20, 2021
Adds new dr_cleancall_save_t flags which are required for proper
interaction between clean calls and drreg:
DR_CLEANCALL_READS_APP_CONTEXT must be set for dr_get_mcontext() to
obtain the proper values, and #DR_CLEANCALL_WRITES_APP_CONTEXT must be
set to ensure that dr_set_mcontext() is persistent.
DR_CLEANCALL_MULTIPATH must be additionally set for might-skip calls.

Adds a clean call insertion event to enable drreg to know about clean
calls at the time they are inserted.  dr_insert_clean_call_ex()
invokes the callback and passes the flags to drreg, who then treats
the clean call as an app instruction.

For annotations, for now we leave drreg looking for the annotation
label (possible future changes #5160 or #5161 would eliminate this
special case).

dr_insert_{cbr,ubr,mbr,call}_instrumentation() always set both labels.

drwrap always sets both labels for pre and post callbacks.

Updates uses throughout our tests and samples to use the new flags as
appropriate.

Adds drreg_statelessly_restore_all() for clean call multipath restoration.

Adds a new dedicated test client.drwrap-drreg-test which tests both a
drwrap call and direct clean calls.

Fixes a missing drwrap cache invalidation on module unload that the
new test uncovers.

This likely fixes #4711 as its code was passing the same location for
the where_respill as where_restore for stateless drreg restoration;
the automated restore here correctly passes the post-instr location.

Issue: #4128, #4711
Fixes #4128
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants