tool.drcachesim.delay-global and tool.drcacheoff.max-global tests failing: did not hit max #4711

derekbruening · 2021-02-04T19:37:18Z

Looks like it didn't hit the max:
https://github.com/DynamoRIO/dynamorio/runs/1833435176?check_suite_focus=true

41/80 Test #252: code_api|tool.drcachesim.delay-global ........................***Failed  Required regular expression not found. Regex=[^Hit delay threshold: enabling tracing.
Hit -max_global_trace_refs: disabling tracing.
.*
     Total Number Of iterations   :  3
    ...................................................................
---- <application exited with code 0> ----
Basic counts tool results:
Total counts:
.*
           1 total threads
.*
$
]  0.18 sec
Hit delay threshold: enabling tracing.
---- <application exited with code 139> ----
Basic counts tool results:

Same thing for offline:

276/357 Test #277: code_api|tool.drcacheoff.max-global ..........................***Failed    0.64 sec
Running cmd |/home/runner/work/dynamorio/dynamorio/build_debug-internal-64/bin64/drrun;-s;90;-quiet;-debug;-killpg;-stderr_mask;0xC;-dumpcore_mask;0;-code_api;-t;drcachesim;-offline;-subdir_prefix;tool.drcacheoff.max-global;-trace_after_instrs;20K;-max_global_trace_refs;10K;--;/home/runner/work/dynamorio/dynamorio/build_debug-internal-64/suite/tests/bin/client.annotation-concurrency;libclient.annotation-concurrency.appdll.so;A;4;64;3|
Running postcmd |/home/runner/work/dynamorio/dynamorio/build_debug-internal-64/clients/bin64/drcachesim;-indir;/home/runner/work/dynamorio/dynamorio/build_debug-internal-64/suite/tests/tool.drcacheoff.max-global.client.annotation-concurrency.22882.5929.dir;-simulator_type;basic_counts|
CMake Error at /home/runner/work/dynamorio/dynamorio/suite/tests/runmulti.cmake:130 (message):
  output |Hit delay threshold: enabling tracing.

  Basic counts tool results:
<...>
  | failed to match expected output |Hit delay threshold: enabling tracing.

  Hit -max_global_trace_refs: disabling tracing.

  .*

The text was updated successfully, but these errors were encountered:

derekbruening · 2021-02-05T22:18:37Z

Happened again, on 2 runs in a row: https://github.com/DynamoRIO/dynamorio/pull/4721/checks?check_run_id=1841880860

sapostolakis · 2021-02-07T22:04:41Z

Unfortunately, I have been unable so far to manually reproduce this failure.
I tested both tests for both a release and a debug build on a Linux x86_64 machine.
For each of the two tests and builds, I did 1000 runs with a loop in a bash script calling ctest -VV -I XXX,XXX for each run, and I also tested 300 runs with ctest --repeat-until-fail 300 -V -R delay-global (and for max-global). There was no failure in these runs. For the delay-global one I also did some manual runs directly invoking the command without using ctest. Not sure what else to try.

For the observed failures, the culprit is probably the x86 related changes in #4677.
Since, I did not have a failing case I could use to debug, I manually investigated the effect of these changes.
In particular, I looked into the logs on some runs and compared the instrumented code before and after #4677, and the behavior is what I would expect. #4677 addressed two potential issues with clobbered aflags for x86:

In the one case that does not create any code changes it wraps a previously unprotected inserted cmp instruction with reserve/unreserve aflags. Since there is a reserve/unreserve aflags earlier (when updating the counter in drx.c) this change does not change the instrumented code. The aflags are already clobbered (!native) and the restoration happens towards the end of the basic block, so the new reserve does not save the aflags again and there is only one restoration of aflags at the end. Without this change there was no issue since even the unprotected cmp was indirectly protected, but I think this change is necessary since it is weird for tracer.cpp to rely on the implementation of a function in drx.c (e.g., there is a code path where we dont use drreg to save/restore aflags).
The other case is restoring aflags and scratch registers in the path before the clean call to hit_instr_count_threshold() which might not return. For two test cases mentioned here, the threshold is less than INT_MAX so there is no restoration of scratch registers. Only the aflags is restored. By observing the logs, I see the expected behavior. In this path, the aflags are restored (newly inserted instructions: cmp %al $0x81, sahf %ah) and this is the only difference from before i#4487: inline instr count for trace_after_instrs in drcachesim, AArch64 #4677.

I also manually checked the counter updates and whether the threshold is hit at the correct time and it seems to work correctly. Of course, this is the case for runs that dont fail, but at least I confirmed that the threshold is hit after 20K instructions.

One option going forward might be to revert the x86-related changes in #4677 and see if these failures persist. Or find a way to reproduce these changes in manual runs to facilitate more debugging.

derekbruening · 2021-02-07T22:57:38Z

The output shows that the tests are hitting the -trace_after_instrs threshold since they print Hit delay threshold: enabling tracing.. What they are not reaching is the -max_global_trace_refs threshold which is not related to #4677. I thought maybe the compiler version or something changed on GA CI and there just too few refs and so the app exited before htiting the threshold: though I thought the value chosen was not close to the end. I think to reproduce we would want to use an Ubuntu 16.04 VM to more closely match GA CI.

sapostolakis · 2021-02-07T23:12:06Z

I see. Yeah I noticed the problem was hitting the max but I thought maybe some program state (potentially the flags) was corrupted before reaching the threshold or at the time of reaching the threshold leading to a problem later on. That was mostly because AFAIK #4677 was the most recent change that was somewhat related to this issue. In any case, it was good to do some sanity checks to exclude (to some extent) this scenario.

derekbruening · 2021-02-09T17:09:25Z

I have an Ubuntu 16.04 VM with gcc 5.4.0 which is close to the GA CI setup (it has 5.5.0). I built and ran both tests 1000x (loop around individual ctest) and did not reproduce it.

derekbruening · 2021-02-09T18:12:02Z

Actually, looking more closely at the pasted output from the top comment, delay-global crashed with SIGSEGV! The exit code is 139 which I did not notice at first:

---- <application exited with code 139> ----

For the offline version max-global: it's not clear; is it possible it crashed too and runmulti.cmake doesn't care and looks only at the output?

So maybe #4677 is a possible suspect since this now seems to be an app crash?

The most recent case fails both tests on both debug and release -- which makes it seem like it's something on a particular VM or sthg and if the job is scheduled on that type of VM they all fail?
https://github.com/DynamoRIO/dynamorio/pull/4726/checks?check_run_id=1865162842

2021-02-09T17:58:30.6369267Z debug-internal-64: 353 tests passed, **** 4 tests failed, of which 2 were flaky: ****
2021-02-09T17:58:30.6370340Z 	code_api|tool.drcachesim.delay-global 
2021-02-09T17:58:30.6371322Z 	code_api|tool.drcacheoff.max-global 
2021-02-09T17:58:30.6372121Z 	code_api|api.detach_spawn_stress_FLAKY 
2021-02-09T17:58:30.6372728Z 	code_api|api.static_sideline_FLAKY 
2021-02-09T17:58:30.6373672Z release-external-64: 78 tests passed, **** 2 tests failed: ****
2021-02-09T17:58:30.6374689Z 	code_api|tool.drcachesim.delay-global 
2021-02-09T17:58:30.6375683Z 	code_api|tool.drcacheoff.max-global

The prior ones are almost the same but only have 1 debug failure -- hmm:

debug-internal-64: 354 tests passed, **** 3 tests failed, of which 2 were flaky, but ignoring 0 for i#2941: ****
	code_api|tool.drcacheoff.max-global 
	code_api|api.detach_spawn_stress_FLAKY 
	code_api|api.static_sideline_FLAKY 
release-external-64: 78 tests passed, **** 2 tests failed, but ignoring 0 for i#2941: ****
	code_api|tool.drcachesim.delay-global 
	code_api|tool.drcacheoff.max-global

2021-02-05T21:19:04.3306811Z debug-internal-64: 354 tests passed, **** 3 tests failed, of which 2 were flaky: ****
2021-02-05T21:19:04.3307803Z 	code_api|tool.drcacheoff.max-global 
2021-02-05T21:19:04.3308471Z 	code_api|api.detach_spawn_stress_FLAKY 
2021-02-05T21:19:04.3309023Z 	code_api|api.static_sideline_FLAKY 
2021-02-05T21:19:04.3309956Z release-external-64: 78 tests passed, **** 2 tests failed: ****
2021-02-05T21:19:04.3310926Z 	code_api|tool.drcachesim.delay-global 
2021-02-05T21:19:04.3311876Z 	code_api|tool.drcacheoff.max-global

derekbruening · 2021-02-10T04:35:58Z

Happened again on master: https://github.com/DynamoRIO/dynamorio/runs/1868575670?check_suite_focus=true

derekbruening · 2021-02-10T04:59:27Z

@sapostolakis : One idea is to create two branches, one with #4677 and one without, with each of them disabling all the other tests except these two and running them many times. Then we repeatedly re-run the GA CI tests (could script it through the github REST or whatnot; or just keep pushing empty commits; or even manually clicking several times a day) on those branches and see whether we get zero cases on the one without #4677 and non-zero on the other.

sapostolakis · 2021-02-10T05:32:55Z

Sounds reasonable. I created the two branches. The i4711-with-4677 branch is just the same as master, while the i4711-without-4677 has all the x86-related changes in #4677 reverted.

sapostolakis · 2021-02-10T05:41:32Z

I can see how I can manually run from the Actions window the whole ci-x86 workflow for these branches but I don't know how to selectively run only the two tests of interest. Do we have to create a new workflow?

derekbruening · 2021-02-10T15:19:13Z

I would go in and change the code (just temporarily on this branch of course) to only target those tests.
One way is via a label: in runsuite.cmake see how OSX uses one:

set(extra_ctest_args INCLUDE_LABEL OSX)

And how some tests are given that label in suite/tests/CMakeLists.txt:

set_tests_properties(code_api|client.flush PROPERTIES LABELS OSX)

I would assign a new label to these two tests and have runsuite.cmake target just that label.

If using a push to trigger, I would delete all the other workflow files so only the x86 one runs.

derekbruening · 2021-02-11T15:01:34Z

Happened again: all 4 tests again. https://github.com/DynamoRIO/dynamorio/pull/4729/checks?check_run_id=1876776357

sapostolakis · 2021-02-16T22:02:06Z

In PR #4730 we are reverting a x86_64 fix (restoring the arithmetic flags and, if used, the scratch register before a clean call to hit_instr_count_threshold that might not return) to prevent these test failures. Essentially, we are leaving x86_64 with buggy code since the fix seems to cause problems that we do not yet understand.

Revert certain x86-related changes from #4677. In particular, we avoid restoring the arithmetic flags and (if used) the scratch register before the call to `hit_instr_count_threshold`, which might not return. This fix is necessary but it seems to be causing failures in `drcachesim.delay-global` and `drcacheoff.max-global` on x86-64 CI testing. These failures are not reproducible outside CI testing and are (so far) inexplicable based on manual analysis of the instrumented assembly code (see #4711 (comment)). Thus, we are leaving x86_64 as technically broken to keep our tests green until the source of instability is found. Issue: #4711

derekbruening · 2021-10-14T17:59:50Z

For #4128 we're adding automated restoring of all app values and my plan is to remove the deliberately-not-restored buggy code and see whether the problems return.

derekbruening · 2021-10-16T19:23:43Z

No failure in PR #5164 run. Closing as non-repro.

derekbruening · 2021-10-19T16:20:16Z

We noticed in PR #5164 that the drreg_statelessly_restore_app_value() calls in tracer.cpp were passing the same where for where_respill and where_restore, which will do the wrong thing for aflags stored in rax. Maybe this was the cause of the failures.

Adds new dr_cleancall_save_t flags which are required for proper interaction between clean calls and drreg: DR_CLEANCALL_READS_APP_CONTEXT must be set for dr_get_mcontext() to obtain the proper values, and #DR_CLEANCALL_WRITES_APP_CONTEXT must be set to ensure that dr_set_mcontext() is persistent. DR_CLEANCALL_MULTIPATH must be additionally set for might-skip calls. Adds a clean call insertion event to enable drreg to know about clean calls at the time they are inserted. dr_insert_clean_call_ex() invokes the callback and passes the flags to drreg, who then treats the clean call as an app instruction. For annotations, for now we leave drreg looking for the annotation label (possible future changes #5160 or #5161 would eliminate this special case). dr_insert_{cbr,ubr,mbr,call}_instrumentation() always set both labels. drwrap always sets both labels for pre and post callbacks. Updates uses throughout our tests and samples to use the new flags as appropriate. Adds drreg_statelessly_restore_all() for clean call multipath restoration. Adds a new dedicated test client.drwrap-drreg-test which tests both a drwrap call and direct clean calls. Fixes a missing drwrap cache invalidation on module unload that the new test uncovers. This likely fixes #4711 as its code was passing the same location for the where_respill as where_restore for stateless drreg restoration; the automated restore here correctly passes the post-instr location. Issue: #4128, #4711 Fixes #4128

derekbruening added Component-Tests Hotlist-ContinuousIntegration Component-DrMemtrace labels Feb 4, 2021

sapostolakis self-assigned this Feb 5, 2021

derekbruening mentioned this issue Feb 5, 2021

i#4717: Fix hang in api.static_prepop test on aarchxx #4721

Merged

derekbruening mentioned this issue Feb 6, 2021

i#2666 dyn cxt: Fix early context free #4722

Merged

derekbruening mentioned this issue Feb 9, 2021

i#4719 qemu: Add xarch_root option for QEMU #4726

Merged

derekbruening mentioned this issue Feb 11, 2021

Avoid blind syslog during init #4729

Merged

sapostolakis mentioned this issue Feb 11, 2021

i#4711: Revert x86-related changes in #4677 #4730

Merged

derekbruening mentioned this issue Oct 13, 2021

tool.drcachesim.delay-global failed on AArch64 #5159

Open

derekbruening mentioned this issue Oct 15, 2021

i#4128: Add cleancall reg read/write flags #5164

Merged

derekbruening closed this as completed Oct 16, 2021

derekbruening unassigned sapostolakis Oct 16, 2021

derekbruening removed the Hotlist-ContinuousIntegration label Oct 16, 2021

derekbruening added the Status-CannotReproduce label Oct 16, 2021

derekbruening mentioned this issue Jun 12, 2023

i#6131 hang win32: Add test timeouts to dr$sim and runcmp tests #6137

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tool.drcachesim.delay-global and tool.drcacheoff.max-global tests failing: did not hit max #4711

tool.drcachesim.delay-global and tool.drcacheoff.max-global tests failing: did not hit max #4711

derekbruening commented Feb 4, 2021

derekbruening commented Feb 5, 2021

sapostolakis commented Feb 7, 2021

derekbruening commented Feb 7, 2021

sapostolakis commented Feb 7, 2021

derekbruening commented Feb 9, 2021

derekbruening commented Feb 9, 2021 •

edited

Loading

derekbruening commented Feb 10, 2021

derekbruening commented Feb 10, 2021

sapostolakis commented Feb 10, 2021

sapostolakis commented Feb 10, 2021

derekbruening commented Feb 10, 2021 •

edited

Loading

derekbruening commented Feb 11, 2021

sapostolakis commented Feb 16, 2021

derekbruening commented Oct 14, 2021

derekbruening commented Oct 16, 2021

derekbruening commented Oct 19, 2021

tool.drcachesim.delay-global and tool.drcacheoff.max-global tests failing: did not hit max #4711

tool.drcachesim.delay-global and tool.drcacheoff.max-global tests failing: did not hit max #4711

Comments

derekbruening commented Feb 4, 2021

derekbruening commented Feb 5, 2021

sapostolakis commented Feb 7, 2021

derekbruening commented Feb 7, 2021

sapostolakis commented Feb 7, 2021

derekbruening commented Feb 9, 2021

derekbruening commented Feb 9, 2021 • edited Loading

derekbruening commented Feb 10, 2021

derekbruening commented Feb 10, 2021

sapostolakis commented Feb 10, 2021

sapostolakis commented Feb 10, 2021

derekbruening commented Feb 10, 2021 • edited Loading

derekbruening commented Feb 11, 2021

sapostolakis commented Feb 16, 2021

derekbruening commented Oct 14, 2021

derekbruening commented Oct 16, 2021

derekbruening commented Oct 19, 2021

derekbruening commented Feb 9, 2021 •

edited

Loading

derekbruening commented Feb 10, 2021 •

edited

Loading