-
Notifications
You must be signed in to change notification settings - Fork 562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tool.drcachesim.delay-global and tool.drcacheoff.max-global tests failing: did not hit max #4711
Comments
Happened again, on 2 runs in a row: https://github.com/DynamoRIO/dynamorio/pull/4721/checks?check_run_id=1841880860 |
Unfortunately, I have been unable so far to manually reproduce this failure. For the observed failures, the culprit is probably the x86 related changes in #4677.
I also manually checked the counter updates and whether the threshold is hit at the correct time and it seems to work correctly. Of course, this is the case for runs that dont fail, but at least I confirmed that the threshold is hit after 20K instructions. One option going forward might be to revert the x86-related changes in #4677 and see if these failures persist. Or find a way to reproduce these changes in manual runs to facilitate more debugging. |
The output shows that the tests are hitting the |
I see. Yeah I noticed the problem was hitting the max but I thought maybe some program state (potentially the flags) was corrupted before reaching the threshold or at the time of reaching the threshold leading to a problem later on. That was mostly because AFAIK #4677 was the most recent change that was somewhat related to this issue. In any case, it was good to do some sanity checks to exclude (to some extent) this scenario. |
I have an Ubuntu 16.04 VM with gcc 5.4.0 which is close to the GA CI setup (it has 5.5.0). I built and ran both tests 1000x (loop around individual ctest) and did not reproduce it. |
Actually, looking more closely at the pasted output from the top comment, delay-global crashed with SIGSEGV! The exit code is 139 which I did not notice at first:
For the offline version max-global: it's not clear; is it possible it crashed too and runmulti.cmake doesn't care and looks only at the output? So maybe #4677 is a possible suspect since this now seems to be an app crash? The most recent case fails both tests on both debug and release -- which makes it seem like it's something on a particular VM or sthg and if the job is scheduled on that type of VM they all fail?
The prior ones are almost the same but only have 1 debug failure -- hmm:
|
Happened again on master: https://github.com/DynamoRIO/dynamorio/runs/1868575670?check_suite_focus=true |
@sapostolakis : One idea is to create two branches, one with #4677 and one without, with each of them disabling all the other tests except these two and running them many times. Then we repeatedly re-run the GA CI tests (could script it through the github REST or whatnot; or just keep pushing empty commits; or even manually clicking several times a day) on those branches and see whether we get zero cases on the one without #4677 and non-zero on the other. |
Sounds reasonable. I created the two branches. The |
I can see how I can manually run from the Actions window the whole ci-x86 workflow for these branches but I don't know how to selectively run only the two tests of interest. Do we have to create a new workflow? |
I would go in and change the code (just temporarily on this branch of course) to only target those tests.
And how some tests are given that label in suite/tests/CMakeLists.txt:
I would assign a new label to these two tests and have runsuite.cmake target just that label. If using a push to trigger, I would delete all the other workflow files so only the x86 one runs. |
Happened again: all 4 tests again. https://github.com/DynamoRIO/dynamorio/pull/4729/checks?check_run_id=1876776357 |
In PR #4730 we are reverting a x86_64 fix (restoring the arithmetic flags and, if used, the scratch register before a clean call to |
Revert certain x86-related changes from #4677. In particular, we avoid restoring the arithmetic flags and (if used) the scratch register before the call to `hit_instr_count_threshold`, which might not return. This fix is necessary but it seems to be causing failures in `drcachesim.delay-global` and `drcacheoff.max-global` on x86-64 CI testing. These failures are not reproducible outside CI testing and are (so far) inexplicable based on manual analysis of the instrumented assembly code (see #4711 (comment)). Thus, we are leaving x86_64 as technically broken to keep our tests green until the source of instability is found. Issue: #4711
For #4128 we're adding automated restoring of all app values and my plan is to remove the deliberately-not-restored buggy code and see whether the problems return. |
No failure in PR #5164 run. Closing as non-repro. |
We noticed in PR #5164 that the drreg_statelessly_restore_app_value() calls in tracer.cpp were passing the same |
Adds new dr_cleancall_save_t flags which are required for proper interaction between clean calls and drreg: DR_CLEANCALL_READS_APP_CONTEXT must be set for dr_get_mcontext() to obtain the proper values, and #DR_CLEANCALL_WRITES_APP_CONTEXT must be set to ensure that dr_set_mcontext() is persistent. DR_CLEANCALL_MULTIPATH must be additionally set for might-skip calls. Adds a clean call insertion event to enable drreg to know about clean calls at the time they are inserted. dr_insert_clean_call_ex() invokes the callback and passes the flags to drreg, who then treats the clean call as an app instruction. For annotations, for now we leave drreg looking for the annotation label (possible future changes #5160 or #5161 would eliminate this special case). dr_insert_{cbr,ubr,mbr,call}_instrumentation() always set both labels. drwrap always sets both labels for pre and post callbacks. Updates uses throughout our tests and samples to use the new flags as appropriate. Adds drreg_statelessly_restore_all() for clean call multipath restoration. Adds a new dedicated test client.drwrap-drreg-test which tests both a drwrap call and direct clean calls. Fixes a missing drwrap cache invalidation on module unload that the new test uncovers. This likely fixes #4711 as its code was passing the same location for the where_respill as where_restore for stateless drreg restoration; the automated restore here correctly passes the post-instr location. Issue: #4128, #4711 Fixes #4128
Looks like it didn't hit the max:
https://github.com/DynamoRIO/dynamorio/runs/1833435176?check_suite_focus=true
Same thing for offline:
The text was updated successfully, but these errors were encountered: