-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-threading failure (HANG) on AArch64. ASSERT in utils.c: !lock->owner #3956
Comments
So a thread acquires a lock and its #2502 is a known race issue which affects both AArch{32,64} AFAIK but would likely result in different symptoms than this. |
IIRC |
@AssadHashmi were you able to determine any further clues? We are also seeing this same assert in debug, and in release crashes and hangs that look like they may come from races/bugs in DR's lock implementation, on Thunder X2. |
No further clues but I will re-visit this and post an update. |
A first pass review shows that the atomic macros in arch_exports.h for AArch64 need to be changed to meet the release-acquire memory model that DR is assuming there (xref my recent changes to non-mutex atomics which also want release-acquire, such as 35cbfc4#diff-e74a10e858f5f2dcd45c8afe825e8b26R642). Basically, |
@derekbruening I have a small test case (linked to the Arm Performance Libraries and built with the Arm compiler), which crashes after around 200 runs. If you have a patch, I can try it on my build. I'm on the latest head. |
Replaces ldxr..stxr with ldaxr..stlxr in the atomic sequences used to implement DR's mutexes. DR's mutexes, and its other atomic operations, assume release-acquire memory ordering. Issue: #3956
I just pushed my simple Adding a link to the branch: https://github.com/DynamoRIO/dynamorio/tree/i3956-arm-locks |
I went ahead and made it into a PR for simple diff browsing and commenting on the changes: #4254. |
A partial update on #4254's effect on our app: it does seem to be better. Running it with a small-ish thread pool on HEAD, it succeeded only 2 out of 10 times, with the rest being various crashes ("Cannot correctly handle received signal 11", internal DR crash, occasionally app crash). With #4254, 10 out of 10 succeed. However, upping the thread pool to a normal level it seems to be hanging. Let me evaluate further in debug build to ensure the assert in the title is gone. Probably there are multiple underlying issues (certainly there are several debug asserts/warnings I will file soon). |
Without the #4254 patch, my test case crashes after 161, 529, 663, 195 and 230 runs. With the patch it gets to 1000 without crashing. I will do some more, longer runs over the weekend, maybe with different OpenMP and build flags. |
For extra info, without the patch, the test case crashed in different ways:
|
So that matches what I'm seeing on the small version: those same types of crashes w/o the fix, and success with the fix. I'm seeing hangs when I scale up the app, and with drcachesim debug build I see some other asserts and warnings and app crashes, but they are presumably separate issues. It is looking like #4254 is worth putting in and that it does indeed fix the DR lock bugs this issue covers. I'll file separate issues on the other problems I'm seeing: unfortunately there are quite a few, and some are blocking running this app reliably, but fixing this is a good step. |
Adds a "dmb ish" barrier prior to the "ldrex..strex" loops in the atomics used to implement mutexes and other operations on 32-bit ARM where we need release-acquire semantics. Issue: #3956
Weekend test runs show that my small test case always passes. I will apply #4254 and run the full applications test suite to see what happens. Results may take a couple of day to come through. |
Adds a "dmb ish" barrier prior to the "ldrex..strex" loops in the atomics used to implement mutexes and other operations on 32-bit ARM where we need release-acquire semantics. Issue: #3956
We're seeing intermittent hangs with an AArch64 guest binary compiled with OpenMP. All the indications are that it is a multi-threading bug in DR. The hang happens on a DR release build. With the DEBUG build, the following assert fires and exits so doesn't get as far as hanging:
Which happens in:
The guest binary is built with armclang and linked to the Arm Performance Libraries on RHEL7.5:
armclang -fopenmp -armpl=lp64,parallel test_case.c -o test_case.exe
It fails without clients:
drrun ./test_case.exe
These also appear during the
-debug
run:It takes between 3 and 60 runs to get the assert to fire and only seems to fail on ThunderX2 machines.
Running with
-loglevel 3
gives the following thread statistics:Does anything look unusual?
There's lots of other thread related tracing in the logs but I don't know what to look for.
Clearly
ownable
andlock->owner
are contradicting each other.Where and when could that be happening?
Thanks
The text was updated successfully, but these errors were encountered: