-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NativeAOT] Linux/ARM bring-up (4/n) #97269
Conversation
Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas Issue DetailsImplement large part of the thread hijacking, fixes tests with recursive generics.
|
...ILCompiler.Compiler/Compiler/DependencyAnalysis/Target_ARM/ARMReadyToRunGenericHelperNode.cs
Outdated
Show resolved
Hide resolved
I managed to get a stack trace from the last failing smoke test:
It is flaky though, and it may be failing with other errors at different times. I saw this assert multiple times though, with the |
Just had another failed test run, this time two tests failed:
I have seen this before. The |
The failures look like GC holes. It can be because of hijacking/stackwalking, or because of something else. I am not sure the write barriers are up to date, for example. |
For the write barriers, you may want to try disabling the following - at least to see if that accounts for all the remaining issues.
It looks like the ARM32 write barriers do not have support for that. I am not sure if that works on 32bit at all. |
Thanks for the debugging tips. I didn't do enough runs to verify or confirm it, but returning (I intend to fix the DWARF info not to contain the Thumb bit in the addresses and align correctly with the instruction boundaries, but it requires fixing up the managed stack trace logic to match it.) |
I finally managed to catch the Let's start with the back trace:
The instruction on frame 1 is
The faulting address is
There's one more weird thing. The CPU seems not to be in the Thumb mode (could be a debugger quirk, though):
...and, for posterity, the thread did receive the activation signal for hijack just before the crash. |
I captured few more of the crashes. Seems like they don't happen with |
Multiple runs on qemu-arm-user and Raspberry Pi with QEMU:
Raspberry Pi 4:
More re-runs still showed some crashes, so this is not conclusive. |
…. Save FP return values on hijack.
…erence (eg. boxed int)
I managed to find the GC hole. Fix is committed, rebuilding and rerunning tests now. |
The tests pass on Raspberry Pi now:
Postmortem: The main bug was that thread hijacking failed to report the method return value in R0 register as GC reference. It was reproducible with the TLS test in DynamicGenerics. I enabled Second bug is that the Thumb bit was often not masked out from IP during stack walks. The LR register (or custom frames) saved on the stack have the Thumb bit set. It needs to be cleared before doing the GC enumeration to ensure that subtracting 1 here actually gets to previous instruction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me. Thank you!
@VSadov Could you please sign-off on this change as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! Thank you!!
Couple things that do not need to go in this change (perhaps better be logged as issues to follow up):
Maybe it gets disabled somehow anyways, but it seems better to be explicit about it.
I think there could be some subtle NYIs around that, for example:
|
Contributes to dotnet/runtimelab#833
Implements thread hijacking, fixes tests with recursive generics.