Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ARM64 floating point registers unwinding in PAL #56919

Merged
merged 4 commits into from
Aug 8, 2021

Conversation

janvorli
Copy link
Member

@janvorli janvorli commented Aug 5, 2021

We were not unwinding the non-volatile floating point registers at all
(not transferring them between the CONTEXT and ucontext_t before and
after the unw_step). That causes crashes on arm64 Unix in some of
the tests since JIT now generates code that uses e.g. the D8 register
and a runtime code that was throwing an exception was using it too.

Close #56522

We were not unwinding the non-volatile floating point registers at all
(not transferring them between the CONTEXT and ucontext_t before and
after the unw_step). That causes crashes on arm64 Unix in some of
the tests since JIT now generates code that uses e.g. the D8 register
and a runtime code that was throwing an exception was using it too.
@janvorli janvorli added this to the 6.0.0 milestone Aug 5, 2021
@janvorli janvorli self-assigned this Aug 5, 2021
@sandreenko
Copy link
Contributor

sandreenko commented Aug 5, 2021

cc @echesakovMSFT @briansull fix for the issue that we were looking at.

@sandreenko
Copy link
Contributor

Could you please delete this as well:

<PropertyGroup Condition="'$(TargetArchitecture)' == 'arm64'">
<!-- https://github.com/dotnet/runtime/issues/56522 -->
<CLRTestBatchPreCommands><![CDATA[
$(CLRTestBatchPreCommands)
set COMPlus_JitDoValueNumber=0
]]></CLRTestBatchPreCommands>
<BashCLRTestPreCommands><![CDATA[
$(BashCLRTestPreCommands)
export COMPlus_JitDoValueNumber=0
]]></BashCLRTestPreCommands>
</PropertyGroup>

it was a workaround for this issue.

@janvorli
Copy link
Member Author

janvorli commented Aug 5, 2021

I haven't noticed that the libunwind on Linux uses V instead of D in the UNW_AARCH64_* constants. Fixing it now.

@janvorli
Copy link
Member Author

janvorli commented Aug 5, 2021

Could you please delete this as well:

Done

@sandreenko
Copy link
Contributor

/azp run runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@davidwrighton
Copy link
Member

How is it that we don't need to unwind the saved floating point registers on other architectures like ARM and X64?

@sandreenko
Copy link
Contributor

sandreenko commented Aug 5, 2021

How is it that we don't need to unwind the saved floating point registers on other architectures like ARM and X64?

The test is failing on arm Linux but only in JitStress that I was not checking while merging (I ran only outerloop) there.
Based on the output it is the same problem there:
https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-be4915f89d4c4f0a9b/JIT.IL_Conformance/1/console.41ed0c51.log?sv=2019-07-07&se=2021-08-24T05%3A11%3A46Z&sr=c&sp=rl&sig=R7YcMBEMOZeOZEGlbEAGCkhSNfRsqWmyPxCw4H4Y89A%3D

      No exception in DynamicConvertFromSystem.SingleToSystem.UInt640Opconv.ovf.u8
      No exception in DynamicConvertFromSystem.SingleToSystem.UInt640Opconv.ovf.u8.un

but without stress testing we don't allocate d8.

So my guess is that we need the same fix for arm.

@janvorli
Copy link
Member Author

janvorli commented Aug 5, 2021

x64 Unix ABI doesn't have any non-volatile floating point registers. But for arm, it seems we will need the same fix.

@janvorli
Copy link
Member Author

janvorli commented Aug 5, 2021

Interestingly, the Linux arm64 has now many legs failing with this change. I am investigating ...

@briansull
Copy link
Contributor

And what about ARM64 Windows?
Do we save the non-volatile floating point registers, or is this fix only for Linix/PAL

@janvorli
Copy link
Member Author

janvorli commented Aug 5, 2021

And what about ARM64 Windows?

This is need for Unix. On Windows, unwinding is done using the OS.

@briansull
Copy link
Contributor

It is odd that we are just now noticing this, as the JIT will often use the non-volatile floating point registers.
I guess that an exception has to be thrown and caught, for the corruption to occur. Maybe that test case is rare/missing.

@janvorli
Copy link
Member Author

janvorli commented Aug 5, 2021

It is odd that we are just now noticing this

It is also needed to modify a non-volatile FP register in the native code that throws the exception that's propagated to managed code. So I guess the C++ compiler just started doing that at some places.

@janvorli
Copy link
Member Author

janvorli commented Aug 5, 2021

Bummer. Libunwind doesn't support unwinding floating point registers on ARM.

@am11
Copy link
Member

am11 commented Aug 5, 2021

Bummer. Libunwind doesn't support unwinding floating point registers on ARM.

(maybe the future minipal can share mono's unwind implementation so we can get rid of no-gnu libunwind dependency altogether 🙂)

@janvorli
Copy link
Member Author

janvorli commented Aug 6, 2021

@am11 IIRC, the mono unwind library is just very minimal one supporting only what mono generates. We need to be able to unwind whatever C++ compiler decides to generate.

@am11
Copy link
Member

am11 commented Aug 6, 2021

We need to be able to unwind whatever C++ compiler decides to generate.

@janvorli, in this case, would it make more sense to use the unwinder from one of the compiler toolchain? If llvm's libunwind is missing context pointers support, and pinning is an option in GC (as used for macOS and FreeBSD, btw, both of which actually use llvm's libunwind), would it cause a significant degradation in quality? In 2013, I think Apple opensourced their system unwinder to llvm and it became llvm libunwind.

I am asking because nognu libunwind does not have many consumers and lacks proper/decicated maintainership. In Debian, only three (1,2,3 not 3% 🙂) packages depend on nognu libunwind and even they have option to switch to other unwinders. Distro maintainers of nognu libunwind package, as far as I know (Alpine, Gentoo, Debian/Ubuntu) are not really thrilled about the fact it has longstanding unresolved bugs and the fact that it is complicated implementation and tries to do many things. On the other hand, the compiler toolchains dogfood on their unwinder implementation and they keep the implementation up to date.

@janvorli
Copy link
Member Author

janvorli commented Aug 6, 2021

Unfortunately, the LLVM libunwind has one significant deficiency - it doesn't support unw_get_save_loc which is important for our GC performance. When GC needs to update a reference to an object in some call frame in the middle of a stack and the reference is in a non-volatile register, it needs to know where the register was later saved so that it can update the reference in that stack slot. Later on when the code execution returns to that frame, the reference is then restored to the correct updated value.
When we cannot get the save location for a register (like on macOS where we use their libunwind), we cannot relocate any objects that are referenced by non-volatile registers in the whole stack. That makes those effectively pinned, which increases fragmentation of the GC heap and can prevent its shrinking.

@janvorli
Copy link
Member Author

janvorli commented Aug 6, 2021

Actually, I have forgotten to mention that it is related to the cases when we actually unwind using the libunwind, which excludes unwinding from managed code frames. So the problem I have mentioned only occurs on native / managed boundaries.

@janvorli
Copy link
Member Author

janvorli commented Aug 6, 2021

In the past, I was already thinking about adding support for the unw_get_save_loc to the LLVM libunwind and after taking a brief look, it seemed it would not be that complicated. However, I think would add small penalty to all unwind operations, so I am not sure how difficult it would be to get such a change into the upstream libunwind.

As for how to solve the current problem for ARM, I can see several approaches:

  • Disable using the VFP registers in the whole native runtime for Linux ARM (compiler option -mfpu=none or -mfloat-abi=soft).
  • Fix the current libunwind to support unwinding the floating point registers. I have no idea yet how complicated that would be, I assume comparing arm and arm64 stuff in the libunwind would give some hints.
  • Prevent JIT from using non-volatile D8..D15 on Unix ARM. I have no idea how complicated that would be.
  • Migrate to the LLVM libunwind and either add support for unw_get_save_loc or live with the issue around some of the objects being effectively pinned as I've described before.

@janvorli
Copy link
Member Author

janvorli commented Aug 6, 2021

Hmm, looking at our libunwind, there seems to be actually support for floating point unwinding, just the unw_context_t doesn't contain the FP regs. I need to understand it better then. There is a concept of context and cursor in the libunwind. It seems that the context is used for the initialization while the cursor for walking the stack. So there is a chance that the fix would be easy.

@am11
Copy link
Member

am11 commented Aug 6, 2021

just the unw_context_t doesn't contain the FP regs.

A few related commits were not included in v1.5:

Can we take their latest master (libunwind/libunwind@c720133) and try out if it solves problems? There were some fixes for x86 and x64 since the last release as well.

@janvorli
Copy link
Member Author

janvorli commented Aug 6, 2021

I have looked at the main branch there before and the context still didn't have the FP registers for ARM.

@janvorli janvorli force-pushed the fix-arm64-fp-regs-unwinding branch from 758ed86 to 91b7bef Compare August 6, 2021 15:25
Copy link
Member

@davidwrighton davidwrighton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix looks good now. I've opened an issue to track fixing this on arm32.

@davidwrighton davidwrighton merged commit c7eb3a8 into dotnet:main Aug 8, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Sep 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Arm64 codegen bug.
5 participants