Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] System.Runtime.Tests occasionally crashes on linux-arm-NativeAOT #99079

Closed
Tracked by #97729
VSadov opened this issue Feb 28, 2024 · 23 comments
Closed
Tracked by #97729

Comments

@VSadov
Copy link
Member

VSadov commented Feb 28, 2024

The crash does not happen often - just once in a few CI runs.
So far I was unable to get a dump or any other info for the failure. I'd need to set up a local repro.

For example seen in #99031 :

https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-99031-merge-ad45d5f9cac945e0af/System.Runtime.Tests/1/console.a6a20f16.log?helixlogtype=result

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Feb 28, 2024
@ghost
Copy link

ghost commented Feb 28, 2024

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

The crash does not happen often - just once in a few CI runs.
So far I was unable to get a dump or any other info for the failure. I'd need to set up a local repro.

For example seen in #99031 :

https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-99031-merge-ad45d5f9cac945e0af/System.Runtime.Tests/1/console.a6a20f16.log?helixlogtype=result

Author: VSadov
Assignees: -
Labels:

area-NativeAOT-coreclr

Milestone: -

@filipnavara
Copy link
Member

Preliminary notes:

@VSadov
Copy link
Member Author

VSadov commented Feb 28, 2024

We may be missing some of the Align8 fixes done

I once saw a failure at this stack:

[SKIP] System.Collections.Concurrent.Tests.ConcurrentBagTests.DebuggerTypeProxy_Ctor_NullArgument_Throws
Process terminated. Access Violation: Attempted to read or write protected memory. This is often an indication that other memory is corrupt. The application will be terminated since this platform does not support throwing an AccessViolationException.
   at System.RuntimeExceptionHelpers.FailFast(String, Exception, String, RhFailFastReason, IntPtr, IntPtr) + 0x161
   at System.RuntimeExceptionHelpers.GetRuntimeException(ExceptionIDs) + 0x2a1
   at System.Runtime.EH.GetClasslibException(ExceptionIDs, IntPtr) + 0x27
   at System.Runtime.EH.RhThrowHwEx(UInt32, EH.ExInfo&) + 0x8d
   at System.Threading.PortableThreadPool.HillClimbing.GetWaveComponent(Double[], Int32, Double) + 0x99
   at System.Threading.PortableThreadPool.HillClimbing.Update(Int32, Double, Int32) + 0x2fb
   at System.Threading.PortableThreadPool.AdjustMaxWorkersActive() + 0x133
   at System.Threading.ThreadPoolWorkQueue.Dispatch() + 0x335
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart() + 0x12b
   at System.Threading.Thread.StartThread(IntPtr) + 0xd5
   at System.Threading.Thread.ThreadEntryPoint(IntPtr) + 0xf
./RunTests.sh: line 179:    86 Aborted                 (core dumped) ./System.Collections.Concurrent.Tests -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing -xml testResults.xml $RSP_FILE

Not necessarily because of double[] alignment, but there are some double[] involved.

I kind of think if this system was configured to fail on misaligned accesses, it would fail in more tests. Also this looks like an AV, not alignment problem, so maybe it was something else.

@EgorBo
Copy link
Member

EgorBo commented Feb 28, 2024

Speaking of align8 - does NAOT respect align8 for e.g.

static readonly double[] arr = new double[100];

during preinitialization?

@VSadov
Copy link
Member Author

VSadov commented Feb 28, 2024

Speaking of align8 - does NAOT respect align8 for e.g.

static readonly double[] arr = new double[100];
during preinitialization?

Last time I checked RequiresAlign8Flag was not set for anything so types that want 8-align could not report it. So I doubt it.

It is not a lot of changes to enable this, but it would be "a bit here, a bit there."
Respecting 8-align is basically an ARM32 feature, so it was not implemented and would not be testable, even if implemented until there is ARM32 support.

@VSadov
Copy link
Member Author

VSadov commented Feb 29, 2024

                // Currently we don't support frozen objects with special alignment requirements
                // TODO: We should also give up on arrays of doubles on 32-bit platforms.
                // (we currently never allocate them on frozen segments)
#if FEATURE_64BIT_ALIGNMENT
                if (type->RequiresAlign8)
                {
                    // Align8 objects are not supported yet
                    return null;
                }
#endif

So, nope.

@EgorBo
Copy link
Member

EgorBo commented Feb 29, 2024

So, nope.

The code you quoted is used to allocate objects on nongc in run-time, it's different from preinitialization that also allocates on nongc.

@EgorBo
Copy link
Member

EgorBo commented Feb 29, 2024

I kind of think if this system was configured to fail on misaligned accesses, it would fail in more tests. Also this looks like an AV, not alignment problem, so maybe it was something else.

Afair, some floating-point related loads/store (as well as atomicity-related) are not fixable by OS and may fail

@VSadov
Copy link
Member Author

VSadov commented Feb 29, 2024

Right, that is not even in NAOT. Sorry.

Generally aligning frozen arrays should be doable. Just do what GC does - overallocate space and pad in front with empty objects (dummy byte[] would work too). I do not see any code attempting to do that.

@VSadov
Copy link
Member Author

VSadov commented Feb 29, 2024

Afair, some floating-point related loads/store (as well as atomicity-related) are not fixable by OS and may fail

That what I suspected too. Perhaps OS was configured to handle misaligned accesses, but could not handle all the cases. I am not very familiar with how robust the emulation is. I'd assume it is robust, since it is in OS, but would not be too surprised if there are limitations.

@VSadov
Copy link
Member Author

VSadov commented Feb 29, 2024

Generally aligning frozen arrays should be doable. Just do what GC does - overallocate space and pad in front with empty objects (dummy byte[] would work too). I do not see any code attempting to do that.

Actually, since it is not in the real heap, I am not sure if it needs to be parseable/walkable, so maybe just leaving gaps in front will work too. One way to find out. :-)

@EgorBo
Copy link
Member

EgorBo commented Feb 29, 2024

Actually, since it is not in the real heap, I am not sure if it needs to be parseable/walkable,

In CoreCLR it's still walkable in two cases:

  1. Profiler API, I presume those aren't (yet?) supported by NAOT
  2. There is an edge case in GC (Segment-mode) when if nongc segments somehow sneak into "in-range" GC will walk (and do mark phase) them too

@EgorBo
Copy link
Member

EgorBo commented Feb 29, 2024

So here is a test we can try:

using System.Runtime.CompilerServices;

public class Prog
{
    static void Main()
    {
        Test(doubles);
    }

    static readonly double[] doubles = new double[100];

    [MethodImpl(MethodImplOptions.NoInlining)]
    static double Test(double[] d)
    {
        return d[0];
    }
}

I can't check real codegen for NAOT-arm32 (perhaps, @filipnavara can check?) but on CoreCLR AltJIT I am seeing:

; Assembly listing for method Prog:Test(double[]):double (FullOpts)
G_M31521_IG01:  ;; offset=0x0000
000000      push    {r11,lr}
000004      mov     r11, sp
G_M31521_IG02:  ;; offset=0x0006
000006      movs    r3, 0
000008      ldr     r2, [r0+0x04]
00000A      cmp     r3, r2
00000C      bhs     SHORT G_M31521_IG04
00000E      vldr    d0, [r0+0x08]   ;;;;;;; <--------------------
G_M31521_IG03:  ;; offset=0x0012
000012      pop     {r11,pc}
G_M31521_IG04:  ;; offset=0x0016
000016      movw    r3, 0x7bd0
00001A      movt    r3, 0x8248
00001E      blx     r3		// CORINFO_HELP_RNGCHKFAIL
000020      bkpt    
; Total bytes of code 34

And that vldr is not fixable, e.g. see raspberrypi/linux#3099 (comment)

@VSadov
Copy link
Member Author

VSadov commented Feb 29, 2024

and do mark phase

That is mildly disturbing.
Yes, if there is marking, especially of byrefs, it will need to walk at least to figure where the containing object starts.

@MichalPetryka
Copy link
Contributor

MichalPetryka commented Feb 29, 2024

Wouldn't Align8 be also needed for 8B Exchange/CompareExchange on 32 bit platforms?
EDIT: ldrexd/strexd on ARM32 do require it so it's probably broken.

@filipnavara
Copy link
Member

FWIW ARMv7+ handles unaligned accesses in hardware with few notable exceptions:

  • Floating point memory loads/stores (vldr mentioned above)
  • 8-byte atomics
  • Double word load/store (not really generated by RyuJIT iirc)

Several of the test crashes point to misalignment of the double on memory load.

@MichalStrehovsky
Copy link
Member

and do mark phase

That is mildly disturbing. Yes, if there is marking, especially of byrefs, it will need to walk at least to figure where the containing object starts.

I'll simply block these in #99104. It's not clear if this is the problem for these tests, but it's necessary for correctness anyway.

@jkotas
Copy link
Member

jkotas commented Feb 29, 2024

https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-99031-merge-ad45d5f9cac945e0af/System.Runtime.Tests/1/console.a6a20f16.log?helixlogtype=result

This collected a dump. You can get it by running:

dotnet tool install --global runfo
runfo get-helix-payload -j ad45d5f9-cac9-45e0-af17-19acbf27c4bf -w System.Runtime.Tests -o c:\helix_payload\System.Runtime.Tests

The stacktrace of the crash is:

libc_2_31+0x177e6
System_Runtime!System.Globalization.CalendricalCalculationsHelper::Compute+0x1e [/_/src/libraries/System.Private.CoreLib/src/System/Globalization/CalendricalCalculationsHelper.cs @ 354]
System_Runtime!System.Globalization.CalendricalCalculationsHelper::PersianNewYearOnOrBefore+0x108 [/_/src/libraries/System.Private.CoreLib/src/System/Globalization/CalendricalCalculationsHelper.cs @ 384]
System_Runtime!System.Globalization.PersianCalendar::GetAbsoluteDatePersian+0x6a [/_/src/libraries/System.Private.CoreLib/src/System/Globalization/PersianCalendar.cs @ 70]
System_Runtime!System.Globalization.PersianCalendar::ToDateTime+0x34 [/_/src/libraries/System.Private.CoreLib/src/System/Globalization/PersianCalendar.cs @ 382]
...

Looks like it crashed somewhere in Math.Sin.

@filipnavara
Copy link
Member

Looks like it crashed somewhere in Math.Sin.

That's the issue tracked in #98795. Dump is useful.

@filipnavara
Copy link
Member

I got the payload from ad45d5f9-cac9-45e0-af17-19acbf27c4bf mentioned above and run it couple of times on my Raspberry Pi. It crashes quite often. Turns out that the linker produced the long thunks inside the managed code section and there's no unwinding information available for them:

      Address: System.Runtime.Tests[0x014990f8] (System.Runtime.Tests.PT_LOAD[1].__managedcode + 56)
      Summary: System.Runtime.Tests`__ThumbV7PILongThunk_log + 8
       Module: file = "/home/navara/srt/workitems/System.Runtime.Tests/System.Runtime.Tests", arch = "arm"
       Symbol: id = {0x000003ed}, range = [0x019f90f0-0x019f90fc), name="__ThumbV7PILongThunk_log"

I'll need to figure out which linker [version] did that. In my experiments the thunks were generated in a separate section and LLD source code shows that was the intended behavior.

@filipnavara
Copy link
Member

filipnavara commented Mar 3, 2024

While analyzing this I found one pattern that we don't recognize in IsInProlog correctly:

System.Runtime.Tests`System.SpanHelpers__NonPackedIndexOfValueType<Int32__System.SpanHelpers_Negate`1<Int32>>
    0x4150a30 <+0>:    push.w {r4, r5, r6, r10, r11, lr}
    0x4150a34 <+3>:    add.w  r11, sp, #0x10     <--------------
    0x4150a38 <+7>:    sub.w  sp, sp, #0x778

Furthermore, some of sub sp, X and add sp, X patterns are incorrectly recognized as prolog when appearing in the middle of a method with frame register (r9).

@filipnavara
Copy link
Member

I think this it resolved now.

@MichalStrehovsky
Copy link
Member

Yes, i don't see crashes in the arm legs in the past week. Thank you Filip!

@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label Mar 25, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Apr 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Archived in project
Development

No branches or pull requests

6 participants