Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM64-SVE: Add SVE registers to pal context #103801

Merged
merged 29 commits into from
Jun 29, 2024

Conversation

a74nh
Copy link
Contributor

@a74nh a74nh commented Jun 21, 2024

Adds Linux support for SVE state on signals.

Testing:
I forced a sigill (by making one of the hwintrinsic API calls generate a bad instruction). I checked the SVE registers when the signal occurred. I stepped through and made sure the lpContext is correctly filled.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jun 21, 2024
//
// Sve Registers
//
//TODO-SVE: How does this structure handle variable sized Z/P/FFR registers?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AIUI, this should match the same structure in Windows. I don't have the any documentation, so I've made a guess at what the fields should be for SVE, and I expect that it's wrong
For convenience I've only used a vector length 128bits. I'd be surprised if windows supports a full 2048bit vector length without doing anything special.
(Offsets below marked with a ? I'll fix once the structure is correct)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Windows in general don't store extended context parts in the CONTEXT data structure itself. There is a flag CONTEXT_XSTATE that indicates presence of extra data attached to the CONTEXT. There are APIs InitializeContext and InitializeContext2 that allows setting up a context for the extended state. It can be also used to get the size of memory needed for the extended context. The InitializeContext2 is a new one that allows to select only a subset of the extended state using the XStateCompactionMask argument.

We have done this differently for AVX512 for the sake of simplicity - we have included the extra registers in the CONTEXT structure itself. I think it would be better to move that to the way Windows handle that so that we don't waste time initializing and copying extra fields at places where we don't care about the extended state or when the current CPU doesn't support them. That would also allow to size the storage for the Z/P registers dynamically based on the current CPU.
Having said that though, for this PR, we can follow the suite and do the same thing we did for intel avx512 and migrate both to the better model later. Based on what @kunalspathak told me, starting with 128 bits of space for the registers should be sufficient for now.

I would add them to the very end of the CONTEXT after the debug registers so that the layout of the part that's common with Windows is the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, for the OS, extended context like SVE and AVX etc are stored in a variable-sized buffer separate from the CONTEXT. The CONTEXT_EX structure immediately follows the CONTEXT structure, and contains pointers to the variable-sized XSTATE buffer. On x64, the XSTATE buffer is in the exact format that is supported by the hardware via the XSAVE and XRSTOR instructions. On ARM64, there are no XSAVE/XRSTOR instructions, but the XSTATE buffer is laid out in a similar fashion to x64 (including Header->Mask, Header->CompationMask etc), to allow for max code sharing with x64.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OS kernel does support all SVE vector lengths, up to 2048-bit SVE, though there is the caveat that HyperV only supports 128-bit SVE. So, when running on hardware that supports SVE larger than 128-bits, if HyperV is enabled you'll only see 128-bit SVE, but if HyperV is off then you'll be able to take advantage of the full SVE width supported by the CPU. And to my understanding, there is likely hardware in the future that supports larger SVE lengths than 128-bit, though I don't know any specific on timelines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. Updated with the following:

  • Added context2.S changes
  • Removed store/restore of Z registers (as we only support 128bits for now, which fully overlap the V registers)
  • Added a XStateFeaturesMask to the Arm64 context so that we can tell whether to use SVE or not.

I'm currently unsure where else SVE state might need saving/restoring

@a74nh
Copy link
Contributor Author

a74nh commented Jun 21, 2024

Things I'm unsure about:

  • What the LPCONTEXT should look like
  • If there are extra areas in coreclr that need covering
    • I think AOT needs convering too. But SVE is no yet supported in AOT
    • Do the SVE registers need propagating anywhere else?
  • Is there a testsuite for this?

@a74nh a74nh marked this pull request as ready for review June 21, 2024 13:22
@a74nh
Copy link
Contributor Author

a74nh commented Jun 21, 2024

Build failures on Windows, but I expected that as that all still needs doing. Marking as ready as I could do with comments, especially on the Windows side.

@dotnet/arm64-contrib @kunalspathak @tannergooding

@kunalspathak kunalspathak requested a review from janvorli June 21, 2024 14:06
@kunalspathak kunalspathak added the arm-sve Work related to arm64 SVE/SVE2 support label Jun 21, 2024
@kunalspathak
Copy link
Member

@JasonLinMS

src/coreclr/pal/inc/pal.h Outdated Show resolved Hide resolved
src/coreclr/pal/src/arch/arm64/context2.S Outdated Show resolved Hide resolved
@@ -718,6 +720,41 @@ void CONTEXTToNativeContext(CONST CONTEXT *lpContext, native_context_t *native)
*(NEON128*) &fp->vregs[i] = lpContext->V[i];
}
}

if (sve)
Copy link
Member

@janvorli janvorli Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to x64, we should copy the state only if the contextFlags has the CONTEXT_XSTATE flag set. The passed in contextFlags list parts of the state that are valid that the caller is interested in.

It seems it would make sense to move this to the end of the function next to where we extract xstate for amd64 and put it under the same if ((contextFlags & CONTEXT_XSTATE) == CONTEXT_XSTATE).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to x64, we should copy the state only if the contextFlags has the CONTEXT_XSTATE flag set.

There is one remaining test failure I've just debugged to being due to this. Will fix it up.

@@ -104,6 +105,32 @@ LOCAL_LABEL(Done_CONTEXT_INTEGER):
sub x0, x0, CONTEXT_FLOAT_CONTROL_OFFSET + CONTEXT_NEON_OFFSET

LOCAL_LABEL(Done_CONTEXT_FLOATING_POINT):
ldr x1, [x0, CONTEXT_XSTATEFEATURESMASK_OFFSET]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should check the CONTEXT_XSTATE in CONTEXT_ContextFlags first and check the features mask only if the CONTEXT_XSTATE is set.

@@ -154,6 +184,31 @@ LOCAL_LABEL(Restore_CONTEXT_FLOATING_POINT):
// since we potentially clobber x0 below, we'll bank it in x16
mov x16, x0

ldr w17, [x16, CONTEXT_XSTATEFEATURESMASK_OFFSET]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should check the CONTEXT_XSTATE in CONTEXT_ContextFlags first and check the features mask only if the CONTEXT_XSTATE is set.

@@ -60,6 +60,90 @@ using asm_sigcontext::_xstate;
bool Xstate_IsAvx512Supported();
#endif // XSTATE_SUPPORTED || (HOST_AMD64 && HAVE_MACH_EXCEPTIONS)

#if defined(HOST_64BIT) && defined(HOST_ARM64) && !defined(TARGET_FREEBSD) && !defined(TARGET_OSX)
#if !defined(SVE_MAGIC)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this define not present when building in our CI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this define not present when building in our CI?

Yes, they are missing in the CI. If I remove any of them the build falls over. I think this is when cross compiling. Eithe way, it must be an old Linux being used because these defines have been present in Linux since about 2017.

@janvorli
Copy link
Member

Besides the few comments, it looks good.

@a74nh
Copy link
Contributor Author

a74nh commented Jun 28, 2024

Fixed up so that XSTATE is set and checked as suggested.
Everything in run.sh passes and I can see state getting copied when debugging around signals.

This PR enables -DXSTATE_SUPPORTED on Arm64. Am I correct in thinking that on AMD64 Windows this flag is also used in certain scenarios to ensure the xstate data is block copied? If that is the case, then on Arm64 Windows this support still needs adding - probably by just enabling some AMD64 defines for ARM64. However, I've not go a windows setup and so don't want to blindly do anything here and recommend someone checks windows this after this PR is merged.

@a74nh
Copy link
Contributor Author

a74nh commented Jun 28, 2024

Running all priority 1 tests in checked on SVE Linux....

Time [secs] | Total | Passed | Failed | Skipped | Assembly Execution Summary
============================================================================
     30.836 |   131 |    131 |      0 |       0 | JIT.Regression.Regression_4
     13.873 |   345 |    342 |      0 |       3 | JIT.Regression.Regression_3
     13.446 |    53 |     53 |      0 |       0 | CoreMangLib.CoreMangLib
     11.249 |   100 |    100 |      0 |       0 | Loader.classloader.TypeGeneratorTests.TypeGeneratorTests900-999
     13.538 |    76 |     76 |      0 |       0 | GC.API.XUnitWrapper.dll
      1.225 |     3 |      3 |      0 |       0 | GC.Coverage.XUnitWrapper.dll
    183.980 |    42 |     42 |      0 |       0 | GC.Features.XUnitWrapper.dll
      3.100 |     6 |      6 |      0 |       0 | GC.LargeMemory.XUnitWrapper.dll
      4.731 |    12 |     12 |      0 |       0 | GC.Regressions.XUnitWrapper.dll
     22.314 |   481 |    481 |      0 |       0 | GC.Scenarios.XUnitWrapper.dll
      0.039 |     1 |      1 |      0 |       0 | GC.Stress.XUnitWrapper.dll
      0.534 |     1 |      1 |      0 |       0 | ilasm.PortablePdb.XUnitWrapper.dll
      0.619 |     1 |      1 |      0 |       0 | ilasm.System.XUnitWrapper.dll
      2.668 |     1 |      1 |      0 |       0 | ilverify.XUnitWrapper.dll
      0.537 |     2 |      2 |      0 |       0 | profiler.assembly.XUnitWrapper.dll
      1.236 |     2 |      2 |      0 |       0 | profiler.elt.XUnitWrapper.dll
      0.793 |     3 |      3 |      0 |       0 | profiler.eventpipe.XUnitWrapper.dll
      0.983 |     4 |      4 |      0 |       0 | profiler.gc.XUnitWrapper.dll
      0.557 |     1 |      1 |      0 |       0 | profiler.handles.XUnitWrapper.dll
      0.038 |     1 |      1 |      0 |       0 | profiler.multiple.XUnitWrapper.dll
      0.037 |     1 |      1 |      0 |       0 | profiler.rejit.XUnitWrapper.dll
      1.360 |     1 |      1 |      0 |       0 | profiler.transitions.XUnitWrapper.dll
      4.133 |     5 |      5 |      0 |       0 | profiler.unittest.XUnitWrapper.dll
      0.117 |     1 |      0 |      0 |       1 | JIT.jit64.jit64_2
      5.451 |   116 |    116 |      0 |       0 | JIT.SIMD.JIT.SIMD
      6.433 |   101 |    101 |      0 |       0 | Loader.classloader.TypeGeneratorTests.TypeGeneratorTests1400-1599
      4.025 |   100 |    100 |      0 |       0 | Loader.classloader.TypeGeneratorTests.TypeGeneratorTests200-299
     10.097 |   230 |    227 |      1 |       2 | JIT.Directed.Directed_3
     19.526 |   214 |    214 |      0 |       0 | JIT.Directed.Directed_1
      5.211 |   216 |    216 |      0 |       0 | JIT.Generics.JIT.Generics
      8.080 |   100 |    100 |      0 |       0 | Loader.classloader.TypeGeneratorTests.TypeGeneratorTests1200-1299
     12.593 |    16 |     16 |      0 |       0 | reflection.reflection
    138.821 |   100 |    100 |      0 |       0 | JIT.Performance.JIT.performance
    172.627 |    11 |     11 |      0 |       0 | JIT.JIT_others
      2.365 |   100 |    100 |      0 |       0 | Loader.classloader.TypeGeneratorTests.TypeGeneratorTests0-99
      1.128 |     2 |      1 |      0 |       1 | Exceptions.Exceptions
     60.075 |    88 |     88 |      0 |       0 | baseservices.threading.threading_group2
      2.885 |   100 |    100 |      0 |       0 | Loader.classloader.TypeGeneratorTests.TypeGeneratorTests500-599
    158.641 |   322 |    321 |      0 |       1 | Loader.Loader
     17.855 |   226 |    207 |      0 |      19 | Interop.Interop
      5.244 |   437 |    433 |      0 |       4 | JIT.Regression.Regression_6
      8.702 |   641 |    641 |      0 |       0 | JIT.CodeGenBringUpTests.JIT.CodeGenBringUpTests
      6.596 |    16 |     16 |      0 |       0 | JIT.JIT_r
     23.990 |   482 |    480 |      0 |       2 | JIT.Regression.Regression_1
      8.302 |   100 |    100 |      0 |       0 | Loader.classloader.TypeGeneratorTests.TypeGeneratorTests700-799
      0.522 |     3 |      3 |      0 |       0 | JIT.JIT_do
      9.039 |   100 |    100 |      0 |       0 | Loader.classloader.TypeGeneratorTests.TypeGeneratorTests1100-1199
      2.658 |   100 |    100 |      0 |       0 | Loader.classloader.TypeGeneratorTests.TypeGeneratorTests400-499
     48.146 |   112 |    112 |      0 |       0 | JIT.jit64.jit64_1
     84.672 |     1 |      1 |      0 |       0 | readytorun.coreroot_determinism.readytorun_coreroot_determinism
     13.396 |   215 |    215 |      0 |       0 | Loader.classloader.generics.LoaderClassloaderGenerics
      2.474 |   100 |    100 |      0 |       0 | Loader.classloader.TypeGeneratorTests.TypeGeneratorTests100-199
     43.988 |  2551 |   2551 |      0 |       0 | JIT.HardwareIntrinsics.HardwareIntrinsics_General_ro
      9.626 |   100 |    100 |      0 |       0 | Loader.classloader.TypeGeneratorTests.TypeGeneratorTests1300-1399
     76.360 |    75 |     75 |      0 |       0 | Regressions.Regressions
      5.244 |   211 |    210 |      0 |       1 | JIT.Methodical.Methodical_r2
      5.458 |    85 |     85 |      0 |       0 | JIT.Methodical.Methodical_ro
      9.593 |   100 |    100 |      0 |       0 | Loader.classloader.TypeGeneratorTests.TypeGeneratorTests800-899
     75.522 |    20 |     18 |      0 |       2 | readytorun.readytorun
     12.881 |   279 |    279 |      0 |       0 | JIT.Methodical.Methodical_r1
      3.120 |   343 |    343 |      0 |       0 | JIT.jit64.jit64_4
      1.532 |    47 |     47 |      0 |       0 | JIT.Regression.Regression_5
     41.352 |   144 |    142 |      0 |       2 | baseservices.exceptions.baseservices-exceptions
      9.289 |   100 |    100 |      0 |       0 | Loader.classloader.TypeGeneratorTests.TypeGeneratorTests1000-1099
     18.946 |    67 |     67 |      0 |       0 | Loader.classloader.regressions.LoaderClassloaderRegressions
     83.636 |   355 |    353 |      0 |       2 | JIT.opt.JIT.opt
      5.979 |    85 |     85 |      0 |       0 | JIT.Methodical.Methodical_do
      4.367 |    35 |     33 |      0 |       2 | tracing.tracing
     37.561 |   485 |    484 |      0 |       1 | JIT.Regression.Regression_2
     13.844 |   276 |    276 |      0 |       0 | JIT.Methodical.Methodical_d1
      0.467 |     2 |      2 |      0 |       0 | JIT.JIT_d
     92.261 |   143 |    139 |      0 |       4 | JIT.jit64.jit64_3
      6.514 |    16 |     16 |      0 |       0 | JIT.JIT_ro
     50.519 |    39 |     35 |      0 |       4 | baseservices.baseservices
      7.336 |   100 |    100 |      0 |       0 | Loader.classloader.TypeGeneratorTests.TypeGeneratorTests600-699
      5.981 |   208 |    207 |      0 |       1 | JIT.Methodical.Methodical_d2
     35.809 |   231 |    226 |      0 |       5 | JIT.jit64.jit64_5
     49.200 |  2584 |   2584 |      0 |       0 | JIT.HardwareIntrinsics.HardwareIntrinsics_General_r
      1.555 |    55 |     54 |      0 |       1 | JIT.Methodical.Methodical_others
      3.838 |   100 |    100 |      0 |       0 | Loader.classloader.TypeGeneratorTests.TypeGeneratorTests300-399
      0.088 |     0 |      0 |      0 |       0 | managed.Managed
     54.354 |   203 |    199 |      0 |       4 | JIT.Directed.Directed_2
      2.741 |   405 |    405 |      0 |       0 | JIT.IL_Conformance.IL_Conformance
     77.532 |    82 |     81 |      0 |       1 | baseservices.threading.threading_group1
----------------------------------------------------------------------------
   1997.991 | 15149 |  15085 |      1 |      63 | (total)

That single failure I get on latest head, so I'm not worried about it.

@jkotas
Copy link
Member

jkotas commented Jun 28, 2024

I see a lot of TODOs about SVE size being hardcoded to 128 bit.

What is going to be the experience when somebody runs .NET 9 binary on a machine with 256 bit SVE? It is important that it just works, without crashing, buffer overruns, etc.

@a74nh
Copy link
Contributor Author

a74nh commented Jun 28, 2024

I see a lot of TODOs about SVE size being hardcoded to 128 bit.

What is going to be the experience when somebody runs .NET 9 binary on a machine with 256 bit SVE? It is important that it just works, without crashing, buffer overruns, etc.

Running the entire testsuite on 256bit, all the tests pass with and without my latest fix. That's because there is no SVE state in the kernel, so that structure that comes back from the OS has no SVE state (sve.size is 16, it's just the header with no data).

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @a74nh for your contribution. LGTM.

@kunalspathak
Copy link
Member

/ba-g failure is #103550

@kunalspathak kunalspathak merged commit 9528c15 into dotnet:main Jun 29, 2024
84 of 89 checks passed
akoeplinger added a commit to akoeplinger/runtime that referenced this pull request Jul 8, 2024
It got broken by dotnet#103801 due to a host vs. target arch typo.

This showed up in the VMR since we use arm64 macOS build agents there.
akoeplinger added a commit to akoeplinger/runtime that referenced this pull request Jul 8, 2024
It got uncovered by dotnet#103801.

This showed up in the VMR since we use arm64 macOS build agents there.
akoeplinger added a commit that referenced this pull request Jul 8, 2024
It got uncovered by #103801.

This showed up in the VMR since we use arm64 macOS build agents there.
@github-actions github-actions bot locked and limited conversation to collaborators Jul 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-PAL-coreclr arm-sve Work related to arm64 SVE/SVE2 support community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants