Arm64 apple vm fixes for arg alignment. #46665

sandreenko · 2021-01-07T04:23:44Z

Contributes to #46456.

Move to byte sizes/offsets in VM so it works correctly with arguments that do not occupy TARGET_POINTER_SIZE stack slots on arm64.

It also unifies some platforms, for example, before ArgIteratorBase was using byte offsets for some and stack slot index for the others, now it uses byte offset for all.

Use TARGET_POINTER_SIZE instead of STACK_ELEM_SIZE because:

there were many places where TARGET_POINTER_SIZE was already used in such context;
it is confusing for arm64 apple to have STACK_ELEM_SIZE and I did not come up with a better name;

sandreenko · 2021-01-12T08:49:34Z

PTAL @janvorli , @jkotas , @sdmaclea , with these changes arm64 passes my small repro from sandreenko@678ce4b and does not show any new failures in my local run, generic test reflection is passing further but still failing, probably with an independent issue.

janvorli

There is one more place to change, the https://github.com/dotnet/runtime/blob/master/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/DependencyAnalysis/ReadyToRun/ArgIterator.cs. This one is basically a port of the callingconvention.h to managed code.

src/coreclr/vm/arm64/cgencpu.h

janvorli · 2021-01-12T10:45:39Z

src/coreclr/vm/callingconvention.h

        if (TransitionBlock::IsFloatArgumentRegisterOffset(argOffset))
        {
-            pLoc->m_idxFloatReg = (argOffset - TransitionBlock::GetOffsetOfFloatArgumentRegisters()) / 4;
-            pLoc->m_cFloatReg = cSlots;
+            pLoc->m_idxFloatReg = (argOffset - TransitionBlock::GetOffsetOfFloatArgumentRegisters()) / TARGET_POINTER_SIZE;


This is a bit misleading, the size is not a target pointer size, but a floating point size. I know they are the same on arm, but they are not the same on arm64 and x64 and the equivalent functions for those two use / 16. I would prefer defining FLOAT_REGISTER_SIZE in the cgencpu.h files and using it here and in the same arm64 and x64 methods.

Personally I would avoid the term FLOAT_REGISTER_SIZE on arm/arm64. I would prefer FLOAT_SIZE or SIZEOF_FLOAT.

Added FLOAT_REGISTER_SIZE,

I would prefer FLOAT_SIZE or SIZEOF_FLOAT.

would not it be confusing to have #define FLOAT_SIZE 16 on arm64?

I guess I didn't expect the value to be 16.

For Arm32, a float register i.e. s0 would be 4 bytes, but it would be one of many in the q0/v0 register.

For arm64 s0 is the lower 4 bytes of the v0 SIMD register which is 16 bytes.

FLOAT_SIZE is definitely, worse, but I still think FLOAT_REGISTER_SIZE might be ambiguous for ARM architectures. VECTOR_REGISTER_SIZE might be better for ARM/ARM64.

The thing is that we use m_idxFloatReg, m_cFloatReg fields in the surrounding code, thus the FLOAT_REGISTER_SIZE matches those.

janvorli · 2021-01-12T11:06:39Z

src/coreclr/vm/callingconvention.h

@@ -620,35 +632,35 @@ class ArgIteratorTemplate : public ARGITERATOR_BASE

        pLoc->m_fRequires64BitAlignment = m_fRequires64BitAlignment;

+        const unsigned byteArgSize = StackElemSize(GetArgSize());


This variable name is misleading, as it looks as if it was the size of argument in bytes, but is it the size aligned to stack element size. The more I look at the code below, the more I feel like we should just keep using the cSlots except for the places where we place arguments on stack (we set the m_byteStackSize). This comment is meant for all architectures.

Does cSlots meant slots count? Does slot size depend on context, like it could be a general reg size, float reg size, stack slot size?

cSlots meant slot count where slots were considered both register slots in the transition block and stack slots. Only on ARM32 the term has leaked into the floating point register count setting, which may be misleading. Now the stack slot concept is kind of broken by the Apple ARM64 calling convention, but the register slot still holds.

I have changed this code so we don't create cSlots and don't call StackElemSize unless we put byteArgSize on the stack.

src/coreclr/vm/callingconvention.h

janvorli · 2021-01-12T11:34:57Z

src/coreclr/vm/callingconvention.h


+        unsigned cSlots = (byteArgSize + TARGET_POINTER_SIZE - 1) / TARGET_POINTER_SIZE;
+
+        // Question: why do not arm and x86 have similar checks?


I am not sure if this check is needed here. The ArgIteratorTemplate<ARGITERATOR_BASE>::GetNextOffset() is responsible for getting the right offset depending on the available registers and argument type and size.

There are cases where we get into this check, for example, JIT\Stress\ABI\pinvokes_d:

CoreCLR!ArgIteratorTemplate<ArgIteratorBase>::GetArgLoc+0x10c CoreCLR!GenerateShuffleArrayPortable+0x27c CoreCLR!GenerateShuffleArray+0x14

comes here with byteArgSize = 17 and we change its size to 8 here, if we don't do this we will get a wrong result.

src/coreclr/vm/callingconvention.h

src/coreclr/vm/arm/cgencpu.h

sandreenko · 2021-01-19T22:55:43Z

There is one more place to change, the https://github.com/dotnet/runtime/blob/master/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/DependencyAnalysis/ReadyToRun/ArgIterator.cs. This one is basically a port of the callingconvention.h to managed code.

would do it in a separate PR because it looks like the same amount of work there.

janvorli · 2021-01-19T22:57:57Z

would do it in a separate PR because it looks like the same amount of work there.

Makes sense

janvorli

LGTM modulo the comments.

janvorli · 2021-01-20T01:29:35Z

src/coreclr/vm/comdelegate.cpp


            // Delegates cannot handle overly large argument stacks due to shuffle entry encoding limitations.
            if (index >= ShuffleEntry::REGMASK)
            {
                COMPlusThrow(kNotSupportedException);
            }

-            return (UINT16)index;
+            return -(int)byteIndex;


I don't think we should make this change and the one in the caller when we can only handle the case when the offsets are multiples of TARGET_POINTER_SIZE anyways. I believe it would require quite complex change to the shuffle thunk generation to make it work for non-aligned stuff, so it seems that doing it half way doesn't bring any benefit.

It is now exactly half-way and it is required to pass existing tests.
We have cases on arm64 apple when we return byteIndex that is not byteIndex % 8 == 0 but it is the same for source and destination, so we don't need to generate a move and don't hit the NYI assert.

Ah, ok, makes sense then.

src/coreclr/vm/amd64/cgencpu.h

sdmaclea · 2021-01-20T02:42:21Z

@sandreenko Looks like this is breaking a pinvoke test on linux-arm64 & windows-arm64

janvorli · 2021-01-20T12:11:04Z

src/coreclr/vm/callingconvention.h

-            pLoc->m_cFloatReg = cSlots;
+            const int floatRegOfsInBytes = argOffset - TransitionBlock::GetOffsetOfFloatArgumentRegisters();
+            _ASSERTE((floatRegOfsInBytes % FLOAT_REGISTER_SIZE) == 0);
+            pLoc->m_idxFloatReg = (argOffset - TransitionBlock::GetOffsetOfFloatArgumentRegisters()) / FLOAT_REGISTER_SIZE;


A nit - can you please use the floatRegOfsInBytes constant here too?

My bad, thanks for catching

…atforms. Before some platforms were using stackSlots, some curOfs in bytes.

Fix arm32. x86 fixes. use StackSize on ArgSizes Add `GetStackArgumentByteIndexFromOffset` and return back the old values for asserts. another fix

because it won't pass on arm64 apple.

It is not a complete fix for arm64 apple, but covers most cases.

sandreenko · 2021-01-21T21:37:39Z

I have checked that the updated version still works on arm64 apple and fixes JIT/HardwareIntrinsics/General/Vector* tests.

Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 7, 2021

sandreenko marked this pull request as draft January 7, 2021 04:30

runfoapp bot mentioned this pull request Jan 7, 2021

Inability to unzip assets during build on Unix x64 #32805

Closed

sandreenko added area-VM-coreclr and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Jan 7, 2021

sandreenko force-pushed the arm64AppleVM branch 3 times, most recently from ade1e53 to c50a5d6 Compare January 11, 2021 22:02

sandreenko changed the title ~~Arm64 apple vm~~ Arm64 apple vm fixes for arg alignment. Jan 11, 2021

sandreenko marked this pull request as ready for review January 11, 2021 22:06

janvorli reviewed Jan 12, 2021

View reviewed changes

JulieLeeMSFT added this to the 6.0.0 milestone Jan 13, 2021

JulieLeeMSFT assigned sandreenko Jan 13, 2021

sandreenko mentioned this pull request Jan 16, 2021

Fix the GenericTest.Vector*Boolean #47070

Merged

sandreenko force-pushed the arm64AppleVM branch from c50a5d6 to 302a379 Compare January 20, 2021 00:20

janvorli approved these changes Jan 20, 2021

View reviewed changes

sdmaclea approved these changes Jan 20, 2021

View reviewed changes

sandreenko force-pushed the arm64AppleVM branch 2 times, most recently from 16c8b29 to 69195a9 Compare January 20, 2021 09:38

janvorli reviewed Jan 20, 2021

View reviewed changes

Sergey added 6 commits January 21, 2021 00:07

Add MarshalInfo::IsValueClass.

7b1024c

Add TypeHandle* pTypeHandle to SizeOf.

30c0a33

Add a few asserts/start using inline function instead of macro.

44c85e8

use TARGET_POINTER_SIZE instead of STACK_ELEM_SIZE.

6824218

Use m_curOfs instead of m_idxStack in ArgIteratorBase on all pl…

2ce2c53

…atforms. Before some platforms were using stackSlots, some curOfs in bytes.

Use byte sizes and offsets in ArgLocDesc.

1ff7b76

Fix arm32. x86 fixes. use StackSize on ArgSizes Add `GetStackArgumentByteIndexFromOffset` and return back the old values for asserts. another fix

Sergey added 19 commits January 21, 2021 00:07

Stop using #define STACK_ELEM_SIZE

d4237b7

Add isFloatHfa.

e130b18

delete checking code.

28fc554

because it won't pass on arm64 apple.

arm64 apple fixes.

8364408

roundUp the stack size.

18340b4

Add a reflection test.

bb99453

Return byte offset from GetNextOfs.

7e00748

It is not a complete fix for arm64 apple, but covers most cases.

Add FLOAT_REGISTER_SIZE

2026ab7

Use StackElemSize for pLoc->m_byteStackSize.

78517bd

replace assert with _ASSERTE.

215178a

Use ALIGN_UP in the code that I have changed.

35202d5

rename m_curOfs as m_ofsStack.

2ea716b

delete "ceremony " from StackElemSize.

164f54f

Delete cSlots and don't call StackElemSize on GetArgSize.

cc5838e

Fix an assert.

9948f4b

Fix nit.

80d16c3

fix wrong return for hfa<float>.

e42895f

fix nit.

2b9aa39

Fix crossgen job.

e6c5cc6

sandreenko force-pushed the arm64AppleVM branch from db04d24 to e6c5cc6 Compare January 21, 2021 08:07

sandreenko merged commit 648437b into dotnet:master Jan 21, 2021

sandreenko deleted the arm64AppleVM branch January 21, 2021 21:37

sandreenko mentioned this pull request Jan 21, 2021

Failure in Regressions/coreclr/GitHub_35000/test35000/test35000.sh #47294

Closed

JulieLeeMSFT mentioned this pull request Jan 28, 2021

What's new in .NET 6 Preview 1 dotnet/core#5853

Closed

ghost locked as resolved and limited conversation to collaborators Feb 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arm64 apple vm fixes for arg alignment. #46665

Arm64 apple vm fixes for arg alignment. #46665

sandreenko commented Jan 7, 2021 •

edited

Loading

sandreenko commented Jan 12, 2021

janvorli left a comment

janvorli Jan 12, 2021

sdmaclea Jan 12, 2021

sandreenko Jan 19, 2021

sdmaclea Jan 19, 2021

janvorli Jan 19, 2021

janvorli Jan 12, 2021

sandreenko Jan 19, 2021

janvorli Jan 19, 2021

sandreenko Jan 20, 2021

janvorli Jan 12, 2021

sandreenko Jan 20, 2021

sandreenko commented Jan 19, 2021

janvorli commented Jan 19, 2021

janvorli left a comment

janvorli Jan 20, 2021

sandreenko Jan 20, 2021

janvorli Jan 20, 2021

sdmaclea commented Jan 20, 2021

janvorli Jan 20, 2021

sandreenko Jan 20, 2021

sandreenko commented Jan 21, 2021

		@@ -620,35 +632,35 @@ class ArgIteratorTemplate : public ARGITERATOR_BASE

		pLoc->m_fRequires64BitAlignment = m_fRequires64BitAlignment;

		const unsigned byteArgSize = StackElemSize(GetArgSize());


		unsigned cSlots = (byteArgSize + TARGET_POINTER_SIZE - 1) / TARGET_POINTER_SIZE;

		// Question: why do not arm and x86 have similar checks?

Arm64 apple vm fixes for arg alignment. #46665

Arm64 apple vm fixes for arg alignment. #46665

Conversation

sandreenko commented Jan 7, 2021 • edited Loading

sandreenko commented Jan 12, 2021

janvorli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sandreenko commented Jan 19, 2021

janvorli commented Jan 19, 2021

janvorli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sdmaclea commented Jan 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sandreenko commented Jan 21, 2021

sandreenko commented Jan 7, 2021 •

edited

Loading