-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arm64 apple vm fixes for arg alignment. #46665
Conversation
ade1e53
to
c50a5d6
Compare
PTAL @janvorli , @jkotas , @sdmaclea , with these changes arm64 passes my small repro from sandreenko@678ce4b and does not show any new failures in my local run, generic test reflection is passing further but still failing, probably with an independent issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one more place to change, the https://github.com/dotnet/runtime/blob/master/src/coreclr/tools/aot/ILCompiler.ReadyToRun/Compiler/DependencyAnalysis/ReadyToRun/ArgIterator.cs. This one is basically a port of the callingconvention.h to managed code.
src/coreclr/vm/callingconvention.h
Outdated
if (TransitionBlock::IsFloatArgumentRegisterOffset(argOffset)) | ||
{ | ||
pLoc->m_idxFloatReg = (argOffset - TransitionBlock::GetOffsetOfFloatArgumentRegisters()) / 4; | ||
pLoc->m_cFloatReg = cSlots; | ||
pLoc->m_idxFloatReg = (argOffset - TransitionBlock::GetOffsetOfFloatArgumentRegisters()) / TARGET_POINTER_SIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit misleading, the size is not a target pointer size, but a floating point size. I know they are the same on arm, but they are not the same on arm64 and x64 and the equivalent functions for those two use / 16. I would prefer defining FLOAT_REGISTER_SIZE in the cgencpu.h files and using it here and in the same arm64 and x64 methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I would avoid the term FLOAT_REGISTER_SIZE
on arm/arm64. I would prefer FLOAT_SIZE
or SIZEOF_FLOAT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added FLOAT_REGISTER_SIZE
,
I would prefer FLOAT_SIZE or SIZEOF_FLOAT.
would not it be confusing to have #define FLOAT_SIZE 16
on arm64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I didn't expect the value to be 16.
- For Arm32, a float register i.e.
s0
would be 4 bytes, but it would be one of many in theq0
/v0
register. - For arm64
s0
is the lower 4 bytes of thev0
SIMD register which is 16 bytes.
FLOAT_SIZE
is definitely, worse, but I still think FLOAT_REGISTER_SIZE
might be ambiguous for ARM architectures. VECTOR_REGISTER_SIZE
might be better for ARM/ARM64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is that we use m_idxFloatReg, m_cFloatReg fields in the surrounding code, thus the FLOAT_REGISTER_SIZE matches those.
src/coreclr/vm/callingconvention.h
Outdated
@@ -620,35 +632,35 @@ class ArgIteratorTemplate : public ARGITERATOR_BASE | |||
|
|||
pLoc->m_fRequires64BitAlignment = m_fRequires64BitAlignment; | |||
|
|||
const unsigned byteArgSize = StackElemSize(GetArgSize()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable name is misleading, as it looks as if it was the size of argument in bytes, but is it the size aligned to stack element size. The more I look at the code below, the more I feel like we should just keep using the cSlots except for the places where we place arguments on stack (we set the m_byteStackSize). This comment is meant for all architectures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does cSlots
meant slots count? Does slot size depend on context, like it could be a general reg size, float reg size, stack slot size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cSlots
meant slot count where slots were considered both register slots in the transition block and stack slots. Only on ARM32 the term has leaked into the floating point register count setting, which may be misleading. Now the stack slot concept is kind of broken by the Apple ARM64 calling convention, but the register slot still holds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed this code so we don't create cSlots
and don't call StackElemSize
unless we put byteArgSize
on the stack.
|
||
unsigned cSlots = (byteArgSize + TARGET_POINTER_SIZE - 1) / TARGET_POINTER_SIZE; | ||
|
||
// Question: why do not arm and x86 have similar checks? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this check is needed here. The ArgIteratorTemplate<ARGITERATOR_BASE>::GetNextOffset()
is responsible for getting the right offset depending on the available registers and argument type and size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are cases where we get into this check, for example, JIT\Stress\ABI\pinvokes_d
:
CoreCLR!ArgIteratorTemplate<ArgIteratorBase>::GetArgLoc+0x10c
CoreCLR!GenerateShuffleArrayPortable+0x27c
CoreCLR!GenerateShuffleArray+0x14
comes here with byteArgSize = 17 and we change its size to 8 here, if we don't do this we will get a wrong result.
would do it in a separate PR because it looks like the same amount of work there. |
Makes sense |
c50a5d6
to
302a379
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo the comments.
|
||
// Delegates cannot handle overly large argument stacks due to shuffle entry encoding limitations. | ||
if (index >= ShuffleEntry::REGMASK) | ||
{ | ||
COMPlusThrow(kNotSupportedException); | ||
} | ||
|
||
return (UINT16)index; | ||
return -(int)byteIndex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should make this change and the one in the caller when we can only handle the case when the offsets are multiples of TARGET_POINTER_SIZE anyways. I believe it would require quite complex change to the shuffle thunk generation to make it work for non-aligned stuff, so it seems that doing it half way doesn't bring any benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is now exactly half-way and it is required to pass existing tests.
We have cases on arm64 apple when we return byteIndex
that is not byteIndex % 8 == 0
but it is the same for source and destination, so we don't need to generate a move and don't hit the NYI assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok, makes sense then.
@sandreenko Looks like this is breaking a pinvoke test on linux-arm64 & windows-arm64 |
16c8b29
to
69195a9
Compare
src/coreclr/vm/callingconvention.h
Outdated
pLoc->m_cFloatReg = cSlots; | ||
const int floatRegOfsInBytes = argOffset - TransitionBlock::GetOffsetOfFloatArgumentRegisters(); | ||
_ASSERTE((floatRegOfsInBytes % FLOAT_REGISTER_SIZE) == 0); | ||
pLoc->m_idxFloatReg = (argOffset - TransitionBlock::GetOffsetOfFloatArgumentRegisters()) / FLOAT_REGISTER_SIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit - can you please use the floatRegOfsInBytes constant here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, thanks for catching
…atforms. Before some platforms were using stackSlots, some curOfs in bytes.
Fix arm32. x86 fixes. use StackSize on ArgSizes Add `GetStackArgumentByteIndexFromOffset` and return back the old values for asserts. another fix
because it won't pass on arm64 apple.
It is not a complete fix for arm64 apple, but covers most cases.
db04d24
to
e6c5cc6
Compare
I have checked that the updated version still works on arm64 apple and fixes JIT/HardwareIntrinsics/General/Vector* tests. |
Contributes to #46456.
Move to byte sizes/offsets in VM so it works correctly with arguments that do not occupy TARGET_POINTER_SIZE stack slots on arm64.
It also unifies some platforms, for example, before
ArgIteratorBase
was using byte offsets for some and stack slot index for the others, now it uses byte offset for all.Use
TARGET_POINTER_SIZE
instead ofSTACK_ELEM_SIZE
because:TARGET_POINTER_SIZE
was already used in such context;STACK_ELEM_SIZE
and I did not come up with a better name;