-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Arm64] Use stp and str (SIMD) for stack prolog zeroing #43789
Comments
Indeed, this would be a good optimization. You might want to consider treating this as a general inlined In particular look at how the compiler behaves when the number of bytes to copy is not a power of 2. We then just issue overlapping instructions. i.e. to fill 46 bytes, we use one #include <string.h>
void ZeroInt32x4(void* pDst, void* pSrc)
{
memset(pDst, 0, 16);
}
void ZeroInt32x8(void* pDst, void* pSrc)
{
memset(pDst, 0, 32);
}
void ZeroInt32x16(void* pDst, void* pSrc)
{
memset(pDst, 0, 64);
}
void ZeroInt32x5(void* pDst, void* pSrc)
{
memset(pDst, 0, 24);
}
void memsetCharUneven(void* pDst)
{
memset(pDst, 8, 30);
}
void memsetCharUnevenComplex(void* pDst)
{
memset(pDst, 'c', 30);
}
void memsetChar(void* pDst)
{
memset(pDst, 'c', 32);
}
void memsetCharUnevenUnknown(void* pDst, char c)
{
memset(pDst, c, 30);
}
void memsetCharUnknown(void* pDst, char c)
{
memset(pDst, c, 30);
}
void memsetUneven(void* pDst, void* pSrc)
{
memset(pDst, 0, 46);
}
void MemcpyInt32x16(void* pDst, void* pSrc)
{
memcpy(pDst, pSrc, 64);
} and it's assembly ZeroInt32x4(void*, void*): // @ZeroInt32x4(void*, void*)
stp xzr, xzr, [x0]
ret
ZeroInt32x8(void*, void*): // @ZeroInt32x8(void*, void*)
movi v0.2d, #0000000000000000
stp q0, q0, [x0]
ret
ZeroInt32x16(void*, void*): // @ZeroInt32x16(void*, void*)
movi v0.2d, #0000000000000000
stp q0, q0, [x0, #32]
stp q0, q0, [x0]
ret
ZeroInt32x5(void*, void*): // @ZeroInt32x5(void*, void*)
stp xzr, xzr, [x0]
str xzr, [x0, #16]
ret
memsetCharUneven(void*): // @memsetCharUneven(void*)
mov x8, #578721382704613384
stur x8, [x0, #22]
stp x8, x8, [x0, #8]
str x8, [x0]
ret
memsetCharUnevenComplex(void*): // @memsetCharUnevenComplex(void*)
mov x8, #25443
movk x8, #25443, lsl #16
movk x8, #25443, lsl #32
movk x8, #25443, lsl #48
stur x8, [x0, #22]
stp x8, x8, [x0, #8]
str x8, [x0]
ret
memsetChar(void*): // @memsetChar(void*)
movi v0.16b, #99
stp q0, q0, [x0]
ret
memsetCharUnevenUnknown(void*, char): // @memsetCharUnevenUnknown(void*, char)
and x8, x1, #0xff
mov x9, #72340172838076673
mul x8, x8, x9
stur x8, [x0, #22]
stp x8, x8, [x0, #8]
str x8, [x0]
ret
memsetCharUnknown(void*, char): // @memsetCharUnknown(void*, char)
and x8, x1, #0xff
mov x9, #72340172838076673
mul x8, x8, x9
stur x8, [x0, #22]
stp x8, x8, [x0, #8]
str x8, [x0]
ret
memsetUneven(void*, void*): // @memsetUneven(void*, void*)
movi v0.2d, #0000000000000000
stur q0, [x0, #30]
stp q0, q0, [x0]
ret
MemcpyInt32x16(void*, void*): // @MemcpyInt32x16(void*, void*)
ldp q1, q0, [x1, #32]
ldp q3, q2, [x1]
stp q1, q0, [x0, #32]
stp q3, q2, [x0]
ret |
As an additional side note, the loop in runtime/src/coreclr/src/jit/codegencommon.cpp Line 6160 in 54906ea
|
@TamarChristinaArm Thank you for all the pointers! Let me read through the documentation you mentioned and I will come back with the algorithm before implementing. |
Thanks for pointing that out @TamarChristinaArm . As part of #43227, we are also working on aligning loop bodies to 32B boundary. |
@echesakovMSFT coming back to this, particularly for zeroing you have an additional option: On AArch64 you have the data cache zero instructions Essentially using this you can clear large blocks of memory by The expectation is that most systems will be configured with a usable amount such as |
Thank you for following up, Tamar! I've heard about this instruction but, for some reason, I was under (wrong) impression that it required EL1 or higher. Is there any alignment requirement for the address in |
There is, but it's a bit sneakily described. The operation works on an entire cache line. If the address is not aligned to the cache line it will silently align it (ignore the lower bits) and clear the entire cache line. Which means it will clear data you didn't intend to. So there is an alignment constraint for what you probably want, but not for the actual use of the instruction (as in, you won't get an alignment fault). This requirement makes it impossible to use in static compilers (though we do use it in AoR's memset https://github.com/ARM-software/optimized-routines/blob/0f4ae0c5b561de25acb10130fd5e473ec038f89d/string/aarch64/memset.S#L79) but for JITs it may still be useful. |
Small correction here, the ZVA works on a ZVA region, which an be set independently of the cache line size. In practice on all current Arm designed cores the region is the same as the cache line size, but they don't need to be. So you have to be aligned to the ZVA region, not the cache line so the code you wrote in #46609 is still correct, but wanted to clarify the statement above :) |
This was my understanding as well - that the instruction block size reported by As you probably noticed in #46609 I assumed that 64 bytes is the most common choice for |
Possibly, but as you noted in the PR, the problem with larger ZVA is that as the ZVA region grows your alignment and remainder overhead grows. Which also pushes your profitability threshold for ZVA usage higher. And as that goes up, it means you regress your smaller memsets (vs a smaller ZVA region size). So this will always be a balancing act. I don't have the data to back it up, but my personal opinion is that the average consumer/server workload you'll find the smaller sets more often than larger ones (The only exceptions I know to this in the HPC market, but that's pretty specialized). So support for this wouldn't really be a priority in the near future in my opinion. |
@echesakovMSFT btw, this issue ended up focusing only on zero'ing, but the approach outlined in #43789 (comment) should help with Is it worth splitting that out? |
@TamarChristinaArm Agree, I will try implementing the approach after finish the stack probing work on arm64. |
Hi @TamarChristinaArm, I've given your idea more thoughts recently. In However, I think we can optimize runtime/src/coreclr/jit/codegenarmarch.cpp Line 1979 in 9102b07
CodeGen::genCodeForCpBlkUnroll runtime/src/coreclr/jit/codegenarmarch.cpp Line 2102 in 9102b07
stp (SIMD) instead of stp (GpReg) .
cc @sandreenko |
I see.. If you never know the exact size you can still do somewhat better than a scalar loop though. You can still emit a loop that uses STP of Q registers to set 64 bytes at a time for some cases. Looking at https://docs.microsoft.com/en-us/dotnet/api/system.span-1.fill?view=netcore-2.2 since So does Hmmm Looking at it https://github.com/dotnet/runtime/blob/79ae74f5ca5c8a6fe3a48935e85bd7374959c570/src/coreclr/vm/arm64/crthelpers.asm it looks like Since these are inline assembly, is there any reason you can't just use the AoR implementations? Those would be most optimal and save you a lot of work in this case :) |
@echesakovMSFT Also I'm wondering, unless I misread the dock, doesn't |
@TamarChristinaArm I believe there is an opportunity to optimize non-byte size element version of
using such approach. Presumably, this would mean that on Arm64 we would use SIMD
I don't know if there is any reason that could prevent us from using AoR implementation. Note that for Linux, https://github.com/dotnet/runtime/blob/79ae74f5ca5c8a6fe3a48935e85bd7374959c570/src/coreclr/vm/arm64/crthelpers.S we call
In non-byte sizes element case,
we use unrolled loops. For byte-sized element case, we call runtime/src/coreclr/vm/jitinterface.cpp Lines 7062 to 7071 in 79ae74f
that internally uses |
Thanks for the detailed response @echesakovMSFT !
Yes, but see blow.
The lanes themselves are in array order, so the dup of the right size should handle it. Whether the constant itself needs any handling is a good question, but the rest of the runtime should have ensured the value in a single register is already correct.
Yes, although you need a sufficiently new glibc to take advantage of the optimized routines. That said it's probably still a good idea to do this as glibc can ifunc to optimized implementations for various different uarches. That said, the AoR guys had the idea that you can use the current memset in AoR for the non-byte case as well by creating a new entry point here https://github.com/ARM-software/optimized-routines/blob/master/string/aarch64/memset.S#L29 just below the dup. To call it you need to do a couple of things:
with these conditions you can remove unrolled loop in the clr and get the optimized memset for everything from short to long long. |
Thank you for the follow-up @TamarChristinaArm! |
Currently, void CodeGen::genZeroInitFrame(int untrLclHi, int untrLclLo, regNumber initReg, bool* pInitRegZeroed)
inlines a zeroing loop for frames larger than 10 machine words (80 bytes on Arm64). The loop uses a
wzr
orxzr
register andstp
orstr
instructions and can write up to 16 bytes of zeros at once.Following ideas in #32538 we can
qReg
qReg
instead ofxzr
withstp qReg, qReg, [mem]
allowing to write up to 32 bytes of zeros to memory in one instruction.We can also consider increasing the upper boundary (i.e. 10 machine words) to some larger number.
It seems that Clang/LLVM uses similar way for initializing stack allocated structs.
https://godbolt.org/z/8rKxvn
For example,
would be compiled down to
@dotnet/jit-contrib @TamarChristinaArm
The text was updated successfully, but these errors were encountered: