-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unroll Buffer.Memmove for constant lengths #83638
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue DetailsIf pointers don't overlap and size is jit-time known - try to unroll it. Various shapes are supported, e.g.: void WriteFalse(Span<char> span)
{
"False".CopyTo(span);
}
void WriteHelloWorldUtf8(Span<byte> span)
{
"Hello World"u8.CopyTo(span);
}
void CopyFirst128Bytes(ReadOnlySpan<byte> src, Span<byte> dest)
{
src.Slice(0, 128).CopyTo(dest);
} E.g. codegen for Benchmarksstatic readonly char[] charBuffer = new char[1024];
static readonly byte[] byteBuffer1 = new byte[1024];
static readonly byte[] byteBuffer2 = new byte[1024];
[Benchmark]
public void CopyConstSlice1() => byteBuffer1.AsSpan(0, 1).CopyTo(byteBuffer2);
[Benchmark]
public void CopyConstSlice8() => byteBuffer1.AsSpan(0, 8).CopyTo(byteBuffer2);
[Benchmark]
public void CopyConstSlice16() => byteBuffer1.AsSpan(0, 16).CopyTo(byteBuffer2);
[Benchmark]
public void CopyConstSlice64() => byteBuffer1.AsSpan(0, 64).CopyTo(byteBuffer2);
[Benchmark]
public void CopyConstSlice128() => byteBuffer1.AsSpan(0, 128).CopyTo(byteBuffer2);
MotivationTo simplify hacks like these: runtime/src/libraries/System.Private.CoreLib/src/System/Boolean.cs Lines 118 to 121 in 87b73f0
cc @jkotas @stephentoub thoughts? From my understanding the "are overlapped" check is GC-safe and can be futher improved (but not with
|
Can we remove things like this? Lines 139 to 197 in 4a1dd91
|
Oh, good example, let me check. |
It would be nice to turn this into just this:
|
Right, but, presumably, it's not a simple change since we also need to modify LSRA to tell it we're going to need up to 4 spare simd registers (+ scalars). Also, we need some separate phase after VN (but before lower since VN's will be invalidated) where we'll be doing these transformations. (or inside EarlyProp with some modifications). I plan to prototype such a phase eventually because it's only way I can unroll |
Could we for now assume that using up to 4 regs is fine and change it to only unroll if there are free ones later? |
Ok, I'll try to do the whole thing in JIT. It probably makes sense to still merge improvements for |
Does it make sense to split them off into a separate PR? |
26ff188
to
cc66929
Compare
Implemented in JIT, e.g.: void Test(ReadOnlySpan<int> dst, Span<int> src) =>
dst.Slice(0, 30).CopyTo(src); it now emits (linux-x64): ; Method P:Test
G_M7025_IG01:
push rbp
vzeroupper
mov rbp, rsp
G_M7025_IG02:
cmp edx, 30
jb SHORT G_M7025_IG04
cmp r8d, 30
jb SHORT G_M7025_IG05
vmovdqu ymm0, ymmword ptr[rsi]
vmovdqu ymm1, ymmword ptr[rsi+20H]
vmovdqu ymm2, ymmword ptr[rsi+40H]
vmovdqu ymm3, ymmword ptr[rsi+58H]
vmovdqu ymmword ptr[rcx], ymm0
vmovdqu ymmword ptr[rcx+20H], ymm1
vmovdqu ymmword ptr[rcx+40H], ymm2
vmovdqu ymmword ptr[rcx+58H], ymm3
G_M7025_IG03:
pop rbp
ret
G_M7025_IG04:
mov rax, 0xD1FFAB1E ; code for System.ThrowHelper:ThrowArgumentOutOfRangeException()
call [rax]System.ThrowHelper:ThrowArgumentOutOfRangeException()
int3
G_M7025_IG05:
mov rax, 0xD1FFAB1E ; code for System.ThrowHelper:ThrowArgumentException_DestinationTooShort()
call [rax]System.ThrowHelper:ThrowArgumentException_DestinationTooShort()
int3
; Total bytes of code: 84 30*sizeof(int) = 120 bytes = 4 ymm moves
Yeah, will push the improvements for IsKnownConstant separately once/if this lands |
Or in case of Boolean.cs hack: bool TryWriteFalseUtf8(Span<byte> span)
{
ReadOnlySpan<byte> str = "False"u8; // utf8 literal
return str.TryCopyTo(span);
}
void WriteFalse(Span<char> span)
{
string str = "False"; // utf16
str.AsSpan().CopyTo(span);
} ; Method P:TryWriteFalseUtf8
G_M40086_IG01:
push rax
G_M40086_IG02:
mov rax, 0xD1FFAB1E ;; RVA data for utf8 literal
xor edi, edi
cmp edx, 5
jb SHORT G_M40086_IG04
G_M40086_IG03:
mov edx, dword ptr [rax] ;; TODO: could be folded to a constant (RVA[cns])
mov edi, dword ptr [rax+01H] ;; TODO: could be folded to a constant (RVA[cns])
mov dword ptr [rsi], edx
mov dword ptr [rsi+01H], edi
mov edi, 1
G_M40086_IG04:
mov eax, edi
G_M40086_IG05:
add rsp, 8
ret
; Total bytes of code: 40
; Method P:WriteFalse
G_M10586_IG01:
push rbp
mov rbp, rsp
G_M10586_IG02:
mov rax, 0xD1FFAB1E ;; "False" frozen string literal
add rax, 12
cmp edx, 5
jb SHORT G_M10586_IG04
mov rdx, qword ptr [rax] ;; TODO: could be folded to a constant (string[cns])
mov rdi, qword ptr [rax+02H] ;; TODO: could be folded to a constant (string[cns])
mov qword ptr [rsi], rdx
mov qword ptr [rsi+02H], rdi
G_M10586_IG03:
pop rbp
ret
G_M10586_IG04:
mov rax, 0xD1FFAB1E
call [rax]System.ThrowHelper:ThrowArgumentException_DestinationTooShort()
int3
; Total bytes of code: 52 (same for |
Should be ready for review once CI passes, checked GC info, tests. Will run jitstresssregs and other outerloops including gcstress. new benchmarks: public static IEnumerable<object[]> TestArgs()
{
yield return new object[] { new byte[128], new byte[128] };
}
[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice4(byte[] dst, byte[] src) => src.AsSpan(0, 4).CopyTo(dst);
[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice10(byte[] dst, byte[] src) => src.AsSpan(0, 10).CopyTo(dst);
[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice26(byte[] dst, byte[] src) => src.AsSpan(0, 26).CopyTo(dst);
[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice64(byte[] dst, byte[] src) => src.AsSpan(0, 64).CopyTo(dst);
[Benchmark]
[ArgumentsSource(nameof(TestArgs))]
public void CopyConstSlice120(byte[] dst, byte[] src) => src.AsSpan(0, 120).CopyTo(dst);
// Overlapping
IntPtr nativeAlloc;
[GlobalSetup]
public void GlobalSetup() => nativeAlloc = (IntPtr)NativeMemory.AlignedAlloc(1024, 32); // aligned to reduce noise
[GlobalCleanup]
public void GlobalCleanup() => NativeMemory.AlignedFree((void*)nativeAlloc);
[Benchmark]
public void CopyConstSlice8_overlap() =>
new Span<byte>((void*)nativeAlloc, 8).CopyTo(new Span<byte>((void*)IntPtr.Add(nativeAlloc, 4), 8));
[Benchmark]
public void CopyConstSlice32_overlap() =>
new Span<byte>((void*)nativeAlloc, 32).CopyTo(new Span<byte>((void*)IntPtr.Add(nativeAlloc, 4), 32));
[Benchmark]
public void CopyConstSlice120_overlap() =>
new Span<byte>((void*)nativeAlloc, 120).CopyTo(new Span<byte>((void*)IntPtr.Add(nativeAlloc, 4), 120)); Codegen diff for them: https://www.diffchecker.com/YLQJhBTE/
Note that I was bencmarking it on a very fast CPU Ryzen 7950X (Windows 11 x64). There are a few nit TODO-CQ in the code for future improvements (e.g. don't use two |
TestMemmove((dst, src) => src.AsSpan(0, 65).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(65)).CopyTo(dst)); | ||
TestMemmove((dst, src) => src.AsSpan(0, 127).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(127)).CopyTo(dst)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestMemmove((dst, src) => src.AsSpan(0, 65).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(65)).CopyTo(dst)); | |
TestMemmove((dst, src) => src.AsSpan(0, 127).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(127)).CopyTo(dst)); | |
TestMemmove((dst, src) => src.AsSpan(0, 65).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(65)).CopyTo(dst)); | |
TestMemmove((dst, src) => src.AsSpan(0, 95).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(95)).CopyTo(dst)); | |
TestMemmove((dst, src) => src.AsSpan(0, 96).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(96)).CopyTo(dst)); | |
TestMemmove((dst, src) => src.AsSpan(0, 97).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(97)).CopyTo(dst)); | |
TestMemmove((dst, src) => src.AsSpan(0, 127).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(127)).CopyTo(dst)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will apply in the arm64 follow up PR if no further feedback is provided. To avoid spinning CI again
TestMemmove((dst, src) => src.AsSpan(0, 33).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(33)).CopyTo(dst)); | ||
TestMemmove((dst, src) => src.AsSpan(0, 63).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(63)).CopyTo(dst)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestMemmove((dst, src) => src.AsSpan(0, 33).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(33)).CopyTo(dst)); | |
TestMemmove((dst, src) => src.AsSpan(0, 63).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(63)).CopyTo(dst)); | |
TestMemmove((dst, src) => src.AsSpan(0, 33).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(33)).CopyTo(dst)); | |
TestMemmove((dst, src) => src.AsSpan(0, 47).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(47)).CopyTo(dst)); | |
TestMemmove((dst, src) => src.AsSpan(0, 48).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(48)).CopyTo(dst)); | |
TestMemmove((dst, src) => src.AsSpan(0, 49).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(49)).CopyTo(dst)); | |
TestMemmove((dst, src) => src.AsSpan(0, 63).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(63)).CopyTo(dst)); |
TestMemmove((dst, src) => src.AsSpan(0, 161).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(161)).CopyTo(dst)); | ||
TestMemmove((dst, src) => src.AsSpan(0, 255).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(255)).CopyTo(dst)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestMemmove((dst, src) => src.AsSpan(0, 161).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(161)).CopyTo(dst)); | |
TestMemmove((dst, src) => src.AsSpan(0, 255).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(255)).CopyTo(dst)); | |
TestMemmove((dst, src) => src.AsSpan(0, 161).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(161)).CopyTo(dst)); | |
TestMemmove((dst, src) => src.AsSpan(0, 191).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(191)).CopyTo(dst)); | |
TestMemmove((dst, src) => src.AsSpan(0, 192).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(192)).CopyTo(dst)); | |
TestMemmove((dst, src) => src.AsSpan(0, 193).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(193)).CopyTo(dst)); | |
TestMemmove((dst, src) => src.AsSpan(0, 255).CopyTo(dst), (dst, src) => src.AsSpan(0, ToVar(255)).CopyTo(dst)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will apply these in the follow up
Failures seem to be #83655 (Mono-wasm) |
@@ -1911,6 +1981,7 @@ void Lowering::LowerCall(GenTree* node) | |||
JITDUMP("lowering call (after):\n"); | |||
DISPTREERANGE(BlockRange(), call); | |||
JITDUMP("\n"); | |||
return nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this return call->gtNext
and then we could avoid all the special-case checking for nullptr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't return call->gtNext
here becuase gtNext
can be nullptr itself so then it won't be clear whether this method made any changes or not. In fact, I did it initally and hit an assert somewhere
@BruceForstall thanks for the feedback, addressed in #83740 |
Unroll
Buffer.Memmove
for constant lengths, various patterns improve, e.g.:Codegen example:
Was:
Now
The whole
src
is saved into temp SIMD regs so we can ignore the fact thatsrc
anddst
might overlap. Currently, we support unrolling from 1 to 128 bytes (or 256 bytes with AVX-512 once it's enabled)Benchmarks:
Codegen diff for them: https://www.diffchecker.com/YLQJhBTE/
Note that I was bencmarking it on a very fast CPU Ryzen 7950X (Windows 11 x64).
There are a few nit TODO-CQ in the code for future improvements (e.g. don't use two
ymm
to unroll 33 bytes and doymm
+GPR) and only x64 for now to simplify code review, I'll do arm64 right after this lands. 32bit can also be implemented, but it adds extra complexity like on x86 we can only do byte-wide loads on specific register and we need up to 4 GPR regs to handle 15 bytes, etc.Motivation
To simplify hacks like these:
runtime/src/libraries/System.Private.CoreLib/src/System/Boolean.cs
Lines 118 to 121 in 87b73f0
cc @jkotas @stephentoub thoughts?
From my understanding the "are overlapped" check is GC-safe and can be futher improved (but not with
Math.Abs
). Also, I had to slightly improveIsKnowConstant
because when spans are involved we can only properly detect constants in late phases (after SSA/VN)