-
Notifications
You must be signed in to change notification settings - Fork 2.7k
[no merge] Further tweak Buffer.MemoryCopy performance #6638
Conversation
#endif // BIT64 | ||
|
||
|
||
int alignment = IntPtr.Size - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be better to just inline sizeof(nuint) - 1
in the expression below instead of creating a local variable that is initialized by a call (IntPtr.Size is function call in IL) that the JIT has to inline and optimize out. Same for other uses of IntPtr.Size in this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkotas I thought mscorlib was crossgened, so this would have no runtime overhead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment was more to make the IL simple so that there is nothing that can derail JIT from producing a good code for it.
@jamesqo, what are you using to bench? I think the biggest issue with copying so few bytes (0-1024) is that we are hitting the lower limits of what any onboard timer can actually measure. I have written a very simple bench here (https://gist.github.com/tannergooding/bb256733943fc5afecce6a51a640910d) for the existing code In the sample, it:
I did it this way to ensure that only the data we care about is tracked ( It is also trivial to modify the sample to start from an offset of source/destination so unaligned read/writes can be tested. |
if ((len & 8) != 0) | ||
// We have <= 15 bytes to copy at this point. | ||
|
||
if ((mask & 8) != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider, if (mask == 0) { return; }
, otherwise we are doing four pointless comparisons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also consider, something like (math should be correct, I validated it does what I expect for some simple scenarios).
- First Write,
nuint i = 8u - (dest % sizeof(nuint))
(adjust for misaligned destination)len -= i
; (adjust for bytes already copied, not including initial bytes that will be double copied so we stay aligned)nuint r = len % 16
(number of bytes remaining after we loop)nuint l = (sizeof(nuint) - (r % sizeof(nuint))
; (number of bytes remaining that need to be written misaligned)- Loop
- Final
cases 13-15:
*(int*)(dest + i) = *(int*)(src + i)
*(int*)(dest + i + 4) = *(int*)(src + i + 4)
*(int*)(dest + i + 8) = *(int*)(src + i + 8)
*(int*)(dest + i + 12 - l) = *(int*)(src + i + 12 - l)
case 12:
*(int*)(dest + i) = *(int*)(src + i)
*(int*)(dest + i + 4) = *(int*)(src + i + 4)
*(int*)(dest + i + 8) = *(int*)(src + i + 8)
cases 9-11:
*(int*)(dest + i) = *(int*)(src + i)
*(int*)(dest + i + 4 - l) = *(int*)(src + i + 4 - l)
case 8:
*(int*)(dest + i) = *(int*)(src + i)
*(int*)(dest + i + 4) = *(int*)(src + i + 4)
cases 5-7:
*(int*)(dest + i) = *(int*)(src + i)
*(int*)(dest + i + 4 - l) = *(int*)(src + i + 4 - l)
case 4:
*(int*)(dest + i) = *(int*)(src + i)
cases 1-3
*(int*)(dest + i -l) = *(int*)(src + i - l)
case 0:
return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent idea, we can use a switch statement here to avoid the branches/unaligned write (and since code size doesn't matter, as all this is crossgened and there is no runtime overhead for larger methods). I'm going to experiment with this and see if I can tease out some even better perf.
@jamesqo Regarding |
@omariom Interesting... I didn't know that before. I'm going to try replacing the inner loop with a *(decimal*)(dest + i) = *(decimal*)(src + i); for x64 and see how that goes. |
@tannergooding I was using tests from the previous PR to benchmark: https://gist.github.com/jamesqo/337852c8ce09205a8289ce1f1b9b5382 There's a I didn't get a chance to take a full look at your benchmark, but I noticed that in your inner loop you are calling |
@jamesqo what if you try to unroll to 2 xmm copies? struct Dummy32 { long l1,l2,l3,l4 } |
@jamesqo great work.
@omariom yes I thought I read something about that too by @mikedn so any struct defined as: struct LongLong
{
Long l0;
Long l1;
} should be useable for copying 16 bytes at a time. Haven't tested it myself, though. I do not know all the background for this, but couldn't |
It works if JIT knows the length, otherwise it calls a helper. |
Would that mean something like this would work? struct Bytes16
{
long bytes0;
long bytes1;
}
struct Bytes32
{
long bytes0;
long bytes1;
long bytes2;
long bytes3;
}
struct Bytes64
{
long bytes0;
long bytes1;
long bytes2;
long bytes3;
long bytes4;
long bytes5;
long bytes6;
long bytes7;
} nuint end = len - 64;
nuint counter;
while (i <= end)
{
counter = i + 64;
*(Bytes64*)(dest + i) = *(Bytes64*)(src + i);
i = counter;
}
if ((len & 32) != 0)
{
counter = i + 32;
*(Bytes32*)(dest + i) = *(Bytes32*)(src + i);
i = counter;
}
if ((len & 16) != 0)
{
counter = i + 16;
*(Bytes16*)(dest + i) = *(Bytes16*)(src + i);
i = counter;
} |
@jamesqo: https://gist.github.com/tannergooding/08702b99b26447b9e30e2126bba2c966, a somewhat optimized version that uses x64 is showing a 15% improvement (this is a general improvement when tested against all sizes and alignments from 0-1024 bytes (where alignments are have source or destination misaligned by 1-15 bytes). x86 is showing a 7% improvement under the same scenarios. It may be interesting to note that the native implementation doesn't really start showing any benefits until around 8192 bytes, where (at least on Windows) the native implementation will begin using the prefetch instruction. In either case, the principals behind the code for doing unaligned writes as a pre/post step remain the same (and there are probably more optimizations to be had). |
@benaadams, yes, that works as expected (unrolling to 4 |
@tannergooding awesome, could I suggest a code change, since we are only using using Reg16 = System.Decimal; in the beginning of the code file and using @tannergooding would you mind putting up the assembly as well? On that note, I believe defining these types should work, not sure there is any benefit perf wise for |
Loop unrolling without the unroll :) Maybe at some point it will automagically take advantage of AVX-512? (and maybe AVX2 along the way) |
Yes I agree it seems to be worth doing just for improving readability and perhaps future improvements in the JIT. @jkotas didn't we talk about perhaps adding |
Shouldn't |
I agree, but wasn't sure whether this was due to whether or not "custom structs" like |
I'm not so sure if we should do that; it might be good for large buffers (avoid an extra branch/write every 16 bytes), but for smaller ones (in the ~30-50 range) it could be detrimental. I may attempt it if we increase the threshold at which we call the native
See comment here, it looks like the cpblk implementation isn't as efficient as a manual copy, though I'm not sure the situation has changed since that comment was written.
I'll make an
Really? I was under the impression it would have to use 4 GPRs and do something like move each int in the struct into a register, and then move those to the destination (8 movs total). At least, I think something like that happened when I looked at the disassembly for |
Alright, here is the code of the inner loop when using a 16-byte struct: G_M6842_IG28:
4C8D5810 lea r11, [rax+16]
488D3402 lea rsi, [rdx+rax]
4803C1 add rax, rcx
F30F6F06 movdqu xmm0, qword ptr [rsi]
F30F7F00 movdqu qword ptr [rax], xmm0
498BC3 mov rax, r11
493BC1 cmp rax, r9
76E5 jbe SHORT G_M6842_IG28 It's not efficient as it could be. E.g. I think it should be something like LOOP:
; r11 => counter rax => i, rdx => src, rcx => dest, r9 => end
lea r11, [rax+16] ; calculate counter
movdqu xmm0, qword ptr [rdx+rax] ; should be movdqa, but as nietras mentioned we need Unsafe.Assume
movdqu qword ptr [rcx+rax], xmm0
mov rax, r11
cmp rax, r9
jbe LOOP edit: Reverting the changes @benaadams suggested to avoid making the loop condition depend on the writes still results in 1 additional G_M6842_IG28:
4C8D1C02 lea r11, [rdx+rax]
488D3401 lea rsi, [rcx+rax]
F3410F6F03 movdqu xmm0, qword ptr [r11]
F30F7F06 movdqu qword ptr [rsi], xmm0
4883C010 add rax, 16
493BC1 cmp rax, r9
76E6 jbe SHORT G_M6842_IG28 I think the JIT tries to eliminate these dependencies by itself (hence why the The version that was just merged seemed to work with this trick, not sure why it's not working now. Regardless even though an extra |
@jamesqo And as it is not your last optimization, you could reuse that excel in the future :) |
The problem is we never really got a great definitive answer for this and other questions related to memory copying on .NET. I am currently running https://github.com/DotNetCross/Memory.Copies (which contains a single BenchmarkDotNet project) measuring many different memory copy variants such as the ones discussed in https://github.com/dotnet/coreclr/issues/2430 and aspnet/KestrelHttpServer#511 . This is based on a benchmark project that @benaadams did. I added the original Buffer.Memmove, @jamesqo variant merged into master and @tannergooding variant. For the former two I revert to This benchmark, however, only runs on "normal" .NET that is the three JITs available there. Have yet to succeed in building coreclr and getting tests run, just haven't had time ;) It should be possible to add .NET Core benchmarks too. Of course, the benchmark will take a long long time to run (working on an alternative to BenchmarkDotNet which will focus on doing the minimal possible to get good measurements, since BDN simply takes too long for quick brainstorming), it will probably finish sometime tomorrow morning CET. And it doesn't run under the exact same conditions, but it would be good to see what advantages/disadvantages the different variants have. An early run of this with less parameters can be found here (I did some minor changes after this though): https://gist.github.com/nietras/400dfe8954450825c1033e36ae35a6a4 |
I have updated my gist with the disassembly for both the 32-bit and 64-bit versions (this is disassembly generated using Desktop 4.6.2, not using CoreCLR, but they should produce very similar results on Windows). I can also add a version that shows the source code inline if desirable. @jamesqo, I handled the case of unrolling by doing 128-byte blocks and special casing anything under 128-bytes (the same code used for the special case is also used to handle any leftover bytes a the end of the large block copy loop). Edit: Updated the gist to have assembly and assembly w/ inline source for both. Additionally, updated to use a |
Pessimistic I am, it actually finished just now it took 10558059 ms aka ~2.9 hours to run om my desktop PC. See https://gist.github.com/nietras/22b8efd26af715ac32ccaf1a57a465da I've found it is easiest to view by downloading the html, open it in browser and zoom out as you please. A graph would be welcome though ;) Here a few random samples for RyuJIT 64-bit:
|
@nietras, it might be worth noting (for my sample) that didn't properly handle overlapping buffers (throwing an exception in the original, which I believe you grabbed; and falling back to Buffer.MemoryCopy now). This isn't a problem if the bench didn't cover that, but is if it did. I think another important distinction is that my code did not fall back to the native implementation for any size. But, as mentioned above, it doesn't seem like this is really important until the underlying implementation begins using |
That's why I usually run such benchmarks in LinqPad. BDN is not (yet) convenient for that. |
@tannergooding Your version looks interesting. I'm going to copy your approach partially and replace the double word writes w/ XMM in the switch-cases as well. Also it looks like the JIT is generating some redundant code for the *(UInt128*)(dst + sizeof_UInt128) = *(UInt128*)(src + sizeof_UInt128);
00007FF953CE4A85 lea rax,[rsi+10h]
00007FF953CE4A89 lea r8,[rcx+10h]
00007FF953CE4A8D movdqu xmm0,xmmword ptr [rax]
00007FF953CE4A91 movdqu xmmword ptr [r8],xmm0
*(UInt128*)(dst + (sizeof_UInt128 * 2)) = *(UInt128*)(src + (sizeof_UInt128 * 2));
00007FF953CE4A96 lea rax,[rsi+20h]
00007FF953CE4A9A lea r8,[rcx+20h]
00007FF953CE4A9E movdqu xmm0,xmmword ptr [rax]
00007FF953CE4AA2 movdqu xmmword ptr [r8],xmm0 This looks like it should just be movdqu xmm0,xmmword ptr [rsi+10h]
movdqu xmmword ptr [rcx+10h],xmm0
movdqu xmm0,xmmword ptr [rsi+20h]
movdqu xmmword ptr [rcx+20h],xmm0 without the Also I don't know how you're getting G_M6842_IG28:
8B75F0 mov esi, dword ptr [ebp-10H]
0375EC add esi, dword ptr [ebp-14H]
8B7DEC mov edi, dword ptr [ebp-14H]
03FB add edi, ebx
A5 movsd
A5 movsd
A5 movsd
A5 movsd
8345EC10 add dword ptr [ebp-14H], 16
3955EC cmp dword ptr [ebp-14H], edx
76E8 jbe SHORT G_M6842_IG28 (edit: was responding to your comment 3 comments up) |
I did a simple sum over Scaled column for each method for 64-bit RyuJIT to get some idea about each methods applicability: @tannergooding yes I took your version verbatim, so it does not revert to the My excel skills are really coming being tested today (had to google http://superuser.com/questions/750353/excel-scatter-plot-with-multiple-series-from-1-table), but here a plot for small bytes copies, where I plot the Scaled result again: The benchmarks does have some issue with some methods doing input parameter checks and others not. If you guys have some suggestions for changes to the benchmarks, like make it all unsafe or omit the checks or something, I can make the changes. And, of course, include new variants. |
@Anderman shouldn't we assume that data is cached? If it's uncached performance will be completely dominated by cache line loads and what we do does not matter much. Optimizations should matter almost exclusively for the cached case. If memory can be delivered at 10GB/s in a streaming fashion that's about 3 bytes per cycle. We therefore have ~20 cycles to spend on copying one 64 byte cache line. That's easy to do in 20 cycles. We get even more than 20 instructions, maybe 2-3x. |
@GSPP not really. Given that memory loading unto the lower level memory hierarchies dominate the whole runtime, the real global minimum happen when you can retire as many instructions as possible even considering the latency caused by memory access. The real trick is in overlapping the memory accesses on the cache misses with the operations on L1 hits (without consuming all execution ports) or pay the prefetch of multiple cache misses in batches (you can copy kbytes at a time for essentially free if you don't pollute the cache with the written data with non temporal writes). |
@GSPP @redknightlois
This will slow down your performance by a factor 10-20 for small sizes. But not so much for big arrays. prefetching seems to work well. Interesting is that the best copy method depends on cached vs not cached |
@GSPP I think you are right. There will be programs where all data is already in L1 cache before the copy method is called. So the current test are good. Maybe we can extent the test with not cached data to see if that lead to other solutions |
@redknightlois I did some test with DIY performance test program. The result for the cached version are the same but test run faster. I also did test with random array access. Will post the results soon |
@Anderman do you have the full code for this anywhere? I'd be happy to put it in the repo if you like. |
Yes, I have to clean it up a little bit. The results are interesting. It looks that buffer.array en memmove use extra data access and when it is not cached it will cause an extra delay. |
@Anderman is that |
@redknightlois that is ns per operation. Array.copy will copy 32 bytes in 100ns (random access) |
@Anderman as far as I understand your benchmarks are based on your own basic measurements on .NET Core, with no pre-jit or warmup iterations, so there probably is a bias (although small probably) in these measurements. Anyway, that is not my main question, I don't understand how your measurements can show that |
I Include an excel with the foer different test. from0-8192 Test results from I7-3610QM. (Random,cached, sequence, aligntest) |
@nietras I have run the test with the org benchmark and there was msvcrt.dll slower for 1-15. Could it be an other version of msvcrt.dll? How can do check that? procmon or are there easier ways? |
@Anderman I think it is because your version of if (src == null || dst == null) throw new ArgumentNullException(nameof(src));
if (count < 0 || srcOffset < 0 || dstOffset < 0) throw new ArgumentOutOfRangeException(nameof(count));
if (srcOffset + count > src.Length) throw new ArgumentException(nameof(src));
if (dstOffset + count > dst.Length) throw new ArgumentException(nameof(dst)); |
@nietras Yep, That's why. Do you known how you can use |
Test on a I5-4590 |
@nietras I send you a pull request with a new test project.
My Testing summary on 2 different test machines |
Might have to retest with #7198 |
Closing due to length issues. I may open a new PR if I get a chance to work on this in the future. |
This is a follow-up to #6627, since that was accidentally merged. I've experimented with @tannergooding's idea of doing word writes at the beginning/end of the buffer before actually aligning
dest
, which shaves off a couple of branches and seems to improve performance.This is the performance data I have for now (the Gist will be updated continually): https://gist.github.com/jamesqo/18b61a17a65489b5dd8eaf0617b9099d You need to press Ctrl+F and search for
---BASELINE---
to see the times for the existing version (the one that just got merged).During the other PR I was unable to get consistent timings between benchmark runs (and BenchmarkDotNet takes waaay too long), so what I've done instead is run the tests multiple times and see which numbers keep popping up. Here are the 'average' times for each configuration:
Note that this should not be merged yet, as I haven't tested for i386 and have not collected data for when dest is unaligned. Also I'm thinking of possibly raising the threshold for when we do a QCall into the native
memmove
, since copying 511 bytes seems to be significantly faster than 512 (see my comment in the other PR).cc @jkotas @tannergooding @GSPP @benaadams