Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

[no merge] Further tweak Buffer.MemoryCopy performance #6638

Closed
wants to merge 19 commits into from

Conversation

jamesqo
Copy link

@jamesqo jamesqo commented Aug 6, 2016

This is a follow-up to #6627, since that was accidentally merged. I've experimented with @tannergooding's idea of doing word writes at the beginning/end of the buffer before actually aligning dest, which shaves off a couple of branches and seems to improve performance.

This is the performance data I have for now (the Gist will be updated continually): https://gist.github.com/jamesqo/18b61a17a65489b5dd8eaf0617b9099d You need to press Ctrl+F and search for ---BASELINE--- to see the times for the existing version (the one that just got merged).

During the other PR I was unable to get consistent timings between benchmark runs (and BenchmarkDotNet takes waaay too long), so what I've done instead is run the tests multiple times and see which numbers keep popping up. Here are the 'average' times for each configuration:

# of Bytes 30 40 50
Baseline ~2.0 ~2.7 ~3.3
Experimental ~1.7/8 ~2.4 ~2.0

Note that this should not be merged yet, as I haven't tested for i386 and have not collected data for when dest is unaligned. Also I'm thinking of possibly raising the threshold for when we do a QCall into the native memmove, since copying 511 bytes seems to be significantly faster than 512 (see my comment in the other PR).

cc @jkotas @tannergooding @GSPP @benaadams

#endif // BIT64


int alignment = IntPtr.Size - 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be better to just inline sizeof(nuint) - 1 in the expression below instead of creating a local variable that is initialized by a call (IntPtr.Size is function call in IL) that the JIT has to inline and optimize out. Same for other uses of IntPtr.Size in this function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas I thought mscorlib was crossgened, so this would have no runtime overhead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was more to make the IL simple so that there is nothing that can derail JIT from producing a good code for it.

@tannergooding
Copy link
Member

tannergooding commented Aug 6, 2016

@jamesqo, what are you using to bench?

I think the biggest issue with copying so few bytes (0-1024) is that we are hitting the lower limits of what any onboard timer can actually measure.

I have written a very simple bench here (https://gist.github.com/tannergooding/bb256733943fc5afecce6a51a640910d) for the existing code

In the sample, it:

  1. Allocates Source
  2. Randomizes Bytes in Source
  3. Allocates Destination
  4. Zeros Bytes in Destination
  5. Begins timing
  6. Executes 1,000 iterations of Buffer.MemoryCopy
  7. Ends timing
  8. Repeats steps 5-7; 100,000 times
  9. Tracks the minimum, maximum, and total average time for all iterations
  10. Prints the information

I did it this way to ensure that only the data we care about is tracked (Buffer.MemoryCopy) and to try and help normalize any discrepancies from operations which are too small to be measured and from any caching of the source memory.

It is also trivial to modify the sample to start from an offset of source/destination so unaligned read/writes can be tested.

if ((len & 8) != 0)
// We have <= 15 bytes to copy at this point.

if ((mask & 8) != 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider, if (mask == 0) { return; }, otherwise we are doing four pointless comparisons

Copy link
Member

@tannergooding tannergooding Aug 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also consider, something like (math should be correct, I validated it does what I expect for some simple scenarios).

  • First Write,
  • nuint i = 8u - (dest % sizeof(nuint)) (adjust for misaligned destination)
  • len -= i; (adjust for bytes already copied, not including initial bytes that will be double copied so we stay aligned)
  • nuint r = len % 16 (number of bytes remaining after we loop)
  • nuint l = (sizeof(nuint) - (r % sizeof(nuint)); (number of bytes remaining that need to be written misaligned)
  • Loop
  • Final
    cases 13-15:
        *(int*)(dest + i) = *(int*)(src + i)
        *(int*)(dest + i + 4) = *(int*)(src + i + 4)
                *(int*)(dest + i + 8) = *(int*)(src + i + 8)
        *(int*)(dest + i + 12 - l) = *(int*)(src + i + 12 - l)
    case 12:
        *(int*)(dest + i) = *(int*)(src + i)
        *(int*)(dest + i + 4) = *(int*)(src + i + 4)
        *(int*)(dest + i + 8) = *(int*)(src + i + 8)
    cases 9-11:
        *(int*)(dest + i) = *(int*)(src + i)
        *(int*)(dest + i + 4 - l) = *(int*)(src + i + 4 - l)
    case 8:
        *(int*)(dest + i) = *(int*)(src + i)
        *(int*)(dest + i + 4) = *(int*)(src + i + 4)
    cases 5-7:
        *(int*)(dest + i) = *(int*)(src + i)
        *(int*)(dest + i + 4 - l) = *(int*)(src + i + 4 - l)
    case 4:
        *(int*)(dest + i) = *(int*)(src + i)
    cases 1-3
        *(int*)(dest + i -l) = *(int*)(src + i - l)
    case 0:
        return

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent idea, we can use a switch statement here to avoid the branches/unaligned write (and since code size doesn't matter, as all this is crossgened and there is no runtime overhead for larger methods). I'm going to experiment with this and see if I can tease out some even better perf.

@omariom
Copy link

omariom commented Aug 6, 2016

@jamesqo Regarding xmm registers and Vector<T> you mentioned in the previous PR discussion. Actually JIT can use xmm regs when copying structs of size >= 16. I noticed that in Slices project.

@jamesqo
Copy link
Author

jamesqo commented Aug 6, 2016

@omariom Interesting... I didn't know that before. I'm going to try replacing the inner loop with a

*(decimal*)(dest + i) = *(decimal*)(src + i);

for x64 and see how that goes.

@jamesqo
Copy link
Author

jamesqo commented Aug 6, 2016

@tannergooding I was using tests from the previous PR to benchmark: https://gist.github.com/jamesqo/337852c8ce09205a8289ce1f1b9b5382 There's a .cs file in there which contains the source code for the benchmark.

I didn't get a chance to take a full look at your benchmark, but I noticed that in your inner loop you are calling MemoryCopy with an int argument. That may be suboptimal since that's going to call the long overload which does a checked cast before calling Memmove. I'm not sure if the JIT is able to optimize that out but it may be better to cast to ulong beforehand.

@omariom
Copy link

omariom commented Aug 6, 2016

@jamesqo what if you try to unroll to 2 xmm copies?

struct Dummy32 { long l1,l2,l3,l4 }

@nietras
Copy link

nietras commented Aug 6, 2016

@jamesqo great work.

Regarding xmm registers and Vector you mentioned in the previous PR discussion. Actually JIT can use xmm regs when copying structs of size >= 16.

@omariom yes I thought I read something about that too by @mikedn so any struct defined as:

struct LongLong
{
    Long l0;
    Long l1;
}

should be useable for copying 16 bytes at a time. Haven't tested it myself, though.

I do not know all the background for this, but couldn't Buffer.MemoryCopy defer to .cpblk and then the JIT could insert a copy loop or call optimized for a given architecture? That is, specifically use enhanced rep movsb (ERMSB) when this would be more optimal and not the least result in a much smaller code size? (Not sure why code size is not considered important?).

@omariom
Copy link

omariom commented Aug 6, 2016

I do not know all the background for this, but couldn't Buffer.MemoryCopy defer to .cpblk and then the JIT could insert a copy loop or call optimized for a given architecture.

It works if JIT knows the length, otherwise it calls a helper.

@benaadams
Copy link
Member

benaadams commented Aug 6, 2016

Would that mean something like this would work?

struct Bytes16
{
    long bytes0;
    long bytes1;
}
struct Bytes32
{
    long bytes0;
    long bytes1;
    long bytes2;
    long bytes3;
}
struct Bytes64
{
    long bytes0;
    long bytes1;
    long bytes2;
    long bytes3;
    long bytes4;
    long bytes5;
    long bytes6;
    long bytes7;
}
nuint end = len - 64;
nuint counter; 

while (i <= end)
{
    counter = i + 64;
    *(Bytes64*)(dest + i) = *(Bytes64*)(src + i);
    i = counter;
}

if ((len & 32) != 0)
{
    counter = i + 32;
    *(Bytes32*)(dest + i) = *(Bytes32*)(src + i);
    i = counter;
}

if ((len & 16) != 0)
{
    counter = i + 16;
    *(Bytes16*)(dest + i) = *(Bytes16*)(src + i);
    i = counter;
}

@tannergooding
Copy link
Member

@jamesqo: https://gist.github.com/tannergooding/08702b99b26447b9e30e2126bba2c966, a somewhat optimized version that uses decimal to copy 16 bytes at a time. The generated assembly (on x64) is using movdqu (which is slightly undesirable, as it means we lose some, but not all, of the benefits of ensuring we are aligned). For x86, it is generating two movq instructions.

x64 is showing a 15% improvement (this is a general improvement when tested against all sizes and alignments from 0-1024 bytes (where alignments are have source or destination misaligned by 1-15 bytes).

x86 is showing a 7% improvement under the same scenarios.

It may be interesting to note that the native implementation doesn't really start showing any benefits until around 8192 bytes, where (at least on Windows) the native implementation will begin using the prefetch instruction.

In either case, the principals behind the code for doing unaligned writes as a pre/post step remain the same (and there are probably more optimizations to be had).

@tannergooding
Copy link
Member

@benaadams, yes, that works as expected (unrolling to 4 movdqu instructions). Sadly, 128 byte structs end up being a call instead.

@nietras
Copy link

nietras commented Aug 6, 2016

@tannergooding awesome, could I suggest a code change, since we are only using decimal as a 16-byte type putting:

using Reg16 = System.Decimal;

in the beginning of the code file and using Reg16 would perhaps make it more clear? Just as using nuint helped a lot with the readability and code complexity. Or call it Bytes16 as @benaadams suggests.

@tannergooding would you mind putting up the assembly as well?

On that note, I believe defining these types should work, not sure there is any benefit perf wise for >=Bytes32 though, as these just do multiple 16-byte copies then, I guess ;) But code readability would be better in my view.

@benaadams
Copy link
Member

benaadams commented Aug 6, 2016

not sure there is any benefit perf wise for >=Bytes32 though

Loop unrolling without the unroll :)

Maybe at some point it will automagically take advantage of AVX-512? (and maybe AVX2 along the way)

@nietras
Copy link

nietras commented Aug 6, 2016

Loop unrolling without the unroll :) And maybe at some point it will automagically take advantage of AVX-512

Yes I agree it seems to be worth doing just for improving readability and perhaps future improvements in the JIT.

@jkotas didn't we talk about perhaps adding Aligned or something to Unsafe to try to tell the JIT when we know an address is aligned? Can't find the reference for this... found it https://github.com/dotnet/coreclr/issues/2725 it was Unsafe.Assume

@GSPP
Copy link

GSPP commented Aug 6, 2016

Shouldn't System.Decimal be avoided entirely? Seems like a code smell. And it's a dependency on the JIT never mishandling that type (due to some confusion or bug). Is it's size and packing even guaranteed? Might be but better not do this.

@nietras
Copy link

nietras commented Aug 6, 2016

Shouldn't System.Decimal be avoided entirely?

I agree, but wasn't sure whether this was due to whether or not "custom structs" like Bytes16 are acceptable or not.

@jamesqo
Copy link
Author

jamesqo commented Aug 6, 2016

@omariom

what if you try to unroll to 2 xmm copies?

I'm not so sure if we should do that; it might be good for large buffers (avoid an extra branch/write every 16 bytes), but for smaller ones (in the ~30-50 range) it could be detrimental. I may attempt it if we increase the threshold at which we call the native memmove, though.

@nietras

I do not know all the background for this, but couldn't Buffer.MemoryCopy defer to .cpblk and then the JIT could insert a copy loop or call optimized for a given architecture?

See comment here, it looks like the cpblk implementation isn't as efficient as a manual copy, though I'm not sure the situation has changed since that comment was written.

could I suggest a code change, since we are only using decimal as a 16-byte type putting:

I'll make an Int128 Buffer16 type and use that instead of decimal. I preferred the latter since it was a built-in type and got syntax highlighting, but I guess the former more clearly expresses the intent.

@tannergooding

For x86, it is generating two movq instructions.

Really? I was under the impression it would have to use 4 GPRs and do something like move each int in the struct into a register, and then move those to the destination (8 movs total). At least, I think something like that happened when I looked at the disassembly for long* writes on x86. I'll have to test this out for myself.

@jamesqo
Copy link
Author

jamesqo commented Aug 6, 2016

Alright, here is the code of the inner loop when using a 16-byte struct:

G_M6842_IG28:
       4C8D5810             lea      r11, [rax+16]
       488D3402             lea      rsi, [rdx+rax]
       4803C1               add      rax, rcx
       F30F6F06             movdqu   xmm0, qword ptr [rsi]
       F30F7F00             movdqu   qword ptr [rax], xmm0
       498BC3               mov      rax, r11
       493BC1               cmp      rax, r9
       76E5                 jbe      SHORT G_M6842_IG28

It's not efficient as it could be. E.g. I think it should be something like

LOOP:

; r11 => counter rax => i, rdx => src, rcx => dest, r9 => end
lea r11, [rax+16] ; calculate counter
movdqu xmm0, qword ptr [rdx+rax] ; should be movdqa, but as nietras mentioned we need Unsafe.Assume
movdqu qword ptr [rcx+rax], xmm0
mov rax, r11
cmp rax, r9
jbe LOOP

edit: Reverting the changes @benaadams suggested to avoid making the loop condition depend on the writes still results in 1 additional lea:

G_M6842_IG28:
       4C8D1C02             lea      r11, [rdx+rax]
       488D3401             lea      rsi, [rcx+rax]
       F3410F6F03           movdqu   xmm0, qword ptr [r11]
       F30F7F06             movdqu   qword ptr [rsi], xmm0
       4883C010             add      rax, 16
       493BC1               cmp      rax, r9
       76E6                 jbe      SHORT G_M6842_IG28

I think the JIT tries to eliminate these dependencies by itself (hence why the movdqus don't just do [rdx+rax] directly), but it should be moving i + 16 or i into another register, not the pointer offsets.

The version that was just merged seemed to work with this trick, not sure why it's not working now.

Regardless even though an extra lea is being generated it's still an improvement over the old version by ~4 bytes, so I'll take it. I'm going to raise an issue for this when I can get a more generalized repro, and maybe this can be revisited later.

@omariom
Copy link

omariom commented Aug 6, 2016

@jamesqo
Would be interesting to see results for the whole length range - from 1 to 512.
If you save the results to a csv file you will be able to create nice looking charts in Excel.
Like this, for example.
https://cloud.githubusercontent.com/assets/1781701/14231309/fdb290e8-f986-11e5-9d47-0e2c3467f612.png

And as it is not your last optimization, you could reuse that excel in the future :)

@nietras
Copy link

nietras commented Aug 6, 2016

looks like the cpblk implementation isn't as efficient as a manual copy, though I'm not sure the situation has changed

The problem is we never really got a great definitive answer for this and other questions related to memory copying on .NET. I am currently running https://github.com/DotNetCross/Memory.Copies (which contains a single BenchmarkDotNet project) measuring many different memory copy variants such as the ones discussed in https://github.com/dotnet/coreclr/issues/2430 and aspnet/KestrelHttpServer#511 . This is based on a benchmark project that @benaadams did.

I added the original Buffer.Memmove, @jamesqo variant merged into master and @tannergooding variant. For the former two I revert to mscvrt.dll memmove instead of the internal coreclr call. @tannergooding does not revert to mscvrt.dll. See the benchmark file https://github.com/DotNetCross/Memory.Copies/blob/master/src/DotNetCross.Memory.Copies.Benchmarks/CopiesBenchmark.cs

This benchmark, however, only runs on "normal" .NET that is the three JITs available there. Have yet to succeed in building coreclr and getting tests run, just haven't had time ;) It should be possible to add .NET Core benchmarks too. Of course, the benchmark will take a long long time to run (working on an alternative to BenchmarkDotNet which will focus on doing the minimal possible to get good measurements, since BDN simply takes too long for quick brainstorming), it will probably finish sometime tomorrow morning CET. And it doesn't run under the exact same conditions, but it would be good to see what advantages/disadvantages the different variants have.

An early run of this with less parameters can be found here (I did some minor changes after this though): https://gist.github.com/nietras/400dfe8954450825c1033e36ae35a6a4

@tannergooding
Copy link
Member

tannergooding commented Aug 6, 2016

I have updated my gist with the disassembly for both the 32-bit and 64-bit versions (this is disassembly generated using Desktop 4.6.2, not using CoreCLR, but they should produce very similar results on Windows).

I can also add a version that shows the source code inline if desirable.

@jamesqo, I handled the case of unrolling by doing 128-byte blocks and special casing anything under 128-bytes (the same code used for the special case is also used to handle any leftover bytes a the end of the large block copy loop).

Edit: Updated the gist to have assembly and assembly w/ inline source for both. Additionally, updated to use a UInt128 struct, rather than the decimal type.

@nietras
Copy link

nietras commented Aug 6, 2016

Pessimistic I am, it actually finished just now it took 10558059 ms aka ~2.9 hours to run om my desktop PC. See https://gist.github.com/nietras/22b8efd26af715ac32ccaf1a57a465da

I've found it is easiest to view by downloading the html, open it in browser and zoom out as you please. A graph would be welcome though ;)

Here a few random samples for RyuJIT 64-bit:

Method Platform Jit BytesCopied Median StdDev Scaled Mean StdError StdDev Op/s Min Q1 Median Q3 Max
ArrayCopy X64 RyuJit 3 11.8486 ns 0.0732 ns 1.00 11.8697 ns 0.0328 ns 0.0732 ns 84248344.93 11.7916 ns 11.8112 ns 11.8486 ns 11.9387 ns 11.9827 ns
BufferBlockCopy X64 RyuJit 3 11.8605 ns 0.0471 ns 1.00 11.8545 ns 0.0211 ns 0.0471 ns 84356469.07 11.8099 ns 11.8104 ns 11.8605 ns 11.8955 ns 11.9237 ns
IllyriadVectorizedCopy X64 RyuJit 3 11.8480 ns 0.2041 ns 1.00 11.9497 ns 0.0913 ns 0.2041 ns 83684379.92 11.7781 ns 11.7896 ns 11.8480 ns 12.1606 ns 12.2495 ns
AndermanVectorizedCopy X64 RyuJit 3 9.3606 ns 0.1497 ns 0.79 9.3392 ns 0.0670 ns 0.1497 ns 107075766.64 9.1550 ns 9.1922 ns 9.3606 ns 9.4755 ns 9.5305 ns
UnsafeIllyriadVectorizedCopy X64 RyuJit 3 12.1958 ns 0.2594 ns 1.03 12.2369 ns 0.1160 ns 0.2594 ns 81720202.83 11.9971 ns 12.0441 ns 12.1958 ns 12.4502 ns 12.6713 ns
UnsafeAndermanVectorizedCopy X64 RyuJit 3 7.9733 ns 0.1325 ns 0.67 7.9373 ns 0.0593 ns 0.1325 ns 125986743.59 7.7770 ns 7.7994 ns 7.9733 ns 8.0573 ns 8.0802 ns
UnsafeCopyBlock X64 RyuJit 3 8.0212 ns 0.0503 ns 0.68 8.0069 ns 0.0225 ns 0.0503 ns 124892864.98 7.9433 ns 7.9563 ns 8.0212 ns 8.0503 ns 8.0688 ns
Buffer_MemmoveOriginal X64 RyuJit 3 8.5832 ns 0.0677 ns 0.72 8.6095 ns 0.0303 ns 0.0677 ns 116150892.74 8.5275 ns 8.5541 ns 8.5832 ns 8.6780 ns 8.6969 ns
Buffer_MemmoveJamesqo X64 RyuJit 3 8.9447 ns 0.0615 ns 0.75 8.9039 ns 0.0275 ns 0.0615 ns 112310642.02 8.8187 ns 8.8383 ns 8.9447 ns 8.9490 ns 8.9516 ns
Buffer_MemmoveTannerGooding X64 RyuJit 3 8.2688 ns 0.1275 ns 0.70 8.2158 ns 0.0570 ns 0.1275 ns 121717114.81 8.0688 ns 8.0830 ns 8.2688 ns 8.3220 ns 8.3666 ns
ArrayCopy X64 RyuJit 21 21.9709 ns 0.2818 ns 1.00 21.9018 ns 0.1260 ns 0.2818 ns 45658425.73 21.4363 ns 21.6691 ns 21.9709 ns 22.0999 ns 22.1938 ns
BufferBlockCopy X64 RyuJit 21 22.8252 ns 0.3546 ns 1.04 22.8957 ns 0.1586 ns 0.3546 ns 43676276.72 22.4627 ns 22.6353 ns 22.8252 ns 23.1914 ns 23.4442 ns
IllyriadVectorizedCopy X64 RyuJit 21 10.5605 ns 0.0813 ns 0.48 10.5823 ns 0.0364 ns 0.0813 ns 94497284.1 10.4862 ns 10.5171 ns 10.5605 ns 10.6584 ns 10.7029 ns
AndermanVectorizedCopy X64 RyuJit 21 5.4889 ns 0.0795 ns 0.25 5.4991 ns 0.0355 ns 0.0795 ns 181848044.18 5.4050 ns 5.4240 ns 5.4889 ns 5.5793 ns 5.5911 ns
UnsafeIllyriadVectorizedCopy X64 RyuJit 21 9.9539 ns 0.0634 ns 0.45 9.9597 ns 0.0283 ns 0.0634 ns 100405097.81 9.8839 ns 9.8987 ns 9.9539 ns 10.0235 ns 10.0265 ns
UnsafeAndermanVectorizedCopy X64 RyuJit 21 6.8868 ns 0.0319 ns 0.31 6.8939 ns 0.0142 ns 0.0319 ns 145056194.27 6.8638 ns 6.8706 ns 6.8868 ns 6.9207 ns 6.9471 ns
UnsafeCopyBlock X64 RyuJit 21 10.3474 ns 0.0982 ns 0.47 10.3240 ns 0.0439 ns 0.0982 ns 96861778.16 10.1843 ns 10.2289 ns 10.3474 ns 10.4073 ns 10.4395 ns
Buffer_MemmoveOriginal X64 RyuJit 21 12.5054 ns 0.1772 ns 0.57 12.4501 ns 0.0792 ns 0.1772 ns 80320527.72 12.2471 ns 12.2688 ns 12.5054 ns 12.6038 ns 12.6693 ns
Buffer_MemmoveJamesqo X64 RyuJit 21 9.3066 ns 0.0323 ns 0.42 9.3176 ns 0.0145 ns 0.0323 ns 107324239.33 9.2873 ns 9.2934 ns 9.3066 ns 9.3472 ns 9.3702 ns
Buffer_MemmoveTannerGooding X64 RyuJit 21 8.6628 ns 0.1095 ns 0.39 8.6026 ns 0.0490 ns 0.1095 ns 116243273.06 8.4256 ns 8.4961 ns 8.6628 ns 8.6792 ns 8.6826 ns
ArrayCopy X64 RyuJit 65 24.6007 ns 0.3236 ns 1.00 24.6037 ns 0.1447 ns 0.3236 ns 40644267.97 24.2618 ns 24.3087 ns 24.6007 ns 24.9003 ns 25.0799 ns
BufferBlockCopy X64 RyuJit 65 25.7098 ns 0.3814 ns 1.05 25.5493 ns 0.1706 ns 0.3814 ns 39140036.49 24.8857 ns 25.2327 ns 25.7098 ns 25.7857 ns 25.8258 ns
IllyriadVectorizedCopy X64 RyuJit 65 15.6770 ns 0.2101 ns 0.64 15.5863 ns 0.0940 ns 0.2101 ns 64158881.64 15.2112 ns 15.4354 ns 15.6770 ns 15.6918 ns 15.6948 ns
AndermanVectorizedCopy X64 RyuJit 65 9.6312 ns 0.0904 ns 0.39 9.5969 ns 0.0404 ns 0.0904 ns 104200247.67 9.4601 ns 9.5109 ns 9.6312 ns 9.6657 ns 9.6981 ns
UnsafeIllyriadVectorizedCopy X64 RyuJit 65 11.1950 ns 0.1201 ns 0.46 11.2211 ns 0.0537 ns 0.1201 ns 89117785.46 11.0570 ns 11.1176 ns 11.1950 ns 11.3377 ns 11.3605 ns
UnsafeAndermanVectorizedCopy X64 RyuJit 65 9.3821 ns 0.1452 ns 0.38 9.3171 ns 0.0649 ns 0.1452 ns 107329738.59 9.1599 ns 9.1612 ns 9.3821 ns 9.4405 ns 9.4628 ns
UnsafeCopyBlock X64 RyuJit 65 9.9720 ns 0.0540 ns 0.41 9.9936 ns 0.0241 ns 0.0540 ns 100063697.03 9.9478 ns 9.9585 ns 9.9720 ns 10.0396 ns 10.0858 ns
Buffer_MemmoveOriginal X64 RyuJit 65 15.3230 ns 0.1963 ns 0.62 15.3714 ns 0.0878 ns 0.1963 ns 65055788.97 15.1583 ns 15.1921 ns 15.3230 ns 15.5749 ns 15.6093 ns
Buffer_MemmoveJamesqo X64 RyuJit 65 14.4864 ns 0.1597 ns 0.59 14.5143 ns 0.0714 ns 0.1597 ns 68897689.19 14.3478 ns 14.3654 ns 14.4864 ns 14.6771 ns 14.7175 ns
Buffer_MemmoveTannerGooding X64 RyuJit 65 12.6768 ns 0.0776 ns 0.52 12.6946 ns 0.0347 ns 0.0776 ns 78773425.2 12.6169 ns 12.6375 ns 12.6768 ns 12.7607 ns 12.8226 ns
ArrayCopy X64 RyuJit 128 27.7511 ns 0.2175 ns 1.00 27.6736 ns 0.0973 ns 0.2175 ns 36135474.82 27.4014 ns 27.4458 ns 27.7511 ns 27.8627 ns 27.9071 ns
BufferBlockCopy X64 RyuJit 128 26.4619 ns 0.3276 ns 0.95 26.3389 ns 0.1465 ns 0.3276 ns 37966666.2 25.7567 ns 26.0969 ns 26.4619 ns 26.5194 ns 26.5341 ns
IllyriadVectorizedCopy X64 RyuJit 128 14.2371 ns 0.1849 ns 0.51 14.2701 ns 0.0827 ns 0.1849 ns 70076491.04 14.0883 ns 14.1423 ns 14.2371 ns 14.4144 ns 14.5810 ns
AndermanVectorizedCopy X64 RyuJit 128 16.8091 ns 0.1228 ns 0.61 16.8434 ns 0.0549 ns 0.1228 ns 59370550.11 16.6976 ns 16.7344 ns 16.8091 ns 16.9695 ns 16.9900 ns
UnsafeIllyriadVectorizedCopy X64 RyuJit 128 9.1695 ns 0.2100 ns 0.33 9.0657 ns 0.0939 ns 0.2100 ns 110306043.73 8.8190 ns 8.8418 ns 9.1695 ns 9.2377 ns 9.2881 ns
UnsafeAndermanVectorizedCopy X64 RyuJit 128 12.7646 ns 0.2398 ns 0.46 12.7311 ns 0.1072 ns 0.2398 ns 78547667.5 12.3533 ns 12.5115 ns 12.7646 ns 12.9341 ns 12.9344 ns
UnsafeCopyBlock X64 RyuJit 128 10.9182 ns 0.2382 ns 0.39 10.8837 ns 0.1065 ns 0.2382 ns 91880141.97 10.5941 ns 10.6492 ns 10.9182 ns 11.1011 ns 11.1881 ns
Buffer_MemmoveOriginal X64 RyuJit 128 19.8758 ns 0.2395 ns 0.72 19.9155 ns 0.1071 ns 0.2395 ns 50212255.71 19.5789 ns 19.7217 ns 19.8758 ns 20.1290 ns 20.2311 ns
Buffer_MemmoveJamesqo X64 RyuJit 128 19.7783 ns 0.2726 ns 0.71 19.6836 ns 0.1219 ns 0.2726 ns 50803812.96 19.2007 ns 19.4871 ns 19.7783 ns 19.8327 ns 19.8677 ns
Buffer_MemmoveTannerGooding X64 RyuJit 128 12.6614 ns 0.3868 ns 0.46 12.5310 ns 0.1730 ns 0.3868 ns 79801909.28 11.9916 ns 12.1419 ns 12.6614 ns 12.8549 ns 12.9635 ns
ArrayCopy X64 RyuJit 256 27.0799 ns 0.1014 ns 1.00 27.0722 ns 0.0454 ns 0.1014 ns 36938273.37 26.9537 ns 26.9752 ns 27.0799 ns 27.1653 ns 27.2100 ns
BufferBlockCopy X64 RyuJit 256 28.1309 ns 0.3786 ns 1.04 28.0623 ns 0.1693 ns 0.3786 ns 35635018.71 27.6200 ns 27.6765 ns 28.1309 ns 28.4138 ns 28.5073 ns
IllyriadVectorizedCopy X64 RyuJit 256 25.8186 ns 0.5686 ns 0.95 26.1196 ns 0.2543 ns 0.5686 ns 38285427.39 25.6403 ns 25.7007 ns 25.8186 ns 26.6890 ns 27.0020 ns
AndermanVectorizedCopy X64 RyuJit 256 29.1532 ns 0.5057 ns 1.08 29.0232 ns 0.2262 ns 0.5057 ns 34455141.26 28.1601 ns 28.6074 ns 29.1532 ns 29.3741 ns 29.4576 ns
UnsafeIllyriadVectorizedCopy X64 RyuJit 256 14.1985 ns 0.2394 ns 0.52 14.2448 ns 0.1070 ns 0.2394 ns 70200856.14 13.9674 ns 14.0787 ns 14.1985 ns 14.4341 ns 14.6283 ns
UnsafeAndermanVectorizedCopy X64 RyuJit 256 19.7972 ns 0.2998 ns 0.73 19.9209 ns 0.1341 ns 0.2998 ns 50198413.41 19.5571 ns 19.6732 ns 19.7972 ns 20.2305 ns 20.2673 ns
UnsafeCopyBlock X64 RyuJit 256 14.6567 ns 0.2135 ns 0.54 14.5426 ns 0.0955 ns 0.2135 ns 68763503.94 14.1961 ns 14.3370 ns 14.6567 ns 14.6912 ns 14.7127 ns
Buffer_MemmoveOriginal X64 RyuJit 256 26.6222 ns 0.3156 ns 0.98 26.7835 ns 0.1411 ns 0.3156 ns 37336485.43 26.4921 ns 26.5223 ns 26.6222 ns 27.1252 ns 27.1428 ns
Buffer_MemmoveJamesqo X64 RyuJit 256 26.1589 ns 0.2413 ns 0.97 26.1227 ns 0.1079 ns 0.2413 ns 38280875.61 25.7992 ns 25.8860 ns 26.1589 ns 26.3413 ns 26.4083 ns
Buffer_MemmoveTannerGooding X64 RyuJit 256 15.6074 ns 0.1285 ns 0.58 15.5355 ns 0.0575 ns 0.1285 ns 64368702.74 15.3700 ns 15.3976 ns 15.6074 ns 15.6374 ns 15.6542 ns
ArrayCopy X64 RyuJit 544 38.4386 ns 0.3362 ns 1.00 38.5297 ns 0.1504 ns 0.3362 ns 25953978.74 38.1637 ns 38.2388 ns 38.4386 ns 38.8662 ns 39.0021 ns
BufferBlockCopy X64 RyuJit 544 38.1120 ns 0.4219 ns 0.99 38.2131 ns 0.1887 ns 0.4219 ns 26169025.95 37.6772 ns 37.8589 ns 38.1120 ns 38.6179 ns 38.7868 ns
IllyriadVectorizedCopy X64 RyuJit 544 52.1134 ns 0.7588 ns 1.36 52.1800 ns 0.3394 ns 0.7588 ns 19164414.12 51.0906 ns 51.5215 ns 52.1134 ns 52.8720 ns 53.0984 ns
AndermanVectorizedCopy X64 RyuJit 544 60.9400 ns 0.8133 ns 1.59 60.9339 ns 0.3637 ns 0.8133 ns 16411212.99 59.9444 ns 60.1344 ns 60.9400 ns 61.7305 ns 61.8440 ns
UnsafeIllyriadVectorizedCopy X64 RyuJit 544 24.0320 ns 0.3387 ns 0.63 24.0222 ns 0.1515 ns 0.3387 ns 41628129.27 23.6642 ns 23.6862 ns 24.0320 ns 24.3533 ns 24.4330 ns
UnsafeAndermanVectorizedCopy X64 RyuJit 544 38.1652 ns 0.4368 ns 0.99 38.3020 ns 0.1954 ns 0.4368 ns 26108324.12 37.7511 ns 37.9395 ns 38.1652 ns 38.7328 ns 38.8734 ns
UnsafeCopyBlock X64 RyuJit 544 26.3371 ns 0.4298 ns 0.69 26.1284 ns 0.1922 ns 0.4298 ns 38272556.01 25.5445 ns 25.6808 ns 26.3371 ns 26.4716 ns 26.5768 ns
Buffer_MemmoveOriginal X64 RyuJit 544 36.5762 ns 0.6803 ns 0.95 36.8039 ns 0.3042 ns 0.6803 ns 27171035.38 36.1182 ns 36.1962 ns 36.5762 ns 37.5255 ns 37.5845 ns
Buffer_MemmoveJamesqo X64 RyuJit 544 36.1769 ns 0.3230 ns 0.94 36.2245 ns 0.1444 ns 0.3230 ns 27605659.02 35.8542 ns 35.9416 ns 36.1769 ns 36.5310 ns 36.6919 ns
Buffer_MemmoveTannerGooding X64 RyuJit 544 24.9767 ns 0.3292 ns 0.65 25.0289 ns 0.1472 ns 0.3292 ns 39953764.86 24.6877 ns 24.7214 ns 24.9767 ns 25.3625 ns 25.4539 ns
ArrayCopy X64 RyuJit 1024 52.5563 ns 0.4810 ns 1.00 52.4264 ns 0.2151 ns 0.4810 ns 19074363.03 51.7578 ns 51.9354 ns 52.5563 ns 52.8524 ns 52.8638 ns
BufferBlockCopy X64 RyuJit 1024 52.1743 ns 0.3734 ns 0.99 52.3227 ns 0.1670 ns 0.3734 ns 19112163.5 52.0713 ns 52.0877 ns 52.1743 ns 52.6319 ns 52.9735 ns
IllyriadVectorizedCopy X64 RyuJit 1024 53.5631 ns 0.8648 ns 1.02 53.2958 ns 0.3867 ns 0.8648 ns 18763190.18 52.2829 ns 52.3801 ns 53.5631 ns 54.0780 ns 54.0900 ns
AndermanVectorizedCopy X64 RyuJit 1024 56.5467 ns 0.7837 ns 1.08 56.2059 ns 0.3505 ns 0.7837 ns 17791727.81 55.0410 ns 55.4314 ns 56.5467 ns 56.8100 ns 57.0404 ns
UnsafeIllyriadVectorizedCopy X64 RyuJit 1024 46.5726 ns 0.9408 ns 0.89 45.9791 ns 0.4207 ns 0.9408 ns 21749006.4 44.8342 ns 44.9558 ns 46.5726 ns 46.7057 ns 46.7709 ns
UnsafeAndermanVectorizedCopy X64 RyuJit 1024 56.0373 ns 0.3810 ns 1.07 55.9425 ns 0.1704 ns 0.3810 ns 17875482.8 55.3060 ns 55.6212 ns 56.0373 ns 56.2165 ns 56.3067 ns
UnsafeCopyBlock X64 RyuJit 1024 44.4629 ns 0.6619 ns 0.85 44.2982 ns 0.2960 ns 0.6619 ns 22574257.2 43.2472 ns 43.6771 ns 44.4629 ns 44.8371 ns 44.8767 ns
Buffer_MemmoveOriginal X64 RyuJit 1024 46.6081 ns 0.6426 ns 0.89 46.6340 ns 0.2874 ns 0.6426 ns 21443590.92 45.7607 ns 46.1689 ns 46.6081 ns 47.1120 ns 47.5748 ns
Buffer_MemmoveJamesqo X64 RyuJit 1024 49.5556 ns 0.9978 ns 0.94 49.1193 ns 0.4462 ns 0.9978 ns 20358577.71 47.5254 ns 48.1520 ns 49.5556 ns 49.8686 ns 49.9482 ns
Buffer_MemmoveTannerGooding X64 RyuJit 1024 40.3597 ns 0.5364 ns 0.77 40.5760 ns 0.2399 ns 0.5364 ns 24645125.94 39.9316 ns 40.1276 ns 40.3597 ns 41.1325 ns 41.1856 ns

@tannergooding
Copy link
Member

@nietras, it might be worth noting (for my sample) that didn't properly handle overlapping buffers (throwing an exception in the original, which I believe you grabbed; and falling back to Buffer.MemoryCopy now).

This isn't a problem if the bench didn't cover that, but is if it did.

I think another important distinction is that my code did not fall back to the native implementation for any size. But, as mentioned above, it doesn't seem like this is really important until the underlying implementation begins using prefetch for sizes greater than 8192

@omariom
Copy link

omariom commented Aug 6, 2016

Pessimistic I am, it actually finished just now it took 10558059 ms aka ~2.9 hours to run om my desktop PC.

That's why I usually run such benchmarks in LinqPad. BDN is not (yet) convenient for that.

@jamesqo
Copy link
Author

jamesqo commented Aug 6, 2016

@tannergooding Your version looks interesting. I'm going to copy your approach partially and replace the double word writes w/ XMM in the switch-cases as well.

Also it looks like the JIT is generating some redundant code for the movdqu on x64? For example:

                    *(UInt128*)(dst + sizeof_UInt128) = *(UInt128*)(src + sizeof_UInt128);
00007FF953CE4A85  lea         rax,[rsi+10h]  
00007FF953CE4A89  lea         r8,[rcx+10h]  
00007FF953CE4A8D  movdqu      xmm0,xmmword ptr [rax]  
00007FF953CE4A91  movdqu      xmmword ptr [r8],xmm0  
                    *(UInt128*)(dst + (sizeof_UInt128 * 2)) = *(UInt128*)(src + (sizeof_UInt128 * 2));
00007FF953CE4A96  lea         rax,[rsi+20h]  
00007FF953CE4A9A  lea         r8,[rcx+20h]  
00007FF953CE4A9E  movdqu      xmm0,xmmword ptr [rax]  
00007FF953CE4AA2  movdqu      xmmword ptr [r8],xmm0  

This looks like it should just be

movdqu xmm0,xmmword ptr [rsi+10h]
movdqu xmmword ptr [rcx+10h],xmm0
movdqu xmm0,xmmword ptr [rsi+20h]
movdqu xmmword ptr [rcx+20h],xmm0

without the leas.

Also I don't know how you're getting movqs over there on x86, this is the (very inefficient) code I'm getting for the x86 inner loop:

G_M6842_IG28:
       8B75F0       mov      esi, dword ptr [ebp-10H]
       0375EC       add      esi, dword ptr [ebp-14H]
       8B7DEC       mov      edi, dword ptr [ebp-14H]
       03FB         add      edi, ebx
       A5           movsd    
       A5           movsd    
       A5           movsd    
       A5           movsd    
       8345EC10     add      dword ptr [ebp-14H], 16
       3955EC       cmp      dword ptr [ebp-14H], edx
       76E8         jbe      SHORT G_M6842_IG28

(edit: was responding to your comment 3 comments up)

@nietras
Copy link

nietras commented Aug 6, 2016

I did a simple sum over Scaled column for each method for 64-bit RyuJIT to get some idea about each methods applicability:
image

@tannergooding yes I took your version verbatim, so it does not revert to the memmove, also the benchmark does nothing fancy it simply copies from one managed array to another, so there should be no overlap or anything. Also no offset from beginning of array. That is another parameter that I have for now uncommented.

My excel skills are really coming being tested today (had to google http://superuser.com/questions/750353/excel-scatter-plot-with-multiple-series-from-1-table), but here a plot for small bytes copies, where I plot the Scaled result again:

image

The benchmarks does have some issue with some methods doing input parameter checks and others not. If you guys have some suggestions for changes to the benchmarks, like make it all unsafe or omit the checks or something, I can make the changes. And, of course, include new variants.

@GSPP
Copy link

GSPP commented Aug 21, 2016

@Anderman shouldn't we assume that data is cached? If it's uncached performance will be completely dominated by cache line loads and what we do does not matter much. Optimizations should matter almost exclusively for the cached case.

If memory can be delivered at 10GB/s in a streaming fashion that's about 3 bytes per cycle. We therefore have ~20 cycles to spend on copying one 64 byte cache line. That's easy to do in 20 cycles. We get even more than 20 instructions, maybe 2-3x.

@redknightlois
Copy link

@GSPP not really. Given that memory loading unto the lower level memory hierarchies dominate the whole runtime, the real global minimum happen when you can retire as many instructions as possible even considering the latency caused by memory access. The real trick is in overlapping the memory accesses on the cache misses with the operations on L1 hits (without consuming all execution ports) or pay the prefetch of multiple cache misses in batches (you can copy kbytes at a time for essentially free if you don't pollute the cache with the written data with non temporal writes).

@Anderman
Copy link

@GSPP @redknightlois
I did some test with disable caching. I changed the index for every test

 public void ArrayCopy()
        {
            Index = (Index + cacheSize) & BufferSize-1;
            Array.Copy(bufferFrom, Index, bufferTo, Index, BytesCopied);
        }

This will slow down your performance by a factor 10-20 for small sizes. But not so much for big arrays. prefetching seems to work well.

Interesting is that the best copy method depends on cached vs not cached

@Anderman
Copy link

@GSPP I think you are right. There will be programs where all data is already in L1 cache before the copy method is called. So the current test are good. Maybe we can extent the test with not cached data to see if that lead to other solutions

@redknightlois
Copy link

@Anderman would be interesting to see @nietras graphs on the uncached versions. We for once have lots of the "uncached" kind. Probably for algorithm selection purpose we should look for a tradeoff. Probably not the fastest for each case, but faster for both in average.

@Anderman
Copy link

@redknightlois I did some test with DIY performance test program. The result for the cached version are the same but test run faster. I also did test with random array access. Will post the results soon

@Anderman
Copy link

Anderman commented Aug 22, 2016

Cached test results.

3610qm scale cached

I did some improvement in the alignment and faster copy large blocks by using some ticks from the MsvcrtMemmove

@Anderman
Copy link

Anderman commented Aug 22, 2016

Updated with correct graph
With cache but now the time it takes to do a copy operation
3610qm ns-op

Random access (Edit graph)
Scale

3610qm scale

ns per operation (Edit graph)
3610qm ns-op

I can't explain what cause the delay for arrayCopy and msMemmove.

@nietras
Copy link

nietras commented Aug 22, 2016

@Anderman do you have the full code for this anywhere? I'd be happy to put it in the repo if you like.

@Anderman
Copy link

Yes, I have to clean it up a little bit. The results are interesting. It looks that buffer.array en memmove use extra data access and when it is not cached it will cause an extra delay.

@redknightlois
Copy link

@Anderman is that Operations per nanosecond or Nanoseconds per operation at the last graph?

@Anderman
Copy link

@redknightlois that is ns per operation. Array.copy will copy 32 bytes in 100ns (random access)

@Anderman
Copy link

Seq. access
scale
3610qm scale seq

ns per operation
3610qm ns-op seq

@nietras
Copy link

nietras commented Aug 23, 2016

@Anderman as far as I understand your benchmarks are based on your own basic measurements on .NET Core, with no pre-jit or warmup iterations, so there probably is a bias (although small probably) in these measurements. Anyway, that is not my main question, I don't understand how your measurements can show that MsvcrtMemmove is faster than Array.Copy when almost all our other measurements show the opposit at least for sizes 0-15, as seen below. How can this be explained?

CopiesBenchmark-report-i7-5820K
image

@Anderman
Copy link

Anderman commented Aug 23, 2016

MsvcrtMemmove is faster than Array.Copy
@nietras I don't know. In Cached mode MsvcrtMemmove is faster but seq mode not. Array copy does some well with precache

I Include an excel with the foer different test. from0-8192

Test results from I7-3610QM. (Random,cached, sequence, aligntest)
I7-3610QM.xlsx
most interesting is seq/cached

@Anderman
Copy link

Cached vs Seq
3610qm ns cached
3610qm ns-op seq

@Anderman
Copy link

@nietras I have run the test with the org benchmark and there was msvcrt.dll slower for 1-15. Could it be an other version of msvcrt.dll? How can do check that? procmon or are there easier ways?

@nietras
Copy link

nietras commented Aug 23, 2016

@Anderman I think it is because your version of Msvcrtomits the checks that are in the org benchmark and in the other variants in your own benchmark i.e.

if (src == null || dst == null) throw new ArgumentNullException(nameof(src));
if (count < 0 || srcOffset < 0 || dstOffset < 0) throw new ArgumentOutOfRangeException(nameof(count));
if (srcOffset + count > src.Length) throw new ArgumentException(nameof(src));
if (dstOffset + count > dst.Length) throw new ArgumentException(nameof(dst));

@Anderman
Copy link

@nietras Yep, That's why. Do you known how you can use 000007FB44D49FF7 rep movs byte ptr [rdi],byte ptr [rsi] Array.Copy use this. I tried Unsafe.CopyBlk.

@Anderman
Copy link

Test on a I5-4590
With a faster processor than my laptop I7 it make really sense to optimize code.
I5-4590.xlsx

i5-4590 ns cached

@Anderman
Copy link

@nietras I send you a pull request with a new test project.

  • The new performance test are must faster (100ms per test)
  • The test are now measured in clock cycles
  • The test will generate a json file to show the result in google linechart
  • The test are more reliable. If you run the test twice you get almost the same results for the complete test range.
  • I found a way to write assembly in c# (for testing only)
  • 3 available test mode (cached, sequence, random)

My Testing summary on 2 different test machines
reboot computer with a fresh memory will remove a lot of noise
rep movsb has some startup cost
rep movsb has no alignment problem

@benaadams
Copy link
Member

Might have to retest with #7198

@jamesqo
Copy link
Author

jamesqo commented Oct 10, 2016

Closing due to length issues. I may open a new PR if I get a chance to work on this in the future.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
* NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons)
Projects
None yet
Development

Successfully merging this pull request may close these issues.