[no merge] Further tweak Buffer.MemoryCopy performance #6638

jamesqo · 2016-08-06T03:18:32Z

This is a follow-up to #6627, since that was accidentally merged. I've experimented with @tannergooding's idea of doing word writes at the beginning/end of the buffer before actually aligning dest, which shaves off a couple of branches and seems to improve performance.

This is the performance data I have for now (the Gist will be updated continually): https://gist.github.com/jamesqo/18b61a17a65489b5dd8eaf0617b9099d You need to press Ctrl+F and search for ---BASELINE--- to see the times for the existing version (the one that just got merged).

During the other PR I was unable to get consistent timings between benchmark runs (and BenchmarkDotNet takes waaay too long), so what I've done instead is run the tests multiple times and see which numbers keep popping up. Here are the 'average' times for each configuration:

# of Bytes	30	40	50
Baseline	~2.0	~2.7	~3.3
Experimental	~1.7/8	~2.4	~2.0

Note that this should not be merged yet, as I haven't tested for i386 and have not collected data for when dest is unaligned. Also I'm thinking of possibly raising the threshold for when we do a QCall into the native memmove, since copying 511 bytes seems to be significantly faster than 512 (see my comment in the other PR).

cc @jkotas @tannergooding @GSPP @benaadams

jkotas · 2016-08-06T03:34:28Z

src/mscorlib/src/System/Buffer.cs

-#endif // BIT64
-
+
+            int alignment = IntPtr.Size - 1;


It may be better to just inline sizeof(nuint) - 1 in the expression below instead of creating a local variable that is initialized by a call (IntPtr.Size is function call in IL) that the JIT has to inline and optimize out. Same for other uses of IntPtr.Size in this function.

@jkotas I thought mscorlib was crossgened, so this would have no runtime overhead?

My comment was more to make the IL simple so that there is nothing that can derail JIT from producing a good code for it.

tannergooding · 2016-08-06T04:23:39Z

@jamesqo, what are you using to bench?

I think the biggest issue with copying so few bytes (0-1024) is that we are hitting the lower limits of what any onboard timer can actually measure.

I have written a very simple bench here (https://gist.github.com/tannergooding/bb256733943fc5afecce6a51a640910d) for the existing code

In the sample, it:

Allocates Source
Randomizes Bytes in Source
Allocates Destination
Zeros Bytes in Destination
Begins timing
Executes 1,000 iterations of Buffer.MemoryCopy
Ends timing
Repeats steps 5-7; 100,000 times
Tracks the minimum, maximum, and total average time for all iterations
Prints the information

I did it this way to ensure that only the data we care about is tracked (Buffer.MemoryCopy) and to try and help normalize any discrepancies from operations which are too small to be measured and from any caching of the source memory.

It is also trivial to modify the sample to start from an offset of source/destination so unaligned read/writes can be tested.

tannergooding · 2016-08-06T04:40:35Z

src/mscorlib/src/System/Buffer.cs

-            if ((len & 8) != 0)
+            // We have <= 15 bytes to copy at this point.
+
+            if ((mask & 8) != 0)


Consider, if (mask == 0) { return; }, otherwise we are doing four pointless comparisons

Also consider, something like (math should be correct, I validated it does what I expect for some simple scenarios).

First Write,

nuint i = 8u - (dest % sizeof(nuint)) (adjust for misaligned destination)

len -= i; (adjust for bytes already copied, not including initial bytes that will be double copied so we stay aligned)

nuint r = len % 16 (number of bytes remaining after we loop)

nuint l = (sizeof(nuint) - (r % sizeof(nuint)); (number of bytes remaining that need to be written misaligned)

Loop

Final

cases 13-15: *(int*)(dest + i) = *(int*)(src + i) *(int*)(dest + i + 4) = *(int*)(src + i + 4) *(int*)(dest + i + 8) = *(int*)(src + i + 8) *(int*)(dest + i + 12 - l) = *(int*)(src + i + 12 - l) case 12: *(int*)(dest + i) = *(int*)(src + i) *(int*)(dest + i + 4) = *(int*)(src + i + 4) *(int*)(dest + i + 8) = *(int*)(src + i + 8) cases 9-11: *(int*)(dest + i) = *(int*)(src + i) *(int*)(dest + i + 4 - l) = *(int*)(src + i + 4 - l) case 8: *(int*)(dest + i) = *(int*)(src + i) *(int*)(dest + i + 4) = *(int*)(src + i + 4) cases 5-7: *(int*)(dest + i) = *(int*)(src + i) *(int*)(dest + i + 4 - l) = *(int*)(src + i + 4 - l) case 4: *(int*)(dest + i) = *(int*)(src + i) cases 1-3 *(int*)(dest + i -l) = *(int*)(src + i - l) case 0: return

Excellent idea, we can use a switch statement here to avoid the branches/unaligned write (and since code size doesn't matter, as all this is crossgened and there is no runtime overhead for larger methods). I'm going to experiment with this and see if I can tease out some even better perf.

omariom · 2016-08-06T04:59:20Z

@jamesqo Regarding xmm registers and Vector<T> you mentioned in the previous PR discussion. Actually JIT can use xmm regs when copying structs of size >= 16. I noticed that in Slices project.

jamesqo · 2016-08-06T05:23:28Z

@omariom Interesting... I didn't know that before. I'm going to try replacing the inner loop with a

*(decimal*)(dest + i) = *(decimal*)(src + i);

for x64 and see how that goes.

jamesqo · 2016-08-06T05:31:02Z

@tannergooding I was using tests from the previous PR to benchmark: https://gist.github.com/jamesqo/337852c8ce09205a8289ce1f1b9b5382 There's a .cs file in there which contains the source code for the benchmark.

I didn't get a chance to take a full look at your benchmark, but I noticed that in your inner loop you are calling MemoryCopy with an int argument. That may be suboptimal since that's going to call the long overload which does a checked cast before calling Memmove. I'm not sure if the JIT is able to optimize that out but it may be better to cast to ulong beforehand.

omariom · 2016-08-06T05:40:00Z

@jamesqo what if you try to unroll to 2 xmm copies?

struct Dummy32 { long l1,l2,l3,l4 }

nietras · 2016-08-06T08:31:07Z

@jamesqo great work.

Regarding xmm registers and Vector you mentioned in the previous PR discussion. Actually JIT can use xmm regs when copying structs of size >= 16.

@omariom yes I thought I read something about that too by @mikedn so any struct defined as:

struct LongLong
{
    Long l0;
    Long l1;
}

should be useable for copying 16 bytes at a time. Haven't tested it myself, though.

I do not know all the background for this, but couldn't Buffer.MemoryCopy defer to .cpblk and then the JIT could insert a copy loop or call optimized for a given architecture? That is, specifically use enhanced rep movsb (ERMSB) when this would be more optimal and not the least result in a much smaller code size? (Not sure why code size is not considered important?).

omariom · 2016-08-06T08:44:37Z

I do not know all the background for this, but couldn't Buffer.MemoryCopy defer to .cpblk and then the JIT could insert a copy loop or call optimized for a given architecture.

It works if JIT knows the length, otherwise it calls a helper.

benaadams · 2016-08-06T09:03:00Z

Would that mean something like this would work?

struct Bytes16
{
    long bytes0;
    long bytes1;
}
struct Bytes32
{
    long bytes0;
    long bytes1;
    long bytes2;
    long bytes3;
}
struct Bytes64
{
    long bytes0;
    long bytes1;
    long bytes2;
    long bytes3;
    long bytes4;
    long bytes5;
    long bytes6;
    long bytes7;
}

nuint end = len - 64;
nuint counter; 

while (i <= end)
{
    counter = i + 64;
    *(Bytes64*)(dest + i) = *(Bytes64*)(src + i);
    i = counter;
}

if ((len & 32) != 0)
{
    counter = i + 32;
    *(Bytes32*)(dest + i) = *(Bytes32*)(src + i);
    i = counter;
}

if ((len & 16) != 0)
{
    counter = i + 16;
    *(Bytes16*)(dest + i) = *(Bytes16*)(src + i);
    i = counter;
}

tannergooding · 2016-08-06T09:04:51Z

@jamesqo: https://gist.github.com/tannergooding/08702b99b26447b9e30e2126bba2c966, a somewhat optimized version that uses decimal to copy 16 bytes at a time. The generated assembly (on x64) is using movdqu (which is slightly undesirable, as it means we lose some, but not all, of the benefits of ensuring we are aligned). For x86, it is generating two movq instructions.

x64 is showing a 15% improvement (this is a general improvement when tested against all sizes and alignments from 0-1024 bytes (where alignments are have source or destination misaligned by 1-15 bytes).

x86 is showing a 7% improvement under the same scenarios.

It may be interesting to note that the native implementation doesn't really start showing any benefits until around 8192 bytes, where (at least on Windows) the native implementation will begin using the prefetch instruction.

In either case, the principals behind the code for doing unaligned writes as a pre/post step remain the same (and there are probably more optimizations to be had).

tannergooding · 2016-08-06T09:11:16Z

@benaadams, yes, that works as expected (unrolling to 4 movdqu instructions). Sadly, 128 byte structs end up being a call instead.

nietras · 2016-08-06T09:14:04Z

@tannergooding awesome, could I suggest a code change, since we are only using decimal as a 16-byte type putting:

using Reg16 = System.Decimal;

in the beginning of the code file and using Reg16 would perhaps make it more clear? Just as using nuint helped a lot with the readability and code complexity. Or call it Bytes16 as @benaadams suggests.

@tannergooding would you mind putting up the assembly as well?

On that note, I believe defining these types should work, not sure there is any benefit perf wise for >=Bytes32 though, as these just do multiple 16-byte copies then, I guess ;) But code readability would be better in my view.

benaadams · 2016-08-06T09:17:42Z

not sure there is any benefit perf wise for >=Bytes32 though

Loop unrolling without the unroll :)

Maybe at some point it will automagically take advantage of AVX-512? (and maybe AVX2 along the way)

nietras · 2016-08-06T09:28:27Z

Loop unrolling without the unroll :) And maybe at some point it will automagically take advantage of AVX-512

Yes I agree it seems to be worth doing just for improving readability and perhaps future improvements in the JIT.

@jkotas didn't we talk about perhaps adding Aligned or something to Unsafe to try to tell the JIT when we know an address is aligned? Can't find the reference for this... found it https://github.com/dotnet/coreclr/issues/2725 it was Unsafe.Assume

GSPP · 2016-08-06T10:47:17Z

Shouldn't System.Decimal be avoided entirely? Seems like a code smell. And it's a dependency on the JIT never mishandling that type (due to some confusion or bug). Is it's size and packing even guaranteed? Might be but better not do this.

nietras · 2016-08-06T12:11:50Z

Shouldn't System.Decimal be avoided entirely?

I agree, but wasn't sure whether this was due to whether or not "custom structs" like Bytes16 are acceptable or not.

jamesqo · 2016-08-06T15:02:21Z

@omariom

what if you try to unroll to 2 xmm copies?

I'm not so sure if we should do that; it might be good for large buffers (avoid an extra branch/write every 16 bytes), but for smaller ones (in the ~30-50 range) it could be detrimental. I may attempt it if we increase the threshold at which we call the native memmove, though.

@nietras

I do not know all the background for this, but couldn't Buffer.MemoryCopy defer to .cpblk and then the JIT could insert a copy loop or call optimized for a given architecture?

See comment here, it looks like the cpblk implementation isn't as efficient as a manual copy, though I'm not sure the situation has changed since that comment was written.

could I suggest a code change, since we are only using decimal as a 16-byte type putting:

I'll make an ~~Int128~~ Buffer16 type and use that instead of decimal. I preferred the latter since it was a built-in type and got syntax highlighting, but I guess the former more clearly expresses the intent.

@tannergooding

For x86, it is generating two movq instructions.

Really? I was under the impression it would have to use 4 GPRs and do something like move each int in the struct into a register, and then move those to the destination (8 movs total). At least, I think something like that happened when I looked at the disassembly for long* writes on x86. I'll have to test this out for myself.

jamesqo · 2016-08-06T15:44:21Z

Alright, here is the code of the inner loop when using a 16-byte struct:

G_M6842_IG28:
       4C8D5810             lea      r11, [rax+16]
       488D3402             lea      rsi, [rdx+rax]
       4803C1               add      rax, rcx
       F30F6F06             movdqu   xmm0, qword ptr [rsi]
       F30F7F00             movdqu   qword ptr [rax], xmm0
       498BC3               mov      rax, r11
       493BC1               cmp      rax, r9
       76E5                 jbe      SHORT G_M6842_IG28

It's not efficient as it could be. E.g. I think it should be something like

LOOP:

; r11 => counter rax => i, rdx => src, rcx => dest, r9 => end
lea r11, [rax+16] ; calculate counter
movdqu xmm0, qword ptr [rdx+rax] ; should be movdqa, but as nietras mentioned we need Unsafe.Assume
movdqu qword ptr [rcx+rax], xmm0
mov rax, r11
cmp rax, r9
jbe LOOP

edit: Reverting the changes @benaadams suggested to avoid making the loop condition depend on the writes still results in 1 additional lea:

G_M6842_IG28:
       4C8D1C02             lea      r11, [rdx+rax]
       488D3401             lea      rsi, [rcx+rax]
       F3410F6F03           movdqu   xmm0, qword ptr [r11]
       F30F7F06             movdqu   qword ptr [rsi], xmm0
       4883C010             add      rax, 16
       493BC1               cmp      rax, r9
       76E6                 jbe      SHORT G_M6842_IG28

I think the JIT tries to eliminate these dependencies by itself (hence why the movdqus don't just do [rdx+rax] directly), but it should be moving i + 16 or i into another register, not the pointer offsets.

The version that was just merged seemed to work with this trick, not sure why it's not working now.

Regardless even though an extra lea is being generated it's still an improvement over the old version by ~4 bytes, so I'll take it. I'm going to raise an issue for this when I can get a more generalized repro, and maybe this can be revisited later.

omariom · 2016-08-06T15:51:21Z

@jamesqo
Would be interesting to see results for the whole length range - from 1 to 512.
If you save the results to a csv file you will be able to create nice looking charts in Excel.
Like this, for example.
https://cloud.githubusercontent.com/assets/1781701/14231309/fdb290e8-f986-11e5-9d47-0e2c3467f612.png

And as it is not your last optimization, you could reuse that excel in the future :)

nietras · 2016-08-06T15:54:00Z

looks like the cpblk implementation isn't as efficient as a manual copy, though I'm not sure the situation has changed

The problem is we never really got a great definitive answer for this and other questions related to memory copying on .NET. I am currently running https://github.com/DotNetCross/Memory.Copies (which contains a single BenchmarkDotNet project) measuring many different memory copy variants such as the ones discussed in https://github.com/dotnet/coreclr/issues/2430 and aspnet/KestrelHttpServer#511 . This is based on a benchmark project that @benaadams did.

I added the original Buffer.Memmove, @jamesqo variant merged into master and @tannergooding variant. For the former two I revert to mscvrt.dll memmove instead of the internal coreclr call. @tannergooding does not revert to mscvrt.dll. See the benchmark file https://github.com/DotNetCross/Memory.Copies/blob/master/src/DotNetCross.Memory.Copies.Benchmarks/CopiesBenchmark.cs

This benchmark, however, only runs on "normal" .NET that is the three JITs available there. Have yet to succeed in building coreclr and getting tests run, just haven't had time ;) It should be possible to add .NET Core benchmarks too. Of course, the benchmark will take a long long time to run (working on an alternative to BenchmarkDotNet which will focus on doing the minimal possible to get good measurements, since BDN simply takes too long for quick brainstorming), it will probably finish sometime tomorrow morning CET. And it doesn't run under the exact same conditions, but it would be good to see what advantages/disadvantages the different variants have.

An early run of this with less parameters can be found here (I did some minor changes after this though): https://gist.github.com/nietras/400dfe8954450825c1033e36ae35a6a4

tannergooding · 2016-08-06T16:00:26Z

I have updated my gist with the disassembly for both the 32-bit and 64-bit versions (this is disassembly generated using Desktop 4.6.2, not using CoreCLR, but they should produce very similar results on Windows).

I can also add a version that shows the source code inline if desirable.

@jamesqo, I handled the case of unrolling by doing 128-byte blocks and special casing anything under 128-bytes (the same code used for the special case is also used to handle any leftover bytes a the end of the large block copy loop).

Edit: Updated the gist to have assembly and assembly w/ inline source for both. Additionally, updated to use a UInt128 struct, rather than the decimal type.

nietras · 2016-08-06T16:19:20Z

Pessimistic I am, it actually finished just now it took 10558059 ms aka ~2.9 hours to run om my desktop PC. See https://gist.github.com/nietras/22b8efd26af715ac32ccaf1a57a465da

I've found it is easiest to view by downloading the html, open it in browser and zoom out as you please. A graph would be welcome though ;)

Here a few random samples for RyuJIT 64-bit:

Method	Platform	Jit	BytesCopied	Median	StdDev	Scaled	Mean	StdError	StdDev	Op/s	Min	Q1	Median	Q3	Max
ArrayCopy	X64	RyuJit	3	11.8486 ns	0.0732 ns	1.00	11.8697 ns	0.0328 ns	0.0732 ns	84248344.93	11.7916 ns	11.8112 ns	11.8486 ns	11.9387 ns	11.9827 ns
BufferBlockCopy	X64	RyuJit	3	11.8605 ns	0.0471 ns	1.00	11.8545 ns	0.0211 ns	0.0471 ns	84356469.07	11.8099 ns	11.8104 ns	11.8605 ns	11.8955 ns	11.9237 ns
IllyriadVectorizedCopy	X64	RyuJit	3	11.8480 ns	0.2041 ns	1.00	11.9497 ns	0.0913 ns	0.2041 ns	83684379.92	11.7781 ns	11.7896 ns	11.8480 ns	12.1606 ns	12.2495 ns
AndermanVectorizedCopy	X64	RyuJit	3	9.3606 ns	0.1497 ns	0.79	9.3392 ns	0.0670 ns	0.1497 ns	107075766.64	9.1550 ns	9.1922 ns	9.3606 ns	9.4755 ns	9.5305 ns
UnsafeIllyriadVectorizedCopy	X64	RyuJit	3	12.1958 ns	0.2594 ns	1.03	12.2369 ns	0.1160 ns	0.2594 ns	81720202.83	11.9971 ns	12.0441 ns	12.1958 ns	12.4502 ns	12.6713 ns
UnsafeAndermanVectorizedCopy	X64	RyuJit	3	7.9733 ns	0.1325 ns	0.67	7.9373 ns	0.0593 ns	0.1325 ns	125986743.59	7.7770 ns	7.7994 ns	7.9733 ns	8.0573 ns	8.0802 ns
UnsafeCopyBlock	X64	RyuJit	3	8.0212 ns	0.0503 ns	0.68	8.0069 ns	0.0225 ns	0.0503 ns	124892864.98	7.9433 ns	7.9563 ns	8.0212 ns	8.0503 ns	8.0688 ns
Buffer_MemmoveOriginal	X64	RyuJit	3	8.5832 ns	0.0677 ns	0.72	8.6095 ns	0.0303 ns	0.0677 ns	116150892.74	8.5275 ns	8.5541 ns	8.5832 ns	8.6780 ns	8.6969 ns
Buffer_MemmoveJamesqo	X64	RyuJit	3	8.9447 ns	0.0615 ns	0.75	8.9039 ns	0.0275 ns	0.0615 ns	112310642.02	8.8187 ns	8.8383 ns	8.9447 ns	8.9490 ns	8.9516 ns
Buffer_MemmoveTannerGooding	X64	RyuJit	3	8.2688 ns	0.1275 ns	0.70	8.2158 ns	0.0570 ns	0.1275 ns	121717114.81	8.0688 ns	8.0830 ns	8.2688 ns	8.3220 ns	8.3666 ns
ArrayCopy	X64	RyuJit	21	21.9709 ns	0.2818 ns	1.00	21.9018 ns	0.1260 ns	0.2818 ns	45658425.73	21.4363 ns	21.6691 ns	21.9709 ns	22.0999 ns	22.1938 ns
BufferBlockCopy	X64	RyuJit	21	22.8252 ns	0.3546 ns	1.04	22.8957 ns	0.1586 ns	0.3546 ns	43676276.72	22.4627 ns	22.6353 ns	22.8252 ns	23.1914 ns	23.4442 ns
IllyriadVectorizedCopy	X64	RyuJit	21	10.5605 ns	0.0813 ns	0.48	10.5823 ns	0.0364 ns	0.0813 ns	94497284.1	10.4862 ns	10.5171 ns	10.5605 ns	10.6584 ns	10.7029 ns
AndermanVectorizedCopy	X64	RyuJit	21	5.4889 ns	0.0795 ns	0.25	5.4991 ns	0.0355 ns	0.0795 ns	181848044.18	5.4050 ns	5.4240 ns	5.4889 ns	5.5793 ns	5.5911 ns
UnsafeIllyriadVectorizedCopy	X64	RyuJit	21	9.9539 ns	0.0634 ns	0.45	9.9597 ns	0.0283 ns	0.0634 ns	100405097.81	9.8839 ns	9.8987 ns	9.9539 ns	10.0235 ns	10.0265 ns
UnsafeAndermanVectorizedCopy	X64	RyuJit	21	6.8868 ns	0.0319 ns	0.31	6.8939 ns	0.0142 ns	0.0319 ns	145056194.27	6.8638 ns	6.8706 ns	6.8868 ns	6.9207 ns	6.9471 ns
UnsafeCopyBlock	X64	RyuJit	21	10.3474 ns	0.0982 ns	0.47	10.3240 ns	0.0439 ns	0.0982 ns	96861778.16	10.1843 ns	10.2289 ns	10.3474 ns	10.4073 ns	10.4395 ns
Buffer_MemmoveOriginal	X64	RyuJit	21	12.5054 ns	0.1772 ns	0.57	12.4501 ns	0.0792 ns	0.1772 ns	80320527.72	12.2471 ns	12.2688 ns	12.5054 ns	12.6038 ns	12.6693 ns
Buffer_MemmoveJamesqo	X64	RyuJit	21	9.3066 ns	0.0323 ns	0.42	9.3176 ns	0.0145 ns	0.0323 ns	107324239.33	9.2873 ns	9.2934 ns	9.3066 ns	9.3472 ns	9.3702 ns
Buffer_MemmoveTannerGooding	X64	RyuJit	21	8.6628 ns	0.1095 ns	0.39	8.6026 ns	0.0490 ns	0.1095 ns	116243273.06	8.4256 ns	8.4961 ns	8.6628 ns	8.6792 ns	8.6826 ns
ArrayCopy	X64	RyuJit	65	24.6007 ns	0.3236 ns	1.00	24.6037 ns	0.1447 ns	0.3236 ns	40644267.97	24.2618 ns	24.3087 ns	24.6007 ns	24.9003 ns	25.0799 ns
BufferBlockCopy	X64	RyuJit	65	25.7098 ns	0.3814 ns	1.05	25.5493 ns	0.1706 ns	0.3814 ns	39140036.49	24.8857 ns	25.2327 ns	25.7098 ns	25.7857 ns	25.8258 ns
IllyriadVectorizedCopy	X64	RyuJit	65	15.6770 ns	0.2101 ns	0.64	15.5863 ns	0.0940 ns	0.2101 ns	64158881.64	15.2112 ns	15.4354 ns	15.6770 ns	15.6918 ns	15.6948 ns
AndermanVectorizedCopy	X64	RyuJit	65	9.6312 ns	0.0904 ns	0.39	9.5969 ns	0.0404 ns	0.0904 ns	104200247.67	9.4601 ns	9.5109 ns	9.6312 ns	9.6657 ns	9.6981 ns
UnsafeIllyriadVectorizedCopy	X64	RyuJit	65	11.1950 ns	0.1201 ns	0.46	11.2211 ns	0.0537 ns	0.1201 ns	89117785.46	11.0570 ns	11.1176 ns	11.1950 ns	11.3377 ns	11.3605 ns
UnsafeAndermanVectorizedCopy	X64	RyuJit	65	9.3821 ns	0.1452 ns	0.38	9.3171 ns	0.0649 ns	0.1452 ns	107329738.59	9.1599 ns	9.1612 ns	9.3821 ns	9.4405 ns	9.4628 ns
UnsafeCopyBlock	X64	RyuJit	65	9.9720 ns	0.0540 ns	0.41	9.9936 ns	0.0241 ns	0.0540 ns	100063697.03	9.9478 ns	9.9585 ns	9.9720 ns	10.0396 ns	10.0858 ns
Buffer_MemmoveOriginal	X64	RyuJit	65	15.3230 ns	0.1963 ns	0.62	15.3714 ns	0.0878 ns	0.1963 ns	65055788.97	15.1583 ns	15.1921 ns	15.3230 ns	15.5749 ns	15.6093 ns
Buffer_MemmoveJamesqo	X64	RyuJit	65	14.4864 ns	0.1597 ns	0.59	14.5143 ns	0.0714 ns	0.1597 ns	68897689.19	14.3478 ns	14.3654 ns	14.4864 ns	14.6771 ns	14.7175 ns
Buffer_MemmoveTannerGooding	X64	RyuJit	65	12.6768 ns	0.0776 ns	0.52	12.6946 ns	0.0347 ns	0.0776 ns	78773425.2	12.6169 ns	12.6375 ns	12.6768 ns	12.7607 ns	12.8226 ns
ArrayCopy	X64	RyuJit	128	27.7511 ns	0.2175 ns	1.00	27.6736 ns	0.0973 ns	0.2175 ns	36135474.82	27.4014 ns	27.4458 ns	27.7511 ns	27.8627 ns	27.9071 ns
BufferBlockCopy	X64	RyuJit	128	26.4619 ns	0.3276 ns	0.95	26.3389 ns	0.1465 ns	0.3276 ns	37966666.2	25.7567 ns	26.0969 ns	26.4619 ns	26.5194 ns	26.5341 ns
IllyriadVectorizedCopy	X64	RyuJit	128	14.2371 ns	0.1849 ns	0.51	14.2701 ns	0.0827 ns	0.1849 ns	70076491.04	14.0883 ns	14.1423 ns	14.2371 ns	14.4144 ns	14.5810 ns
AndermanVectorizedCopy	X64	RyuJit	128	16.8091 ns	0.1228 ns	0.61	16.8434 ns	0.0549 ns	0.1228 ns	59370550.11	16.6976 ns	16.7344 ns	16.8091 ns	16.9695 ns	16.9900 ns
UnsafeIllyriadVectorizedCopy	X64	RyuJit	128	9.1695 ns	0.2100 ns	0.33	9.0657 ns	0.0939 ns	0.2100 ns	110306043.73	8.8190 ns	8.8418 ns	9.1695 ns	9.2377 ns	9.2881 ns
UnsafeAndermanVectorizedCopy	X64	RyuJit	128	12.7646 ns	0.2398 ns	0.46	12.7311 ns	0.1072 ns	0.2398 ns	78547667.5	12.3533 ns	12.5115 ns	12.7646 ns	12.9341 ns	12.9344 ns
UnsafeCopyBlock	X64	RyuJit	128	10.9182 ns	0.2382 ns	0.39	10.8837 ns	0.1065 ns	0.2382 ns	91880141.97	10.5941 ns	10.6492 ns	10.9182 ns	11.1011 ns	11.1881 ns
Buffer_MemmoveOriginal	X64	RyuJit	128	19.8758 ns	0.2395 ns	0.72	19.9155 ns	0.1071 ns	0.2395 ns	50212255.71	19.5789 ns	19.7217 ns	19.8758 ns	20.1290 ns	20.2311 ns
Buffer_MemmoveJamesqo	X64	RyuJit	128	19.7783 ns	0.2726 ns	0.71	19.6836 ns	0.1219 ns	0.2726 ns	50803812.96	19.2007 ns	19.4871 ns	19.7783 ns	19.8327 ns	19.8677 ns
Buffer_MemmoveTannerGooding	X64	RyuJit	128	12.6614 ns	0.3868 ns	0.46	12.5310 ns	0.1730 ns	0.3868 ns	79801909.28	11.9916 ns	12.1419 ns	12.6614 ns	12.8549 ns	12.9635 ns
ArrayCopy	X64	RyuJit	256	27.0799 ns	0.1014 ns	1.00	27.0722 ns	0.0454 ns	0.1014 ns	36938273.37	26.9537 ns	26.9752 ns	27.0799 ns	27.1653 ns	27.2100 ns
BufferBlockCopy	X64	RyuJit	256	28.1309 ns	0.3786 ns	1.04	28.0623 ns	0.1693 ns	0.3786 ns	35635018.71	27.6200 ns	27.6765 ns	28.1309 ns	28.4138 ns	28.5073 ns
IllyriadVectorizedCopy	X64	RyuJit	256	25.8186 ns	0.5686 ns	0.95	26.1196 ns	0.2543 ns	0.5686 ns	38285427.39	25.6403 ns	25.7007 ns	25.8186 ns	26.6890 ns	27.0020 ns
AndermanVectorizedCopy	X64	RyuJit	256	29.1532 ns	0.5057 ns	1.08	29.0232 ns	0.2262 ns	0.5057 ns	34455141.26	28.1601 ns	28.6074 ns	29.1532 ns	29.3741 ns	29.4576 ns
UnsafeIllyriadVectorizedCopy	X64	RyuJit	256	14.1985 ns	0.2394 ns	0.52	14.2448 ns	0.1070 ns	0.2394 ns	70200856.14	13.9674 ns	14.0787 ns	14.1985 ns	14.4341 ns	14.6283 ns
UnsafeAndermanVectorizedCopy	X64	RyuJit	256	19.7972 ns	0.2998 ns	0.73	19.9209 ns	0.1341 ns	0.2998 ns	50198413.41	19.5571 ns	19.6732 ns	19.7972 ns	20.2305 ns	20.2673 ns
UnsafeCopyBlock	X64	RyuJit	256	14.6567 ns	0.2135 ns	0.54	14.5426 ns	0.0955 ns	0.2135 ns	68763503.94	14.1961 ns	14.3370 ns	14.6567 ns	14.6912 ns	14.7127 ns
Buffer_MemmoveOriginal	X64	RyuJit	256	26.6222 ns	0.3156 ns	0.98	26.7835 ns	0.1411 ns	0.3156 ns	37336485.43	26.4921 ns	26.5223 ns	26.6222 ns	27.1252 ns	27.1428 ns
Buffer_MemmoveJamesqo	X64	RyuJit	256	26.1589 ns	0.2413 ns	0.97	26.1227 ns	0.1079 ns	0.2413 ns	38280875.61	25.7992 ns	25.8860 ns	26.1589 ns	26.3413 ns	26.4083 ns
Buffer_MemmoveTannerGooding	X64	RyuJit	256	15.6074 ns	0.1285 ns	0.58	15.5355 ns	0.0575 ns	0.1285 ns	64368702.74	15.3700 ns	15.3976 ns	15.6074 ns	15.6374 ns	15.6542 ns
ArrayCopy	X64	RyuJit	544	38.4386 ns	0.3362 ns	1.00	38.5297 ns	0.1504 ns	0.3362 ns	25953978.74	38.1637 ns	38.2388 ns	38.4386 ns	38.8662 ns	39.0021 ns
BufferBlockCopy	X64	RyuJit	544	38.1120 ns	0.4219 ns	0.99	38.2131 ns	0.1887 ns	0.4219 ns	26169025.95	37.6772 ns	37.8589 ns	38.1120 ns	38.6179 ns	38.7868 ns
IllyriadVectorizedCopy	X64	RyuJit	544	52.1134 ns	0.7588 ns	1.36	52.1800 ns	0.3394 ns	0.7588 ns	19164414.12	51.0906 ns	51.5215 ns	52.1134 ns	52.8720 ns	53.0984 ns
AndermanVectorizedCopy	X64	RyuJit	544	60.9400 ns	0.8133 ns	1.59	60.9339 ns	0.3637 ns	0.8133 ns	16411212.99	59.9444 ns	60.1344 ns	60.9400 ns	61.7305 ns	61.8440 ns
UnsafeIllyriadVectorizedCopy	X64	RyuJit	544	24.0320 ns	0.3387 ns	0.63	24.0222 ns	0.1515 ns	0.3387 ns	41628129.27	23.6642 ns	23.6862 ns	24.0320 ns	24.3533 ns	24.4330 ns
UnsafeAndermanVectorizedCopy	X64	RyuJit	544	38.1652 ns	0.4368 ns	0.99	38.3020 ns	0.1954 ns	0.4368 ns	26108324.12	37.7511 ns	37.9395 ns	38.1652 ns	38.7328 ns	38.8734 ns
UnsafeCopyBlock	X64	RyuJit	544	26.3371 ns	0.4298 ns	0.69	26.1284 ns	0.1922 ns	0.4298 ns	38272556.01	25.5445 ns	25.6808 ns	26.3371 ns	26.4716 ns	26.5768 ns
Buffer_MemmoveOriginal	X64	RyuJit	544	36.5762 ns	0.6803 ns	0.95	36.8039 ns	0.3042 ns	0.6803 ns	27171035.38	36.1182 ns	36.1962 ns	36.5762 ns	37.5255 ns	37.5845 ns
Buffer_MemmoveJamesqo	X64	RyuJit	544	36.1769 ns	0.3230 ns	0.94	36.2245 ns	0.1444 ns	0.3230 ns	27605659.02	35.8542 ns	35.9416 ns	36.1769 ns	36.5310 ns	36.6919 ns
Buffer_MemmoveTannerGooding	X64	RyuJit	544	24.9767 ns	0.3292 ns	0.65	25.0289 ns	0.1472 ns	0.3292 ns	39953764.86	24.6877 ns	24.7214 ns	24.9767 ns	25.3625 ns	25.4539 ns
ArrayCopy	X64	RyuJit	1024	52.5563 ns	0.4810 ns	1.00	52.4264 ns	0.2151 ns	0.4810 ns	19074363.03	51.7578 ns	51.9354 ns	52.5563 ns	52.8524 ns	52.8638 ns
BufferBlockCopy	X64	RyuJit	1024	52.1743 ns	0.3734 ns	0.99	52.3227 ns	0.1670 ns	0.3734 ns	19112163.5	52.0713 ns	52.0877 ns	52.1743 ns	52.6319 ns	52.9735 ns
IllyriadVectorizedCopy	X64	RyuJit	1024	53.5631 ns	0.8648 ns	1.02	53.2958 ns	0.3867 ns	0.8648 ns	18763190.18	52.2829 ns	52.3801 ns	53.5631 ns	54.0780 ns	54.0900 ns
AndermanVectorizedCopy	X64	RyuJit	1024	56.5467 ns	0.7837 ns	1.08	56.2059 ns	0.3505 ns	0.7837 ns	17791727.81	55.0410 ns	55.4314 ns	56.5467 ns	56.8100 ns	57.0404 ns
UnsafeIllyriadVectorizedCopy	X64	RyuJit	1024	46.5726 ns	0.9408 ns	0.89	45.9791 ns	0.4207 ns	0.9408 ns	21749006.4	44.8342 ns	44.9558 ns	46.5726 ns	46.7057 ns	46.7709 ns
UnsafeAndermanVectorizedCopy	X64	RyuJit	1024	56.0373 ns	0.3810 ns	1.07	55.9425 ns	0.1704 ns	0.3810 ns	17875482.8	55.3060 ns	55.6212 ns	56.0373 ns	56.2165 ns	56.3067 ns
UnsafeCopyBlock	X64	RyuJit	1024	44.4629 ns	0.6619 ns	0.85	44.2982 ns	0.2960 ns	0.6619 ns	22574257.2	43.2472 ns	43.6771 ns	44.4629 ns	44.8371 ns	44.8767 ns
Buffer_MemmoveOriginal	X64	RyuJit	1024	46.6081 ns	0.6426 ns	0.89	46.6340 ns	0.2874 ns	0.6426 ns	21443590.92	45.7607 ns	46.1689 ns	46.6081 ns	47.1120 ns	47.5748 ns
Buffer_MemmoveJamesqo	X64	RyuJit	1024	49.5556 ns	0.9978 ns	0.94	49.1193 ns	0.4462 ns	0.9978 ns	20358577.71	47.5254 ns	48.1520 ns	49.5556 ns	49.8686 ns	49.9482 ns
Buffer_MemmoveTannerGooding	X64	RyuJit	1024	40.3597 ns	0.5364 ns	0.77	40.5760 ns	0.2399 ns	0.5364 ns	24645125.94	39.9316 ns	40.1276 ns	40.3597 ns	41.1325 ns	41.1856 ns

tannergooding · 2016-08-06T16:34:46Z

@nietras, it might be worth noting (for my sample) that didn't properly handle overlapping buffers (throwing an exception in the original, which I believe you grabbed; and falling back to Buffer.MemoryCopy now).

This isn't a problem if the bench didn't cover that, but is if it did.

I think another important distinction is that my code did not fall back to the native implementation for any size. But, as mentioned above, it doesn't seem like this is really important until the underlying implementation begins using prefetch for sizes greater than 8192

omariom · 2016-08-06T16:35:54Z

Pessimistic I am, it actually finished just now it took 10558059 ms aka ~2.9 hours to run om my desktop PC.

That's why I usually run such benchmarks in LinqPad. BDN is not (yet) convenient for that.

jamesqo · 2016-08-06T16:36:08Z

@tannergooding Your version looks interesting. I'm going to copy your approach partially and replace the double word writes w/ XMM in the switch-cases as well.

Also it looks like the JIT is generating some redundant code for the movdqu on x64? For example:

                    *(UInt128*)(dst + sizeof_UInt128) = *(UInt128*)(src + sizeof_UInt128);
00007FF953CE4A85  lea         rax,[rsi+10h]  
00007FF953CE4A89  lea         r8,[rcx+10h]  
00007FF953CE4A8D  movdqu      xmm0,xmmword ptr [rax]  
00007FF953CE4A91  movdqu      xmmword ptr [r8],xmm0  
                    *(UInt128*)(dst + (sizeof_UInt128 * 2)) = *(UInt128*)(src + (sizeof_UInt128 * 2));
00007FF953CE4A96  lea         rax,[rsi+20h]  
00007FF953CE4A9A  lea         r8,[rcx+20h]  
00007FF953CE4A9E  movdqu      xmm0,xmmword ptr [rax]  
00007FF953CE4AA2  movdqu      xmmword ptr [r8],xmm0

This looks like it should just be

movdqu xmm0,xmmword ptr [rsi+10h]
movdqu xmmword ptr [rcx+10h],xmm0
movdqu xmm0,xmmword ptr [rsi+20h]
movdqu xmmword ptr [rcx+20h],xmm0

without the leas.

Also I don't know how you're getting movqs over there on x86, this is the (very inefficient) code I'm getting for the x86 inner loop:

G_M6842_IG28:
       8B75F0       mov      esi, dword ptr [ebp-10H]
       0375EC       add      esi, dword ptr [ebp-14H]
       8B7DEC       mov      edi, dword ptr [ebp-14H]
       03FB         add      edi, ebx
       A5           movsd    
       A5           movsd    
       A5           movsd    
       A5           movsd    
       8345EC10     add      dword ptr [ebp-14H], 16
       3955EC       cmp      dword ptr [ebp-14H], edx
       76E8         jbe      SHORT G_M6842_IG28

(edit: was responding to your comment 3 comments up)

nietras · 2016-08-06T16:38:20Z

I did a simple sum over Scaled column for each method for 64-bit RyuJIT to get some idea about each methods applicability:

@tannergooding yes I took your version verbatim, so it does not revert to the memmove, also the benchmark does nothing fancy it simply copies from one managed array to another, so there should be no overlap or anything. Also no offset from beginning of array. That is another parameter that I have for now uncommented.

My excel skills are really coming being tested today (had to google http://superuser.com/questions/750353/excel-scatter-plot-with-multiple-series-from-1-table), but here a plot for small bytes copies, where I plot the Scaled result again:

The benchmarks does have some issue with some methods doing input parameter checks and others not. If you guys have some suggestions for changes to the benchmarks, like make it all unsafe or omit the checks or something, I can make the changes. And, of course, include new variants.

GSPP · 2016-08-21T09:35:34Z

@Anderman shouldn't we assume that data is cached? If it's uncached performance will be completely dominated by cache line loads and what we do does not matter much. Optimizations should matter almost exclusively for the cached case.

If memory can be delivered at 10GB/s in a streaming fashion that's about 3 bytes per cycle. We therefore have ~20 cycles to spend on copying one 64 byte cache line. That's easy to do in 20 cycles. We get even more than 20 instructions, maybe 2-3x.

redknightlois · 2016-08-21T20:04:01Z

@GSPP not really. Given that memory loading unto the lower level memory hierarchies dominate the whole runtime, the real global minimum happen when you can retire as many instructions as possible even considering the latency caused by memory access. The real trick is in overlapping the memory accesses on the cache misses with the operations on L1 hits (without consuming all execution ports) or pay the prefetch of multiple cache misses in batches (you can copy kbytes at a time for essentially free if you don't pollute the cache with the written data with non temporal writes).

Anderman · 2016-08-21T22:27:11Z

@GSPP @redknightlois
I did some test with disable caching. I changed the index for every test

 public void ArrayCopy()
        {
            Index = (Index + cacheSize) & BufferSize-1;
            Array.Copy(bufferFrom, Index, bufferTo, Index, BytesCopied);
        }

This will slow down your performance by a factor 10-20 for small sizes. But not so much for big arrays. prefetching seems to work well.

Interesting is that the best copy method depends on cached vs not cached

Anderman · 2016-08-21T22:48:12Z

@GSPP I think you are right. There will be programs where all data is already in L1 cache before the copy method is called. So the current test are good. Maybe we can extent the test with not cached data to see if that lead to other solutions

redknightlois · 2016-08-22T14:37:33Z

@Anderman would be interesting to see @nietras graphs on the uncached versions. We for once have lots of the "uncached" kind. Probably for algorithm selection purpose we should look for a tradeoff. Probably not the fastest for each case, but faster for both in average.

Anderman · 2016-08-22T19:08:29Z

@redknightlois I did some test with DIY performance test program. The result for the cached version are the same but test run faster. I also did test with random array access. Will post the results soon

Anderman · 2016-08-22T19:37:01Z

Cached test results.

I did some improvement in the alignment and faster copy large blocks by using some ticks from the MsvcrtMemmove

Anderman · 2016-08-22T21:38:51Z

Updated with correct graph
With cache but now the time it takes to do a copy operation

Random access (Edit graph)
Scale

ns per operation (Edit graph)

I can't explain what cause the delay for arrayCopy and msMemmove.

nietras · 2016-08-22T22:19:38Z

@Anderman do you have the full code for this anywhere? I'd be happy to put it in the repo if you like.

Anderman · 2016-08-22T22:28:44Z

Yes, I have to clean it up a little bit. The results are interesting. It looks that buffer.array en memmove use extra data access and when it is not cached it will cause an extra delay.

redknightlois · 2016-08-23T01:37:43Z

@Anderman is that Operations per nanosecond or Nanoseconds per operation at the last graph?

Anderman · 2016-08-23T01:53:27Z

@redknightlois that is ns per operation. Array.copy will copy 32 bytes in 100ns (random access)

Anderman · 2016-08-23T01:55:53Z

Seq. access
scale

ns per operation

nietras · 2016-08-23T18:30:16Z

@Anderman as far as I understand your benchmarks are based on your own basic measurements on .NET Core, with no pre-jit or warmup iterations, so there probably is a bias (although small probably) in these measurements. Anyway, that is not my main question, I don't understand how your measurements can show that MsvcrtMemmove is faster than Array.Copy when almost all our other measurements show the opposit at least for sizes 0-15, as seen below. How can this be explained?

CopiesBenchmark-report-i7-5820K

Anderman · 2016-08-23T18:40:51Z

MsvcrtMemmove is faster than Array.Copy
@nietras I don't know. In Cached mode MsvcrtMemmove is faster but seq mode not. Array copy does some well with precache

I Include an excel with the foer different test. from0-8192

Test results from I7-3610QM. (Random,cached, sequence, aligntest)
I7-3610QM.xlsx
most interesting is seq/cached

Anderman · 2016-08-23T18:51:10Z

Cached vs Seq

Anderman · 2016-08-23T19:19:12Z

@nietras I have run the test with the org benchmark and there was msvcrt.dll slower for 1-15. Could it be an other version of msvcrt.dll? How can do check that? procmon or are there easier ways?

nietras · 2016-08-23T20:34:48Z

@Anderman I think it is because your version of Msvcrtomits the checks that are in the org benchmark and in the other variants in your own benchmark i.e.

if (src == null || dst == null) throw new ArgumentNullException(nameof(src));
if (count < 0 || srcOffset < 0 || dstOffset < 0) throw new ArgumentOutOfRangeException(nameof(count));
if (srcOffset + count > src.Length) throw new ArgumentException(nameof(src));
if (dstOffset + count > dst.Length) throw new ArgumentException(nameof(dst));

Anderman · 2016-08-23T20:53:37Z

@nietras Yep, That's why. Do you known how you can use 000007FB44D49FF7 rep movs byte ptr [rdi],byte ptr [rsi] Array.Copy use this. I tried Unsafe.CopyBlk.

Anderman · 2016-08-23T21:16:58Z

Test on a I5-4590
With a faster processor than my laptop I7 it make really sense to optimize code.
I5-4590.xlsx

Anderman · 2016-08-28T19:50:05Z

@nietras I send you a pull request with a new test project.

The new performance test are must faster (100ms per test)
The test are now measured in clock cycles
The test will generate a json file to show the result in google linechart
The test are more reliable. If you run the test twice you get almost the same results for the complete test range.
I found a way to write assembly in c# (for testing only)
3 available test mode (cached, sequence, random)

My Testing summary on 2 different test machines
reboot computer with a fresh memory will remove a lot of noise
rep movsb has some startup cost
rep movsb has no alignment problem

benaadams · 2016-09-15T00:21:44Z

Might have to retest with #7198

jamesqo · 2016-10-10T19:00:44Z

Closing due to length issues. I may open a new PR if I get a chance to work on this in the future.

dnfclas added the cla-already-signed label Aug 6, 2016

jkotas reviewed Aug 6, 2016
View reviewed changes

tannergooding reviewed Aug 6, 2016
View reviewed changes

jamesqo mentioned this pull request Aug 26, 2016

Improve performance of string.IndexOf for large strings on 64-bit #6434

Closed

jamesqo closed this Oct 10, 2016

jamesqo deleted the memorycopy-more-tweaks branch October 10, 2016 19:00

jamesqo mentioned this pull request Jan 26, 2017

[Revived PR] Faster Buffer.MemoryCopy for AMD64 #9143

Closed

ahsonkhan mentioned this pull request Feb 15, 2017

Optimize span clear #9598

Merged

jamesqo mentioned this pull request Jan 31, 2020

JIT: Generating rep movsb for cpblk with non-constant arguments dotnet/runtime#6464

Closed

danmoseley mentioned this pull request Jan 31, 2020

Enable crossgen to use SSE2 instructions for x86 dotnet/runtime#6888

Closed

[no merge] Further tweak Buffer.MemoryCopy performance #6638

[no merge] Further tweak Buffer.MemoryCopy performance #6638

Conversation

jamesqo commented Aug 6, 2016 • edited Loading

jkotas Aug 6, 2016

Choose a reason for hiding this comment

jamesqo Aug 6, 2016

Choose a reason for hiding this comment

jkotas Aug 6, 2016

Choose a reason for hiding this comment

tannergooding commented Aug 6, 2016 • edited Loading

tannergooding Aug 6, 2016

Choose a reason for hiding this comment

tannergooding Aug 6, 2016 • edited Loading

Choose a reason for hiding this comment

jamesqo Aug 6, 2016

Choose a reason for hiding this comment

omariom commented Aug 6, 2016

jamesqo commented Aug 6, 2016

jamesqo commented Aug 6, 2016

omariom commented Aug 6, 2016

nietras commented Aug 6, 2016

omariom commented Aug 6, 2016

benaadams commented Aug 6, 2016 • edited Loading

tannergooding commented Aug 6, 2016

tannergooding commented Aug 6, 2016

nietras commented Aug 6, 2016 • edited Loading

benaadams commented Aug 6, 2016 • edited Loading

nietras commented Aug 6, 2016 • edited Loading

GSPP commented Aug 6, 2016

nietras commented Aug 6, 2016

jamesqo commented Aug 6, 2016 • edited Loading

jamesqo commented Aug 6, 2016 • edited Loading

omariom commented Aug 6, 2016

nietras commented Aug 6, 2016 • edited Loading

tannergooding commented Aug 6, 2016 • edited Loading

nietras commented Aug 6, 2016

tannergooding commented Aug 6, 2016

omariom commented Aug 6, 2016

jamesqo commented Aug 6, 2016 • edited Loading

nietras commented Aug 6, 2016 • edited Loading

GSPP commented Aug 21, 2016 • edited Loading

redknightlois commented Aug 21, 2016

Anderman commented Aug 21, 2016

Anderman commented Aug 21, 2016

redknightlois commented Aug 22, 2016

Anderman commented Aug 22, 2016

Anderman commented Aug 22, 2016 • edited Loading

Anderman commented Aug 22, 2016 • edited Loading

nietras commented Aug 22, 2016

Anderman commented Aug 22, 2016

redknightlois commented Aug 23, 2016

Anderman commented Aug 23, 2016

Anderman commented Aug 23, 2016

nietras commented Aug 23, 2016 • edited Loading

Anderman commented Aug 23, 2016 • edited Loading

Anderman commented Aug 23, 2016

Anderman commented Aug 23, 2016

nietras commented Aug 23, 2016 • edited Loading

Anderman commented Aug 23, 2016

Anderman commented Aug 23, 2016

Anderman commented Aug 28, 2016

benaadams commented Sep 15, 2016

jamesqo commented Oct 10, 2016

jamesqo commented Aug 6, 2016 •

edited

Loading

tannergooding commented Aug 6, 2016 •

edited

Loading

tannergooding Aug 6, 2016 •

edited

Loading

benaadams commented Aug 6, 2016 •

edited

Loading

nietras commented Aug 6, 2016 •

edited

Loading

benaadams commented Aug 6, 2016 •

edited

Loading

nietras commented Aug 6, 2016 •

edited

Loading

jamesqo commented Aug 6, 2016 •

edited

Loading

jamesqo commented Aug 6, 2016 •

edited

Loading

nietras commented Aug 6, 2016 •

edited

Loading

tannergooding commented Aug 6, 2016 •

edited

Loading

jamesqo commented Aug 6, 2016 •

edited

Loading

nietras commented Aug 6, 2016 •

edited

Loading

GSPP commented Aug 21, 2016 •

edited

Loading

Anderman commented Aug 22, 2016 •

edited

Loading

Anderman commented Aug 22, 2016 •

edited

Loading

nietras commented Aug 23, 2016 •

edited

Loading

Anderman commented Aug 23, 2016 •

edited

Loading

nietras commented Aug 23, 2016 •

edited

Loading