Decrease writes to local variables in Buffer.MemoryCopy #6627

jamesqo · 2016-08-05T14:33:45Z

In Buffer.MemoryCopy currently we are making 4 writes every time we copy some data; 1 to update *dest, 1 to update dest, 1 to update src and 1 to update len. I've decreased it to 2; one to update a new local variable i, which keeps track of how many bytes we are into the buffer. All writes are now made using

*(dest + i + x) = *(src + i + x)

which has no additional overhead since they're converted to using memory addressing operands by the jit.

Another change I made was to add a few extra cases for the switch-case at the beginning that does copying for small sizes without any branches. It now covers sizes 0-22. This is beneficial to the main codepath, since we can convert the unrolled loop to a do..while loop and save an extra branch at the beginning. (max 7 bytes for alignment, 16 for 1 iteration of the loop, so the min bytes we can copy without checking whether we should stop is 23.) This adds

This PR increases the performance of MemoryCopy by 10-20% for most buffer sizes on x86; you can see the performance test/results (and the generated assembly for each version) here. (Note that this codepath is also used by wstrcpy at the moment, so this directly affects many common String operations.)

Note that I haven't tested this on ARM or anything (only i386 and x86_64), so I'm not sure how these changes will affect other platforms.

I have yet to write up tests for this in CoreFX, but I will later when I get the time.

cc @jkotas @nietras @mikedn @GSPP @omariom @bbowyersmyth

…t iteration of the loop

…ed properly

GSPP · 2016-08-05T14:44:21Z

Is there a policy/guideline for squashing commits? All these intermediate commits do not add value to the history.

jamesqo · 2016-08-05T14:45:06Z

Also tagging @benaadams because of #2430

jamesqo · 2016-08-05T14:47:17Z

@GSPP That's true. I think a while back GitHub added a feature to automatically squash commits when merging PRs: https://github.com/blog/2141-squash-your-commits So they'll really only taking up 1 entry in the commit log.

ghost · 2016-08-05T14:49:13Z

@GSPP, in GitHub, now the repo maintainers can squash the commit at the time of merge from the web UI. See https://help.github.com/articles/about-pull-request-merge-squashing/. I think @jkotas et al. use this feature. So the contributors don't need to worry about squashing. I personally prefer one commit per PR, unless there is a test case which warrants a separate commit.

@jamesqo, great work man on performance front mate! Keep it up. 🚀 👍

ghost · 2016-08-05T14:49:58Z

@jamesqo, you beat me by seconds..! 😄

txdv · 2016-08-05T15:07:19Z

src/mscorlib/src/System/Buffer.cs

+                // these to use memory addressing operands.
+                // So the only cost is a bit of code size,
+                // which is made up for by the fact that
+                // we save on writes to dest/src.


The lines seem to be a little short, you should break at least only after 80 chars

jkotas · 2016-08-05T15:26:55Z

Note that I haven't tested this on ARM or anything

@rahku Could you or somebody else with arm64 machine please measure the perf impact of this change on arm64? I would like to make sure that it is not making arm64 slower. (The link to the source code of the microbenchmark is in the PR description.)

jkotas · 2016-08-05T15:28:10Z

@jamesqo Very nice improvements - thank you!

tannergooding · 2016-08-05T15:30:38Z

Is always doing an fcall to always use the CRT implementation of memcpy really so much more expensive (it seems to work plenty well for System.Math)?

The CRT implementation can do a lot of things this can't. Such as (on modern CPUs where it is faster) just call rep movsb.

Where rep movsb is not faster it can do all cases of 1-16 bytes in, at most 4 writes (which we match). It can then do the case of 17-32 bytes in 2 writes (always) using XMM. For all cases past 32-bytes, it can do it in, at most (bytecount / 16) + 2 writes.

For an aligned destination where byte count is divisible by 16, it will just copy 16 bytes at a time using XMM until complete.
For an aligned destination where byte count is not divisible by 16, it will copy 16 bytes at a time, then it will move back 16 - (bytecount % 16) bytes and do a single unaligned write (you write (16 - bytecount % 16) bytes twice).
For an unaligned destination where byte count is divisible by 16, the first write will be unaligned and then it will move forward (bytecount %16) bytes and begin doing aligned writes (you write (bytecount % 16) bytes twice).
For an unaligned destination where byte count is not divisible by 16, it will do a combination of the previous two (unaligned write, aligned writes, unaligned write), causing (bytecount % 16) + (16 - (bytecount % 16)) to be written twice.

For larger loops, it can also do some other intelligent things, such as using XMM and doing partially unrolled loops copying in 128, 64, 32, and 16 byte blocks (depending on the size of bytecount).

It also doesn't have to do 4 checks for the remaining 1-15 bytes, doing up to 4 separate writes to handle it as it can do the same in a single unaligned XMM write (although we could optimize this slightly by exiting if there are no remaining bytes).

benaadams · 2016-08-05T15:36:42Z

src/mscorlib/src/System/Buffer.cs

            }
+            while (i <= end);
+
+            len -= i; // len now represents how many bytes are left after the unrolled loop


len -= end;

Before the loop? Is it dependent on the result in i? (not sure as loop is <=)

If not its result can be prepared well ahead of the next uses on the lines below.

@benaadams end is just len - 16, so that would always be 16. I suppose this could come before the loop since i is always incremented by 16 however, so the lower 4 bits would still be the same.

jkotas · 2016-08-05T15:40:10Z

fcall to always use the CRT implementation of memcpy

It is not possible to fcall the CRT implementation directly. It does not work because of GC interaction (it would result into potential GC startvation), EH interaction and calling convention mismatch (on x86).

tannergooding · 2016-08-05T15:49:11Z

@jkotas, and doing the qcall is more expensive for all of the cases that are covered by the manual implementation?

rahku · 2016-08-05T15:57:24Z

@jkotas will measure on arm64

jkotas · 2016-08-05T16:09:37Z

qcall has fixed overhead that it has to pay for. The cut-off is for the average case where it starts paying of for the overhead. There may be opportunity for tuning the cut-off point, but it is close to impossible to tune it perfectly. There will always be some cases where the other implementation is better.

tannergooding · 2016-08-05T16:29:51Z

I think this is one of the places where tuning the cut-off point is desirable. I would find it difficult to believe that doing 4x the number of writes (with no consideration to alignment or page boundaries) is faster for all of the cases covered by the manual implementation (up to 512 bytes).

GSPP · 2016-08-05T18:12:50Z

@jkotas GC starvation could be solved by calling the native method in chunks of, say, 4MB. A 4MB memcpy dominates any native call cost by far. And a 4MB chunk is enough to pull off any tricks such as alignment peeling and XMM stuff.

That way the CRT version could be used.

jkotas · 2016-08-05T18:32:18Z

Right, that is why we use the CRT version for anything greater than 512 bytes - it is even better than doing the chunking.

rahku · 2016-08-05T19:02:07Z

There is improvement for arm64 when copying less bytes but not otherwise. There is no degradation though.
https://gist.github.com/rahku/00fc18932666fb3f92271ea1ae07026a

jkotas · 2016-08-05T19:11:29Z

@rahku Thanks!

bbowyersmyth · 2016-08-05T21:44:59Z

src/mscorlib/src/System/Buffer.cs

-                    len--;
-                    if (((int)dest & 2) == 0)
-                        goto Aligned;
+                    *(dest + i) = *(src + i);


Nit: i is always zero here

@bbowyersmyth Yep, I included that just in case someone changes the code in the future. From what I can see it doesn't make a difference since the JIT is able to do constant propogation here and i += 1 is just compiled to mov i, 1.

jamesqo · 2016-08-06T01:50:26Z

@tannergooding I think the idea of doing unaligned word writes at the beginning/end of the buffer sounds interesting, it should let us shave off a couple of branches. I'm going to try that in this PR and see what the benchmarks say.

I think this is one of the places where tuning the cut-off point is desirable. I would find it difficult to believe that doing 4x the number of writes (with no consideration to alignment or page boundaries) is faster for all of the cases covered by the manual implementation (up to 512 bytes).

It's only 2x the number of writes for x64, and with these changes copying 511 bytes seems to be significantly faster than 512 (~3.4 vs 3.8). I've yet to garner any perf data for this vs. the native implementation on x86, though.

Also, the managed implementation does pay attention to alignment-- it tries to align dest before doing full-word writes.

@jkotas @GSPP Regarding the discussion about using xmm registers, I think it would be very interesting if we could move the implementation of MemoryCopy to the CoreFX repo while keeping the other methods of Buffer here, like was done with Environment. That way we could make things much faster since we have access to things like Unsafe.Read / Vector<T> over there, while in CoreCLR we're stuck with word-sized copies.

jkotas · 2016-08-06T02:00:13Z

while in CoreCLR we're stuck

This should be done other way around - because of memorycopy is always going to be needed for corelib: The small parts of CoreFX required for corelib implementation can duplicated in CoreCLR, large parts can be moved over if necessary. We want to minimize dependencies from CoreCLR to CoreFX. Other way is fine - it is the case with Environment.

jkotas · 2016-08-06T02:07:32Z

This was really nice change. Could you please make the same change in CoreRT as well?

jamesqo · 2016-08-06T02:14:44Z

@jkotas Actually wasn't quite ready for merging yet, I mentioned in my other comment I was experimenting with eliminating a couple of branches at the beginning/end of the method ;)

I'll port this to CoreRT once that's done. Would it be possible to un-merge this temporarily and reopen (don't know if it's possible to do that with GitHub), or should I submit a new PR for that?

jkotas · 2016-08-06T02:31:34Z

Sorry about merging this prematurely ... please submit a new PR if there are additional tweaks you would like to make.

jamesqo · 2016-09-02T03:00:22Z

@jkotas You mentioned a few comments up:

The small parts of CoreFX required for corelib implementation can duplicated in CoreCLR, large parts can be moved over if necessary.

How large is "large"? To be able to use Vector.Equals in e.g. String.Equals, it looks like we'd have to pull in this gigantic .tt file which presumably adds another step to the build process, and then update it every time the CoreFX version is updated?

jkotas · 2016-09-02T03:17:59Z

You do not need the whole file. If you just the one method you care about. More tricky problem with Vector is that it is special cased type - based on name & assembly name, in several places, e.g.: https://github.com/dotnet/coreclr/blob/master/src/vm/assembly.cpp#L307 or https://github.com/dotnet/coreclr/blob/master/src/vm/methodtablebuilder.cpp#L1206. This special casing would not apply automatically to your corelib copy - it would need to be tweaked to make it work.

…lr#6627) In `Buffer.MemoryCopy` currently we are making 4 writes every time we copy some data; 1 to update `*dest`, 1 to update `dest`, 1 to update `src` and 1 to update `len`. I've decreased it to 2; one to update a new local variable `i`, which keeps track of how many bytes we are into the buffer. All writes are now made using ```cs *(dest + i + x) = *(src + i + x) ``` which has no additional overhead since they're converted to using memory addressing operands by the jit. Another change I made was to add a few extra cases for the switch-case at the beginning that does copying for small sizes without any branches. It now covers sizes 0-22. This is beneficial to the main codepath, since we can convert the unrolled loop to a `do..while` loop and save an extra branch at the beginning. (max 7 bytes for alignment, 16 for 1 iteration of the loop, so the min bytes we can copy without checking whether we should stop is 23.) This adds This PR increases the performance of `MemoryCopy` by 10-20% for most buffer sizes on x86; you can see the performance test/results (and the generated assembly for each version) [here](https://gist.github.com/jamesqo/337852c8ce09205a8289ce1f1b9b5382). (Note that this codepath is also used by `wstrcpy` at the moment, so this directly affects many common String operations.) Commit migrated from dotnet/coreclr@32fe063

jamesqo added 15 commits August 4, 2016 20:18

Some cleanup for Buffer.Memmove, use nuint to remove some ifdefs

d3fea7b

Remove some whitespace

45d73dc

Avoid a dec per iteration

2f00e0c

Save yet another write per iteration

d50a239

More nuint cleanup

d35dae6

Use i in more places

0f91eab

Fix bug

beea7a4

Add more switch-cases, so it makes it valid to skip a branch the firs…

8278b61

…t iteration of the loop

More fixes

c440918

Fix compiler errors

6c3dbdd

Add more comments

a5602d2

Add note on switch statement

24cdaa0

Bugfix

a08ba07

Add comment on bugfix

3a39eae

After many hours of debugging, squash another bug if dest isn't align…

3eccc83

…ed properly

dnfclas added the cla-already-signed label Aug 5, 2016

txdv reviewed Aug 5, 2016
View reviewed changes

benaadams reviewed Aug 5, 2016
View reviewed changes

jamesqo added 2 commits August 5, 2016 17:20

Expand comment in response to PR feedback

1c5b505

More PR feedback

89abc69

bbowyersmyth reviewed Aug 5, 2016
View reviewed changes

Reduce data dependencies

7a697a2

jkotas merged commit 32fe063 into dotnet:master Aug 6, 2016

jamesqo deleted the memorycopy-tweaks branch August 6, 2016 02:32

jamesqo mentioned this pull request Aug 6, 2016

[no merge] Further tweak Buffer.MemoryCopy performance #6638

Closed

benaadams mentioned this pull request Aug 6, 2016

Use Buffer.MemoryCopy when available aspnet/KestrelHttpServer#1036

Closed

AlexGhiondea added the netfx-port-consider label Oct 4, 2017

jamesqo mentioned this pull request Jan 31, 2020

Using Vector apis within S.P.CoreLib dotnet/runtime#6967

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decrease writes to local variables in Buffer.MemoryCopy #6627

Decrease writes to local variables in Buffer.MemoryCopy #6627

jamesqo commented Aug 5, 2016

GSPP commented Aug 5, 2016

jamesqo commented Aug 5, 2016

jamesqo commented Aug 5, 2016

ghost commented Aug 5, 2016

ghost commented Aug 5, 2016

txdv Aug 5, 2016

jkotas commented Aug 5, 2016 •

edited

Loading

jkotas commented Aug 5, 2016

tannergooding commented Aug 5, 2016

benaadams Aug 5, 2016 •

edited

Loading

jamesqo Aug 5, 2016

jkotas commented Aug 5, 2016

tannergooding commented Aug 5, 2016

rahku commented Aug 5, 2016

jkotas commented Aug 5, 2016

tannergooding commented Aug 5, 2016

GSPP commented Aug 5, 2016

jkotas commented Aug 5, 2016

rahku commented Aug 5, 2016

jkotas commented Aug 5, 2016

bbowyersmyth Aug 5, 2016

jamesqo Aug 5, 2016

jamesqo commented Aug 6, 2016 •

edited

Loading

jkotas commented Aug 6, 2016

jkotas commented Aug 6, 2016

jamesqo commented Aug 6, 2016 •

edited

Loading

jkotas commented Aug 6, 2016

jamesqo commented Sep 2, 2016 •

edited

Loading

jkotas commented Sep 2, 2016

Decrease writes to local variables in Buffer.MemoryCopy #6627

Decrease writes to local variables in Buffer.MemoryCopy #6627

Conversation

jamesqo commented Aug 5, 2016

GSPP commented Aug 5, 2016

jamesqo commented Aug 5, 2016

jamesqo commented Aug 5, 2016

ghost commented Aug 5, 2016

ghost commented Aug 5, 2016

txdv Aug 5, 2016

Choose a reason for hiding this comment

jkotas commented Aug 5, 2016 • edited Loading

jkotas commented Aug 5, 2016

tannergooding commented Aug 5, 2016

benaadams Aug 5, 2016 • edited Loading

Choose a reason for hiding this comment

jamesqo Aug 5, 2016

Choose a reason for hiding this comment

jkotas commented Aug 5, 2016

tannergooding commented Aug 5, 2016

rahku commented Aug 5, 2016

jkotas commented Aug 5, 2016

tannergooding commented Aug 5, 2016

GSPP commented Aug 5, 2016

jkotas commented Aug 5, 2016

rahku commented Aug 5, 2016

jkotas commented Aug 5, 2016

bbowyersmyth Aug 5, 2016

Choose a reason for hiding this comment

jamesqo Aug 5, 2016

Choose a reason for hiding this comment

jamesqo commented Aug 6, 2016 • edited Loading

jkotas commented Aug 6, 2016

jkotas commented Aug 6, 2016

jamesqo commented Aug 6, 2016 • edited Loading

jkotas commented Aug 6, 2016

jamesqo commented Sep 2, 2016 • edited Loading

jkotas commented Sep 2, 2016

jkotas commented Aug 5, 2016 •

edited

Loading

benaadams Aug 5, 2016 •

edited

Loading

jamesqo commented Aug 6, 2016 •

edited

Loading

jamesqo commented Aug 6, 2016 •

edited

Loading

jamesqo commented Sep 2, 2016 •

edited

Loading