Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Decrease writes to local variables in Buffer.MemoryCopy #6627

Merged
merged 18 commits into from
Aug 6, 2016

Conversation

jamesqo
Copy link

@jamesqo jamesqo commented Aug 5, 2016

In Buffer.MemoryCopy currently we are making 4 writes every time we copy some data; 1 to update *dest, 1 to update dest, 1 to update src and 1 to update len. I've decreased it to 2; one to update a new local variable i, which keeps track of how many bytes we are into the buffer. All writes are now made using

*(dest + i + x) = *(src + i + x)

which has no additional overhead since they're converted to using memory addressing operands by the jit.

Another change I made was to add a few extra cases for the switch-case at the beginning that does copying for small sizes without any branches. It now covers sizes 0-22. This is beneficial to the main codepath, since we can convert the unrolled loop to a do..while loop and save an extra branch at the beginning. (max 7 bytes for alignment, 16 for 1 iteration of the loop, so the min bytes we can copy without checking whether we should stop is 23.) This adds

This PR increases the performance of MemoryCopy by 10-20% for most buffer sizes on x86; you can see the performance test/results (and the generated assembly for each version) here. (Note that this codepath is also used by wstrcpy at the moment, so this directly affects many common String operations.)

Note that I haven't tested this on ARM or anything (only i386 and x86_64), so I'm not sure how these changes will affect other platforms.

I have yet to write up tests for this in CoreFX, but I will later when I get the time.

cc @jkotas @nietras @mikedn @GSPP @omariom @bbowyersmyth

@GSPP
Copy link

GSPP commented Aug 5, 2016

Is there a policy/guideline for squashing commits? All these intermediate commits do not add value to the history.

@jamesqo
Copy link
Author

jamesqo commented Aug 5, 2016

Also tagging @benaadams because of #2430

@jamesqo
Copy link
Author

jamesqo commented Aug 5, 2016

@GSPP That's true. I think a while back GitHub added a feature to automatically squash commits when merging PRs: https://github.com/blog/2141-squash-your-commits So they'll really only taking up 1 entry in the commit log.

@ghost
Copy link

ghost commented Aug 5, 2016

@GSPP, in GitHub, now the repo maintainers can squash the commit at the time of merge from the web UI. See https://help.github.com/articles/about-pull-request-merge-squashing/. I think @jkotas et al. use this feature. So the contributors don't need to worry about squashing. I personally prefer one commit per PR, unless there is a test case which warrants a separate commit.

@jamesqo, great work man on performance front mate! Keep it up. 🚀 👍

@ghost
Copy link

ghost commented Aug 5, 2016

@jamesqo, you beat me by seconds..! 😄

// these to use memory addressing operands.
// So the only cost is a bit of code size,
// which is made up for by the fact that
// we save on writes to dest/src.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lines seem to be a little short, you should break at least only after 80 chars

@jkotas
Copy link
Member

jkotas commented Aug 5, 2016

Note that I haven't tested this on ARM or anything

@rahku Could you or somebody else with arm64 machine please measure the perf impact of this change on arm64? I would like to make sure that it is not making arm64 slower. (The link to the source code of the microbenchmark is in the PR description.)

@jkotas
Copy link
Member

jkotas commented Aug 5, 2016

@jamesqo Very nice improvements - thank you!

@tannergooding
Copy link
Member

Is always doing an fcall to always use the CRT implementation of memcpy really so much more expensive (it seems to work plenty well for System.Math)?

The CRT implementation can do a lot of things this can't. Such as (on modern CPUs where it is faster) just call rep movsb.

Where rep movsb is not faster it can do all cases of 1-16 bytes in, at most 4 writes (which we match). It can then do the case of 17-32 bytes in 2 writes (always) using XMM. For all cases past 32-bytes, it can do it in, at most (bytecount / 16) + 2 writes.

  • For an aligned destination where byte count is divisible by 16, it will just copy 16 bytes at a time using XMM until complete.
  • For an aligned destination where byte count is not divisible by 16, it will copy 16 bytes at a time, then it will move back 16 - (bytecount % 16) bytes and do a single unaligned write (you write (16 - bytecount % 16) bytes twice).
  • For an unaligned destination where byte count is divisible by 16, the first write will be unaligned and then it will move forward (bytecount %16) bytes and begin doing aligned writes (you write (bytecount % 16) bytes twice).
  • For an unaligned destination where byte count is not divisible by 16, it will do a combination of the previous two (unaligned write, aligned writes, unaligned write), causing (bytecount % 16) + (16 - (bytecount % 16)) to be written twice.

For larger loops, it can also do some other intelligent things, such as using XMM and doing partially unrolled loops copying in 128, 64, 32, and 16 byte blocks (depending on the size of bytecount).

It also doesn't have to do 4 checks for the remaining 1-15 bytes, doing up to 4 separate writes to handle it as it can do the same in a single unaligned XMM write (although we could optimize this slightly by exiting if there are no remaining bytes).

}
while (i <= end);

len -= i; // len now represents how many bytes are left after the unrolled loop
Copy link
Member

@benaadams benaadams Aug 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len -= end;

Before the loop? Is it dependent on the result in i? (not sure as loop is <=)

If not its result can be prepared well ahead of the next uses on the lines below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benaadams end is just len - 16, so that would always be 16. I suppose this could come before the loop since i is always incremented by 16 however, so the lower 4 bits would still be the same.

@jkotas
Copy link
Member

jkotas commented Aug 5, 2016

fcall to always use the CRT implementation of memcpy

It is not possible to fcall the CRT implementation directly. It does not work because of GC interaction (it would result into potential GC startvation), EH interaction and calling convention mismatch (on x86).

@tannergooding
Copy link
Member

@jkotas, and doing the qcall is more expensive for all of the cases that are covered by the manual implementation?

@rahku
Copy link

rahku commented Aug 5, 2016

@jkotas will measure on arm64

@jkotas
Copy link
Member

jkotas commented Aug 5, 2016

qcall has fixed overhead that it has to pay for. The cut-off is for the average case where it starts paying of for the overhead. There may be opportunity for tuning the cut-off point, but it is close to impossible to tune it perfectly. There will always be some cases where the other implementation is better.

@tannergooding
Copy link
Member

I think this is one of the places where tuning the cut-off point is desirable. I would find it difficult to believe that doing 4x the number of writes (with no consideration to alignment or page boundaries) is faster for all of the cases covered by the manual implementation (up to 512 bytes).

@GSPP
Copy link

GSPP commented Aug 5, 2016

@jkotas GC starvation could be solved by calling the native method in chunks of, say, 4MB. A 4MB memcpy dominates any native call cost by far. And a 4MB chunk is enough to pull off any tricks such as alignment peeling and XMM stuff.

That way the CRT version could be used.

@jkotas
Copy link
Member

jkotas commented Aug 5, 2016

Right, that is why we use the CRT version for anything greater than 512 bytes - it is even better than doing the chunking.

@rahku
Copy link

rahku commented Aug 5, 2016

There is improvement for arm64 when copying less bytes but not otherwise. There is no degradation though.
https://gist.github.com/rahku/00fc18932666fb3f92271ea1ae07026a

@jkotas
Copy link
Member

jkotas commented Aug 5, 2016

@rahku Thanks!

len--;
if (((int)dest & 2) == 0)
goto Aligned;
*(dest + i) = *(src + i);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: i is always zero here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bbowyersmyth Yep, I included that just in case someone changes the code in the future. From what I can see it doesn't make a difference since the JIT is able to do constant propogation here and i += 1 is just compiled to mov i, 1.

@jamesqo
Copy link
Author

jamesqo commented Aug 6, 2016

@tannergooding I think the idea of doing unaligned word writes at the beginning/end of the buffer sounds interesting, it should let us shave off a couple of branches. I'm going to try that in this PR and see what the benchmarks say.

I think this is one of the places where tuning the cut-off point is desirable. I would find it difficult to believe that doing 4x the number of writes (with no consideration to alignment or page boundaries) is faster for all of the cases covered by the manual implementation (up to 512 bytes).

It's only 2x the number of writes for x64, and with these changes copying 511 bytes seems to be significantly faster than 512 (~3.4 vs 3.8). I've yet to garner any perf data for this vs. the native implementation on x86, though.

Also, the managed implementation does pay attention to alignment-- it tries to align dest before doing full-word writes.

@jkotas @GSPP Regarding the discussion about using xmm registers, I think it would be very interesting if we could move the implementation of MemoryCopy to the CoreFX repo while keeping the other methods of Buffer here, like was done with Environment. That way we could make things much faster since we have access to things like Unsafe.Read / Vector<T> over there, while in CoreCLR we're stuck with word-sized copies.

@jkotas
Copy link
Member

jkotas commented Aug 6, 2016

while in CoreCLR we're stuck

This should be done other way around - because of memorycopy is always going to be needed for corelib: The small parts of CoreFX required for corelib implementation can duplicated in CoreCLR, large parts can be moved over if necessary. We want to minimize dependencies from CoreCLR to CoreFX. Other way is fine - it is the case with Environment.

@jkotas jkotas merged commit 32fe063 into dotnet:master Aug 6, 2016
@jkotas
Copy link
Member

jkotas commented Aug 6, 2016

This was really nice change. Could you please make the same change in CoreRT as well?

@jamesqo
Copy link
Author

jamesqo commented Aug 6, 2016

@jkotas Actually wasn't quite ready for merging yet, I mentioned in my other comment I was experimenting with eliminating a couple of branches at the beginning/end of the method ;)

I'll port this to CoreRT once that's done. Would it be possible to un-merge this temporarily and reopen (don't know if it's possible to do that with GitHub), or should I submit a new PR for that?

@jkotas
Copy link
Member

jkotas commented Aug 6, 2016

Sorry about merging this prematurely ... please submit a new PR if there are additional tweaks you would like to make.

@jamesqo
Copy link
Author

jamesqo commented Sep 2, 2016

@jkotas You mentioned a few comments up:

The small parts of CoreFX required for corelib implementation can duplicated in CoreCLR, large parts can be moved over if necessary.

How large is "large"? To be able to use Vector.Equals in e.g. String.Equals, it looks like we'd have to pull in this gigantic .tt file which presumably adds another step to the build process, and then update it every time the CoreFX version is updated?

@jkotas
Copy link
Member

jkotas commented Sep 2, 2016

You do not need the whole file. If you just the one method you care about. More tricky problem with Vector is that it is special cased type - based on name & assembly name, in several places, e.g.: https://github.com/dotnet/coreclr/blob/master/src/vm/assembly.cpp#L307 or https://github.com/dotnet/coreclr/blob/master/src/vm/methodtablebuilder.cpp#L1206. This special casing would not apply automatically to your corelib copy - it would need to be tweaked to make it work.

picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
…lr#6627)

In `Buffer.MemoryCopy` currently we are making 4 writes every time we copy some data; 1 to update `*dest`, 1 to update `dest`, 1 to update `src` and 1 to update `len`. I've decreased it to 2; one to update a new local variable `i`, which keeps track of how many bytes we are into the buffer. All writes are now made using

```cs
*(dest + i + x) = *(src + i + x)
```

which has no additional overhead since they're converted to using memory addressing operands by the jit.

Another change I made was to add a few extra cases for the switch-case at the beginning that does copying for small sizes without any branches. It now covers sizes 0-22. This is beneficial to the main codepath, since we can convert the unrolled loop to a `do..while` loop and save an extra branch at the beginning. (max 7 bytes for alignment, 16 for 1 iteration of the loop, so the min bytes we can copy without checking whether we should stop is 23.) This adds 

This PR increases the performance of `MemoryCopy` by 10-20% for most buffer sizes on x86; you can see the performance test/results (and the generated assembly for each version) [here](https://gist.github.com/jamesqo/337852c8ce09205a8289ce1f1b9b5382). (Note that this codepath is also used by `wstrcpy` at the moment, so this directly affects many common String operations.)

Commit migrated from dotnet/coreclr@32fe063
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants