-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Decrease writes to local variables in Buffer.MemoryCopy #6627
Conversation
…t iteration of the loop
Is there a policy/guideline for squashing commits? All these intermediate commits do not add value to the history. |
Also tagging @benaadams because of #2430 |
@GSPP That's true. I think a while back GitHub added a feature to automatically squash commits when merging PRs: https://github.com/blog/2141-squash-your-commits So they'll really only taking up 1 entry in the commit log. |
@GSPP, in GitHub, now the repo maintainers can squash the commit at the time of merge from the web UI. See https://help.github.com/articles/about-pull-request-merge-squashing/. I think @jkotas et al. use this feature. So the contributors don't need to worry about squashing. I personally prefer one commit per PR, unless there is a test case which warrants a separate commit. @jamesqo, great work man on performance front mate! Keep it up. 🚀 👍 |
@jamesqo, you beat me by seconds..! 😄 |
// these to use memory addressing operands. | ||
// So the only cost is a bit of code size, | ||
// which is made up for by the fact that | ||
// we save on writes to dest/src. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lines seem to be a little short, you should break at least only after 80 chars
@rahku Could you or somebody else with arm64 machine please measure the perf impact of this change on arm64? I would like to make sure that it is not making arm64 slower. (The link to the source code of the microbenchmark is in the PR description.) |
@jamesqo Very nice improvements - thank you! |
Is always doing an The CRT implementation can do a lot of things this can't. Such as (on modern CPUs where it is faster) just call Where
For larger loops, it can also do some other intelligent things, such as using XMM and doing partially unrolled loops copying in 128, 64, 32, and 16 byte blocks (depending on the size of bytecount). It also doesn't have to do 4 checks for the remaining 1-15 bytes, doing up to 4 separate writes to handle it as it can do the same in a single unaligned XMM write (although we could optimize this slightly by exiting if there are no remaining bytes). |
} | ||
while (i <= end); | ||
|
||
len -= i; // len now represents how many bytes are left after the unrolled loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len -= end;
Before the loop? Is it dependent on the result in i
? (not sure as loop is <=
)
If not its result can be prepared well ahead of the next uses on the lines below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benaadams end
is just len - 16
, so that would always be 16. I suppose this could come before the loop since i is always incremented by 16 however, so the lower 4 bits would still be the same.
It is not possible to fcall the CRT implementation directly. It does not work because of GC interaction (it would result into potential GC startvation), EH interaction and calling convention mismatch (on x86). |
@jkotas, and doing the |
@jkotas will measure on arm64 |
qcall has fixed overhead that it has to pay for. The cut-off is for the average case where it starts paying of for the overhead. There may be opportunity for tuning the cut-off point, but it is close to impossible to tune it perfectly. There will always be some cases where the other implementation is better. |
I think this is one of the places where tuning the cut-off point is desirable. I would find it difficult to believe that doing 4x the number of writes (with no consideration to alignment or page boundaries) is faster for all of the cases covered by the manual implementation (up to 512 bytes). |
@jkotas GC starvation could be solved by calling the native method in chunks of, say, 4MB. A 4MB memcpy dominates any native call cost by far. And a 4MB chunk is enough to pull off any tricks such as alignment peeling and XMM stuff. That way the CRT version could be used. |
Right, that is why we use the CRT version for anything greater than 512 bytes - it is even better than doing the chunking. |
There is improvement for arm64 when copying less bytes but not otherwise. There is no degradation though. |
@rahku Thanks! |
len--; | ||
if (((int)dest & 2) == 0) | ||
goto Aligned; | ||
*(dest + i) = *(src + i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: i
is always zero here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bbowyersmyth Yep, I included that just in case someone changes the code in the future. From what I can see it doesn't make a difference since the JIT is able to do constant propogation here and i += 1
is just compiled to mov i, 1
.
@tannergooding I think the idea of doing unaligned word writes at the beginning/end of the buffer sounds interesting, it should let us shave off a couple of branches. I'm going to try that in this PR and see what the benchmarks say.
It's only 2x the number of writes for x64, and with these changes copying 511 bytes seems to be significantly faster than 512 (~3.4 vs 3.8). I've yet to garner any perf data for this vs. the native implementation on x86, though. Also, the managed implementation does pay attention to alignment-- it tries to align dest before doing full-word writes. @jkotas @GSPP Regarding the discussion about using |
This should be done other way around - because of memorycopy is always going to be needed for corelib: The small parts of CoreFX required for corelib implementation can duplicated in CoreCLR, large parts can be moved over if necessary. We want to minimize dependencies from CoreCLR to CoreFX. Other way is fine - it is the case with Environment. |
This was really nice change. Could you please make the same change in CoreRT as well? |
@jkotas Actually wasn't quite ready for merging yet, I mentioned in my other comment I was experimenting with eliminating a couple of branches at the beginning/end of the method ;) I'll port this to CoreRT once that's done. Would it be possible to un-merge this temporarily and reopen (don't know if it's possible to do that with GitHub), or should I submit a new PR for that? |
Sorry about merging this prematurely ... please submit a new PR if there are additional tweaks you would like to make. |
@jkotas You mentioned a few comments up:
How large is "large"? To be able to use |
You do not need the whole file. If you just the one method you care about. More tricky problem with Vector is that it is special cased type - based on name & assembly name, in several places, e.g.: https://github.com/dotnet/coreclr/blob/master/src/vm/assembly.cpp#L307 or https://github.com/dotnet/coreclr/blob/master/src/vm/methodtablebuilder.cpp#L1206. This special casing would not apply automatically to your corelib copy - it would need to be tweaked to make it work. |
…lr#6627) In `Buffer.MemoryCopy` currently we are making 4 writes every time we copy some data; 1 to update `*dest`, 1 to update `dest`, 1 to update `src` and 1 to update `len`. I've decreased it to 2; one to update a new local variable `i`, which keeps track of how many bytes we are into the buffer. All writes are now made using ```cs *(dest + i + x) = *(src + i + x) ``` which has no additional overhead since they're converted to using memory addressing operands by the jit. Another change I made was to add a few extra cases for the switch-case at the beginning that does copying for small sizes without any branches. It now covers sizes 0-22. This is beneficial to the main codepath, since we can convert the unrolled loop to a `do..while` loop and save an extra branch at the beginning. (max 7 bytes for alignment, 16 for 1 iteration of the loop, so the min bytes we can copy without checking whether we should stop is 23.) This adds This PR increases the performance of `MemoryCopy` by 10-20% for most buffer sizes on x86; you can see the performance test/results (and the generated assembly for each version) [here](https://gist.github.com/jamesqo/337852c8ce09205a8289ce1f1b9b5382). (Note that this codepath is also used by `wstrcpy` at the moment, so this directly affects many common String operations.) Commit migrated from dotnet/coreclr@32fe063
In
Buffer.MemoryCopy
currently we are making 4 writes every time we copy some data; 1 to update*dest
, 1 to updatedest
, 1 to updatesrc
and 1 to updatelen
. I've decreased it to 2; one to update a new local variablei
, which keeps track of how many bytes we are into the buffer. All writes are now made usingwhich has no additional overhead since they're converted to using memory addressing operands by the jit.
Another change I made was to add a few extra cases for the switch-case at the beginning that does copying for small sizes without any branches. It now covers sizes 0-22. This is beneficial to the main codepath, since we can convert the unrolled loop to a
do..while
loop and save an extra branch at the beginning. (max 7 bytes for alignment, 16 for 1 iteration of the loop, so the min bytes we can copy without checking whether we should stop is 23.) This addsThis PR increases the performance of
MemoryCopy
by 10-20% for most buffer sizes on x86; you can see the performance test/results (and the generated assembly for each version) here. (Note that this codepath is also used bywstrcpy
at the moment, so this directly affects many common String operations.)Note that I haven't tested this on ARM or anything (only i386 and x86_64), so I'm not sure how these changes will affect other platforms.
I have yet to write up tests for this in CoreFX, but I will later when I get the time.
cc @jkotas @nietras @mikedn @GSPP @omariom @bbowyersmyth