-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Array.Copy & Buffer.BlockCopy x2 to x3 slower < 1kB #4847
Comments
That's a bit of apples and oranges comparison, your version only handles
Those are handled by the JIT, they have nothing to do with |
I'm not entirely sure how unless this is particularly onerous // Size of the Arrays in bytes
SIZE_T srcLen = src->GetNumComponents() * src->GetComponentSize();
SIZE_T dstLen = srcLen;
// We only want to allow arrays of primitives, no Objects.
const CorElementType srcET = src->GetArrayElementType();
if (!CorTypeInfo::IsPrimitiveType_NoThrow(srcET))
FCThrowArgumentVoid(W("src"), W("Arg_MustBePrimArray"));
if (src != dst) {
const CorElementType dstET = dst->GetArrayElementType();
if (!CorTypeInfo::IsPrimitiveType_NoThrow(dstET))
FCThrowArgumentVoid(W("dest"), W("Arg_MustBePrimArray"));
dstLen = dst->GetNumComponents() * dst->GetComponentSize();
} If that is heavy it could move into a jitted constant for the type or exception and be essentially eliminated? The cpu vector load and store instructions don't care what the data types are; though there would be some overhead in handling overlapping data; which I don't do. However that is an identical cost for all calls and doesn't explain the dramatic fall off at 16 bytes; or the slightly smaller fall off at 64 bytes?
Yes, but my point was that the code that does the copying is a call to if ((srcPtr != dstPtr) && (count > 0)) {
memmove(dstPtr, srcPtr, count);
} So anything calling |
At least for small sizes like 16, yes that piece of code's overhead is high.
I'm not very sure what you are suggesting. One possible solution is to have a generic version and then use code like: if (typeof(T).IsBlittable) Buffer.__memmove(...); else Array.Copy(...}; where IsBlittable is a hypothetical property that returns true if T is not a reference type or a struct that contains reference type fields. The
Which calls? Your version has only one managed-managed call. The framework version has one managed-native call and one native-native call.
Are you saying that the problem is in fact caused by memmove? For 16 bytes memmove uses only 2 |
Even for larger sizes the |
Yeah, essentially the pattern the System.Numerics.Vector uses /*
* PATTERN:
* if (typeof(T) == typeof(Int32)) { ... }
* else if (typeof(T) == typeof(Single)) { ... }
* EXPLANATION:
* At runtime, each instantiation of Vector<T> will be type-specific, and each of these typeof
* blocks will be eliminated, as typeof(T) is a(JIT) compile-time constant for each
* instantiation. This design was chosen to eliminate any overhead from delegates and
* other patterns.
*/
The graph included in the issue summary or graph and full data table on IllyriadGames/ByteArrayExtension shows the average times and operations per second for lengths 0 - 96, 508 - 513, 543, 544, 547, 576, 1024, 4096, 8512 for I can see there being overhead in
It would suggest to me it was; as the actual |
i.e. 17 bytes takes double the time of 16 bytes which is it was alignment alone should have recovered somewhat for 32 bytes; but doesn't; snip from table below:
At 16 bytes So something is odd; and the At a guess, its either not taking advantage of wider datatypes or has instruction result interdependence preventing OoOE and pipelining - the data should all be in L1 cache at this point? |
I see, I was looking at the graph upside down so to speak 😄 VC++'s memcpy (and memmove) implementation has special cases for less than 16 bytes and 16-32 bytes. See the code in "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\crt\src\amd64\memcpy.asm". |
It looks like its a dependent read causing a pipeline stall for 17-32 and for every higher byte count. The next step in the Whereas the two 16 byte copies are independent in the Is there an open source github repo for memcpy.asm? 😉 |
@mikedn looking into this a bit further it looks like can extend the performance for the full range (getting 304GBit/s read + 304GBit/s write) for 16kB aligned arrays. (assuming a I'm just testing with unaligned copies to see what effect that has. I could add an extension for public static void CopyTo(this byte[] src, byte[] array, int index) { }
public static void CopyTo(this char[] src, char[] array, int index) { }
public static void CopyTo(this byte[] sourceArray, int sourceIndex, byte destinationArray, int destinationIndex, int length) { } As they would have a dependency on Higher up the stack I couldn't do
Would this be worth pursuing as a contribution to corefx? Would only be the |
Agree. To compare apples to apples, it would be more accurate to compare the vectorized implementation with something like:
The results will vary in both directions based on number of dimensions:
I am getting pretty different number on my config: VS2015 Update 1, Intel Core i7-3520M @ 2.90GHz - the difference is a lot smaller than what you are seeing. Since Array.Copy and Buffer.MemoryCopy are basically just wrappers around memcpy from C runtime, it is really a perf issue in C runtime, specific to your config. It is actually pretty hard problem to build memcpy with great performance accross all dimensions. I have seen analysis of the problem several years ago, comparing performance of different strategies - it was multidimensional cube that it was hard to make sense of. |
Interesting didn't know
Yeah :-/ Tis very true
Using .Net Framework.4.6.1 (4.6.01038); RyuJIT, x64, Win 10 (10586.29), VS 14.0.24720 Update 1 - though just using provided framework not building myself. Also have CPU power saving off and CPU fan locked on full to avoid throttling, and nothing else running on test machine; just letting it do its thing. Using static readonly arrays to minimise GC impact; which will mean it will all quickly be hitting cache. Agreed that since But yes, benchmarks are hard, also is why am using BenchmarkDotNet rather than trying to roll my own. It goes through warm up cycles for jitting; idles to get a measure of an empty loop; tries to calculate a good number of operations to run to get a good measure; repeats the benchmark 10 times; then starts from scratch 3 times and from the 30 runs it derives the standard deviation; for the example below that ends up being around 591 billion function executions for the 16kB size - alas it takes a long time though. // Run, Process: 2 / 3
// IntParam=16384
// Method=AlignedVectorizedCopy
// BenchmarkDotNet-Dev=v0.8.0.0+
// OS=Microsoft Windows NT 6.2.9200.0
// Processor=Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz, ProcessorCount=8
// CLR=MS.NET 4.0.30319.42000, Arch=64-bit [RyuJIT]
// Pre-Warmup: 2 op, 0 ms, 3158.1 ns, 8 ticks, 1579.0267 ns/op, 633301.5 op/s
// Pre-Warmup: 2000 op, 1 ms, 1015314.2 ns, 2572 ticks, 507.6571 ns/op, 1969833.6 op/s
// Pre-Warmup: 1970000 op, 771 ms, 770992173.6 ns, 1953082 ticks, 391.3666 ns/op, 2555149.2 op/s
// Pre-Warmup: 3940000 op, 1551.9 ms, 1551853659 ns, 3931165 ticks, 393.8715 ns/op, 2538899.2 op/s
// Warmup (idle): 3940000 op, 4.4 ms, 4357324.3 ns, 11038 ticks, 1.1059 ns/op, 904224645.8 op/s
// Warmup (idle): 3940000 op, 4.2 ms, 4245608.1 ns, 10755 ticks, 1.0776 ns/op, 928017818.7 op/s
// Warmup (idle): 3940000 op, 4.4 ms, 4384167.7 ns, 11106 ticks, 1.1127 ns/op, 898688244.2 op/s
// IterationCount = 1970000
// Target (idle): 3940000 op, 4.5 ms, 4480488.4 ns, 11350 ticks, 1.1372 ns/op, 879368426.4 op/s
// Target (idle): 3940000 op, 4.2 ms, 4232186.4 ns, 10721 ticks, 1.0742 ns/op, 930960884.2 op/s
// Target (idle): 3940000 op, 4.3 ms, 4265740.7 ns, 10806 ticks, 1.0827 ns/op, 923637945.6 op/s
// Target (idle): 3940000 op, 4.4 ms, 4416932.5 ns, 11189 ticks, 1.121 ns/op, 892021775 op/s
// Target (idle): 3940000 op, 4.3 ms, 4269688.3 ns, 10816 ticks, 1.0837 ns/op, 922783990.4 op/s
// Warmup 1: 3940000 op, 1547.1 ms, 1547107499.4 ns, 3919142 ticks, 392.6669 ns/op, 2546687.9 op/s
// Warmup 2: 3940000 op, 1550.6 ms, 1550621228.6 ns, 3928043 ticks, 393.5587 ns/op, 2540917.1 op/s
// Warmup 3: 3940000 op, 1548.1 ms, 1548139393.3 ns, 3921756 ticks, 392.9288 ns/op, 2544990.5 op/s
// Warmup 4: 3940000 op, 1548.8 ms, 1548824296.2 ns, 3923491 ticks, 393.1026 ns/op, 2543865.1 op/s
// Warmup 5: 3940000 op, 1548.4 ms, 1548367562.7 ns, 3922334 ticks, 392.9867 ns/op, 2544615.4 op/s
Target 1: 3940000 op, 1548 ms, 1548044651.7 ns, 3921516 ticks, 392.9047 ns/op, 2545146.2 op/s
Target 2: 3940000 op, 1547.3 ms, 1547276060.5 ns, 3919569 ticks, 392.7097 ns/op, 2546410.5 op/s
Target 3: 3940000 op, 1546.8 ms, 1546808273.8 ns, 3918384 ticks, 392.5909 ns/op, 2547180.6 op/s
Target 4: 3940000 op, 1549.6 ms, 1549584597.5 ns, 3925417 ticks, 393.2956 ns/op, 2542616.9 op/s
Target 5: 3940000 op, 1547.9 ms, 1547870958.8 ns, 3921076 ticks, 392.8606 ns/op, 2545431.8 op/s
Target 6: 3940000 op, 1546.7 ms, 1546704452.8 ns, 3918121 ticks, 392.5646 ns/op, 2547351.6 op/s
Target 7: 3940000 op, 1548.5 ms, 1548512438.4 ns, 3922701 ticks, 393.0235 ns/op, 2544377.4 op/s
Target 8: 3940000 op, 1547.9 ms, 1547853194.7 ns, 3921031 ticks, 392.8561 ns/op, 2545461 op/s
Target 9: 3940000 op, 1546.6 ms, 1546550892.4 ns, 3917732 ticks, 392.5256 ns/op, 2547604.5 op/s
Target 10: 3940000 op, 1546.9 ms, 1546852881.3 ns, 3918497 ticks, 392.6023 ns/op, 2547107.1 op/s |
Also needs to be RyuJIT x64 for Vectors to have meaning beyond convenience I believe |
That has nothing to do with the 17-32 case, that code is for the 32-128 case.
16kB aligned arrays? Maybe you mean 16 byte? And what alignment has to do with static readonly?! Anyway, I don't understand where are you trying to go with all this. Even if
MemoryCopy uses memcpy only for size > 512, you're probably talking about BlockCopy. |
Ah my mistake, so its doing x2 16 byte copies for 17-32 in
16kb arrays, which I assumed to be 16 byte aligned; but probably aren't due to the type and length preamble. Static readonly was to provide a fuller picture of the benchmark and the cpu caching state + GC interplay.
Sorry... got a bit side tracked, thank you for following up. Basically a faster Copy for primitive arrays at small sizes; particularly |
Yup. In my C++ tests the 17-32 is actually faster than the rest and given how the code looks I think it's normal. But this case seems to be new in VC++2015 and the desktop version of .NET uses the VC++2013 CRT, that may explain why you're seeing that sharp decrease at 16 bytes.
Why would the length of an array imply anything about its alignment?
I still don't understand what static readonly has to do with all this. Those modifiers only affect the array reference, not the array itself.
Adding another API beyond the existing one doesn't seem such a good idea. That said, generic versions of the existing APIs may be useful and I wouldn't exactly consider them new APIs. Some of the stuff mentioned here already appeared in other issues. Of the top of my head:
Still, I think that the best thing to do would to treat these methods as JIT intrinsics the same way native compilers do. Unfortunately that probably requires more work. |
Could you help me understand the costs? (i.e. what costs is it worth trying to mitigate) Does the extern call of Is 2 x There is are two size multiplications I assume the other parameter validation isn't more expensive in native than in managed? I also assume the call to |
Cool benaadams, It looks like I have the same results. Vector copy is faster. |
@benaadams
|
Result from the improved vector copy compared with @benaadams vector copy |
Best thing to do would be use a profiler to see what happens. I've done that already so here's a quick overview:
Now, profiler's claims should be taken with a grain of salt. But it's pretty certain that in my test (with lengths from 0 to 64) 76% of the time is spent in coreclr.dll, 17% is spent in vcruntime140.dll and 5% of the time is spent in the test executable. The influence of Keep in mind that my test is using CoreCLR compiled with VS2015. You may get different results if you use the desktop version of the CLR because that one is compiled using VS2013 which appears to have a slightly different memcpy implementation. |
Wow, ok - so sticking to managed code is probably best. If I introduced some methods like
Then have them call though to I had a look at what could be done with native vectors and intrinsics; but will admit what goes on there is a little exotic for me to confidently make those changes at this time.
We are exclusively CoreCLR x64 for production use now; so this is the area I'd most like to target. @Anderman is similar to what |
@benaadams
|
@benaadams Calling via p/invoke is very costly for small array sizes. While fixing the array in memory and copying it is not. This is the most optimized routine for aligned buffers we could create for JIT64 (haven't tested it yet con CoreCLR and or unaligned but the difference for Intel architectures shouldn't be too different) https://github.com/Corvalius/ravendb/blob/master/Raven.Sparrow/Sparrow/Memory.cs . When we tested back in .NET 4.6 against BenchmarkDotNet=v0.7.8.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz, ProcessorCount=4
HostCLR=MS.NET 4.0.30319.42000, Arch=64-bit [AttachedDebugger] [RyuJIT]
Type=MemoryPerf Mode=Throughput Platform=X64 Jit=RyuJit .NET=HostFramework
Each platform has a different threshold where it makes sense to just resort to I would agree also with @mikedn that MemoryCopy should be a JIT intrinsic. |
@redknightlois interesting, also I agree on the JIT intrinsic; I just don't have the domain knowledge for it and I assume with the dotnet cli and approaching RC2/RTM; I doubt the dotnet team would have a chance to work on it in the near term. However; I can do the managed generics contribution for Would be good to get in coreclr as it certainly looks like there are lots of people rolling their own copy routines for improved performance; whereas they shouldn't have to. |
Maybe that's not actually needed. The whole point of a "copy" JIT intrinsic would be that certain cases can be better optimized if the JIT understands what's going on. The most important case that can benefit is copy length being a constant. If the JIT would be more willing to consider implementations such as
I'm not sure how you could make generic versions work efficiently without something like
Custom implementations will likely always exist because it's impossible to optimize for every case under the sun. For example, why should I pay the cost of checks for small copy length if I know that copy length will never be smaller than a certain threshold in the application I'm writing? |
@mikedn But the question still lingers. Could the JIT as it currently stand (without a dedicated codepath for If that could be necessary, what's the difference with treating it as an intrinsic? Probably that is something only a handful of people can actually give some input into it like @CarolEidt, @erozenfeld or @cmckinsey. |
What complex code? Most of the int a = 3;
switch (a) {
case 0:
Console.WriteLine(0);
break;
...
case 15:
Console.WriteLine(15);
break;
} generates:
Not much I'd say. Maybe it could emit a |
Would do for byte->double with fallback of current methods
SSE -> AVX -> AVX-512 instructions; the Vectors lib is further up the stack in corefx so coreclr couldn't take a dependency on it (also would want to skip the bounds checks) and the wide types aren't native in the base runtime beyond |
@mikedn Actually I was thinking on cases where you need to accommodate cases like always copy 1033 bytes which will probably have a few loops of 512 bytes copy and a few more based on switch; the typical @benaadams That is basically what I implied, if you have to implement custom paths for |
@redknightlois its more complicated than that; for example with CopyTo (a good example) Managed non accelerated version |
I think you're assigning a slightly different meaning to intrinsic, usually intrinsic means that the function call is replaced with a inline version. What you seem to be suggesting is that the JIT compiler should generate the code for functions like
Hmm, put 2 longs in a struct, the JIT usually copies such structs using SSE instructions. Or put 4 and you get a form of unrolling as well.
With a bit more optimization on the JIT side that could probably work as well. And the additional optimizations could handle other situations unrelated to copying whereas an intrinsic is useful only in this limited circumstance. |
@mikedn. Interesting. So, if I'am correct the VM will mark the Vector assembly as special class. The importer scans every CALL and will try to convert the call to impSIMDIntrinsic The impSIMDIntrinsic will first find out if the call is to a special class here, here and here So it must be possible to do the same Array.CopyTo or Buffer.BlockCopy.
Skipping bound check did improve the copy performance for 128 byte by 50-80% |
What about something like the following? It assumes the argument checks are placed in a separate method
|
Yes but but a hypothetical
Currently |
This issue is getting long and IMO a bit unfocused. Let's try to sum it up, we'll see there are actually 2 different related issues. It's clear that Now, the CLR may end up using no less than 3 different versions of Issue 1 Are there workarounds? Well, if you don't mind using Issue 2 Issue 3 Hmm, looks like I ended up with 3 different issues :) |
Need to remeasure post dotnet/coreclr#3118 |
Yeah, I love how a change was sneaked in without any reference to a existing long discussion about the affected piece of code 😀 |
dotnet/coreclr#3118 was @geoffkizer suggestion that I just helped to happen, so it did not immediately occur to me to link it to this one as I was typing the commit description ... sorry about that. Geoff is analyzing performance of the whole .NET Core server stack. |
@Anderman could you explain how this works: if (orgCount > Vector<byte>.Count)
{
new Vector<byte>(src, orgCount - Vector<byte>.Count).CopyTo(dst, orgCount - Vector<byte>.Count);
return;
} I don't understand how this works, when it is not taking offsets into account? I am redoing the different copies benchmarks (on all JITs) and wanted to include your take too. As far as I understand Kestrel now uses |
We should look into fixing this using Span. |
Closing as very out of date;, though don't know current situation |
Array.Copy & Buffer.BlockCopy are up to x3 times slower than what can currently be written using managed C# for
byte[]
copies < 1024 bytes.As these are the low level copy apis they should be faster than anything that can be written using managed code.
The IllyriadGames/ByteArrayExtension repository adds a
VectorizedCopy
extension method forbyte[]
that demonstrates this speed-up; and I assume potentially more gains could be made at the clr level.As
Buffer.BlockCopy
is essentially callingmemmove
Would this effect other copies e.g. structs passed as parameters; as they are essentially small copy operations?The text was updated successfully, but these errors were encountered: