update JIT_MemSet/MemCpy, Buffer::BlockCopy and Buffer::InternalBlock… #7198

helloguo · 2016-09-14T23:57:53Z

This PR provides an optimized version of JIT_MemSet/JIT_MemCpy assembly helper functions on 64-bit Windows. JIT_MemSet gets invoked when bytecode Initblk is executed while JIT_MemCpy gets invoked when bytecode Cpblk is executed. JIT_MemCpy takes care of both overlap and non-overlap scenarios. The use of this optimized JIT_MemCpy is extended to Buffer::BlockCopy and Buffer::InternalBlockCopy by replacing the CRT memmove. This PR addresses the issue #7146.

The unit test https://github.com/dotnet/coreclr/blob/master/tests/src/performance/perflab/BlockCopyPerf.cs is used as reference and modified so that the copy length varies from 0 byte to 520 bytes. This micro benchmark tests Buffer::BlockCopy (JIT_MemCpy). The following chart and table show the result on Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz with 32GB RAM. OS is Windows 10 Enterprise.

We also measure the performance of TechEmpower (JSON serialization and Plaintext tests). The experimental configuration for the app server is: Intel(R) Xeon(R) CPU E5-2697 v3 @2.60GHz with 32GB RAM running Windows Server 2012 R2 64 bits. The difference between baseline and the prototype is within the run to run variation. For JSON serialization test, VTune profiling shows the default JIT_MemSet takes 0.6% of overall cycles while the optimized JIT_MemSet takes 0.4% cycles for our prototype. Same thing happens to Plaintext test: the default JIT_MemSet takes 1.2% of overall cycles while the optimized JIT_MemSet takes 0.8% of overall cycles. For both JSON serialization and Plaintext tests, JIT_MemSet improves by about 33%. Both JSON serialization and Plaintext tests do not use JIT_MemCpy. In terms of Buffer::BlockCopy, the profiling shows most of the copy lengths are from 16 bytes to 64 bytes, for which CRT memmove and the optimized JIT_MemCpy have similar performance. So the improvement is negligible for TechEmpower.

These optimized routines may need to be revisited when VC/universal CRTs get updated from both performance and maintenance points of view.

benaadams · 2016-09-15T00:14:31Z

Also related issue: https://github.com/dotnet/coreclr/issues/6661

cmckinsey · 2016-09-15T00:46:19Z

@russellhadley @swaroop-sridhar PTAL
/cc @dotnet/jit-contrib

swaroop-sridhar · 2016-09-15T01:50:38Z

src/vm/amd64/CrtHelpers.asm

-;JIT_MemCpy - Copy source buffer to destination buffer
-;
-;Purpose:
-;JIT_MemCpy - Copy source buffer to destination buffer


Similarly, please add header comment to memcpy

swaroop-sridhar · 2016-09-15T01:51:40Z

src/vm/amd64/CrtHelpers.asm

 ;char *memset(dst, value, count) - sets "count" bytes at "dst" to "value"
-;
-;Purpose:


Can you please retain the header comments that describe the purpose of the method.
Please update the algorithm in the comments to reflect the new strategy.

Thanks for comment, done.

cmckinsey · 2016-09-15T01:58:27Z

Very happy to see this contribution. For small copies of lengths 0-16 are a very common range and it appears that some paths in the implementation are gated on these small sizes. Can we see the experimental results including 0-16 lengths? Also, can we see similar results for a JIT_MemSet benchmark at varying lengths?

I'd also like to see a future contribution that includes changes to the Linux .s implementation. Also, I'd suggest running a pass of this on "Full framework" since there are many more extensive cpblk/initblk tests there from managed cplusplus compiler.

jkotas · 2016-09-15T05:23:21Z

src/vm/comutilnative.cpp

@@ -1524,7 +1528,11 @@ FCIMPL5(VOID, Buffer::InternalBlockCopy, ArrayBase *src, int srcOffset, ArrayBas
    _ASSERTE(count >= 0);

    // Copy the data.
+#if defined(_WIN64)  


This should be #if defined(_AMD64_) && !defined(PLATFORM_UNIX). Otherwise, there will be build break on ARM64 and Unix will get slower for no good reason.

Thanks for pointing it out. It's updated.

jkotas · 2016-09-15T05:28:29Z

src/vm/comutilnative.cpp

@@ -1524,7 +1528,11 @@ FCIMPL5(VOID, Buffer::InternalBlockCopy, ArrayBase *src, int srcOffset, ArrayBas
    _ASSERTE(count >= 0);

    // Copy the data.
+#if defined(_WIN64)  
+    JIT_MemCpy(dst->GetDataPtr() + dstOffset, src->GetDataPtr() + srcOffset, count);
+#else    


There are number of other uses of memmove. If this is faster than the stock memmove implementation, should it be re-defined in some more central header instead?

Or ideally ... fixed in the stock memmove implementation, so that all programs benefit from it and we do not have to have a private copy that has to be kept up to date with the latest tweaks.

I opened issue #7234 to extend this work to other instances of memcpy/memmove.

…Copy

swaroop-sridhar · 2016-09-15T17:31:32Z

ECMA-335 says CpBlk instruction says "The behavior of cpblk is unspecified if the source and destination areas overlap." That is, CpBlk behaves like memcpy() for overlapping destinations.
Is it worth having an optimized version of JIT_MemCpy without the overlap checks?

swaroop-sridhar · 2016-09-15T18:04:14Z

src/vm/amd64/CrtHelpers.asm

+        shr     r9, 6                   ; count/64
+
+        align 16
+mcpy09: movdqu  xmm0, [rdx] 


When AVX is available, shouldn't we use YMM wide copy/initializations?

Good point. Based on the experiment data, we do not see advantages when using YMM. Basically, when the copy length is large, 'rep movsb' or 'rep stosb' is a good choice. In addition, cpu id check is needed if we choose YMM.

For certain copy lengths, we do see improvement when using YMM with vzeroupper. However, considering the cpu dispatch cost, we may consider YMM as future improvement.

For future but could the cpu id check be passed in as a cached param?

swaroop-sridhar · 2016-09-15T18:06:00Z

src/vm/amd64/CrtHelpers.asm

+        mov     eax, edx                ; eax is value
+        mov     rdi, rcx                ; rdi is dst
+        mov     rcx, r8                 ; rcx is count
+        rep     stosb


For cases larger than 512 bytes, do we expect rep stosb/movsb to be faster than XMM/YMM wide copies?

If CPUID EAX=7, EBX=0 => Bit 9 Enhanced REP MOVSB/STOSB it is faster, however it has a setup cost which needs to be amortized vs vector copies https://github.com/dotnet/coreclr/issues/6661#issuecomment-242547024 so is a cut over point

Thanks @benaadams for the clarification.
@helloguo why's the cutoff 512B? Is this based on measurement, or because of existing implementation (ex: in CRT)? Can you please add a comment about it?

Also, looks like aligning the src/dest will reduce the startup overhead for movsb. Is it worth trying to align the addresses befrore movsb?

Yes, the 512 is chosen based on experiment data. After 512, 'rep movsb' is better.

swaroop-sridhar · 2016-09-15T18:17:23Z

src/vm/amd64/CrtHelpers.asm

+        shr     r9, 7                   ; count/128
+
+        align 16
+mset01: movdqu	[rcx], xmm0


Is it worth considering src/dest alignment? For CpBlk/InitBlk, the sources/destinations are usually expected to be aligned, and therefore can benefit from the aligned write instructions (movdqa instead of movdqu).

ECMA-355 says:
"cpblk assumes that both destaddr and srcaddr are aligned to the natural size of the machine (but
see the unaligned. prefix instruction). The operation of the cpblk instruction can be altered by
an immediately preceding volatile. or unaligned. prefix instruction.
[Rationale: cpblk is intended for copying structures (rather than arbitrary byte-runs). All such
structures, allocated by the CLI, are naturally aligned for the current platform. Therefore, there is
no need for the compiler that generates cpblk instructions to be aware of whether the code will
eventually execute on a 32-bit or 64-bit platform. end rationale]"

Thanks for pointing it out. It is safe to use movdqu because movdqa can raise exception if the address is not aligned. Also, we do not see much difference between movdqa and movdqu when running experiments.

swaroop-sridhar

@helloguo Overall the change is good. Thanks for the fix, and the detailed measurements.

helloguo · 2016-09-15T22:04:08Z

@dotnet-bot test Linux ARM Emulator Cross Debug Build
@dotnet-bot test Linux ARM Emulator Cross Release Build
@dotnet-bot test Ubuntu x64 Checked Build and Test
@dotnet-bot test Windows_NT arm Cross Checked Build
@dotnet-bot test Windows_NT arm Cross Debug Build
@dotnet-bot test Windows_NT arm Cross Release Build

helloguo · 2016-09-16T18:46:29Z

@DrewScoggins PTAL

helloguo · 2016-09-16T18:52:19Z

@swaroop-sridhar In terms of "Is it worth having an optimized version of JIT_MemCpy without the overlap checks", we do this overlap check because the default implementation has the check. Also, CRT implementation has the check. So we follow the same way. But this is a good point. Maybe we can open an issue to follow up.

swaroop-sridhar

Thanks for the clarifications.
Please update the checkin message with the work-items for all the follow on work (Extension to all memmov/memcpy, removing overlap checks, use of wide copies, etc.)

helloguo · 2016-09-19T18:59:31Z

@cmckinsey New data is updated.

In order to test JIT_MemSet and JIT_MemCpy, we use DymamicMethod to generate Initblk and Cpblk. The following pseudo code shows how the test works.

var memset_dm = new DynamicMethod();
var il1 = memset_dm.GetILGenerator();
il1.Emit(OpCodes.Ldarg_0); // dst address
il1.Emit(OpCodes.Ldarg_1); // value
il1.Emit(OpCodes.Ldarg_2); // number of bytes
il1.Emit(OpCodes.Initblk);
il1.Emit(OpCodes.Ret);
MemSet = memset_dm.CreateDelegate();

var memcpy_dm = new DynamicMethod();
var il2 = memcpy_dm.GetILGenerator();
il2.Emit(OpCodes.Ldarg_0); // dst address
il2.Emit(OpCodes.Ldarg_1); // src address
il2.Emit(OpCodes.Ldarg_2); // number of bytes
il2.Emit(OpCodes.Cpblk);
il2.Emit(OpCodes.Ret);
MemCpy = memcpy_dm.CreateDelegate();

// arr_count is an array contains 1000 pseudo random numbers within one bucket.
// For example, when testing bucket from 0 to 8 bytes, all the 1000 elements of
// arr_count is from 0 to 8. 
begin = DateTime.Now.Ticks;
for (i = 0; i < 1000000; i++)
    for ( j = 0 ; j < 1000; j++)
        call MemSet(dst_address, value, arr_count[j]);
        Or call MemCpy(dst_address, src_address, arr_count[j]);
end = DateTime.Now.Ticks;
time = end - begin;

The test is running on a Skylake machine Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz with 32GB RAM. OS is Windows 10 Enterprise. The results show good improvement.

The raw data of JIT_MemSet is as following (the number is ticks):

The raw data of JIT_MemCpy is as following (the number is ticks):

cmckinsey · 2016-09-20T01:42:10Z

Thanks for the new numbers. Things look positive for all lengths. LGTM pending clean test results from Swaroop on desktop and opening issues for the things he and Jan called out.

russellhadley · 2016-09-20T16:45:57Z

LGTM - I'm happy with the change and the follow on items. Let's get this win locked in before we go after any further items.

swaroop-sridhar · 2016-09-20T17:50:55Z

Desktop DDR tests have passed, so the change looks good to me too. Thanks.

helloguo · 2016-09-20T20:06:18Z

@dotnet-bot test Windows_NT arm Cross Checked Build please
@dotnet-bot test Windows_NT arm Cross Debug Build please
@dotnet-bot test Windows_NT arm Cross Release Build please

benaadams · 2016-09-20T20:47:29Z

not sure "Windows_NT arm Cross XXX" are tests which is why they error?

helloguo · 2016-09-21T00:41:41Z

Created #7282 and #7283 as follow on items. I believe all the feedback have been addressed.

nietras · 2016-09-21T09:28:25Z

In order to test JIT_MemSet and JIT_MemCpy, we use DymamicMethod to generate Initblk and Cpblk.

@helloguo you could use System.Runtime.CompilerServices.Unsafe also.

swaroop-sridhar · 2016-09-22T21:25:30Z

@dotnet-bot test this please

helloguo · 2016-09-23T00:10:40Z

@dotnet-bot test Linux ARM Emulator Cross Debug Build
@dotnet-bot test Linux ARM Emulator Cross Release Build

This PR provides an optimized version of JIT_MemSet/JIT_MemCpy assembly helper functions on 64-bit Windows. JIT_MemSet gets invoked when bytecode Initblk is executed while JIT_MemCpy gets invoked when bytecode Cpblk is executed. JIT_MemCpy takes care of both overlap and non-overlap scenarios. The use of this optimized JIT_MemCpy is extended to Buffer::BlockCopy and Buffer::InternalBlockCopy by replacing the CRT memmove. The unit test BlockCopyPerf.cs is used as reference and modified so that the copy length varies from 0 byte to 520 bytes. This micro benchmark tests Buffer::BlockCopy (JIT_MemCpy). The following chart and table show the result on Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz with 32GB RAM. OS is Windows 10 Enterprise. Further details about performance improvements are available at dotnet#7198 Fixes #7146.

dotnet/coreclr#7198) * update JIT_MemSet/MemCpy, Buffer::BlockCopy and Buffer::InternalBlockCopy * add header comments Commit migrated from dotnet/coreclr@3bfe9a0

dnfclas added the cla-already-signed label Sep 14, 2016

This was referenced Sep 15, 2016

[no merge] Further tweak Buffer.MemoryCopy performance #6638

Closed

Use Buffer.MemoryCopy when available aspnet/KestrelHttpServer#1036

Closed

swaroop-sridhar reviewed Sep 15, 2016

View reviewed changes

jkotas suggested changes Sep 15, 2016

View reviewed changes

update JIT_MemSet/MemCpy, Buffer::BlockCopy and Buffer::InternalBlock…

76aed3a

…Copy

swaroop-sridhar reviewed Sep 15, 2016

View reviewed changes

add header comments

73bcd5a

swaroop-sridhar reviewed Sep 15, 2016

View reviewed changes

swaroop-sridhar approved these changes Sep 19, 2016

View reviewed changes

swaroop-sridhar merged commit 3bfe9a0 into dotnet:master Sep 23, 2016

swaroop-sridhar mentioned this pull request Sep 24, 2016

Update MemSet/MemCopy helpers on Windows to a faster implementation. #7341

Closed

This was referenced Sep 30, 2016

Add InplaceStringFormatter aspnet/HttpAbstractions#717

Closed

add header ordering and short circuits to known headers aspnet/KestrelHttpServer#1135

Merged

benaadams mentioned this pull request Jan 26, 2017

[Revived PR] Faster Buffer.MemoryCopy for AMD64 #9143

Closed

benaadams mentioned this pull request Feb 15, 2017

Optimize span clear #9598

Merged

lt72 mentioned this pull request Jan 31, 2020

Extend usage of optimized JIT helpers for memcpy/memove. dotnet/runtime#6676

Closed

This was referenced Jan 31, 2020

Evaluate JIT_MemCpy without overlap check dotnet/runtime#6701

Open

Evaluate JIT_MemCpy with wider copies using AVX2 on hardware that support it dotnet/runtime#6702

Closed

ahsonkhan mentioned this pull request Jan 31, 2020

Investigate using JIT_Memset over Memset in Span.Clear dotnet/runtime#7477

Closed

abelbraaksma mentioned this pull request Jun 15, 2020

Performance of certain build-in functions, esp the ones related to Array manipulation dotnet/fsharp#9390

Closed

update JIT_MemSet/MemCpy, Buffer::BlockCopy and Buffer::InternalBlock… #7198

update JIT_MemSet/MemCpy, Buffer::BlockCopy and Buffer::InternalBlock… #7198

Conversation

helloguo commented Sep 14, 2016

benaadams commented Sep 15, 2016

cmckinsey commented Sep 15, 2016

swaroop-sridhar Sep 15, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmckinsey commented Sep 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

swaroop-sridhar commented Sep 15, 2016

Choose a reason for hiding this comment

helloguo Sep 16, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benaadams Sep 15, 2016 • edited Loading

Choose a reason for hiding this comment

swaroop-sridhar Sep 15, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

swaroop-sridhar Sep 15, 2016 • edited Loading

Choose a reason for hiding this comment

helloguo Sep 16, 2016 • edited Loading

Choose a reason for hiding this comment

swaroop-sridhar left a comment

Choose a reason for hiding this comment

helloguo commented Sep 15, 2016

helloguo commented Sep 16, 2016

helloguo commented Sep 16, 2016

swaroop-sridhar left a comment

Choose a reason for hiding this comment

helloguo commented Sep 19, 2016

cmckinsey commented Sep 20, 2016

russellhadley commented Sep 20, 2016

swaroop-sridhar commented Sep 20, 2016

helloguo commented Sep 20, 2016

benaadams commented Sep 20, 2016 • edited Loading

helloguo commented Sep 21, 2016

nietras commented Sep 21, 2016

swaroop-sridhar commented Sep 22, 2016

helloguo commented Sep 23, 2016

swaroop-sridhar Sep 15, 2016 •

edited

Loading

helloguo Sep 16, 2016 •

edited

Loading

benaadams Sep 15, 2016 •

edited

Loading

swaroop-sridhar Sep 15, 2016 •

edited

Loading

swaroop-sridhar Sep 15, 2016 •

edited

Loading

helloguo Sep 16, 2016 •

edited

Loading

benaadams commented Sep 20, 2016 •

edited

Loading