Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

update JIT_MemSet/MemCpy, Buffer::BlockCopy and Buffer::InternalBlock… #7198

Merged
merged 2 commits into from
Sep 23, 2016
Merged

Conversation

helloguo
Copy link

This PR provides an optimized version of JIT_MemSet/JIT_MemCpy assembly helper functions on 64-bit Windows. JIT_MemSet gets invoked when bytecode Initblk is executed while JIT_MemCpy gets invoked when bytecode Cpblk is executed. JIT_MemCpy takes care of both overlap and non-overlap scenarios. The use of this optimized JIT_MemCpy is extended to Buffer::BlockCopy and Buffer::InternalBlockCopy by replacing the CRT memmove. This PR addresses the issue #7146.

The unit test https://github.com/dotnet/coreclr/blob/master/tests/src/performance/perflab/BlockCopyPerf.cs is used as reference and modified so that the copy length varies from 0 byte to 520 bytes. This micro benchmark tests Buffer::BlockCopy (JIT_MemCpy). The following chart and table show the result on Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz with 32GB RAM. OS is Windows 10 Enterprise.

image

image

We also measure the performance of TechEmpower (JSON serialization and Plaintext tests). The experimental configuration for the app server is: Intel(R) Xeon(R) CPU E5-2697 v3 @2.60GHz with 32GB RAM running Windows Server 2012 R2 64 bits. The difference between baseline and the prototype is within the run to run variation. For JSON serialization test, VTune profiling shows the default JIT_MemSet takes 0.6% of overall cycles while the optimized JIT_MemSet takes 0.4% cycles for our prototype. Same thing happens to Plaintext test: the default JIT_MemSet takes 1.2% of overall cycles while the optimized JIT_MemSet takes 0.8% of overall cycles. For both JSON serialization and Plaintext tests, JIT_MemSet improves by about 33%. Both JSON serialization and Plaintext tests do not use JIT_MemCpy. In terms of Buffer::BlockCopy, the profiling shows most of the copy lengths are from 16 bytes to 64 bytes, for which CRT memmove and the optimized JIT_MemCpy have similar performance. So the improvement is negligible for TechEmpower.

These optimized routines may need to be revisited when VC/universal CRTs get updated from both performance and maintenance points of view.

@benaadams
Copy link
Member

@cmckinsey
Copy link

@russellhadley @swaroop-sridhar PTAL
/cc @dotnet/jit-contrib

;JIT_MemCpy - Copy source buffer to destination buffer
;
;Purpose:
;JIT_MemCpy - Copy source buffer to destination buffer
Copy link

@swaroop-sridhar swaroop-sridhar Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, please add header comment to memcpy

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

;char *memset(dst, value, count) - sets "count" bytes at "dst" to "value"
;
;Purpose:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please retain the header comments that describe the purpose of the method.
Please update the algorithm in the comments to reflect the new strategy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for comment, done.

@cmckinsey
Copy link

Very happy to see this contribution. For small copies of lengths 0-16 are a very common range and it appears that some paths in the implementation are gated on these small sizes. Can we see the experimental results including 0-16 lengths? Also, can we see similar results for a JIT_MemSet benchmark at varying lengths?

I'd also like to see a future contribution that includes changes to the Linux .s implementation. Also, I'd suggest running a pass of this on "Full framework" since there are many more extensive cpblk/initblk tests there from managed cplusplus compiler.

@@ -1524,7 +1528,11 @@ FCIMPL5(VOID, Buffer::InternalBlockCopy, ArrayBase *src, int srcOffset, ArrayBas
_ASSERTE(count >= 0);

// Copy the data.
#if defined(_WIN64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be #if defined(_AMD64_) && !defined(PLATFORM_UNIX). Otherwise, there will be build break on ARM64 and Unix will get slower for no good reason.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out. It's updated.

@@ -1524,7 +1528,11 @@ FCIMPL5(VOID, Buffer::InternalBlockCopy, ArrayBase *src, int srcOffset, ArrayBas
_ASSERTE(count >= 0);

// Copy the data.
#if defined(_WIN64)
JIT_MemCpy(dst->GetDataPtr() + dstOffset, src->GetDataPtr() + srcOffset, count);
#else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are number of other uses of memmove. If this is faster than the stock memmove implementation, should it be re-defined in some more central header instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or ideally ... fixed in the stock memmove implementation, so that all programs benefit from it and we do not have to have a private copy that has to be kept up to date with the latest tweaks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened issue #7234 to extend this work to other instances of memcpy/memmove.

@swaroop-sridhar
Copy link

ECMA-335 says CpBlk instruction says "The behavior of cpblk is unspecified if the source and destination areas overlap." That is, CpBlk behaves like memcpy() for overlapping destinations.
Is it worth having an optimized version of JIT_MemCpy without the overlap checks?

shr r9, 6 ; count/64

align 16
mcpy09: movdqu xmm0, [rdx]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When AVX is available, shouldn't we use YMM wide copy/initializations?

Copy link
Author

@helloguo helloguo Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Based on the experiment data, we do not see advantages when using YMM. Basically, when the copy length is large, 'rep movsb' or 'rep stosb' is a good choice. In addition, cpu id check is needed if we choose YMM.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For certain copy lengths, we do see improvement when using YMM with vzeroupper. However, considering the cpu dispatch cost, we may consider YMM as future improvement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future but could the cpu id check be passed in as a cached param?

mov eax, edx ; eax is value
mov rdi, rcx ; rdi is dst
mov rcx, r8 ; rcx is count
rep stosb

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cases larger than 512 bytes, do we expect rep stosb/movsb to be faster than XMM/YMM wide copies?

Copy link
Member

@benaadams benaadams Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If CPUID EAX=7, EBX=0 => Bit 9 Enhanced REP MOVSB/STOSB it is faster, however it has a setup cost which needs to be amortized vs vector copies https://github.com/dotnet/coreclr/issues/6661#issuecomment-242547024 so is a cut over point

Copy link

@swaroop-sridhar swaroop-sridhar Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @benaadams for the clarification.
@helloguo why's the cutoff 512B? Is this based on measurement, or because of existing implementation (ex: in CRT)? Can you please add a comment about it?

Also, looks like aligning the src/dest will reduce the startup overhead for movsb. Is it worth trying to align the addresses befrore movsb?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the 512 is chosen based on experiment data. After 512, 'rep movsb' is better.

shr r9, 7 ; count/128

align 16
mset01: movdqu [rcx], xmm0
Copy link

@swaroop-sridhar swaroop-sridhar Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth considering src/dest alignment? For CpBlk/InitBlk, the sources/destinations are usually expected to be aligned, and therefore can benefit from the aligned write instructions (movdqa instead of movdqu).

ECMA-355 says:
"cpblk assumes that both destaddr and srcaddr are aligned to the natural size of the machine (but
see the unaligned. prefix instruction). The operation of the cpblk instruction can be altered by
an immediately preceding volatile. or unaligned. prefix instruction.
[Rationale: cpblk is intended for copying structures (rather than arbitrary byte-runs). All such
structures, allocated by the CLI, are naturally aligned for the current platform. Therefore, there is
no need for the compiler that generates cpblk instructions to be aware of whether the code will
eventually execute on a 32-bit or 64-bit platform. end rationale]"

Copy link
Author

@helloguo helloguo Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out. It is safe to use movdqu because movdqa can raise exception if the address is not aligned. Also, we do not see much difference between movdqa and movdqu when running experiments.

Copy link

@swaroop-sridhar swaroop-sridhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@helloguo Overall the change is good. Thanks for the fix, and the detailed measurements.

@helloguo
Copy link
Author

@dotnet-bot test Linux ARM Emulator Cross Debug Build
@dotnet-bot test Linux ARM Emulator Cross Release Build
@dotnet-bot test Ubuntu x64 Checked Build and Test
@dotnet-bot test Windows_NT arm Cross Checked Build
@dotnet-bot test Windows_NT arm Cross Debug Build
@dotnet-bot test Windows_NT arm Cross Release Build

@helloguo
Copy link
Author

@DrewScoggins PTAL

@helloguo
Copy link
Author

@swaroop-sridhar In terms of "Is it worth having an optimized version of JIT_MemCpy without the overlap checks", we do this overlap check because the default implementation has the check. Also, CRT implementation has the check. So we follow the same way. But this is a good point. Maybe we can open an issue to follow up.

Copy link

@swaroop-sridhar swaroop-sridhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarifications.
Please update the checkin message with the work-items for all the follow on work (Extension to all memmov/memcpy, removing overlap checks, use of wide copies, etc.)

@helloguo
Copy link
Author

@cmckinsey New data is updated.

In order to test JIT_MemSet and JIT_MemCpy, we use DymamicMethod to generate Initblk and Cpblk. The following pseudo code shows how the test works.

var memset_dm = new DynamicMethod();
var il1 = memset_dm.GetILGenerator();
il1.Emit(OpCodes.Ldarg_0); // dst address
il1.Emit(OpCodes.Ldarg_1); // value
il1.Emit(OpCodes.Ldarg_2); // number of bytes
il1.Emit(OpCodes.Initblk);
il1.Emit(OpCodes.Ret);
MemSet = memset_dm.CreateDelegate();

var memcpy_dm = new DynamicMethod();
var il2 = memcpy_dm.GetILGenerator();
il2.Emit(OpCodes.Ldarg_0); // dst address
il2.Emit(OpCodes.Ldarg_1); // src address
il2.Emit(OpCodes.Ldarg_2); // number of bytes
il2.Emit(OpCodes.Cpblk);
il2.Emit(OpCodes.Ret);
MemCpy = memcpy_dm.CreateDelegate();

// arr_count is an array contains 1000 pseudo random numbers within one bucket.
// For example, when testing bucket from 0 to 8 bytes, all the 1000 elements of
// arr_count is from 0 to 8. 
begin = DateTime.Now.Ticks;
for (i = 0; i < 1000000; i++)
    for ( j = 0 ; j < 1000; j++)
        call MemSet(dst_address, value, arr_count[j]);
        Or call MemCpy(dst_address, src_address, arr_count[j]);
end = DateTime.Now.Ticks;
time = end - begin; 

The test is running on a Skylake machine Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz with 32GB RAM. OS is Windows 10 Enterprise. The results show good improvement.

image

image

The raw data of JIT_MemSet is as following (the number is ticks):
image

The raw data of JIT_MemCpy is as following (the number is ticks):
image

@cmckinsey
Copy link

Thanks for the new numbers. Things look positive for all lengths. LGTM pending clean test results from Swaroop on desktop and opening issues for the things he and Jan called out.

@russellhadley
Copy link

LGTM - I'm happy with the change and the follow on items. Let's get this win locked in before we go after any further items.

@swaroop-sridhar
Copy link

Desktop DDR tests have passed, so the change looks good to me too. Thanks.

@helloguo
Copy link
Author

@dotnet-bot test Windows_NT arm Cross Checked Build please
@dotnet-bot test Windows_NT arm Cross Debug Build please
@dotnet-bot test Windows_NT arm Cross Release Build please

@benaadams
Copy link
Member

benaadams commented Sep 20, 2016

not sure "Windows_NT arm Cross XXX" are tests which is why they error?

@helloguo
Copy link
Author

Created #7282 and #7283 as follow on items. I believe all the feedback have been addressed.

@nietras
Copy link

nietras commented Sep 21, 2016

In order to test JIT_MemSet and JIT_MemCpy, we use DymamicMethod to generate Initblk and Cpblk.

@helloguo you could use System.Runtime.CompilerServices.Unsafe also.

@swaroop-sridhar
Copy link

@dotnet-bot test this please

@helloguo
Copy link
Author

@dotnet-bot test Linux ARM Emulator Cross Debug Build
@dotnet-bot test Linux ARM Emulator Cross Release Build

@swaroop-sridhar swaroop-sridhar merged commit 3bfe9a0 into dotnet:master Sep 23, 2016
swaroop-sridhar pushed a commit to swaroop-sridhar/coreclr that referenced this pull request Sep 24, 2016
This PR provides an optimized version of JIT_MemSet/JIT_MemCpy assembly helper
functions on 64-bit Windows. JIT_MemSet gets invoked when bytecode Initblk is
executed while JIT_MemCpy gets invoked when bytecode Cpblk is executed.
JIT_MemCpy takes care of both overlap and non-overlap scenarios.
The use of this optimized JIT_MemCpy is extended to Buffer::BlockCopy
and Buffer::InternalBlockCopy by replacing the CRT memmove.

The unit test BlockCopyPerf.cs is used as reference and modified so that the
copy length varies from 0 byte to 520 bytes.
This micro benchmark tests Buffer::BlockCopy (JIT_MemCpy).
The following chart and table show the result on
Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz with 32GB RAM. OS is
Windows 10 Enterprise.

Further details about performance improvements are available at
dotnet#7198

Fixes #7146.
@benaadams benaadams mentioned this pull request Feb 15, 2017
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
dotnet/coreclr#7198)

* update JIT_MemSet/MemCpy, Buffer::BlockCopy and Buffer::InternalBlockCopy

* add header comments


Commit migrated from dotnet/coreclr@3bfe9a0
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants