-
Notifications
You must be signed in to change notification settings - Fork 2.7k
update JIT_MemSet/MemCpy, Buffer::BlockCopy and Buffer::InternalBlock… #7198
Conversation
Also related issue: https://github.com/dotnet/coreclr/issues/6661 |
@russellhadley @swaroop-sridhar PTAL |
;JIT_MemCpy - Copy source buffer to destination buffer | ||
; | ||
;Purpose: | ||
;JIT_MemCpy - Copy source buffer to destination buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, please add header comment to memcpy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
;char *memset(dst, value, count) - sets "count" bytes at "dst" to "value" | ||
; | ||
;Purpose: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please retain the header comments that describe the purpose of the method.
Please update the algorithm in the comments to reflect the new strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for comment, done.
Very happy to see this contribution. For small copies of lengths 0-16 are a very common range and it appears that some paths in the implementation are gated on these small sizes. Can we see the experimental results including 0-16 lengths? Also, can we see similar results for a JIT_MemSet benchmark at varying lengths? I'd also like to see a future contribution that includes changes to the Linux .s implementation. Also, I'd suggest running a pass of this on "Full framework" since there are many more extensive cpblk/initblk tests there from managed cplusplus compiler. |
@@ -1524,7 +1528,11 @@ FCIMPL5(VOID, Buffer::InternalBlockCopy, ArrayBase *src, int srcOffset, ArrayBas | |||
_ASSERTE(count >= 0); | |||
|
|||
// Copy the data. | |||
#if defined(_WIN64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be #if defined(_AMD64_) && !defined(PLATFORM_UNIX)
. Otherwise, there will be build break on ARM64 and Unix will get slower for no good reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing it out. It's updated.
@@ -1524,7 +1528,11 @@ FCIMPL5(VOID, Buffer::InternalBlockCopy, ArrayBase *src, int srcOffset, ArrayBas | |||
_ASSERTE(count >= 0); | |||
|
|||
// Copy the data. | |||
#if defined(_WIN64) | |||
JIT_MemCpy(dst->GetDataPtr() + dstOffset, src->GetDataPtr() + srcOffset, count); | |||
#else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are number of other uses of memmove. If this is faster than the stock memmove implementation, should it be re-defined in some more central header instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or ideally ... fixed in the stock memmove implementation, so that all programs benefit from it and we do not have to have a private copy that has to be kept up to date with the latest tweaks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened issue #7234 to extend this work to other instances of memcpy/memmove.
ECMA-335 says |
shr r9, 6 ; count/64 | ||
|
||
align 16 | ||
mcpy09: movdqu xmm0, [rdx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When AVX is available, shouldn't we use YMM wide copy/initializations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Based on the experiment data, we do not see advantages when using YMM. Basically, when the copy length is large, 'rep movsb' or 'rep stosb' is a good choice. In addition, cpu id check is needed if we choose YMM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For certain copy lengths, we do see improvement when using YMM with vzeroupper. However, considering the cpu dispatch cost, we may consider YMM as future improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future but could the cpu id check be passed in as a cached param?
mov eax, edx ; eax is value | ||
mov rdi, rcx ; rdi is dst | ||
mov rcx, r8 ; rcx is count | ||
rep stosb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For cases larger than 512 bytes, do we expect rep stosb/movsb
to be faster than XMM/YMM wide copies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If CPUID EAX=7, EBX=0 => Bit 9 Enhanced REP MOVSB/STOSB it is faster, however it has a setup cost which needs to be amortized vs vector copies https://github.com/dotnet/coreclr/issues/6661#issuecomment-242547024 so is a cut over point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @benaadams for the clarification.
@helloguo why's the cutoff 512B? Is this based on measurement, or because of existing implementation (ex: in CRT)? Can you please add a comment about it?
Also, looks like aligning the src/dest will reduce the startup overhead for movsb
. Is it worth trying to align the addresses befrore movsb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the 512 is chosen based on experiment data. After 512, 'rep movsb' is better.
shr r9, 7 ; count/128 | ||
|
||
align 16 | ||
mset01: movdqu [rcx], xmm0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth considering src/dest alignment? For CpBlk/InitBlk, the sources/destinations are usually expected to be aligned, and therefore can benefit from the aligned write instructions (movdqa
instead of movdqu
).
ECMA-355 says:
"cpblk assumes that both destaddr and srcaddr are aligned to the natural size of the machine (but
see the unaligned. prefix instruction). The operation of the cpblk instruction can be altered by
an immediately preceding volatile. or unaligned. prefix instruction.
[Rationale: cpblk is intended for copying structures (rather than arbitrary byte-runs). All such
structures, allocated by the CLI, are naturally aligned for the current platform. Therefore, there is
no need for the compiler that generates cpblk instructions to be aware of whether the code will
eventually execute on a 32-bit or 64-bit platform. end rationale]"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing it out. It is safe to use movdqu because movdqa can raise exception if the address is not aligned. Also, we do not see much difference between movdqa and movdqu when running experiments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@helloguo Overall the change is good. Thanks for the fix, and the detailed measurements.
@dotnet-bot test Linux ARM Emulator Cross Debug Build |
@DrewScoggins PTAL |
@swaroop-sridhar In terms of "Is it worth having an optimized version of JIT_MemCpy without the overlap checks", we do this overlap check because the default implementation has the check. Also, CRT implementation has the check. So we follow the same way. But this is a good point. Maybe we can open an issue to follow up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarifications.
Please update the checkin message with the work-items for all the follow on work (Extension to all memmov/memcpy, removing overlap checks, use of wide copies, etc.)
@cmckinsey New data is updated. In order to test JIT_MemSet and JIT_MemCpy, we use DymamicMethod to generate Initblk and Cpblk. The following pseudo code shows how the test works.
The test is running on a Skylake machine Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz with 32GB RAM. OS is Windows 10 Enterprise. The results show good improvement. The raw data of JIT_MemSet is as following (the number is ticks): The raw data of JIT_MemCpy is as following (the number is ticks): |
Thanks for the new numbers. Things look positive for all lengths. LGTM pending clean test results from Swaroop on desktop and opening issues for the things he and Jan called out. |
LGTM - I'm happy with the change and the follow on items. Let's get this win locked in before we go after any further items. |
Desktop DDR tests have passed, so the change looks good to me too. Thanks. |
@dotnet-bot test Windows_NT arm Cross Checked Build please |
not sure "Windows_NT arm Cross XXX" are tests which is why they error? |
Created #7282 and #7283 as follow on items. I believe all the feedback have been addressed. |
@helloguo you could use |
@dotnet-bot test this please |
@dotnet-bot test Linux ARM Emulator Cross Debug Build |
This PR provides an optimized version of JIT_MemSet/JIT_MemCpy assembly helper functions on 64-bit Windows. JIT_MemSet gets invoked when bytecode Initblk is executed while JIT_MemCpy gets invoked when bytecode Cpblk is executed. JIT_MemCpy takes care of both overlap and non-overlap scenarios. The use of this optimized JIT_MemCpy is extended to Buffer::BlockCopy and Buffer::InternalBlockCopy by replacing the CRT memmove. The unit test BlockCopyPerf.cs is used as reference and modified so that the copy length varies from 0 byte to 520 bytes. This micro benchmark tests Buffer::BlockCopy (JIT_MemCpy). The following chart and table show the result on Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz with 32GB RAM. OS is Windows 10 Enterprise. Further details about performance improvements are available at dotnet#7198 Fixes #7146.
dotnet/coreclr#7198) * update JIT_MemSet/MemCpy, Buffer::BlockCopy and Buffer::InternalBlockCopy * add header comments Commit migrated from dotnet/coreclr@3bfe9a0
This PR provides an optimized version of JIT_MemSet/JIT_MemCpy assembly helper functions on 64-bit Windows. JIT_MemSet gets invoked when bytecode Initblk is executed while JIT_MemCpy gets invoked when bytecode Cpblk is executed. JIT_MemCpy takes care of both overlap and non-overlap scenarios. The use of this optimized JIT_MemCpy is extended to Buffer::BlockCopy and Buffer::InternalBlockCopy by replacing the CRT memmove. This PR addresses the issue #7146.
The unit test https://github.com/dotnet/coreclr/blob/master/tests/src/performance/perflab/BlockCopyPerf.cs is used as reference and modified so that the copy length varies from 0 byte to 520 bytes. This micro benchmark tests Buffer::BlockCopy (JIT_MemCpy). The following chart and table show the result on Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz with 32GB RAM. OS is Windows 10 Enterprise.
We also measure the performance of TechEmpower (JSON serialization and Plaintext tests). The experimental configuration for the app server is: Intel(R) Xeon(R) CPU E5-2697 v3 @2.60GHz with 32GB RAM running Windows Server 2012 R2 64 bits. The difference between baseline and the prototype is within the run to run variation. For JSON serialization test, VTune profiling shows the default JIT_MemSet takes 0.6% of overall cycles while the optimized JIT_MemSet takes 0.4% cycles for our prototype. Same thing happens to Plaintext test: the default JIT_MemSet takes 1.2% of overall cycles while the optimized JIT_MemSet takes 0.8% of overall cycles. For both JSON serialization and Plaintext tests, JIT_MemSet improves by about 33%. Both JSON serialization and Plaintext tests do not use JIT_MemCpy. In terms of Buffer::BlockCopy, the profiling shows most of the copy lengths are from 16 bytes to 64 bytes, for which CRT memmove and the optimized JIT_MemCpy have similar performance. So the improvement is negligible for TechEmpower.
These optimized routines may need to be revisited when VC/universal CRTs get updated from both performance and maintenance points of view.