-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[vm] Optimize MemoryCopyInstr
with constant parameters
#51031
Comments
We should also optimise it for unboxed parameters as well. |
As far as I can see, we don't have a
|
@dcharkes
This indicates that this operation should just take unboxed integer or appropriate size (e.g. (There might be some fusing opportunities on ARM - which does not have something like |
I'll give it a spin, but I kind of expect more boxing/unboxing operations, lets see. edit: For the length yes, for the start and destination offset when using size 2, 4, or 8 not. These avoid untagging by using a 0.5x scale on the offset calculation. |
https://dart-review.googlesource.com/c/sdk/+/279172 Oh, yes, we need to force-optimize to use unboxed. This causes the recognized methods that do copy to be outlined and add static calls. So the code becomes slower unless we start supporting the unboxed representations in unoptimized mode or support inlining force optimized functions (I need to think a bit if those are idempotent). |
Alternatively you could just explicitly add inlining code for this function in the |
When the
We should consider doing one of the following:
The potential use case here is the @askeksa-google Do you remember the rationale of making the src/dest offsets be in element size rather than in bytes? |
Done in https://dart-review.googlesource.com/c/sdk/+/279172. Feel free to leave further suggestions on the stack of CLs. |
TEST=runtime/vm/compiler/backend/memory_copy_test.cc Bug: #51031 Change-Id: I6b0b2eb63f97ae9d7d3c9c80d998929f657ef482 Cq-Include-Trybots: luci.dart.try:vm-precomp-ffi-qemu-linux-release-riscv64-try,vm-precomp-ffi-qemu-linux-release-arm-try,vm-ffi-android-debug-arm64c-try,vm-ffi-android-debug-arm-try,vm-kernel-nnbd-mac-debug-arm64-try,vm-kernel-nnbd-win-debug-x64-try,vm-kernel-win-debug-x64c-try,vm-kernel-win-debug-ia32-try,vm-kernel-nnbd-linux-debug-ia32-try,vm-reload-rollback-linux-debug-x64-try Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/279387 Reviewed-by: Alexander Markov <alexmarkov@google.com>
When constants are passed in to the source and destination start, no registers are needed for these. The constants are directly compiled into the machine code. If the constant happens to be 0, no machine code is emitted at all. I did not measure any speed improvements. Likely the micro-code schedulers in the CPUs already noticed the no-ops. I have verified manually that we emit smaller machine code with these changes on x64. TEST=runtime/vm/compiler/backend/memory_copy_test.cc Bug: #51031 Change-Id: I70f12c9ae299b44a8f5007ca3a8c5ee56a9aff40 Cq-Include-Trybots: luci.dart.try:vm-precomp-ffi-qemu-linux-release-riscv64-try,vm-precomp-ffi-qemu-linux-release-arm-try,vm-ffi-android-debug-arm64c-try,vm-ffi-android-debug-arm-try,vm-kernel-nnbd-mac-debug-arm64-try,vm-kernel-nnbd-win-debug-x64-try,vm-kernel-win-debug-x64c-try,vm-kernel-win-debug-ia32-try,vm-kernel-nnbd-linux-debug-ia32-try,vm-reload-rollback-linux-debug-x64-try Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/279170 Reviewed-by: Alexander Markov <alexmarkov@google.com>
This CL removes a Smi untagging by taking an unboxed input for the length in the memory copy instruction or changing the loop decrementor to 2 instead of 1. Unboxing required implementing the method in the inliner because unboxed representations cannot be used in the method recognizer. Reduces code size slightly. Does not measurably improve speed. TEST=runtime/vm/compiler/backend/memory_copy_test.cc Bug: #51031 Change-Id: Ie311929af25b76c3b899ff2791bfaf4e40b1f06f Cq-Include-Trybots: luci.dart.try:vm-precomp-ffi-qemu-linux-release-riscv64-try,vm-precomp-ffi-qemu-linux-release-arm-try,vm-ffi-android-debug-arm64c-try,vm-ffi-android-debug-arm-try,vm-kernel-nnbd-mac-debug-arm64-try,vm-kernel-nnbd-win-debug-x64-try,vm-kernel-win-debug-x64c-try,vm-kernel-win-debug-ia32-try,vm-kernel-nnbd-linux-debug-ia32-try,vm-reload-rollback-linux-debug-x64-try Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/279172 Reviewed-by: Alexander Markov <alexmarkov@google.com>
Removes loops with constant lenght taking code-size into account. On ia32 and x64 only removes single iteration (removing the branch). This speeds up single byte copies. On arm, arm64, and risc-v, removes loops up to 4 iterations, shrinking code size. No speedups were measured on these platforms. TEST=runtime/vm/compiler/backend/memory_copy_test.cc Bug: #51031 Change-Id: I292ebde023b3ec2c3a9ce872e0c9543ac43371b9 Cq-Include-Trybots: luci.dart.try:vm-precomp-ffi-qemu-linux-release-riscv64-try,vm-precomp-ffi-qemu-linux-release-arm-try,vm-ffi-android-debug-arm64c-try,vm-ffi-android-debug-arm-try,vm-kernel-nnbd-mac-debug-arm64-try,vm-kernel-nnbd-win-debug-x64-try,vm-kernel-win-debug-x64c-try,vm-kernel-win-debug-ia32-try,vm-kernel-nnbd-linux-debug-ia32-try,vm-reload-rollback-linux-debug-x64-try Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/279178 Reviewed-by: Aske Simon Christensen <askesc@google.com> Reviewed-by: Alexander Markov <alexmarkov@google.com>
A larger element size results in larger mov instructions. This reduces code size and increase performance. The element size can only be increased if * src_offset, dest_offset, and length parameters are const, * and if they contain a common denominator (powers of two). TEST=runtime/vm/compiler/backend/memory_copy_test.cc Bug: #51031 Change-Id: If35fb419aa118c497b15c122bdf6279266e2294a Cq-Include-Trybots: luci.dart.try:vm-precomp-ffi-qemu-linux-release-riscv64-try,vm-precomp-ffi-qemu-linux-release-arm-try,vm-ffi-android-debug-arm64c-try,vm-ffi-android-debug-arm-try,vm-kernel-nnbd-mac-debug-arm64-try,vm-kernel-nnbd-win-debug-x64-try,vm-kernel-win-debug-x64c-try,vm-kernel-win-debug-ia32-try,vm-kernel-nnbd-linux-debug-ia32-try,vm-reload-rollback-linux-debug-x64-try Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/279506 Commit-Queue: Daco Harkes <dacoharkes@google.com> Reviewed-by: Aske Simon Christensen <askesc@google.com> Reviewed-by: Alexander Markov <alexmarkov@google.com>
534453e improved single-byte-struct copy througput on ia32/x64. 360fd45 improved larger struct copies on all architectures. (Relevant benchmarks: https://github.com/dart-lang/sdk/blob/main/benchmarks/FfiStructCopy/dart/FfiStructCopy.dart) Further improvements for struct copies (known length, variable offsets) would entail unboxing pointer and folding more of the pointer arithmetic within structs (which is currently interleaved with pointer boxing and unboxing). Another possible improvement is for struct copies with odd length (e.g. non-wordsize multiples). Those are currently not affected by 360fd45. Possible further improvements for string operations (variable lengths) could entail loop unrolling, but it would need to be experimented with to see if modern CPUs require such logic. For example on the 32kb copy we hit over 20GB/s on x64, which is around the max bandwidth of (single channel) DDR4, it's likely the x64 CPUs recognize the |
5d0325d started using
MemoryCopyInstr
for copying data between Pointers and TypedDatas. Indart:ffi
it's rather common to have constant values forCurrently, the implementation of
MemoryCopyInstr
does not take advantage of constants, rather it forces these 3 parameters to be in registers:sdk/runtime/vm/compiler/backend/il_x64.cc
Lines 163 to 165 in 5a5a830
This means their
Location
s cannot beConstant
.We don't have a(edit: we can branch on the values inLocation::WritableRegisterOrConstant
andLocation::RegisterLocationOrConstant(RCX)
afaik. We would likely also want some more finegrained control over the different use cases. E.g., even if we have constants we still need temp registers.MakeLocationSummary
.)Currently, the code on x64 for copying a single byte between two Pointers with 0 as start and dest offset is:
The offset instructions can be completely omitted of offsets are zero.
For small constant lengths it might be smaller to unroll the loop instead of do
rep
.cc @askeksa-google @mraleph
The text was updated successfully, but these errors were encountered: