[vm] Optimize `MemoryCopyInstr` with constant parameters #51031

dcharkes · 2023-01-17T11:10:22Z

5d0325d started using MemoryCopyInstr for copying data between Pointers and TypedDatas. In dart:ffi it's rather common to have constant values for

the length (for example the length of a struct),
the source start (often 0), and
the destination start (often 0).

Currently, the implementation of MemoryCopyInstr does not take advantage of constants, rather it forces these 3 parameters to be in registers:

sdk/runtime/vm/compiler/backend/il_x64.cc

Lines 163 to 165 in 5a5a830

    
           locs->set_in(kSrcStartPos, Location::WritableRegister()); 
        
           locs->set_in(kDestStartPos, Location::WritableRegister()); 
        
           locs->set_in(kLengthPos, Location::RegisterLocation(RCX));

This means their Locations cannot be Constant.

We don't have a Location::WritableRegisterOrConstant and Location::RegisterLocationOrConstant(RCX) afaik. We would likely also want some more finegrained control over the different use cases. E.g., even if we have constants we still need temp registers. (edit: we can branch on the values in MakeLocationSummary.)

Currently, the code on x64 for copying a single byte between two Pointers with 0 as start and dest offset is:

;; out/DebugX64/dart --print-flow-graph --print-flow-graph-filter=Bytes_run --disassemble benchmarks/FfiStructCopy/dart/FfiStructCopy.dart &> log.txt
        ;; MemoryCopy(v85 T{Object}, v110 T{Object}, v99, v99, v21)
        ;; Inlined [Struct1BytesWrapper.set_nested]
0x7fc8a11c0f52    488b7607               movq rsi,[rsi+0x7]    // load address out of source Pointer/TypedData
0x7fc8a11c0f56    48d1fb                 sarq rbx,1            // smi untag constant 0
0x7fc8a11c0f59    488d341e               leaq rsi,[rsi+rbx*1]  // add constant 0 to address
0x7fc8a11c0f5d    488b7f07               movq rdi,[rdi+0x7]    // load address out of target Pointer/TypedData
0x7fc8a11c0f61    48d1fa                 sarq rdx,1            // smi untag constant zero
0x7fc8a11c0f64    488d3c17               leaq rdi,[rdi+rdx*1]  // add constant 0 to address
0x7fc8a11c0f68    48d1f9                 sarq rcx,1            // smi untag constant 1 for the length
0x7fc8a11c0f6b    f3a4                   rep movsb             // actual copy of bytes

The offset instructions can be completely omitted of offsets are zero.

For small constant lengths it might be smaller to unroll the loop instead of do rep.

cc @askeksa-google @mraleph

The text was updated successfully, but these errors were encountered:

mraleph · 2023-01-17T11:30:50Z

We should also optimise it for unboxed parameters as well.

dcharkes · 2023-01-17T19:41:26Z

We should also optimise it for unboxed parameters as well.

As far as I can see, we don't have a Representation that says "either be tagged (Smi) or unboxed". All instructions either take a representation as input or are specialized per instruction:

IntConverterInstr takes a representation as input.
LoadIndexedInstr takes a boolean whether the index is unboxed.
BinaryInt32OpInstr takes unboxed ints, and BinarySmiOpInstr tagged ints.

@mraleph

Adding Representation parameters to MemoryCopyInstr will make us only use it on the places where we specify (which would be the FFI for now).
Changing the Representation to always be unboxed will likely regress things. See the discussion on LoadIndexedInstr on 4081235.
Adding new "either be unboxed or tagged" Representation to RequiredInputRepresentation is likely a large refactoring. Because that would also require doing a non trivial analysis to select the representation that creates the least unboxing and int conversion instructions.

mraleph · 2023-01-18T08:20:46Z

@dcharkes LoadIndexedInstr and MemoryCopyInstr are somewhat different. With LoadIndexedInstr you can attempt to fuse untagging and effective address calculation. Things can be completely fused into addressing mode on X86. On ARM there are often additional instructions required - but then you can fuse untagging into those calculations. There is no such possibility with MemoryCopyInstr - the first thing it does is (🥁)

0x7fc8a11c0f52    488b7607               movq rsi,[rsi+0x7]    // load address out of source Pointer/TypedData
0x7fc8a11c0f56    48d1fb                 sarq rbx,1            // smi untag constant 0

This indicates that this operation should just take unboxed integer or appropriate size (e.g. UnboxedInt32 on 32-bit platforms and fully-blown UnboxedInt64 on 64-bit platforms). No need for hybrid representations.

(There might be some fusing opportunities on ARM - which does not have something like rep mov, so there we might need to be a bit smarter).

dcharkes · 2023-01-18T08:24:27Z

This indicates that this operation should just take unboxed integer or appropriate size (e.g. UnboxedInt32 on 32-bit platforms and fully-blown UnboxedInt64 on 64-bit platforms). No need for hybrid representations.

I'll give it a spin, but I kind of expect more boxing/unboxing operations, lets see.

edit: For the length yes, for the start and destination offset when using size 2, 4, or 8 not. These avoid untagging by using a 0.5x scale on the offset calculation.

dcharkes · 2023-01-18T12:08:42Z

I'll give it a spin, but I kind of expect more boxing/unboxing operations, lets see.

https://dart-review.googlesource.com/c/sdk/+/279172

Oh, yes, we need to force-optimize to use unboxed. This causes the recognized methods that do copy to be outlined and add static calls. So the code becomes slower unless we start supporting the unboxed representations in unoptimized mode or support inlining force optimized functions (I need to think a bit if those are idempotent).

[vm/ffi] Enable inlining idempotent force-optimized functions #38985

mraleph · 2023-01-18T12:17:50Z

So the code becomes slower unless we start supporting the unboxed representations in unoptimized mode or support inlining force optimized functions (I need to think a bit if those are idempotent).

Alternatively you could just explicitly add inlining code for this function in the inliner.cc.

dcharkes · 2023-01-23T16:43:42Z

When the MemoryCopyInstr was introduced:

It was only used with single-byte element sizes.
The length, src_offset, and dest_offset are in element size (the alternative is in bytes).
- length has to be in element size (the element size determines the copy chunk size).
- With element sizes >= 2 it means the untagging of those 3 parameters can be avoided often.
- However, this also prevents promoting the element size to a larger size if the the length is known as a constant but the src_offset or dest_offset is not.

We should consider doing one of the following:

Make the src/dest offsets be in bytes and always pay the untagging cost (currently we have 0 uses with element size larger than 1)
Add a bool to the instruction that specifies whether the src/dest offsets are in element size or bytes.

The potential use case here is the Structs in dart:ffi where nested structs are copied in which the length is const and element size could be promoted, but the offsets are not known (for example indexed access). (Not a current use case, as the Pointer offset calculations currently do boxing with Pointers and unboxing again.)

@askeksa-google Do you remember the rationale of making the src/dest offsets be in element size rather than in bytes?

dcharkes · 2023-01-23T20:22:47Z

Alternatively you could just explicitly add inlining code for this function in the inliner.cc.

Done in https://dart-review.googlesource.com/c/sdk/+/279172.

Feel free to leave further suggestions on the stack of CLs.

TEST=runtime/vm/compiler/backend/memory_copy_test.cc Bug: #51031 Change-Id: I6b0b2eb63f97ae9d7d3c9c80d998929f657ef482 Cq-Include-Trybots: luci.dart.try:vm-precomp-ffi-qemu-linux-release-riscv64-try,vm-precomp-ffi-qemu-linux-release-arm-try,vm-ffi-android-debug-arm64c-try,vm-ffi-android-debug-arm-try,vm-kernel-nnbd-mac-debug-arm64-try,vm-kernel-nnbd-win-debug-x64-try,vm-kernel-win-debug-x64c-try,vm-kernel-win-debug-ia32-try,vm-kernel-nnbd-linux-debug-ia32-try,vm-reload-rollback-linux-debug-x64-try Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/279387 Reviewed-by: Alexander Markov <alexmarkov@google.com>

When constants are passed in to the source and destination start, no registers are needed for these. The constants are directly compiled into the machine code. If the constant happens to be 0, no machine code is emitted at all. I did not measure any speed improvements. Likely the micro-code schedulers in the CPUs already noticed the no-ops. I have verified manually that we emit smaller machine code with these changes on x64. TEST=runtime/vm/compiler/backend/memory_copy_test.cc Bug: #51031 Change-Id: I70f12c9ae299b44a8f5007ca3a8c5ee56a9aff40 Cq-Include-Trybots: luci.dart.try:vm-precomp-ffi-qemu-linux-release-riscv64-try,vm-precomp-ffi-qemu-linux-release-arm-try,vm-ffi-android-debug-arm64c-try,vm-ffi-android-debug-arm-try,vm-kernel-nnbd-mac-debug-arm64-try,vm-kernel-nnbd-win-debug-x64-try,vm-kernel-win-debug-x64c-try,vm-kernel-win-debug-ia32-try,vm-kernel-nnbd-linux-debug-ia32-try,vm-reload-rollback-linux-debug-x64-try Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/279170 Reviewed-by: Alexander Markov <alexmarkov@google.com>

This CL removes a Smi untagging by taking an unboxed input for the length in the memory copy instruction or changing the loop decrementor to 2 instead of 1. Unboxing required implementing the method in the inliner because unboxed representations cannot be used in the method recognizer. Reduces code size slightly. Does not measurably improve speed. TEST=runtime/vm/compiler/backend/memory_copy_test.cc Bug: #51031 Change-Id: Ie311929af25b76c3b899ff2791bfaf4e40b1f06f Cq-Include-Trybots: luci.dart.try:vm-precomp-ffi-qemu-linux-release-riscv64-try,vm-precomp-ffi-qemu-linux-release-arm-try,vm-ffi-android-debug-arm64c-try,vm-ffi-android-debug-arm-try,vm-kernel-nnbd-mac-debug-arm64-try,vm-kernel-nnbd-win-debug-x64-try,vm-kernel-win-debug-x64c-try,vm-kernel-win-debug-ia32-try,vm-kernel-nnbd-linux-debug-ia32-try,vm-reload-rollback-linux-debug-x64-try Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/279172 Reviewed-by: Alexander Markov <alexmarkov@google.com>

Removes loops with constant lenght taking code-size into account. On ia32 and x64 only removes single iteration (removing the branch). This speeds up single byte copies. On arm, arm64, and risc-v, removes loops up to 4 iterations, shrinking code size. No speedups were measured on these platforms. TEST=runtime/vm/compiler/backend/memory_copy_test.cc Bug: #51031 Change-Id: I292ebde023b3ec2c3a9ce872e0c9543ac43371b9 Cq-Include-Trybots: luci.dart.try:vm-precomp-ffi-qemu-linux-release-riscv64-try,vm-precomp-ffi-qemu-linux-release-arm-try,vm-ffi-android-debug-arm64c-try,vm-ffi-android-debug-arm-try,vm-kernel-nnbd-mac-debug-arm64-try,vm-kernel-nnbd-win-debug-x64-try,vm-kernel-win-debug-x64c-try,vm-kernel-win-debug-ia32-try,vm-kernel-nnbd-linux-debug-ia32-try,vm-reload-rollback-linux-debug-x64-try Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/279178 Reviewed-by: Aske Simon Christensen <askesc@google.com> Reviewed-by: Alexander Markov <alexmarkov@google.com>

A larger element size results in larger mov instructions. This reduces code size and increase performance. The element size can only be increased if * src_offset, dest_offset, and length parameters are const, * and if they contain a common denominator (powers of two). TEST=runtime/vm/compiler/backend/memory_copy_test.cc Bug: #51031 Change-Id: If35fb419aa118c497b15c122bdf6279266e2294a Cq-Include-Trybots: luci.dart.try:vm-precomp-ffi-qemu-linux-release-riscv64-try,vm-precomp-ffi-qemu-linux-release-arm-try,vm-ffi-android-debug-arm64c-try,vm-ffi-android-debug-arm-try,vm-kernel-nnbd-mac-debug-arm64-try,vm-kernel-nnbd-win-debug-x64-try,vm-kernel-win-debug-x64c-try,vm-kernel-win-debug-ia32-try,vm-kernel-nnbd-linux-debug-ia32-try,vm-reload-rollback-linux-debug-x64-try Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/279506 Commit-Queue: Daco Harkes <dacoharkes@google.com> Reviewed-by: Aske Simon Christensen <askesc@google.com> Reviewed-by: Alexander Markov <alexmarkov@google.com>

dcharkes · 2023-02-06T11:09:28Z

534453e improved single-byte-struct copy througput on ia32/x64.

360fd45 improved larger struct copies on all architectures.

(Relevant benchmarks: https://github.com/dart-lang/sdk/blob/main/benchmarks/FfiStructCopy/dart/FfiStructCopy.dart)

Further improvements for struct copies (known length, variable offsets) would entail unboxing pointer and folding more of the pointer arithmetic within structs (which is currently interleaved with pointer boxing and unboxing).

[vm/ffi] Unbox Pointer #50777

Another possible improvement is for struct copies with odd length (e.g. non-wordsize multiples). Those are currently not affected by 360fd45.

Possible further improvements for string operations (variable lengths) could entail loop unrolling, but it would need to be experimented with to see if modern CPUs require such logic. For example on the 32kb copy we hit over 20GB/s on x64, which is around the max bandwidth of (single channel) DDR4, it's likely the x64 CPUs recognize the rep mov instructions. Maybe the ARM/RISC-V architectures would benefit from loop unrolling. But that should be experimented with.

dcharkes added the area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. label Jan 17, 2023

dcharkes mentioned this issue Jan 17, 2023

[vm/ffi] Optimize memcopy in nested struct assignment #43967

Closed

dcharkes added the type-performance Issue relates to performance or code size label Jan 17, 2023

dcharkes mentioned this issue Jan 17, 2023

[VM] Support instruction either-boxed-or-unboxed input representations #51039

Open

dcharkes mentioned this issue Apr 7, 2023

[vm/ffi] Array accesses & asTypedList both require static calls #51976

Open

a-siva added P3 A lower priority bug or feature request triaged Issue has been triaged by sub team labels Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[vm] Optimize `MemoryCopyInstr` with constant parameters #51031

[vm] Optimize `MemoryCopyInstr` with constant parameters #51031

dcharkes commented Jan 17, 2023 •

edited

Loading

mraleph commented Jan 17, 2023

Uh oh!

dcharkes commented Jan 17, 2023

Uh oh!

mraleph commented Jan 18, 2023 •

edited

Loading

Uh oh!

dcharkes commented Jan 18, 2023 •

edited

Loading

Uh oh!

dcharkes commented Jan 18, 2023

Uh oh!

mraleph commented Jan 18, 2023

Uh oh!

dcharkes commented Jan 23, 2023 •

edited

Loading

Uh oh!

dcharkes commented Jan 23, 2023

Uh oh!

dcharkes commented Feb 6, 2023 •

edited

Loading

Uh oh!

[vm] Optimize MemoryCopyInstr with constant parameters #51031

[vm] Optimize MemoryCopyInstr with constant parameters #51031

Comments

dcharkes commented Jan 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mraleph commented Jan 17, 2023

Uh oh!

dcharkes commented Jan 17, 2023

Uh oh!

mraleph commented Jan 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcharkes commented Jan 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcharkes commented Jan 18, 2023

Uh oh!

mraleph commented Jan 18, 2023

Uh oh!

dcharkes commented Jan 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcharkes commented Jan 23, 2023

Uh oh!

dcharkes commented Feb 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

[vm] Optimize `MemoryCopyInstr` with constant parameters #51031

[vm] Optimize `MemoryCopyInstr` with constant parameters #51031

dcharkes commented Jan 17, 2023 •

edited

Loading

mraleph commented Jan 18, 2023 •

edited

Loading

dcharkes commented Jan 18, 2023 •

edited

Loading

dcharkes commented Jan 23, 2023 •

edited

Loading

dcharkes commented Feb 6, 2023 •

edited

Loading