Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vm] Optimize MemoryCopyInstr with constant parameters #51031

Open
dcharkes opened this issue Jan 17, 2023 · 9 comments
Open

[vm] Optimize MemoryCopyInstr with constant parameters #51031

dcharkes opened this issue Jan 17, 2023 · 9 comments
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. P3 A lower priority bug or feature request triaged Issue has been triaged by sub team type-performance Issue relates to performance or code size

Comments

@dcharkes
Copy link
Contributor

dcharkes commented Jan 17, 2023

5d0325d started using MemoryCopyInstr for copying data between Pointers and TypedDatas. In dart:ffi it's rather common to have constant values for

  1. the length (for example the length of a struct),
  2. the source start (often 0), and
  3. the destination start (often 0).

Currently, the implementation of MemoryCopyInstr does not take advantage of constants, rather it forces these 3 parameters to be in registers:

locs->set_in(kSrcStartPos, Location::WritableRegister());
locs->set_in(kDestStartPos, Location::WritableRegister());
locs->set_in(kLengthPos, Location::RegisterLocation(RCX));

This means their Locations cannot be Constant.

We don't have a Location::WritableRegisterOrConstant and Location::RegisterLocationOrConstant(RCX) afaik. We would likely also want some more finegrained control over the different use cases. E.g., even if we have constants we still need temp registers. (edit: we can branch on the values in MakeLocationSummary.)

Currently, the code on x64 for copying a single byte between two Pointers with 0 as start and dest offset is:

;; out/DebugX64/dart --print-flow-graph --print-flow-graph-filter=Bytes_run --disassemble benchmarks/FfiStructCopy/dart/FfiStructCopy.dart &> log.txt
        ;; MemoryCopy(v85 T{Object}, v110 T{Object}, v99, v99, v21)
        ;; Inlined [Struct1BytesWrapper.set_nested]
0x7fc8a11c0f52    488b7607               movq rsi,[rsi+0x7]    // load address out of source Pointer/TypedData
0x7fc8a11c0f56    48d1fb                 sarq rbx,1            // smi untag constant 0
0x7fc8a11c0f59    488d341e               leaq rsi,[rsi+rbx*1]  // add constant 0 to address
0x7fc8a11c0f5d    488b7f07               movq rdi,[rdi+0x7]    // load address out of target Pointer/TypedData
0x7fc8a11c0f61    48d1fa                 sarq rdx,1            // smi untag constant zero
0x7fc8a11c0f64    488d3c17               leaq rdi,[rdi+rdx*1]  // add constant 0 to address
0x7fc8a11c0f68    48d1f9                 sarq rcx,1            // smi untag constant 1 for the length
0x7fc8a11c0f6b    f3a4                   rep movsb             // actual copy of bytes

The offset instructions can be completely omitted of offsets are zero.

For small constant lengths it might be smaller to unroll the loop instead of do rep.

cc @askeksa-google @mraleph

@dcharkes dcharkes added the area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. label Jan 17, 2023
@dcharkes dcharkes added the type-performance Issue relates to performance or code size label Jan 17, 2023
@mraleph
Copy link
Member

mraleph commented Jan 17, 2023

We should also optimise it for unboxed parameters as well.

@dcharkes
Copy link
Contributor Author

We should also optimise it for unboxed parameters as well.

As far as I can see, we don't have a Representation that says "either be tagged (Smi) or unboxed". All instructions either take a representation as input or are specialized per instruction:

  • IntConverterInstr takes a representation as input.
  • LoadIndexedInstr takes a boolean whether the index is unboxed.
  • BinaryInt32OpInstr takes unboxed ints, and BinarySmiOpInstr tagged ints.

@mraleph

  1. Adding Representation parameters to MemoryCopyInstr will make us only use it on the places where we specify (which would be the FFI for now).
  2. Changing the Representation to always be unboxed will likely regress things. See the discussion on LoadIndexedInstr on 4081235.
  3. Adding new "either be unboxed or tagged" Representation to RequiredInputRepresentation is likely a large refactoring. Because that would also require doing a non trivial analysis to select the representation that creates the least unboxing and int conversion instructions.

@mraleph
Copy link
Member

mraleph commented Jan 18, 2023

@dcharkes LoadIndexedInstr and MemoryCopyInstr are somewhat different. With LoadIndexedInstr you can attempt to fuse untagging and effective address calculation. Things can be completely fused into addressing mode on X86. On ARM there are often additional instructions required - but then you can fuse untagging into those calculations. There is no such possibility with MemoryCopyInstr - the first thing it does is (🥁)

0x7fc8a11c0f52    488b7607               movq rsi,[rsi+0x7]    // load address out of source Pointer/TypedData
0x7fc8a11c0f56    48d1fb                 sarq rbx,1            // smi untag constant 0

This indicates that this operation should just take unboxed integer or appropriate size (e.g. UnboxedInt32 on 32-bit platforms and fully-blown UnboxedInt64 on 64-bit platforms). No need for hybrid representations.

(There might be some fusing opportunities on ARM - which does not have something like rep mov, so there we might need to be a bit smarter).

@dcharkes
Copy link
Contributor Author

dcharkes commented Jan 18, 2023

This indicates that this operation should just take unboxed integer or appropriate size (e.g. UnboxedInt32 on 32-bit platforms and fully-blown UnboxedInt64 on 64-bit platforms). No need for hybrid representations.

I'll give it a spin, but I kind of expect more boxing/unboxing operations, lets see.

edit: For the length yes, for the start and destination offset when using size 2, 4, or 8 not. These avoid untagging by using a 0.5x scale on the offset calculation.

@dcharkes
Copy link
Contributor Author

I'll give it a spin, but I kind of expect more boxing/unboxing operations, lets see.

https://dart-review.googlesource.com/c/sdk/+/279172

Oh, yes, we need to force-optimize to use unboxed. This causes the recognized methods that do copy to be outlined and add static calls. So the code becomes slower unless we start supporting the unboxed representations in unoptimized mode or support inlining force optimized functions (I need to think a bit if those are idempotent).

@mraleph
Copy link
Member

mraleph commented Jan 18, 2023

So the code becomes slower unless we start supporting the unboxed representations in unoptimized mode or support inlining force optimized functions (I need to think a bit if those are idempotent).

Alternatively you could just explicitly add inlining code for this function in the inliner.cc.

@dcharkes
Copy link
Contributor Author

dcharkes commented Jan 23, 2023

When the MemoryCopyInstr was introduced:

  • It was only used with single-byte element sizes.
  • The length, src_offset, and dest_offset are in element size (the alternative is in bytes).
    • length has to be in element size (the element size determines the copy chunk size).
    • With element sizes >= 2 it means the untagging of those 3 parameters can be avoided often.
    • However, this also prevents promoting the element size to a larger size if the the length is known as a constant but the src_offset or dest_offset is not.

We should consider doing one of the following:

  1. Make the src/dest offsets be in bytes and always pay the untagging cost (currently we have 0 uses with element size larger than 1)
  2. Add a bool to the instruction that specifies whether the src/dest offsets are in element size or bytes.

The potential use case here is the Structs in dart:ffi where nested structs are copied in which the length is const and element size could be promoted, but the offsets are not known (for example indexed access). (Not a current use case, as the Pointer offset calculations currently do boxing with Pointers and unboxing again.)

@askeksa-google Do you remember the rationale of making the src/dest offsets be in element size rather than in bytes?

@dcharkes
Copy link
Contributor Author

Alternatively you could just explicitly add inlining code for this function in the inliner.cc.

Done in https://dart-review.googlesource.com/c/sdk/+/279172.

Feel free to leave further suggestions on the stack of CLs.

copybara-service bot pushed a commit that referenced this issue Feb 3, 2023
TEST=runtime/vm/compiler/backend/memory_copy_test.cc

Bug: #51031
Change-Id: I6b0b2eb63f97ae9d7d3c9c80d998929f657ef482
Cq-Include-Trybots: luci.dart.try:vm-precomp-ffi-qemu-linux-release-riscv64-try,vm-precomp-ffi-qemu-linux-release-arm-try,vm-ffi-android-debug-arm64c-try,vm-ffi-android-debug-arm-try,vm-kernel-nnbd-mac-debug-arm64-try,vm-kernel-nnbd-win-debug-x64-try,vm-kernel-win-debug-x64c-try,vm-kernel-win-debug-ia32-try,vm-kernel-nnbd-linux-debug-ia32-try,vm-reload-rollback-linux-debug-x64-try
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/279387
Reviewed-by: Alexander Markov <alexmarkov@google.com>
copybara-service bot pushed a commit that referenced this issue Feb 3, 2023
When constants are passed in to the source and destination start,
no registers are needed for these. The constants are directly compiled
into the machine code. If the constant happens to be 0, no machine
code is emitted at all.

I did not measure any speed improvements. Likely the micro-code
schedulers in the CPUs already noticed the no-ops.

I have verified manually that we emit smaller machine code with
these changes on x64.

TEST=runtime/vm/compiler/backend/memory_copy_test.cc

Bug: #51031

Change-Id: I70f12c9ae299b44a8f5007ca3a8c5ee56a9aff40
Cq-Include-Trybots: luci.dart.try:vm-precomp-ffi-qemu-linux-release-riscv64-try,vm-precomp-ffi-qemu-linux-release-arm-try,vm-ffi-android-debug-arm64c-try,vm-ffi-android-debug-arm-try,vm-kernel-nnbd-mac-debug-arm64-try,vm-kernel-nnbd-win-debug-x64-try,vm-kernel-win-debug-x64c-try,vm-kernel-win-debug-ia32-try,vm-kernel-nnbd-linux-debug-ia32-try,vm-reload-rollback-linux-debug-x64-try
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/279170
Reviewed-by: Alexander Markov <alexmarkov@google.com>
copybara-service bot pushed a commit that referenced this issue Feb 3, 2023
This CL removes a Smi untagging by taking an unboxed input for the
length in the memory copy instruction or changing the loop decrementor
to 2 instead of 1.

Unboxing required implementing the method in the inliner because
unboxed representations cannot be used in the method recognizer.

Reduces code size slightly.
Does not measurably improve speed.

TEST=runtime/vm/compiler/backend/memory_copy_test.cc

Bug: #51031
Change-Id: Ie311929af25b76c3b899ff2791bfaf4e40b1f06f
Cq-Include-Trybots: luci.dart.try:vm-precomp-ffi-qemu-linux-release-riscv64-try,vm-precomp-ffi-qemu-linux-release-arm-try,vm-ffi-android-debug-arm64c-try,vm-ffi-android-debug-arm-try,vm-kernel-nnbd-mac-debug-arm64-try,vm-kernel-nnbd-win-debug-x64-try,vm-kernel-win-debug-x64c-try,vm-kernel-win-debug-ia32-try,vm-kernel-nnbd-linux-debug-ia32-try,vm-reload-rollback-linux-debug-x64-try
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/279172
Reviewed-by: Alexander Markov <alexmarkov@google.com>
copybara-service bot pushed a commit that referenced this issue Feb 3, 2023
Removes loops with constant lenght taking code-size into account.

On ia32 and x64 only removes single iteration (removing the branch).
This speeds up single byte copies.

On arm, arm64, and risc-v, removes loops up to 4 iterations,
shrinking code size.
No speedups were measured on these platforms.

TEST=runtime/vm/compiler/backend/memory_copy_test.cc

Bug: #51031
Change-Id: I292ebde023b3ec2c3a9ce872e0c9543ac43371b9
Cq-Include-Trybots: luci.dart.try:vm-precomp-ffi-qemu-linux-release-riscv64-try,vm-precomp-ffi-qemu-linux-release-arm-try,vm-ffi-android-debug-arm64c-try,vm-ffi-android-debug-arm-try,vm-kernel-nnbd-mac-debug-arm64-try,vm-kernel-nnbd-win-debug-x64-try,vm-kernel-win-debug-x64c-try,vm-kernel-win-debug-ia32-try,vm-kernel-nnbd-linux-debug-ia32-try,vm-reload-rollback-linux-debug-x64-try
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/279178
Reviewed-by: Aske Simon Christensen <askesc@google.com>
Reviewed-by: Alexander Markov <alexmarkov@google.com>
copybara-service bot pushed a commit that referenced this issue Feb 3, 2023
A larger element size results in larger mov instructions. This reduces
code size and increase performance.

The element size can only be increased if
* src_offset, dest_offset, and length parameters are const,
* and if they contain a common denominator (powers of two).

TEST=runtime/vm/compiler/backend/memory_copy_test.cc

Bug: #51031
Change-Id: If35fb419aa118c497b15c122bdf6279266e2294a
Cq-Include-Trybots: luci.dart.try:vm-precomp-ffi-qemu-linux-release-riscv64-try,vm-precomp-ffi-qemu-linux-release-arm-try,vm-ffi-android-debug-arm64c-try,vm-ffi-android-debug-arm-try,vm-kernel-nnbd-mac-debug-arm64-try,vm-kernel-nnbd-win-debug-x64-try,vm-kernel-win-debug-x64c-try,vm-kernel-win-debug-ia32-try,vm-kernel-nnbd-linux-debug-ia32-try,vm-reload-rollback-linux-debug-x64-try
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/279506
Commit-Queue: Daco Harkes <dacoharkes@google.com>
Reviewed-by: Aske Simon Christensen <askesc@google.com>
Reviewed-by: Alexander Markov <alexmarkov@google.com>
@dcharkes
Copy link
Contributor Author

dcharkes commented Feb 6, 2023

534453e improved single-byte-struct copy througput on ia32/x64.

360fd45 improved larger struct copies on all architectures.

(Relevant benchmarks: https://github.com/dart-lang/sdk/blob/main/benchmarks/FfiStructCopy/dart/FfiStructCopy.dart)

Further improvements for struct copies (known length, variable offsets) would entail unboxing pointer and folding more of the pointer arithmetic within structs (which is currently interleaved with pointer boxing and unboxing).

Another possible improvement is for struct copies with odd length (e.g. non-wordsize multiples). Those are currently not affected by 360fd45.

Possible further improvements for string operations (variable lengths) could entail loop unrolling, but it would need to be experimented with to see if modern CPUs require such logic. For example on the 32kb copy we hit over 20GB/s on x64, which is around the max bandwidth of (single channel) DDR4, it's likely the x64 CPUs recognize the rep mov instructions. Maybe the ARM/RISC-V architectures would benefit from loop unrolling. But that should be experimented with.

@a-siva a-siva added P3 A lower priority bug or feature request triaged Issue has been triaged by sub team labels Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. P3 A lower priority bug or feature request triaged Issue has been triaged by sub team type-performance Issue relates to performance or code size
Projects
None yet
Development

No branches or pull requests

3 participants