Skip to content

Commit

Permalink
[vm/compiler] Further optimize setRange on TypedData receivers.
Browse files Browse the repository at this point in the history
When setRange is called on a TypedData receiver and the source is also
a TypedData object with the same element size and clamping is not
required, the VM implementation now calls _boundsCheckAndMemcpyN for
element size N. The generated IL for these methods performs the copy
using the MemoryCopy instruction (mostly, see the note below).

Since the two TypedData objects might have the same underlying
buffer, the CL adds a can_overlap flag to the MemoryCopy instruction
which checks for overlapping regions. If can_overlap is set, then
the copy is performed backwards instead of forwards when needed
to ensure that elements of the source region are read before
they are overwritten.

The existing uses of the MemoryCopy instruction are adjusted as
follows:
* The IL generated for copyRangeFromUint8ListToOneByteString
  passes false for can_overlap, as all uses currently ensure that
  the OneByteString is non-external and thus cannot overlap.
* The IL generated for _memCopy, used by the FFI library, passes
  true for can_overlap, as there is no guarantee that the regions
  pointed at by the Pointer objects do not overlap.

The MemoryCopy instruction has also been adjusted so that all numeric
inputs (the two start offsets and the length) are either boxed or
unboxed instead of just the length. This exposed an issue
in the inliner, where unboxed constants in the callee graph were
replaced with boxed constants when inlining into the caller graph,
since withList calls setRange with constant starting offsets of 0.
Now the representation of constants in the callee graph are preserved
when inlining the callee graph into the caller graph.

Fixes #51237 by using TMP
and TMP2 for the LDP/STP calls in the 16-byte element size case, so no
temporaries need to be allocated for the instruction.

On ARM when not unrolling the memory copy loop, uses TMP and a single
additional temporary for LDM/STM calls in the 8-byte and 16-byte
element cases, with the latter just using two LDM/STM calls within
the loop, a different approach than the one described in
#51229 .

Note: Once the number of elements being copied reaches a certain
threshold (1048576 on X86, 256 otherwise), _boundsCheckAndMemcpyN
instead calls _nativeSetRange, which is a native call that uses memmove
from the standard C library for non-clamped inputs. It does this
because the code currently emitted for MemoryCopy performs poorly
compared to the more optimized memmove implementation when copying
larger regions of memory.

Notable benchmark changes for dart-aot:
* X64
  * TypedDataDuplicate.*.fromList improvement from ~13%-~250%
  * Uf8Encode.*.10 improvement from ~50%-~75%
  * MapCopy.Map.*.of.Map.* improvement from ~13%-~65%
  * MemoryCopy.*.setRange.* improvement from ~13%-~500%
* ARM7
  * Uf8Encode.*.10 improvement from ~35%-~70%
  * MapCopy.Map.*.of.Map.* improvement from ~6%-~75%
  * MemoryCopy.*.setRange.{8,64} improvement from ~22%-~500%
    * Improvement of ~100%-~200% for MemoryCopy.512.setRange.*.Double
    * Regression of ~40% for MemoryCopy.512.setRange.*.Uint8
    * Regression of ~85% for MemoryCopy.4096.setRange.*.Uint8
* ARM8
  * Uf8Encode.*.10 improvement from ~35%-~70%
  * MapCopy.Map.*.of.Map.* improvement from ~7%-~75%
  * MemoryCopy.*.setRange.{8,64} improvement from ~22%-~500%
    * Improvement of ~75%-~160% for MemoryCopy.512.setRange.*.Double
    * Regression of ~40% for MemoryCopy.512.setRange.*.Uint8
    * Regression of ~85% for MemoryCopy.4096.setRange.*.Uint8

TEST=vm/cc/IRTest_Memory, co19{,_2}/LibTest/typed_data,
     lib{,_2}/typed_data, corelib{,_2}/list_test

Issue: #42072
Issue: b/294114694
Issue: b/259315681

Change-Id: Ic75521c5fe10b952b5b9ce5f2020c7e3f03672a9
Cq-Include-Trybots: luci.dart.try:vm-aot-linux-debug-simarm_x64-try,vm-aot-linux-debug-simriscv64-try,vm-aot-linux-debug-x64-try,vm-aot-linux-debug-x64c-try,vm-kernel-linux-debug-x64-try,vm-kernel-precomp-linux-debug-x64-try,vm-linux-debug-ia32-try,vm-linux-debug-simriscv64-try,vm-linux-debug-x64-try,vm-linux-debug-x64c-try,vm-mac-debug-arm64-try,vm-mac-debug-x64-try,vm-aot-linux-release-simarm64-try,vm-aot-linux-release-simarm_x64-try,vm-aot-linux-release-x64-try,vm-aot-mac-release-arm64-try,vm-aot-mac-release-x64-try,vm-ffi-qemu-linux-release-riscv64-try,vm-ffi-qemu-linux-release-arm-try,vm-aot-msan-linux-release-x64-try,vm-msan-linux-release-x64-try,vm-aot-tsan-linux-release-x64-try,vm-tsan-linux-release-x64-try,vm-linux-release-ia32-try,vm-linux-release-simarm-try,vm-linux-release-simarm64-try,vm-linux-release-x64-try,vm-mac-release-arm64-try,vm-mac-release-x64-try,vm-kernel-precomp-linux-release-x64-try,vm-aot-android-release-arm64c-try,vm-ffi-android-debug-arm64c-try
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/319521
Reviewed-by: Daco Harkes <dacoharkes@google.com>
Reviewed-by: Alexander Markov <alexmarkov@google.com>
Commit-Queue: Tess Strickland <sstrickl@google.com>
  • Loading branch information
sstrickl authored and Commit Queue committed Sep 4, 2023
1 parent b745fa8 commit c93f924
Show file tree
Hide file tree
Showing 29 changed files with 1,768 additions and 983 deletions.
145 changes: 74 additions & 71 deletions runtime/lib/typed_data.cc
Original file line number Diff line number Diff line change
Expand Up @@ -66,90 +66,93 @@ DEFINE_NATIVE_ENTRY(TypedDataView_typedData, 0, 1) {
return TypedDataView::Cast(instance).typed_data();
}

static BoolPtr CopyData(const TypedDataBase& dst_array,
const TypedDataBase& src_array,
const Smi& dst_start,
const Smi& src_start,
const Smi& length,
bool clamped) {
const intptr_t dst_offset_in_bytes = dst_start.Value();
const intptr_t src_offset_in_bytes = src_start.Value();
const intptr_t length_in_bytes = length.Value();
ASSERT(Utils::RangeCheck(src_offset_in_bytes, length_in_bytes,
src_array.LengthInBytes()));
ASSERT(Utils::RangeCheck(dst_offset_in_bytes, length_in_bytes,
dst_array.LengthInBytes()));
if (length_in_bytes > 0) {
NoSafepointScope no_safepoint;
if (clamped) {
uint8_t* dst_data =
reinterpret_cast<uint8_t*>(dst_array.DataAddr(dst_offset_in_bytes));
int8_t* src_data =
reinterpret_cast<int8_t*>(src_array.DataAddr(src_offset_in_bytes));
for (intptr_t ix = 0; ix < length_in_bytes; ix++) {
int8_t v = *src_data;
if (v < 0) v = 0;
*dst_data = v;
src_data++;
dst_data++;
}
} else {
memmove(dst_array.DataAddr(dst_offset_in_bytes),
src_array.DataAddr(src_offset_in_bytes), length_in_bytes);
}
}
return Bool::True().ptr();
}

static bool IsClamped(intptr_t cid) {
switch (cid) {
case kTypedDataUint8ClampedArrayCid:
case kExternalTypedDataUint8ClampedArrayCid:
case kTypedDataUint8ClampedArrayViewCid:
case kUnmodifiableTypedDataUint8ClampedArrayViewCid:
return true;
default:
return false;
}
COMPILE_ASSERT((kTypedDataUint8ClampedArrayCid + 1 ==
kTypedDataUint8ClampedArrayViewCid) &&
(kTypedDataUint8ClampedArrayCid + 2 ==
kExternalTypedDataUint8ClampedArrayCid) &&
(kTypedDataUint8ClampedArrayCid + 3 ==
kUnmodifiableTypedDataUint8ClampedArrayViewCid));
return cid >= kTypedDataUint8ClampedArrayCid &&
cid <= kUnmodifiableTypedDataUint8ClampedArrayViewCid;
}

static bool IsUint8(intptr_t cid) {
switch (cid) {
case kTypedDataUint8ClampedArrayCid:
case kExternalTypedDataUint8ClampedArrayCid:
case kTypedDataUint8ClampedArrayViewCid:
case kUnmodifiableTypedDataUint8ClampedArrayViewCid:
case kTypedDataUint8ArrayCid:
case kExternalTypedDataUint8ArrayCid:
case kTypedDataUint8ArrayViewCid:
case kUnmodifiableTypedDataUint8ArrayViewCid:
return true;
default:
return false;
}
COMPILE_ASSERT(
(kTypedDataUint8ArrayCid + 1 == kTypedDataUint8ArrayViewCid) &&
(kTypedDataUint8ArrayCid + 2 == kExternalTypedDataUint8ArrayCid) &&
(kTypedDataUint8ArrayCid + 3 ==
kUnmodifiableTypedDataUint8ArrayViewCid) &&
(kTypedDataUint8ArrayCid + 4 == kTypedDataUint8ClampedArrayCid));
return cid >= kTypedDataUint8ArrayCid &&
cid <= kUnmodifiableTypedDataUint8ClampedArrayViewCid;
}

DEFINE_NATIVE_ENTRY(TypedDataBase_setRange, 0, 7) {
DEFINE_NATIVE_ENTRY(TypedDataBase_setRange, 0, 5) {
const TypedDataBase& dst =
TypedDataBase::CheckedHandle(zone, arguments->NativeArgAt(0));
const Smi& dst_start = Smi::CheckedHandle(zone, arguments->NativeArgAt(1));
const Smi& length = Smi::CheckedHandle(zone, arguments->NativeArgAt(2));
const Smi& dst_start_smi =
Smi::CheckedHandle(zone, arguments->NativeArgAt(1));
const Smi& dst_end_smi = Smi::CheckedHandle(zone, arguments->NativeArgAt(2));
const TypedDataBase& src =
TypedDataBase::CheckedHandle(zone, arguments->NativeArgAt(3));
const Smi& src_start = Smi::CheckedHandle(zone, arguments->NativeArgAt(4));
const Smi& to_cid_smi = Smi::CheckedHandle(zone, arguments->NativeArgAt(5));
const Smi& from_cid_smi = Smi::CheckedHandle(zone, arguments->NativeArgAt(6));
const Smi& src_start_smi =
Smi::CheckedHandle(zone, arguments->NativeArgAt(4));

if (length.Value() < 0) {
const String& error = String::Handle(String::NewFormatted(
"length (%" Pd ") must be non-negative", length.Value()));
Exceptions::ThrowArgumentError(error);
const intptr_t element_size_in_bytes = dst.ElementSizeInBytes();
ASSERT_EQUAL(src.ElementSizeInBytes(), element_size_in_bytes);

const intptr_t dst_start_in_bytes =
dst_start_smi.Value() * element_size_in_bytes;
const intptr_t dst_end_in_bytes = dst_end_smi.Value() * element_size_in_bytes;
const intptr_t src_start_in_bytes =
src_start_smi.Value() * element_size_in_bytes;

const intptr_t length_in_bytes = dst_end_in_bytes - dst_start_in_bytes;

if (!IsClamped(dst.ptr()->GetClassId()) || IsUint8(src.ptr()->GetClassId())) {
// We've already performed range checking in _boundsCheckAndMemcpyN prior
// to the call to _nativeSetRange, so just perform the memmove.
//
// TODO(dartbug.com/42072): We do this when the copy length gets large
// enough that a native call to invoke memmove is faster than the generated
// code from MemoryCopy. Replace the static call to _nativeSetRange with
// a CCall() to a memmove leaf runtime entry and remove the possibility of
// calling _nativeSetRange except in the clamping case.
NoSafepointScope no_safepoint;
memmove(dst.DataAddr(dst_start_in_bytes), src.DataAddr(src_start_in_bytes),
length_in_bytes);
return Object::null();
}

// This is called on the fast path prior to bounds checking, so perform
// the bounds check even if the length is 0.
const intptr_t dst_length_in_bytes = dst.LengthInBytes();
RangeCheck(dst_start_in_bytes, length_in_bytes, dst_length_in_bytes,
element_size_in_bytes);

const intptr_t src_length_in_bytes = src.LengthInBytes();
RangeCheck(src_start_in_bytes, length_in_bytes, src_length_in_bytes,
element_size_in_bytes);

ASSERT_EQUAL(element_size_in_bytes, 1);

if (length_in_bytes > 0) {
NoSafepointScope no_safepoint;
uint8_t* dst_data =
reinterpret_cast<uint8_t*>(dst.DataAddr(dst_start_in_bytes));
int8_t* src_data =
reinterpret_cast<int8_t*>(src.DataAddr(src_start_in_bytes));
for (intptr_t ix = 0; ix < length_in_bytes; ix++) {
int8_t v = *src_data;
if (v < 0) v = 0;
*dst_data = v;
src_data++;
dst_data++;
}
}
const intptr_t to_cid = to_cid_smi.Value();
const intptr_t from_cid = from_cid_smi.Value();

const bool needs_clamping = IsClamped(to_cid) && !IsUint8(from_cid);
return CopyData(dst, src, dst_start, src_start, length, needs_clamping);
return Object::null();
}

// Native methods for typed data allocation are recognized and implemented
Expand Down
2 changes: 1 addition & 1 deletion runtime/vm/bootstrap_natives.h
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ namespace dart {
V(TypedData_Int32x4Array_new, 2) \
V(TypedData_Float64x2Array_new, 2) \
V(TypedDataBase_length, 1) \
V(TypedDataBase_setRange, 7) \
V(TypedDataBase_setRange, 5) \
V(TypedData_GetInt8, 2) \
V(TypedData_SetInt8, 3) \
V(TypedData_GetUint8, 2) \
Expand Down
17 changes: 17 additions & 0 deletions runtime/vm/compiler/assembler/assembler_base.h
Original file line number Diff line number Diff line change
Expand Up @@ -629,6 +629,23 @@ class AssemblerBase : public StackResource {

virtual void SmiTag(Register r) = 0;

// If Smis are compressed and the Smi value in dst is non-negative, ensures
// the upper bits are cleared. If Smis are not compressed, is a no-op.
//
// Since this operation only affects the unused upper bits when Smis are
// compressed, it can be used on registers not allocated as writable.
//
// The behavior on the upper bits of signed compressed Smis is undefined.
#if defined(DART_COMPRESSED_POINTERS)
virtual void ExtendNonNegativeSmi(Register dst) {
// Default to sign extension and allow architecture-specific assemblers
// where an alternative like zero-extension is preferred to override this.
ExtendValue(dst, dst, kObjectBytes);
}
#else
void ExtendNonNegativeSmi(Register dst) {}
#endif

// Extends a value of size sz in src to a value of size kWordBytes in dst.
// That is, bits in the source register that are not part of the sz-sized
// value are ignored, and if sz is signed, then the value is sign extended.
Expand Down
50 changes: 10 additions & 40 deletions runtime/vm/compiler/assembler/assembler_ia32.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1776,6 +1776,16 @@ void Assembler::cmpxchgl(const Address& address, Register reg) {
EmitOperand(reg, address);
}

void Assembler::cld() {
AssemblerBuffer::EnsureCapacity ensured(&buffer_);
EmitUint8(0xFC);
}

void Assembler::std() {
AssemblerBuffer::EnsureCapacity ensured(&buffer_);
EmitUint8(0xFD);
}

void Assembler::cpuid() {
AssemblerBuffer::EnsureCapacity ensured(&buffer_);
EmitUint8(0x0F);
Expand Down Expand Up @@ -3126,46 +3136,6 @@ Address Assembler::ElementAddressForIntIndex(bool is_external,
}
}

static ScaleFactor ToScaleFactor(intptr_t index_scale, bool index_unboxed) {
if (index_unboxed) {
switch (index_scale) {
case 1:
return TIMES_1;
case 2:
return TIMES_2;
case 4:
return TIMES_4;
case 8:
return TIMES_8;
case 16:
return TIMES_16;
default:
UNREACHABLE();
return TIMES_1;
}
} else {
// Note that index is expected smi-tagged, (i.e, times 2) for all arrays
// with index scale factor > 1. E.g., for Uint8Array and OneByteString the
// index is expected to be untagged before accessing.
ASSERT(kSmiTagShift == 1);
switch (index_scale) {
case 1:
return TIMES_1;
case 2:
return TIMES_1;
case 4:
return TIMES_2;
case 8:
return TIMES_4;
case 16:
return TIMES_8;
default:
UNREACHABLE();
return TIMES_1;
}
}
}

Address Assembler::ElementAddressForRegIndex(bool is_external,
intptr_t cid,
intptr_t index_scale,
Expand Down
3 changes: 3 additions & 0 deletions runtime/vm/compiler/assembler/assembler_ia32.h
Original file line number Diff line number Diff line change
Expand Up @@ -572,6 +572,9 @@ class Assembler : public AssemblerBase {
void lock();
void cmpxchgl(const Address& address, Register reg);

void cld();
void std();

void cpuid();

/*
Expand Down
40 changes: 0 additions & 40 deletions runtime/vm/compiler/assembler/assembler_x64.cc
Original file line number Diff line number Diff line change
Expand Up @@ -2683,46 +2683,6 @@ Address Assembler::ElementAddressForIntIndex(bool is_external,
}
}

static ScaleFactor ToScaleFactor(intptr_t index_scale, bool index_unboxed) {
if (index_unboxed) {
switch (index_scale) {
case 1:
return TIMES_1;
case 2:
return TIMES_2;
case 4:
return TIMES_4;
case 8:
return TIMES_8;
case 16:
return TIMES_16;
default:
UNREACHABLE();
return TIMES_1;
}
} else {
// Note that index is expected smi-tagged, (i.e, times 2) for all arrays
// with index scale factor > 1. E.g., for Uint8Array and OneByteString the
// index is expected to be untagged before accessing.
ASSERT(kSmiTagShift == 1);
switch (index_scale) {
case 1:
return TIMES_1;
case 2:
return TIMES_1;
case 4:
return TIMES_2;
case 8:
return TIMES_4;
case 16:
return TIMES_8;
default:
UNREACHABLE();
return TIMES_1;
}
}
}

Address Assembler::ElementAddressForRegIndex(bool is_external,
intptr_t cid,
intptr_t index_scale,
Expand Down
8 changes: 8 additions & 0 deletions runtime/vm/compiler/assembler/assembler_x64.h
Original file line number Diff line number Diff line change
Expand Up @@ -1024,6 +1024,14 @@ class Assembler : public AssemblerBase {
Register scratch,
bool can_be_null = false) override;

#if defined(DART_COMPRESSED_POINTERS)
void ExtendNonNegativeSmi(Register dst) override {
// Zero-extends and is a smaller instruction to output than sign
// extension (movsxd).
orl(dst, dst);
}
#endif

// CheckClassIs fused with optimistic SmiUntag.
// Value in the register object is untagged optimistically.
void SmiUntagOrCheckClass(Register object, intptr_t class_id, Label* smi);
Expand Down
Loading

0 comments on commit c93f924

Please sign in to comment.