ffi: working with struct fields is slow #43142

mraleph · 2020-08-21T14:16:35Z

Given the code like this:

class _Counter extends ffi.Struct {
  @ffi.Int64()
  int counter;
}

class Counter {
  final ffi.Pointer<_Counter> _impl = allocate<_Counter>();
  // ...
  void increment(int value) {
    _impl.ref.counter += value;
  }
}

void main() {
  final c = Counter();
  for (int i = 0; i < 10000; i++) c.increment(0);
}

I expect Counter.increment to translate into a small number of direct memory operations on the value of _impl pointer.

However instead I see a very large graph with three calls - one of which _loadStruct, if I am not mistaken, bottoms in the runtime call doing really complex stuff - at least I can't find IL optimization rules that would specialise _loadStruct for known arguments.

*** BEGIN CFG
After AllocateRegisters
==== file:///usr/local/google/home/vegorov/src/dartvm_api_vs_ffi/impl_ffi.dart_Counter_rawIncrement (RegularFunction)
  0: B0[graph]:0 {
      v0 <- Constant(#null) T{Null?}
      v1 <- Constant(#<optimized out>) T{_OneByteString}
      v4 <- Constant(#TypeArguments: (H3095e7b3) [Type: _Counter@23328508*]) T{TypeArguments}
      v25 <- Constant(#0) [0, 0] T{_Smi}
      v26 <- Constant(#_ImmutableList len:5) T{_ImmutableList}
      v48 <- Constant(#_ImmutableList len:3) T{_ImmutableList}
      v49 <- Constant(#3) [3, 3] T{_Smi}
}
  2: B1[function entry]:2 {
      v2 <- Parameter(0) T{Counter}
      v3 <- Parameter(1) T{int?}
}
  4:     CheckStackOverflow:8(stack=0, loop=0)
  5:     ParallelMove rax <- S+3
  6:     v5 <- LoadField(v2 . _impl@23328508 {final}) T{Pointer}
  8:     PushArgument(v4)
 10:     PushArgument(v5)
 12:     PushArgument(v25)
 14:     v23 <- StaticCall:26( _loadStruct@8050071<1> v4, v5, v25) T{*?}
 15:     ParallelMove rax <- rax
 16:     ParallelMove S-4 <- rax
 16:     CheckClass:16(v23 T{*?} Cids[1: _Counter@23328508 etc.  cid 1139])
 18:     v32 <- LoadField(v23 T{_Counter} . _addressOf@8050071 {final}) T{Pointer}
 19:     ParallelMove S-3 <- rcx
 20:     PushArgument(v32)
 22:     PushArgument(v25)
 24:     v39 <- StaticCall:16( _loadInt64@8050071<0> v32, v25, recognized_kind = FfiLoadInt64) T{int?}
 25:     ParallelMove rax <- rax
 26:     CheckSmi:18(v39 T{int?})
 27:     ParallelMove rcx <- S+2
 28:     CheckSmi:18(v3)
 30:     ParallelMove rdx <- rax
 30:     v11 <- BinarySmiOp:18(+, v39 T{_Smi}, v3 T{_Smi}) [-4611686018427387904, 4611686018427387903] T{_Smi}
 31:     ParallelMove S-5 <- rdx
 32:     PushArgument(v32)
 34:     PushArgument(v25)
 36:     PushArgument(v11)
 38:     v65 <- StaticCall:16( _storeInt64@8050071<0> v32, v25, v11, recognized_kind = FfiStoreInt64) T{void?}
 39:     ParallelMove rax <- C
 40:     Return:24(v0)
*** END CFG

This ends up being around 50x slower then calling a C function through FFI to do the same increment.

/cc @mkustermann

The text was updated successfully, but these errors were encountered:

mraleph · 2020-08-21T14:32:33Z

Similarly if I do the code below I expect 0 calls to be left in the graph:

  void rawIncrement(int value) {
    _impl.cast<Int64>().value += value;
  }

However I get 4 calls:

==== file:///usr/local/google/home/vegorov/src/dartvm_api_vs_ffi/impl_ffi.dart_Counter_rawrawIncrement (RegularFunction)
  0: B0[graph]:0 {
      v0 <- Constant(#null) T{Null?}
      v1 <- Constant(#<optimized out>) T{_OneByteString}
      v4 <- Constant(#TypeArguments: (H146ff0f0) [Type: Int64*]) T{TypeArguments}
      v32 <- Constant(#0) [0, 0] T{_Smi}
      v33 <- Constant(#TypeArguments: (H3489f82c) [Type: NativeType]) T{TypeArguments}
      v34 <- Constant(#_ImmutableList len:5) T{_ImmutableList}
      v64 <- Constant(#1) [1, 1] T{_Smi}
}
  2: B10[function entry]:2 {
      v2 <- Parameter(0) T{Counter}
      v3 <- Parameter(1) T{int?}
}
  4:     CheckStackOverflow:8(stack=0, loop=0)
  5:     ParallelMove rax <- S+3
  6:     v5 <- LoadField(v2 . _impl@23328508 {final}) T{Pointer}
  8:     PushArgument(v5)
 10:     v27 <- StaticCall:40( get:address<0> v5, recognized_kind = FfiGetAddress) T{int?}
 11:     ParallelMove rax <- rax
 12:     PushArgument(v4)
 14:     PushArgument(v27 T{int?})
 16:     v62 <- StaticCall:12( _fromAddress@8050071<1> v4, v27 T{int?}, recognized_kind = FfiFromAddress) T{*?}
 17:     ParallelMove rax <- rax
 18:     ParallelMove S-3 <- rax
 18:     PushArgument(v62 T{*?})
 20:     PushArgument(v32)
 22:     v41 <- StaticCall:12( _loadInt64@8050071<0> v62 T{*?}, v32, recognized_kind = FfiLoadInt64) T{int?}
 23:     ParallelMove rax <- rax
 24:     CheckSmi:18(v41 T{int?})
 25:     ParallelMove rcx <- S+2
 26:     CheckSmi:18(v3)
 28:     ParallelMove rdx <- rax
 28:     v11 <- BinarySmiOp:18(+, v41 T{_Smi}, v3 T{_Smi}) [-4611686018427387904, 4611686018427387903] T{_Smi}
 30:     PushArgument(v62 T{*?})
 32:     PushArgument(v32)
 34:     PushArgument(v11)
 36:     v52 <- StaticCall:12( _storeInt64@8050071<0> v62 T{*?}, v32, v11, recognized_kind = FfiStoreInt64) T{void?}
 37:     ParallelMove rax <- C
 38:     Return:24(v0)
*** END CFG

though I think everything here is intrinsified so overall performance is not that bad. However it can probably be made N times better.

lfkdsk · 2020-08-23T09:20:47Z

@mraleph Perhaps we could add a cache before loadClass, lookUpFunction in ffi.cc (like we did in interpreter.cc), which might solve some of the performance overhead of class and method lookups.

mraleph · 2020-08-24T07:09:19Z

@lfkdsk I think we should treat the code in ffi.cc as a slow path. Instead we should provide specialisation for _loadStruct (e.g. in the CallSpecializer or as a canonicalisation or inlining rule) which takes type argument passed to _loadStruct and resolves everything needed in compile time.

I remember now that _storeX/_loadX are not inlined in JIT because they are force-optimized (there were some questions about deoptimization targets to be addressed). For the second variant in AOT situation is a bit better, but calls are still there, e.g. cast is not inlined:

╭─~/src/dartvm_api_vs_ffi ⟨master …1⟩ ⟨16s822ms⟩
╰─» ~/src/dart/sdk/pkg/vm/tool/precompiler2 --print-flow-graph-optimized --print-flow-graph-filter=RawRaw_run bench.dart bench.snap
*** BEGIN CFG
After AllocateRegisters
==== file:///usr/local/google/home/vegorov/src/dartvm_api_vs_ffi/bench.dart_CounterViaFfiRawRaw_run (RegularFunction)
  0: B0[graph]:0 {
      v0 <- Constant(#null) T{Null?}
      v1 <- Constant(#<optimized out>) T{_OneByteString}
      v5 <- Constant(#0) [0, 0] T{_Smi}
      v8 <- Constant(#100) [100, 100] T{_Smi}
      v11 <- Constant(#true) T{bool}
      v14 <- Constant(#99) [99, 99] T{_Smi}
      v15 <- Constant(#50) [50, 50] T{_Smi}
      v20 <- Constant(#error) T{_OneByteString}
      v21 <- Constant(#1) [1, 1] T{_Smi}
      v47 <- Constant(#TypeArguments: (H146ff0f0) [Type: Int64*]) T{TypeArguments}
      v114 <- Constant(#4950) [4950, 4950] T{_Smi}
}
  2: B1[function entry]:2 {
      v2 <- Parameter(0) T{CounterViaFfiRawRaw}
}
  4:     CheckStackOverflow:8(stack=0, loop=0)
  6:     v3 <- AllocateObject(Counter) T{Counter}
  7:     ParallelMove rax <- rax
  8:     ParallelMove S-1 <- rax
  8:     PushArgument(v3)
 10:     StaticCall:12( Counter.<0> v3)
 12:     v137 <- UnboxedConstant:34(#100) [100, 100] T{_Smi}
 13:     ParallelMove rax <- S-1
 14:     v35 <- LoadField:34(v3 . _impl@18328508 {final}) T{Pointer?}
 15:     ParallelMove S-3 <- rcx
 16:     CheckNull:34(v35, NoSuchMethodError) T{Pointer}
 18:     v87 <- UnboxedConstant:34(#0) [0, 0] T{_Smi}
 20:     v139 <- UnboxedConstant:34(#1) [1, 1] T{_Smi}
 22:     v149 <- UnboxedConstant(#0) [0, 0] T{_Smi}
 24:     ParallelMove rdx <- C goto:34 B5
 26: B5[join]:32 pred(B1, B3) {
      v6 <- phi(v149, v22 T{int}) alive [-9223372036854775808, 9223372036854775807] int64 T{int}
}
 27:     ParallelMove S-2 <- rdx
 28:     CheckStackOverflow:38(stack=0, loop=1)
 30:     Branch if RelationalOp(<, v6 T{int}, v137) T{bool} goto (3, 4)
 32: B3[target]:24
 34:     PushArgument(v47)
 36:     PushArgument(v35 T{Pointer})
 38:     v37 <- StaticCall:16( cast<1> v47, v35 T{Pointer}, using unchecked entrypoint, result_type = T{Pointer?}) T{Pointer?}
 39:     ParallelMove rax <- rax
 40:     t1 <- CheckNull:12(v37 T{Pointer?}, NoSuchMethodError) T{Pointer}
 42:     v79 <- LoadUntagged(v37 T{Pointer}, 8) T{*?}
 44:     v83 <- LoadIndexed(v79, v87) [-9223372036854775808, 9223372036854775807] T{int}
 45:     ParallelMove rcx <- S-2
 46:     ParallelMove rdx <- rdx
 46:     v41 <- BinaryInt64Op(+ [tr], v83 T{int}, v6 T{int}) [-9223372036854775808, 9223372036854775807] T{int}
 48:     v106 <- LoadUntagged(v37 T{Pointer}, 8) T{*?}
 50:     StoreIndexed(v106, v87, v41 T{int}, NoStoreBarrier)
 52:     ParallelMove rcx <- rcx
 52:     v22 <- BinaryInt64Op(+ [tr], v6 T{int}, v139) [-9223372036854775808, 9223372036854775807] T{int}
 54:     ParallelMove rdx <- rcx, rax <- S-1, rcx <- S-3 goto:36 B5
 56: B4[target]:26 ParallelMove rax <- C
 58:     PushArgument(v3)
 60:     v12 <- StaticCall:40( get:rawrawCounter<0> v3, result_type = T{int??}) T{int?}
 61:     ParallelMove rcx <- rax, rax <- C
 62:     Branch if CheckedSmiComparison:44(!=, v12, v114) T{bool} goto (6, 7)
 64: B6[target]:52
 65:     ParallelMove rax <- C
 66:     Throw:56(v20)
 68: B7[target]:58
 69:     ParallelMove rax <- C
 70:     Return:62(v0)
*** END CFG

mraleph · 2020-08-24T07:18:01Z

If I force inline Pointer.cast via @pragma('vm:prefer-inline') then I get the following code:

*** BEGIN CFG
After AllocateRegisters
==== file:///usr/local/google/home/vegorov/src/dartvm_api_vs_ffi/bench.dart_CounterViaFfiRawRaw_run (RegularFunction)
  0: B0[graph]:0 {
      v0 <- Constant(#null) T{Null?}
      v1 <- Constant(#<optimized out>) T{_OneByteString}
      v5 <- Constant(#0) [0, 0] T{_Smi}
      v8 <- Constant(#100) [100, 100] T{_Smi}
      v11 <- Constant(#true) T{bool}
      v14 <- Constant(#99) [99, 99] T{_Smi}
      v15 <- Constant(#50) [50, 50] T{_Smi}
      v20 <- Constant(#error) T{_OneByteString}
      v21 <- Constant(#1) [1, 1] T{_Smi}
      v47 <- Constant(#TypeArguments: (H146ff0f0) [Type: Int64*]) T{TypeArguments}
      v67 <- Constant(#TypeArguments: (H3489f82c) [Type: NativeType]) T{TypeArguments}
      v68 <- Constant(#_ImmutableList len:5) T{_ImmutableList}
      v175 <- Constant(#4950) [4950, 4950] T{_Smi}
}
  2: B1[function entry]:2 {
      v2 <- Parameter(0) T{CounterViaFfiRawRaw}
}
  4:     CheckStackOverflow:8(stack=0, loop=0)
  6:     v3 <- AllocateObject(Counter) T{Counter}
  7:     ParallelMove rax <- rax
  8:     ParallelMove S-1 <- rax
  8:     PushArgument(v3)
 10:     StaticCall:12( Counter.<0> v3)
 12:     v198 <- UnboxedConstant:34(#100) [100, 100] T{_Smi}
 13:     ParallelMove rax <- S-1
 14:     v35 <- LoadField:34(v3 . _impl@18328508 {final}) T{Pointer?}
 15:     ParallelMove S-4 <- rcx
 16:     CheckNull:34(v35, NoSuchMethodError) T{Pointer}
 18:     v131 <- UnboxedConstant:34(#0) [0, 0] T{_Smi}
 20:     v200 <- UnboxedConstant:34(#1) [1, 1] T{_Smi}
 22:     v210 <- UnboxedConstant(#0) [0, 0] T{_Smi}
 24:     ParallelMove rbx <- C goto:34 B5
 26: B5[join]:32 pred(B1, B17) {
      v6 <- phi(v210, v22 T{int}) alive [-9223372036854775808, 9223372036854775807] int64 T{int}
}
 27:     ParallelMove S-3 <- rbx
 28:     CheckStackOverflow:38(stack=0, loop=1)
 30:     Branch if RelationalOp(<, v6 T{int}, v198) T{bool} goto (17, 4)
 32: B17[target]:24
 34:     v95 <- LoadUntagged(v35 T{Pointer}, 8) T{*?}
 35:     ParallelMove rdx <- C, S-2 <- rsi
 36:     v167 <- AllocateObject(Pointer, v47) T{Pointer}
 37:     ParallelMove rcx <- rax, rax <- S-2
 38:     StoreUntagged(v167, v95 T{int})
 40:     v123 <- LoadUntagged(v167, 8) T{*?}
 42:     v127 <- LoadIndexed(v123, v131) [-9223372036854775808, 9223372036854775807] T{int}
 43:     ParallelMove rax <- S-3
 44:     ParallelMove rdx <- rdx
 44:     v41 <- BinaryInt64Op(+ [tr], v127 T{int}, v6 T{int}) [-9223372036854775808, 9223372036854775807] T{int}
 46:     v150 <- LoadUntagged(v167, 8) T{*?}
 48:     StoreIndexed(v150, v131, v41 T{int}, NoStoreBarrier)
 50:     ParallelMove rax <- rax
 50:     v22 <- BinaryInt64Op(+ [tr], v6 T{int}, v200) [-9223372036854775808, 9223372036854775807] T{int}
 52:     ParallelMove rbx <- rax, rax <- S-1, rcx <- S-4 goto:36 B5
 54: B4[target]:26 ParallelMove rax <- C
 56:     PushArgument(v3)
 58:     v12 <- StaticCall:40( get:rawrawCounter<0> v3, result_type = T{int??}) T{int?}
 59:     ParallelMove rcx <- rax, rax <- C
 60:     Branch if CheckedSmiComparison:44(!=, v12, v175) T{bool} goto (6, 7)
 62: B6[target]:52
 63:     ParallelMove rax <- C
 64:     Throw:56(v20)
 66: B7[target]:58
 67:     ParallelMove rax <- C
 68:     Return:62(v0)
*** END CFG

We should probably change LoadUntagged/StoreUntagged to use Load/StoreInstanceField with an untagged slot which would make them a subject to proper load/store forwarding and subsequently make pointer objects subject to allocation sinking. This would also decrease the surface area of our IL.

dcharkes · 2020-08-24T09:18:41Z

However instead I see a very large graph with three calls - one of which _loadStruct, if I am not mistaken, bottoms in the runtime call doing really complex stuff - at least I can't find IL optimization rules that would specialise _loadStruct for known arguments.

That is correct. .ref has not yet been optimized due to using the runtime type argument of Pointer<MyStruct>. We should change this to the static type argument so that we can replace it in the front-end transformation to a normal constructor call. (This is related to tree-shaking as well: #38721).

Edit: Alternatively we can also optimize the call sites for which we know the type statically and leave the dynamic call sites slow. However, that will not solve the tree shaking.

dcharkes · 2020-09-07T10:05:42Z

Duplicate of #38648. Thanks for reporting and the analysis @mraleph.

mraleph added area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. library-ffi labels Aug 21, 2020

mraleph assigned dcharkes Aug 21, 2020

dcharkes closed this as completed Sep 7, 2020

karbiv mentioned this issue Sep 11, 2024

Abnormally slow performance of Dart FFI code. Microbenchmark. #56705

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ffi: working with struct fields is slow #43142

ffi: working with struct fields is slow #43142

mraleph commented Aug 21, 2020

mraleph commented Aug 21, 2020

lfkdsk commented Aug 23, 2020

mraleph commented Aug 24, 2020

mraleph commented Aug 24, 2020

dcharkes commented Aug 24, 2020 •

edited

Loading

dcharkes commented Sep 7, 2020

ffi: working with struct fields is slow #43142

ffi: working with struct fields is slow #43142

Comments

mraleph commented Aug 21, 2020

mraleph commented Aug 21, 2020

lfkdsk commented Aug 23, 2020

mraleph commented Aug 24, 2020

mraleph commented Aug 24, 2020

dcharkes commented Aug 24, 2020 • edited Loading

dcharkes commented Sep 7, 2020

dcharkes commented Aug 24, 2020 •

edited

Loading