Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ffi: working with struct fields is slow #43142

Closed
mraleph opened this issue Aug 21, 2020 · 6 comments
Closed

ffi: working with struct fields is slow #43142

mraleph opened this issue Aug 21, 2020 · 6 comments
Assignees
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. library-ffi

Comments

@mraleph
Copy link
Member

mraleph commented Aug 21, 2020

Given the code like this:

class _Counter extends ffi.Struct {
  @ffi.Int64()
  int counter;
}

class Counter {
  final ffi.Pointer<_Counter> _impl = allocate<_Counter>();
  // ...
  void increment(int value) {
    _impl.ref.counter += value;
  }
}

void main() {
  final c = Counter();
  for (int i = 0; i < 10000; i++) c.increment(0);
}

I expect Counter.increment to translate into a small number of direct memory operations on the value of _impl pointer.

However instead I see a very large graph with three calls - one of which _loadStruct, if I am not mistaken, bottoms in the runtime call doing really complex stuff - at least I can't find IL optimization rules that would specialise _loadStruct for known arguments.

*** BEGIN CFG
After AllocateRegisters
==== file:///usr/local/google/home/vegorov/src/dartvm_api_vs_ffi/impl_ffi.dart_Counter_rawIncrement (RegularFunction)
  0: B0[graph]:0 {
      v0 <- Constant(#null) T{Null?}
      v1 <- Constant(#<optimized out>) T{_OneByteString}
      v4 <- Constant(#TypeArguments: (H3095e7b3) [Type: _Counter@23328508*]) T{TypeArguments}
      v25 <- Constant(#0) [0, 0] T{_Smi}
      v26 <- Constant(#_ImmutableList len:5) T{_ImmutableList}
      v48 <- Constant(#_ImmutableList len:3) T{_ImmutableList}
      v49 <- Constant(#3) [3, 3] T{_Smi}
}
  2: B1[function entry]:2 {
      v2 <- Parameter(0) T{Counter}
      v3 <- Parameter(1) T{int?}
}
  4:     CheckStackOverflow:8(stack=0, loop=0)
  5:     ParallelMove rax <- S+3
  6:     v5 <- LoadField(v2 . _impl@23328508 {final}) T{Pointer}
  8:     PushArgument(v4)
 10:     PushArgument(v5)
 12:     PushArgument(v25)
 14:     v23 <- StaticCall:26( _loadStruct@8050071<1> v4, v5, v25) T{*?}
 15:     ParallelMove rax <- rax
 16:     ParallelMove S-4 <- rax
 16:     CheckClass:16(v23 T{*?} Cids[1: _Counter@23328508 etc.  cid 1139])
 18:     v32 <- LoadField(v23 T{_Counter} . _addressOf@8050071 {final}) T{Pointer}
 19:     ParallelMove S-3 <- rcx
 20:     PushArgument(v32)
 22:     PushArgument(v25)
 24:     v39 <- StaticCall:16( _loadInt64@8050071<0> v32, v25, recognized_kind = FfiLoadInt64) T{int?}
 25:     ParallelMove rax <- rax
 26:     CheckSmi:18(v39 T{int?})
 27:     ParallelMove rcx <- S+2
 28:     CheckSmi:18(v3)
 30:     ParallelMove rdx <- rax
 30:     v11 <- BinarySmiOp:18(+, v39 T{_Smi}, v3 T{_Smi}) [-4611686018427387904, 4611686018427387903] T{_Smi}
 31:     ParallelMove S-5 <- rdx
 32:     PushArgument(v32)
 34:     PushArgument(v25)
 36:     PushArgument(v11)
 38:     v65 <- StaticCall:16( _storeInt64@8050071<0> v32, v25, v11, recognized_kind = FfiStoreInt64) T{void?}
 39:     ParallelMove rax <- C
 40:     Return:24(v0)
*** END CFG

This ends up being around 50x slower then calling a C function through FFI to do the same increment.

/cc @mkustermann

@mraleph mraleph added area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. library-ffi labels Aug 21, 2020
@mraleph
Copy link
Member Author

mraleph commented Aug 21, 2020

Similarly if I do the code below I expect 0 calls to be left in the graph:

  void rawIncrement(int value) {
    _impl.cast<Int64>().value += value;
  }

However I get 4 calls:

==== file:///usr/local/google/home/vegorov/src/dartvm_api_vs_ffi/impl_ffi.dart_Counter_rawrawIncrement (RegularFunction)
  0: B0[graph]:0 {
      v0 <- Constant(#null) T{Null?}
      v1 <- Constant(#<optimized out>) T{_OneByteString}
      v4 <- Constant(#TypeArguments: (H146ff0f0) [Type: Int64*]) T{TypeArguments}
      v32 <- Constant(#0) [0, 0] T{_Smi}
      v33 <- Constant(#TypeArguments: (H3489f82c) [Type: NativeType]) T{TypeArguments}
      v34 <- Constant(#_ImmutableList len:5) T{_ImmutableList}
      v64 <- Constant(#1) [1, 1] T{_Smi}
}
  2: B10[function entry]:2 {
      v2 <- Parameter(0) T{Counter}
      v3 <- Parameter(1) T{int?}
}
  4:     CheckStackOverflow:8(stack=0, loop=0)
  5:     ParallelMove rax <- S+3
  6:     v5 <- LoadField(v2 . _impl@23328508 {final}) T{Pointer}
  8:     PushArgument(v5)
 10:     v27 <- StaticCall:40( get:address<0> v5, recognized_kind = FfiGetAddress) T{int?}
 11:     ParallelMove rax <- rax
 12:     PushArgument(v4)
 14:     PushArgument(v27 T{int?})
 16:     v62 <- StaticCall:12( _fromAddress@8050071<1> v4, v27 T{int?}, recognized_kind = FfiFromAddress) T{*?}
 17:     ParallelMove rax <- rax
 18:     ParallelMove S-3 <- rax
 18:     PushArgument(v62 T{*?})
 20:     PushArgument(v32)
 22:     v41 <- StaticCall:12( _loadInt64@8050071<0> v62 T{*?}, v32, recognized_kind = FfiLoadInt64) T{int?}
 23:     ParallelMove rax <- rax
 24:     CheckSmi:18(v41 T{int?})
 25:     ParallelMove rcx <- S+2
 26:     CheckSmi:18(v3)
 28:     ParallelMove rdx <- rax
 28:     v11 <- BinarySmiOp:18(+, v41 T{_Smi}, v3 T{_Smi}) [-4611686018427387904, 4611686018427387903] T{_Smi}
 30:     PushArgument(v62 T{*?})
 32:     PushArgument(v32)
 34:     PushArgument(v11)
 36:     v52 <- StaticCall:12( _storeInt64@8050071<0> v62 T{*?}, v32, v11, recognized_kind = FfiStoreInt64) T{void?}
 37:     ParallelMove rax <- C
 38:     Return:24(v0)
*** END CFG

though I think everything here is intrinsified so overall performance is not that bad. However it can probably be made N times better.

@lfkdsk
Copy link
Contributor

lfkdsk commented Aug 23, 2020

@mraleph Perhaps we could add a cache before loadClass, lookUpFunction in ffi.cc (like we did in interpreter.cc), which might solve some of the performance overhead of class and method lookups.

@mraleph
Copy link
Member Author

mraleph commented Aug 24, 2020

@lfkdsk I think we should treat the code in ffi.cc as a slow path. Instead we should provide specialisation for _loadStruct (e.g. in the CallSpecializer or as a canonicalisation or inlining rule) which takes type argument passed to _loadStruct and resolves everything needed in compile time.

I remember now that _storeX/_loadX are not inlined in JIT because they are force-optimized (there were some questions about deoptimization targets to be addressed). For the second variant in AOT situation is a bit better, but calls are still there, e.g. cast is not inlined:

╭─~/src/dartvm_api_vs_ffi ⟨master …1⟩ ⟨16s822ms⟩
╰─» ~/src/dart/sdk/pkg/vm/tool/precompiler2 --print-flow-graph-optimized --print-flow-graph-filter=RawRaw_run bench.dart bench.snap
*** BEGIN CFG
After AllocateRegisters
==== file:///usr/local/google/home/vegorov/src/dartvm_api_vs_ffi/bench.dart_CounterViaFfiRawRaw_run (RegularFunction)
  0: B0[graph]:0 {
      v0 <- Constant(#null) T{Null?}
      v1 <- Constant(#<optimized out>) T{_OneByteString}
      v5 <- Constant(#0) [0, 0] T{_Smi}
      v8 <- Constant(#100) [100, 100] T{_Smi}
      v11 <- Constant(#true) T{bool}
      v14 <- Constant(#99) [99, 99] T{_Smi}
      v15 <- Constant(#50) [50, 50] T{_Smi}
      v20 <- Constant(#error) T{_OneByteString}
      v21 <- Constant(#1) [1, 1] T{_Smi}
      v47 <- Constant(#TypeArguments: (H146ff0f0) [Type: Int64*]) T{TypeArguments}
      v114 <- Constant(#4950) [4950, 4950] T{_Smi}
}
  2: B1[function entry]:2 {
      v2 <- Parameter(0) T{CounterViaFfiRawRaw}
}
  4:     CheckStackOverflow:8(stack=0, loop=0)
  6:     v3 <- AllocateObject(Counter) T{Counter}
  7:     ParallelMove rax <- rax
  8:     ParallelMove S-1 <- rax
  8:     PushArgument(v3)
 10:     StaticCall:12( Counter.<0> v3)
 12:     v137 <- UnboxedConstant:34(#100) [100, 100] T{_Smi}
 13:     ParallelMove rax <- S-1
 14:     v35 <- LoadField:34(v3 . _impl@18328508 {final}) T{Pointer?}
 15:     ParallelMove S-3 <- rcx
 16:     CheckNull:34(v35, NoSuchMethodError) T{Pointer}
 18:     v87 <- UnboxedConstant:34(#0) [0, 0] T{_Smi}
 20:     v139 <- UnboxedConstant:34(#1) [1, 1] T{_Smi}
 22:     v149 <- UnboxedConstant(#0) [0, 0] T{_Smi}
 24:     ParallelMove rdx <- C goto:34 B5
 26: B5[join]:32 pred(B1, B3) {
      v6 <- phi(v149, v22 T{int}) alive [-9223372036854775808, 9223372036854775807] int64 T{int}
}
 27:     ParallelMove S-2 <- rdx
 28:     CheckStackOverflow:38(stack=0, loop=1)
 30:     Branch if RelationalOp(<, v6 T{int}, v137) T{bool} goto (3, 4)
 32: B3[target]:24
 34:     PushArgument(v47)
 36:     PushArgument(v35 T{Pointer})
 38:     v37 <- StaticCall:16( cast<1> v47, v35 T{Pointer}, using unchecked entrypoint, result_type = T{Pointer?}) T{Pointer?}
 39:     ParallelMove rax <- rax
 40:     t1 <- CheckNull:12(v37 T{Pointer?}, NoSuchMethodError) T{Pointer}
 42:     v79 <- LoadUntagged(v37 T{Pointer}, 8) T{*?}
 44:     v83 <- LoadIndexed(v79, v87) [-9223372036854775808, 9223372036854775807] T{int}
 45:     ParallelMove rcx <- S-2
 46:     ParallelMove rdx <- rdx
 46:     v41 <- BinaryInt64Op(+ [tr], v83 T{int}, v6 T{int}) [-9223372036854775808, 9223372036854775807] T{int}
 48:     v106 <- LoadUntagged(v37 T{Pointer}, 8) T{*?}
 50:     StoreIndexed(v106, v87, v41 T{int}, NoStoreBarrier)
 52:     ParallelMove rcx <- rcx
 52:     v22 <- BinaryInt64Op(+ [tr], v6 T{int}, v139) [-9223372036854775808, 9223372036854775807] T{int}
 54:     ParallelMove rdx <- rcx, rax <- S-1, rcx <- S-3 goto:36 B5
 56: B4[target]:26 ParallelMove rax <- C
 58:     PushArgument(v3)
 60:     v12 <- StaticCall:40( get:rawrawCounter<0> v3, result_type = T{int??}) T{int?}
 61:     ParallelMove rcx <- rax, rax <- C
 62:     Branch if CheckedSmiComparison:44(!=, v12, v114) T{bool} goto (6, 7)
 64: B6[target]:52
 65:     ParallelMove rax <- C
 66:     Throw:56(v20)
 68: B7[target]:58
 69:     ParallelMove rax <- C
 70:     Return:62(v0)
*** END CFG

@mraleph
Copy link
Member Author

mraleph commented Aug 24, 2020

If I force inline Pointer.cast via @pragma('vm:prefer-inline') then I get the following code:

*** BEGIN CFG
After AllocateRegisters
==== file:///usr/local/google/home/vegorov/src/dartvm_api_vs_ffi/bench.dart_CounterViaFfiRawRaw_run (RegularFunction)
  0: B0[graph]:0 {
      v0 <- Constant(#null) T{Null?}
      v1 <- Constant(#<optimized out>) T{_OneByteString}
      v5 <- Constant(#0) [0, 0] T{_Smi}
      v8 <- Constant(#100) [100, 100] T{_Smi}
      v11 <- Constant(#true) T{bool}
      v14 <- Constant(#99) [99, 99] T{_Smi}
      v15 <- Constant(#50) [50, 50] T{_Smi}
      v20 <- Constant(#error) T{_OneByteString}
      v21 <- Constant(#1) [1, 1] T{_Smi}
      v47 <- Constant(#TypeArguments: (H146ff0f0) [Type: Int64*]) T{TypeArguments}
      v67 <- Constant(#TypeArguments: (H3489f82c) [Type: NativeType]) T{TypeArguments}
      v68 <- Constant(#_ImmutableList len:5) T{_ImmutableList}
      v175 <- Constant(#4950) [4950, 4950] T{_Smi}
}
  2: B1[function entry]:2 {
      v2 <- Parameter(0) T{CounterViaFfiRawRaw}
}
  4:     CheckStackOverflow:8(stack=0, loop=0)
  6:     v3 <- AllocateObject(Counter) T{Counter}
  7:     ParallelMove rax <- rax
  8:     ParallelMove S-1 <- rax
  8:     PushArgument(v3)
 10:     StaticCall:12( Counter.<0> v3)
 12:     v198 <- UnboxedConstant:34(#100) [100, 100] T{_Smi}
 13:     ParallelMove rax <- S-1
 14:     v35 <- LoadField:34(v3 . _impl@18328508 {final}) T{Pointer?}
 15:     ParallelMove S-4 <- rcx
 16:     CheckNull:34(v35, NoSuchMethodError) T{Pointer}
 18:     v131 <- UnboxedConstant:34(#0) [0, 0] T{_Smi}
 20:     v200 <- UnboxedConstant:34(#1) [1, 1] T{_Smi}
 22:     v210 <- UnboxedConstant(#0) [0, 0] T{_Smi}
 24:     ParallelMove rbx <- C goto:34 B5
 26: B5[join]:32 pred(B1, B17) {
      v6 <- phi(v210, v22 T{int}) alive [-9223372036854775808, 9223372036854775807] int64 T{int}
}
 27:     ParallelMove S-3 <- rbx
 28:     CheckStackOverflow:38(stack=0, loop=1)
 30:     Branch if RelationalOp(<, v6 T{int}, v198) T{bool} goto (17, 4)
 32: B17[target]:24
 34:     v95 <- LoadUntagged(v35 T{Pointer}, 8) T{*?}
 35:     ParallelMove rdx <- C, S-2 <- rsi
 36:     v167 <- AllocateObject(Pointer, v47) T{Pointer}
 37:     ParallelMove rcx <- rax, rax <- S-2
 38:     StoreUntagged(v167, v95 T{int})
 40:     v123 <- LoadUntagged(v167, 8) T{*?}
 42:     v127 <- LoadIndexed(v123, v131) [-9223372036854775808, 9223372036854775807] T{int}
 43:     ParallelMove rax <- S-3
 44:     ParallelMove rdx <- rdx
 44:     v41 <- BinaryInt64Op(+ [tr], v127 T{int}, v6 T{int}) [-9223372036854775808, 9223372036854775807] T{int}
 46:     v150 <- LoadUntagged(v167, 8) T{*?}
 48:     StoreIndexed(v150, v131, v41 T{int}, NoStoreBarrier)
 50:     ParallelMove rax <- rax
 50:     v22 <- BinaryInt64Op(+ [tr], v6 T{int}, v200) [-9223372036854775808, 9223372036854775807] T{int}
 52:     ParallelMove rbx <- rax, rax <- S-1, rcx <- S-4 goto:36 B5
 54: B4[target]:26 ParallelMove rax <- C
 56:     PushArgument(v3)
 58:     v12 <- StaticCall:40( get:rawrawCounter<0> v3, result_type = T{int??}) T{int?}
 59:     ParallelMove rcx <- rax, rax <- C
 60:     Branch if CheckedSmiComparison:44(!=, v12, v175) T{bool} goto (6, 7)
 62: B6[target]:52
 63:     ParallelMove rax <- C
 64:     Throw:56(v20)
 66: B7[target]:58
 67:     ParallelMove rax <- C
 68:     Return:62(v0)
*** END CFG

We should probably change LoadUntagged/StoreUntagged to use Load/StoreInstanceField with an untagged slot which would make them a subject to proper load/store forwarding and subsequently make pointer objects subject to allocation sinking. This would also decrease the surface area of our IL.

@dcharkes
Copy link
Contributor

dcharkes commented Aug 24, 2020

However instead I see a very large graph with three calls - one of which _loadStruct, if I am not mistaken, bottoms in the runtime call doing really complex stuff - at least I can't find IL optimization rules that would specialise _loadStruct for known arguments.

That is correct. .ref has not yet been optimized due to using the runtime type argument of Pointer<MyStruct>. We should change this to the static type argument so that we can replace it in the front-end transformation to a normal constructor call. (This is related to tree-shaking as well: #38721).

Edit: Alternatively we can also optimize the call sites for which we know the type statically and leave the dynamic call sites slow. However, that will not solve the tree shaking.

@dcharkes
Copy link
Contributor

dcharkes commented Sep 7, 2020

Duplicate of #38648. Thanks for reporting and the analysis @mraleph.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. library-ffi
Projects
None yet
Development

No branches or pull requests

3 participants