Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vm] Smaller list literals #44391

Closed
rakudrama opened this issue Dec 4, 2020 · 6 comments
Closed

[vm] Smaller list literals #44391

rakudrama opened this issue Dec 4, 2020 · 6 comments
Assignees
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. vm-aot-code-size Related to improvements in AOT code size

Comments

@rakudrama
Copy link
Member

rakudrama commented Dec 4, 2020

The current VM code pattern for a list literal like <T>[a, b()] does the following:

  • call to allocate an array (CreateArray) (~5 instructions)
  • store a in array (~4 instructions)
  • call b()
  • store result in array with write barrier (~10 instructions). The write barrier is needed because the call to b() could have moved the array via some GCs.
  • call to _GrowableList<T>._fromLiteral(array) (~9 instructions)

For small list literals (up to, say, 8 elements) it might be better to call a helper, e.g. _GrowableList<T>._literal2(a, b()).

  • The call to the helper is much smaller and the helper is often shared between several literals.
  • The helper can avoid write barriers since all stores immediately follow the new-space array allocation.
  • _GrowableList<T>._fromLiteral can be aggressively inlined into the helper to 'pay down' the additional call cost.

The benefit of eliminating write barriers can also be achieved by evaluating the element expressions before CreateArray.
The values would tend to spill, so it would need to be clear that a write barrier was indeed avoided by the early evaluation.
It should be possible to rematerialize constants at the indexed store rather than spill them, and other expressions (e.g. variable values) might need a write barrier but no spill slot, so a policy for this would be sensitive to the number of spill slots vs the number of write barriers avoided.

Example: [a(), const b, const c, d(), e, f]:

  • a() and d() return values that don't require a write barrier.
  • moving a() ahead of the array allocation does not help remove a write barrier for const b since there is none.
  • moving d() ahead of the allocation forces a() ahead of allocation, but might allow e and f to avoid a write barrier.
  • so two spill slots for a() and d() remove two write barriers for e and f.

For larger literals this kind of analysis might be better than using a _literalN constructor, which in effect is indiscriminately spilling all the elements via pushing arguments.

@rakudrama rakudrama added area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. vm-aot-code-size Related to improvements in AOT code size labels Dec 4, 2020
@alexmarkov alexmarkov self-assigned this Dec 7, 2020
@rakudrama
Copy link
Member Author

I wonder if this transform was less effective than anticipated (only 0.22% total size improvement) because the _literalN helpers are inlined?

From precompile2 --trace-inlining:

  => new _GrowableList._literal2 (deopt count 0)
     Success
       with reason need to count first, code size 10, call sites: 1

AllocateArrayStub is called in the inlined code, so there are two call sites at the machine level. Perhaps the inliner heruristics should count allocation stub calls too.

I saw one example of something like [a, b, c] where the elements are integers that are boxed.
The boxing originally was prior to the inlined call to _literalN and hence the AllocateArrayStub, but moved after the allocation. This allocation reordering of the boxing towards the use site causes write barriers in the inlined code. This would be fixed trivially by not inlining, but it also points to a more general strategy for reducing write barriers by not moving an allocation into the region between a second allocation and stores into the second allocation.

I did the following experiment: compile pkg/compiler/lib/src/dart2js.dart:

  • With sdk
  • With sdk but adding @pragma('vm:never-inline') to the _literalN methods.

The total file size was reduced by 0.535%, from 25139776 to 25005208.

@mraleph
Copy link
Member

mraleph commented Jan 19, 2022

@rakudrama In general we should be able to eliminate write-barriers even if there is an interfering boxing, we have special optimization which does that (WB is eliminated but if boxing causes GC then we would apply special GC invariant restoration code, which compensates for potentially removed barriers).

It would be good to check why this optimization does not kick in in this particular case.

@rakudrama
Copy link
Member Author

In trying to make a repro, I needed to have an optional argument that is not optimized away:

main() {
  foo(1,1,1);
  foo(2,2,2,2);
}

@pragma('vm:never-inline')
foo(int i1, int i2, int i3, [int i4 = 0]) {
  print([i1,i2,i3]);
  print([i1,i2,i3]);
  print(i4);
}

The generated code for the first [i1,i2,i3] has boxing and write barriers.
The code for the second list is barrier-free, reusing previous boxings.

I think there might be two issues here:

  1. Boxing is moved after the original allocation, causing write barriers - my original complaint (two comments up). Perhaps boxing is inserted after inlining at the point of use, and should be inserted earlier.
  2. AOT knows some arguments are in a safe (Smi) range, but forgets that when the function has optional arguments.

I'm interested to hear more about GC invariant restoration code. I can see that when parsing the stack for GC there might be an indication that some slot is an new-ish object with future writes with a removed barrier, and so the object is preemptively marked if it moves to a generation that needs marking. Perhaps this exists, but not for arrays since their marking is different (to prevent scanning the whole array, which was a problem in the past). The general case for arrays is hard (knowing the set of indexes written since the last GC by inspecting the stack) but it should be tractable for constant indexes for the first N elements.

<Precompiled____foo_3809>:
Precompiled____foo_3809():
	          55                   	push   %rbp
	          48 89 e5             	mov    %rsp,%rbp
	          48 83 ec 30          	sub    $0x30,%rsp
	          4c 89 d0             	mov    %r10,%rax
	          48 8b 48 1f          	mov    0x1f(%rax),%rcx
	          48 83 e9 06          	sub    $0x6,%rcx
	          48 8b 54 8d 20       	mov    0x20(%rbp,%rcx,4),%rdx
	          48 8b 74 8d 18       	mov    0x18(%rbp,%rcx,4),%rsi
	          48 89 75 e0          	mov    %rsi,-0x20(%rbp)
	          48 8b 7c 8d 10       	mov    0x10(%rbp,%rcx,4),%rdi
	          48 89 7d e8          	mov    %rdi,-0x18(%rbp)
	          48 83 f9 02          	cmp    $0x2,%rcx
	   /----- 0f 8c 1a 00 00 00    	jl     <Precompiled____foo_3809+0x4e>
	   |      48 8b 44 8d 08       	mov    0x8(%rbp,%rcx,4),%rax
	   |      48 d1 f8             	sar    %rax
	   |  /-- 73 08                	jae    <Precompiled____foo_3809+0x46>
	   |  |   48 8b 04 45 08 00 00 	mov    0x8(,%rax,2),%rax
	   |  |   00 
	   |  \-> 48 89 c1             	mov    %rax,%rcx
	   |  /-- e9 02 00 00 00       	jmp    <Precompiled____foo_3809+0x50>
	   \--|-> 33 c9                	xor    %ecx,%ecx
	      \-> 48 89 4d f0          	mov    %rcx,-0x10(%rbp)
	          49 3b 66 38          	cmp    0x38(%r14),%rsp
	   /----- 0f 86 5b 01 00 00    	jbe    <Precompiled____foo_3809+0x1b9>
	/--|----> 48 89 d0             	mov    %rdx,%rax
	|  |      48 03 c0             	add    %rax,%rax
	|  |  /-- 0f 81 09 00 00 00    	jno    <Precompiled____foo_3809+0x73>
	|  |  |   e8 71 f0 ff ff       	call   <Precompiled_Stub__iso_stub_AllocateMintSharedWithoutFPURegsStub>
	|  |  |   48 89 50 07          	mov    %rdx,0x7(%rax)
	|  |  \-> 49 8b 9e c8 00 00 00 	mov    0xc8(%r14),%rbx
	|  |      41 ba 06 00 00 00    	mov    $0x6,%r10d
	|  |      48 89 45 f8          	mov    %rax,-0x8(%rbp)
dart:core-patch/growable_array.dart:592
	|  |      e8 c3 ec ff ff       	call   <Precompiled_Stub__iso_stub_AllocateArrayStub>
new _GrowableList._literal3():
	|  |      48 89 c6             	mov    %rax,%rsi
	|  |      48 8b 4d f8          	mov    -0x8(%rbp),%rcx
	|  |      48 89 4e 17          	mov    %rcx,0x17(%rsi)
	|  |      48 8b 55 e0          	mov    -0x20(%rbp),%rdx
	|  |      48 89 d0             	mov    %rdx,%rax
	|  |      48 03 c0             	add    %rax,%rax
	|  |  /-- 0f 81 09 00 00 00    	jno    <Precompiled____foo_3809+0xad>
	|  |  |   e8 37 f0 ff ff       	call   <Precompiled_Stub__iso_stub_AllocateMintSharedWithoutFPURegsStub>
	|  |  |   48 89 50 07          	mov    %rdx,0x7(%rax)
	|  |  \-> 48 89 f2             	mov    %rsi,%rdx
	|  |      48 89 c7             	mov    %rax,%rdi
	|  |      48 89 7d d0          	mov    %rdi,-0x30(%rbp)
	|  |      4c 8d 6a 1f          	lea    0x1f(%rdx),%r13
	|  |      49 89 45 00          	mov    %rax,0x0(%r13)
	|  |      a8 01                	test   $0x1,%al
	|  |  /-- 74 17                	je     <Precompiled____foo_3809+0xda>
	|  |  |   44 8a 5a ff          	mov    -0x1(%rdx),%r11b
	|  |  |   41 c1 eb 02          	shr    $0x2,%r11d
	|  |  |   45 23 5e 40          	and    0x40(%r14),%r11d
	|  |  |   44 84 58 ff          	test   %r11b,-0x1(%rax)
	|  |  +-- 74 05                	je     <Precompiled____foo_3809+0xda>
	|  |  |   e8 aa d6 ff ff       	call   <Precompiled_Stub__iso_stub_ArrayWriteBarrierStub>
	|  |  \-> 48 8b 55 e8          	mov    -0x18(%rbp),%rdx
	|  |      48 89 d0             	mov    %rdx,%rax
	|  |      48 03 c0             	add    %rax,%rax
	|  |  /-- 0f 81 09 00 00 00    	jno    <Precompiled____foo_3809+0xf3>
	|  |  |   e8 f1 ef ff ff       	call   <Precompiled_Stub__iso_stub_AllocateMintSharedWithoutFPURegsStub>
	|  |  |   48 89 50 07          	mov    %rdx,0x7(%rax)
	|  |  \-> 48 89 f2             	mov    %rsi,%rdx
	|  |      48 89 c3             	mov    %rax,%rbx
	|  |      48 89 5d d8          	mov    %rbx,-0x28(%rbp)
	|  |      4c 8d 6a 27          	lea    0x27(%rdx),%r13
	|  |      49 89 45 00          	mov    %rax,0x0(%r13)
	|  |      a8 01                	test   $0x1,%al
	|  |  /-- 74 17                	je     <Precompiled____foo_3809+0x120>
	|  |  |   44 8a 5a ff          	mov    -0x1(%rdx),%r11b
	|  |  |   41 c1 eb 02          	shr    $0x2,%r11d
	|  |  |   45 23 5e 40          	and    0x40(%r14),%r11d
	|  |  |   44 84 58 ff          	test   %r11b,-0x1(%rax)
	|  |  +-- 74 05                	je     <Precompiled____foo_3809+0x120>
	|  |  |   e8 64 d6 ff ff       	call   <Precompiled_Stub__iso_stub_ArrayWriteBarrierStub>
dart:core-patch/growable_array.dart:596
	|  |  \-> 4d 8b 9f 6f 02 00 00 	mov    0x26f(%r15),%r11
	|  |      41 53                	push   %r11
	|  |      56                   	push   %rsi
	|  |      e8 e1 64 f5 ff       	call   <Precompiled__GrowableList_0150898__GrowableList_0150898__withData_0150898_668>
	|  |      59                   	pop    %rcx
	|  |      59                   	pop    %rcx
dart:core-patch/growable_array.dart:597
	|  |      48 c7 40 0f 06 00 00 	movq   $0x6,0xf(%rax)
	|  |      00 
	|  |      50                   	push   %rax
Precompiled____foo_3809():
file:///usr/local/google/home/sra/play1/bug1c3m.dart:8
	|  |      e8 59 b4 f6 ff       	call   <Precompiled____print_972>
	|  |      59                   	pop    %rcx
	|  |      49 8b 9e c8 00 00 00 	mov    0xc8(%r14),%rbx
	|  |      41 ba 06 00 00 00    	mov    $0x6,%r10d
dart:core-patch/growable_array.dart:592
	|  |      e8 fa eb ff ff       	call   <Precompiled_Stub__iso_stub_AllocateArrayStub>
new _GrowableList._literal3():
	|  |      48 89 c1             	mov    %rax,%rcx
	|  |      48 8b 45 f8          	mov    -0x8(%rbp),%rax
	|  |      48 89 41 17          	mov    %rax,0x17(%rcx)
	|  |      48 8b 45 d0          	mov    -0x30(%rbp),%rax
	|  |      48 89 41 1f          	mov    %rax,0x1f(%rcx)
	|  |      48 8b 45 d8          	mov    -0x28(%rbp),%rax
	|  |      48 89 41 27          	mov    %rax,0x27(%rcx)
dart:core-patch/growable_array.dart:596
	|  |      4d 8b 9f 6f 02 00 00 	mov    0x26f(%r15),%r11
	|  |      41 53                	push   %r11
	|  |      51                   	push   %rcx
	|  |      e8 94 64 f5 ff       	call   <Precompiled__GrowableList_0150898__GrowableList_0150898__withData_0150898_668>
	|  |      59                   	pop    %rcx
	|  |      59                   	pop    %rcx
dart:core-patch/growable_array.dart:597
	|  |      48 c7 40 0f 06 00 00 	movq   $0x6,0xf(%rax)
	|  |      00 
	|  |      50                   	push   %rax
Precompiled____foo_3809():
file:///usr/local/google/home/sra/play1/bug1c3m.dart:9
	|  |      e8 0c b4 f6 ff       	call   <Precompiled____print_972>
	|  |      59                   	pop    %rcx
	|  |      48 8b 4d f0          	mov    -0x10(%rbp),%rcx
	|  |      48 89 c8             	mov    %rcx,%rax
	|  |      48 03 c0             	add    %rax,%rax
	|  |  /-- 0f 81 09 00 00 00    	jno    <Precompiled____foo_3809+0x1a6>
	|  |  |   e8 3e ef ff ff       	call   <Precompiled_Stub__iso_stub_AllocateMintSharedWithoutFPURegsStub>
	|  |  |   48 89 48 07          	mov    %rcx,0x7(%rax)
file:///usr/local/google/home/sra/play1/bug1c3m.dart:10
	|  |  \-> 50                   	push   %rax
	|  |      e8 ec b3 f6 ff       	call   <Precompiled____print_972>
	|  |      59                   	pop    %rcx
	|  |      49 8b 86 c8 00 00 00 	mov    0xc8(%r14),%rax
file:///usr/local/google/home/sra/play1/bug1c3m.dart:6
	|  |      48 89 ec             	mov    %rbp,%rsp
	|  |      5d                   	pop    %rbp
	|  |      c3                   	ret    
	|  \----> 41 ff 96 48 02 00 00 	call   *0x248(%r14)
	\-------- e9 99 fe ff ff       	jmp    <Precompiled____foo_3809+0x5e>
	          cc                   	int3   

@mraleph
Copy link
Member

mraleph commented Jan 19, 2022

Okay, I have reread the EliminateWriteBarriers pass to refresh the memories. It does not actually work for arrays. The reasoning provided in the comment is to avoid adding too much work for subsequent marking/root scanning. I don't see any reason however why this should not be extended to small arrays. I think something like this does the trick:

diff --git a/runtime/vm/compiler/write_barrier_elimination.cc b/runtime/vm/compiler/write_barrier_elimination.cc
index 855cffb37a0..ad32ae1bdd9 100644
--- a/runtime/vm/compiler/write_barrier_elimination.cc
+++ b/runtime/vm/compiler/write_barrier_elimination.cc
@@ -114,7 +114,7 @@ class WriteBarrierElimination : public ValueObject {
 
   // Bitvector with all non-Array-allocation instructions set. Used to
   // un-mark Array allocations as usable.
-  BitVector* array_allocations_mask_;
+  BitVector* large_array_allocations_mask_;
 
   // Bitvectors for each block of which allocations are new or remembered
   // at the start (after Phis).
@@ -189,8 +189,15 @@ void WriteBarrierElimination::SaveResults() {
   }
 }
 
+static bool IsCreateLargeArray(Definition* defn) {
+  if (auto create_array = defn->AsCreateArray()) {
+    return create_array->GetConstantNumElements() >= Thread::kArrayLengthLimitForWriteBarrierElimination;
+  }
+  return false;
+}
+
 void WriteBarrierElimination::IndexDefinitions(Zone* zone) {
-  BitmapBuilder array_allocations;
+  BitmapBuilder large_array_allocations;
 
   GrowableArray<Definition*> create_array_worklist;
 
@@ -198,7 +205,7 @@ void WriteBarrierElimination::IndexDefinitions(Zone* zone) {
     BlockEntryInstr* const block = block_order_->At(i);
     if (auto join_block = block->AsJoinEntry()) {
       for (PhiIterator it(join_block); !it.Done(); it.Advance()) {
-        array_allocations.Set(definition_count_, false);
+        large_array_allocations.Set(definition_count_, false);
         definition_indices_.Insert({it.Current(), definition_count_++});
 #if defined(DEBUG)
         if (tracing_) {
@@ -211,10 +218,10 @@ void WriteBarrierElimination::IndexDefinitions(Zone* zone) {
     for (ForwardInstructionIterator it(block); !it.Done(); it.Advance()) {
       if (Definition* current = it.Current()->AsDefinition()) {
         if (IsUsable(current)) {
-          const bool is_create_array = current->IsCreateArray();
-          array_allocations.Set(definition_count_, is_create_array);
+          const bool is_create_large_array = IsCreateLargeArray(current);
+          large_array_allocations.Set(definition_count_, is_create_large_array);
           definition_indices_.Insert({current, definition_count_++});
-          if (is_create_array) {
+          if (is_create_large_array) {
             create_array_worklist.Add(current);
           }
 #if defined(DEBUG)
@@ -234,8 +241,8 @@ void WriteBarrierElimination::IndexDefinitions(Zone* zone) {
          it.Advance()) {
       if (auto phi_use = it.Current()->instruction()->AsPhi()) {
         const intptr_t index = Index(phi_use);
-        if (!array_allocations.Get(index)) {
-          array_allocations.Set(index, /*can_be_create_array=*/true);
+        if (!large_array_allocations.Get(index)) {
+          large_array_allocations.Set(index, /*can_be_create_large_array=*/true);
           create_array_worklist.Add(phi_use);
         }
       }
@@ -244,9 +251,9 @@ void WriteBarrierElimination::IndexDefinitions(Zone* zone) {
 
   vector_ = new (zone) BitVector(zone, definition_count_);
   vector_->SetAll();
-  array_allocations_mask_ = new (zone) BitVector(zone, definition_count_);
+  large_array_allocations_mask_ = new (zone) BitVector(zone, definition_count_);
   for (intptr_t i = 0; i < definition_count_; ++i) {
-    if (!array_allocations.Get(i)) array_allocations_mask_->Add(i);
+    if (!large_array_allocations.Get(i)) large_array_allocations_mask_->Add(i);
   }
 }
 
@@ -388,9 +395,9 @@ void WriteBarrierElimination::UpdateVectorForBlock(BlockEntryInstr* entry,
     if (current->CanCallDart()) {
       vector_->Clear();
     } else if (current->CanTriggerGC()) {
-      // Clear array allocations. These are not added to the remembered set
-      // by Thread::RememberLiveTemporaries() after a scavenge.
-      vector_->Intersect(array_allocations_mask_);
+      // Clear large array allocations. These are not added to the remembered
+      // set by Thread::RememberLiveTemporaries() after a scavenge.
+      vector_->Intersect(large_array_allocations_mask_);
     }
 
     if (AllocationInstr* const alloc = current->AsAllocation()) {
diff --git a/runtime/vm/thread.cc b/runtime/vm/thread.cc
index 6efc403e9c0..e7d6ddf14d2 100644
--- a/runtime/vm/thread.cc
+++ b/runtime/vm/thread.cc
@@ -659,10 +659,15 @@ class RestoreWriteBarrierInvariantVisitor : public ObjectPointerVisitor {
       // Stores into new-space objects don't need a write barrier.
       if (obj->IsSmiOrNewObject()) continue;
 
-      // To avoid adding too much work into the remembered set, skip
+      // To avoid adding too much work into the remembered set, skip large
       // arrays. Write barrier elimination will not remove the barrier
       // if we can trigger GC between array allocation and store.
-      if (obj->GetClassId() == kArrayCid) continue;
+      if (obj->GetClassId() == kArrayCid) {
+        const auto length = Smi::Value(Array::RawCast(obj)->untag()->length());
+        if (length >= Thread::kArrayLengthLimitForWriteBarrierElimination) {
+          continue;
+        }
+      }
 
       // Dart code won't store into VM-internal objects except Contexts and
       // UnhandledExceptions. This assumption is checked by an assertion in
diff --git a/runtime/vm/thread.h b/runtime/vm/thread.h
index be48283f062..7b76f8f66b6 100644
--- a/runtime/vm/thread.h
+++ b/runtime/vm/thread.h
@@ -1073,6 +1073,8 @@ class Thread : public ThreadState {
                : SafepointLevel::kGCAndDeopt;
   }
 
+  static constexpr intptr_t kArrayLengthLimitForWriteBarrierElimination = 16;
+
  private:
   template <class T>
   T* AllocateReusableHandle();

I'm interested to hear more about GC invariant restoration code

This code is much simpler. We don't need to track which fields will be potentially written - we just need to treat the whole object in a special way and rescan it as whole. So the code just takes all objects in the bottom frame which can potentially violate the invariant and puts those objects into the store buffer / deferred marking queue preemptively.

@mraleph
Copy link
Member

mraleph commented Jan 25, 2022

Uploaded my fix for the review to eliminate WB when interfering boxing is involved: https://dart-review.googlesource.com/c/sdk/+/229903

@rakudrama
Copy link
Member Author

@mraleph dart2js static calls to ArayWriteBarrierStub: 3357 ⟶ 3177 = 180/3357 = 5.3% reduction.

Uri.parse has multiple hits.

Eyeballing the objdump --no-addresses --no-show-raw-insn disassembly (which diffs reasonably well), most removed write barriers are in the context of StringBase._interpolate, where the non-string component(s) is mint-boxed.

I don't know if we can generalize from dart2js, but most interpolations are small and most of the remaining StoreIndexed write barriers comming from interpolation (about 500) could be removed by reordering via an inlined interpolation helper like _GrowableList._literalN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. vm-aot-code-size Related to improvements in AOT code size
Projects
None yet
Development

No branches or pull requests

3 participants