Implement GC mark loop #21590

yuyichao · 2017-04-27T16:40:32Z

This implements @vtjnash 's idea of a DFS mark stack without a max depth. It was originally proposed as an performance improvement though I was more interested in it since it's very compatible with parallel and incremental GC. I believe at least firefox implemented something similar (can't find the link right now...)

In summary, the advantages include

Easier for parallel and incremental marking
Minimum stack usage and very shallow call stack in GC
Performance improvement

For detail explainations, see the comment in the code.

Performance-wise, I measured a ~30% performance improvement on repeated full GC which seems to be due to a ~35% cut in instruction count. Similar effect have been observed on aarch64 too. Seems that we are farther away from cache-miss limited than I though. Of course this doesn't reflect real world GC performance but this number seems to be relatively representitive for the worst case and I haven't observed any performance regressions in other cases either.

The implementation uses local blocks as functions and is completely flattened, what I called "the most readable goto spaghetti". It should be fully commented so I hope it's not too hard to understand.

Since the stack is manually managed, this also make it possible to print a backtrace with local variables without a debugger in some cases

julia> function f()
           a = Ref(1)
           ptr = pointer_from_objref(a) - 8
           unsafe_store!(Ptr{Int}(ptr), ptr)
           gc()
       end
f (generic function with 1 method)

julia> f()
GC error (probable corruption) :
Allocations: 2706311 (Pool: 2704994; Big: 1317); GC: 3
getfield(Core.Inference, Symbol("##38#39")){Symbol}(sym=:noinline)
0x2fb8e30: Root object: 0x7feae4744010 :: 0x7feae47118d0 (bits: 3)
        of type Module
0x2fb8e48:  r-- Object (8bit) 0x7feae81187f0 :: 0x7feae4711ff3 -- [0, 9)
        of type Task
0x2fb8e68:   `- Stack frame 0x7ffe7a2b4e68 -- 1 of 2 (direct)

More general version can also be implemented once we have parallel marking (therefore having sync'd the stack pointers to global states)

This has gone through >200 CPU hours of GC stress test exposing mostly LLVM / rr bugs......

This should also make it possible to do more GC work concurrently but that requires some more thought and tests. Since there's already a lot of changes in this PR, that'll probably go into another one...

Not sure how this fits in the release timeline. It's not a new feature but it's not a bug fix either. Whether or not this can get into 0.6 I at least want to merge this before #21185 or similar PR's to minimize conflicts. I can help rebase those PR's if necessary.

StefanKarpinski · 2017-04-27T16:57:31Z

In my experience, making low level changes like this right before a release is not a great idea. Let's merge this right after the 0.6 branch occurs so it will be on master during the entire 1.0 development cycle. If it turns out to be really stable, it could be backported to 0.6.1 or something.

yuyichao · 2017-04-27T17:03:38Z

Sure, that sounds good to me. (In fact implementing this takes shorter time than I planned and I was going to submit a PR after branching.......)

ararslan · 2017-04-27T17:15:25Z

This is awesome. Nice work! Just for fun, let's invite our good friend Nanosoldier to the party.

@nanosoldier runbenchmarks(ALL, vs=":master")

yuyichao · 2017-04-27T17:37:29Z

src/gc.c

+// If there isn't enough space on the stack anymore, the stack will be resized with the stack
+// lock held. The caller should invalidate any local cache of the stack addresses that's not
+// in `gc_cache` or `sp`
+// The caller is also responsible for increasing `pc`.


Note to myself. This line of comment is out-of-date. This is now controled by an argument.

Keno · 2017-04-27T18:46:25Z

src/gc.c

+    JL_UNLOCK_NOGC(&gc_cache->stack_lock);
+}
+
+// Push a work to the stack. The type of the work is marked with `pc` and the data needed


work item?

nanosoldier · 2017-04-27T20:13:39Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

yuyichao · 2017-04-27T22:01:23Z

................. Accidentally restarted the lin64 job when I want to check for the failure log (.... probably shouuldn't be doing that on a phone.....) Is there anyway to recover the log?........................

Keno

I haven't gotten through all of it, but here's some comments.

Keno · 2017-04-27T18:47:16Z

src/gc.c

+// Push a work to the stack. The type of the work is marked with `pc` and the data needed
+// is in `data` and is of size `data_size`.
+// The `sp` keeps track of the current stack pointer and will be updated on return.
+// If there isn't enough space on the stack anymore, the stack will be resized with the stack


drop anymore?

Keno · 2017-04-27T18:52:31Z

src/gc.c

+                                      void *pc, void *data, size_t data_size, int inc)
+{
+    assert(data_size <= sizeof(gc_mark_data_t));
+    if (__unlikely(sp->pc == sp->pc_end))


So this means there's a bunch of space on the data stack that's not used? Do we gain any performance by that, or is that just a simplification for now?

Yes, though not that much. The total size is tens of KB and most of the items are 4 pointers.

Keno · 2017-04-27T18:54:00Z

src/gc.c

+    }
+}
+
+// Check if the reference is non-NULL and atomically set the mark bit.


In what sense is this atomic?

Atomic operation kind of atomic?

Keno · 2017-04-27T18:55:34Z

src/gc.c

+        return;
+    jl_value_t **items = (jl_value_t**)list->items;
+    gc_mark_finlist_t markdata = {items + start, items + len};
+    gc_mark_stack_push(gc_cache, sp, gc_mark_label_addrs[2], &markdata, sizeof(markdata), 1);


The gc_mark_label_addrs seems a big magic to me. Can we give them descriptive names (at least for the indices)?

Sure, I guess I can just use the enum for MSVC...

Keno · 2017-04-27T20:36:42Z

src/gc.c

+STATIC_INLINE void gc_mark_push_remset(jl_ptls_t ptls, jl_value_t *obj, uintptr_t nptr)
+{
+    if (__unlikely((nptr & 0x3) == 0x3)) {
+        ptls->heap.remset_nptr += nptr >> 2;


This seems weird. Isn't nptr a ptr?

This is documented in great detail

yuyichao · 2017-05-03T14:16:03Z

I fixed a size calculation issue that was causing the CI failure in the channels test and now all CI pass. I'll probably still do more tests today but any other comments? Given that we've branched I'll probably merge this weekend.

yuyichao · 2017-05-06T13:28:28Z

Request for final review, I'll merge this tomorrow.

* Rename and clean up thread heap initialization * More reliable inlining of `gc_setmark_buf` in `gc.c` * Move assertion part of `gc_datatype_assert` outline.

* Uses an iterative (mostly) Depth-first search (DFS) to mark all the objects. * Use two manually managed stacks instead of the native stack for better performance and compatibility with incremental/parallel scanning.

## Previous work Since #21590, the GC mark-loop was implemented by keeping two manually managed stacks: one of which contained iterator states used to keep track of the object currently being marked. As an example, to mark arrays, we would pop the corresponding iterator state from the stack, iterate over the array until we found an unmarked reference, and if so, we would update the iterator state (to reflect the index we left off), "repush" it into the stack and proceed with marking the reference we just found. ## This PR This PR eliminates the need of keeping the iterator states by modifying the object graph traversal code. We keep a single stack of `jl_value_t *` currently being processed. To mark an object, we first pop it from the stack, push all unmarked references into the stack and proceed with marking. I believe this doesn't break any invariant from the generational GC. Indeed, the age bits are set after marking (if the object survived one GC cycle it's moved to the old generation), so this new traversal scheme wouldn't change the fact of whether an object had references to old objects or not. Furthermore, we must not update GC metadata for objects in the `remset`, and we ensure this by calling `gc_mark_outrefs` in `gc_queue_remset` with `meta_updated` set to 1. ## Additional advantages 1. There are no recursive function calls in the GC mark-loop code (one of the reasons why #21590 was implemented). 2. Keeping a single GC queue will **greatly** simplify work-stealing in the multi-threaded GC we are working on (c.f. #45639). 3. Arrays of references, for example, are now marked on a regular stride fashion, which could help with hardware prefetching. 4. We can easily modify the traversal mode (to breadth first, for example) by only changing the `jl_gc_markqueue_t`(from LIFO to FIFO, for example) methods without touching the mark-loop itself, which could enable further exploration on the GC in the future. Since this PR changes the mark-loop graph traversal code, there are some changes in the heap-snapshot, though I'm not familiar with that PR. Some benchmark results are here: https://hackmd.io/@Idnmfpb3SxK98-OsBtRD5A/H1V6QSzvs.

yuyichao added GC Garbage collector multithreading Base.Threads and related functionality performance Must go faster labels Apr 27, 2017

yuyichao commented Apr 27, 2017

View reviewed changes

Keno reviewed Apr 27, 2017

View reviewed changes

yuyichao force-pushed the yyc/gc/mark2 branch 3 times, most recently from 5f823ab to 8b51fc1 Compare May 2, 2017 03:27

yuyichao force-pushed the yyc/gc/mark2 branch 2 times, most recently from b893e02 to 710041e Compare May 5, 2017 22:11

yuyichao added 3 commits May 6, 2017 23:21

A few GC clean up/tweak

15c86f8

* Rename and clean up thread heap initialization * More reliable inlining of `gc_setmark_buf` in `gc.c` * Move assertion part of `gc_datatype_assert` outline.

Implement GC mark loop

c967e68

* Uses an iterative (mostly) Depth-first search (DFS) to mark all the objects. * Use two manually managed stacks instead of the native stack for better performance and compatibility with incremental/parallel scanning.

Use mark loop in GC

96d23d9

yuyichao force-pushed the yyc/gc/mark2 branch from 710041e to 96d23d9 Compare May 7, 2017 03:29

yuyichao merged commit 38417ec into master May 7, 2017

yuyichao deleted the yyc/gc/mark2 branch May 7, 2017 12:42

yuyichao mentioned this pull request May 18, 2017

conservative stack scanning? #11714

Closed

d-netto mentioned this pull request Jun 7, 2022

GC mark-loop rewrite #45608

Closed

3 tasks

d-netto mentioned this pull request Nov 28, 2022

GC mark-loop rewrite #47292

Merged

vtjnash mentioned this pull request Feb 2, 2023

[rfc] parallel marking #45639

Closed

3 tasks

d-netto mentioned this pull request May 19, 2023

Attempting to add debug logs for ENQUEUING an invalid object #49741

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement GC mark loop #21590

Implement GC mark loop #21590

yuyichao commented Apr 27, 2017 •

edited

Loading

StefanKarpinski commented Apr 27, 2017

yuyichao commented Apr 27, 2017

ararslan commented Apr 27, 2017

yuyichao Apr 27, 2017 •

edited

Loading

Keno Apr 27, 2017

nanosoldier commented Apr 27, 2017

yuyichao commented Apr 27, 2017

Keno left a comment

Keno Apr 27, 2017

Keno Apr 27, 2017

yuyichao Apr 27, 2017

Keno Apr 27, 2017

yuyichao Apr 27, 2017

Keno Apr 27, 2017

yuyichao Apr 27, 2017

Keno Apr 27, 2017

yuyichao Apr 27, 2017

yuyichao commented May 3, 2017

yuyichao commented May 6, 2017

Implement GC mark loop #21590

Implement GC mark loop #21590

Conversation

yuyichao commented Apr 27, 2017 • edited Loading

StefanKarpinski commented Apr 27, 2017

yuyichao commented Apr 27, 2017

ararslan commented Apr 27, 2017

yuyichao Apr 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nanosoldier commented Apr 27, 2017

yuyichao commented Apr 27, 2017

Keno left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuyichao commented May 3, 2017

yuyichao commented May 6, 2017

yuyichao commented Apr 27, 2017 •

edited

Loading

yuyichao Apr 27, 2017 •

edited

Loading