-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement GC mark loop #21590
Implement GC mark loop #21590
Conversation
In my experience, making low level changes like this right before a release is not a great idea. Let's merge this right after the 0.6 branch occurs so it will be on master during the entire 1.0 development cycle. If it turns out to be really stable, it could be backported to 0.6.1 or something. |
Sure, that sounds good to me. (In fact implementing this takes shorter time than I planned and I was going to submit a PR after branching.......) |
This is awesome. Nice work! Just for fun, let's invite our good friend Nanosoldier to the party. @nanosoldier |
src/gc.c
Outdated
// If there isn't enough space on the stack anymore, the stack will be resized with the stack | ||
// lock held. The caller should invalidate any local cache of the stack addresses that's not | ||
// in `gc_cache` or `sp` | ||
// The caller is also responsible for increasing `pc`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to myself. This line of comment is out-of-date. This is now controled by an argument.
src/gc.c
Outdated
JL_UNLOCK_NOGC(&gc_cache->stack_lock); | ||
} | ||
|
||
// Push a work to the stack. The type of the work is marked with `pc` and the data needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
work item
?
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels |
................. Accidentally restarted the lin64 job when I want to check for the failure log (.... probably shouuldn't be doing that on a phone.....) Is there anyway to recover the log?........................ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't gotten through all of it, but here's some comments.
src/gc.c
Outdated
// Push a work to the stack. The type of the work is marked with `pc` and the data needed | ||
// is in `data` and is of size `data_size`. | ||
// The `sp` keeps track of the current stack pointer and will be updated on return. | ||
// If there isn't enough space on the stack anymore, the stack will be resized with the stack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop anymore
?
void *pc, void *data, size_t data_size, int inc) | ||
{ | ||
assert(data_size <= sizeof(gc_mark_data_t)); | ||
if (__unlikely(sp->pc == sp->pc_end)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this means there's a bunch of space on the data stack that's not used? Do we gain any performance by that, or is that just a simplification for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, though not that much. The total size is tens of KB and most of the items are 4 pointers.
} | ||
} | ||
|
||
// Check if the reference is non-NULL and atomically set the mark bit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what sense is this atomic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Atomic operation kind of atomic?
src/gc.c
Outdated
return; | ||
jl_value_t **items = (jl_value_t**)list->items; | ||
gc_mark_finlist_t markdata = {items + start, items + len}; | ||
gc_mark_stack_push(gc_cache, sp, gc_mark_label_addrs[2], &markdata, sizeof(markdata), 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gc_mark_label_addrs
seems a big magic to me. Can we give them descriptive names (at least for the indices)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I guess I can just use the enum for MSVC...
STATIC_INLINE void gc_mark_push_remset(jl_ptls_t ptls, jl_value_t *obj, uintptr_t nptr) | ||
{ | ||
if (__unlikely((nptr & 0x3) == 0x3)) { | ||
ptls->heap.remset_nptr += nptr >> 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems weird. Isn't nptr a ptr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is documented in great detail
5f823ab
to
8b51fc1
Compare
I fixed a size calculation issue that was causing the CI failure in the channels test and now all CI pass. I'll probably still do more tests today but any other comments? Given that we've branched I'll probably merge this weekend. |
b893e02
to
710041e
Compare
Request for final review, I'll merge this tomorrow. |
* Rename and clean up thread heap initialization * More reliable inlining of `gc_setmark_buf` in `gc.c` * Move assertion part of `gc_datatype_assert` outline.
* Uses an iterative (mostly) Depth-first search (DFS) to mark all the objects. * Use two manually managed stacks instead of the native stack for better performance and compatibility with incremental/parallel scanning.
## Previous work Since #21590, the GC mark-loop was implemented by keeping two manually managed stacks: one of which contained iterator states used to keep track of the object currently being marked. As an example, to mark arrays, we would pop the corresponding iterator state from the stack, iterate over the array until we found an unmarked reference, and if so, we would update the iterator state (to reflect the index we left off), "repush" it into the stack and proceed with marking the reference we just found. ## This PR This PR eliminates the need of keeping the iterator states by modifying the object graph traversal code. We keep a single stack of `jl_value_t *` currently being processed. To mark an object, we first pop it from the stack, push all unmarked references into the stack and proceed with marking. I believe this doesn't break any invariant from the generational GC. Indeed, the age bits are set after marking (if the object survived one GC cycle it's moved to the old generation), so this new traversal scheme wouldn't change the fact of whether an object had references to old objects or not. Furthermore, we must not update GC metadata for objects in the `remset`, and we ensure this by calling `gc_mark_outrefs` in `gc_queue_remset` with `meta_updated` set to 1. ## Additional advantages 1. There are no recursive function calls in the GC mark-loop code (one of the reasons why #21590 was implemented). 2. Keeping a single GC queue will **greatly** simplify work-stealing in the multi-threaded GC we are working on (c.f. #45639). 3. Arrays of references, for example, are now marked on a regular stride fashion, which could help with hardware prefetching. 4. We can easily modify the traversal mode (to breadth first, for example) by only changing the `jl_gc_markqueue_t`(from LIFO to FIFO, for example) methods without touching the mark-loop itself, which could enable further exploration on the GC in the future. Since this PR changes the mark-loop graph traversal code, there are some changes in the heap-snapshot, though I'm not familiar with that PR. Some benchmark results are here: https://hackmd.io/@Idnmfpb3SxK98-OsBtRD5A/H1V6QSzvs.
This implements @vtjnash 's idea of a DFS mark stack without a max depth. It was originally proposed as an performance improvement though I was more interested in it since it's very compatible with parallel and incremental GC. I believe at least firefox implemented something similar (can't find the link right now...)
In summary, the advantages include
For detail explainations, see the comment in the code.
Performance-wise, I measured a ~30% performance improvement on repeated full GC which seems to be due to a ~35% cut in instruction count. Similar effect have been observed on aarch64 too. Seems that we are farther away from cache-miss limited than I though. Of course this doesn't reflect real world GC performance but this number seems to be relatively representitive for the worst case and I haven't observed any performance regressions in other cases either.
The implementation uses local blocks as functions and is completely flattened, what I called "the most readable goto spaghetti". It should be fully commented so I hope it's not too hard to understand.
Since the stack is manually managed, this also make it possible to print a backtrace with local variables without a debugger in some cases
More general version can also be implemented once we have parallel marking (therefore having sync'd the stack pointers to global states)
This has gone through >200 CPU hours of GC stress test exposing mostly LLVM / rr bugs......
This should also make it possible to do more GC work concurrently but that requires some more thought and tests. Since there's already a lot of changes in this PR, that'll probably go into another one...
Not sure how this fits in the release timeline. It's not a new feature but it's not a bug fix either. Whether or not this can get into 0.6 I at least want to merge this before #21185 or similar PR's to minimize conflicts. I can help rebase those PR's if necessary.