Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement GC mark loop #21590

Merged
merged 3 commits into from
May 7, 2017
Merged

Implement GC mark loop #21590

merged 3 commits into from
May 7, 2017

Conversation

yuyichao
Copy link
Contributor

@yuyichao yuyichao commented Apr 27, 2017

This implements @vtjnash 's idea of a DFS mark stack without a max depth. It was originally proposed as an performance improvement though I was more interested in it since it's very compatible with parallel and incremental GC. I believe at least firefox implemented something similar (can't find the link right now...)

In summary, the advantages include

  1. Easier for parallel and incremental marking
  2. Minimum stack usage and very shallow call stack in GC
  3. Performance improvement

For detail explainations, see the comment in the code.

Performance-wise, I measured a ~30% performance improvement on repeated full GC which seems to be due to a ~35% cut in instruction count. Similar effect have been observed on aarch64 too. Seems that we are farther away from cache-miss limited than I though. Of course this doesn't reflect real world GC performance but this number seems to be relatively representitive for the worst case and I haven't observed any performance regressions in other cases either.

The implementation uses local blocks as functions and is completely flattened, what I called "the most readable goto spaghetti". It should be fully commented so I hope it's not too hard to understand.

Since the stack is manually managed, this also make it possible to print a backtrace with local variables without a debugger in some cases

julia> function f()
           a = Ref(1)
           ptr = pointer_from_objref(a) - 8
           unsafe_store!(Ptr{Int}(ptr), ptr)
           gc()
       end
f (generic function with 1 method)

julia> f()
GC error (probable corruption) :
Allocations: 2706311 (Pool: 2704994; Big: 1317); GC: 3
getfield(Core.Inference, Symbol("##38#39")){Symbol}(sym=:noinline)
0x2fb8e30: Root object: 0x7feae4744010 :: 0x7feae47118d0 (bits: 3)
        of type Module
0x2fb8e48:  r-- Object (8bit) 0x7feae81187f0 :: 0x7feae4711ff3 -- [0, 9)
        of type Task
0x2fb8e68:   `- Stack frame 0x7ffe7a2b4e68 -- 1 of 2 (direct)

More general version can also be implemented once we have parallel marking (therefore having sync'd the stack pointers to global states)

This has gone through >200 CPU hours of GC stress test exposing mostly LLVM / rr bugs......

This should also make it possible to do more GC work concurrently but that requires some more thought and tests. Since there's already a lot of changes in this PR, that'll probably go into another one...

Not sure how this fits in the release timeline. It's not a new feature but it's not a bug fix either. Whether or not this can get into 0.6 I at least want to merge this before #21185 or similar PR's to minimize conflicts. I can help rebase those PR's if necessary.

@yuyichao yuyichao added GC Garbage collector multithreading Base.Threads and related functionality performance Must go faster labels Apr 27, 2017
@StefanKarpinski
Copy link
Member

In my experience, making low level changes like this right before a release is not a great idea. Let's merge this right after the 0.6 branch occurs so it will be on master during the entire 1.0 development cycle. If it turns out to be really stable, it could be backported to 0.6.1 or something.

@yuyichao
Copy link
Contributor Author

Sure, that sounds good to me. (In fact implementing this takes shorter time than I planned and I was going to submit a PR after branching.......)

@ararslan
Copy link
Member

This is awesome. Nice work! Just for fun, let's invite our good friend Nanosoldier to the party.

@nanosoldier runbenchmarks(ALL, vs=":master")

src/gc.c Outdated
// If there isn't enough space on the stack anymore, the stack will be resized with the stack
// lock held. The caller should invalidate any local cache of the stack addresses that's not
// in `gc_cache` or `sp`
// The caller is also responsible for increasing `pc`.
Copy link
Contributor Author

@yuyichao yuyichao Apr 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself. This line of comment is out-of-date. This is now controled by an argument.

src/gc.c Outdated
JL_UNLOCK_NOGC(&gc_cache->stack_lock);
}

// Push a work to the stack. The type of the work is marked with `pc` and the data needed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

work item?

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@yuyichao
Copy link
Contributor Author

................. Accidentally restarted the lin64 job when I want to check for the failure log (.... probably shouuldn't be doing that on a phone.....) Is there anyway to recover the log?........................

Copy link
Member

@Keno Keno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't gotten through all of it, but here's some comments.

src/gc.c Outdated
// Push a work to the stack. The type of the work is marked with `pc` and the data needed
// is in `data` and is of size `data_size`.
// The `sp` keeps track of the current stack pointer and will be updated on return.
// If there isn't enough space on the stack anymore, the stack will be resized with the stack
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop anymore?

void *pc, void *data, size_t data_size, int inc)
{
assert(data_size <= sizeof(gc_mark_data_t));
if (__unlikely(sp->pc == sp->pc_end))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means there's a bunch of space on the data stack that's not used? Do we gain any performance by that, or is that just a simplification for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, though not that much. The total size is tens of KB and most of the items are 4 pointers.

}
}

// Check if the reference is non-NULL and atomically set the mark bit.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what sense is this atomic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Atomic operation kind of atomic?

src/gc.c Outdated
return;
jl_value_t **items = (jl_value_t**)list->items;
gc_mark_finlist_t markdata = {items + start, items + len};
gc_mark_stack_push(gc_cache, sp, gc_mark_label_addrs[2], &markdata, sizeof(markdata), 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gc_mark_label_addrs seems a big magic to me. Can we give them descriptive names (at least for the indices)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I guess I can just use the enum for MSVC...

STATIC_INLINE void gc_mark_push_remset(jl_ptls_t ptls, jl_value_t *obj, uintptr_t nptr)
{
if (__unlikely((nptr & 0x3) == 0x3)) {
ptls->heap.remset_nptr += nptr >> 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems weird. Isn't nptr a ptr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuyichao yuyichao force-pushed the yyc/gc/mark2 branch 3 times, most recently from 5f823ab to 8b51fc1 Compare May 2, 2017 03:27
@yuyichao
Copy link
Contributor Author

yuyichao commented May 3, 2017

I fixed a size calculation issue that was causing the CI failure in the channels test and now all CI pass. I'll probably still do more tests today but any other comments? Given that we've branched I'll probably merge this weekend.

@yuyichao yuyichao force-pushed the yyc/gc/mark2 branch 2 times, most recently from b893e02 to 710041e Compare May 5, 2017 22:11
@yuyichao
Copy link
Contributor Author

yuyichao commented May 6, 2017

Request for final review, I'll merge this tomorrow.

yuyichao added 3 commits May 6, 2017 23:21
* Rename and clean up thread heap initialization

* More reliable inlining of `gc_setmark_buf` in `gc.c`

* Move assertion part of `gc_datatype_assert` outline.
* Uses an iterative (mostly) Depth-first search (DFS) to mark all the objects.
* Use two manually managed stacks instead of the native stack for better
  performance and compatibility with incremental/parallel scanning.
@yuyichao yuyichao merged commit 38417ec into master May 7, 2017
@yuyichao yuyichao deleted the yyc/gc/mark2 branch May 7, 2017 12:42
@d-netto d-netto mentioned this pull request Jun 7, 2022
3 tasks
@d-netto d-netto mentioned this pull request Nov 28, 2022
vtjnash pushed a commit that referenced this pull request Jan 24, 2023
## Previous work

Since #21590, the GC mark-loop was implemented by keeping two manually managed stacks: one of which contained iterator states used to keep track of the object currently being marked. As an example, to mark arrays, we would pop the corresponding iterator state from the stack, iterate over the array until we found an unmarked reference, and if so, we would update the iterator state (to reflect the index we left off), "repush" it into the stack and  proceed with marking the reference we just found.

## This PR

This PR eliminates the need of keeping the iterator states by modifying the object graph traversal code. We keep a single stack of `jl_value_t *` currently being processed. To mark an object, we first pop it from the stack, push all unmarked references into the stack and proceed with marking.

I believe this doesn't break any invariant from the generational GC. Indeed, the age bits are set after marking (if the object survived one GC cycle it's moved to the old generation), so this new traversal scheme wouldn't change the fact of whether an object had references to old objects or not. Furthermore, we must not update GC metadata for objects in the `remset`, and we ensure this by calling `gc_mark_outrefs` in `gc_queue_remset` with `meta_updated` set to 1.

## Additional advantages

1. There are no recursive function calls in the GC mark-loop code (one of the reasons why #21590 was implemented).
2. Keeping a single GC queue will **greatly** simplify work-stealing in the multi-threaded GC we are working on (c.f. #45639).
3. Arrays of references, for example, are now marked on a regular stride fashion, which could help with hardware prefetching.
4. We can easily modify the traversal mode (to breadth first, for example) by only changing the `jl_gc_markqueue_t`(from LIFO to FIFO, for example) methods without touching the mark-loop itself, which could enable further exploration on the GC in the future.

Since this PR changes the mark-loop graph traversal code, there are some changes in the heap-snapshot, though I'm not familiar with that PR.

Some benchmark results are here: https://hackmd.io/@Idnmfpb3SxK98-OsBtRD5A/H1V6QSzvs.
@vtjnash vtjnash mentioned this pull request Feb 2, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GC Garbage collector multithreading Base.Threads and related functionality performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants