-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generational plans set next_gc_full_heap
in end_of_gc
#714
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just a bit confused as to why this change is necessary. What are the "There could be other parallel release work when we execute Plan.release()
"? From the scheduler code, it seems like the reference processing is done in the Release
work bucket as well, is that the issue? If so, could you document it somewhere?
@@ -228,10 +228,15 @@ pub trait Plan: 'static + Sync + Downcast { | |||
/// per-worker preparation. This method is invoked once per worker by the worker thread passed as the argument. | |||
fn prepare_worker(&self, _worker: &mut GCWorker<Self::VM>) {} | |||
|
|||
/// Release the plan after a GC. This is invoked at the end of a GC when most GC work is finished. | |||
/// Release the plan after transitive closure. A plan can implement this method to call each policy's release, | |||
/// or create any work packet that should be done in release. | |||
/// This is invoked once per GC by one worker thread. 'tls' is the worker thread that executes this method. | |||
fn release(&mut self, tls: VMWorkerThread); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I find the "release" terminology a bit confusing then. What are the semantics of "release" supposed to be exactly? Most (if not all?) plans just release pages from spaces. Is the issue that the plan.release()
is called at the start of the Release
work bucket instead of the end? The comment for the Release
work packet implies that there should only ever one [1] (and only one is ever made in the scheduler_common_work
function). What is the source of the issue you have encountered? Reference processing?
[1]:
mmtk-core/src/scheduler/gc_work.rs
Lines 91 to 97 in fc144cf
/// The global GC release Work | |
/// This work packet invokes release() for the plan (which will invoke release() for each space), and | |
/// pushes work packets for releasing mutators and collectors. | |
/// We should only have one such work packet per GC, after all actual GC work ends. | |
/// We assume this work packet is the only running work packet that accesses plan, and there should | |
/// be no other concurrent work packet that accesses plan (read or write). Otherwise, there may | |
/// be a race condition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see -- it seems like MarkSweep and Immix uses the Release
work bucket for sweeping [1, 2]. My question is then why not have a separate Sweep
work bucket like we have with the Compact
work bucket [3]?
[1]: https://github.com/mmtk/mmtk-core/blob/master/src/policy/marksweepspace/malloc_ms/global.rs#L472-L501
[2]: https://github.com/mmtk/mmtk-core/blob/master/src/policy/immix/immixspace.rs#L316-L340
[3]:
mmtk-core/src/scheduler/work_bucket.rs
Line 220 in 43a6755
Compact, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, right. That is the motivation. The problem is that some plans generate work in the release bucket, and we need a hook to run some stuff after every work packet is done.
What you pointed out reminds me that our current code is actually incorrect. This is a race condition: we run Release
work (which calls Plan.release()
with a mutable plan) and those sweeping work (which can access Plan
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment for the Release
work packet above, seems to imply that it should not be executing with anything else and that "all the GC work" should have been finished before the Release
work packet is called. If MS and Immix are performing sweeping in Release
, then all the GC work hasn't really been finished right? Either we should fix the comment for the Release
work packet, or make a new work bucket called Sweep
where MS and Immix generate sweep work packets (in the latter case, the issue of "parallel release work" ceases to exist, unless reference processing is still an issue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that "fixing" the comment is not actually a fix. Yes -- there is a race condition. The Sweep
work bucket is the more sensible solution (unless you have a better proposal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the
Prepare
andRelease
work packet, we assume we are the only work that is being executed, thus it is safe for us to get a mutable reference of plan. If we have any other work packet, they can accessmmtk
in theirdo_work
. And that is racy.
That makes sense.
#[allow(clippy::cast_ref_to_mut)]
let plan_mut: &mut C::PlanType = unsafe { &mut *(self.plan as *const _ as *mut _) };
plan_mut.prepare(worker.tls);
I think this is the same issue.
It doesn't look like an issue that can be solved in one or two days. To ensure we have only one thread having access to the plan, we must stop spreading &'static Plan
(and &'static MMTk, too) everywhere, and have precise control of who holds reference to Plan
(and MMTk
as well). A key to the solution will be the ability to "revoke" all such references during STW. In fact, this is what STW is really meant to do, i.e. stop mutators from accessing the heap, as a form of synchronisation.
The good thing for RefEnqueue is that it never needs to access the MMTk instance. It can be a pure VM-side operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree. Though with regard to this PR (or more specifically, the issue this PR is trying to address), I think the most pragmatic approach is to add a new Sweep
or Collect
work bucket and make it so that MS and Immix generate their sweeping work packets into this bucket. In that case we don't really need this PR as all the GC work will have been done before the should_next_gc_full_heap()
is called.
It will also be helpful to exactly clarify what the semantics of the "Release" work bucket is, i.e. should we send the list of processed references to the binding in the "Release" work bucket or the "Sweep"/"Collect" work bucket, or even the "EndOfGC" work packet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The good thing for RefEnqueue is that it never needs to access the MMTk instance. It can be a pure VM-side operation.
It technically accesses the mmtk
instance (right now at least -- unsure of your proposed changes):
https://github.com/mmtk/mmtk-core/blob/master/src/util/reference_processor.rs#L556
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will also be helpful to exactly clarify what the semantics of the "Release" work bucket is, i.e. should we send the list of processed references to the binding in the "Release" work bucket or the "Sweep"/"Collect" work bucket, or even the "EndOfGC" work packet.
I have made an attempt to do this in my branch (see below), but not as precise as you described.
https://github.com/wks/mmtk-core/blob/165b3fab07cc788ad88876effdeefe567697aaf2/src/scheduler/work_bucket.rs#L258
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The good thing for RefEnqueue is that it never needs to access the MMTk instance. It can be a pure VM-side operation.
It technically accesses the
mmtk
instance (right now at least -- unsure of your proposed changes):https://github.com/mmtk/mmtk-core/blob/master/src/util/reference_processor.rs#L556
It's accessing the mmtk.reference_processor
field. Currently it is a field of the MMTK
struct. In my branch, it is part of the VM binding instance.
@@ -223,6 +223,11 @@ impl<VM: VMBinding> GCWork<VM> for EndOfGC { | |||
fn do_work(&mut self, worker: &mut GCWorker<VM>, mmtk: &'static MMTK<VM>) { | |||
info!("End of GC"); | |||
|
|||
// We assume this is the only running work packet that accesses plan at the point of execution | |||
#[allow(clippy::cast_ref_to_mut)] | |||
let plan_mut: &mut dyn Plan<VM = VM> = unsafe { &mut *(&*mmtk.plan as *const _ as *mut _) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this everywhere -- maybe we should just change the mmtk
variable to mut MMTK
or something in the work packets. Or am I missing something which means we can't do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same pattern as our prepare/release code. This is unfortunate.
mmtk-core/src/scheduler/gc_work.rs
Lines 44 to 46 in 2300ce1
#[allow(clippy::cast_ref_to_mut)] | |
let plan_mut: &mut C::PlanType = unsafe { &mut *(self.plan as *const _ as *mut _) }; | |
plan_mut.prepare(worker.tls); |
mmtk-core/src/scheduler/gc_work.rs
Lines 113 to 115 in 2300ce1
#[allow(clippy::cast_ref_to_mut)] | |
let plan_mut: &mut C::PlanType = unsafe { &mut *(self.plan as *const _ as *mut _) }; | |
plan_mut.release(worker.tls); |
We can fix it, but not as simple as making it mut mmtk
. Basically mmtk: &'static MMTK<VM>
is a non mutable reference, but we want to get a mutable reference from it -- we do so under the assumption that we are the only thread that is accessing mmtk
. As long as the assumption holds, the code is correct.
Basically what we want is something like RefCell
but without overhead on borrowing. I don't think there is a safe way to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it not something like an Arc
? Multiple work packets should be executing at the same time, all of them accessing the mmtk
variable. Overhead of reference counting? I guess if 99% of the times it's only read like most work packets, and this is the hot loop, ugly hacks like this may be required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically a RWLock
could help, but I feel like adding a lock in the hottest part of our code may not be ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arc
can help manage the lifetime of plan/mmtk. But it comes at the cost of a fat pointer (unknown cost) and reference counting (which should be minimal cost). But it does not solve the issue of mutability.
There are many things that we need to tidy up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right sorry -- you need a RefCell
or Mutex
/RWLock
to have mutability. Sorry was thinking of Arc<Mutex<T>>
which is a probably a bad idea.
I guess we can flag this issue for the future. Could you add a comment here (and maybe the other places as well) saying that this is only safe under the assumption we're the only executing work packet.
I think it mainly looks good to me -- now that I understand what the issue is (MarkSweep and Immix sweeping in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just make sure the spelling issues @k-sareen pointed out are solved.
I still think we should discuss the semantics of "Release" as a group before we merge this PR |
Just noticed the commit message says |
next_gc_full_size
in end_of_gc
next_gc_full_heap
in end_of_gc
I will just do a squash merge, and that commit message will not be seen in the commit history. |
This PR adds `Plan.end_of_gc()`. Our generational plans compute whether we should do full heap GC at the end of `Plan.release()` which is incorrect. There could be other parallel release work when we execute `Plan.release()`. This PR adds `Plan.end_of_gc()`, and we compute `next_gc_full_heap` in `end_of_gc()`.
This PR adds
Plan.end_of_gc()
. Our generational plans compute whether we should do full heap GC at the end ofPlan.release()
which is incorrect. There could be other parallel release work when we executePlan.release()
. This PR addsPlan.end_of_gc()
, and we computenext_gc_full_heap
inend_of_gc()
.