Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generational plans set next_gc_full_heap in end_of_gc #714

Merged
merged 4 commits into from
Jan 2, 2023

Conversation

qinsoon
Copy link
Member

@qinsoon qinsoon commented Dec 19, 2022

This PR adds Plan.end_of_gc(). Our generational plans compute whether we should do full heap GC at the end of Plan.release() which is incorrect. There could be other parallel release work when we execute Plan.release(). This PR adds Plan.end_of_gc(), and we compute next_gc_full_heap in end_of_gc().

@qinsoon qinsoon requested review from k-sareen and wks December 19, 2022 04:38
@qinsoon qinsoon marked this pull request as ready for review December 19, 2022 04:38
Copy link
Collaborator

@k-sareen k-sareen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just a bit confused as to why this change is necessary. What are the "There could be other parallel release work when we execute Plan.release()"? From the scheduler code, it seems like the reference processing is done in the Release work bucket as well, is that the issue? If so, could you document it somewhere?

src/plan/global.rs Outdated Show resolved Hide resolved
@@ -228,10 +228,15 @@ pub trait Plan: 'static + Sync + Downcast {
/// per-worker preparation. This method is invoked once per worker by the worker thread passed as the argument.
fn prepare_worker(&self, _worker: &mut GCWorker<Self::VM>) {}

/// Release the plan after a GC. This is invoked at the end of a GC when most GC work is finished.
/// Release the plan after transitive closure. A plan can implement this method to call each policy's release,
/// or create any work packet that should be done in release.
/// This is invoked once per GC by one worker thread. 'tls' is the worker thread that executes this method.
fn release(&mut self, tls: VMWorkerThread);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I find the "release" terminology a bit confusing then. What are the semantics of "release" supposed to be exactly? Most (if not all?) plans just release pages from spaces. Is the issue that the plan.release() is called at the start of the Release work bucket instead of the end? The comment for the Release work packet implies that there should only ever one [1] (and only one is ever made in the scheduler_common_work function). What is the source of the issue you have encountered? Reference processing?

[1]:

/// The global GC release Work
/// This work packet invokes release() for the plan (which will invoke release() for each space), and
/// pushes work packets for releasing mutators and collectors.
/// We should only have one such work packet per GC, after all actual GC work ends.
/// We assume this work packet is the only running work packet that accesses plan, and there should
/// be no other concurrent work packet that accesses plan (read or write). Otherwise, there may
/// be a race condition.

Copy link
Collaborator

@k-sareen k-sareen Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see -- it seems like MarkSweep and Immix uses the Release work bucket for sweeping [1, 2]. My question is then why not have a separate Sweep work bucket like we have with the Compact work bucket [3]?

[1]: https://github.com/mmtk/mmtk-core/blob/master/src/policy/marksweepspace/malloc_ms/global.rs#L472-L501
[2]: https://github.com/mmtk/mmtk-core/blob/master/src/policy/immix/immixspace.rs#L316-L340
[3]:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, right. That is the motivation. The problem is that some plans generate work in the release bucket, and we need a hook to run some stuff after every work packet is done.

What you pointed out reminds me that our current code is actually incorrect. This is a race condition: we run Release work (which calls Plan.release() with a mutable plan) and those sweeping work (which can access Plan).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment for the Release work packet above, seems to imply that it should not be executing with anything else and that "all the GC work" should have been finished before the Release work packet is called. If MS and Immix are performing sweeping in Release, then all the GC work hasn't really been finished right? Either we should fix the comment for the Release work packet, or make a new work bucket called Sweep where MS and Immix generate sweep work packets (in the latter case, the issue of "parallel release work" ceases to exist, unless reference processing is still an issue).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that "fixing" the comment is not actually a fix. Yes -- there is a race condition. The Sweep work bucket is the more sensible solution (unless you have a better proposal).

Copy link
Collaborator

@wks wks Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the Prepare and Release work packet, we assume we are the only work that is being executed, thus it is safe for us to get a mutable reference of plan. If we have any other work packet, they can access mmtk in their do_work. And that is racy.

That makes sense.

 #[allow(clippy::cast_ref_to_mut)] 
 let plan_mut: &mut C::PlanType = unsafe { &mut *(self.plan as *const _ as *mut _) }; 
 plan_mut.prepare(worker.tls); 

I think this is the same issue.

It doesn't look like an issue that can be solved in one or two days. To ensure we have only one thread having access to the plan, we must stop spreading &'static Plan (and &'static MMTk, too) everywhere, and have precise control of who holds reference to Plan (and MMTk as well). A key to the solution will be the ability to "revoke" all such references during STW. In fact, this is what STW is really meant to do, i.e. stop mutators from accessing the heap, as a form of synchronisation.

The good thing for RefEnqueue is that it never needs to access the MMTk instance. It can be a pure VM-side operation.

Copy link
Collaborator

@k-sareen k-sareen Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree. Though with regard to this PR (or more specifically, the issue this PR is trying to address), I think the most pragmatic approach is to add a new Sweep or Collect work bucket and make it so that MS and Immix generate their sweeping work packets into this bucket. In that case we don't really need this PR as all the GC work will have been done before the should_next_gc_full_heap() is called.

It will also be helpful to exactly clarify what the semantics of the "Release" work bucket is, i.e. should we send the list of processed references to the binding in the "Release" work bucket or the "Sweep"/"Collect" work bucket, or even the "EndOfGC" work packet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The good thing for RefEnqueue is that it never needs to access the MMTk instance. It can be a pure VM-side operation.

It technically accesses the mmtk instance (right now at least -- unsure of your proposed changes):

https://github.com/mmtk/mmtk-core/blob/master/src/util/reference_processor.rs#L556

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will also be helpful to exactly clarify what the semantics of the "Release" work bucket is, i.e. should we send the list of processed references to the binding in the "Release" work bucket or the "Sweep"/"Collect" work bucket, or even the "EndOfGC" work packet.

I have made an attempt to do this in my branch (see below), but not as precise as you described.
https://github.com/wks/mmtk-core/blob/165b3fab07cc788ad88876effdeefe567697aaf2/src/scheduler/work_bucket.rs#L258

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The good thing for RefEnqueue is that it never needs to access the MMTk instance. It can be a pure VM-side operation.

It technically accesses the mmtk instance (right now at least -- unsure of your proposed changes):

https://github.com/mmtk/mmtk-core/blob/master/src/util/reference_processor.rs#L556

It's accessing the mmtk.reference_processor field. Currently it is a field of the MMTK struct. In my branch, it is part of the VM binding instance.

@@ -223,6 +223,11 @@ impl<VM: VMBinding> GCWork<VM> for EndOfGC {
fn do_work(&mut self, worker: &mut GCWorker<VM>, mmtk: &'static MMTK<VM>) {
info!("End of GC");

// We assume this is the only running work packet that accesses plan at the point of execution
#[allow(clippy::cast_ref_to_mut)]
let plan_mut: &mut dyn Plan<VM = VM> = unsafe { &mut *(&*mmtk.plan as *const _ as *mut _) };
Copy link
Collaborator

@k-sareen k-sareen Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this everywhere -- maybe we should just change the mmtk variable to mut MMTK or something in the work packets. Or am I missing something which means we can't do that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same pattern as our prepare/release code. This is unfortunate.

#[allow(clippy::cast_ref_to_mut)]
let plan_mut: &mut C::PlanType = unsafe { &mut *(self.plan as *const _ as *mut _) };
plan_mut.prepare(worker.tls);

#[allow(clippy::cast_ref_to_mut)]
let plan_mut: &mut C::PlanType = unsafe { &mut *(self.plan as *const _ as *mut _) };
plan_mut.release(worker.tls);

We can fix it, but not as simple as making it mut mmtk. Basically mmtk: &'static MMTK<VM> is a non mutable reference, but we want to get a mutable reference from it -- we do so under the assumption that we are the only thread that is accessing mmtk. As long as the assumption holds, the code is correct.

Basically what we want is something like RefCell but without overhead on borrowing. I don't think there is a safe way to do it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it not something like an Arc? Multiple work packets should be executing at the same time, all of them accessing the mmtk variable. Overhead of reference counting? I guess if 99% of the times it's only read like most work packets, and this is the hot loop, ugly hacks like this may be required.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically a RWLock could help, but I feel like adding a lock in the hottest part of our code may not be ideal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arc can help manage the lifetime of plan/mmtk. But it comes at the cost of a fat pointer (unknown cost) and reference counting (which should be minimal cost). But it does not solve the issue of mutability.

There are many things that we need to tidy up.

Copy link
Collaborator

@k-sareen k-sareen Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right sorry -- you need a RefCell or Mutex/RWLock to have mutability. Sorry was thinking of Arc<Mutex<T>> which is a probably a bad idea.

I guess we can flag this issue for the future. Could you add a comment here (and maybe the other places as well) saying that this is only safe under the assumption we're the only executing work packet.

@k-sareen
Copy link
Collaborator

I think it mainly looks good to me -- now that I understand what the issue is (MarkSweep and Immix sweeping in the Release work bucket). But it would be good to document it, if possible. Thanks!

Copy link
Collaborator

@wks wks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just make sure the spelling issues @k-sareen pointed out are solved.

@k-sareen
Copy link
Collaborator

I still think we should discuss the semantics of "Release" as a group before we merge this PR

@k-sareen
Copy link
Collaborator

Just noticed the commit message says next_gc_full_size instead of next_gc_full_heap -- maybe force push an amended commit when you update some of the documentation as mentioned above ^

@k-sareen k-sareen changed the title Generational plans set next_gc_full_size in end_of_gc Generational plans set next_gc_full_heap in end_of_gc Dec 21, 2022
@qinsoon
Copy link
Member Author

qinsoon commented Dec 21, 2022

Just noticed the commit message says next_gc_full_size instead of next_gc_full_heap -- maybe force push an amended commit when you update some of the documentation as mentioned above ^

I will just do a squash merge, and that commit message will not be seen in the commit history.

@qinsoon qinsoon merged commit 06237f1 into mmtk:master Jan 2, 2023
wenyuzhao pushed a commit to wenyuzhao/mmtk-core that referenced this pull request Mar 20, 2023
This PR adds `Plan.end_of_gc()`. Our generational plans compute whether we should do full heap GC at the end of `Plan.release()` which is incorrect. There could be other parallel release work when we execute `Plan.release()`. This PR adds `Plan.end_of_gc()`, and we compute `next_gc_full_heap` in `end_of_gc()`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants