-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let MarkSweepSpace use BlockPageResource #1150
Conversation
The The time taken to execute |
The time spent flushing the When the heap is bigger, the time to flush will increase. The following timeline is from the PMD benchmark (600M heap size). The time to flush |
Let MarkSweepSpace use BlockPageResource to eliminate the lock contention when releasing pages in parallel.
Why can't we use a sentinel to flush it at the end? |
There are two difficulties. (1) To know if it is at the end, we need to know the number of units of work beforehand. In fn generate_sweep_tasks(&self) -> Vec<Box<dyn GCWork<VM>>> {
self.defrag.mark_histograms.lock().clear();
// # Safety: ImmixSpace reference is always valid within this collection cycle.
let space = unsafe { &*(self as *const Self) };
let epilogue = Arc::new(FlushPageResource {
space,
counter: AtomicUsize::new(0),
});
let tasks = self.chunk_map.generate_tasks(|chunk| {
Box::new(SweepChunk {
space,
chunk,
epilogue: epilogue.clone(),
})
});
epilogue.counter.store(tasks.len(), Ordering::SeqCst);
tasks
} In We can't let some threads (such as workers executing (2) In order to use the sentinel, we need to inject an So the easiest way to do it is do it in Alternatively, we can make a "sentinel" that is executed when the |
Performance evaluation: Three builds:
I added the I ran Here are the line plots (plotty link). Each line is a build-plan tuple. The x axis is the heap factor (N/1000 min heap size). The "geomean" near 0 should be "2428", and that's probably a bug in plotty. We can see a tiny amount of difference between the builds for MarkSweep, but almost no difference for Immix. The number of GC overflowed and stuck at 2048 at the smallest heap factor. The Here is a zoomed-in line plot for And the histogram of When the heap is small, the noise is large; but as the heap grows larger, this PR has smaller and smaller STW time. That's because the larger the heap is, the more blocks there are to sweep, and the more blocks there are to release. The lock contention problem with FYI, the STW time when running Immix has no differences beyond noise between the three builds. Here is a histogram. (plotty link) I'll try running the |
The improvement looks marginal. I wonder if it is because you were using only 8 GC threads and there wasn't much contention in the release phase. Did you check the pathological case you found in #1145? Does this PR improve it? Nonetheless, I think switching to |
It was 16 GC threads. I used The pathological case in #1145 was using 100M heap size, so the difference is greater in that run. master 72M (ran on my laptop): master 100M (you can see the time to release mutators became much longer than 72M): And yes. This PR improves the speed of releasing mutators at 100M, too. |
The Line plots: (plotty link) Histogram (MarkSweep-only), normalized to The result isn't as exciting as I expected. At 6x min heap size, this PR only has 6%-7% STW time reduction. It may be because We observe a small increase in STW time from 1.8x to 3x min heap sizes, as a result of increased number of GC. It may be a result of pub fn flush_all(&self) {
self.block_queue.flush_all()
// TODO: For 32-bit space, we may want to free some contiguous chunks.
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good. We shall wait for Kunshan to run another round of evaluation for the PR to include all the benchmarks.
I ran many other benchmarks in DaCapo Chopin (excluding some that are known to produce inconsistent results as we increase the number of iterations, likely due to the lack of reference processing). I ran with two builds ( There are too many plots, so I only post Plotty links.
Most benchmarks have reduced STW time with this PR. The benchmark that has the highest STW time reduction is graphchi, with more than 60% reduction at 4.392x min heap size (w.r.t. G1 in OpenJDK). eBPF timeline shows graphchi spent a lot of time releasing mutators, likely because it simply has too many blocks to release in each GC, and the effect is amplified by the lock contention in The time to compute the transitive closure is also reduced, and it is likely to be a result of BlockPageResource reusing existing blocks, but I am not certain about it. There are some benchmarks that have increased STW time instead. They are cassandra, fop, jython and spring. Their error bars are large, and I cannot reproduce their slowdown on my laptop. The number of GC is small for both cassandra and fop (see this plotty link for the number of GC), and the GCs concentrate near the end of the benchmark execution. Such benchmarks are more susceptible to random effects. For fop, some GCs may spend 50% of the time tracing descendants of finalizable objects. So the variation of STW time is more related to when and how much finalization happened instead of the time spent executing ReleaseMutator. See the timeline below The effect on total time is observable, but not that dramatic. Overall, I think this PR is effective in reducing the time of ReleaseMutator, especially in the pathological cases where the heap is large and the number of blocks to be released is high. |
The performance looks good. We see up to 10% improvement of STW time in geomean across all the measured benchmarks in 3x/3.6x/4.3x heap size, about 1% improvement in total time. You can merge the PR at any time. |
Let
MarkSweepSpace
use theBlockPageResource
instead of the rawFreeListPageResource
. This solves a problem where multiple GC workers contend for the mutexFreeListPageResource::sync
when callingFreeListPageResource::release_pages
concurrently during theRelease
stage.We flush the
BlockPageResource
inMarkSweepSpace::end_of_gc
. To monitor the performance, we added a pair of USDT trace points before and after callingPlan.end_of_gc
. We also count the invocation ofend_of_gc
into the GC time.We changed
BlockQueue
so that it usesMaybeUninit
to hold uninitialized elements instead of relying onB::from_aligned_address(Address::ZERO)
because theBlock
type used byMarkSweepSpace
is based onNonZeroUsize
, and cannot be zero.Fixes: #1145