-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove fl_map and refactor FreeListPageResoure #953
Conversation
`ScanObjectsWork::make_another` is unused. It was used when `ScanObjectsWork` was used for scanning node roots. Now that node roots are scanned by the dedicated `ProcessRootNode` work packet, we can remove it. `WorkCounter::get_base_mut` is never used. All derived counters use `merge_val` to update all fields at the same time. We use `Box::as_ref()` to get the reference to its underlying element. It fixes a compilation error related to CommonFreeListPageResource. But we should eventually remove CommonFreeListPageResource completely as it is a workaround for mimicking the legacy design from JikesRVM that allow VMMap to enumerate and patch existing FreeListPageResource instances by registering them in a global list, which is not idiomatic in Rust. See mmtk#934 and mmtk#953
`ScanObjectsWork::make_another` is unused. It was used when `ScanObjectsWork` was used for scanning node roots. Now that node roots are scanned by the dedicated `ProcessRootNode` work packet, we can remove it. `WorkCounter::get_base_mut` is never used. All derived counters use `merge_val` to update all fields at the same time. We use `Box::as_ref()` to get the reference to its underlying element. It fixes a compilation error related to CommonFreeListPageResource. But we should eventually remove CommonFreeListPageResource completely as it is a workaround for mimicking the legacy design from JikesRVM that allow VMMap to enumerate and patch existing FreeListPageResource instances by registering them in a global list, which is not idiomatic in Rust. See #934 and #953.
I ran lusearch from DaCapo Chopin on vole.moma, 3x min heap size w.r.t. G1, 40 invocations, 5 iterations each, comparing master against this PR. To my surprise, the STW time increased for all plans except MarkSweep. I thought the increased range of mutex in |
} | ||
common_flpr | ||
let free_list = vm_map.create_parent_freelist(start, pages, PAGES_IN_REGION as _); | ||
let actual_start = if let Some(rmfl) = free_list.downcast_ref::<RawMemoryFreeList>() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could let dyn FreeList
to return a value based on the implementation rather than downcasting the type here. The FreeList
implementation is internal to the VMMap
, and we should not break the abstraction and let PageResource
do things based on the FreeList
implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think FreeList
is an implementation detail of VMMap
. It is a very general free list abstraction, and has two implementations, namely IntArrayFreeList
and RawMemoryFreeList
. In this sense, I don't think trait FreeList
should contain a method that returns the number of bytes it occupies in the space, because a FreeList
is not necessarily used by a space. It can be used to allocate any kind of resources.
But it is true that VMMap
selects the implementation. It is VMMap::create_freelist
that decide whether to reserve a proportion of memory at the beginning of the space for the RawMemoryFreeList
. I think it is natural for VMMap::create_freelist
to return the number of bytes reserved for the FreeList
implementation alongside the dyn FreeList
, i.e. it should return a tuple.
In my previous version of the PR, I introduced the CreateFreeListResult type:
pub struct CreateFreeListResult {
pub free_list: Box<dyn FreeList>,
pub space_displacement: usize,
}
I think I can move it to this PR, too. If that's too verbose, I can replace it with a simple tuple instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning an offset from the create functions sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made create_freelist
return CreateFreeListResult
so that FreeListPageResource
no longer performs downcast.
let start = vm_layout().available_start(); | ||
let free_list = vm_map.create_freelist(start); | ||
debug_assert!( | ||
free_list.downcast_ref::<RawMemoryFreeList>().is_none(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. We could let dyn FreeList
to tell us if discontiguous space is allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I said, I don't think FreeList
should be aware of whether it is used by one kind of space or another.
Our current implementation happens to disallow discontiguous spaces when using Map64
, and Map64
happens to select RawMemoryFreeList
. I think we can remove this assertion because it's only the status quo, and there is no reason why we can't use RawMemoryFreeList
in discontiguous spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a second thought, I think we should leave the comment as is, because it is an implementation detail of the current FreeListPageResource
. The way we deal with discontiguous space is that the start
variable here is meaningless. But when the address range of the discontiguous space is determined, we re-assign the address
of all discontiguous page resources. (That's the callback you saw in mmtk.rs
.) If start
needs an offset to make room for the RawMemoryFreeList
, it will be overwritten.
We could adapt to that possibility by requiring start
to be initially zero (when using IntArrayFreeList
) or the offset (when using RawMemoryFreeList
) for discontiguous spaces. When the discontiguous range is determined, we add the starting address to the discontiguous range to the start
variable instead of overwriting it. But we have no way to test that because we never have that use case, and I already proposed to revamp the way we create spaces (see here and here). So we can leave this assertion here to reflect the status quo. downcast_ref
is not as elegant as virtual methods from OOP's point of view, but we are just inspecting an implementation detail, and it is in an assertion. I'll add comment to clarify that this is only the status quo, not something that has to be true in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added comments to this assertion.
let maybe_rmfl = sync.free_list.downcast_mut::<RawMemoryFreeList>(); | ||
let region = | ||
self.common | ||
.grow_discontiguous_space(space_descriptor, required_chunks, maybe_rmfl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use Option<&mut dyn FreeList>
instead of Option<&mut RawMemoryFreeList>
for grow_discontiguous_space
and allocate_contiguous_chunks
. In that case, the downcast happens in VMMap
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the change as you suggested. It makes sense because if the page resource uses free list, and Map64 can grow the space, they always need to grow the free list anyway. It just requires Map64
to choose a growable FreeList
implementation.
The slow-down can be easily reproduced on my personal computer. Just compile MMTk and OpenJDK as usual and we can observe the performance difference in a microbenchmark that triggers GC many times. The performance difference is due to some code change that subtly changed the inlining decision of the Rust compiler. The slowdown is only reproducible by setting Compile both versions using PGO and the difference will go away. I'll run the benchmarks again with PGO after addressing the review comments. |
You can mark the function as |
But that goes against our policy of not interfering with the compiler's inlining decision and relying on PGO to make more informed decision. |
If a VMMap implementation chooses RawMemoryFreeList, it will report how many bytes the RawMemoryFreeList occupies at the start of the Space.
The policy is not to never add them, but to add them judiciously and for proper reasons, specially if they make the PGO build faster itself. PGO is not always perfect. |
To be clear, when I observed the performance difference, I was not using PGO. I always worried that PGO may be another source of non-determinism. But this time, PGO is working properly and eliminated the performance difference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I am running the benchmark again on vole.moma, with PGO enabled. |
I re-run lusearch from DaCapo Chopin on vole.moma. 40 invocations, 5 iterations. This time I compiled mmtk-openjdk with PGO (training with fop). 3x min heap size w.r.t. G1 (plotty) 4.4x min heap size w.r.t. G1 (plotty) The total time and STW time still get worse. The difference is bigger with larger heap size. Although the confidence interval is large, the difference is still obvious. |
Does this PR affect performance for |
I ran it again. This time I trained the PGO using only StickyImmix, and not using stress GC, and re-ran lusearch with StickyImmix. 4.4x min heap size. (plotty) From the result, this PR is faster than the master branch. My conclusion is that this thing is extremely sensitive to the profile. Previously, I did PGO using MarkSweep, SemiSpace, MarkCompact, Immix, GenImmix, StickyImmix, and set a stress factor of 4M. Maybe that made the Rust compiler think some functions on the hot paths are not worth inlining, or maybe the stress GC made it miss some hot paths not executed during non-stress GC. I also tried locally and found that when I set the optimization level to 2, it failed to inline some function and caused slowdown. This also shows the sensitivity to optimisation. |
I don't think we can say it is faster. The difference is so tiny and could be just noise. Probably run all the benchmarks for the plans that may be affected with this PR. |
Yes. That's probably just noise. This PR changes the page resource and the VMMap. The difference only manifests when the PageResource acquires chunks from the VMMap. So it should be virtually no visible performance difference because it only affects the slow path of the slow paths. To be safe, I'll run other benchmarks with StickyImmix with the binary I compiled with PGO trained from StickyImmix. If the difference is just caused by optimization, other benchmarks should perform the same with the master or this PR. |
The same binary as the last one, the same setting, except running other benchmarks, too. (I omitted some benchmarks as before since their times increase as the number of iterations go up, probably due to the lack of reference processing.) Results: (plotty) The means are between -1.5% and +0.6% of the master, but the error bars are larger than the difference. That means the performance difference is insignificant. The STW time of graphchi is reduced from 57 ms to 54 ms (see the un-normalized version), a 5% reduction, and the error bars are narrow. Similar is true for kafka. |
@wks You can merge the PR when you think it is ready. |
This PR removes the
shared_fl_map
field fromMap32
, and removesfl_map
andfl_page_resource
fields fromMap64
. These global maps were used to enumerate and update the starting address of spaces (stored inCommonFreeListPageResource
instances), for several different purposes.Map32
can only decide the starting address of discontiguous spaces after all contiguous spaces are placed, therefore it needs to modify the start address of all discontiguousFreeListPageResource
instances.shared_fl_map
was used to locateCommonFreeListPageResource
instances by the space ordinal.Plan::for_each_space_mut
to enumerate spaces, and use the newly addedSpace::maybe_get_page_resource_mut
method to get the page resource (if it has one) in order to update their starting addresses.Map64
allocatesRawMemoryFreeList
in the beginning of each space that usesFreeListPageResource
. It needs to update the start address of each space to take the address range occupied byRawMemoryFreeList
into account.fl_page_resource
was used to locateCommonFreeListPageResource
instances by the space ordinal.fl_map
was used to locate the underlyingRawMemoryFreeList
instances of the correspondingFreeListPageResource
by the space ordinal.FreeListPageResource
immediately after aRawMemoryFreeList
is created, taking the limit ofRawMemoryFreeList
into account. Therefore, it is no longer needed to update the starting address later.Because those global maps are removed, several changes are made to
VMMap
and its implementationsMap32
andMap64
.VMMap::boot
and theVMMap::bind_freelist
no longer serve any purpose, and are removed.Map64::finalize_static_space_map
now does nothing but settingfinalized
to true.The
CommonFreeListPageResource
data structure was introduced for the sole purpose to be referenced by those global maps. This PR removesCommonFreeListPageResource
andFreeListPageResourceInner
, and relocates their fields.common: CommonPageResource
is now a field ofFreeListPageResource
.free_list: Box<dyn FreeList>
andstart: Address
are moved intoFreeListPageResourceSync
because they have always been accessed while holding the mutexFreeListPageResource::sync
.FreeListPageResource::release_pages
used to callfree_list.size()
without holding thesync
mutex, and subsequently callsfree_list.free()
with thesync
mutex. The algorithm ofFreeList
guaranteesfree()
will not race withsize()
. But theMutex
forbids calling the read-only operationfree()
without holding the lock because in general it is possible that a read-write operations may race with read-only operations, too. For now, we extended the range of the mutex to cover the whole function.After the changes above, the
FreeListPageResource
no longer needsUnsafeCell
.This PR is one step of the greater goal of removing unsafe operations related to FreeList, PageResource and VMMap. See: #853
Related PRs: