Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove fl_map and refactor FreeListPageResoure #953

Merged
merged 12 commits into from
May 10, 2024
Merged

Conversation

wks
Copy link
Collaborator

@wks wks commented Sep 12, 2023

This PR removes the shared_fl_map field from Map32, and removes fl_map and fl_page_resource fields from Map64. These global maps were used to enumerate and update the starting address of spaces (stored in CommonFreeListPageResource instances), for several different purposes.

  1. Map32 can only decide the starting address of discontiguous spaces after all contiguous spaces are placed, therefore it needs to modify the start address of all discontiguous FreeListPageResource instances.
    • shared_fl_map was used to locate CommonFreeListPageResource instances by the space ordinal.
    • With this PR, we use Plan::for_each_space_mut to enumerate spaces, and use the newly added Space::maybe_get_page_resource_mut method to get the page resource (if it has one) in order to update their starting addresses.
  2. Map64 allocates RawMemoryFreeList in the beginning of each space that uses FreeListPageResource. It needs to update the start address of each space to take the address range occupied by RawMemoryFreeList into account.
    • fl_page_resource was used to locate CommonFreeListPageResource instances by the space ordinal.
    • fl_map was used to locate the underlying RawMemoryFreeList instances of the corresponding FreeListPageResource by the space ordinal.
    • With this PR, we initialize the start address of the FreeListPageResource immediately after a RawMemoryFreeList is created, taking the limit of RawMemoryFreeList into account. Therefore, it is no longer needed to update the starting address later.

Because those global maps are removed, several changes are made to VMMap and its implementations Map32 and Map64.

  • The VMMap::boot and the VMMap::bind_freelist no longer serve any purpose, and are removed.
  • Map64::finalize_static_space_map now does nothing but setting finalized to true.

The CommonFreeListPageResource data structure was introduced for the sole purpose to be referenced by those global maps. This PR removes CommonFreeListPageResource and FreeListPageResourceInner, and relocates their fields.

  • common: CommonPageResource is now a field of FreeListPageResource.
  • The free_list: Box<dyn FreeList> and start: Address are moved into FreeListPageResourceSync because they have always been accessed while holding the mutex FreeListPageResource::sync.
    • Note that the method FreeListPageResource::release_pages used to call free_list.size() without holding the sync mutex, and subsequently calls free_list.free() with the sync mutex. The algorithm of FreeList guarantees free() will not race with size(). But the Mutex forbids calling the read-only operation free() without holding the lock because in general it is possible that a read-write operations may race with read-only operations, too. For now, we extended the range of the mutex to cover the whole function.

After the changes above, the FreeListPageResource no longer needs UnsafeCell.

This PR is one step of the greater goal of removing unsafe operations related to FreeList, PageResource and VMMap. See: #853

Related PRs:

@wks wks mentioned this pull request Oct 30, 2023
@qinsoon qinsoon mentioned this pull request Jan 8, 2024
wks added a commit to wks/mmtk-core that referenced this pull request May 6, 2024
`ScanObjectsWork::make_another` is unused.  It was used when
`ScanObjectsWork` was used for scanning node roots.  Now that node roots
are scanned by the dedicated `ProcessRootNode` work packet, we can
remove it.

`WorkCounter::get_base_mut` is never used.  All derived counters use
`merge_val` to update all fields at the same time.

We use `Box::as_ref()` to get the reference to its underlying element.
It fixes a compilation error related to CommonFreeListPageResource.  But
we should eventually remove CommonFreeListPageResource completely as it
is a workaround for mimicking the legacy design from JikesRVM that allow
VMMap to enumerate and patch existing FreeListPageResource instances by
registering them in a global list, which is not idiomatic in Rust.  See
mmtk#934 and
mmtk#953
github-merge-queue bot pushed a commit that referenced this pull request May 6, 2024
`ScanObjectsWork::make_another` is unused. It was used when
`ScanObjectsWork` was used for scanning node roots. Now that node roots
are scanned by the dedicated `ProcessRootNode` work packet, we can
remove it.

`WorkCounter::get_base_mut` is never used. All derived counters use
`merge_val` to update all fields at the same time.

We use `Box::as_ref()` to get the reference to its underlying element.
It fixes a compilation error related to CommonFreeListPageResource. But
we should eventually remove CommonFreeListPageResource completely as it
is a workaround for mimicking the legacy design from JikesRVM that allow
VMMap to enumerate and patch existing FreeListPageResource instances by
registering them in a global list, which is not idiomatic in Rust. See
#934 and
#953.
@wks wks marked this pull request as ready for review May 6, 2024 13:04
@wks wks requested a review from qinsoon May 6, 2024 13:04
@wks
Copy link
Collaborator Author

wks commented May 7, 2024

I ran lusearch from DaCapo Chopin on vole.moma, 3x min heap size w.r.t. G1, 40 invocations, 5 iterations each, comparing master against this PR.

image

plotty link

To my surprise, the STW time increased for all plans except MarkSweep. I thought the increased range of mutex in FreeListPageResource::release_pages would negatively impact MarkSweep, and other plans should not be affected by this change. I'll investigate why it became slower.

src/mmtk.rs Show resolved Hide resolved
}
common_flpr
let free_list = vm_map.create_parent_freelist(start, pages, PAGES_IN_REGION as _);
let actual_start = if let Some(rmfl) = free_list.downcast_ref::<RawMemoryFreeList>() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could let dyn FreeList to return a value based on the implementation rather than downcasting the type here. The FreeList implementation is internal to the VMMap, and we should not break the abstraction and let PageResource do things based on the FreeList implementation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think FreeList is an implementation detail of VMMap. It is a very general free list abstraction, and has two implementations, namely IntArrayFreeList and RawMemoryFreeList. In this sense, I don't think trait FreeList should contain a method that returns the number of bytes it occupies in the space, because a FreeList is not necessarily used by a space. It can be used to allocate any kind of resources.

But it is true that VMMap selects the implementation. It is VMMap::create_freelist that decide whether to reserve a proportion of memory at the beginning of the space for the RawMemoryFreeList. I think it is natural for VMMap::create_freelist to return the number of bytes reserved for the FreeList implementation alongside the dyn FreeList, i.e. it should return a tuple.

In my previous version of the PR, I introduced the CreateFreeListResult type:

pub struct CreateFreeListResult {
    pub free_list: Box<dyn FreeList>,
    pub space_displacement: usize,
}

I think I can move it to this PR, too. If that's too verbose, I can replace it with a simple tuple instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning an offset from the create functions sounds good.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made create_freelist return CreateFreeListResult so that FreeListPageResource no longer performs downcast.

let start = vm_layout().available_start();
let free_list = vm_map.create_freelist(start);
debug_assert!(
free_list.downcast_ref::<RawMemoryFreeList>().is_none(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. We could let dyn FreeList to tell us if discontiguous space is allowed.

Copy link
Collaborator Author

@wks wks May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I said, I don't think FreeList should be aware of whether it is used by one kind of space or another.

Our current implementation happens to disallow discontiguous spaces when using Map64, and Map64 happens to select RawMemoryFreeList. I think we can remove this assertion because it's only the status quo, and there is no reason why we can't use RawMemoryFreeList in discontiguous spaces.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a second thought, I think we should leave the comment as is, because it is an implementation detail of the current FreeListPageResource. The way we deal with discontiguous space is that the start variable here is meaningless. But when the address range of the discontiguous space is determined, we re-assign the address of all discontiguous page resources. (That's the callback you saw in mmtk.rs.) If start needs an offset to make room for the RawMemoryFreeList, it will be overwritten.

We could adapt to that possibility by requiring start to be initially zero (when using IntArrayFreeList) or the offset (when using RawMemoryFreeList) for discontiguous spaces. When the discontiguous range is determined, we add the starting address to the discontiguous range to the start variable instead of overwriting it. But we have no way to test that because we never have that use case, and I already proposed to revamp the way we create spaces (see here and here). So we can leave this assertion here to reflect the status quo. downcast_ref is not as elegant as virtual methods from OOP's point of view, but we are just inspecting an implementation detail, and it is in an assertion. I'll add comment to clarify that this is only the status quo, not something that has to be true in general.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added comments to this assertion.

let maybe_rmfl = sync.free_list.downcast_mut::<RawMemoryFreeList>();
let region =
self.common
.grow_discontiguous_space(space_descriptor, required_chunks, maybe_rmfl);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use Option<&mut dyn FreeList> instead of Option<&mut RawMemoryFreeList> for grow_discontiguous_space and allocate_contiguous_chunks. In that case, the downcast happens in VMMap.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the change as you suggested. It makes sense because if the page resource uses free list, and Map64 can grow the space, they always need to grow the free list anyway. It just requires Map64 to choose a growable FreeList implementation.

@wks
Copy link
Collaborator Author

wks commented May 7, 2024

plotty link

To my surprise, the STW time increased for all plans except MarkSweep. I thought the increased range of mutex in FreeListPageResource::release_pages would negatively impact MarkSweep, and other plans should not be affected by this change. I'll investigate why it became slower.

The slow-down can be easily reproduced on my personal computer. Just compile MMTk and OpenJDK as usual and we can observe the performance difference in a microbenchmark that triggers GC many times.

The performance difference is due to some code change that subtly changed the inlining decision of the Rust compiler. perf shows that after this PR, one of the compare-exchange operation in object_forwarding::attempt_to_forward is not inlined, and is taking a significant amount time. The inlining decisions on the hot path, including the do_work of ProcessEdgesWork implementations and the trace_object of the spaces, are different in the two builds.

The slowdown is only reproducible by setting profile.release.opt-level=2 or above in Cargo.toml. If I set it to 1 or lower, there will be no performance difference regardless whether this PR is applied or not.

Compile both versions using PGO and the difference will go away.

I'll run the benchmarks again with PGO after addressing the review comments.

@k-sareen
Copy link
Collaborator

k-sareen commented May 8, 2024

You can mark the function as #[hot] then.

@wks
Copy link
Collaborator Author

wks commented May 8, 2024

You can mark the function as #[hot] then.

But that goes against our policy of not interfering with the compiler's inlining decision and relying on PGO to make more informed decision.

If a VMMap implementation chooses RawMemoryFreeList, it will report how
many bytes the RawMemoryFreeList occupies at the start of the Space.
@k-sareen
Copy link
Collaborator

k-sareen commented May 8, 2024

The policy is not to never add them, but to add them judiciously and for proper reasons, specially if they make the PGO build faster itself. PGO is not always perfect.

@wks
Copy link
Collaborator Author

wks commented May 8, 2024

The policy is not to never add them, but to add them judiciously and for proper reasons, specially if they make the PGO build faster itself. PGO is not always perfect.

To be clear, when I observed the performance difference, I was not using PGO. I always worried that PGO may be another source of non-determinism. But this time, PGO is working properly and eliminated the performance difference.

Copy link
Member

@qinsoon qinsoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qinsoon qinsoon added the PR-extended-testing Run extended tests for the pull request label May 8, 2024
@wks
Copy link
Collaborator Author

wks commented May 8, 2024

I am running the benchmark again on vole.moma, with PGO enabled.

@wks
Copy link
Collaborator Author

wks commented May 9, 2024

I re-run lusearch from DaCapo Chopin on vole.moma. 40 invocations, 5 iterations. This time I compiled mmtk-openjdk with PGO (training with fop).

3x min heap size w.r.t. G1 (plotty)

image

4.4x min heap size w.r.t. G1 (plotty)

image

The total time and STW time still get worse. The difference is bigger with larger heap size. Although the confidence interval is large, the difference is still obvious.

@qinsoon
Copy link
Member

qinsoon commented May 9, 2024

Does this PR affect performance for SemiSpace and GenCopy? They mostly use MonotonePageResource which seems unchanged in this PR.

@wks
Copy link
Collaborator Author

wks commented May 9, 2024

I ran it again. This time I trained the PGO using only StickyImmix, and not using stress GC, and re-ran lusearch with StickyImmix. 4.4x min heap size. (plotty)

image

From the result, this PR is faster than the master branch.

My conclusion is that this thing is extremely sensitive to the profile. Previously, I did PGO using MarkSweep, SemiSpace, MarkCompact, Immix, GenImmix, StickyImmix, and set a stress factor of 4M. Maybe that made the Rust compiler think some functions on the hot paths are not worth inlining, or maybe the stress GC made it miss some hot paths not executed during non-stress GC.

I also tried locally and found that when I set the optimization level to 2, it failed to inline some function and caused slowdown. This also shows the sensitivity to optimisation.

@qinsoon
Copy link
Member

qinsoon commented May 9, 2024

From the result, this PR is faster than the master branch.

I don't think we can say it is faster. The difference is so tiny and could be just noise. Probably run all the benchmarks for the plans that may be affected with this PR.

@wks
Copy link
Collaborator Author

wks commented May 9, 2024

I don't think we can say it is faster. The difference is so tiny and could be just noise. Probably run all the benchmarks for the plans that may be affected with this PR.

Yes. That's probably just noise.

This PR changes the page resource and the VMMap. The difference only manifests when the PageResource acquires chunks from the VMMap. So it should be virtually no visible performance difference because it only affects the slow path of the slow paths.

To be safe, I'll run other benchmarks with StickyImmix with the binary I compiled with PGO trained from StickyImmix. If the difference is just caused by optimization, other benchmarks should perform the same with the master or this PR.

@wks
Copy link
Collaborator Author

wks commented May 10, 2024

The same binary as the last one, the same setting, except running other benchmarks, too. (I omitted some benchmarks as before since their times increase as the number of iterations go up, probably due to the lack of reference processing.)

Results: (plotty)

image

The means are between -1.5% and +0.6% of the master, but the error bars are larger than the difference. That means the performance difference is insignificant.

The STW time of graphchi is reduced from 57 ms to 54 ms (see the un-normalized version), a 5% reduction, and the error bars are narrow. Similar is true for kafka.

@qinsoon
Copy link
Member

qinsoon commented May 10, 2024

@wks You can merge the PR when you think it is ready.

@wks wks added this pull request to the merge queue May 10, 2024
Merged via the queue into mmtk:master with commit 698a737 May 10, 2024
30 checks passed
@wks wks deleted the fix/no-fl-table branch May 10, 2024 07:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR-extended-testing Run extended tests for the pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants