Remove fl_map and refactor FreeListPageResoure #953

wks · 2023-09-12T14:15:29Z

This PR removes the shared_fl_map field from Map32, and removes fl_map and fl_page_resource fields from Map64. These global maps were used to enumerate and update the starting address of spaces (stored in CommonFreeListPageResource instances), for several different purposes.

Map32 can only decide the starting address of discontiguous spaces after all contiguous spaces are placed, therefore it needs to modify the start address of all discontiguous FreeListPageResource instances.
- shared_fl_map was used to locate CommonFreeListPageResource instances by the space ordinal.
- With this PR, we use Plan::for_each_space_mut to enumerate spaces, and use the newly added Space::maybe_get_page_resource_mut method to get the page resource (if it has one) in order to update their starting addresses.
Map64 allocates RawMemoryFreeList in the beginning of each space that uses FreeListPageResource. It needs to update the start address of each space to take the address range occupied by RawMemoryFreeList into account.
- fl_page_resource was used to locate CommonFreeListPageResource instances by the space ordinal.
- fl_map was used to locate the underlying RawMemoryFreeList instances of the corresponding FreeListPageResource by the space ordinal.
- With this PR, we initialize the start address of the FreeListPageResource immediately after a RawMemoryFreeList is created, taking the limit of RawMemoryFreeList into account. Therefore, it is no longer needed to update the starting address later.

Because those global maps are removed, several changes are made to VMMap and its implementations Map32 and Map64.

The VMMap::boot and the VMMap::bind_freelist no longer serve any purpose, and are removed.
Map64::finalize_static_space_map now does nothing but setting finalized to true.

The CommonFreeListPageResource data structure was introduced for the sole purpose to be referenced by those global maps. This PR removes CommonFreeListPageResource and FreeListPageResourceInner, and relocates their fields.

common: CommonPageResource is now a field of FreeListPageResource.
The free_list: Box<dyn FreeList> and start: Address are moved into FreeListPageResourceSync because they have always been accessed while holding the mutex FreeListPageResource::sync.
- Note that the method FreeListPageResource::release_pages used to call free_list.size() without holding the sync mutex, and subsequently calls free_list.free() with the sync mutex. The algorithm of FreeList guarantees free() will not race with size(). But the Mutex forbids calling the read-only operation free() without holding the lock because in general it is possible that a read-write operations may race with read-only operations, too. For now, we extended the range of the mutex to cover the whole function.

After the changes above, the FreeListPageResource no longer needs UnsafeCell.

This PR is one step of the greater goal of removing unsafe operations related to FreeList, PageResource and VMMap. See: #853

Related PRs:

Let FreeListPageResource hold FreeList in sync #925: This PR is a subset of that PR.

`ScanObjectsWork::make_another` is unused. It was used when `ScanObjectsWork` was used for scanning node roots. Now that node roots are scanned by the dedicated `ProcessRootNode` work packet, we can remove it. `WorkCounter::get_base_mut` is never used. All derived counters use `merge_val` to update all fields at the same time. We use `Box::as_ref()` to get the reference to its underlying element. It fixes a compilation error related to CommonFreeListPageResource. But we should eventually remove CommonFreeListPageResource completely as it is a workaround for mimicking the legacy design from JikesRVM that allow VMMap to enumerate and patch existing FreeListPageResource instances by registering them in a global list, which is not idiomatic in Rust. See mmtk#934 and mmtk#953

`ScanObjectsWork::make_another` is unused. It was used when `ScanObjectsWork` was used for scanning node roots. Now that node roots are scanned by the dedicated `ProcessRootNode` work packet, we can remove it. `WorkCounter::get_base_mut` is never used. All derived counters use `merge_val` to update all fields at the same time. We use `Box::as_ref()` to get the reference to its underlying element. It fixes a compilation error related to CommonFreeListPageResource. But we should eventually remove CommonFreeListPageResource completely as it is a workaround for mimicking the legacy design from JikesRVM that allow VMMap to enumerate and patch existing FreeListPageResource instances by registering them in a global list, which is not idiomatic in Rust. See #934 and #953.

wks · 2024-05-07T01:29:26Z

I ran lusearch from DaCapo Chopin on vole.moma, 3x min heap size w.r.t. G1, 40 invocations, 5 iterations each, comparing master against this PR.

plotty link

To my surprise, the STW time increased for all plans except MarkSweep. I thought the increased range of mutex in FreeListPageResource::release_pages would negatively impact MarkSweep, and other plans should not be affected by this change. I'll investigate why it became slower.

src/mmtk.rs

qinsoon · 2024-05-07T05:38:41Z

src/util/heap/freelistpageresource.rs

-            }
-            common_flpr
+        let free_list = vm_map.create_parent_freelist(start, pages, PAGES_IN_REGION as _);
+        let actual_start = if let Some(rmfl) = free_list.downcast_ref::<RawMemoryFreeList>() {


We could let dyn FreeList to return a value based on the implementation rather than downcasting the type here. The FreeList implementation is internal to the VMMap, and we should not break the abstraction and let PageResource do things based on the FreeList implementation.

I don't think FreeList is an implementation detail of VMMap. It is a very general free list abstraction, and has two implementations, namely IntArrayFreeList and RawMemoryFreeList. In this sense, I don't think trait FreeList should contain a method that returns the number of bytes it occupies in the space, because a FreeList is not necessarily used by a space. It can be used to allocate any kind of resources.

But it is true that VMMap selects the implementation. It is VMMap::create_freelist that decide whether to reserve a proportion of memory at the beginning of the space for the RawMemoryFreeList. I think it is natural for VMMap::create_freelist to return the number of bytes reserved for the FreeList implementation alongside the dyn FreeList, i.e. it should return a tuple.

In my previous version of the PR, I introduced the CreateFreeListResult type:

pub struct CreateFreeListResult { pub free_list: Box<dyn FreeList>, pub space_displacement: usize, }

I think I can move it to this PR, too. If that's too verbose, I can replace it with a simple tuple instead.

Returning an offset from the create functions sounds good.

I made create_freelist return CreateFreeListResult so that FreeListPageResource no longer performs downcast.

qinsoon · 2024-05-07T05:43:12Z

src/util/heap/freelistpageresource.rs

+        let start = vm_layout().available_start();
+        let free_list = vm_map.create_freelist(start);
+        debug_assert!(
+            free_list.downcast_ref::<RawMemoryFreeList>().is_none(),


Same here. We could let dyn FreeList to tell us if discontiguous space is allowed.

Like I said, I don't think FreeList should be aware of whether it is used by one kind of space or another.

Our current implementation happens to disallow discontiguous spaces when using Map64, and Map64 happens to select RawMemoryFreeList. ~~I think we can remove this assertion because it's only the status quo, and there is no reason why we can't use RawMemoryFreeList in discontiguous spaces.~~

After a second thought, I think we should leave the comment as is, because it is an implementation detail of the current FreeListPageResource. The way we deal with discontiguous space is that the start variable here is meaningless. But when the address range of the discontiguous space is determined, we re-assign the address of all discontiguous page resources. (That's the callback you saw in mmtk.rs.) If start needs an offset to make room for the RawMemoryFreeList, it will be overwritten.

We could adapt to that possibility by requiring start to be initially zero (when using IntArrayFreeList) or the offset (when using RawMemoryFreeList) for discontiguous spaces. When the discontiguous range is determined, we add the starting address to the discontiguous range to the start variable instead of overwriting it. But we have no way to test that because we never have that use case, and I already proposed to revamp the way we create spaces (see here and here). So we can leave this assertion here to reflect the status quo. downcast_ref is not as elegant as virtual methods from OOP's point of view, but we are just inspecting an implementation detail, and it is in an assertion. I'll add comment to clarify that this is only the status quo, not something that has to be true in general.

I added comments to this assertion.

qinsoon · 2024-05-07T05:45:24Z

src/util/heap/freelistpageresource.rs

+        let maybe_rmfl = sync.free_list.downcast_mut::<RawMemoryFreeList>();
+        let region =
+            self.common
+                .grow_discontiguous_space(space_descriptor, required_chunks, maybe_rmfl);


We could use Option<&mut dyn FreeList> instead of Option<&mut RawMemoryFreeList> for grow_discontiguous_space and allocate_contiguous_chunks. In that case, the downcast happens in VMMap.

I made the change as you suggested. It makes sense because if the page resource uses free list, and Map64 can grow the space, they always need to grow the free list anyway. It just requires Map64 to choose a growable FreeList implementation.

wks · 2024-05-07T13:37:35Z

plotty link

To my surprise, the STW time increased for all plans except MarkSweep. I thought the increased range of mutex in FreeListPageResource::release_pages would negatively impact MarkSweep, and other plans should not be affected by this change. I'll investigate why it became slower.

The slow-down can be easily reproduced on my personal computer. Just compile MMTk and OpenJDK as usual and we can observe the performance difference in a microbenchmark that triggers GC many times.

The performance difference is due to some code change that subtly changed the inlining decision of the Rust compiler. perf shows that after this PR, one of the compare-exchange operation in object_forwarding::attempt_to_forward is not inlined, and is taking a significant amount time. The inlining decisions on the hot path, including the do_work of ProcessEdgesWork implementations and the trace_object of the spaces, are different in the two builds.

The slowdown is only reproducible by setting profile.release.opt-level=2 or above in Cargo.toml. If I set it to 1 or lower, there will be no performance difference regardless whether this PR is applied or not.

Compile both versions using PGO and the difference will go away.

I'll run the benchmarks again with PGO after addressing the review comments.

k-sareen · 2024-05-08T01:47:06Z

You can mark the function as #[hot] then.

wks · 2024-05-08T02:14:03Z

You can mark the function as #[hot] then.

But that goes against our policy of not interfering with the compiler's inlining decision and relying on PGO to make more informed decision.

If a VMMap implementation chooses RawMemoryFreeList, it will report how many bytes the RawMemoryFreeList occupies at the start of the Space.

k-sareen · 2024-05-08T03:07:03Z

The policy is not to never add them, but to add them judiciously and for proper reasons, specially if they make the PGO build faster itself. PGO is not always perfect.

wks · 2024-05-08T07:21:32Z

The policy is not to never add them, but to add them judiciously and for proper reasons, specially if they make the PGO build faster itself. PGO is not always perfect.

To be clear, when I observed the performance difference, I was not using PGO. I always worried that PGO may be another source of non-determinism. But this time, PGO is working properly and eliminated the performance difference.

qinsoon

LGTM

wks · 2024-05-08T08:12:50Z

I am running the benchmark again on vole.moma, with PGO enabled.

wks · 2024-05-09T03:19:24Z

I re-run lusearch from DaCapo Chopin on vole.moma. 40 invocations, 5 iterations. This time I compiled mmtk-openjdk with PGO (training with fop).

3x min heap size w.r.t. G1 (plotty)

4.4x min heap size w.r.t. G1 (plotty)

The total time and STW time still get worse. The difference is bigger with larger heap size. Although the confidence interval is large, the difference is still obvious.

qinsoon · 2024-05-09T03:48:13Z

Does this PR affect performance for SemiSpace and GenCopy? They mostly use MonotonePageResource which seems unchanged in this PR.

wks · 2024-05-09T08:29:46Z

I ran it again. This time I trained the PGO using only StickyImmix, and not using stress GC, and re-ran lusearch with StickyImmix. 4.4x min heap size. (plotty)

From the result, this PR is faster than the master branch.

My conclusion is that this thing is extremely sensitive to the profile. Previously, I did PGO using MarkSweep, SemiSpace, MarkCompact, Immix, GenImmix, StickyImmix, and set a stress factor of 4M. Maybe that made the Rust compiler think some functions on the hot paths are not worth inlining, or maybe the stress GC made it miss some hot paths not executed during non-stress GC.

I also tried locally and found that when I set the optimization level to 2, it failed to inline some function and caused slowdown. This also shows the sensitivity to optimisation.

qinsoon · 2024-05-09T09:55:24Z

From the result, this PR is faster than the master branch.

I don't think we can say it is faster. The difference is so tiny and could be just noise. Probably run all the benchmarks for the plans that may be affected with this PR.

wks · 2024-05-09T13:31:35Z

I don't think we can say it is faster. The difference is so tiny and could be just noise. Probably run all the benchmarks for the plans that may be affected with this PR.

Yes. That's probably just noise.

This PR changes the page resource and the VMMap. The difference only manifests when the PageResource acquires chunks from the VMMap. So it should be virtually no visible performance difference because it only affects the slow path of the slow paths.

To be safe, I'll run other benchmarks with StickyImmix with the binary I compiled with PGO trained from StickyImmix. If the difference is just caused by optimization, other benchmarks should perform the same with the master or this PR.

wks · 2024-05-10T05:12:09Z

The same binary as the last one, the same setting, except running other benchmarks, too. (I omitted some benchmarks as before since their times increase as the number of iterations go up, probably due to the lack of reference processing.)

Results: (plotty)

The means are between -1.5% and +0.6% of the master, but the error bars are larger than the difference. That means the performance difference is insignificant.

The STW time of graphchi is reduced from 57 ms to 54 ms (see the un-normalized version), a 5% reduction, and the error bars are narrow. Similar is true for kafka.

qinsoon · 2024-05-10T05:21:11Z

@wks You can merge the PR when you think it is ready.

wks added 5 commits September 12, 2023 17:09

WIP: No global fl tables

06ff498

Remove fl tables

0cf0df1

Remove CommonFreeListPageResource

ffed459

Remove FreelistPageResourceInner

8121955

Fix BlockPageResource, too.

31cde38

wks mentioned this pull request Sep 12, 2023

Let FreeListPageResource hold FreeList in sync #925

Closed

wks mentioned this pull request Oct 16, 2023

Two-step spaces instantiation #984

Open

wks mentioned this pull request Oct 30, 2023

Two step space creation #1004

Open

qinsoon mentioned this pull request Jan 8, 2024

Unsafe Code Cleanup #97

Open

wks mentioned this pull request May 6, 2024

Fix compiler errors and warnings for Rust 1.78.0 #1132

Merged

Merge branch 'master' into fix/no-fl-table

ea46733

wks added 3 commits May 6, 2024 14:32

Merge branch 'master' into fix/no-fl-table

6752135

Use get_mut to bypass mutex

659d507

Fix comments

84a7e91

wks marked this pull request as ready for review May 6, 2024 13:04

wks requested a review from qinsoon May 6, 2024 13:04

qinsoon reviewed May 7, 2024

View reviewed changes

Comment finalize_static_space_map and its callback

2c78609

When creating free list, report space displacement

1994ef1

If a VMMap implementation chooses RawMemoryFreeList, it will report how many bytes the RawMemoryFreeList occupies at the start of the Space.

Let Map64 check if the FreeList is RawMemoryFreeList

650de35

qinsoon approved these changes May 8, 2024

View reviewed changes

qinsoon added the PR-extended-testing Run extended tests for the pull request label May 8, 2024

wks added this pull request to the merge queue May 10, 2024

Merged via the queue into mmtk:master with commit 698a737 May 10, 2024
30 checks passed

wks deleted the fix/no-fl-table branch May 10, 2024 07:01

wks mentioned this pull request May 10, 2024

Remove cast_ref_to_mut related to FreeList, PageResource and VMMap #853

Open

3 tasks

qinsoon mentioned this pull request May 15, 2024

Marksweep is slower in most benchmarks after #953 #1135

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove fl_map and refactor FreeListPageResoure #953

Remove fl_map and refactor FreeListPageResoure #953

wks commented Sep 12, 2023 •

edited

Loading

wks commented May 7, 2024

qinsoon May 7, 2024

wks May 7, 2024

qinsoon May 7, 2024

wks May 8, 2024

qinsoon May 7, 2024

wks May 7, 2024 •

edited

Loading

wks May 7, 2024

wks May 8, 2024

qinsoon May 7, 2024

wks May 8, 2024

wks commented May 7, 2024

k-sareen commented May 8, 2024

wks commented May 8, 2024

k-sareen commented May 8, 2024

wks commented May 8, 2024

qinsoon left a comment

wks commented May 8, 2024

wks commented May 9, 2024

qinsoon commented May 9, 2024

wks commented May 9, 2024 •

edited

Loading

qinsoon commented May 9, 2024

wks commented May 9, 2024

wks commented May 10, 2024

qinsoon commented May 10, 2024

Remove fl_map and refactor FreeListPageResoure #953

Remove fl_map and refactor FreeListPageResoure #953

Conversation

wks commented Sep 12, 2023 • edited Loading

wks commented May 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wks May 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wks commented May 7, 2024

k-sareen commented May 8, 2024

wks commented May 8, 2024

k-sareen commented May 8, 2024

wks commented May 8, 2024

qinsoon left a comment

Choose a reason for hiding this comment

wks commented May 8, 2024

wks commented May 9, 2024

qinsoon commented May 9, 2024

wks commented May 9, 2024 • edited Loading

qinsoon commented May 9, 2024

wks commented May 9, 2024

wks commented May 10, 2024

qinsoon commented May 10, 2024

wks commented Sep 12, 2023 •

edited

Loading

wks May 7, 2024 •

edited

Loading

wks commented May 9, 2024 •

edited

Loading