Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove descriptor_map #952

Closed
wants to merge 1 commit into from
Closed

Remove descriptor_map #952

wants to merge 1 commit into from

Conversation

wks
Copy link
Collaborator

@wks wks commented Sep 12, 2023

DRAFT: This PR has two problems:

  1. The current implementation of address_in_space in this PR has noticeable performance overhead. From benchmark results, it looks like the check of having SFT entry and the 128-bit load is a bottleneck.
  2. The semantics of casting the 128-bit *const dyn SFT to *const () is unclear, and may be unreliable.

So I decide to postpone this PR until two other things are done:

  1. Refactoring the SFT_MAP implementation to eliminate 128-bit atomic read. See Should we avoid using fat pointers for SFT? #945
  2. Merging the compressed pointer support in the OpenJDK binding so that we will have a more realistic use case to evaluate. See Compressed Oops Support mmtk-openjdk#235

Description

This PR removes the descriptor_map from both Map32 and Map64.

Currently, the descriptor_map is only used by Space::address_in_space when using Map32, and not used (except in some assertions about newly acquired pages) when using Map64. With descriptor_map removed, Space::address_in_space will use the SFT to find the space of a given address.

Performance: The Space::in_space function (which calls Space::address_in_space) is used by the derive macro of the trait PlanTraceObject. Therefore, this PR will affect the performance of tracing for plans that use PlanTraceObject when using Map32. This needs to be tested.

Related PRs:

Copy link
Member

@qinsoon qinsoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactoring looks good. We need some performance data before we can confidently merge the PR.

src/policy/space.rs Show resolved Hide resolved
@wks
Copy link
Collaborator Author

wks commented Sep 12, 2023

I tested on bobcat.moma, comparing master and this PR, running lusearch, with the Immix plan, at 2.5x min heap size, 20 iterations. The result shows less than 1% impact on total time, STW time, and mutator time. This looks promising. I'll retry this test with other plans running lusearch.

image

Plotty: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|bobcat-2023-09-12-Tue-093227&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark&build;build1|41&Histogram%20(with%20CI)^build^benchmark&

@qinsoon
Copy link
Member

qinsoon commented Sep 13, 2023

I tested on bobcat.moma, comparing master and this PR, running lusearch, with the Immix plan, at 2.5x min heap size, 20 iterations. The result shows less than 1% impact on total time, STW time, and mutator time. This looks promising. I'll retry this test with other plans running lusearch.

image

Plotty: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|bobcat-2023-09-12-Tue-093227&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark&build;build1|41&Histogram%20(with%20CI)^build^benchmark&

Is this measured with a 64 bit build? If that's the case, the new code in address_in_space is not used.

@wks
Copy link
Collaborator Author

wks commented Sep 13, 2023

Is this measured with a 64 bit build? If that's the case, the new code in address_in_space is not used.

I built and ran it on x86-64. But I used the following patch to force it to use Map32.

diff --git a/src/util/heap/layout/vm_layout.rs b/src/util/heap/layout/vm_layout.rs
index ddf4472a5..66a7ee083 100644
--- a/src/util/heap/layout/vm_layout.rs
+++ b/src/util/heap/layout/vm_layout.rs
@@ -178,14 +178,14 @@ impl std::default::Default for VMLayout {
 
     #[cfg(target_pointer_width = "64")]
     fn default() -> Self {
-        Self::new_64bit()
+        Self::new_32bit()
     }
 }
 
 #[cfg(target_pointer_width = "32")]
 static mut VM_LAYOUT: VMLayout = VMLayout::new_32bit();
 #[cfg(target_pointer_width = "64")]
-static mut VM_LAYOUT: VMLayout = VMLayout::new_64bit();
+static mut VM_LAYOUT: VMLayout = VMLayout::new_32bit();
 
 static VM_LAYOUT_FETCHED: AtomicBool = AtomicBool::new(false);

The log shows it is using the new code.

@wks
Copy link
Collaborator Author

wks commented Sep 13, 2023

I ran again on bobcat.moma, 2.5x min heap size, 40 iterations, with several plans (MarkSweep and MarkCompact couldn't run with that heap size) and used the patch above to force it using Map32.

The result shows noticeable slowdown for all plans. Immix has the smallest slowdown. Others have about 1% slowdown in STW time.

image

Plotty: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|bobcat-2023-09-12-Tue-120535&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^mmtk_gc&build;build1|40&Histogram%20(with%20CI)^build^mmtk_gc&

@qinsoon
Copy link
Member

qinsoon commented Sep 13, 2023

Your build should be using Map32 and SFTSparseChunkMap. Getting SFT from SFTSparseChunkMap should be as simple as indexing into a vec, and it should be no different than the previous descriptor_map code. Is it possible to use
SFTMap::get_unchecked() instead? It could be the extra check that caused the slowdown.

@wks
Copy link
Collaborator Author

wks commented Sep 13, 2023

Your build should be using Map32 and SFTSparseChunkMap. Getting SFT from SFTSparseChunkMap should be as simple as indexing into a vec, and it should be no different than the previous descriptor_map code. Is it possible to use SFTMap::get_unchecked() instead? It could be the extra check that caused the slowdown.

We can't use SFTMap::get_unchecked() because the object may not be in any space in MMTk. That's what ActivePlan::vm_trace_object handles. But I'll try running get_unchecked on OpenJDK to see if it is causing the slowdown.

I guess the 128-bit atomic load may also be a reason for the slowdown. I'll check that, too.

@qinsoon
Copy link
Member

qinsoon commented Sep 13, 2023

The

Your build should be using Map32 and SFTSparseChunkMap. Getting SFT from SFTSparseChunkMap should be as simple as indexing into a vec, and it should be no different than the previous descriptor_map code. Is it possible to use SFTMap::get_unchecked() instead? It could be the extra check that caused the slowdown.

We can't use SFTMap::get_unchecked() because the object may not be in any space in MMTk. That's what ActivePlan::vm_trace_object handles. But I'll try running get_unchecked on OpenJDK to see if it is causing the slowdown.

I guess the 128-bit atomic load may also be a reason for the slowdown. I'll check that, too.

The address does not have to be in an MMTk space. It just needs an entry in SFT (which could be mapped to SFT_EMPTY_SPACE). For SFTSpaceMap and SFTSparseChunkMap, we should be fine, as we map all the address space we may use into SFT map. But for SFTDenseChunkMap where we use side metadata, we cannot guarantee the side metadata is available for the entire address space.

@wks
Copy link
Collaborator Author

wks commented Sep 13, 2023

This time I added build3 and build4. Build3 uses SFT_MAP::get_unchecked. Build4 does not check either, but also only load the lowest 64 bits of the SFTSparseChunkMap using unsafe non-atomic 64-bit load. In the current Rust version on x86_64, the lower 64 bits of a &dyn is the object pointer.

From the plot, build3 is noticeably faster than build2 in STW time. The STW time of build4 is about the same as build3 for StickyImmix, but slightly higher than build3 for Immix. It may indicate that the "check" is the bottleneck. But it is hard to explain why build4 is slower than build3 since it does strictly less memory loading. It may be noise.

But since the number is still a bit noisy, I'll do some extra experiments with micro-benchmarks.

image

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|bobcat-2023-09-13-Wed-061313&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^mmtk_gc&build;build1|41&Histogram%20(with%20CI)^build^mmtk_gc&

@wks
Copy link
Collaborator Author

wks commented Sep 14, 2023

lusearch

This is the same setting but with added tests for other plans, and increased the number of invocations to 40.

image

Plotty: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|bobcat-2023-09-13-Wed-144053&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^mmtk_gc&build;build1|41&Histogram%20(with%20CI)^build^mmtk_gc&

From this plot, we can see

  • Build2 (this PR) is consistently slower than build1 w.r.t. STW time.
  • Build3 (this PR but use get_unchecked) also has some overhead, but not as great as build2. For Immix, the mean STW time is the same as build1's
  • Build4 exhibits speed-up in STW time over build1 in GenCopy, GenImmix and SemiSpace, but slowdown in Immix and StickyImmix.

Microbenchmark

I also tested with a microbenchmark. It is similar to GCBench, but

  1. it only has one long-lived tree, and
  2. it only triggers GC and do nothing else, and
  3. it prints out the time of each GC (not total STW time) in the end.
class Node {
    int n;
    Node left;
    Node right;

    Node(int n, Node left, Node right) {
        this.n = n;
        this.left = left;
        this.right = right;
    }

    static Node makeTree(int depth) {
        if (depth == 0) {
            return null;
        } else {
            return new Node(depth, makeTree(depth - 1), makeTree(depth - 1));
        }
    }
}

public class TraceTest {
    public static void main(String[] args) {
        int depth = Integer.parseInt(args[0]);
        int iterations = Integer.parseInt(args[1]);
        int warmups = Integer.parseInt(args[2]);
        long[] gctimes = new long[iterations];

        Node tree = Node.makeTree(depth);

        for (int i = 0; i < warmups; i++) {
            System.gc();
        }

        for (int i = 0; i < iterations; i++) {
            long time1 = System.nanoTime();
            System.gc();
            long time2 = System.nanoTime();

            gctimes[i] = time2 - time1;
        }

        for (long gctime: gctimes) {
            System.out.println(gctime);
        }
    }
}

I ran it on bobcat.moma with the following script:

for plan in SemiSpace GenCopy GenImmix StickyImmix Immix; do
        for j in {1..5}; do
                for i in {1..4}; do
                        echo $plan build$i iter$j
                        MMTK_THREADS=1 MMTK_PLAN=${plan} ~/compare/build${i}/openjdk/build/linux-x86_64-normal-server-release/images/jdk/bin/java -XX:+UseThirdPartyHeap -server -XX:ParallelGCThreads=1 -XX:MetaspaceSize=100M -Xm{s,x}500M TraceTest 22 100 10 > out/result-${plan}-build${i}-iter${j}.txt
                done
        done
done

The number of GC workers is set to 1. For each plan-build pair, it runs 5 times. Each time it creates a 22-level tree, trigger GC 10 times for warm-up, and trigger GC 100 more times, recording the time of each GC.

The results are plotted in the following violin plot + scattered point plot. Each cell corresponds to a plan-build pair, and it contains 5 bars, each correspond to one of the 5 iteration. The horizontal dash "-" in the middle of each "violin" is the median.

image

With outliers (zscore >= 3) removed, the result is:

image

The GC times exhibit an interesting bi-modal distribution in SemiSpace and StickyImmix.

For the two non-generational plans, namely SemiSpace and Immix, the median of build2 is significant greater than build1. build3 is slightly faster than build2 but still noticeably slower than build1. Build4 is similar to build1 in both SemiSpace and Immix, but build4 is slightly faster than build1 in SemiSpace, but slightly slower than build1 in Immix.

For the two GenXxxxx plans, namely GenCopy and GenImmix, the plot doesn't show significant differences in GC time. The result varies in each run, and the noise is more significant than the medians.

StickyImmix is a bit interesting. The bi-modal distribution disappeared in build3 (like this PR but using get_unchecked). Since the script runs each build in turn (build1, build2, build3, build4 then build1 and buid2 again...), the difference should be intrinsic to build3.

This result is hard to interpret, but it looks like the cost of the 128-bit load is significant, and the check in get_checked also has a minor contribution to the overhead.

// TODO: For `DenseChunkMap`, it is possible to reduce one level of dereferencing by
// letting each space remember its index in `DenseChunkMap::index_map`.
let self_ptr = self as *const _ as *const ();
let other_ptr = SFT_MAP.get_checked(start) as *const _ as *const ();
Copy link
Member

@wenyuzhao wenyuzhao Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use as *const Self instead of *const _. We had a bug #750 where if the return type of get_checked is changed, the *const _ cast may still compile but can sliently fail the pointer comparison or dereference later.

Same to L229 as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We can do that on line 229. But on L230, the returned SFT instance may not have the same type as Self.

But I think a deeper problem is that SFT_MAP.get_checked(start) as *const _ is a *const dyn SFT, which is 16 bytes long. But *const () is 8 bytes long. Embarrassingly, neither the Rust reference nor the Rustonomicon specify the semantics of casting pointer to pointer. This means casting pointers this way is really no better than loading only half of the 128-bit &dyn from the SFT, as what build4 did.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Embarrassingly, neither the Rust reference nor the Rustonomicon specify the semantics of casting pointer to pointer.

Not sure I follow. Doesn't the first rule here describe when it's allowed? Or do you mean the semantics of then deferencing the pointer, which is most likely undefined

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow. Doesn't the first rule here describe when it's allowed? Or do you mean the semantics of then deferencing the pointer, which is most likely undefined

The link you provided points to an ancient version of the Rust Book for v1.25. The newest version of the Rust Book does not contain that section.

I did mean the casting itself. It is not an no-op bits-preserving cast like transmute, as it changes the number of bits.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right. That was the first link on Google for me when I searched for "raw pointer casting Rust".

@wks wks marked this pull request as draft September 15, 2023 02:50
@k-sareen
Copy link
Collaborator

k-sareen commented Sep 15, 2023

Could you run a smaller experiment with a different machine? bobcat is asymmetrical so scheduling may affect the results. (Or just run on the performance cores on bobcat)

@wks
Copy link
Collaborator Author

wks commented Sep 15, 2023

To further investigate the bi-modal distribution in my microbenchmark, I plotted the GC time of each GC in the first SemiSpace run with build2.

image

The GC time is jumping back and forth between two values. I suspect the difference is the cost of a failed in_space test. When running SemiSpace and PlanTraceObject, the macro-generated trace_object checks if an object is in copyspace1 before checking if it is in copyspace2. So in odd GCs, copyspace1 is the from space, while in even GCs, copyspace2 is the from space. In my microbenchmark, the data structure is a large binary tree, so all from-space objects are visited by exactly one edge (i.e. it will not see edges pointing to objects that are already forwarded).

  • When copyspace1 is the from-space, most objects will be in copyspace1, and the first copyspace1.in_space(object) check will usuallly succeed.
  • When copyspace2 is the from-space, most objects will be in copyspace2. But the generated trace_object function will still call copyspace1.in_space(object) first, which will return false. Then it will call copyspace2.in_space(object) which will return true.

So it will call in_space once in odd GCs, but twice in even GCs. That's probably the reason behind the bimodal distribution.

The curve for StickyImmix is different The following is the first run of build 2.

image

This may indicate that the bimodal distribution is caused by something else that is periodic.

And the first run of build 3:

image

This indicates the GC time still oscillates, but with lower "AC" amplitude and higher "DC" component. This is hard to explain because omitting a check should only make things faster.

@wks
Copy link
Collaborator Author

wks commented Sep 15, 2023

I ran the microbenchmark again on fisher.moma.

The overall result is consistent with bobcat.moma. For non-generational plans, build2 is obviously slower; build3 is slightly less slower; build4 is almost as good as build1. SemiSpace still exhibit the bi-modal distribution, but StickyImmix no longer does.

image

Semispace build2 run2:

image

StickyImmix build2 run2:

image

@wks
Copy link
Collaborator Author

wks commented Sep 26, 2023

From our previous discussion, although the descriptor_map is not used in Map64 now, it is still useful, and one possible use case is identifying which space an object is in, and enqueue edges into specialised work packets for the space it points into. So we shouldn't simply remove descriptor_map. I am closing this PR without merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants