Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove TransitiveClosure param from Space::trace_object and Scanning::scan_object #559

Closed
3 tasks done
wks opened this issue Mar 14, 2022 · 20 comments · Fixed by #607
Closed
3 tasks done

Remove TransitiveClosure param from Space::trace_object and Scanning::scan_object #559

wks opened this issue Mar 14, 2022 · 20 comments · Fixed by #607
Labels
A-work-queue Area: Work queue C-cleanup Category: Cleanup F-question Call For Participation: Unanswered question (need more information)

Comments

@wks
Copy link
Collaborator

wks commented Mar 14, 2022

TL;DR: The type parameter of XxxxxSpace::trace_object<T: TransitiveClosure> and the trace: &mut T parameter can be removed, and the information should be carried by return value.

TL;DR2: The trace: &mut T parameter of Scanning::scan_object should also be removed and replaced by a lambda. (See comment below)

TODO list:

  • Remove the <T: TransitiveClosure> parameter from XxxxxSpace::trace_object
  • Remove the <T: TransitiveClosure> parameter from Scanning::scan_object
  • Remove the TransitiveClosure trait completely

What?

Each space (like this one) has a trace_object<T: TransitiveClosure> method that takes a trace: &mut T parameter.

We mentioned the removal of the <T: TransitiveClosure> type parameter in #110. In that issue, @qinsoon mentioned that once we remove plan-specific ProcessEdgesWork, we can remove it.

But I think we can remove it earlier than that, and plan-specific edge-processing is not even the core of the problem. The trace parameter is more like an "out parameter" in some languages.

What does trace: &mut T do?

It only serves one purpose for any Space: to enqueue the newly marked object.

Space::trace_object does two things:

  1. Mark an object if it is not already marked. When an object is marked (in GC terms, when a "grey" object becomes "black"), its edges should be enqueued (i.e. adjacent object shall turn "grey").
  2. Forward an object if it is not already forwarded. If the current object reference points to a tombstone, find the address of the forwarded object.

So the result of the function should also consist of two parts:

  1. If the object is newly marked, and
  2. if the object is forwarded and, if it is, what is the new address.

The current trace_object signature is

    pub fn trace_object<T: TransitiveClosure>(
        &self,
        trace: &mut T,
        object: ObjectReference,
        semantics: Option<CopySemantics>,
        worker: &mut GCWorker<VM>,
    ) -> ObjectReference {

With this signature, the return value ObjectReference only carries the information of part (2). Part (1) is done by calling trace.process_node(object) just before returning. See:

Currently, the T: TransitiveClosure type parameter can only be instances of ProcessEdgesWork, and the ProcessEdgesWork::process_node(object) has only one implementation. It simply adds object into ProcessEdgesBase::nodes. This makes the polymorphism against T: TransitiveClosure pointless, for it is not polymorphic at all.

What should it return?

trace_object should have returned this:

pub struct TraceObjectResult {
    pub is_newly_marked: bool,  // true if the object is newly marked
    pub forwarded: Option<ObjectReference>,  // If an object is forwarded, return Some(newobj), otherwise None.

MallocSpace::trace_object should return TraceObjectResult(true, None) if called on a "grey" object, and TraceObjectResult(false, None) if the object is "black".

If an object is in a from-space, CopySpace::trace_object should return TraceObjectResult(true, Some(newobj)) the first time it visits an object, but TraceObjectResult(false, Some(newobj)) for subsequent visits. For objects in to-spaces, the second part should be None.

The Immix space, depending on whether it is defragging, may either return TraceObjectResult(xxx, None) or TraceObjectResult(xxx, Some(newobj)).

How should it be used?

trace_object is called in ProcessEdgesWork::process_edge. The default implementation is:

    #[inline]
    fn process_edge(&mut self, slot: Address) {
        let object = unsafe { slot.load::<ObjectReference>() };
        let new_object = self.trace_object(object);
        if Self::OVERWRITE_REFERENCE {
            unsafe { slot.store(new_object) };
        }
    }

ImmixProcessEdges and GenNurseryProcessEdges override it, but the only difference is the way to decide whether to store back to the slot. With this change, it only needs to be implemented in one way:

    #[inline]
    fn process_edge(&mut self, slot: Address) {
        let object = unsafe { slot.load::<ObjectReference>() };
        let TraceObjectResult(newly_marked, forwarded) = self.trace_object(object);
        if newly_marked {
            self.process_node(forwarded.unwrap_or(object));  // enqueue IFF newly marked
        }
        if let Some(new_object) = forwarded {
            unsafe { slot.store(new_object) };  // store IFF forwarded.
        }
    }

Discussion

Is it compatible with mark-compact? @tianleq

Is there any performance issue?

@qinsoon
Copy link
Member

qinsoon commented Mar 14, 2022

I feel we should think more carefully about this. I thought ProcessEdgesWork was replacing TransitiveClosure. However, giving it second thought, I think maybe ProcessEdgesWork is just one implementation of TransitiveClosure. If you search for TransitiveClosure in Java MMTk, you will see there are many types that implements TransitiveClosure that do not yet exist in mmtk-core, such as reference counting's mod buffer, and some write barrier implementation. It is unclear to me whether we will use ProcessEdgesWork for those cases, or we will eventually replace some uses of ProcessEdgesWork in our current code base with TransitiveClosure (which is more general).

I agree that TransitiveClosure seems unnecessary in our current code base. I think this needs to be carefully considered before we make any further change.

For your proposal:

  • is_newly_marked suggests marking is happening, which is not true for policies like CopySpace.
  • The performance should be measured, as TraceObjectResult introduces more branches to the performance critical code.

@qinsoon qinsoon added F-question Call For Participation: Unanswered question (need more information) C-cleanup Category: Cleanup A-work-queue Area: Work queue labels Mar 14, 2022
@wks
Copy link
Collaborator Author

wks commented Mar 15, 2022

I looked at various implementations. While most use cases use it as a real transitive closure in the sense of graph theory, other use cases are, in my opinion, abuses. They actually need callback functions that consume either an object or an edge.

What is TransitiveClosure

In Java MMTk (and Rust MMTk alike), TransitiveClosure is an amalgamation of two call-back functions:

  • processEdge(ObjectReference reference, Address slot)
  • processNode(ObjectReference reference)

Interesting enough, both functions have default failing implementation, allowing its subclasses to implement only one of the two functions. This alone shows that they are two independent callbacks. From my observation, I find that:

Where are they called:

  • processEdge is always used by Scanning.scanObject or Scanning.specializedScanObject. They call processEdge on each edge of the object.
  • processNode is called in various places:
    • mostly by XxxSpace.traceObject
    • during PROCESS_NEWROOTBUFFER phase in RC collectors, it is called to enqueue objects from the mod buffer
    • in StickyImmix and StickyMS, it is called to enqueue mod buffer objects

I looked at the classes that inherits TransitiveClosure in Java MMTk.

subclasses of TransitiveClosure

  • TraceLocal represents the process of finding reachable objects. I think it is the only legitimate subclass of TransitiveClosure. It actually implements both processEdge (scanning the pointed-to object and forward the ref field) and processNode (enqueueing objects).
  • RCZero implements processEdge alone. It just stores NULL to a field. RC collectors call Scanning.scanObject(RCZero) to find all ref fields and store NULL into it. Strictly speaking, RCZero is just a per-field callback, not really a "transitive closure" in the graph-theory sense.
    • RCModifiedProcessor and GenRCModifiedProcessor implement processEdge alone. Like RCZero, RC collectors also calls Scanning.scanObject(RCModifiedProcessor) for each object in the mod buffer.
    • RCDecBuffer is also the same: implements processEdge alone, and used as callback of scanObject.
    • ImmortalSpaceDriver.Closure is the same.
  • TraceWriteBuffer implements processNode alone, which adds a node into a buffer. It is instantiated in CMSMutator as the remset. In write barrier, CMSMutator.checkAndEnqueueReference calls XxxSpace.traceObject(TraceWriteBuffer, obj) and, if it is newly marked, it will add the obj into remset.

What have we learned?

From the observation above, we see that

  1. TransitiveClosure means (at least it is originally supposed to mean) "finding all reachable objects". It processes edges and nodes as it progresses, hence the interface. scanObject enqueues edges, and traceObject marks reachable objects and expands the transitive closure.
  2. However, scanObject does not necessarily provide roots of a transitive closure. scanObject may just be used to enumerate object fields.
  3. Similarly, traceObject does not necessarily add objects of a transitive closure. traceObject just sees if an object is first visited, and may forward objects.

Generalising scan_object and trace_object

My original post in this issue described a refactoring for XxxSpace::trace_object that decouples it from TransitiveClosure

We can do the same for Scanning::scan_object, too. From my experience with Ruby-MMTk, I think it is also necessary. Currently, the signature is:

    fn scan_object<T: TransitiveClosure>(
        trace: &mut T,
        object: ObjectReference,
        tls: VMWorkerThread,
    );

We can refactor it to:

    fn scan_object<F: FnMut(Address)>(
        object: ObjectReference,
        tls: VMWorkerThread,
        callback: F,
    );

F is equivalent to the process_edge method in the TransitiveClosure, except that it is not a &mut, but a value, and the compiler can inline it if it is small. As an FnMut, it can also capture the &mut TransitiveClosure if necessary.

Generalising "edge"

Treating an edge as Address is sufficient for OpenJDK and so on. For Ruby, and maybe other VMs, too, it may be necessary to generalise "edge", too, because sometimes an "edge" may not be simply a memory location.

    fn scan_object<E: Edge, F: FnMut(Edge)>(
        object: ObjectReference,
        tls: VMWorkerThread,
        callback: F,
    );

Detailed in #573

@tianleq
Copy link
Collaborator

tianleq commented Mar 15, 2022

@wks I believe it should be compatible with markcompact.

@wks
Copy link
Collaborator Author

wks commented Mar 16, 2022

Some history of the TransitiveClosure type in Java MMTk and Rust MMTk:

  1. In the beginning, there was only TraceLocal.
    • It was semantically the same TraceLocal as it is today. It did not have any super classes.
    • Scanning.scanObject took TraceLocal as argument.
  2. Someone introduced a new RC algorithm.
    • It introduced a base class TraceStep for TraceLocal that only had processEdge (was named traceObjectLocation), and used it as the base class of RC trial-deletion steps that visited object fields (i.e. needed to be passed to scanObject), but were not doing tracing as in tracing collectors.
    • Scanning.scanObject then took TraceStep as parameter.
  3. Someone then did a refactoring to facilitate concurrent GC and pre-fetching.
    • It inintroduced TransitiveClosure that had both processEdge and processNode. Both processEdge and processNode provided panicking default implementations, so existing subclasses that were not TraceLocal only needed to implement one of them.
    • Scanning.scanObject then took TransitiveClosure as parameter.
  4. Trial deletion was later removed, but TransitiveClosure remained.
  5. We ported MMTk from Java to Rust.
    • We kept the TransitiveClosure trait, although we do not have any RC-based collectors.
    • We ported the Scanning interface "literally" to Rust, keeping the TransitiveClosure parameter.
  6. We introduced the work packet framework.
    • The Scheduler superseded the TraceLocal.
    • We discussed about removing TransitiveClosure in that PR, but it was not removed.

From step 2, we see TraceStep was initially intended to be used as a callback for scanObject. TransitiveClosure is its successor, but it conflated two concerns into one:

  1. to support both edge enqueueing and object enqueueing, and
  2. to be the callback of scanObject.

And they should be separated.

@wks wks changed the title Remove TransitiveClosure param from Space::trace_object Remove TransitiveClosure param from Space::trace_object and Scanning::scan_object Mar 16, 2022
@wks
Copy link
Collaborator Author

wks commented May 17, 2022

I tried to change the return value of trace_object to something equivalent to Option<ObjectReference>. Running it with lusearch, a GC-intensive benchmark, the effect on execution time and STW time is negligible.

Key: execution_times
|Benchmark|Group            |Trunk(ms)       |                     |       |Branch(ms)      |                                   |       |Diff  |                     |
|:-------:|:---------------:|:--------------:|:-------------------:|:-----:|:--------------:|:---------------------------------:|:-----:|:----:|:-------------------:|
|         |                 |mean            |mean without outliers|median |mean            |mean without outliers              |median |mean  |mean without outliers|
|:-------:|:---------------:|:--------------:|:-------------------:|:-----:|:--------------:|:---------------------------------:|:-----:|:----:|:-------------------:|
|lusearch |mmtk_gc-SemiSpace|20073.55 ±88.02 |20073.55 ±88.02      |20095.0|20103.75 ±123.07|20103.75 ±123.07                   |20013.0|+0.15%|+0.15%               |
|lusearch |mmtk_gc-Immix    |15770.95 ±116.82|15770.95 ±116.82     |15713.5|15773.70 ±140.14|15773.70 ±140.14                   |15714.5|+0.02%|+0.02%               |
|lusearch |mmtk_gc-GenImmix |16416.10 ±36.48 |16416.10 ±36.48      |16428.0|16416.20 ±45.48 |16401.16 ±34.73 :warning: 1 removed|16405.0|+0.00%|-0.09%               |

Key: time.stw
|Benchmark|Group            |Trunk(ms)     |                     |                  |Branch(ms)     |                                  |        |Diff  |                     |
|:-------:|:---------------:|:------------:|:-------------------:|:----------------:|:-------------:|:--------------------------------:|:------:|:----:|:-------------------:|
|         |                 |mean          |mean without outliers|median            |mean           |mean without outliers             |median  |mean  |mean without outliers|
|:-------:|:---------------:|:------------:|:-------------------:|:----------------:|:-------------:|:--------------------------------:|:------:|:----:|:-------------------:|
|lusearch |mmtk_gc-SemiSpace|8453.94 ±69.94|8453.94 ±69.94       |8452.49           |8461.10 ±106.85|8461.10 ±106.85                   |8346.715|+0.08%|+0.08%               |
|lusearch |mmtk_gc-Immix    |3994.37 ±89.71|3994.37 ±89.71       |3982.8450000000003|3984.88 ±91.55 |3984.88 ±91.55                    |3928.17 |-0.24%|-0.24%               |
|lusearch |mmtk_gc-GenImmix |4478.06 ±9.46 |4478.06 ±9.46        |4480.345          |4464.85 ±15.73 |4459.55 ±11.79 :warning: 1 removed|4454.125|-0.29%|-0.41%               |

Key: time.other
|Benchmark|Group            |Trunk(ms)      |                     |         |Branch(ms)     |                                   |        |Diff  |                     |
|:-------:|:---------------:|:-------------:|:-------------------:|:-------:|:-------------:|:---------------------------------:|:------:|:----:|:-------------------:|
|         |                 |mean           |mean without outliers|median   |mean           |mean without outliers              |median  |mean  |mean without outliers|
|:-------:|:---------------:|:-------------:|:-------------------:|:-------:|:-------------:|:---------------------------------:|:------:|:----:|:-------------------:|
|lusearch |mmtk_gc-SemiSpace|11619.89 ±32.51|11619.89 ±32.51      |11629.4  |11642.74 ±40.40|11642.74 ±40.40                    |11634.66|+0.20%|+0.20%               |
|lusearch |mmtk_gc-Immix    |11776.72 ±41.63|11776.72 ±41.63      |11796.96 |11788.97 ±73.40|11760.00 ±43.77 :warning: 1 removed|11791.55|+0.10%|-0.14%               |
|lusearch |mmtk_gc-GenImmix |11938.21 ±36.51|11938.21 ±36.51      |11951.695|11951.49 ±48.90|11934.93 ±36.51 :warning: 1 removed|11943.72|+0.11%|-0.03%               |

@k-sareen
Copy link
Collaborator

Sorry, I was wondering what machine did you use and how many invocations did you run it for?

@wks
Copy link
Collaborator Author

wks commented May 17, 2022

Sorry, I was wondering what machine did you use and how many invocations did you run it for?

20 invocations on hare.moma

@wks
Copy link
Collaborator Author

wks commented May 22, 2022

I developed two solutions for removing `TransitiveClosure.

  1. trace-object-return2: Removed the trace: T parameter from Space::trace_object. Use the return value to indicate which object to enqueue.
  2. object-queue-trait: Rename TransitiveClosure to ObjectQueue; rename process_node to enqueue. ProcessEdgesWork no longer implements TransitiveClosure. Create a dedicated VectorObjectQueue (holding a Vec<ObjectReference>) that implement the ObjectQueue trait.

I prepared three builds for testing. All used OpenJDK, and the mmtk-core repo differ.

I ran lusearch from Dacapo Chopin on hare.moma using SemiSpace, Immix, GenCopy and GenImmix plan, and compared the stop-the-world time.

I ran the test three times, with slightly different configurations.

The first time

In the beginning, I only developed the trace-object-return2 branch (build2), so I only ran it with build1 and build2.

I used 6x heap size, 20 iterations, 5 invocations each.

Comparing build1 and build2:

Key: time.stw
|Benchmark|Group            |Trunk(ms)     |                                 |                  |Branch(ms)    |                                  |                  |Diff  |                     |
|:-------:|:---------------:|:------------:|:-------------------------------:|:----------------:|:------------:|:--------------------------------:|:----------------:|:----:|:-------------------:|
|         |                 |mean          |mean without outliers            |median            |mean          |mean without outliers             |median            |mean  |mean without outliers|
|:-------:|:---------------:|:------------:|:-------------------------------:|:----------------:|:------------:|:--------------------------------:|:----------------:|:----:|:-------------------:|
|lusearch |mmtk_gc-SemiSpace|6019.62 ±66.29|6019.62 ±66.29                   |5980.605          |6044.05 ±51.99|6026.28 ±38.43 :warning: 1 removed|6014.665          |+0.41%|+0.11%               |
|lusearch |mmtk_gc-Immix    |3088.54 ±87.36|3088.54 ±87.36                   |3103.9300000000003|3123.66 ±82.21|3123.66 ±82.21                    |3165.35           |+1.14%|+1.14% :red_square:  |
|lusearch |mmtk_gc-GenCopy  |3455.07 ±6.28 |3455.07 ±6.28                    |3455.17           |3468.93 ±6.98 |3468.93 ±6.98                     |3466.75           |+0.40%|+0.40%               |
|lusearch |mmtk_gc-GenImmix |3311.27 ±10.13|3307.21 ±5.85 :warning: 1 removed|3308.38           |3304.93 ±3.62 |3304.93 ±3.62                     |3305.3549999999996|-0.19%|-0.07%               |

Plot: Both box plot and violin plot are drawn. The yellow line in the box is the median, and the yellow line in the violin is the mean. The data point of all 20 iterations are also drawn.

image

From the result of the first time, I thought build2 has significant performance impact, especially with Immix.

Then I coded a more conservative approach (still passing the queue, not using return value for queueing) in the object-queue-trait branch (build3). However, to further confirm the impact of build2, I ran the test with build1 and build2 again.

The second time

Running the same tests (with the same built binary of build1 and build2) with 8x heap size (some tests failed with 6x heap size) and 40 invocations, and the results are a bit different:

Comparing build1 and build2:

Key: time.stw
|Benchmark|Group            |Trunk(ms)     |                     |                  |Branch(ms)    |                     |       |Diff  |                     |
|:-------:|:---------------:|:------------:|:-------------------:|:----------------:|:------------:|:-------------------:|:-----:|:----:|:-------------------:|
|         |                 |mean          |mean without outliers|median            |mean          |mean without outliers|median |mean  |mean without outliers|
|:-------:|:---------------:|:------------:|:-------------------:|:----------------:|:------------:|:-------------------:|:-----:|:----:|:-------------------:|
|lusearch |mmtk_gc-SemiSpace|3966.75 ±24.36|3966.75 ±24.36       |3947.4049999999997|3979.45 ±24.90|3979.45 ±24.90       |3978.84|+0.32%|+0.32%               |
|lusearch |mmtk_gc-Immix    |2220.77 ±43.71|2220.77 ±43.71       |2220.4399999999996|2194.67 ±37.18|2194.67 ±37.18       |2199.27|-1.18%|-1.18% :green_square:|
|lusearch |mmtk_gc-GenCopy  |3271.14 ±2.97 |3271.14 ±2.97        |3272.5            |3271.48 ±2.73 |3271.48 ±2.73        |3273.04|+0.01%|+0.01%               |
|lusearch |mmtk_gc-GenImmix |3228.16 ±2.69 |3228.16 ±2.69        |3229.24           |3227.09 ±2.59 |3227.09 ±2.59        |3227.4 |-0.03%|-0.03%               |

Plot: Both box plot and violin plot are drawn. The yellow line in the box is the median, and the yellow line in the violin is the mean. The data point of all 40 iterations are also drawn.

image

Strangely, this time, build2 was faster than build1 with Immix and GenImmix, although the binaries are the same.

The third time

Then I ran the test with all of build1, build2 and build3. (build1 and build2 are still the same binary) This time using 6x heap size (and no runs failed) and 40 iterations.

Comparing build1 and build2:

Key: time.stw
|Benchmark|Group            |Trunk(ms)     |                     |                  |Branch(ms)    |                     |        |Diff  |                     |
|:-------:|:---------------:|:------------:|:-------------------:|:----------------:|:------------:|:-------------------:|:------:|:----:|:-------------------:|
|         |                 |mean          |mean without outliers|median            |mean          |mean without outliers|median  |mean  |mean without outliers|
|:-------:|:---------------:|:------------:|:-------------------:|:----------------:|:------------:|:-------------------:|:------:|:----:|:-------------------:|
|lusearch |mmtk_gc-SemiSpace|5940.29 ±31.94|5940.29 ±31.94       |5927.175          |5966.03 ±26.46|5966.03 ±26.46       |5955.52 |+0.43%|+0.43%               |
|lusearch |mmtk_gc-Immix    |3015.77 ±68.98|3015.77 ±68.98       |3042.665          |3019.62 ±63.61|3019.62 ±63.61       |3050.08 |+0.13%|+0.13%               |
|lusearch |mmtk_gc-GenCopy  |3455.95 ±5.12 |3455.95 ±5.12        |3456.34           |3462.73 ±3.57 |3462.73 ±3.57        |3461.435|+0.20%|+0.20%               |
|lusearch |mmtk_gc-GenImmix |3309.24 ±3.93 |3309.24 ±3.93        |3311.6549999999997|3308.08 ±3.16 |3308.08 ±3.16        |3308.435|-0.04%|-0.04%               |

Comparing build1 and build3:

Key: time.stw
|Benchmark|Group            |Trunk(ms)     |                     |                  |Branch(ms)    |                     |                  |Diff  |                     |
|:-------:|:---------------:|:------------:|:-------------------:|:----------------:|:------------:|:-------------------:|:----------------:|:----:|:-------------------:|
|         |                 |mean          |mean without outliers|median            |mean          |mean without outliers|median            |mean  |mean without outliers|
|:-------:|:---------------:|:------------:|:-------------------:|:----------------:|:------------:|:-------------------:|:----------------:|:----:|:-------------------:|
|lusearch |mmtk_gc-SemiSpace|5940.29 ±31.94|5940.29 ±31.94       |5927.175          |6061.95 ±34.06|6061.95 ±34.06       |6038.43           |+2.05%|+2.05% :red_square:  |
|lusearch |mmtk_gc-Immix    |3015.77 ±68.98|3015.77 ±68.98       |3042.665          |3081.19 ±70.57|3081.19 ±70.57       |3086.3199999999997|+2.17%|+2.17% :red_square:  |
|lusearch |mmtk_gc-GenCopy  |3455.95 ±5.12 |3455.95 ±5.12        |3456.34           |3463.21 ±4.44 |3463.21 ±4.44        |3460.395          |+0.21%|+0.21%               |
|lusearch |mmtk_gc-GenImmix |3309.24 ±3.93 |3309.24 ±3.93        |3311.6549999999997|3308.02 ±5.52 |3308.02 ±5.52        |3307.995          |-0.04%|-0.04%               |

Plot: Both box plot and violin plot are drawn. The yellow line in the box is the median, and the yellow line in the violin is the mean. The data point of all 40 iterations are also drawn.

image

This time, build3 is the slowest, although I thought build3 should be very similar to build1 while build2 may have some visible overhead.

Conclusion

It seems that the STW time varies too much, and even 40 invocations are too few to show which build is the clear winner. Either the impact is insignificant, or I need a better way to do the experiment.

@k-sareen
Copy link
Collaborator

k-sareen commented May 22, 2022

Might be worth using a different benchmark such as biojava or xalan, which I believe are still quite generational (though not as much as lusearch of course). If 40 invocations is not showing a substantial difference, then there may be no statistically significant difference. I will note that you are using a fairly old machine/CPU microarchitecture. If possible, I would also recommend trying a different (and newer) CPU microarchitecture. We have seen performance improvements on benchmarks on older machines whereas they significantly reduce performance on the newer machines (for example, the performance regression Wenyu found).

EDIT: Going by the LXR paper, sunflow is another good candidate as it allocates a lot and has a very low survival rate.

@qinsoon
Copy link
Member

qinsoon commented May 23, 2022

  1. trace-object-return2: Removed the trace: T parameter from Space::trace_object. Use the return value to indicate which object to enqueue.
  2. object-queue-trait: Rename TransitiveClosure to ObjectQueue; rename process_node to enqueue. ProcessEdgesWork no longer implements TransitiveClosure. Create a dedicated VectorObjectQueue (holding a Vec<ObjectReference>) that implement the ObjectQueue trait.

Solution 2 looks straightforward. It just moves the enqueue from ProcessEdgesWork to a separate struct. With proper inlining, I don't feel there should be any performance difference.

Solution 1 adds more clarity to the code. The policy's trace_object() no longer needs to know anything about enqueue or how object are processed in the work packet -- it just returns what it knows.

@wks
Copy link
Collaborator Author

wks commented May 23, 2022

@k-sareen I re-run the last test on bobcat.moma. It is exactly the same source code and running command as the last test on bobcat.moma.

Comparing build1 and build2:

Key: time.stw
|Benchmark|Group            |Trunk(ms)|                     |        |Branch(ms)|                                  |        |Diff  |                     |
|:-------:|:---------------:|:-------:|:-------------------:|:------:|:--------:|:--------------------------------:|:------:|:----:|:-------------------:|
|         |                 |mean     |mean without outliers|median  |mean      |mean without outliers             |median  |mean  |mean without outliers|
|:-------:|:---------------:|:-------:|:-------------------:|:------:|:--------:|:--------------------------------:|:------:|:----:|:-------------------:|
|lusearch |mmtk_gc-SemiSpace|None     |5390.96 ±19.55       |5375.55 |None      |5399.52 ±15.74 :warning: 1 removed|5396.16 |+0.25%|+0.16%               |
|lusearch |mmtk_gc-Immix    |None     |2341.42 ±34.35       |2318.215|None      |2330.23 ±43.70                    |2309.41 |-0.48%|-0.48%               |
|lusearch |mmtk_gc-GenCopy  |None     |3592.88 ±10.23       |3589.65 |None      |3598.64 ±9.40                     |3597.64 |+0.16%|+0.16%               |
|lusearch |mmtk_gc-GenImmix |None     |3098.02 ±6.82        |3100.5  |None      |3094.43 ±7.44                     |3090.325|-0.12%|-0.12%               |

Comparing build1 and build3:

Key: time.stw
|Benchmark|Group            |Trunk(ms)|                     |        |Branch(ms)|                                  |                  |Diff  |                     |
|:-------:|:---------------:|:-------:|:-------------------:|:------:|:--------:|:--------------------------------:|:----------------:|:----:|:-------------------:|
|         |                 |mean     |mean without outliers|median  |mean      |mean without outliers             |median            |mean  |mean without outliers|
|:-------:|:---------------:|:-------:|:-------------------:|:------:|:--------:|:--------------------------------:|:----------------:|:----:|:-------------------:|
|lusearch |mmtk_gc-SemiSpace|None     |5390.96 ±19.55       |5375.55 |None      |5428.69 ±16.09                    |5430.3150000000005|+0.70%|+0.70%               |
|lusearch |mmtk_gc-Immix    |None     |2341.42 ±34.35       |2318.215|None      |2331.02 ±34.74 :warning: 1 removed|2301.065          |+0.03%|-0.44%               |
|lusearch |mmtk_gc-GenCopy  |None     |3592.88 ±10.23       |3589.65 |None      |3603.91 ±9.25 :warning: 1 removed |3609.0299999999997|+0.46%|+0.31%               |
|lusearch |mmtk_gc-GenImmix |None     |3098.02 ±6.82        |3100.5  |None      |3094.58 ±5.20 :warning: 1 removed |3092.5            |+0.04%|-0.11%               |

Plot:
image

It looks like there is no clear winner, as the results of all builds are quite noisy. But the plot for SemiSpace concerns me, because like on hare.moma, build2 and build3 seem to be systematically slower than build1.

@k-sareen
Copy link
Collaborator

Bobcat is an Alder Lake CPU which can give funky results dude to its Efficiency cores. Did you make sure that the benchmarks were ran only on the Performance cores (something like numactl or taskset can be used, I think).

Though yes, it does seem quite weird that SS's performance is slightly worse. Not quite sure why it would be slower for build 3.

@wks
Copy link
Collaborator Author

wks commented May 23, 2022

@k-sareen No. I didn't do anything special about big/small calls. But

  1. If any real-world users run OpenJDK and/or MMTk on that machine, they would enable those small cores.
  2. If the small cores are a problem, they should interfere with all of the three builds.

I guess it may have something to do with function inlining. I'll check.

@wks
Copy link
Collaborator Author

wks commented May 24, 2022

Repeated the last experiment for SemiSpace on bobcat.moma. This time I have 9 builds:

  • build1x = build1y = build1z = build1
  • build2x = build2y = build2z = build2
  • build3x = build3y = build3z = build3

That is, I duplicated the runtime definition of build1 into build1x, build1y and build1z. They point to the same binary, but appear as three distinct runtimes in the running script. I did the same for build2 and build3. I run lusearch on all the "nine" builds, 40 iterations each. (Effectively, that's running 120 iterations for each actual build, but we plot the result of iteration 0, 3, 6, 9, ... for "build1x", iteration 1, 4, 7, 10, ... for "build1y" and iteration 2, 5, 8, 11, ... for "build1z".) The results are:

image

I expect the distribution of build1x, build1y and build1z to be identical, because they are the same binary. The same should be true for build2{x,y,z} and build3{x,y,z}.

On the contrary, If build2 is really worse than build1, I expect all of build2{x,y,z} to be consistently worse than build1{x,y,z}. The same applies for build3.

But from the plot, it is not the case. The distribution can vary greatly even for the same actual build (i.e. the same binary).

I conclude that there is no significant difference between build1, build2 and build3. If there is any, it is less significant than the noise.

FYI, in the following plot, each column contains all the combined 120 data points for each actual build (build1 = build1x + build1y + build1z, ...).

image

@wks
Copy link
Collaborator Author

wks commented Jun 6, 2022

I ran the last benchmark after the work stealing PR was merged. 40 iterations, 5 invocations each, 6x min heap size. build1=master, build2=trace-object-return2, build3=object-queue-trait

("build4" is identical to "build1" except its directory name is extremely long. I thought the directory name length may affect the live objects in the heap that holds directory names (such as jar files), causing SemiSpace to copy more data than others. But from the data, it does not seem to be the case.)

image

Running SemiSpace alone for 120 iterations (build1{x,y,z} are the same, ...):

image

Combined:

image

It looks like the work stealing PR made tracing more efficiently, and amplified the impact of build3. It seems to impact the GC algorithms with more copying. I'll look into that.

@wks
Copy link
Collaborator Author

wks commented Jun 6, 2022

I wrote a micro benchmark that creates a 22-level binary tree, and then run System.gc() 100 times, recording the time taken by each System.gc(). The plan is SemiSpace, and MMTK_THREADS is 1.

Run it on build1 and build3, interleaving, 5 invocations each. That resulted in 500 data points (GC time for each System.gc(), in nanoseconds) for each build. The plot is as following:

image

I ran that on my local machine, a Coffee Lake without any special tuning such as disabling frequency scaling. The result is quite obvious. It is not noise.

@wks
Copy link
Collaborator Author

wks commented Jun 6, 2022

Here are the data for four different plans. From the data, it seems the problem affects SemiSpace exclusively.

image

With outliers (zscore >=3) removed:

image

@wks
Copy link
Collaborator Author

wks commented Jun 6, 2022

Found the problem.

CopySpace::trace_object has the annotation #[inline]. By changing the signature of CopySpace::trace_object, the Rust compiler somehow decided not to inline CopySpace::trace_object. This makes SemiSpace much slower, because it mainly uses CopySpace.

The evidence is the result of perf.

Output of perf record and perf report for the master branch (build1):

  38.93%  MMTk Collector   libmmtk_openjdk.so    [.] mmtk::util::object_forwarding::forward_object
  27.07%  MMTk Collector   libmmtk_openjdk.so    [.] mmtk::scheduler::gc_work::<impl mmtk::scheduler::work::GCWork<<E as mmtk::scheduler::gc_work::ProcessEdgesWork>::VM> for E>::do_wo
  12.61%  MMTk Collector   libmmtk_openjdk.so    [.] <mmtk_openjdk::abi::InstanceKlass as mmtk_openjdk::object_scanning::OopIterate>::oop_iterate
   5.94%  MMTk Collector   libmmtk_openjdk.so    [.] <mmtk::scheduler::gc_work::PlanScanObjects<E,P> as mmtk::scheduler::work::GCWork<<E as mmtk::scheduler::gc_work::ProcessEdgesWork>
   2.80%  MMTk Collector   libc.so.6             [.] 0x000000000015de0a
   2.80%  MMTk Collector   libc.so.6             [.] 0x000000000015d18d
   1.98%  MMTk Collector   libc.so.6             [.] 0x000000000015d1a1
   1.43%  MMTk Collector   libmmtk_openjdk.so    [.] <std::error::<impl core::convert::From<alloc::string::String> for alloc::boxed::Box<dyn std::error::Error+core::marker::Sync+core:
...

The output of the object-queue-trait branch (build3):

  33.98%  MMTk Collector   libmmtk_openjdk.so    [.] mmtk::util::object_forwarding::forward_object
  22.08%  MMTk Collector   libmmtk_openjdk.so    [.] mmtk::policy::copyspace::CopySpace<VM>::trace_object
  11.38%  MMTk Collector   libmmtk_openjdk.so    [.] <mmtk_openjdk::abi::InstanceKlass as mmtk_openjdk::object_scanning::OopIterate>::oop_iterate
  10.76%  MMTk Collector   libmmtk_openjdk.so    [.] mmtk::scheduler::gc_work::<impl mmtk::scheduler::work::GCWork<<E as mmtk::scheduler::gc_work::ProcessEdgesWork>::VM> for E>::do_wo
   7.23%  MMTk Collector   libmmtk_openjdk.so    [.] <mmtk::scheduler::gc_work::PlanScanObjects<E,P> as mmtk::scheduler::work::GCWork<<E as mmtk::scheduler::gc_work::ProcessEdgesWork>
   2.82%  MMTk Collector   libc.so.6             [.] 0x000000000015de0a
   2.58%  MMTk Collector   libc.so.6             [.] 0x000000000015d18d
   1.87%  MMTk Collector   libc.so.6             [.] 0x000000000015d1a1
   1.28%  MMTk Collector   libmmtk_openjdk.so    [.] <std::error::<impl core::convert::From<alloc::string::String> for alloc::boxed::Box<dyn std::error::Error+core::marker::Sync+core:

I changed the annotation to #[inline(always)] and it no longer records trace_object.

  38.27%  MMTk Collector   libmmtk_openjdk.so    [.] mmtk::util::object_forwarding::forward_object
  27.82%  MMTk Collector   libmmtk_openjdk.so    [.] mmtk::scheduler::gc_work::<impl mmtk::scheduler::work::GCWork<<E as mmtk::scheduler::gc_work::ProcessEdgesWork>::VM> for E>::do_wo
  12.60%  MMTk Collector   libmmtk_openjdk.so    [.] <mmtk_openjdk::abi::InstanceKlass as mmtk_openjdk::object_scanning::OopIterate>::oop_iterate
   5.70%  MMTk Collector   libmmtk_openjdk.so    [.] <mmtk::scheduler::gc_work::PlanScanObjects<E,P> as mmtk::scheduler::work::GCWork<<E as mmtk::scheduler::gc_work::ProcessEdgesWork>
   2.95%  MMTk Collector   libc.so.6             [.] 0x000000000015de0a
   2.78%  MMTk Collector   libc.so.6             [.] 0x000000000015d18d
   2.02%  MMTk Collector   libc.so.6             [.] 0x000000000015d1a1
   1.44%  MMTk Collector   libmmtk_openjdk.so    [.] <std::error::<impl core::convert::From<alloc::string::String> for alloc::boxed::Box<dyn std::error::Error+core::marker::Sync+core:
   0.73%  MMTk Collector   libc.so.6             [.] 0x000000000015d184

The plot after changing to #[inline(always)] in the object-queue-trait branch (I left the master branch unchanged).

image

I'll further investigate other spaces to make sure their trace_object are always inlined.

@wks
Copy link
Collaborator Author

wks commented Jun 6, 2022

Fortunately, ImmixSpace::trace_object already has #[inline(always)], so it is not affected by my change.

I also tried adding #[inline(always)] to many functions to many called functions on the hot path of trace_object, but that slows down the program.

So I just added #[inline(always)] to CopySpace::trace_object.

@wks
Copy link
Collaborator Author

wks commented Jun 7, 2022

This is the result of lusearch on bobcat.moma after fixing the inlining issue in build3 (other builds are not changed). 5 invocations each, 20 iterations, 6x heap size.

Comparing build1 and build3 (I decided to discard build2 in favour for build3):

Key: time.stw
|Benchmark|Group            |Trunk(ms)     |                                 |                  |Branch(ms)    |                     |                  |Diff  |                     |
|:-------:|:---------------:|:------------:|:-------------------------------:|:----------------:|:------------:|:-------------------:|:----------------:|:----:|:-------------------:|
|         |                 |mean          |mean without outliers            |median            |mean          |mean without outliers|median            |mean  |mean without outliers|
|:-------:|:---------------:|:------------:|:-------------------------------:|:----------------:|:------------:|:-------------------:|:----------------:|:----:|:-------------------:|
|lusearch |mmtk_gc-SemiSpace|5350.06 ±14.45|5350.06 ±14.45                   |5354.165          |5357.86 ±13.47|5357.86 ±13.47       |5373.375          |+0.15%|+0.15%               |
|lusearch |mmtk_gc-Immix    |2272.16 ±51.02|2272.16 ±51.02                   |2212.0600000000004|2257.78 ±44.12|2257.78 ±44.12       |2227.2799999999997|-0.63%|-0.63%               |
|lusearch |mmtk_gc-GenCopy  |3510.02 ±6.08 |3510.02 ±6.08                    |3508.785          |3509.63 ±6.65 |3509.63 ±6.65        |3507.6800000000003|-0.01%|-0.01%               |
|lusearch |mmtk_gc-GenImmix |3017.57 ±10.86|3014.01 ±8.36 :warning: 1 removed|3007.9399999999996|3014.95 ±6.46 |3014.95 ±6.46        |3017.525          |-0.09%|+0.03%               |

Plot:

image

The difference between the mean times of build1 and build3 is very small. Their distributions of the data points are similar, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-work-queue Area: Work queue C-cleanup Category: Cleanup F-question Call For Participation: Unanswered question (need more information)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants