Skip to content

Conversation

@merykitty
Copy link
Member

@merykitty merykitty commented Dec 11, 2025

Hi,

The current escape analysis mechanism is all-or-nothing: either the object does not escape, or it does. If the object escapes, we lose the ability to analyse the values of its fields completely, even if the object only escapes at return.

This PR tries to find the escape status of an object at a load, and if it is decided that the object has not escaped there, we can try folding the load aggressively, ignoring calls and memory barriers to find a corresponding store that the load observes.

For the runtime cost, this phase runs very fast, around 5 - 7% the runtime of EA, and about 0.5% the total runtime of C2.

Please take a look and leave your thoughts, thanks a lot.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8373495: C2: Aggressively fold loads from objects that have not escaped (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28764/head:pull/28764
$ git checkout pull/28764

Update a local copy of the PR:
$ git checkout pull/28764
$ git pull https://git.openjdk.org/jdk.git pull/28764/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 28764

View PR using the GUI difftool:
$ git pr show -t 28764

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28764.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Dec 11, 2025

👋 Welcome back qamai! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Dec 11, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Dec 11, 2025
@openjdk
Copy link

openjdk bot commented Dec 11, 2025

@merykitty The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the rfr Pull request is ready for review label Dec 11, 2025
@mlbridge
Copy link

mlbridge bot commented Dec 11, 2025

Webrevs

@robcasloz
Copy link
Contributor

Interesting improvement, thanks for working in this area, Quan Anh! Please allow us some time to think thoroughly about it and how it relates to other plans to improve escape analysis and scalar replacement in C2.

@merykitty
Copy link
Member Author

@robcasloz Thanks for taking a look. I also wonder how this relates to other potential improvements to EA. I think that this can work as an independent step or as a first step toward those goals. I am also pretty excited to realize that we don't need to schedule the graph to know if a load can be folded in such a manner, hope this can also be useful.

Comment on lines 132 to 133
// In this case, even if the load x = o.value is declared after the store of o to p that allows o
// to escape, it is valid for the load to actually happen before the store. As a result, we can
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is correct. If p is external other thread can modify its fields concurrently.
Or are you saying that if p is external we will always have memory barrier?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Java memory model allows this reordering and places the responsibility on the programmer to use a synchronization mechanism if the reordering is undesirable, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think. May be we should add comment about that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added comments to further stress the importance of memory barriers if the developer needs the accesses serialized.

// Phi
// We can see that the object can be considered non-escape at NarrowMemProj, CallJava(null), and
// Proj2, while it is considered escape at CallJava(o), Proj1, Phi. The loads x and z will be
// from NarrowMemProj and Proj2, respectively, which means they can be considered loads from an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this optimization is based on JDK-8327963 changes which introduced NarrowMemProj. But I don't see you can for it in code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for demonstration based on the current shape of the graph. Implementation-wise, we walk the graph until we meet an InitializeNode, at that point we call InitializeNode::find_captured_store, so you can say it is not important what kind of Proj an InitializeNode has.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

@iwanowww
Copy link
Contributor

Interesting idea, Quan!

Why can't the same be done as part of MemNode::can_see_stored_value()? Based on your reasoning in the comments, if the base escapes, the walk over memory graph happening there should encounter it as well. (But you need to ensure it climbs up to the Allocation node to be sure.)

@merykitty
Copy link
Member Author

merykitty commented Dec 12, 2025

@iwanowww No, the walk over the memory graph only visits the memory nodes in the alias class of the load, the escape can happen in a different alias class and be made visible by a MergeMem. For example:

Integer o = new Integer(v);
*p = o;
VarHandle.fullFence();
int x = o.value;

Then the load is in the alias class Integer.value, while the escape is in the alias class p and is made visible by the full fence.

@iwanowww
Copy link
Contributor

iwanowww commented Dec 12, 2025

Yes, I got it, but my understanding of the core idea of the optimization is that you can skip over membars when base object is not escaped yet. So, if MemNode::can_see_stored_value() encounters a node with a wide memory effect (a membar or a call) while traversing the memory graph upwards, it can step over it if it can prove that the freshly allocated instance hasn't escaped yet. And the traversal of memory graph from there up to corresponding Initialize node should reveal whether the instance escaped or not. Do I get it right?

@merykitty
Copy link
Member Author

@iwanowww In principle, I think you are right. However, I don't know how you can prove that a freshly allocated object has not escaped. It seems to me you would need to traverse the whole memory graph to obtain that information. Furthermore, AllocateNode and InitializeNode do not write to the whole memory, so walking some memory alias classes will make you pass them without knowing until you encounter the start memory. And there is also the issue of other nodes may alias with the base, too (e.g. Phi).

@iwanowww
Copy link
Contributor

Ok, what does it take to determine a freshly allocated object doesn't escape in a region bounded by the allocation and some call/membar node (dominated by it)? I believe it should be part of the problem your patch solves.

@merykitty
Copy link
Member Author

The sufficient condition to decide that a freshly allocated object does not escape in a region bounded by the allocation and a call is that there is no action in that region that makes the object escape. This means that there is no node that escapes the object which has the call as a transitive use.

As a result, my solution here is to find all nodes that escape the object, then mark all of its transitive uses as escape. I believe you want to do it in the opposite way, that is, to try to find the nodes that escape the freshly allocated object from a call. But that means that we need to traverse all the transitive inputs of the call, which seems unrealistic for something running in IterGVN. Am I understanding it correctly?

@vnkozlov
Copy link
Contributor

@merykitty thank you for updating comment.

Do you have any performance numbers for some well known benchmarks?

@iwanowww
Copy link
Contributor

Some more thoughts/ideas:

So, an object can escape either through a store to memory or as an argument to a call. (Any other scenarios?)

If we leave memory graph considerations aside, then traversing control graph from a barrier (call/membar) up to allocation should enumerate all calls and stores in that range. (All stores have control.)

(Theoretically, a store control can end up higher in the control graph, but I don't think it happens in practice.)

If a call/store has a data dependency on the allocation, then it's an escaping point.

One case left is the following: if a store has a control in the region, it can be scheduled after the region unless the store dominates the barrier in the memory graph. But, conservatively, it can also be treated as an escape point interfering with the access being optimized.

So, either doing CFG-only or CFG+memory traversal (plus, data inputs traversal on arguments) should detect whether there's an interfering escape point present or not.

Do you see any flaws in my reasoning?

Speaking of the associated costs, it doesn't look prohibitively expensive. The search is localized and doesn't involve traversal of the whole graph.

Alternatively, results of previous analysis requests can be cached. The property changes monotonically: a previously non-escaping case can't turn into escaping one later. If a cache is not invalidated, than the worst case is an optimization opportunity is missed.

@iwanowww
Copy link
Contributor

Speaking of the general approach, if analysis part turns out to be way too
expensive for IGVN, I'd still prefer to have the analysis and transformation to be
separated and IGVN used to conduct the actual IR changes.

There's already some duplication and divergence between IGVN & PhaseLoadFolding
implementation. Without proper care, it can easily get worse in the future.

Another thing to consider: it's beneficial to perform such transformation
earlier, as IGVN case illustrates. (For example, by the time EA kicks in,
inlining is over.) Shared implementation is easier to maintain and reuse.

@merykitty
Copy link
Member Author

@iwanowww Thanks for your analysis, I think it is possible to do the transformation during IGVN and have created another PR which follows that approach, could you take a look?

auto extract_store_value = [&](StoreNode* store) {
assert(store->Opcode() == candidate->store_Opcode(), "must match %s - %s", store->Name(), candidate->Name());
Node* res = store->in(MemNode::ValueIn);
if (candidate->Opcode() == Op_LoadUB) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is such adaptation needed? MemNode::can_see_stored_value() solves a similar task, but it doesn't perform any adaptation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it only looks for a matching store, the one doing the normalization is Load[B|US|S|US]Node::Ideal.

@merykitty
Copy link
Member Author

Close in favour of #28812

@merykitty merykitty closed this Dec 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler hotspot-compiler-dev@openjdk.org rfr Pull request is ready for review

Development

Successfully merging this pull request may close these issues.

4 participants