-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8373495: C2: Aggressively fold loads from objects that have not escaped #28764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Welcome back qamai! A progress list of the required criteria for merging this PR into |
|
❗ This change is not yet ready to be integrated. |
|
@merykitty The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
|
Interesting improvement, thanks for working in this area, Quan Anh! Please allow us some time to think thoroughly about it and how it relates to other plans to improve escape analysis and scalar replacement in C2. |
|
@robcasloz Thanks for taking a look. I also wonder how this relates to other potential improvements to EA. I think that this can work as an independent step or as a first step toward those goals. I am also pretty excited to realize that we don't need to schedule the graph to know if a load can be folded in such a manner, hope this can also be useful. |
| // In this case, even if the load x = o.value is declared after the store of o to p that allows o | ||
| // to escape, it is valid for the load to actually happen before the store. As a result, we can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is correct. If p is external other thread can modify its fields concurrently.
Or are you saying that if p is external we will always have memory barrier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the Java memory model allows this reordering and places the responsibility on the programmer to use a synchronization mechanism if the reordering is undesirable, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think. May be we should add comment about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added comments to further stress the importance of memory barriers if the developer needs the accesses serialized.
| // Phi | ||
| // We can see that the object can be considered non-escape at NarrowMemProj, CallJava(null), and | ||
| // Proj2, while it is considered escape at CallJava(o), Proj1, Phi. The loads x and z will be | ||
| // from NarrowMemProj and Proj2, respectively, which means they can be considered loads from an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this optimization is based on JDK-8327963 changes which introduced NarrowMemProj. But I don't see you can for it in code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only for demonstration based on the current shape of the graph. Implementation-wise, we walk the graph until we meet an InitializeNode, at that point we call InitializeNode::find_captured_store, so you can say it is not important what kind of Proj an InitializeNode has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay
|
Interesting idea, Quan! Why can't the same be done as part of |
|
@iwanowww No, the walk over the memory graph only visits the memory nodes in the alias class of the load, the escape can happen in a different alias class and be made visible by a Then the load is in the alias class |
|
Yes, I got it, but my understanding of the core idea of the optimization is that you can skip over membars when base object is not escaped yet. So, if |
|
@iwanowww In principle, I think you are right. However, I don't know how you can prove that a freshly allocated object has not escaped. It seems to me you would need to traverse the whole memory graph to obtain that information. Furthermore, |
|
Ok, what does it take to determine a freshly allocated object doesn't escape in a region bounded by the allocation and some call/membar node (dominated by it)? I believe it should be part of the problem your patch solves. |
|
The sufficient condition to decide that a freshly allocated object does not escape in a region bounded by the allocation and a call is that there is no action in that region that makes the object escape. This means that there is no node that escapes the object which has the call as a transitive use. As a result, my solution here is to find all nodes that escape the object, then mark all of its transitive uses as escape. I believe you want to do it in the opposite way, that is, to try to find the nodes that escape the freshly allocated object from a call. But that means that we need to traverse all the transitive inputs of the call, which seems unrealistic for something running in |
|
@merykitty thank you for updating comment. Do you have any performance numbers for some well known benchmarks? |
|
Some more thoughts/ideas: So, an object can escape either through a store to memory or as an argument to a call. (Any other scenarios?) If we leave memory graph considerations aside, then traversing control graph from a barrier (call/membar) up to allocation should enumerate all calls and stores in that range. (All stores have control.) (Theoretically, a store control can end up higher in the control graph, but I don't think it happens in practice.) If a call/store has a data dependency on the allocation, then it's an escaping point. One case left is the following: if a store has a control in the region, it can be scheduled after the region unless the store dominates the barrier in the memory graph. But, conservatively, it can also be treated as an escape point interfering with the access being optimized. So, either doing CFG-only or CFG+memory traversal (plus, data inputs traversal on arguments) should detect whether there's an interfering escape point present or not. Do you see any flaws in my reasoning? Speaking of the associated costs, it doesn't look prohibitively expensive. The search is localized and doesn't involve traversal of the whole graph. Alternatively, results of previous analysis requests can be cached. The property changes monotonically: a previously non-escaping case can't turn into escaping one later. If a cache is not invalidated, than the worst case is an optimization opportunity is missed. |
|
Speaking of the general approach, if analysis part turns out to be way too There's already some duplication and divergence between IGVN & Another thing to consider: it's beneficial to perform such transformation |
|
@iwanowww Thanks for your analysis, I think it is possible to do the transformation during IGVN and have created another PR which follows that approach, could you take a look? |
| auto extract_store_value = [&](StoreNode* store) { | ||
| assert(store->Opcode() == candidate->store_Opcode(), "must match %s - %s", store->Name(), candidate->Name()); | ||
| Node* res = store->in(MemNode::ValueIn); | ||
| if (candidate->Opcode() == Op_LoadUB) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is such adaptation needed? MemNode::can_see_stored_value() solves a similar task, but it doesn't perform any adaptation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it only looks for a matching store, the one doing the normalization is Load[B|US|S|US]Node::Ideal.
|
Close in favour of #28812 |
Hi,
The current escape analysis mechanism is all-or-nothing: either the object does not escape, or it does. If the object escapes, we lose the ability to analyse the values of its fields completely, even if the object only escapes at return.
This PR tries to find the escape status of an object at a load, and if it is decided that the object has not escaped there, we can try folding the load aggressively, ignoring calls and memory barriers to find a corresponding store that the load observes.
For the runtime cost, this phase runs very fast, around 5 - 7% the runtime of EA, and about 0.5% the total runtime of C2.
Please take a look and leave your thoughts, thanks a lot.
Progress
Issue
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28764/head:pull/28764$ git checkout pull/28764Update a local copy of the PR:
$ git checkout pull/28764$ git pull https://git.openjdk.org/jdk.git pull/28764/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 28764View PR using the GUI difftool:
$ git pr show -t 28764Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28764.diff
Using Webrev
Link to Webrev Comment