-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Document describing upcoming object stack allocation work. #20251
Document describing upcoming object stack allocation work. #20251
Conversation
@dotnet/jit-contrib @jkotas @davidwrighton |
done in the jit to generate better code for stack-allocated objects. The details are in comments of | ||
[coreclr #1784](https://github.com/dotnet/coreclr/issues/1784). | ||
|
||
We did some analysis of Roslyn csc self-build to see where this optimization may be beneficial. One hot place was found in [GreenNode.WriteTo](https://github.com/dotnet/roslyn/blob/fab7134296816fc80019c60b0f5bef7400cf23ea/src/Compilers/Core/Portable/Syntax/GreenNode.cs#L647). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How hard is to fix this up to save this allocation? It looks pretty straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, implementing a simple struct version of a stack and using it there will remove these allocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be interesting to look at which cases out of the ones you have listed are hard or impossible to fix by a simple local change. I think they would be the ones to focus on. Maybe the delegates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delegates is one case. Another common case is fixed-length array passed to, e.g., Console.WriteLine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another common case is fixed-length array passed to, e.g., Console.WriteLine.
There's also a proposal to avoid that array allocation from the language side in dotnet/csharplang#1757 by allowing params Span<T>
parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so i understand, in the WriteTo case, the proposed optimization would remove the allocation of the Stack object itself, but the underlying array it uses internally would still be heap allocated right? There's no concept of the "entire stack" structure somehow being able to fit on the real execution stack as long as possible, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the proposed optimization would remove the allocation of the Stack object itself. "Inlining" object fields into enclosing objects is beyond the scope of this.
is the most precise and most expensive (it is based on connection graphs) and was used in the context of a static Java compiler, | ||
[[3]](https://pdfs.semanticscholar.org/1b33/dff471644f309392049c2791bca9a7f3b19c.pdf) | ||
is the least precise and cheapest (it doesn't track references through assignments of fields) and was used in MSR's Marmot implementation | ||
[[2]](https://www.usenix.org/legacy/events/vee05/full_papers/p111-kotzmann.pdf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: missing a "." after the [2]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Effectiveness of object stack allocation depends in large part on whether escape analysis is done inter-procedurally. | ||
With intra-procedural analysis only, the compiler has to assume that arguments escape at all non-inlined call sites, | ||
which blocks many stack allocations. In particular, assuming that 'this' argument always escapes hurts the optimization. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might also mention that some approaches are able to handle objects that only escape on some paths by promoting them to the heap "just in time" as control reaches those paths -- ffor instance Partial Escape Analysis and Scalar Replacement for Java.
So long as those escaping paths are indeed rare this can pay off. For instance, in the local delegate case there is an exception path that makes it look like the delegate can escape. In my prototype I managed to modify the importer to prove this code wasn't reachable, but in general we may see things like this....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the doc to mention this approach and added the paper to References.
newobj for the object that was determined to be non-escaping. Note that assemblies may lose verifiability with this approach. | ||
An alternative is to annotate parameters with escape information so that the annotations can be verified by the jit with | ||
local analysis. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I've mentioned elsewhere in passing, the jit cannot generally rely on any AOT derived interprocedural information as ground truth. While that information may be true of the IL scanned by AOT, at runtime, because of profilers and the like, the jit may see different IL initially or new IL may arrive after jitting.
Without the ability to revoke an arbitrary running method, the only current safe way to incorporate interprocedural information is via inlining. Inlines are tracked by the runtime and when a method body is updated, all the existing methods that are impacted are set for rejitting. Existing instances that are active continue to run and consistently use the old versions; new instances invoked after the IL update see only the new versions.
So any AOT derived interprocedural information can at best be used as a strong hint to the jit and those facts must be re-verified by actually inlining. Unless we know that IL updates are not possible OR we implement a general revocation scheme (deopt/osr). Given that method body updates are dynamically rare this hinting might be sufficient to expose the perf opportunities, but it means coupling this information into the inliner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we know that IL updates are not possible OR we implement a general revocation scheme (deopt/osr).
Can the AOT derived analysis include list of methods that were used to produce the result? Then we can make this list logically inlined into the main method and the rest will work the same way as if the methods were physically inlined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imagine we have a long-running A that calls B every so often. We allow information about B to influence A's codegen. Then when B is modified we either need to fix up that running instance of A or prevent the old A from calling the new B. It is not enough to force any new call to A to be rejitted.
Say for instance A passes B a struct implicitly by-ref and the AOT version of B doesn't modify the struct. So we take advantage of this and the initial version of A doesn't copy the struct each time A calls B. Then if the new B modifies the struct, the old A can't safely call the new B.
So we either need to immediately revoke the old A or take pains to make sure the old A will still invoke the old B.
We don't have these problems when B is inlined, as old As always "invoke" old Bs, and when B is updated and both A and B get rejitted, new As always invoke new Bs. And any place we don't inline, we also don't bake in any dependency on the callee. So even if A both inlines B at some sites and calls B at others we're ok.
I haven't thought much about how to realistically support a system where code is versioned and we somehow keep straight which versions can safely call which other versions (note as above that this behavior is call site dependent, eg an old A may be able to safely call the new B at some sites but must call the old B at others...). Maybe it is viable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take pains to make sure the old A will still invoke the old B
I have missed this part. It would certainly be non-trivial to get this right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if profiling is always on in the first version we'll have to inline all methods the stack allocated object can be passed to. Unfortunately, that will complicate analysis of the perf implications when both inlining changes and stack allocation will be performed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted this in the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall -- left a few notes for you to think about.
|
||
## Other restrictions on stack allocations | ||
|
||
* Objects with finalizers can't be stack-allocated since they always escape to the finalizer queue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about objects that have a finalizer, but whose finalization is always suppressed?
Though I don't know what is the likelihood of determining whether finalization is indeed suppressed, since it might require inlining of Dispose
(which is the method that commonly calls SuppressFinalize
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or could the finalizer just be called synchronously similiar to using (obj)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at least for the initial version it's ok to be conservative here and not stack allocate objects with finalizers without worrying about whether they are suppressed.
Hey, Wondering why the approach of explicit management via |
@erozenfeld - I'm curious about the projected gains for this, is there any analysis of what % of typical code would be able to leverage this? is there an estimated increase in performance? |
|
||
## Motivation | ||
|
||
In .NET instances of object types are allocated on the garbage-collected heap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick. in .net all instances are instances of object types (i.e. ints derive from object, etc. etc). but clearly not all of these go on the heap :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be more appropriate to say "instances of non-value types". or something to that affect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
**Cons:** | ||
* The jit analyzes methods top-down, i.e., callers before callees (when inlining), which doesn't fit well with the stack allocation optimization. | ||
* Full interprocedural analysis is too expensive for the jit, even at high tiering levels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has there been any consideration on 'less than full' interprocedural analysis. For example, profiling-guided procedural analysis. I know devs who have worked on jits where the runtime execution metrics were used to drive localized interprocedural analysis with very good results. i.e. instead of doing "full" analysis, instead keep track of which procedures are very 'hot', as well as small relations between the hot procedures. i.e. function A calls B calls C millions of times. then go and analyze those small cliques.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's too strong a statement to say that "Full interprocedural analysis is too expensive for the jit, even at high tiering levels". If one assumes that with a higher-tiered JIT we would have the ability to memoize method properties with (in)validation, background on-demand/full interprocedural analysis would be feasible. That said, it's not likely to be practical any time soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-worded the statement.
If the lifetime of an object is bounded by the lifetime of the allocating method, the allocation | ||
may be moved to the stack. The benefits of this optimization: | ||
|
||
* The pressure on the garbage collector is reduced because the GC heap becomes smaller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to miss a key point that the GC doesn't have to be involved in allocating or deallocating these objects. You mention that above, but you repeat the zero-initialize benefit below, so it might be good to clarify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified this in the doc.
|
||
**Cons:** | ||
* The jit analyzes methods top-down, i.e., callers before callees (when inlining), which doesn't fit well with the stack allocation optimization. | ||
* Full interprocedural analysis is too expensive for the jit, even at high tiering levels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's too strong a statement to say that "Full interprocedural analysis is too expensive for the jit, even at high tiering levels". If one assumes that with a higher-tiered JIT we would have the ability to memoize method properties with (in)validation, background on-demand/full interprocedural analysis would be feasible. That said, it's not likely to be practical any time soon.
|
||
* Objects with finalizers can't be stack-allocated since they always escape to the finalizer queue. | ||
* Objects allocated in a loop can be stack allocated only if the allocation doesn't escape the iteration of the loop in which it is | ||
allocated. Such analysis is complicated and is beyond the scope of at least the initial implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how complex this would actually be. To me the question is whether this occurs frequently - if so, it would be very useful to do this analysis even if it doesn't escape, since the object could be reused across iterations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, the cases we would have to detect and disqualify are things like this:
class A
{
public int i;
}
class Test
{
public static void Main()
{
Foo(2);
}
static void Foo(int i)
{
A a2 = null;
while (i > 0) {
A a1 = new A();
if (i >= 2) {
a2 = a1;
}
a1.i = i;
Console.WriteLine(a2.i);
i--;
}
}
}
This has to output
2
2
so we can't reuse the same stack slot for the allocation at each iteration.
I believe the general analysis for this is not easy but we can detect some simple cases if they occur frequently.
* We can adjust inlining heuristics to give more weight to candidates whose parameters have references to potentially | ||
stack-allocated objects. Inlining such methods may result in additional benefits if the jit can promote fields of the | ||
stack-allocated objects. | ||
* For higher-tier jit the order of method processing may be closer to bottom-up, i.e., callees before callers. That may |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How difficult would this approach be? Compiling methods bottom-up would help more optimizations, for example, if the callee does not use certain volatile registers, the callers would not need to save these registers on the stack (i.e., SIMD code on Unix/Linux).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JIT is fundamentally top-down in its compilation model, as it compiles on-demand, and doesn't know the downstream call chain until either it is excecuted or inlined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is that callees may be promoted to higher tier before callers. For example, if both A calls C and B calls C then it's possible that C will reach the count needed for promotion to the next tier before both A and B.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is that callees may be promoted to higher tier before callers.
This situation looks really tricky, which may need more scenarios and data. AFAIK, JVM implementations usually get helps from the bytecode-level analysis. Perhaps, we can do something similar, e.g., collecting callee info by assembly loading or bytecode verification.
@Korporal There is an analysis in #19663, which manually changes |
@jaredpar can comment more on this but I believe the problem with this approach is that the transient annotation will become viral and everything will have to be annotated with it. That said, if the Roslyn team decides to support that approach, the work in the jit will still be used to do the actual stack allocation. Escape analysis will still be useful for code that doesn't have transient annotations. |
@Korporal It's hard to define typical code. If the allocation that's moved to the stack is hot, it may result in measurable increase in performance. The percentage of allocations that will be moved to stack will depend on how sophisticated our analysis will be. We'll start with a simple analysis and will improve it incrementally. I did some experiments with Roslyn self-build scenario where I measured the percentage of newobj and newarr (but not boxes) allocations that didn't escape at runtime. The result was that 16.1% of allocated objects didn't escape. |
|
||
## GitHub issues | ||
|
||
[roslyn #2104](https://github.com/dotnet/roslyn/issues/2104) Complier should optimize "alloc temporary small object" to "alloc on stack" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/complier/compiler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in the doc, although the typo is still in the title of roslyn #2104
5ed2b35
to
1aca073
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates. LGTM.
Hi! I know this PR is rather old. I only stumbled across it now and I think I may have something useful to contribute to the discussion. My apologies for the necromancy. It's great to see that you're working on stack allocation for classes. I see that one of the options being proposed in this design document is (AOT) IL-to-IL optimization. Coincidentally, I'm actually working on a tool that does just that: Well, So,
|
Your link is broken, BTW. It should be https://www.github.com/jonathanvdc/flame |
Thanks! I didn't realize the |
We are starting work on object stack allocation. This document provides some background and describes our plan.