Document describing upcoming object stack allocation work. #20251

erozenfeld · 2018-10-03T22:45:39Z

We are starting work on object stack allocation. This document provides some background and describes our plan.

erozenfeld · 2018-10-03T22:54:08Z

@dotnet/jit-contrib @jkotas @davidwrighton

jkotas · 2018-10-03T23:04:53Z

Documentation/design-docs/object-stack-allocation.md

+done in the jit to generate better code for stack-allocated objects. The details are in comments of
+[coreclr #1784](https://github.com/dotnet/coreclr/issues/1784).
+
+We did some analysis of Roslyn csc self-build to see where this optimization may be beneficial. One hot place was found in [GreenNode.WriteTo](https://github.com/dotnet/roslyn/blob/fab7134296816fc80019c60b0f5bef7400cf23ea/src/Compilers/Core/Portable/Syntax/GreenNode.cs#L647).


How hard is to fix this up to save this allocation? It looks pretty straightforward.

Yes, implementing a simple struct version of a stack and using it there will remove these allocations.

It would be interesting to look at which cases out of the ones you have listed are hard or impossible to fix by a simple local change. I think they would be the ones to focus on. Maybe the delegates?

Delegates is one case. Another common case is fixed-length array passed to, e.g., Console.WriteLine.

Another common case is fixed-length array passed to, e.g., Console.WriteLine.

There's also a proposal to avoid that array allocation from the language side in dotnet/csharplang#1757 by allowing params Span<T> parameters.

Just so i understand, in the WriteTo case, the proposed optimization would remove the allocation of the Stack object itself, but the underlying array it uses internally would still be heap allocated right? There's no concept of the "entire stack" structure somehow being able to fit on the real execution stack as long as possible, right?

Yes, the proposed optimization would remove the allocation of the Stack object itself. "Inlining" object fields into enclosing objects is beyond the scope of this.

AndyAyersMS · 2018-10-04T01:26:54Z

Documentation/design-docs/object-stack-allocation.md

+is the most precise and most expensive (it is based on connection graphs) and was used in the context of a static Java compiler,
+[[3]](https://pdfs.semanticscholar.org/1b33/dff471644f309392049c2791bca9a7f3b19c.pdf)
+is the least precise and cheapest (it doesn't track references through assignments of fields) and was used in MSR's Marmot implementation
+[[2]](https://www.usenix.org/legacy/events/vee05/full_papers/p111-kotzmann.pdf)


Nit: missing a "." after the [2].

AndyAyersMS · 2018-10-04T01:37:39Z

Documentation/design-docs/object-stack-allocation.md

+Effectiveness of object stack allocation depends in large part on whether escape analysis is done inter-procedurally.
+With intra-procedural analysis only, the compiler has to assume that arguments escape at all non-inlined call sites,
+which blocks many stack allocations. In particular, assuming that 'this' argument always escapes hurts the optimization.
+


You might also mention that some approaches are able to handle objects that only escape on some paths by promoting them to the heap "just in time" as control reaches those paths -- ffor instance Partial Escape Analysis and Scalar Replacement for Java.

So long as those escaping paths are indeed rare this can pay off. For instance, in the local delegate case there is an exception path that makes it look like the delegate can escape. In my prototype I managed to modify the importer to prove this code wasn't reachable, but in general we may see things like this....

Updated the doc to mention this approach and added the paper to References.

AndyAyersMS · 2018-10-04T01:42:27Z

Documentation/design-docs/object-stack-allocation.md

+newobj for the object that was determined to be non-escaping. Note that assemblies may lose verifiability with this approach.
+An alternative is to annotate parameters with escape information so that the annotations can be verified by the jit with
+local analysis.
+


As I've mentioned elsewhere in passing, the jit cannot generally rely on any AOT derived interprocedural information as ground truth. While that information may be true of the IL scanned by AOT, at runtime, because of profilers and the like, the jit may see different IL initially or new IL may arrive after jitting.

Without the ability to revoke an arbitrary running method, the only current safe way to incorporate interprocedural information is via inlining. Inlines are tracked by the runtime and when a method body is updated, all the existing methods that are impacted are set for rejitting. Existing instances that are active continue to run and consistently use the old versions; new instances invoked after the IL update see only the new versions.

So any AOT derived interprocedural information can at best be used as a strong hint to the jit and those facts must be re-verified by actually inlining. Unless we know that IL updates are not possible OR we implement a general revocation scheme (deopt/osr). Given that method body updates are dynamically rare this hinting might be sufficient to expose the perf opportunities, but it means coupling this information into the inliner.

Unless we know that IL updates are not possible OR we implement a general revocation scheme (deopt/osr).

Can the AOT derived analysis include list of methods that were used to produce the result? Then we can make this list logically inlined into the main method and the rest will work the same way as if the methods were physically inlined.

Imagine we have a long-running A that calls B every so often. We allow information about B to influence A's codegen. Then when B is modified we either need to fix up that running instance of A or prevent the old A from calling the new B. It is not enough to force any new call to A to be rejitted.

Say for instance A passes B a struct implicitly by-ref and the AOT version of B doesn't modify the struct. So we take advantage of this and the initial version of A doesn't copy the struct each time A calls B. Then if the new B modifies the struct, the old A can't safely call the new B.

So we either need to immediately revoke the old A or take pains to make sure the old A will still invoke the old B.

We don't have these problems when B is inlined, as old As always "invoke" old Bs, and when B is updated and both A and B get rejitted, new As always invoke new Bs. And any place we don't inline, we also don't bake in any dependency on the callee. So even if A both inlines B at some sites and calls B at others we're ok.

I haven't thought much about how to realistically support a system where code is versioned and we somehow keep straight which versions can safely call which other versions (note as above that this behavior is call site dependent, eg an old A may be able to safely call the new B at some sites but must call the old B at others...). Maybe it is viable?

take pains to make sure the old A will still invoke the old B

I have missed this part. It would certainly be non-trivial to get this right.

Yes, if profiling is always on in the first version we'll have to inline all methods the stack allocated object can be passed to. Unfortunately, that will complicate analysis of the perf implications when both inlining changes and stack allocation will be performed.

Noted this in the document.

AndyAyersMS

Looks good overall -- left a few notes for you to think about.

svick · 2018-10-04T10:49:04Z

Documentation/design-docs/object-stack-allocation.md

+
+## Other restrictions on stack allocations
+
+* Objects with finalizers can't be stack-allocated since they always escape to the finalizer queue. 


What about objects that have a finalizer, but whose finalization is always suppressed?

Though I don't know what is the likelihood of determining whether finalization is indeed suppressed, since it might require inlining of Dispose (which is the method that commonly calls SuppressFinalize).

Or could the finalizer just be called synchronously similiar to using (obj)

I think at least for the initial version it's ok to be conservative here and not stack allocate objects with finalizers without worrying about whether they are suppressed.

xoofx · 2018-10-04T14:16:16Z

Hey,
Super glad there is someone starting to look at this.

Wondering why the approach of explicit management via transient (as I explained in my blog post) could not be an option?

The-Futurist · 2018-10-04T16:49:07Z

@erozenfeld - I'm curious about the projected gains for this, is there any analysis of what % of typical code would be able to leverage this? is there an estimated increase in performance?

CyrusNajmabadi · 2018-10-04T16:51:54Z

Documentation/design-docs/object-stack-allocation.md

+
+## Motivation
+
+In .NET instances of object types are allocated on the garbage-collected heap.


nitpick. in .net all instances are instances of object types (i.e. ints derive from object, etc. etc). but clearly not all of these go on the heap :)

it might be more appropriate to say "instances of non-value types". or something to that affect.

CyrusNajmabadi · 2018-10-04T16:56:47Z

Documentation/design-docs/object-stack-allocation.md

+
+**Cons:**
+* The jit analyzes methods top-down, i.e., callers before callees (when inlining), which doesn't fit well with the stack allocation optimization.
+* Full interprocedural analysis is too expensive for the jit, even at high tiering levels.


has there been any consideration on 'less than full' interprocedural analysis. For example, profiling-guided procedural analysis. I know devs who have worked on jits where the runtime execution metrics were used to drive localized interprocedural analysis with very good results. i.e. instead of doing "full" analysis, instead keep track of which procedures are very 'hot', as well as small relations between the hot procedures. i.e. function A calls B calls C millions of times. then go and analyze those small cliques.

I think it's too strong a statement to say that "Full interprocedural analysis is too expensive for the jit, even at high tiering levels". If one assumes that with a higher-tiered JIT we would have the ability to memoize method properties with (in)validation, background on-demand/full interprocedural analysis would be feasible. That said, it's not likely to be practical any time soon.

I re-worded the statement.

CarolEidt · 2018-10-04T18:09:45Z

Documentation/design-docs/object-stack-allocation.md

+If the lifetime of an object is bounded by the lifetime of the allocating method, the allocation
+may be moved to the stack. The benefits of this optimization:
+
+* The pressure on the garbage collector is reduced because the GC heap becomes smaller.


This seems to miss a key point that the GC doesn't have to be involved in allocating or deallocating these objects. You mention that above, but you repeat the zero-initialize benefit below, so it might be good to clarify this.

Clarified this in the doc.

CarolEidt · 2018-10-04T18:16:56Z

Documentation/design-docs/object-stack-allocation.md

+
+**Cons:**
+* The jit analyzes methods top-down, i.e., callers before callees (when inlining), which doesn't fit well with the stack allocation optimization.
+* Full interprocedural analysis is too expensive for the jit, even at high tiering levels.


I think it's too strong a statement to say that "Full interprocedural analysis is too expensive for the jit, even at high tiering levels". If one assumes that with a higher-tiered JIT we would have the ability to memoize method properties with (in)validation, background on-demand/full interprocedural analysis would be feasible. That said, it's not likely to be practical any time soon.

CarolEidt · 2018-10-04T18:27:20Z

Documentation/design-docs/object-stack-allocation.md

+
+* Objects with finalizers can't be stack-allocated since they always escape to the finalizer queue. 
+* Objects allocated in a loop can be stack allocated only if the allocation doesn't escape the iteration of the loop in which it is
+allocated. Such analysis is complicated and is beyond the scope of at least the initial implementation.


I'm not sure how complex this would actually be. To me the question is whether this occurs frequently - if so, it would be very useful to do this analysis even if it doesn't escape, since the object could be reused across iterations.

Just to clarify, the cases we would have to detect and disqualify are things like this:

class A { public int i; } class Test { public static void Main() { Foo(2); } static void Foo(int i) { A a2 = null; while (i > 0) { A a1 = new A(); if (i >= 2) { a2 = a1; } a1.i = i; Console.WriteLine(a2.i); i--; } } }

This has to output

2 2

so we can't reuse the same stack slot for the allocation at each iteration.
I believe the general analysis for this is not easy but we can detect some simple cases if they occur frequently.

fiigii · 2018-10-04T18:31:35Z

Documentation/design-docs/object-stack-allocation.md

+* We can adjust inlining heuristics to give more weight to candidates whose parameters have references to potentially
+stack-allocated objects. Inlining such methods may result in additional benefits if the jit can promote fields of the
+stack-allocated objects.
+* For higher-tier jit the order of method processing may be closer to bottom-up, i.e., callees before callers. That may


How difficult would this approach be? Compiling methods bottom-up would help more optimizations, for example, if the callee does not use certain volatile registers, the callers would not need to save these registers on the stack (i.e., SIMD code on Unix/Linux).

The JIT is fundamentally top-down in its compilation model, as it compiles on-demand, and doesn't know the downstream call chain until either it is excecuted or inlined.

The idea here is that callees may be promoted to higher tier before callers. For example, if both A calls C and B calls C then it's possible that C will reach the count needed for promotion to the next tier before both A and B.

The idea here is that callees may be promoted to higher tier before callers.

This situation looks really tricky, which may need more scenarios and data. AFAIK, JVM implementations usually get helps from the bytecode-level analysis. Perhaps, we can do something similar, e.g., collecting callee info by assembly loading or bytecode verification.

fiigii · 2018-10-04T19:04:27Z

I'm curious about the projected gains for this, is there any analysis of what % of typical code would be able to leverage this? is there an estimated increase in performance?

@Korporal There is an analysis in #19663, which manually changes class to struct to get the similar effect of escape analysis. The data shows, for the PacketTracer benchmark, 31% execution time improvement, ~16% code size shrink, and significant GC overhead reduction (~33% -> ~11%).

erozenfeld · 2018-10-04T19:05:38Z

@xoofx

Wondering why the approach of explicit management via transient (as I explained in my blog post) could not be an option?

@jaredpar can comment more on this but I believe the problem with this approach is that the transient annotation will become viral and everything will have to be annotated with it. That said, if the Roslyn team decides to support that approach, the work in the jit will still be used to do the actual stack allocation. Escape analysis will still be useful for code that doesn't have transient annotations.

erozenfeld · 2018-10-04T22:21:59Z

I'm curious about the projected gains for this, is there any analysis of what % of typical code would be able to leverage this? is there an estimated increase in performance?

@Korporal It's hard to define typical code. If the allocation that's moved to the stack is hot, it may result in measurable increase in performance. The percentage of allocations that will be moved to stack will depend on how sophisticated our analysis will be. We'll start with a simple analysis and will improve it incrementally.

I did some experiments with Roslyn self-build scenario where I measured the percentage of newobj and newarr (but not boxes) allocations that didn't escape at runtime. The result was that 16.1% of allocated objects didn't escape.

jaredpar · 2018-10-05T21:32:25Z

Documentation/design-docs/object-stack-allocation.md

+
+## GitHub issues
+
+[roslyn #2104](https://github.com/dotnet/roslyn/issues/2104) Complier should optimize "alloc temporary small object" to "alloc on stack"


s/complier/compiler

Fixed in the doc, although the typo is still in the title of roslyn #2104

AndyAyersMS

Thanks for the updates. LGTM.

)

jonathanvdc · 2019-04-17T14:21:59Z

Hi! I know this PR is rather old. I only stumbled across it now and I think I may have something useful to contribute to the discussion. My apologies for the necromancy.

It's great to see that you're working on stack allocation for classes. I see that one of the options being proposed in this design document is (AOT) IL-to-IL optimization. Coincidentally, I'm actually working on a tool that does just that: ilopt. ilopt parses CIL, generates an SSA-based IR designed specifically for optimizing managed code, sends that IR through a sequence of transformations, generates CIL from the optimized IR and saves it back to disk.

Well, ilopt is actually just a command-line driver program. All the heavy lifting is done by Flame, its parent project. Flame seems to fit the bill work regards to the AOT IL optimization infrastructure you're proposing.

So,

Do you think Flame is the kind of thing you're looking for to do AOT IL optimizations?
What are the odds of having an opcode that allocates a class on the stack in the future? I'm only asking because I implemented a stack allocation pass for classes in a previous version of Flame (and plan on reimplementing it in the future) but it was very complex because there's no stackalloc-like opcode for classes. It also would've been able to optimize more diverse fragments of code if there had been such an opcode.

PathogenDavid · 2019-04-17T14:44:04Z

All the heavy lifting is done by Flame

Your link is broken, BTW. It should be https://www.github.com/jonathanvdc/flame

jonathanvdc · 2019-04-17T18:27:40Z

Thanks! I didn't realize the https:// prefix was mandatory. The link should be fixed now.

jkotas reviewed Oct 3, 2018

View reviewed changes

AndyAyersMS reviewed Oct 4, 2018

View reviewed changes

svick reviewed Oct 4, 2018

View reviewed changes

svick mentioned this pull request Oct 4, 2018

Champion "params Span<T>" dotnet/csharplang#1757

Closed

5 tasks

CyrusNajmabadi reviewed Oct 4, 2018

View reviewed changes

CarolEidt reviewed Oct 4, 2018

View reviewed changes

fiigii reviewed Oct 4, 2018

View reviewed changes

jaredpar reviewed Oct 5, 2018

View reviewed changes

Document describing upcoming object stack allocation work.

1aca073

erozenfeld force-pushed the ObjectStackAllocationWriteUp branch from 5ed2b35 to 1aca073 Compare October 8, 2018 22:53

AndyAyersMS approved these changes Oct 8, 2018

View reviewed changes

erozenfeld merged commit 913428d into dotnet:master Oct 8, 2018

HaloFour mentioned this pull request Nov 5, 2018

Proposal: custom nullable structs dotnet/csharplang#1981

Closed

A-And pushed a commit to A-And/coreclr that referenced this pull request Nov 20, 2018

Document describing upcoming object stack allocation work. (dotnet#20251

16b3a19

)

AartBluestoke mentioned this pull request Jan 13, 2020

Optimistic allocation of objects on the stack dotnet/runtime#1661

Closed

erozenfeld mentioned this pull request Jan 31, 2020

JIT: Support object stack allocation dotnet/runtime#11192

Closed

20 tasks

jakobbotsch mentioned this pull request Aug 16, 2024

JIT: Allow stack allocate objects in loops dotnet/runtime#106526

Closed


		## Other restrictions on stack allocations

		* Objects with finalizers can't be stack-allocated since they always escape to the finalizer queue.


		## Motivation

		In .NET instances of object types are allocated on the garbage-collected heap.


		## GitHub issues

		[roslyn #2104](https://github.com/dotnet/roslyn/issues/2104) Complier should optimize "alloc temporary small object" to "alloc on stack"

Document describing upcoming object stack allocation work. #20251

Document describing upcoming object stack allocation work. #20251

Uh oh!

Conversation

erozenfeld commented Oct 3, 2018

Uh oh!

erozenfeld commented Oct 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndyAyersMS Oct 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndyAyersMS left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xoofx commented Oct 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

The-Futurist commented Oct 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndyAyersMS Oct 4, 2018 •

edited

Loading

xoofx commented Oct 4, 2018 •

edited

Loading

jonathanvdc commented Apr 17, 2019 •

edited

Loading