Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Document describing upcoming object stack allocation work. #20251

Merged
merged 1 commit into from
Oct 8, 2018

Conversation

erozenfeld
Copy link
Member

We are starting work on object stack allocation. This document provides some background and describes our plan.

@erozenfeld
Copy link
Member Author

@dotnet/jit-contrib @jkotas @davidwrighton

done in the jit to generate better code for stack-allocated objects. The details are in comments of
[coreclr #1784](https://github.com/dotnet/coreclr/issues/1784).

We did some analysis of Roslyn csc self-build to see where this optimization may be beneficial. One hot place was found in [GreenNode.WriteTo](https://github.com/dotnet/roslyn/blob/fab7134296816fc80019c60b0f5bef7400cf23ea/src/Compilers/Core/Portable/Syntax/GreenNode.cs#L647).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How hard is to fix this up to save this allocation? It looks pretty straightforward.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, implementing a simple struct version of a stack and using it there will remove these allocations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting to look at which cases out of the ones you have listed are hard or impossible to fix by a simple local change. I think they would be the ones to focus on. Maybe the delegates?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delegates is one case. Another common case is fixed-length array passed to, e.g., Console.WriteLine.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another common case is fixed-length array passed to, e.g., Console.WriteLine.

There's also a proposal to avoid that array allocation from the language side in dotnet/csharplang#1757 by allowing params Span<T> parameters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so i understand, in the WriteTo case, the proposed optimization would remove the allocation of the Stack object itself, but the underlying array it uses internally would still be heap allocated right? There's no concept of the "entire stack" structure somehow being able to fit on the real execution stack as long as possible, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the proposed optimization would remove the allocation of the Stack object itself. "Inlining" object fields into enclosing objects is beyond the scope of this.

is the most precise and most expensive (it is based on connection graphs) and was used in the context of a static Java compiler,
[[3]](https://pdfs.semanticscholar.org/1b33/dff471644f309392049c2791bca9a7f3b19c.pdf)
is the least precise and cheapest (it doesn't track references through assignments of fields) and was used in MSR's Marmot implementation
[[2]](https://www.usenix.org/legacy/events/vee05/full_papers/p111-kotzmann.pdf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: missing a "." after the [2].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Effectiveness of object stack allocation depends in large part on whether escape analysis is done inter-procedurally.
With intra-procedural analysis only, the compiler has to assume that arguments escape at all non-inlined call sites,
which blocks many stack allocations. In particular, assuming that 'this' argument always escapes hurts the optimization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might also mention that some approaches are able to handle objects that only escape on some paths by promoting them to the heap "just in time" as control reaches those paths -- ffor instance Partial Escape Analysis and Scalar Replacement for Java.

So long as those escaping paths are indeed rare this can pay off. For instance, in the local delegate case there is an exception path that makes it look like the delegate can escape. In my prototype I managed to modify the importer to prove this code wasn't reachable, but in general we may see things like this....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the doc to mention this approach and added the paper to References.

newobj for the object that was determined to be non-escaping. Note that assemblies may lose verifiability with this approach.
An alternative is to annotate parameters with escape information so that the annotations can be verified by the jit with
local analysis.

Copy link
Member

@AndyAyersMS AndyAyersMS Oct 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I've mentioned elsewhere in passing, the jit cannot generally rely on any AOT derived interprocedural information as ground truth. While that information may be true of the IL scanned by AOT, at runtime, because of profilers and the like, the jit may see different IL initially or new IL may arrive after jitting.

Without the ability to revoke an arbitrary running method, the only current safe way to incorporate interprocedural information is via inlining. Inlines are tracked by the runtime and when a method body is updated, all the existing methods that are impacted are set for rejitting. Existing instances that are active continue to run and consistently use the old versions; new instances invoked after the IL update see only the new versions.

So any AOT derived interprocedural information can at best be used as a strong hint to the jit and those facts must be re-verified by actually inlining. Unless we know that IL updates are not possible OR we implement a general revocation scheme (deopt/osr). Given that method body updates are dynamically rare this hinting might be sufficient to expose the perf opportunities, but it means coupling this information into the inliner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless we know that IL updates are not possible OR we implement a general revocation scheme (deopt/osr).

Can the AOT derived analysis include list of methods that were used to produce the result? Then we can make this list logically inlined into the main method and the rest will work the same way as if the methods were physically inlined.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine we have a long-running A that calls B every so often. We allow information about B to influence A's codegen. Then when B is modified we either need to fix up that running instance of A or prevent the old A from calling the new B. It is not enough to force any new call to A to be rejitted.

Say for instance A passes B a struct implicitly by-ref and the AOT version of B doesn't modify the struct. So we take advantage of this and the initial version of A doesn't copy the struct each time A calls B. Then if the new B modifies the struct, the old A can't safely call the new B.

So we either need to immediately revoke the old A or take pains to make sure the old A will still invoke the old B.

We don't have these problems when B is inlined, as old As always "invoke" old Bs, and when B is updated and both A and B get rejitted, new As always invoke new Bs. And any place we don't inline, we also don't bake in any dependency on the callee. So even if A both inlines B at some sites and calls B at others we're ok.

I haven't thought much about how to realistically support a system where code is versioned and we somehow keep straight which versions can safely call which other versions (note as above that this behavior is call site dependent, eg an old A may be able to safely call the new B at some sites but must call the old B at others...). Maybe it is viable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take pains to make sure the old A will still invoke the old B

I have missed this part. It would certainly be non-trivial to get this right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if profiling is always on in the first version we'll have to inline all methods the stack allocated object can be passed to. Unfortunately, that will complicate analysis of the perf implications when both inlining changes and stack allocation will be performed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted this in the document.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall -- left a few notes for you to think about.


## Other restrictions on stack allocations

* Objects with finalizers can't be stack-allocated since they always escape to the finalizer queue.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about objects that have a finalizer, but whose finalization is always suppressed?

Though I don't know what is the likelihood of determining whether finalization is indeed suppressed, since it might require inlining of Dispose (which is the method that commonly calls SuppressFinalize).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or could the finalizer just be called synchronously similiar to using (obj)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at least for the initial version it's ok to be conservative here and not stack allocate objects with finalizers without worrying about whether they are suppressed.

@xoofx
Copy link
Member

xoofx commented Oct 4, 2018

Hey,
Super glad there is someone starting to look at this.

Wondering why the approach of explicit management via transient (as I explained in my blog post) could not be an option?

@Korporal
Copy link

Korporal commented Oct 4, 2018

@erozenfeld - I'm curious about the projected gains for this, is there any analysis of what % of typical code would be able to leverage this? is there an estimated increase in performance?


## Motivation

In .NET instances of object types are allocated on the garbage-collected heap.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick. in .net all instances are instances of object types (i.e. ints derive from object, etc. etc). but clearly not all of these go on the heap :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be more appropriate to say "instances of non-value types". or something to that affect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


**Cons:**
* The jit analyzes methods top-down, i.e., callers before callees (when inlining), which doesn't fit well with the stack allocation optimization.
* Full interprocedural analysis is too expensive for the jit, even at high tiering levels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has there been any consideration on 'less than full' interprocedural analysis. For example, profiling-guided procedural analysis. I know devs who have worked on jits where the runtime execution metrics were used to drive localized interprocedural analysis with very good results. i.e. instead of doing "full" analysis, instead keep track of which procedures are very 'hot', as well as small relations between the hot procedures. i.e. function A calls B calls C millions of times. then go and analyze those small cliques.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's too strong a statement to say that "Full interprocedural analysis is too expensive for the jit, even at high tiering levels". If one assumes that with a higher-tiered JIT we would have the ability to memoize method properties with (in)validation, background on-demand/full interprocedural analysis would be feasible. That said, it's not likely to be practical any time soon.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-worded the statement.

If the lifetime of an object is bounded by the lifetime of the allocating method, the allocation
may be moved to the stack. The benefits of this optimization:

* The pressure on the garbage collector is reduced because the GC heap becomes smaller.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to miss a key point that the GC doesn't have to be involved in allocating or deallocating these objects. You mention that above, but you repeat the zero-initialize benefit below, so it might be good to clarify this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified this in the doc.


**Cons:**
* The jit analyzes methods top-down, i.e., callers before callees (when inlining), which doesn't fit well with the stack allocation optimization.
* Full interprocedural analysis is too expensive for the jit, even at high tiering levels.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's too strong a statement to say that "Full interprocedural analysis is too expensive for the jit, even at high tiering levels". If one assumes that with a higher-tiered JIT we would have the ability to memoize method properties with (in)validation, background on-demand/full interprocedural analysis would be feasible. That said, it's not likely to be practical any time soon.


* Objects with finalizers can't be stack-allocated since they always escape to the finalizer queue.
* Objects allocated in a loop can be stack allocated only if the allocation doesn't escape the iteration of the loop in which it is
allocated. Such analysis is complicated and is beyond the scope of at least the initial implementation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how complex this would actually be. To me the question is whether this occurs frequently - if so, it would be very useful to do this analysis even if it doesn't escape, since the object could be reused across iterations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, the cases we would have to detect and disqualify are things like this:

class A
    {
        public int i;
    }

    class Test
    {
        public static void Main()
        {
            Foo(2);
        }

        static void Foo(int i)
        {
            A a2 = null;
            while (i > 0) {
                A a1 = new A();
                if (i >= 2) {
                    a2 = a1;
                }
                a1.i = i;
                Console.WriteLine(a2.i);
                i--;
            }
        }
    }

This has to output

2
2

so we can't reuse the same stack slot for the allocation at each iteration.
I believe the general analysis for this is not easy but we can detect some simple cases if they occur frequently.

* We can adjust inlining heuristics to give more weight to candidates whose parameters have references to potentially
stack-allocated objects. Inlining such methods may result in additional benefits if the jit can promote fields of the
stack-allocated objects.
* For higher-tier jit the order of method processing may be closer to bottom-up, i.e., callees before callers. That may
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How difficult would this approach be? Compiling methods bottom-up would help more optimizations, for example, if the callee does not use certain volatile registers, the callers would not need to save these registers on the stack (i.e., SIMD code on Unix/Linux).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JIT is fundamentally top-down in its compilation model, as it compiles on-demand, and doesn't know the downstream call chain until either it is excecuted or inlined.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that callees may be promoted to higher tier before callers. For example, if both A calls C and B calls C then it's possible that C will reach the count needed for promotion to the next tier before both A and B.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that callees may be promoted to higher tier before callers.

This situation looks really tricky, which may need more scenarios and data. AFAIK, JVM implementations usually get helps from the bytecode-level analysis. Perhaps, we can do something similar, e.g., collecting callee info by assembly loading or bytecode verification.

@fiigii
Copy link

fiigii commented Oct 4, 2018

I'm curious about the projected gains for this, is there any analysis of what % of typical code would be able to leverage this? is there an estimated increase in performance?

@Korporal There is an analysis in #19663, which manually changes class to struct to get the similar effect of escape analysis. The data shows, for the PacketTracer benchmark, 31% execution time improvement, ~16% code size shrink, and significant GC overhead reduction (~33% -> ~11%).

@erozenfeld
Copy link
Member Author

@xoofx

Wondering why the approach of explicit management via transient (as I explained in my blog post) could not be an option?

@jaredpar can comment more on this but I believe the problem with this approach is that the transient annotation will become viral and everything will have to be annotated with it. That said, if the Roslyn team decides to support that approach, the work in the jit will still be used to do the actual stack allocation. Escape analysis will still be useful for code that doesn't have transient annotations.

@erozenfeld
Copy link
Member Author

I'm curious about the projected gains for this, is there any analysis of what % of typical code would be able to leverage this? is there an estimated increase in performance?

@Korporal It's hard to define typical code. If the allocation that's moved to the stack is hot, it may result in measurable increase in performance. The percentage of allocations that will be moved to stack will depend on how sophisticated our analysis will be. We'll start with a simple analysis and will improve it incrementally.

I did some experiments with Roslyn self-build scenario where I measured the percentage of newobj and newarr (but not boxes) allocations that didn't escape at runtime. The result was that 16.1% of allocated objects didn't escape.


## GitHub issues

[roslyn #2104](https://github.com/dotnet/roslyn/issues/2104) Complier should optimize "alloc temporary small object" to "alloc on stack"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/complier/compiler

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the doc, although the typo is still in the title of roslyn #2104

@erozenfeld erozenfeld force-pushed the ObjectStackAllocationWriteUp branch from 5ed2b35 to 1aca073 Compare October 8, 2018 22:53
Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. LGTM.

@erozenfeld erozenfeld merged commit 913428d into dotnet:master Oct 8, 2018
@jonathanvdc
Copy link

jonathanvdc commented Apr 17, 2019

Hi! I know this PR is rather old. I only stumbled across it now and I think I may have something useful to contribute to the discussion. My apologies for the necromancy.

It's great to see that you're working on stack allocation for classes. I see that one of the options being proposed in this design document is (AOT) IL-to-IL optimization. Coincidentally, I'm actually working on a tool that does just that: ilopt. ilopt parses CIL, generates an SSA-based IR designed specifically for optimizing managed code, sends that IR through a sequence of transformations, generates CIL from the optimized IR and saves it back to disk.

Well, ilopt is actually just a command-line driver program. All the heavy lifting is done by Flame, its parent project. Flame seems to fit the bill work regards to the AOT IL optimization infrastructure you're proposing.

So,

  1. Do you think Flame is the kind of thing you're looking for to do AOT IL optimizations?
  2. What are the odds of having an opcode that allocates a class on the stack in the future? I'm only asking because I implemented a stack allocation pass for classes in a previous version of Flame (and plan on reimplementing it in the future) but it was very complex because there's no stackalloc-like opcode for classes. It also would've been able to optimize more diverse fragments of code if there had been such an opcode.

@PathogenDavid
Copy link

All the heavy lifting is done by Flame

Your link is broken, BTW. It should be https://www.github.com/jonathanvdc/flame

@jonathanvdc
Copy link

Thanks! I didn't realize the https:// prefix was mandatory. The link should be fixed now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.