-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLR/JIT should optimize "alloc temporary small object" to "alloc on stack" automatically #4584
Comments
Thanks, this is something the team is aware of and discussed. A specific example in a real-world workload that you can share would help us prioritize this request. We already have synthetic examples. |
You may be also interested with the following proposal and prototype implementation project: Stackalloc for reference types with Roslyn and CoreCLR [1] [2] It introduce a generalized stackalloc with a transient variable modififer to solve the problem.
[1] http://xoofx.com/blog/2015/10/08/stackalloc-for-class-with-roslyn-and-coreclr/ |
This might be a chicken and egg situation. If the optimization doesn't exist people will try to twist the code to avoid allocations (see for example params support for IEnumerable) and as a result there will be fewer optimization opportunities. On the other hand if the optimization exists then perhaps people will concentrate more on actually writing code instead of figuring out how to avoid allocations. |
You will need the same constraints that ValueTypes have so you may as well use a struct instead of a class. For example, you can not have class finalizers. The only difference I see is that the compiler is calling a ctor for you instead of having to call MyStruct.Initialize(); yourself. |
@mikedn I don't disagree, but we need some way to help assess priority as we have other opportunities were we can also improve code-quality and this would help to prioritize. We have prototyped this kind of transformation in other experimental systems. Because of the challenges in the compiler automatically finding legal opportunities, which is even more challenging in JIT environments, my experience has been that developers that really want to avoid heap allocations end up having to do manual source changes anyway, such as using ValueTypes. @OtherCrashOverride yes, that is correct. And there are other correctness rules the compiler must observe, for example this transformation might lead to stack-overflow if done on a recursion path. |
Did the team look at the work the Hotspot JVM team did with Escape Analysis? They have production experience in a very similar setting and should be able to answer many open questions. You should be able to run benchmarks on the JVM with Escape Analysis turned on and off. They have a command line switch. The performance differential should carry over to the CLR. Prime use cases would be enumerators (LINQ queries) and string processing I think. These can generate tons of garbage. Also things like |
The situation is a bit more complicated than that. Java doesn't have value types and as such it needs allocation optimizations more than .NET needs. Also, it's interesting that the "Escape Analysis for Java" paper shows that for some benchmarks the percentage of objects allocated on the stack can exceed 70% but the execution time improvement is no more than 23%. And as far as I can tell that also includes improvements due to the elimination of locks for objects that don't escape the thread.
Of all possible use cases LINQ is the least likely to benefit from such an optimization. It's interface reliance means that it is difficult to inline methods or at least perform some interprocedural analysis. Without inlining the created enumerators are returned and it's difficult to allocate such objects on the stack. They cannot be allocated on the callee frame and anyway the callee has no way of knowing that the object doesn't escape the stack, only the caller knows that. I suppose one could imagine some sort of "return allocation context" that's passed from the caller via a hidden function parameter and allows the caller to customize the way returned objects are allocated but that's a rather crazy thing to do. Personally I'd be happy enough if, for example, small arrays could be allocated on the stack so contortions like that described in "params support for IEnumerable" wouldn't be needed. But unfortunately the cost to achieve this might be too great. |
The Java remarks are entirely true. Regarding LINQ, here's an idea: If the callee is known to allocate a small, bounded amount of memory (this is generally the case for enumerator allocating methods) and if no such pointer escapes according to some simple data flow tracing (also the case here) then change the base pointer for heap allocations to the stack temporarily. Like this:
This rather simple scheme should match a lot of functions. No need to inline. Call targets must be known but that also does not require inlining. It requires the JIT to determine the return type (which is pretty much always known in the case of iterator methods) and apply that knowledge to devirtualize all calls. I do not need to inline OrderBy in order to determine that it returns exactly an OrderedEnumerable. On the other hand this will not work for Select because it is too sophisticated. No need to inline but it requires analyzing quite a few functions (all that touch the memory allocated). A second simple scheme would be to allocate all ref types on the stack if their address is never passed to any other method, returned or stored to non-locals. And then have a (reliable) way to blast all aggregates allocated on the stack to registers. This scheme would catch a lot simple helper classes like vectors, points, some strings, Utf8String's, .... If we can get a basic implementation of both of these optimizations that should catch a lot. And btw, I think that 23% execution time improvement is a fantastic number. Depends on the workload obviously. |
It can't allocate everything on the stack, some allocations may escape and the callee won't always know which ones. You really need to distinct allocation contexts - "GC allocation context" and "return allocation context". The return context is normally the same as the GC context but the caller may replace it when possible. The callee always uses the return context to allocate the returned object. The return context cannot be used to allocate any other objects, including objects that are referenced by the returned object since the caller may escape those objects.
That's the "normal" scheme. Though it's a bit too restrictive, calling any method of an object implies passing the object address to the method (unless the method is inlined) so you wouldn't be able to do much with such objects. The compiler should analyze the callees a bit for best results, a call depth of less than 5 might suffice in many cases. For example, it should be possible to stack allocate the void foo(BasicBlock bb) {
var stk = new Stack<BasicBlock>();
stk.Push(bb);
while (bb.Count > 0) {
bb = stk.Pop();
// other code that calls stk.Push
}
} |
OK, all true. I really should not specify any particular (naive) escape analysis scheme but I think it's possible to devise some rather simple scheme that kills a lot of allocations and enables developers to be more productive because they can use more helper objects. I think a key point is that inlining does not have to occur in order to infer information about the behavior of methods. Often, escape behavior can be summarized for a method in a meaningful way ("does not store to the heap/statics at all" or "does not create a new reference from the heap/statics for argument 3"). We now don't have a code size problem, only need to deal with compile time. |
Yes, in general inlining doesn't need to occur. It's just the "return new object" case that benefits from inlining. |
@cmckinsey Here's a practical example where escape analysis could be really helpful. Let's have a look at this snippet in F# code: // ReferenceEscapeTest.fs
module EscapeTest
// a:int -> b:int -> int
let diff a b =
let (min, max) = if a > b then (b, a) else (a, b)
max - min From functional programming standpoint this is a very primitive construction which utilizes a tuple for dealing with two local variables in one expression. For an F# developer those are just cheap local variables, presumably allocated on stack.
But when we look at the disassembled IL: using Microsoft.FSharp.Core;
using System;
[CompilationMapping(SourceConstructFlags.Module)]
public static class EscapeTest
{
[CompilationArgumentCounts(new int[] {1, 1})]
public static int diff(int a, int b)
{
Tuple<int, int> tuple = a <= b ? new Tuple<int, int>(a, b) : new Tuple<int, int>(b, a);
return tuple.Item2 - tuple.Item1;
}
} We can see that, in fact, we're unintentionally abusing heap allocation for a basic arithmetic operation. P. S. This might (or might not) be addressed by F# compiler in the first place, I just not aware of design decisions what led to this code generation in this particular case, but my point is that this optimization on the JIT side could definitely make the difference. |
Example code that can benefit from this is riddled all over FxCore; E.g: HastSet.IntersectWithEnumerable and CheckUniqueAndUnfoundElements both allocate a temporary BitHelper object. List.InsertRange allocates a temporary array. ConcurrentBag new's up a SpinWait in the CanSteal method. Also, almost all code that has a "using" statement has an object that goes out of scope can be cleaned up immediately. No need to wait for the garbage collector. just call dispose and free it. like... And... And... Finally... To name just a few. |
This is the reason the GC has to run to finalize the object. It has to trace to determine if an object reference is still held anywhere. This is also the point where 'escape analysis' enters the discussion.
|
@OtherCrashOverride sure, it's possible that the object is assigned to something inside the using, and in this case I would expect the compiler to see this, be conservative, and not free it. However, I claim the 80% case is this does not happen. Now, perhaps you are talking about the case where there are nested using statements, which is pretty common. In this case, I hope the compiler would be smart enough to unwind the stack and see that after Dispose all the objects can be freed immediately. |
It seems to me that escape analysis and call depth can get pretty complicated. I recommend that this feature be added in a super-simple way that lays the groundwork for a more sophisticated approach later. With that in mind it seems to me a very simple case study is String.Split. It returns an allocated array. It's used in a lot of code. The array does not contain anything that requires depth analysis. It represents a very simple example of an object that often is not used outside the scope of the function. If the array returned by Split can be freed early, the same logic will apply to a lot of other objects. |
@SunnyWar Might be similar analysis already done for Ref returns and Locals? dotnet/roslyn#8030 |
@SunnyWar How could an unknown-sized array returned from a method be stack allocated? Unless Or are you talking about some "early free from heap" approach and not "stack allocate"? |
@svick I'm talking about "early free from heap" though if "stack allocate" accomplishes the same thing then I'm all for it! Every time I see "new" on something simple that has a scoped lifetime I wonder to myself, "Why does this need to be garbage collected at all? Isn't it obvious to even a brain-dead compiler that it is scoped, has no references, and was passed to no fuctions (in othe words: is 100% safe to free)? Why not clean it up immediately? I honestly don't get why this is so difficult and wasn't done from the beginning. That's my motivation. |
And how do you clean it up immediately? The GC heap doesn't have any mechanism that allows you free a single object. And adding such a mechanism isn't exactly trivial. |
If the GC has no mechanism to free up a single object, then how does the GC clean up a single object?
|
@SunnyWar this capability could be added. If the object to be freed is the last one in the current segment the allocation pointer can just be decremented. There have been dynamic escape tracking schemes implemented like this. Escape analysis and tracking can use a mix of static and dynamic analysis. It also can work with dynamically sized objects and variable amounts of objects. I think a pragmatic way to go would be to stack allocate in obvious cases. Sometimes it really is trivial to prove that no pointer escapes and that the size of all objects affected is bounded. |
GC does not work like that, it doesn't free single objects. It basically moves around all the objects that are in use to eliminate the free space that may have accumulated between them. What you suggests requires maintaining free lists and doing that in a GC allocator is a bit of a problem. Not because it is impossible or difficult but because not having to maintain such lists is one of the few advantages a GC allocator has over a classic allocator (though the CLR GC does maintain free lists in certain cases). |
@mikedn Thanks for the explanation. It's enlightening. I'm less interested in how difficult it is than when it will be done. Though now I see why the discussion is about stack allocation. |
I'm happy that so many people are interested in my idea. Please have a look at another idea:https://github.com/dotnet/coreclr/issues/555, I think it's more interesting. |
Can this improve perfomance of using System;
using System.Collections.Generic;
class Program
{
static void Main()
{
const int N = 10000;
var list = new List<int> { 1, 2, 3, 4, 5 };
Console.WriteLine("total memory:");
Console.WriteLine(GC.GetTotalMemory(false));
for (int i = 0; i < N; i++)
{
// no boxing. struct List<int>.Enumerator is used for iteration.
var s2 = 0;
foreach (var x in list)
s2 += x;
}
Console.WriteLine(GC.GetTotalMemory(false));
for (int i = 0; i < N; i++)
{
// list is boxed to IEnumerable
var s3 = ViaInterface(list);
}
Console.WriteLine(GC.GetTotalMemory(false));
for (int i = 0; i < N; i++)
{
// list is not boxed, but List<int>.Enumerator is boxed to IEnumeartor<int>
var s4 = ViaGenerics(list);
}
Console.WriteLine(GC.GetTotalMemory(false));
}
static int ViaInterface(IEnumerable<int> items)
{
var s = 0;
foreach (var x in items)
s += x;
return s;
}
static int ViaGenerics<T>(T items)
where T : IEnumerable<int>
{
var s = 0;
foreach (var x in items)
s += x;
return s;
}
} |
Not really, this was mentioned in some posts above that discussed LINQ. Not only that it is difficult to stack-allocate returned objects but you won't even reach that far because you can't perform escape analysis on It may work if |
To the question about use cases: There are certainly widely spread use cases where avoiding these small allocations can be very useful. Summary: I would really like to see this happening. |
There are a lot of 'params' methods in the hot path. String.Trim(), for example, is often used. It does not allocate unless it actually changes the string, yet must heap allocate an array of chars.
APIs involving mutable structs (C# doesn't have immutable ones) are incredibly error prone. There's no clarity into what copy of a struct is being mutated by what method. structs are therefore eschewed in favor of classes in public APIs. I think it's an oversimplification to imply .NET doesn't need allocation optimizations as much as Java does. |
Pardon my jumping in, but what's the status of this? |
its 2022,with more 3 years,thats will be 10 years,.net team will be do something for .net's ZGC? We all know that the C++ committee delays and slows down C++ development, but the .net foundation is relatively young and shouldn't be. And GC is just an implementation, not a standard, this shouldn't be delayed too long. |
I think it's more along the lines of determining whether this is the right solution to the problem. The JVM's optimizations around allocations have a significant impact, but they are limited in ways that a developer won't necessarily have ready insight into. I'm not an expert on GC by any means, but my limited understanding leads me to believe some form of deterministic deallocation mechanism or a more aggressive object pool for temporary variables would be more feasible. Not least because it would allow you to continue passing "temporary" instances to methods that take them as arguments, including methods on the instance itself. After all, an optimization is no good if all the code you normally write falls outside the prerequisites. |
@sgf I think you might be underestimating how complex this issue is. The GC isn't really the main issue here, it's the escape analysis required to ensure the optimization is safe to preform. Either way, @AndyAyersMS has indicated in #11192 (comment) that this is potentially on deck to receive more attention after .NET 7. |
If .net teams do not have the technical ability to solve the problem, then learn from the go team and hire experts from outside to solve the problem. Times have changed, and .net's current gc can only handle toy-level applications. I don't know if this is a typical problem for a super company like Microsoft, but .net is moving too slowly and has missed too many opportunities. Although things are getting better with the current .net it seems to be too late. The slow motion of .net even affected the development of Unity3d.This is only one of the few areas where .net is still popular. My words don't sound good,maybe some people don't like it and are not happy.But that's the truth, isn't it? |
In order to maintain the illusion of .net prosperity, as long as I see .net projects, I mark almost all of them as star . Although this is trivial, .net open source projects are also part of the .net ecosystem, if only to incentivize them to maintain open source projects. |
|
There are many trillions of lines of code written on the .NET platform, in some cases within codebases that are already more than a decade old. The fact that the .NET ecosystem evolves as fast as it does is a testament to the "technical ability" of all the contributors to the .NET ecosystem. I've worked on applications that started out as Windows services running on premise and ended up running in a Kubernetes on a cloud platform. It speaks well of everyone involved that fundamental changes to core pieces, like GC, are done after very careful consideration. Also, when you get a chance, I kindly suggest reading the Code of Conduct. |
Hope this can be implemented in .Net 8. @AndyAyersMS |
I understand that allocating objects on the stack may be vastly more complex than it appears on the surface. Perhaps a more doable idea is to enable a new type of heap allocation, using a new heap which is not managed by the garbage collector and requires manual allocation and freeing of memory. We can be smarter than the GC at allocating/freeing memory, this new functionality would enable us to put this knowledge into action. I'd love to hear people's thoughts on this, any obvious caveats that I've missed? |
@Zintom why do you think a heap allocator would be cheaper? How would the GC detect roots from your heap? |
Arena based allocators like "allocate temp heap, do whatever you want within it, remove heap entirely" have the same problem as object stack allocation - there should be an escape analysis done proving that no object refs "escape" to actual heaps of GC. So I guess you would need
|
My idea is that any object referenced by this "unmanaged heap" would be marked as "possibly live, linked to an object in the unmanaged heap" and thus the GC would ignore them.
Considering most of the runtime is built around the idea of "the less allocations the better" in order to avoid GC pressure, I believe this could have massive potential in terms of the BCL and the wider ecosystem in general. I think most intermediate developers and above know how to manage their alloc's/free's. Maybe it wouldn't be used everywhere, but in critical sections of code or in libraries we could see major improvements, the Average Joe wouldn't have to know that an unmanaged heap even existed, but would benefit as his allocations wouldn't have such an impact on his program performance. |
If at all possible, I would prefer GC optimizations that didn't require explicit action by the developer. Ideally they should be transparent to the developer, like strategies where the JIT or AOT engine detects non-escaping allocations and implements optimizations, such as scalar replacement, without the developer needing to explicitly mark a variable as non-escaping or allocating it on a special heap. There's a lot of existing code out there, including in the BCL, where such optimizations would be beneficial. It's pragmatically a non-starter to require that code to be updated or rewritten to explicitly make use of those optimizations. My question for the experts would be: would it even be an optimization if hypothetically the runtime could simply deterministically deallocate some non-escaping heap allocations when they go out of scope? I mean, putting scalar replacement aside, if the JIT could deterministically deallocate a temp variable when it goes out of scope. That's not free from a performance perspective, yeah? I imagine it would negatively effect the throughput of the method. I think the question suggests we should step back and analyze what it is we're actually optimizing. GC pressure is "bad" for a specific type of performance. For example, for real-time graphics applications or animated UI frameworks, a GC pause can cause visual stuttering. For high throughput applications like microservices and other backends, high GC pressure might cause throughput issues if the GC can't scale with the demands being placed on it. How can we solve these specific problems? Is deterministic deallocation for temporary objects really the best solution? I'd love to hear some insights from the experts. Worth noting that both the real-time graphics/UI applications and high-throughput microservices tend to do relatively short bursts of work on a particular thread or thread pool and then give up control of the thread back to the thread pool or else the thread pauses until it's signaled again. Just a random thought: perhaps a more efficient way to deal with temporary allocations is some kind of thread-local "temp variable GC generation". That is to say, rather than allocation all temporary variables onto gen 0, when the JIT can prove a variable is non-escaping, it can safely allocate that variable onto a separate "gen T" (for Temporary). This pool ONLY contains objects that are non-escaping for a specific thread. Then, whenever a thread gives up control back to a thread pool or goes to sleep, the GC can free up everything in that thread's gen-t pool without a full scan. |
Allocating small objects on the stack can not only reducing the GC pressure, but also can make small objects be able to participate in the struct promotion optimization, so that small objects can just be kept in the register (enregister) without need of being allocated at all, which is a significant performance improvement for .NET. A lot of legacy/benchmark code is not using structs today, which is also a great motivation to implement this optimization so that they can benefit without need of changing the source. Just see how this can affect the binary tree benchmark: public class BinaryTrees
{
struct TreeNodeS
{
class Next { public TreeNodeS left, right; }
readonly Next next;
TreeNodeS(TreeNodeS left, TreeNodeS right) =>
next = new Next { left = left, right = right };
internal static TreeNodeS Create(int d)
{
return d == 1 ? new TreeNodeS(new TreeNodeS(), new TreeNodeS())
: new TreeNodeS(Create(d - 1), Create(d - 1));
}
internal int Check()
{
int c = 1;
var current = next;
while (current != null)
{
c += current.right.Check() + 1;
current = current.left.next;
}
return c;
}
}
class TreeNode
{
class Next { public TreeNode left, right; }
readonly Next next;
TreeNode(TreeNode left, TreeNode right) =>
next = new Next { left = left, right = right };
public TreeNode() { }
internal static TreeNode Create(int d)
{
return d == 1 ? new TreeNode(new TreeNode(), new TreeNode())
: new TreeNode(Create(d - 1), Create(d - 1));
}
internal int Check()
{
int c = 1;
var current = next;
while (current != null)
{
c += current.right.Check() + 1;
current = current.left.next;
}
return c;
}
internal void Hold() { }
}
internal class Program
{
static void Main(string[] args)
{
BenchmarkRunner.Run<Benchmarks>();
}
}
[MemoryDiagnoser]
public class Benchmarks
{
const int MinDepth = 4;
[Arguments(18), Benchmark]
public int Class(int maxDepth)
{
var longLivedTree = TreeNode.Create(maxDepth);
var nResults = (maxDepth - MinDepth) / 2 + 1;
for (int i = 0; i < nResults; i++)
{
var depth = i * 2 + MinDepth;
var n = (1 << maxDepth - depth + MinDepth);
var check = 0;
for (int j = 0; j < n; j++)
{
check += TreeNode.Create(depth).Check();
}
}
return longLivedTree.Check();
}
[Arguments(18), Benchmark]
public int Struct(int maxDepth)
{
var longLivedTree = TreeNodeS.Create(maxDepth);
var nResults = (maxDepth - MinDepth) / 2 + 1;
for (int i = 0; i < nResults; i++)
{
var depth = i * 2 + MinDepth;
var n = (1 << maxDepth - depth + MinDepth);
var check = 0;
for (int j = 0; j < n; j++)
{
check += TreeNodeS.Create(depth).Check();
}
}
return longLivedTree.Check();
}
}
}
Simply changing class to struct yields a 500% performance improvement and cut down 60% of the allocation for free. |
struct is very good,but still less description ,some limit like : cant define custom struct as fixed buffer , |
Any news on this topic here so far? |
Any progress? |
Yes, it's enabled now. |
It is enabled by default in .NET 9, correct? |
Yes. |
If a small object's reference is never stored in heap or static variables, (only stored in local variables), it is temporary, the CLR or JIT should alloc it on stack and free it in time. Thus, GC pressure would be lower. This feature is just like escape analysis technique in JVM.
I've suggested this in Roslyn forum(dotnet/roslyn#2104), but they said I should suggest here.
category:cq
theme:stack-allocation
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: