Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Improvements for object stack allocation. #21950

Merged
merged 1 commit into from
Jan 15, 2019

Conversation

erozenfeld
Copy link
Member

This change enables object stack allocation for more cases:

  1. Objects with gc fields can now be stack-allocated.
  2. Object stack allocation is enabled for x86.

ObjectAllocator updates the types of trees containing references
to possibly-stack-allocated objects to TYP_BYREF or TYP_I_IMPL as appropriate.
That allows us to remove the hacks in gcencode.cpp and refine reporting of pointers:
the pointer is not reported when we can prove that it always points to a stack-allocated object or is null (typed as TYP_I_IMPL);
the pointer is reported as an interior pointer when it may point to either a stack-allocated object or a heap-allocated object (typed as TYP_BYREF);
the pointer is reported as a normal pointer when it points to a heap-allocated object (typed as TYP_REF).

ObjectAllocator also adds flags to indirections:
GTF_IND_TGTANYWHERE when the indirection may be the heap or the stack
(that results in checked write barriers used for writes)
or the new GTF_IND_TGT_NOT_HEAP when the indirection is null or stack memory
(that results in no barrier used for writes).

@erozenfeld
Copy link
Member Author

x64 PMI diffs with #21944 applied to both base and diff and object stack allocation enabled in both base and diff:

PMI Diffs for System.Private.CoreLib.dll, framework assemblies for  default jit
Summary:
(Lower is better)
Total bytes of diff: -2225 (-0.01% of base)
    diff is an improvement.
Top file regressions by size (bytes):
         154 : Microsoft.DotNet.ProjectModel.dasm (0.07% of base)
          88 : xunit.execution.dotnet.dasm (0.04% of base)
          67 : System.Net.Http.dasm (0.01% of base)
          50 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (0.00% of base)
          34 : System.Private.Xml.dasm (0.00% of base)
Top file improvements by size (bytes):
       -1870 : System.Threading.Tasks.Dataflow.dasm (-0.30% of base)
        -421 : System.Private.DataContractSerialization.dasm (-0.05% of base)
        -277 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.00% of base)
         -80 : NuGet.Configuration.dasm (-0.15% of base)
         -14 : System.Linq.Parallel.dasm (0.00% of base)
15 total files with size differences (7 improved, 8 regressed), 114 unchanged.
Top method regressions by size (bytes):
         154 ( 4.28% of base) : Microsoft.DotNet.ProjectModel.dasm - ProjectReader:ReadProject(ref,ref,ref,ref):ref:this
          51 ( 2.44% of base) : System.Private.Xml.dasm - XslAstRewriter:Refactor(ref,int):this
          50 ( 8.00% of base) : xunit.execution.dotnet.dasm - <>c__DisplayClass26_0:<Find>b__0():this
          46 ( 1.21% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - TraceEventSession:EnableProvider(struct,int,long,ref):bool:this
          38 (11.73% of base) : xunit.execution.dotnet.dasm - <>c__DisplayClass28_0:<Find>b__0():this
Top method improvements by size (bytes):
        -189 (-14.12% of base) : System.Threading.Tasks.Dataflow.dasm - TransformBlock`2:get_DebuggerDisplayContent():ref:this (5 methods)
        -189 (-14.12% of base) : System.Threading.Tasks.Dataflow.dasm - TransformManyBlock`2:get_DebuggerDisplayContent():ref:this (5 methods)
        -148 (-12.52% of base) : System.Threading.Tasks.Dataflow.dasm - BroadcastBlock`1:get_DebuggerDisplayContent():ref:this (5 methods)
         -93 (-19.25% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - MethodCompiler:GetEntryPoint(ref,ref,ref,struct):ref
         -92 (-8.03% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - LocalRewriter:VisitForEachStatement(ref):ref:this
Top method regressions by size (percentage):
          38 (11.73% of base) : xunit.execution.dotnet.dasm - <>c__DisplayClass28_0:<Find>b__0():this
          25 ( 9.69% of base) : System.Private.DataContractSerialization.dasm - DataContractJsonSerializerImpl:.ctor(ref,ref):this (2 methods)
          50 ( 8.00% of base) : xunit.execution.dotnet.dasm - <>c__DisplayClass26_0:<Find>b__0():this
          13 ( 5.88% of base) : System.Security.Cryptography.X509Certificates.dasm - ECDsaCertificateExtensions:HasECDsaKeyUsage(ref):bool
          12 ( 5.56% of base) : System.Private.DataContractSerialization.dasm - DataContractSerializer:.ctor(ref,ref):this (2 methods)
Top method improvements by size (percentage):
         -18 (-32.14% of base) : System.Threading.Tasks.Dataflow.dasm - BroadcastBlock`1:get_ValueForDebugger():int:this
         -18 (-31.58% of base) : System.Threading.Tasks.Dataflow.dasm - BroadcastBlock`1:get_ValueForDebugger():long:this
         -18 (-29.51% of base) : System.Threading.Tasks.Dataflow.dasm - BroadcastBlock`1:get_ValueForDebugger():double:this
         -72 (-28.02% of base) : System.Threading.Tasks.Dataflow.dasm - BroadcastBlock`1:get_HasValueForDebugger():bool:this (5 methods)
         -90 (-25.35% of base) : System.Threading.Tasks.Dataflow.dasm - TransformBlock`2:get_OutputCountForDebugger():int:this (5 methods)
72 total methods with size differences (53 improved, 19 regressed), 193309 unchanged.

@erozenfeld
Copy link
Member Author

No diffs with object stack allocation disabled.

@erozenfeld
Copy link
Member Author

Some of the regressions are due to stack offset changes after moving allocations from the heap to the stack. For example, on x64
mov qword ptr [rbp-80H], rax is 4 bytes while mov qword ptr [rbp-B0H], rax is 7 bytes.

@@ -349,7 +349,7 @@ CONFIG_STRING(JitInlineReplayFile, W("JitInlineReplayFile"))
#endif // defined(DEBUG) || defined(INLINE_DATA)

CONFIG_INTEGER(JitInlinePolicyModel, W("JitInlinePolicyModel"), 0)
CONFIG_INTEGER(JitObjectStackAllocation, W("JitObjectStackAllocation"), 0)
CONFIG_INTEGER(JitObjectStackAllocation, W("JitObjectStackAllocation"), 1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will not be merge this change. I'd like to run some ci testing with object stack allocation enabled. I'll revert this change before merging.

@erozenfeld
Copy link
Member Author

@AndyAyersMS @echesakovMSFT @dotnet/jit-contrib PTAL

@erozenfeld
Copy link
Member Author

erozenfeld commented Jan 11, 2019

Example of a good diff (BroadcastBlock``1:get_HasValueForDebugger():bool:this from System.Threading.Tasks.Dataflow.dll):

G_M19268_IG01:
-     push     rdi
-     push     rsi
-     sub      rsp, 40
+     sub      rsp, 24
+     xor      rax, rax
+     mov      qword ptr [rsp+10H], rax

G_M19268_IG02:
-     mov      rsi, gword ptr [rcx+8]
+     mov      rax, gword ptr [rcx+8]
-     mov      ecx, dword ptr [rsi]
-     mov      rcx, 0xD1FFAB1E
-     call     CORINFO_HELP_NEWSFAST
-     mov      rdi, rax
-     lea      rcx, bword ptr [rdi+8]
-     mov      rdx, rsi
-     call     CORINFO_HELP_ASSIGN_REF
-     mov      rax, gword ptr [rdi+8]
+     mov      rdx, rax
+     mov      edx, dword ptr [rdx]
+     xor      rdx, rdx
+     lea      rcx, bword ptr [rsp+08H]
+     mov      qword ptr [rcx], rdx
      mov      eax, dword ptr [rax+96]

G_M19268_IG03:
-     add      rsp, 40
+     add      rsp, 24
-     pop      rsi
-     pop      rdi

-; Total bytes of code 56, prolog size 6 for method BroadcastBlock`1:get_ValueForDebugger():int:this
+; Total bytes of code 38, prolog size 11 for method BroadcastBlock`1:get_ValueForDebugger():int:this

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good -- nice to see it rounding into shape.

Left you a two small notes.

{
tree->ChangeType(newType);
}
lclVarDsc->lvType = newType;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest you sink this update out of both then/else and add a JITDUMP message describing the change.

@@ -321,6 +392,8 @@ bool ObjectAllocator::MorphAllocObjNodes()

const unsigned int stackLclNum = MorphAllocObjNodeIntoStackAlloc(asAllocObj, block, stmt);
m_HeapLocalToStackLocalMap.AddOrUpdate(lclNum, stackLclNum);
MarkLclVarAsDefinitelyStackPointing(lclNum);
MarkLclVarAsPossiblyStackPointing(lclNum);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a note here and/or elsewhere that the possibly stack pointing set is kept as a superset of the definitely stack pointing set (and a local is in both it is definitely stack pointing).

MarkLclVarAsPossiblyStackPointing(lclNum);

// Check if this pointer always points to the stack.
if (lclVarDsc->lvSingleDef == 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really get around to writing a checker for the early meaning of lvSingleDef someday. I think the upstream phases maintain it, but ...

@erozenfeld erozenfeld force-pushed the RetypeStackPointingRefs branch from 45c257d to a377055 Compare January 11, 2019 22:22
@erozenfeld
Copy link
Member Author

@AndyAyersMS I addressed your feedback and added a couple of small fixes for issues found in x86 testing. PTAL.

@erozenfeld
Copy link
Member Author

@dotnet-bot test Tizen armel Cross Checked Innerloop Build and Test

This change enables object stack allocation for more cases.

1. Objects with gc fields can now be stack-allocated.
2. Object stack allocation is enabled for x86.

ObjectAllocator updates the types of trees containing references
to possibly-stack-allocated objects to TYP_BYREF or TYP_I_IMPL as appropriate.
That allows us to remove the hacks in gcencode.cpp and refine reporting of pointers:
the pointer is not reported when we can prove that it always points to a stack-allocated object or is null (typed as TYP_I_IMPL);
the pointer is reported as an interior pointer when it may point to either a stack-allocated object or a heap-allocated object (typed as TYP_BYREF);
the pointer is reported as a normal pointer when it points to a heap-allocated object (typed as TYP_REF).

ObjectAllocator also adds flags to indirections:
GTF_IND_TGTANYWHERE when the indirection may be the heap or the stack
(that results in checked write barriers used for writes)
or the new GTF_IND_TGT_NOT_HEAP when the indirection is null or stack memory
(that results in no barrier used for writes).
@erozenfeld erozenfeld force-pushed the RetypeStackPointingRefs branch from a377055 to 01731f7 Compare January 13, 2019 03:04
@erozenfeld
Copy link
Member Author

@dotnet-bot test Tizen armel Cross Checked Innerloop Build and Test

2 similar comments
@erozenfeld
Copy link
Member Author

@dotnet-bot test Tizen armel Cross Checked Innerloop Build and Test

@erozenfeld
Copy link
Member Author

@dotnet-bot test Tizen armel Cross Checked Innerloop Build and Test

@AndyAyersMS
Copy link
Member

The Tizen leg is broken (and now, removed) so you might as well ignore it.

{
object o = (f1 == 0) ? (object)new SimpleClassB(f1, f2) : (object)new SimpleClassA(f1, f2);
return (o is SimpleClassB) || !(o is SimpleClassA) ? 0 : 1;
GC.Collect();
return !(o is SimpleClassA) ? 0 : 1;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet there should be a reason why you prefered this over (o is SimpleClassA) ? 1 : 0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the result of refactoring the previous version of the test. I didn't bother simplifying this.

@ygc369
Copy link

ygc369 commented Mar 27, 2020

@erozenfeld
What's the progress of stack allocation project? Is it dead? Or will it be merged into next version of CLR?

@erozenfeld
Copy link
Member Author

We have support for object stack allocation merged but off by default. It can be turned on by setting COMPlus_JitObjectStackAllocation=1. It currently can allocate objects on the stack only in simple cases because arguments to any calls are assumed to be escaping. We need interprocedural escape analysis to make object stack allocation possible in more cases. We are not currently working on it but it's on the table in our longer term planning.

@erozenfeld
Copy link
Member Author

dotnet/runtime#11192 is a tracking issue for object stack allocation.

@NinoFloris
Copy link

@erozenfeld any background to off being the default?

Even the simplest version will definitely help F# code which creates a lot more superfluous tuples, optionals and such.

@AndyAyersMS
Copy link
Member

@NinoFloris if you can point us at specific examples that would be helpful.

More broadly, we are looking for some of F# performance tests -- for example, to help us better evaluate some of the tradeoffs in dotnet/runtime#341.

@erozenfeld
Copy link
Member Author

@NinoFloris The escape analysis is not free in terms of jit speed and we can't justify having it on if it almost never results in actual object stack allocation. And yes, if you have great F# examples where this analysis is sufficient and stack allocation candidates are not passed as arguments, please send them our way.

@NinoFloris
Copy link

NinoFloris commented Apr 17, 2020

I have linked to some cases, generally the F# compiler can automatically inline local functions or functions explicitly marked inline. However the optimizer is not advanced enough to erase the allocations of Options/Tuples/RefCells once they're all in the same body, this is where object stack alloc would be helpful.

I'm also hoping to see if I can (and OK'ed to) make some changes in that area in the compiler instead.

@erozenfeld
Copy link
Member Author

Thank you @NinoFloris ! I'll take a look at these issues sometime next week.

@ygc369
Copy link

ygc369 commented Apr 18, 2020

@erozenfeld
I know that doing escape analysis during JIT is very hard, because JIT time must be very short, JIT compiler can't do too much analysis, but escape analysis is very complex and can't be done within a very short time. So it is very difficult to make a tradeoff. Then why not just introduce escape analysis in AOT compiler? AOT does not care about compiling time, it can do as much analysis as possible.
I think that before finding a good way to do escape analysis during JIT, a full escape analysis should be added to the .NET AOT(R2R) compiler firstly at least, and should be "on" by default.

@erozenfeld
Copy link
Member Author

.NET AOT compilers (crossgen and crossgen2) use the jit to compile individual methods. Neither one currently does whole-program optimizations. We have some long-term plans to look into adding whole-program optimizations to crossgen2 and did some prototyping. Escape analysis is one of the optimizations we will consider. The way I envision this is crossgen2 will do compilation in bottom-up call graph order (callees before callers) and will record escape info for method parameters. That info can be used when compiling callers to determine which args can escape.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants