Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structs: Do not mark "Hidden buffer pointer" as address-exposed #64130

Merged
merged 21 commits into from
Mar 17, 2022

Conversation

kunalspathak
Copy link
Member

@kunalspathak kunalspathak commented Jan 22, 2022

When a method returns a struct, the caller often passes the location of its stack frame as one of the argument, where callee can store the struct it would return. This argument is known as "hidden buffer" argument. The callee would store the struct it should return into this hidden buffer location. When the callee returns, the caller copies the struct value from the hidden buffer into another stack frame location which subsequently is used to access the struct. Currently, we mark such argument as "address-exposed", which prevent us from doing further optimizations on the hidden buffer. However, there is no IL / user code that would modify or pollute the hidden buffer and is strictly used by .NET ABI. Hence, it is safe to assume that the address of the hidden buffer is not exposed even if it is passed to the callee. This opens up opportunities of optimizing the hidden buffer location and eliminating the extra copy that the caller has to perform after the callee returns.

Here is an example taken from #12219
@benaadams

static long InlinedAssignment()
{
    var s = CreateLargeStruct(1);
    s = GetLargeStruct(s);
    return s.l3;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
static LargeStruct GetLargeStruct(LargeStruct l) => l;
 G_M37993_IG01:
-       sub      rsp, 104
-       vzeroupper
-                                               ;; bbWeight=1    PerfScore 1.25
+       sub      rsp, 72
+                                               ;; bbWeight=1    PerfScore 0.25
 G_M37993_IG02:
-       lea      rcx, [rsp+48H]
+       lea      rcx, bword ptr [rsp+28H]
        mov      edx, 1
        call     _12219:CreateLargeStruct(int):LargeStruct
-       vmovdqu  ymm0, ymmword ptr[rsp+48H]
-       vmovdqu  ymmword ptr[rsp+28H], ymm0
        mov      rax, qword ptr [rsp+40H]
-                                               ;; bbWeight=1    PerfScore 7.75
+                                               ;; bbWeight=1    PerfScore 2.75
 G_M37993_IG03:
-       vzeroupper
-       add      rsp, 104
+       add      rsp, 72
        ret
-                                               ;; bbWeight=1    PerfScore 2.25
+                                               ;; bbWeight=1    PerfScore 1.25

Here is the summary of no. of methods improved/regressed in all SPMI collections.

SPMI collection Method Improved Methods Regressed
coreclr_tests Linux arm64 1058 49
libraries.crossgen2 Linux arm64 58 3
libraries.pmi Linux arm64 740 46
libraries_tests Linux arm64 943 65
benchmarks.run OSX arm64 33 3
coreclr_tests OSX arm64 1047 50
libraries.crossgen2 OSX arm64 260 17
libraries.pmi OSX arm64 737 46
libraries_tests OSX arm64 925 65
benchmarks.run Linux x64 36 3
coreclr_tests Linux x64 160 0
libraries.crossgen2 Linux x64 301 3
libraries.pmi Linux x64 886 65
libraries_tests Linux x64 952 439
benchmarks.run windows arm64 33 3
coreclr_tests windows arm64 1050 50
libraries.crossgen2 windows arm64 254 17
libraries.pmi windows arm64 738 46
libraries_tests windows arm64 945 65
aspnet.run windows x64 160 44
benchmarks.run windows x64 52 25
coreclr_tests windows x64 2691 759
libraries.crossgen2 windows x64 571 113
libraries.pmi windows x64 846 179
libraries_tests windows x64 1263 395
coreclr_tests Linux arm 234 79
libraries.crossgen2 Linux arm 993 82
libraries.pmi Linux arm 1065 216
libraries_tests Linux arm 1629 911
benchmarks.run windows x86 73 35
coreclr_tests windows x86 1692 19
libraries.crossgen2 windows x86 674 137
libraries.pmi windows x86 922 377
libraries_tests windows x86 1910 848

I studied some regressions and they are one of the two categories:

  • The stack space needed for a method is lesser with this PR and there is change amount of memory we need to initialize in prolog. Sometimes they are not 16 bytes aligned and hence we emit more instructions to initialize them.
  • I have seen lot of cases where we start doing tail calls as seen below.

image

Changes:

  • During LocalAddressVisitor, before we mark anything with address-exposed, we would check if it is "hidden buffer" and if yes, set lvHiddenBufferStructArg = true. Since this is a unique node that is defined (written by the callee) and used (passed as argument by the caller), it is marked as partial definition using gtFlags |= (GTF_VAR_DEF | GTF_VAR_USEASG).
  • During global morph, if the hidden buffer was marked as needed zero-init, we might constant propagate that value over the call that writes into the hidden buffer. For such cases, we will now check for such argument, and if one found, will remove/invalidate the assertions made for it before that point.
  • During SSA numbering, LclSsaVarDsc is written considering that GT_ASG can be the only node that introduces a definition. However, with this PR, we will also introduce a definition for GT_CALL nodes. Added RenameCall() method that handles call nodes and assigns SSA to the hidden buffer argument.
  • During value numbering, a check is added to look for hidden buffer nodes and assign them unique VN.
  • Added COMPlus_JitHiddenBufferForStruct that is enabled by default but can be turned off for easy debugging.

References:

@ghost ghost assigned kunalspathak Jan 22, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 22, 2022
@ghost
Copy link

ghost commented Jan 22, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: kunalspathak
Assignees: kunalspathak
Labels:

area-CodeGen-coreclr

Milestone: -

@kunalspathak kunalspathak changed the title Structs: No not mark "Hidden buffer pointer" as address-exposed Structs: Do not mark "Hidden buffer pointer" as address-exposed Jan 22, 2022
@AndyAyersMS
Copy link
Member

@dotnet/jit-contrib @SingleAccretion curious to get your thoughts here.

@kunalspathak you may need to write up an overview of what you're up to.

@kunalspathak
Copy link
Member Author

/azp run superpmi-replay

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@kunalspathak
Copy link
Member Author

CoreClr Pri0 test failures are related to #63325

@kunalspathak
Copy link
Member Author

/azp list

@azure-pipelines
Copy link

@kunalspathak
Copy link
Member Author

/azp run runtime-coreclr jitstress

@kunalspathak
Copy link
Member Author

/azp run runtime-coreclr libraries-jitstress

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@kunalspathak
Copy link
Member Author

/azp run runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@kunalspathak
Copy link
Member Author

outerloop tests failures are #60036 and #61825

@kunalspathak
Copy link
Member Author

/azp run runtime-coreclr jitstress
/azp run runtime-coreclr libraries-jitstress
/azp run runtime-coreclr outerloop

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@kunalspathak
Copy link
Member Author

/azp run runtime-coreclr jitstress

1 similar comment
@kunalspathak
Copy link
Member Author

/azp run runtime-coreclr jitstress

@kunalspathak
Copy link
Member Author

/azp run runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

2 similar comments
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

src/coreclr/jit/copyprop.cpp Show resolved Hide resolved
src/coreclr/jit/gentree.h Outdated Show resolved Hide resolved
GenTree* GetLclRetBufArgNode() const
{
if (gtRetBufArg == nullptr)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we assert on both sides of this? (that the flag and the class member agree)

Suggested change
{
{
assert(!HasRetBufArg());

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out that when we get RetBuf as argument in a register (as I understand), and if that RetBuf is further used in a call in current method, that argument is propagated as-is without having to put it on current frame:

               [000000] S-C-G-------              *  CALL      void   hackishModuleName.hackishMethodName
               [000001] ------------ arg0         \--*  LCL_VAR   byref  V01 RetBuf  

In such cases, we are not passing the address and hence we don't perform the optimization of this PR. For such scenarios, gtRetBufArg == nullptr while HasRetBufArg() is true. We could extract the first argument and then validate for such cases if its varDsc->lvIsRegArg == true, but not sure if it is worth it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a simple example where this happens?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current design gtRetBufArg is only set if the buffer's address is local.

(Simple case where it is not: *globalPtr = CallThatReturnsStruct();).

This is mainly for throughput reasons -- we'd like to be able to early out of GetLclRetBufArgNode.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/coreclr/jit/ssabuilder.cpp Show resolved Hide resolved
@kunalspathak
Copy link
Member Author

kunalspathak commented Mar 17, 2022

asmdiffs summary after all PR review updates:

Collection Name Methods Improved Methods Regressed
coreclr_tests Linux arm64 1051 49
libraries crossgen2 Linux arm64 217 13
libraries pmi Linux arm64 739 45
libraries_tests Linux arm64 939 52
benchmarks run OSX arm64 33 3
coreclr_tests OSX arm64 1051 49
libraries crossgen2 OSX arm64 223 14
libraries pmi OSX arm64 736 45
libraries_tests OSX arm64 557 38
benchmarks run Linux x64 34 3
coreclr_tests Linux x64 911 0
libraries crossgen2 Linux x64 10 1
libraries pmi Linux x64 884 65
libraries_tests Linux x64 946 442
benchmarks run windows arm64 33 3
coreclr_tests windows arm64 1051 49
libraries crossgen2 windows arm64 218 13
libraries pmi windows arm64 737 45
libraries_tests windows arm64 939 51
aspnet run windows x64 154 43
benchmarks run windows x64 52 28
coreclr_tests windows x64 2692 759
libraries crossgen2 windows x64 547 119
libraries pmi windows x64 847 181
libraries_tests windows x64 1261 402
coreclr_tests Linux arm 241 81
libraries crossgen2 Linux arm 671 19
libraries pmi Linux arm 1063 270
libraries_tests Linux arm 1618 961
benchmarks run windows x86 73 34
coreclr_tests windows x86 1696 22
libraries crossgen2 windows x86 691 141
libraries pmi windows x86 900 380
libraries_tests windows x86 1916 858

@kunalspathak kunalspathak merged commit c032e0d into dotnet:main Mar 17, 2022
@kunalspathak
Copy link
Member Author

Improvements in dotnet/perf-autofiling-issues#4224

radekdoulik pushed a commit to radekdoulik/runtime that referenced this pull request Mar 30, 2022
…et#64130)

* Preliminary changes

* Add the missing push of ssadef

* Fix an issue for call tree cloning

* formatting

* minor review comments incorporation

* Make GT_CALL as full-def

* Merge RenameCall into RenameDef

* Fix DefinesLocalAddr and lateuse arg

* Minor comments

* fix the recently added code to account that GT_CALL can also have defintion:

* clone the return buffer arg

* Another round of review comments

* Handle return buffer calls in optIsVarAssgCB

* Update locals defined by calls in NodeInfo

* Fix a case where arg can be putarg

* Restructure the cases in GetLclRetBufArgNode()

* move ivaMaskCall outside the condition

* Add back to call for DoNotEnregister which was missed for Release builds

* Retrieve the appropriate node in case of copyReload

* Added an assert

* remove assert and add comment
@ghost ghost locked as resolved and limited conversation to collaborators Apr 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants