Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't clone a large tree in impRuntimeLookupToTree #81472

Closed
wants to merge 4 commits into from

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Feb 1, 2023

Don't clone the whole tree here (it ends up being CSE'd anyway) in importer, e.g.:

image

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 1, 2023
@ghost ghost assigned EgorBo Feb 1, 2023
@ghost
Copy link

ghost commented Feb 1, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Don't clone the whole tree here (it ends up being CSE'd anyway) in importer, e.g.:

image

Author: EgorBo
Assignees: EgorBo
Labels:

area-CodeGen-coreclr

Milestone: -

@EgorBo
Copy link
Member Author

EgorBo commented Feb 1, 2023

no TP diffs

@EgorBo EgorBo closed this Feb 1, 2023
@EgorBo EgorBo reopened this Feb 5, 2023
@EgorBo EgorBo marked this pull request as ready for review February 6, 2023 11:07
@EgorBo
Copy link
Member Author

EgorBo commented Feb 6, 2023

Thanks to @jakobbotsch's fix for SPMI replay, I do see diffs now. I don't know when I'll land #81635 so I suggest we merge this PR. It seems to show some nice diffs and TP improvements: https://dev.azure.com/dnceng-public/public/_build/results?buildId=160683&view=ms.vss-build-web.run-extensions-tab e.g. -22kb for aspnet collection

PTAL @dotnet/jit-contrib

@jakobbotsch
Copy link
Member

Why does CSE not get these cases? Is the 'use' here in a cold block so CSE does not want to increase register pressure for the benefit of the cold block? (related: #75253)

@EgorBo
Copy link
Member Author

EgorBo commented Feb 6, 2023

Why does CSE not get these cases? Is the 'use' here in a cold block so CSE does not want to increase register pressure for the benefit of the cold block? (related: #75253)

use's block is always taken if def's block is not cold. It's:

if ((tmp = IND(X)) !=0)
    return tmp;
else
    return HELPER(); // expected to always be cold (has artificial weight of 0.2)

@EgorBo
Copy link
Member Author

EgorBo commented Feb 6, 2023

Another potential optimization here is to never expand runtime lookups in cold blocks at all - I was thinking about it but hopefully I'll do that in #81635

@jakobbotsch
Copy link
Member

Then I'd be interested to see why CSE doesn't get this case for us, it would probably avoid some of the regressions I see in the diffs, e.g.:

@@ -8,10 +8,11 @@
 ; Final local variable assignments
 ;
 ;  V00 this         [V00,T00] (  3,  3   )   byref  ->  rcx         this single-def
-;* V01 TypeCtx      [V01    ] (  0,  0   )    long  ->  zero-ref    single-def
+;  V01 TypeCtx      [V01,T01] (  3,  3   )    long  ->  rdx         single-def
 ;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
-;* V03 tmp1         [V03    ] (  0,  0   )   byref  ->  zero-ref    "bubbling QMark1"
-;* V04 tmp2         [V04    ] (  0,  0   )    long  ->  zero-ref    "spilling Runtime Lookup tree"
+;  V03 tmp1         [V03,T02] (  2,  4   )   byref  ->  rcx         single-def "bubbling QMark1"
+;* V04 tmp2         [V04    ] (  0,  0   )    long  ->  zero-ref    "fgMakeTemp is creating a new local variable"
+;* V05 tmp3         [V05    ] (  0,  0   )    long  ->  zero-ref    "spilling Runtime Lookup tree"
 ;
 ; Lcl frame size = 0
 
@@ -19,14 +20,15 @@ G_M20711_IG01:        ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000000 {}
 						;; size=0 bbWeight=1 PerfScore 0.00
 G_M20711_IG02:        ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000002 {rcx}, byref
        ; byrRegs +[rcx]
-       mov      rax, gword ptr [rcx+10H]
+       add      rcx, 8
+       mov      rax, gword ptr [rcx+08H]
        ; gcrRegs +[rax]
-						;; size=4 bbWeight=1 PerfScore 2.00
+						;; size=8 bbWeight=1 PerfScore 2.25
 G_M20711_IG03:        ; bbWeight=1, epilog, nogc, extend
        ret      
 						;; size=1 bbWeight=1 PerfScore 1.00
 

@EgorBo
Copy link
Member Author

EgorBo commented Feb 6, 2023

I'll check. I didn't expect any size diffs at all - I was mostly interested on impact on JIT TP since I removed a tree.

Btw, we do a similar thing when we xpand CASTCLASS/ISINST - we save a type handle to a local right in the importer.

Comment on lines +1897 to +1898
GenTree* handleForResult =
opts.OptimizationEnabled() ? fgInsertCommaFormTemp(&handleForNullCheck) : gtCloneExpr(handleForNullCheck);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a comma here? Can we create a new statement given that we are in the importer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, will check!

@EgorBo
Copy link
Member Author

EgorBo commented Feb 6, 2023

@jakobbotsch so a separate statement seems to produce smaller diffs and leads to overall PerfScore regression.

The improvements/regressions seem all just (un)fortunate CSE/loop hoisting decisions, e.g. here
image

it was decided not to hoist it from the loop. A lot of improvements are in cold blocks like you mentioned, e.g.

image
(presumably it decided not to do CSE for a block with weght 0.25) -- Although, that weight is a bit off, the fast path is expected to have the same weight as the nullcheck. But that's because QMARKs don't have ability to specify weights (I'll fix it in my PR)

Overall, the PR is a size/perfscore improvement + TP wins

@jakobbotsch
Copy link
Member

@jakobbotsch so a separate statement seems to produce smaller diffs and leads to overall PerfScore regression.

That's a bit surprising, seems like deficiencies in downstream phases?

it was decided not to hoist it from the loop.

Decided or it hit a limitation due to the new IR?

Overall the change makes sense to me given that the value is reused in the hot code path, so it seems fine to me. Would just like to understand why CSE does not get these on its own and with all of the extra heuristics it uses to try to make a good decision.

@EgorBo
Copy link
Member Author

EgorBo commented Feb 6, 2023

@jakobbotsch so a separate statement seems to produce smaller diffs and leads to overall PerfScore regression.

That's a bit surprising, seems like deficiencies in downstream phases?

it was decided not to hoist it from the loop.

Decided or it hit a limitation due to the new IR?

Overall the change makes sense to me given that the value is reused in the hot code path, so it seems fine to me. Would just like to understand why CSE does not get these on its own and with all of the extra heuristics it uses to try to make a good decision.

Can't find that diff, analyzed around 20 of them already, I'm seeing regressions like this:

image
Not sure what the source, probably forward sub didn't handle it well, closer to lower we have a tree like this:

image

Probably worth handling in EarlyProp (or introduce a new Forward sub late phase)

@AndyAyersMS
Copy link
Member

Hoisting currently can't handle assignments so splitting things across statements (see eg #35735) will block opportunties.

@EgorBo
Copy link
Member Author

EgorBo commented Feb 6, 2023

The change in this PR will be overwritten with #81635 at some point, I'm just wondering if this results in perf improvements (but mainly I did it for TP wins)

@EgorBo
Copy link
Member Author

EgorBo commented Feb 10, 2023

Closing to do it in #81472

@EgorBo EgorBo closed this Feb 10, 2023
@EgorBo EgorBo deleted the jit-tp-clone-nc branch February 10, 2023 10:15
@ghost ghost locked as resolved and limited conversation to collaborators Mar 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants