Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent hoisting nodes with order side effects #100160

Merged

Conversation

BruceForstall
Copy link
Member

If a node has an order side effect, we can't hoist it at all: we don't know what the order dependence actually is. For example, assertion prop might have determined a node can't throw an exception, and eliminated the GTF_EXCEPT flag, replacing it with GTF_ORDER_SIDEEFF. We can't hoist because we might then hoist above the expression that led assertion prop to make that decision. This can happen in JitOptRepeat, where hoisting can follow assertion prop.

If a node has an order side effect, we can't hoist it at all:
we don't know what the order dependence actually is. For example,
assertion prop might have determined a node can't throw an exception,
and eliminated the `GTF_EXCEPT` flag, replacing it with `GTF_ORDER_SIDEEFF`.
We can't hoist because we might then hoist above the expression that
led assertion prop to make that decision. This can happen in JitOptRepeat,
where hoisting can follow assertion prop.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 22, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@BruceForstall
Copy link
Member Author

There were quite a few diffs locally, with, as expected, less hoisting out of loops.

@BruceForstall
Copy link
Member Author

@AndyAyersMS PTAL
cc @dotnet/jit-contrib

BruceForstall added a commit to BruceForstall/runtime that referenced this pull request Mar 22, 2024
@BruceForstall BruceForstall mentioned this pull request Mar 22, 2024
@jakobbotsch
Copy link
Member

Can we generalize the logic that sets/checks m_beforeSideEffect instead?

@jakobbotsch
Copy link
Member

I suppose not since there really is no node interfering with it.

We had a similar problem around if-conversion that we solved by setting GTF_ORDER_SIDEEFF on both nodes with the dependencies (#78698). But I suppose that's not really possible here, since the dependency here is on some dominating compare that exists between the GTF_ORDER_SIDEEFF node and the preheader, and happens through flow...

@BruceForstall
Copy link
Member Author

A discussion of the issue is here: #94250 (comment)

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the right fix, but suspect we need to do some compensating fix to avoid being so pessimistic here.

I would suggest digging into a few of the diffs to see (in the default, non-opt-repeat case) how and when we are setting the GTF_ORDER_SIDEEFF and whether we need to think about deferring those transformations since they may end up blocking a more valuable optimization.

Another angle would be to try and explicitly represent the dependence so we can enable some delimited reordering, but that seems hard; I don't know how we might even attempt this.

@BruceForstall
Copy link
Member Author

Diffs

It's a modest code size improvement, and very small PerfScore regression. The diffs only affect a seemingly small percentage of method contexts: about 0.1% of contexts have diffs.

@AndyAyersMS
Copy link
Member

Diffs

It's a modest code size improvement, and very small PerfScore regression. The diffs only affect a seemingly small percentage of method contexts: about 0.1% of contexts have diffs.

Maybe I'm over-reacting, but I am curious to understand why we have any diffs.

@BruceForstall
Copy link
Member Author

Here's one example of a diff that was, presumably, introduced because of #78698. Function: LUDecomp:build_problem(double[][],int,double[]) (MethodHash=2e136e40)

When we import an array indexing expression, it has a bounds check. The bounds check and the array address nodes get marked with GTF_ORDER_SIDEEFF so we don't hoist the array access above the bounds check. When we clone the loop, we eliminate the bounds check in the fast loop. Before, we could then hoist an array index expression (especially for invariant parts of multi-dimensional jagged array expressions). Now, we don't.

We don't remove the GTF_ORDER_SIDEEFF bit on the remaining IND node when removing the bounds check. Presumably it would be "hard" to find the exact bits that should be removed, and if removed, how would we prevent hoisting the array index above the control flow (loop cloning conditions) checking the bounds conditions? (Say, if we have a 3-nested loop, clone an inner loop and then hoist the array index out of the two inner loops.)

Here's the IR after fgMorphIndexAddr (I altered gtDispFlags to always dump both GTF_DEBUG_NODE_MORPHED and GTF_ORDER_SIDEEFF):

[000345] ---XG+O-----                        |  \--*  COMMA     ref
[000337] ---X-+O-----                        |     +--*  BOUNDS_CHECK_Rng void
[000103] -----+------                        |     |  +--*  LCL_VAR   int    V05 loc2
[000336] ---X-+------                        |     |  \--*  ARR_LENGTH int
[000102] -----+------                        |     |     \--*  LCL_VAR   ref    V00 arg0
[000346] n---G+O-----                        |     \--*  IND       ref
[000344] -----+O-----                        |        \--*  ARR_ADDR  byref ref[]
[000343] -----+------                        |           \--*  ADD       byref
[000334] -----+------                        |              +--*  LCL_VAR   ref    V00 arg0
[000342] -----+------                        |              \--*  ADD       long
[000340] -----+------                        |                 +--*  LSH       long
[000338] -----+----U-                        |                 |  +--*  CAST      long <- uint
[000335] -----+------                        |                 |  |  \--*  LCL_VAR   int    V05 loc2
[000339] -----+--N---                        |                 |  \--*  CNS_INT   long   3
[000341] -----+------                        |                 \--*  CNS_INT   long   16

@AndyAyersMS
Copy link
Member

Here's one example of a diff

Thanks for digging in.

I wonder if we should stop trying to optimize the cloned loop and just let the regular optimizations kick in instead. Probably something of a TP hit but perhaps avoids prematurely doing one thing and inhibiting another better thing. During cloning, we can perhaps mark the nodes as "expected to be optimized later" and then check that we indeed follow through.

I think it's ok to take this now and sort that out later. @dotnet/jit-contrib any thoughts on this?

@BruceForstall
Copy link
Member Author

BruceForstall commented Mar 25, 2024

A little more investigation in the above example:

Several hoistings now get blocked. The one above actually gets optimized away (in the baseline) so doesn't cause a diff. One that doesn't get hoisted or optimized away in the baseline is:

N015 ( 14, 18) [000738] ---XG-O-----                        *  COMMA     ref    <l:$601, c:$600>
N004 (  8, 11) [000739] ---X--O-----                        +--*  BOUNDS_CHECK_Rng void   $53f
N001 (  1,  1) [000740] ------------                        |  +--*  LCL_VAR   int    V05 loc2         u:2 $2cb
N003 (  3,  3) [000741] ---X--------                        |  \--*  ARR_LENGTH int    $247
N002 (  1,  1) [000742] ------------                        |     \--*  LCL_VAR   ref    V00 arg0         u:1 $100
N014 (  6,  7) [000743] n---G-O-----                        \--*  IND       ref    <l:$532, c:$121>
N013 (  3,  5) [000744] ------O-----                           \--*  ARR_ADDR  byref ref[] $8f
N012 (  3,  5) [000745] --------N---                              \--*  ADD       byref  $3a0
N005 (  1,  1) [000746] ------------                                 +--*  LCL_VAR   ref    V00 arg0         u:1 $100
N011 (  4,  5) [000747] --------N---                                 \--*  ADD       long   $316
N009 (  3,  4) [000748] --------N---                                    +--*  LSH       long   $315
N007 (  2,  3) [000749] ----------U-                                    |  +--*  CAST      long <- uint $314
N006 (  1,  1) [000750] ------------                                    |  |  \--*  LCL_VAR   int    V05 loc2         u:2 $2cb
N008 (  1,  1) [000751] --------N---                                    |  \--*  CNS_INT   long   3 $340
N010 (  1,  1) [000752] ------------                                    \--*  CNS_INT   long   16 $342

This is an interesting case because ideally we could hoist this since we're not reordering, with respect to each other, the two nodes that were marked GTF_ORDER_SIDEEFF. But the GTF_ORDER_SIDEEFF bit is too crude to allow knowing that.

This case is also interesting because it doesn't matter: it's in the slow path loop -- which we do bother to optimize, but which should never be executed.

@BruceForstall
Copy link
Member Author

BruceForstall commented Mar 26, 2024

Another interesting case: System.Random+Net5CompatSeedImpl:NextSingle():float:this (MethodHash=e718840f)

N006 (  5,  5) [000026] ---X--O-N--- this in rcx                  |     \--*  COMMA     byref  $181
N002 (  2,  2) [000022] ---X--O-----                              |        +--*  NULLCHECK byte   $101
N001 (  1,  1) [000021] ------------                              |        |  \--*  LCL_VAR   ref    V00 this         u:1 $80
N005 (  3,  3) [000025] ------O-----                              |        \--*  ADD       byref  $180
N003 (  1,  1) [000023] ------------                              |           +--*  LCL_VAR   ref    V00 this         u:1 $80
N004 (  1,  1) [000024] ------------                              |           \--*  CNS_INT   long   8 Fseq[<unknown field>] $140

Here, I think the ORDER bit is set by #78698 on the null check and ADD.

Before and after, we hoist the NULLCHECK:

Hoisting a copy of [000022] $101 from BB02 into PreHeader BB01 for loop L00 (head: BB02):
N002 (  2,  2) [000022] ---X--O-----                        *  NULLCHECK byte   $101
N001 (  1,  1) [000021] ------------                        \--*  LCL_VAR   ref    V00 this         u:1 $80

This hoisted copy placed in PreHeader (BB01):
               [000030] ---X--O-----                        *  COMMA     void  
     (  2,  2) [000027] ---X--O-H---                        +--*  NULLCHECK byte   $101
     (  1,  1) [000028] ------------                        |  \--*  LCL_VAR   ref    V00 this         u:1 $80
               [000029] ------------                        \--*  NOP       void  

Before, we also hoist the ADD:

Hoisting a copy of [000025] $180 from BB02 into PreHeader BB01 for loop L00 (head: BB02):
N005 (  3,  3) [000025] ------O-----                        *  ADD       byref  $180
N003 (  1,  1) [000023] ------------                        +--*  LCL_VAR   ref    V00 this         u:1 $80
N004 (  1,  1) [000024] ------------                        \--*  CNS_INT   long   8 Fseq[<unknown field>] $140

This hoisted copy placed in PreHeader (BB01):
               [000035] ------O-----                        *  COMMA     void  
     (  3,  3) [000031] ------O-H---                        +--*  ADD       byref  $180
     (  1,  1) [000032] ------------                        |  +--*  LCL_VAR   ref    V00 this         u:1 $80
     (  1,  1) [000033] ------------                        |  \--*  CNS_INT   long   8 Fseq[<unknown field>] $140
               [000034] ------------                        \--*  NOP       void  

this seems dangerous and only "works out" because the NULLCHECK was (luckily?) hoisted first, and earlier in the IR order. Note that we're hoisting the ADD past the non-hoisted (original) NULLCHECK since we haven't actually removed the original; we've only added a new copy. This seems contrary to the idea of the ORDER bit in the first place. (Note that NULLCHECK nodes are explicitly allowed to be hoisted.)

Once again, it seems like it would be ok to do the hoisting that was originally done if we actually removed the original nodes being hoisted, which would remove the ORDER bits that should be blocking. Well, that is if you are allowed to hoist past any non-ORDER, non-EXCEPT nodes. But the original problem is hoisting past a compare/branch of the variable with the ORDER bit.

The result of this PR is the ADD remains in the loop.

@BruceForstall
Copy link
Member Author

So, what are the semantics of GTF_ORDER_SIDEEFF supposed to be?

Some code, like fgCanMoveFirstStatementIntoPred and fgForwardSubStatement don't allow a tree with GTF_ORDER_SIDEEFF to be interchanged with another with GTF_ORDER_SIDEEFF | GTF_GLOB_REF.

Some places seem to think GTF_ORDER_SIDEEFF means it's a GT_CATCH_ARG. Some think it's a volatile indirection.

TryMakeIndirsAdjacent does allow some specific reordering with GTF_ORDER_SIDEEFF nodes.

SideEffectSet::InterferesWith with strict = true indicates interference of a node with GTF_ORDER_SIDEEFF and a node with GTF_GLOB_REF, but doesn't seem to indicate interference of two GTF_ORDER_SIDEEFF nodes (a bug?).

@jakobbotsch
Copy link
Member

I'm ok with taking this fix as well.

So, what are the semantics of GTF_ORDER_SIDEEFF supposed to be?

Some code, like fgCanMoveFirstStatementIntoPred and fgForwardSubStatement don't allow a tree with GTF_ORDER_SIDEEFF to be interchanged with another with GTF_ORDER_SIDEEFF | GTF_GLOB_REF.

The correct and conservative semantics is that GTF_ORDER_SIDEEFF is not allowed to be reordered with any node that has side effects. I think both examples here implement those semantics (and are more recent in the code base than the other examples). We ought to factor these checks in a way that they could be reused...
As this issue here shows, it's also illegal to do code motion across blocks with these nodes.

Some places seem to think GTF_ORDER_SIDEEFF means it's a GT_CATCH_ARG. Some think it's a volatile indirection.

TryMakeIndirsAdjacent does allow some specific reordering with GTF_ORDER_SIDEEFF nodes.

Yeah, this one refines the meaning of GTF_ORDER_SIDEEFF to be more precise. If we want we can extract the refinement into a separate helper that returns a more precise notion of order side effect. For example, it could allow us to return something more precise for GT_ADD nodes with GTF_ORDER_SIDEEFF (these are ok to move forwards, e.g. to allow containment), and for the indir case that TryMakeIndirsAdjacent makes use of.
I'm not sure it would help for this case, though. The hoisting we are doing for the GT_ADD is very hard to prove to be legal (as you say that depends on there being some dominating null check even after the hoisting).

SideEffectSet::InterferesWith with strict = true indicates interference of a node with GTF_ORDER_SIDEEFF and a node with GTF_GLOB_REF, but doesn't seem to indicate interference of two GTF_ORDER_SIDEEFF nodes (a bug?).

Yeah, seems like a bug.

@AndyAyersMS
Copy link
Member

One other concern is that CSE may not honor GTF_MAKE_CSE, so we could see a situation where hoisting creates copies of two different trees in proper (implicit) dependence order, but CSE decides not to do the first one, so ends up effectively reordering.

We may get lucky here because we tend to hoist the full comma tree for an array bounds + access, so perhaps the CSE ends up being "all or nothing", though I am a bit surprised that the random CSE mode hasn't tripped us up here.

Perhaps one more small upvote for peeling instead of hoisting? At least peeling the head block of a loop...if not the full body.

@BruceForstall BruceForstall merged commit 05f5a50 into dotnet:main Mar 26, 2024
108 of 110 checks passed
@BruceForstall BruceForstall deleted the DoNotHoistOrderSideEffectNodes branch March 26, 2024 16:25
@github-actions github-actions bot locked and limited conversation to collaborators Apr 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants