Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor subrange assertion generation and propagation #55186

Merged

Conversation

SingleAccretion
Copy link
Contributor

@SingleAccretion SingleAccretion commented Jul 5, 2021

This PR adds a new abstraction for dealing with casts, IntegralRange, a tuple of two symbolic values that represent the lower and upper bounds for an integral value. The symbolic representation has been chosen so that the range can be efficiently used on 32 hosts, since the alternative (using 64 bit integers) would have been, I feel, unacceptably inefficient. Notably, the range always interprets values produces by IR nodes as signed (since we're already using the signed semantic in a number of places, and it is easier to reason about for the purposes of casts as using unsigned representation would have meant a discontinuous range).

Because assertion propagation was using integral ranges already, that's where IntegralRange is used in this PR. This enables reasoning about long checked casts on 32 bit hosts (rare scenario) and ensures the correctness of how the assertions are interpreted, since they're created for both the cast's node output (in local assertion propagation) and input (in global assertion propagation). It also allows for, in the future, shrinking the size of the AssertionDsc structure. I note that since we're allocating a pretty sizeable array of them upfront in global assertion propagation (3K+ bytes), any savings in the size here translate into large percentages in memory allocation reduction (e. g., removing 8 bytes from the AssertionDsc, I've observed -2.2% reduction, and I think we can reduce the size by at least 20 bytes), though this is somewhat misleading as the memory is only actually used for active assertions. Note that this PR does not alter the size of AssertionDsc, the added if !defined(HOST_64BIT) is for clarity only.

Naturally, code propagating the assertions for casts has also been refactored, mostly to avoid retyping a long local to int if it happened to have a range of int. Such retyping is redundant with what optNarrowTree does already, and is in fact incorrect in global assertion propagation as it did not update the value number properly (leaving a node that has TYP_INT with a VN of TYP_LONG). Also, the retyping has been restricted to the known case where it is needed (and a comment added on why it is needed).

An interesting property of the old code was how it treated TYP_BOOL. TYP_BOOL in the compiler is pretty much a TYP_UBYTE, but with a hint that it comes from CORINFO_TYPE_BOOL. Assertion generation treated assignments with a TYP_BOOL source as those restricting the destination to a [0..1] range, however, that is incorrect as per the IL semantics of booleans being treated as equivalent to uint8s. This did not have ill effects except for perhaps dropping a checked cast once in a while, but it was an internal inconsistency in the IR. This PR just makes TYP_BOOLs equivalent to TYP_UBYTEs for the purposes of assertion generation.

(Note: the text below is an explanation for the quirk)

Another interesting property of the old code was that it did not propagate the assertion for casts to TYP_BOOL (perhaps hinting at the reason for why it was "safe" to give TYP_BOOL [0..1] ranges in the first place...). The new code does, by virtue of not special casing it, and that is where the bulk of the diffs (win-x64, win-x86) come from.

Unfortunately, there are, though a small number of, regressions with this handling of TYP_BOOL. There are two sources of them.

First, there is the VN problem, it accounts for the vast majority of the regressions (missing CSEs + missed redundant branch elimination). Say we have the following situation:

int Function(short a)
{
    a = (sbyte)a;
    
    if (a is 1)
        return 1;

    return a;
}

Today, local assertion propagation removes the normalizing cast for a is 1. Then in VN we value number the local without the cast and the normalizing cast for the same local differently:

***** BB01, STMT00002(after)
N004 (  6,  6) [000009] ------------              *  JTRUE     void  
N003 (  4,  4) [000008] N------N-U--              \--*  NE        int    $102
N001 (  2,  2) [000006] ------------                 +--*  LCL_VAR   short  V00 arg0         u:2 $101
N002 (  1,  1) [000007] ------------                 \--*  CNS_INT   int    1 $44


***** BB03, STMT00003(after)
N003 (  4,  5) [000011] ------------              *  RETURN    int    $140
N002 (  3,  4) [000016] ------------              \--*  CAST      int <- short <- int $103
N001 (  2,  2) [000010] ------------                 \--*  LCL_VAR   int    V00 arg0         u:2 (last use) $101

The core issue here is that the VNs for "normalize-on-load" locals are fundamentally "wrong". In the example above, both [000008] and [000010] have the same VN, but say V00's home was on the stack. The first tree would have produced 2 bytes of the value it was assigned, sign-extended, while the second would produce a full int, with the upper 2 bytes set to whatever happened to be next to V00 on the stack (this is how it actually works - the width of the local node signifies to the codegen how wide the load should be). Because of this representation problem, fixing this appears to be not so straightforward.

One possible fix is to recognize, in VN, when a cast a redundant. That runs into regressions with "false CSEs", since we leave the cast tree intact, while in actuality it is redundant and CSEing such a tree has negative profitability (but CSE cannot know that because we leave the costs intact as well).

Another possible fix is to turn these casts into no-ops while renaming the uses (ideally we'd be doing that in early prop, but that's much more expensive TP-wise). That runs into regressions because, sometimes, removing a cast inhibits redundant branch optimization. Consider the following:

// Say p0 is a `short` parameter.
if (p0 != 0) // here, the normalizing cast cannot be removed.
{
    p1 = p0;
   
    if (p1 != 0) // here, the normalizing cast can be removed.
    {
    }
}

// In this example, the two conditions will have different VNs if we remove the cast above p1.

I have spent a considerable amount of time trying out different approaches to fixing the above issues, but they all ran into various regressions, so I tabled fixing it in this PR.

The second source of regressions is the fact that in lowering we can recognize EQ(CAST(bool/ubyte <- int), SMALL_CONST) and retype the tree to UBYTE, leading to, effectively, containment for the cast. Expanding this to consider small LCL_VARs is not so trivial as well, because it can regress when we have those locals assigned RSI, RDI, RBP or RSP, which take up more space to encode in their byte form on Amd64, not to mention we can end up with byte-able register constraints on x86.

Due to the above issues and the fact that I was not able to mitigate them, I will be quirking the handling of TYP_BOOL in this PR to get to a near zero substantial regressions state.

Fixes #54842, but properly.
Fixes #12936

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 5, 2021
@SingleAccretion SingleAccretion marked this pull request as draft July 5, 2021 19:53
@SingleAccretion SingleAccretion marked this pull request as ready for review July 5, 2021 20:19
@SingleAccretion SingleAccretion force-pushed the Improve-Subrange-Assertion-Generation branch 4 times, most recently from d466b18 to 8e90830 Compare July 8, 2021 21:27
@kunalspathak kunalspathak added this to the Future milestone Jul 13, 2021
@kunalspathak
Copy link
Member

@SingleAccretion - Is it possible to narrow this PR to just the fix for #54842 vs. improvement? We will take the fix for .NET6 and improvement can be merged in future milestone.

@SingleAccretion
Copy link
Contributor Author

I will see what can be done.

@SingleAccretion SingleAccretion force-pushed the Improve-Subrange-Assertion-Generation branch 2 times, most recently from 6d5746d to eb68aa8 Compare July 19, 2021 15:15
@terrajobst terrajobst added the community-contribution Indicates that the PR has been added by a community member label Jul 19, 2021
@SingleAccretion SingleAccretion force-pushed the Improve-Subrange-Assertion-Generation branch from eb68aa8 to 77a6951 Compare July 22, 2021 19:42
@SingleAccretion SingleAccretion marked this pull request as draft July 25, 2021 21:54
@SingleAccretion SingleAccretion changed the title Improve subrange assertion generation and fix a bug Refactor subrange assertion generation and propagation Jul 25, 2021
@SingleAccretion SingleAccretion force-pushed the Improve-Subrange-Assertion-Generation branch 7 times, most recently from 24a6b7c to e766cad Compare July 31, 2021 14:07
@SingleAccretion SingleAccretion force-pushed the Improve-Subrange-Assertion-Generation branch 4 times, most recently from 68fa9bd to 8f63b0a Compare August 4, 2021 18:13
@SingleAccretion
Copy link
Contributor Author

SingleAccretion commented Aug 8, 2021

With the quirk in place, this PR is ready for review.

Updated diffs: win-x64, win-x86, win-arm64, linux-arm.

Now almost all are improvements, with the TYP_BOOL handling accounting for the vast majority of them, and the fact that we do not generate useless subrange assertions for just a couple overly large methods in coreclr_tests.

The only regressions (including the rather large NormalizeProtoMember one on x86...) are register allocation + frame zeroing differences, so, while the "actual" diff may be this:

-       movzx    rdx, sil
-       mov      r8, gword ptr [rbp-18H]
+       mov      r8, gword ptr [rbp-20H]

LSRA now decides to use a callee-saved register, or the prolog zeroing code gets a bit larger while the frame is smaller, etc. This can also happen due to new copy propagations.

@SingleAccretion SingleAccretion marked this pull request as ready for review August 8, 2021 21:36
@SingleAccretion SingleAccretion force-pushed the Improve-Subrange-Assertion-Generation branch from 73234b1 to 471d02a Compare August 26, 2021 21:54
@EgorBo
Copy link
Member

EgorBo commented Sep 17, 2021

cc @dotnet/jit-contrib

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this refactoring, but I'm confused on a few points.

{
switch (toType)
{
// CAST_OVF(uint <- uint) - [INT_MIN..INT_MAX]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this case a bit more?

{
return nullptr;
}

// Skip over a GT_COMMA node(s), if necessary to get to the lcl.
GenTree* lcl = op1;
while (lcl->gtOper == GT_COMMA)
while (lcl->OperIs(GT_COMMA))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well use gtEffectiveVal here, since you're changing this bit.

@@ -1064,6 +1064,90 @@ class LclVarDsc

}; // class LclVarDsc

enum class SymbolicIntegerValue : int32_t
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we might want ULongMax in here too, for completeness?

upperBound = SymbolicIntegerValue::IntMax;
break;

// CAST_OVF(ulong <- uint) - [INT_MIN..INT_MAX]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here, why is this not [0..UINT_MAX] ?

@SingleAccretion
Copy link
Contributor Author

Can you explain this case a bit more?

Seems like we might want ULongMax in here too, for completeness?

Ditto here, why is this not [0..UINT_MAX] ?

A very intentional part of the design here is that the IntegralRange is always interpreted as signed, it makes it easier to reason about consistently (in the time I've spent tweaking the cast handling, I've come to believe this is the best approach).

This makes ULongMax an impossible value, since we do not have Int128 integers.

This makes a cast like CAST_OVF(ulong <- uint) - [INT_MIN..INT_MAX] have the range it has:

  1. For which values does the cast overflow?
  2. No such values exist -> the full range of an integer is the "input" range.
  3. The range is signed -> it is [INT_MIN..INT_MAX].

@AndyAyersMS
Copy link
Member

A very intentional part of the design here is that the IntegralRange is always interpreted as signed

Ok, good. It is probably worth mentioning this in comments somewhere -- maybe in the header comment for IntegralRange.

Let me test my understanding. In ForCastInput we have just one case that uses a UINT symbolic bound

// CAST_OVF(uint <- ulong/long) - [0..UINT_MAX]

So we'd say -- yes this can overflow. Valid input range is indeed [0...UINT_MAX]. Interpreting this as signed still leaves [0...UINT_MAX].

@AndyAyersMS
Copy link
Member

#58776 changed how address exposure is handled and is causing your builds to break.

@SingleAccretion
Copy link
Contributor Author

changed how address exposure is handled and is causing your builds to break.

Yes, will be rebasing...

Ok, good. It is probably worth mentioning this in comments somewhere -- maybe in the header comment for IntegralRange.

Will write something more substantial than what is in the header of ForCastInput.

Move it into its own function.

No diffs.
They are not useful: they do not tell us anything about the value.
They can be harmful: code operating on the assertions can wrongly
remove a checked cast because of an assertion for a non-checked one.
This change completely refactors the way subrange assertions are
generated and consumed. It rationalizes and generalizes the model
for subrange assertions which are generated and propagated for casts,
makes AssertionDsc::Range into an actual abstraction, eliminates bugs
related to wrong assertions generated for TYP_BOOL (without CQ loss),
enables propagation for checked casts to non-small types both on 64
and 32 bit, all the while making the AssertionDsc structure smaller.

There are quite a few moving parts, but at the heart of this change
are following items:

1. Make ranges symbolic, i. e. made up of symbolic upper and lower
   bounds, represented by the new SymbolicIntegerValue enum class.
   This is necessary to not regress memory consumption and throughput
   of operations on 32 bit while handling long subranges.
2. Split the concept of "input" and "output" for a cast node.
   To properly handle checked cast and make the code "obviously correct",
   it is necessary to split the two values, one the cast node produces
   and one it consumes, apart. The old code only handles the "easy" case
   of the "input" and "output" ranges matching (such is the case for small
   types), the new code must be, and is, robust in reasoning about, e. g.
   CAST_OVF(uint <- ulong) vs CAST_OVF(int <- long). The ranges that
   represent these values are consistently in the "signed" format, making
   handling them easy.
3. Tweak the propagation code to not remove representation-changing casts,
   make it explicit and obvious in what cases (and why) it retypes the
   local under the cast.

As a side-effect of the above changes, TYP_BOOL is now treated as TYP_UBYTE
in propagation, as it should be. Actually, the old code didn't remove
redundant casts to TYP_BOOL, the new code does, which is also where the
bulk of the positive diffs come from.

The old code also generated "useless" ranges like [INT_MIN..INT_MAX], the new
code only generates ranges that could be used in subsequent propagation.

Also, some miscellaneous cleanup, such as adding function headers, was performed.
In global assertion propagation, tentative assertions are
created for casts, with the expectation that proving them
via implication will allow for removing the cast.

The old code did this through the common path with local
assertions, however, this cannot be done the same way anymore
because the two types of assertions are fundamnetally different.

In local assertion propagation, subranges are derived from ASGs,
where the cast is on the RHS and the local - LHS, so the "output"
range for the cast must be used - local is assigned value the cast
**produces**.

In contrast, global subrange assertions say that the value the
cast **consumes** is within the correct bounds for the cast to
become a no-op or non-overflowing.

Because of this fundamnetal distinction, I chose to make global
assertions for casts in a new method instead of plumbing it
through "optCreateAssertion". Note that it would be possible to
do it via "optCreateAssertion" as well - global propagation
always passes OAK_SUBRANGE to it, while local - OAK_EQUAL, but
I decided against that option for the reason that it seems
unintuitive to me that it would work the way it would. These
tentative assertions are a very special case, so I think it is
better to treat them that way. This leads to some minor code
duplication, but I think that's ok.

No diffs for this commit, though that's expected as "proving"
the generated assertion is quite rare.
Make IntegralRange the owner of the logic for various
methods relating to it and decouple it from AssertionDsc.

IntegralRange will be useful in the future for other code
that deals with cast removal, e. g. in fgMorphCast.

Also some miscellaneous formatting fixups.
Using a static array results in smaller and more
efficient code on all compilers.
This change is for completeness, since "fgMorphCast" decomposes
casts from FP types to small types today and as such assertions
are generated for them anyway.

No diffs for this commit as expected.
To avoid regressions.
They did not handle "long" cases and as such were too conservative.
@SingleAccretion SingleAccretion force-pushed the Improve-Subrange-Assertion-Generation branch from c7d8459 to c513b9a Compare September 17, 2021 20:51
Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Thanks again for your high-quality contributions.

@AndyAyersMS AndyAyersMS merged commit 422924a into dotnet:main Sep 20, 2021
@SingleAccretion SingleAccretion deleted the Improve-Subrange-Assertion-Generation branch September 21, 2021 23:16
@ghost ghost locked as resolved and limited conversation to collaborators Nov 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
5 participants