Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Compute BB weights with higher precision, use tolerance in if-conversion #88385

Merged
merged 3 commits into from
Jul 5, 2023

Conversation

jakobbotsch
Copy link
Member

In #88376 (comment) I noticed a spurious diff in the System.Tests.Perf_Int32.ToStringHex benchmark that looked like this:

 G_M62987_IG13:
        and      edi, esi
        mov      ebx, dword ptr [rbp+88H]
-       mov      ecx, 1
        test     ebx, ebx
-       cmovle   ebx, ecx
+       jg       SHORT G_M62987_IG15
+						;; size=12 bbWeight=1.00 PerfScore 2.50
+G_M62987_IG14:
+       mov      ebx, 1
+						;; size=5 bbWeight=0.98 PerfScore 0.24
+G_M62987_IG15:

Investigating, it turns out to be caused by a check in if-conversion for detecting loops, that is using a very sharp boundary. It turns out that the integral block weights are exactly the same (in both the base and diff), but the floating point calculation ended up with exactly 1 and very close to 1 in the base/diff cases respectively.

So two changes to address it:

  • Switch getBBWeight to multiply by BB_UNITY_WEIGHT after dividing
  • Switch if-conversion to have a 5% tolerance so that losing just a single count in the wrong place will not cause a decision change. Note that the check here is used as a cheap "is inside loop" check.

…nversion

In dotnet#88376 (comment)
I noticed a spurious diff in the System.Tests.Perf_Int32.ToStringHex
benchmark that looked like this:

```diff
 G_M62987_IG13:
        and      edi, esi
        mov      ebx, dword ptr [rbp+88H]
-       mov      ecx, 1
        test     ebx, ebx
-       cmovle   ebx, ecx
+       jg       SHORT G_M62987_IG15
+						;; size=12 bbWeight=1.00 PerfScore 2.50
+G_M62987_IG14:
+       mov      ebx, 1
+						;; size=5 bbWeight=0.98 PerfScore 0.24
+G_M62987_IG15:
```

Investigating, it turns out to be caused by a check in if-conversion for
detecting loops, that is using a very sharp boundary. It turns out that
the integral block weights are exactly the same (in both the base and
diff), but the floating point calculation ended up with exactly 1 and
very close to 1 in the base/diff cases respectively.

So two changes to address it:
* Switch getBBWeight to multiply by BB_UNITY_WEIGHT after dividing
* Switch if-conversion to have a 5% tolerance so that losing just a
  single count in the wrong place will not cause a decision change. Note
  that the check here is used as a cheap "is inside loop" check.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 4, 2023
@ghost ghost assigned jakobbotsch Jul 4, 2023
@ghost
Copy link

ghost commented Jul 4, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

In #88376 (comment) I noticed a spurious diff in the System.Tests.Perf_Int32.ToStringHex benchmark that looked like this:

 G_M62987_IG13:
        and      edi, esi
        mov      ebx, dword ptr [rbp+88H]
-       mov      ecx, 1
        test     ebx, ebx
-       cmovle   ebx, ecx
+       jg       SHORT G_M62987_IG15
+						;; size=12 bbWeight=1.00 PerfScore 2.50
+G_M62987_IG14:
+       mov      ebx, 1
+						;; size=5 bbWeight=0.98 PerfScore 0.24
+G_M62987_IG15:

Investigating, it turns out to be caused by a check in if-conversion for detecting loops, that is using a very sharp boundary. It turns out that the integral block weights are exactly the same (in both the base and diff), but the floating point calculation ended up with exactly 1 and very close to 1 in the base/diff cases respectively.

So two changes to address it:

  • Switch getBBWeight to multiply by BB_UNITY_WEIGHT after dividing
  • Switch if-conversion to have a 5% tolerance so that losing just a single count in the wrong place will not cause a decision change. Note that the check here is used as a cheap "is inside loop" check.
Author: jakobbotsch
Assignees: jakobbotsch
Labels:

area-CodeGen-coreclr

Milestone: -

src/coreclr/jit/block.cpp Outdated Show resolved Hide resolved
// Detect via the block weight as that will be high when inside a loop.
if (m_startBlock->getBBWeight(m_comp) > BB_UNITY_WEIGHT)

if (m_startBlock->getBBWeight(m_comp) > BB_UNITY_WEIGHT * 1.05)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to stylize these comparisons, so maybe add a new fgProfileWeight... helper here?

In particular these sorts of checks are often parts of profitability heuristics which we might want to vary under stress and/or expose to some machine learning mechanism.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think playing around with this threshold makes much sense given its use as an "is inside loop" check. The precise check happens below when this check is false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really checking if the block is likely to be inside a loop, it is checking if the block is likely to be inside a frequently executed loop.

If we assume profile data is representative, then perhaps this is fine; if we think it might not be representative, then perhaps we should be checking bbNatLoopNum and if set, either bypass the conversion, or go on to compare the weight of this block to the weight of the loop entry (so either "in a loop" or "frequently executed wrt the loop"). I wonder if this would subsume the need for an optReachable check.

Do we know if the cases in #82106 did not get handled by our loop recognition? I don't recall why we didn't check this initially -- do we no longer trust the loop table? Seems like it might still be good enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regression in #82106 was because we if-converted inside an unnatural loop that was not recognized by our loop recognition. The loop is this code:

do
{
// Make sure we don't go out of bounds
Debug.Assert(offset + ch1ch2Distance + Vector256<ushort>.Count <= searchSpaceLength);
Vector256<ushort> cmpCh2 = Vector256.Equals(ch2, Vector256.LoadUnsafe(ref ushortSearchSpace, (nuint)(offset + ch1ch2Distance)));
Vector256<ushort> cmpCh1 = Vector256.Equals(ch1, Vector256.LoadUnsafe(ref ushortSearchSpace, (nuint)offset));
Vector256<byte> cmpAnd = (cmpCh1 & cmpCh2).AsByte();
// Early out: cmpAnd is all zeros
if (cmpAnd != Vector256<byte>.Zero)
{
goto CANDIDATE_FOUND;
}
LOOP_FOOTER:
offset += Vector256<ushort>.Count;
if (offset == searchSpaceMinusValueTailLength)
return -1;
// Overlap with the current chunk for trailing elements
if (offset > searchSpaceMinusValueTailLengthAndVector)
offset = searchSpaceMinusValueTailLengthAndVector;
continue;
CANDIDATE_FOUND:
uint mask = cmpAnd.ExtractMostSignificantBits();
do
{
int bitPos = BitOperations.TrailingZeroCount(mask);
// div by 2 (shr) because we work with 2-byte chars
nint charPos = (nint)((uint)bitPos / 2);
if (valueLength == 2 || // we already matched two chars
SequenceEqual(
ref Unsafe.As<char, byte>(ref Unsafe.Add(ref searchSpace, offset + charPos)),
ref Unsafe.As<char, byte>(ref value), (nuint)(uint)valueLength * 2))
{
return (int)(offset + charPos);
}
// Clear two the lowest set bits
if (Bmi1.IsSupported)
mask = Bmi1.ResetLowestSetBit(Bmi1.ResetLowestSetBit(mask));
else
mask &= ~(uint)(0b11 << bitPos);
} while (mask != 0);
goto LOOP_FOOTER;
} while (true);

That's why we ended up with optReachable here.

There is an alternative to simply get rid of this block weight check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok... let's leave this as is for now.

@BruceForstall do we have a work item for loops we ought to recognize but don't? I see #43713 so perhaps that's good enough for now.

Co-authored-by: Andy Ayers <andya@microsoft.com>
// Detect via the block weight as that will be high when inside a loop.
if (m_startBlock->getBBWeight(m_comp) > BB_UNITY_WEIGHT)

if (m_startBlock->getBBWeight(m_comp) > BB_UNITY_WEIGHT * 1.05)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok... let's leave this as is for now.

@BruceForstall do we have a work item for loops we ought to recognize but don't? I see #43713 so perhaps that's good enough for now.

@BruceForstall
Copy link
Member

@BruceForstall do we have a work item for loops we ought to recognize but don't? I see #43713 so perhaps that's good enough for now.

That's the issue I'd reference

@jakobbotsch jakobbotsch merged commit 3f19af8 into dotnet:main Jul 5, 2023
@jakobbotsch jakobbotsch deleted the if-conversion-weight-tolerance branch July 5, 2023 18:37
@ghost ghost locked as resolved and limited conversation to collaborators Aug 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants