-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Compute BB weights with higher precision, use tolerance in if-conversion #88385
JIT: Compute BB weights with higher precision, use tolerance in if-conversion #88385
Conversation
…nversion In dotnet#88376 (comment) I noticed a spurious diff in the System.Tests.Perf_Int32.ToStringHex benchmark that looked like this: ```diff G_M62987_IG13: and edi, esi mov ebx, dword ptr [rbp+88H] - mov ecx, 1 test ebx, ebx - cmovle ebx, ecx + jg SHORT G_M62987_IG15 + ;; size=12 bbWeight=1.00 PerfScore 2.50 +G_M62987_IG14: + mov ebx, 1 + ;; size=5 bbWeight=0.98 PerfScore 0.24 +G_M62987_IG15: ``` Investigating, it turns out to be caused by a check in if-conversion for detecting loops, that is using a very sharp boundary. It turns out that the integral block weights are exactly the same (in both the base and diff), but the floating point calculation ended up with exactly 1 and very close to 1 in the base/diff cases respectively. So two changes to address it: * Switch getBBWeight to multiply by BB_UNITY_WEIGHT after dividing * Switch if-conversion to have a 5% tolerance so that losing just a single count in the wrong place will not cause a decision change. Note that the check here is used as a cheap "is inside loop" check.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsIn #88376 (comment) I noticed a spurious diff in the System.Tests.Perf_Int32.ToStringHex benchmark that looked like this: G_M62987_IG13:
and edi, esi
mov ebx, dword ptr [rbp+88H]
- mov ecx, 1
test ebx, ebx
- cmovle ebx, ecx
+ jg SHORT G_M62987_IG15
+ ;; size=12 bbWeight=1.00 PerfScore 2.50
+G_M62987_IG14:
+ mov ebx, 1
+ ;; size=5 bbWeight=0.98 PerfScore 0.24
+G_M62987_IG15: Investigating, it turns out to be caused by a check in if-conversion for detecting loops, that is using a very sharp boundary. It turns out that the integral block weights are exactly the same (in both the base and diff), but the floating point calculation ended up with exactly 1 and very close to 1 in the base/diff cases respectively. So two changes to address it:
|
// Detect via the block weight as that will be high when inside a loop. | ||
if (m_startBlock->getBBWeight(m_comp) > BB_UNITY_WEIGHT) | ||
|
||
if (m_startBlock->getBBWeight(m_comp) > BB_UNITY_WEIGHT * 1.05) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to stylize these comparisons, so maybe add a new fgProfileWeight...
helper here?
In particular these sorts of checks are often parts of profitability heuristics which we might want to vary under stress and/or expose to some machine learning mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think playing around with this threshold makes much sense given its use as an "is inside loop" check. The precise check happens below when this check is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not really checking if the block is likely to be inside a loop, it is checking if the block is likely to be inside a frequently executed loop.
If we assume profile data is representative, then perhaps this is fine; if we think it might not be representative, then perhaps we should be checking bbNatLoopNum
and if set, either bypass the conversion, or go on to compare the weight of this block to the weight of the loop entry (so either "in a loop" or "frequently executed wrt the loop"). I wonder if this would subsume the need for an optReachable
check.
Do we know if the cases in #82106 did not get handled by our loop recognition? I don't recall why we didn't check this initially -- do we no longer trust the loop table? Seems like it might still be good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regression in #82106 was because we if-converted inside an unnatural loop that was not recognized by our loop recognition. The loop is this code:
runtime/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs
Lines 87 to 137 in 0202b24
do | |
{ | |
// Make sure we don't go out of bounds | |
Debug.Assert(offset + ch1ch2Distance + Vector256<ushort>.Count <= searchSpaceLength); | |
Vector256<ushort> cmpCh2 = Vector256.Equals(ch2, Vector256.LoadUnsafe(ref ushortSearchSpace, (nuint)(offset + ch1ch2Distance))); | |
Vector256<ushort> cmpCh1 = Vector256.Equals(ch1, Vector256.LoadUnsafe(ref ushortSearchSpace, (nuint)offset)); | |
Vector256<byte> cmpAnd = (cmpCh1 & cmpCh2).AsByte(); | |
// Early out: cmpAnd is all zeros | |
if (cmpAnd != Vector256<byte>.Zero) | |
{ | |
goto CANDIDATE_FOUND; | |
} | |
LOOP_FOOTER: | |
offset += Vector256<ushort>.Count; | |
if (offset == searchSpaceMinusValueTailLength) | |
return -1; | |
// Overlap with the current chunk for trailing elements | |
if (offset > searchSpaceMinusValueTailLengthAndVector) | |
offset = searchSpaceMinusValueTailLengthAndVector; | |
continue; | |
CANDIDATE_FOUND: | |
uint mask = cmpAnd.ExtractMostSignificantBits(); | |
do | |
{ | |
int bitPos = BitOperations.TrailingZeroCount(mask); | |
// div by 2 (shr) because we work with 2-byte chars | |
nint charPos = (nint)((uint)bitPos / 2); | |
if (valueLength == 2 || // we already matched two chars | |
SequenceEqual( | |
ref Unsafe.As<char, byte>(ref Unsafe.Add(ref searchSpace, offset + charPos)), | |
ref Unsafe.As<char, byte>(ref value), (nuint)(uint)valueLength * 2)) | |
{ | |
return (int)(offset + charPos); | |
} | |
// Clear two the lowest set bits | |
if (Bmi1.IsSupported) | |
mask = Bmi1.ResetLowestSetBit(Bmi1.ResetLowestSetBit(mask)); | |
else | |
mask &= ~(uint)(0b11 << bitPos); | |
} while (mask != 0); | |
goto LOOP_FOOTER; | |
} while (true); |
That's why we ended up with optReachable
here.
There is an alternative to simply get rid of this block weight check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok... let's leave this as is for now.
@BruceForstall do we have a work item for loops we ought to recognize but don't? I see #43713 so perhaps that's good enough for now.
Co-authored-by: Andy Ayers <andya@microsoft.com>
// Detect via the block weight as that will be high when inside a loop. | ||
if (m_startBlock->getBBWeight(m_comp) > BB_UNITY_WEIGHT) | ||
|
||
if (m_startBlock->getBBWeight(m_comp) > BB_UNITY_WEIGHT * 1.05) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok... let's leave this as is for now.
@BruceForstall do we have a work item for loops we ought to recognize but don't? I see #43713 so perhaps that's good enough for now.
That's the issue I'd reference |
In #88376 (comment) I noticed a spurious diff in the System.Tests.Perf_Int32.ToStringHex benchmark that looked like this:
Investigating, it turns out to be caused by a check in if-conversion for detecting loops, that is using a very sharp boundary. It turns out that the integral block weights are exactly the same (in both the base and diff), but the floating point calculation ended up with exactly 1 and very close to 1 in the base/diff cases respectively.
So two changes to address it: