-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite math jit helpers to managed code #98858
Conversation
They may have a separate helper type, like |
It would be better to leave these no longer used helpers alone, and delete them next time we rev min R2R version. |
@MichalPetryka |
I've cleaned up the changes now and added some comments, I've also verified that .NET 8 X86 R2R binaries seem to still be working so I think this is only waiting for review now? |
Ah I might know what this is, it seems to not like the |
We are missing this in win-x86 NativeAOT: --- a/src/coreclr/nativeaot/Runtime/MathHelpers.cpp
+++ b/src/coreclr/nativeaot/Runtime/MathHelpers.cpp
@@ -271,6 +271,14 @@ FCIMPL1_F(float, tanhf, float x)
return std::tanhf(x);
FCIMPLEND
+FCIMPL2_DD(double, fmod, double x, double y)
+ return std::fmod(x, y);
+FCIMPLEND
+
+FCIMPL2_FF(float, fmodf, float x, float y)
+ return std::fmodf(x, y);
+FCIMPLEND
+
FCIMPL3_DDD(double, fma, double x, double y, double z)
return std::fma(x, y, z);
FCIMPLEND Also, the throw helpers are not properly rooted in the scanning phase:
|
src/coreclr/tools/aot/ILCompiler.Compiler/IL/ILImporter.Scanner.cs
Outdated
Show resolved
Hide resolved
src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/JitHelper.cs
Outdated
Show resolved
Hide resolved
src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/JitHelper.cs
Outdated
Show resolved
Hide resolved
src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/JitHelper.cs
Outdated
Show resolved
Hide resolved
src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/JitHelper.cs
Outdated
Show resolved
Hide resolved
JITHELPER(CORINFO_HELP_DBL2INT_OVF, JIT_Dbl2IntOvf, CORINFO_HELP_SIG_8_STACK) | ||
DYNAMICJITHELPER(CORINFO_HELP_LNG2DBL, NULL, CORINFO_HELP_SIG_8_STACK) | ||
DYNAMICJITHELPER(CORINFO_HELP_ULNG2DBL, NULL, CORINFO_HELP_SIG_8_STACK) | ||
JITHELPER(CORINFO_HELP_DBL2INT, JIT_Dbl2Int, CORINFO_HELP_SIG_8_STACK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What platform is this needed for?
Arm, Arm64, x86, and x64 all provide a direct instruction for double
to int32
as part of the baseline ISAs we require.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unused but can't be removed without an R2R bump. Also, Intel PRs might need it again.
src/coreclr/tools/aot/ILCompiler.Compiler/IL/ILImporter.Scanner.cs
Outdated
Show resolved
Hide resolved
…r.cs Co-authored-by: Jan Kotas <jkotas@microsoft.com>
// | ||
|
||
[StackTraceHidden] | ||
private static ulong MultiplyChecked(ulong left, ulong right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a very expensive way to implement MultiplyChecked
.
For UInt128
we simply do BigMul
and check if upper
is 0: https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/UInt128.cs,1379
But more practically speaking, we can determine overflow just by doing if ((128 - LeadingZeroCount(left) - LeadingZeroCount(right)) > 64)
. Since x-bit * y-bit
will produce an (x + y)-bit
product; anything else must fit into the given 64-bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to change logic here in a separate PR. Keep in mind that this is for 32bit platforms so things like BigMul are slow there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to change logic here in a separate PR
That's fine.
Keep in mind that this is for 32bit platforms so things like BigMul are slow there.
That's why I suggested the extremely cheap check of if ((128 - LeadingZeroCount(left) - LeadingZeroCount(right)) > 64)
. That, on a 32-bit platform, ends up being an almost free check to determine overflow without doing any expensive multiplication (it simply uses the lzcnt
equivalent that every platform we target, including WASM, natively supports).
Then, you only have to do a regular 64x64-bit multiplication since you know it can't overflow. You don't have to do the extra work (and many branches) to check overflow at each step in the algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That, on a 32-bit platform, ends up being an almost free check to determine overflow without doing any expensive multiplication
It looks like an approximate check to determine overflow. If it fails, you still need to take the more expensive multiplication to determine whether it is actually going to overflow - correct?
The current implementation has a variant of this idea where it checks whether the high 32-bits of both numbers are zero. If they are, it knows that there is not going to be an overflow and so it multiplies the low 32-bit portions and returns. If they are not, it does the more expensive multiplication.
I think the difference between what's there currently and your suggestion is in the speed of the fast path and the curoff for the fast path. The current implementation has lower cutoff and faster fast path. Your suggestion has higher cutoff and slower fast path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like an approximate check to determine overflow.
Yes.
you still need to take the more expensive multiplication to determine whether it is actually going to overflow
Not quite.
In general, for an x-bit * y-bit
value (excepting where one input is 0
), you need a minimum of max(x, y)
bits to represent the product and a maximum of x + y
bits.
The minimum bit count comes about because of the special case 1
, which returns the original value. Multiplication then adds one additional bit to the output for each power of 2 (multiplying by 2 is equivalent to shifting left by 1, multiplying by 4 is equivalent to shifting left by 2, thus multiplying by 3 must be between these two values and takes less than 2 additional bits).
Thus the number of bits needed is actually ceil(log2(x) + log2(y))
bits.
Finding the input with the smaller length and computing its effective log2
are both incredibly cheap operations that all our targets have direct acceleration to support. So it should always be significantly cheaper than what we're doing here already.
The approach I detailed of giving if ((128 - LeadingZeroCount(left) - LeadingZeroCount(right)) > 64)
is an even cheaper approximation and there is some subset of valid inputs that would trigger it and for which we'd then need to compute the actual bit count.
So it basically just comes down to handling the 99% of inputs that can't possibly overflow as cheaply as possible. This is always fine because lzcnt
is a native instruction for Arm, Arm64, and WASM. It is part of our default R2R target for x86/x64 and if you happen to be on x86/x64 that is pre-lzcnt (more than 11 years old at this point), you end up getting (sizeof(T) * 8 - 1) ^ BSR
instead. For Arm32 and x86, you at worst get an extra branch and addition in here to account for the upper vs lower halves of the 64-bit input.
For the subset of cases that hit the "it could actually be under 64-bits", you then need to do log2
which just uses the already computed lzcnt
as input. So we remain with an incredibly efficient fast path that minimizes the total amount of work and doesn't need to do the extra multiplications required to check if bits 64-127
are non-zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing code (both the C/C++ version in regular CoreCLR and the managed native AOT version) does not use the BigMul helper, so it does not have this perf bug.
Putting AggressiveInlining
on BigMul
like I did in #100027 makes the regression be about 2.9% on CoreCLR on my machine, I'd argue that's good enough in exchange for sharing the implementation between runtimes. It will also likely improve with future optimizations to the JIT so I'd assume it's gonna get better than native in the future (today it does stuff like or edx, 0
and add eax, 0
).
It may be a good idea to split this PR into multiple smaller one so that each of the changes gets the right level of scrutiny.
While I'd agree with this before, I feel like splitting the PR at this point after most review was done would only complicate things here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like splitting the PR at this point after most review was done would only complicate things here.
It would allow me or other maintainers to get these changes merged with higher confidence and less work. In general, multiple smaller PRs are less total work for maintainers to get through than one big PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have peeled off first chunk into #100222
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#100222 is merged with number of adjustments. I think the next good-sized chunk to peel off into a separate PR is everything to do with the long multiplication. @MichalPetryka Are you interested in doing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into splitting this up tomorrow.
/azp run runtime-extra-platforms, runtime-nativeaot outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run runtime-nativeaot-outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
The |
I expected that it would fail in some other combinations too. We have too many possible combinations in the test matrix and we are not able to test all of them. |
I have extracted the remaining good changes from this PR to #102129 |
Rewrites math helpers that has multiple implementations (C++ for CoreCLR, Windows X86 asm for CoreCLR and C#/C++ for NativeAOT) into a single, shared C# implementation.
I've put the new helpers in
Math
andMathF
but some other place likeRuntimeHelpers
could perhaps be better.On 64bit platforms this should be mostly the same, some direct helper calls might be replaced with indirect ones when the helpers didn't get to T1 yet.
Might result in slight perf regressions on X86 and ARM32 due to removal of asm paths and the managed helpers needing to do another call in some cases.
Regressions could be made up by converting casts to calls earlier and making them inlineable, that's not handled here to keep the changes simple.
This removes most of the implementation of 4 unused helpers, I think I forwarded them to temporary managed variants for legacy R2R images correctly but verification would be appreciated.