-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow inlining in some over-bugdet cases #78874
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsThe current inline is quite conservative in terms of time budget it uses for a call graph, here is how it looks like. [MethodImpl(MethodImplOptions.AggressiveInlining)]
static bool LessThan<TKey>(TKey left, TKey right)
{
if (typeof(TKey) == typeof(byte)) return (byte)(object)left < (byte)(object)right;
if (typeof(TKey) == typeof(sbyte)) return (sbyte)(object)left < (sbyte)(object)right;
if (typeof(TKey) == typeof(ushort)) return (ushort)(object)left < (ushort)(object)right;
if (typeof(TKey) == typeof(short)) return (short)(object)left < (short)(object)right;
if (typeof(TKey) == typeof(uint)) return (uint)(object)left < (uint)(object)right;
if (typeof(TKey) == typeof(int)) return (int)(object)left < (int)(object)right;
if (typeof(TKey) == typeof(ulong)) return (ulong)(object)left < (ulong)(object)right;
if (typeof(TKey) == typeof(long)) return (long)(object)left < (long)(object)right;
return false;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
static bool LessThanWrapper<TKey>(TKey left, TKey right) =>
LessThan(left, right);
public static void Main()
{
LessThanWrapper(1, 2);
} Codegen for Main: ; Method Program:Main()
G_M000_IG01:
4883EC28 sub rsp, 40
G_M000_IG02:
B901000000 mov ecx, 1
BA02000000 mov edx, 2
FF15BCF82C00 call [Program:LessThan[int](int,int):bool]
90 nop
G_M000_IG03:
4883C428 add rsp, 40
C3 ret
; Total bytes of code: 26 As you can see, it's not inlined due to "exceeds the time budget" reason. While in reality the method doesn't have a lot of work for JIT to do - 95% of it is stripped during import. So it seems to me that we should allow overbudget for forceinline callees with foldable branches inside. jit-diff from this change:
Some examples:
|
@dotnet/jit-contrib @jkotas PTAL - do you agree with the logic behind this change? I also checked a few apps I have locally and it didn't lead to "catastrophic" inline decisions, mostly because of the "friction" part. jit-diff is negative despite the fact we inline more now 🙂 (this PR doesn't prevent inlining for something that was inlined previously) |
src/coreclr/jit/inlinepolicy.cpp
Outdated
return true; | ||
} | ||
|
||
if (m_CallsiteDepth == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way we can better track m_CallsiteDepth
for trivial cases?
That is, there are many scenarios where you might have a code pattern like the following:
public void PublicApi(...)
{
// ArgumentValidation
InternalApi(...);
}
void InternalApi(...)
{
if (Vector256.IsHardwareAccelerated)
{
InternalApiV256();
}
else if (Vector128.IsHardwareAccelerated)
{
InternalApiV128();
}
else
{
InternalApiScalar();
}
}
Such scenarios involve a couple small almost "stub" methods or methods that just forward to another (another example is int.LeadingZeroCount
which just forwards to BitOperations.LeadingZeroCount
).
In such scenarios, we will almost always by at m_CallsiteDepth >= 2
and not have gotten to the actual interesting "core" yet.
If we had a way to factor in such methods that are effectively just "forward" or "validate + forward" then we could more meaningfully do inlining for such cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the current impl, it now tries to guess final imported IL size. so for your case InternalApi won't affect the time budget since the majority of code will be folded anyway. So e.g. InternalApiV256 will have plenty of time budget left to be inlined (especially if inliner can detect foldable branches in it).
@dotnet/jit-contrib anyone brave enough to approve? 🙂 |
Given that this fixes the known examples and you've done the general analysis, I think I would be brave enough to approve. However, I don't feel very comfortable jumping into inliner strategy, in particular evaluating how it fits with the other heuristics. Things like - should other uses of m_CodeSize use this improved size estimate? Is it reasonable to just change this one use and possibly look at others later, or does the new use of two different sizes make things more difficult to maintain/update in the future? |
So technically we have three completely different types of checks/heuristics to inline a method:
So this PR only improves the 3rd one slightly. JIT used to incorrectly estimate "time it will take to compile this callee" for functions with foldable branches - it didn't take into account that a lot of dead code will be eliminated during import (won't be materialized into GenTree objects) so it will take much less time for JIT to compile than inliner estimated initially. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@EgorBo, I just noticed that if we extract the LHS ( |
Thanks for the examples, they're not this PR related since they uncover some inefficiency in JIT for __Canon, e.g. minimal repro: https://godbolt.org/z/qTaYc87Pf but must be something we can fix! I'll file an issue |
Thanks for the explanation; it's helpful. However, should this improvement also be used in other places that use |
this happens before we analyze IL so we don't have any info about foldable branches etc. And we can't move those checks to a later stage when we have that info because of throughput: |
Quite a few improvements on win-x64: |
This PR reduces number of not inlined methods with
[AggressiveInlining]
attribute in cases where it's profitable.Closes #78648
Closes #43761
Closes #51587
Closes #41692
(I've verified that this PR fixes all codegen issues in the snippets in those issues ^)
The current inliner is quite conservative around the time budget it uses for a call graph, here is how it looks like.
The following sample demonstrates why it's not perfect:
Codegen for Main:
As you can see, it's not inlined due to "exceeds the time budget" reason despite being marked as AggressiveInlining. While in reality the method doesn't have a lot of work for JIT to do - 95% of it is stripped during import.
So it seems to me that we should allow overbudget for forceinline callees with foldable branches inside.
jit-diff (
-f -pmi
) from this change:Some examples:
Vector<T>(Vector, T)
used to be extremely slow because of that, see System.Numerics.Vector: Recognize division by a constant and inline #43761Vector<T>.IsSupported
is not a jit intrinsic - see here.Size-regressions (look to be perf improvements):
Might also help @stephentoub in #78580