Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] Make casting logic closer to CoreCLR #89548

Closed
wants to merge 12 commits into from
Closed

Conversation

VSadov
Copy link
Member

@VSadov VSadov commented Jul 27, 2023

Fixes: #84464

The actual changes are not as big here as might seem. For the most part this is a refactoring of existing code to have a shape closer to CoreCLR cast helpers, so that similar patterns could be used - in a few places where that has not been done already in earlier changes.
For example cases like CheckCastAny - could start with a cache lookup, since uncached code path can be complex and thus relatively slow. (also addresses some old TODOs in this area)

@ghost
Copy link

ghost commented Jul 27, 2023

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes: #84464

In progress.

For the most part this is a refactoring of existing code to have a shape closer to CoreCLR. In particular to do cache lookups earlier.
Cases like CheckCastAny - we should basically start with a cache lookup, since uncached path can be relatively slow.

Author: VSadov
Assignees: -
Labels:

area-NativeAOT-coreclr

Milestone: -

@VSadov
Copy link
Member Author

VSadov commented Jul 27, 2023

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@VSadov
Copy link
Member Author

VSadov commented Aug 1, 2023

I think this is ready for a review.

@VSadov VSadov requested a review from jkotas August 1, 2023 18:33
@jkotas
Copy link
Member

jkotas commented Aug 1, 2023

Could you please collect numbers for casting microbenchmarks before/after this change?

@@ -694,32 +694,29 @@ private ISymbolNode GetHelperFtnUncached(CorInfoHelpFunc ftnNum)
break;

case CorInfoHelpFunc.CORINFO_HELP_CHKCASTANY:
case CorInfoHelpFunc.CORINFO_HELP_CHKCASTARRAY:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can delete CORINFO_HELP_CHKCASTARRAY and CORINFO_HELP_ISINSTANCEOFARRAY from JIT/EE interface. They are unnecessary now.

@VSadov
Copy link
Member Author

VSadov commented Aug 1, 2023

One possible question here would be - Why not just port/share the CoreCLR casting helpers in their entirety and implement the internal calls in managed code - to have nearly the same implementation?

I think one of the factors for the design of CoreCLR managed casting helpers was to avoid complicated API with the native type system. As a result that API is basically 2 internal calls methods - IsInstanceOfAny_NoCacheLookup and ChkCastAny_NoCacheLookup.

In NativeAOT the type system APIs are easily accessible, so we do not need to minimize the use of those APIs. On the other hand there are some differences, like the way we fetch the base type for arrays, that may stand in the way of code sharing.
Thus I did not consider it is as a goal to share the code or making it maximally similar when possible, but I think it can be done.

@VSadov
Copy link
Member Author

VSadov commented Aug 1, 2023

Also in NativeAOT these casting helpers have more users with additional needs, while in CoreClr it is really just type system facade for object casting.

@VSadov
Copy link
Member Author

VSadov commented Aug 1, 2023

Could you please collect numbers for casting microbenchmarks before/after this change?

I was thinking of what we could measure here.
The most common cases of casting like regular object/interfaces casts did not change, so unlikely to see any differences. We may see differences in more complex cases - like casting to variant interfaces or arrays.

Perhaps just running the regular casting benchmarks that perf lab uses would be informative enough about what changed perf-wise.

@VSadov
Copy link
Member Author

VSadov commented Aug 1, 2023

I have run the same benchmark as in #84430 (comment)

 internal class Program
    {
        const int iters = 1000000;

        static void Main(string[] args)
        {
            for(; ; )
            {
                Time(TestLStringToIROCstring);
            }
        }
        
        static void Time(Action a)
        {
            var sw = Stopwatch.StartNew();
            for (int i = 0; i < 100; i++)
            {
                a();
            }

            sw.Stop();
            System.Console.WriteLine(sw.ElapsedMilliseconds);
        }

        static object o = new List<string>();

        static void TestLStringToIROCstring()
        {
            for (int i = 0; i < iters; i++)
            {
                if (o as IReadOnlyCollection<object> == null)
                    throw null;

                if (o as IReadOnlyCollection<string> == null)
                    throw null;

                if (o as IEnumerable<object> == null)
                    throw null;

                if (o as IEnumerable<string> == null)
                    throw null;
            }
        }
    }

=== before the change

1451
1460
1456
1458
1470
1463

=== after the change:

799
801
798
799
800
799

The reason for the difference is that original code makes a number of calls. Profiler shows:
TypeCast__IsInstanceOf
MethodTable_get_IsArray
TypeCast__IsInstanceOfVariantType - this one does the cache lookup

making calls and additional checks adds up.

In the new implementation there is only one helper call in the profile:
TypeCast__IsInstanceOfAny - this one does the cache lookup

@VSadov
Copy link
Member Author

VSadov commented Aug 1, 2023

For comparison the CoreCLR is a bit faster.

The same benchmark as above produces (smaller is better):

558
561
562
557
561

As I see in the debugger the native code that we run after this change is nearly the same between CoreCLR and NativeAOT.
There are minor differences like loading of type pointers/handles.
In JIT code they look like:

mov         rcx,7FFCA19E79E8h

In NativeAOT loading a type looks like:

lea         rcx, [rip + 0x73d65]

I do not see any other significant differences. Maybe it is just these little diffs and some indirect impact on code size or alignment that makes the difference.

@VSadov
Copy link
Member Author

VSadov commented Aug 1, 2023

Another thing to notice is that compared to #84430 (comment) and prior to this change it looks like the benchmark in the comment has regressed.

That was possibly caused by #86029 . I suspect it made common/simple cases faster, which is good, but regressed complex cases that rely on caching as more checks like MethodTable_get_IsArray could run before eventually hitting the cache. Just a guess though.

Anyways, it looks like after this PR the complex/cached case is faster than in #84430 (comment)

@VSadov
Copy link
Member Author

VSadov commented Aug 2, 2023

There are some superpmi failures. I can't tell if that simply tells that there are codegen diffs or something is crashing.

@VSadov VSadov force-pushed the casts branch 2 times, most recently from c3672cb to ac36d4e Compare August 6, 2023 21:51
@jkotas
Copy link
Member

jkotas commented Aug 6, 2023

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@VSadov
Copy link
Member Author

VSadov commented Aug 8, 2023

I'd be ok with merging this now. I think all concerns have been resolved. Let me know if there is something that may be missing.

@@ -6058,8 +6061,7 @@ CorInfoHelpFunc CEEInfo::getCastingHelperStatic(TypeHandle clsHnd, bool fThrowin
*pfClassMustBeRestored = true;
}

// If it is an array, use the fast array helper
helper = CORINFO_HELP_ISINSTANCEOFARRAY;
_ASSERTE(helper == CORINFO_HELP_ISINSTANCEOFANY);
}
else
if (!clsHnd.IsTypeDesc() && !Nullable::IsNullableType(clsHnd))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The native AOT implementation does not checks for Nullable. Is this check redundant here or is the check for Nullable missing in native AOT?

Copy link
Member Author

@VSadov VSadov Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is that JIT tries to transform simple cases like o is int? into o is int, but can't do that when needs a type lookup. As I understand that is due to limitations of IR.

// ECMA-335 III.4.3: If typeTok is a nullable type, Nullable<T>, it is interpreted as "boxed" T
// We can convert constant-ish tokens of nullable to its underlying type.
// However, when the type is shared generic parameter like Nullable<Struct<__Canon>>, the actual type will require
// runtime lookup. It's too complex to add another level of indirection in op2, fallback to the cast helper instead.
if (isClassExact && !(info.compCompHnd->getClassAttribs(pResolvedToken->hClass) & CORINFO_FLG_SHAREDINST))

NativeAOT does not seem to have a problem with expressing such lookup, so we always cast with nullable stripped

case ReadyToRunHelperId.TypeHandleForCasting:
{
var type = (TypeDesc)targetOfLookup;
if (type.IsNullable)
targetOfLookup = type.Instantiation[0];
return NecessaryTypeSymbolIfPossible((TypeDesc)targetOfLookup);

Copy link
Member Author

@VSadov VSadov Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words in NativeAOT the type check will always be to an unboxed underlying type, thus CLASS variant is correct, and also optimizable, since T? is isClassExact.

In CoreCLR the type check in complex cases could be to the actual Nullable<SomeStruct<string>>, thus it should use ANY variant. It also means that we may be introducing an optimization bug, since such IsInst cannot be lowered to a type/handle compare.

I will check if that is a case, or if there are some mitigating reasons why it still works correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly that is the case. The following works incorrectly

===== Prints:

True
False


using System.Runtime.CompilerServices;

namespace ConsoleApp34
{
    struct S1<T>
    {
        public T value;
    }

    internal class Program
    {
        static object o;
        static void Main(string[] args)
        {
            Test<int>();
            Test<string>();
        }

        [MethodImpl(MethodImplOptions.AggressiveOptimization)]
        private static void Test<T>()
        {
            o = new S1<T>();

            Console.WriteLine(o is S1<T>?);
        }
    }
}

This PR indirectly enabled casting optimizations for cases like o is int?, but we need to suppress it, since in more general cases it does not work correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there are better ideas, I am thinking of constraining isinst optimization for CORINFO_HELP_ISINSTANCEOFANY only if converting to an array type. So we do not keep finding more broken cases.

That may be too conservative, but we should probably stay closer to the preexisting behavior for now and consider if more cases can work in 9.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking that this PR as a whole is too risky for 8.0. Are there parts that we think are critical to get into .NET 8?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the actual casting helper change was low risk as that is mostly refactoring to get more cases to hit the cache earlier. Touching the JIT appears to be a lot more fragile.

There is nothing really "critical" for 8.0, as in - we are not fixing some complete showstoppers here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a "reduced" version of this. I think that is what we can consider for 8.0 - #90234

I have removed all the JIT changes, but kept the added codegen test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@@ -13612,7 +13612,7 @@ methodPointerInfo* Compiler::impAllocateMethodPointerInfo(const CORINFO_RESOLVED
bool Compiler::impIsClassExact(CORINFO_CLASS_HANDLE classHnd)
{
DWORD flags = info.compCompHnd->getClassAttribs(classHnd);
DWORD flagsMask = CORINFO_FLG_FINAL | CORINFO_FLG_VARIANCE | CORINFO_FLG_ARRAY;
DWORD flagsMask = CORINFO_FLG_FINAL | CORINFO_FLG_VARIANCE | CORINFO_FLG_TYPE_EQUIVALENCE | CORINFO_FLG_ARRAY;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both CORINFO_FLG_VARIANCE and CORINFO_FLG_TYPE_EQUIVALENCE are only computed to make the impIsClassExact work. Computing these flags is a waste in all other cases. It is the kind of pattern that calls for introduction of dedicated JIT/EE interface API that replaces the flags.

@VSadov
Copy link
Member Author

VSadov commented Aug 23, 2023

A reduced version of this affecting only run time behavior has been merged.
For the further improvements for the JIT API in this area a tracking bug has been added - #91016

@VSadov VSadov closed this Aug 23, 2023
@jkotas jkotas mentioned this pull request Aug 23, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Sep 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[NativeAOT] Possible perf improvements in casting
4 participants