Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regressions in Quaternion.Conjugate and Quaternion.Negate #41738

Closed
adamsitnik opened this issue Sep 2, 2020 · 20 comments · Fixed by #41829
Closed

Performance regressions in Quaternion.Conjugate and Quaternion.Negate #41738

adamsitnik opened this issue Sep 2, 2020 · 20 comments · Fixed by #41829
Assignees
Labels
Milestone

Comments

@adamsitnik
Copy link
Member

It looks like Conjugate and Negate methods of the Quaternion type have regressed compared to 3.1.

@tannergooding could you please take a look and triage it? I assume that this might be acceptable similarly to #39035 but I don't have enough knowledge to make any decisions here.

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --filter 'System.Numerics.Tests.Perf_Quaternion.ConjugateBenchmark' 'System.Numerics.Tests.Perf_Quaternion.Negat*'

To see all the numbers please click "details" below:

System.Numerics.Tests.Perf_Quaternion.ConjugateBenchmark

Result Base Diff Ratio Alloc Delta Modality Operating System Bit Processor Name Base V Diff V
Slower 1.30 7.91 0.16 +0 Windows 10.0.18363.959 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 3.1.6 5.0.20.41714
Slower 6.43 18.76 0.34 +0 manjaro X64 Intel Core i7-4771 CPU 3.50GHz (Haswell) 3.1.6 5.0.20.41714
Slower 12.90 21.39 0.60 +0 ubuntu 16.04 Arm64 Unknown processor 3.1.6 5.0.20.41714
Slower 12.90 27.30 0.47 +0 ubuntu 16.04 Arm64 Unknown processor 3.1.7 5.0.20.41714
Slower 14.44 24.09 0.60 +0 ubuntu 16.04 Arm64 Unknown processor 3.1.6 5.0.20.41714
Slower 1.00 7.67 0.13 +0 Windows 10.0.18363.959 X86 Intel Xeon CPU E5-1650 v4 3.60GHz 3.1.6 5.0.20.41714
Faster 9.44 5.20 1.81 +0 Windows 10.0.19041.450 Arm Microsoft SQ1 3.0 GHz 3.1.6 5.0.20.41714
Slower 8.01 23.80 0.34 +0 macOS Mojave 10.14.5 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 3.1.6 5.0.20.41714

System.Numerics.Tests.Perf_Quaternion.NegateBenchmark

Result Base Diff Ratio Alloc Delta Modality Operating System Bit Processor Name Base V Diff V
Slower 1.54 7.44 0.21 +0 Windows 10.0.18363.959 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 3.1.6 5.0.20.41714
Slower 6.49 17.96 0.36 +0 manjaro X64 Intel Core i7-4771 CPU 3.50GHz (Haswell) 3.1.6 5.0.20.41714
Slower 13.67 22.12 0.62 +0 ubuntu 16.04 Arm64 Unknown processor 3.1.6 5.0.20.41714
Slower 12.52 20.97 0.60 +0 ubuntu 16.04 Arm64 Unknown processor 3.1.7 5.0.20.41714
Slower 14.06 18.74 0.75 +0 ubuntu 16.04 Arm64 Unknown processor 3.1.6 5.0.20.41714
Slower 1.26 7.64 0.16 +0 bimodal Windows 10.0.18363.959 X86 Intel Xeon CPU E5-1650 v4 3.60GHz 3.1.6 5.0.20.41714
Faster 9.52 5.05 1.89 +0 Windows 10.0.19041.450 Arm Microsoft SQ1 3.0 GHz 3.1.6 5.0.20.41714
Slower 7.99 23.96 0.33 +0 macOS Mojave 10.14.5 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 3.1.6 5.0.20.41714

System.Numerics.Tests.Perf_Quaternion.NegationOperatorBenchmark

Result Base Diff Ratio Alloc Delta Modality Operating System Bit Processor Name Base V Diff V
Slower 1.56 7.47 0.21 +0 Windows 10.0.18363.959 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 3.1.6 5.0.20.41714
Slower 6.48 18.24 0.36 +0 manjaro X64 Intel Core i7-4771 CPU 3.50GHz (Haswell) 3.1.6 5.0.20.41714
Slower 12.90 22.14 0.58 +0 ubuntu 16.04 Arm64 Unknown processor 3.1.6 5.0.20.41714
Slower 12.90 22.12 0.58 +0 ubuntu 16.04 Arm64 Unknown processor 3.1.7 5.0.20.41714
Slower 13.67 24.44 0.56 +0 ubuntu 16.04 Arm64 Unknown processor 3.1.6 5.0.20.41714
Slower 1.26 7.16 0.18 +0 Windows 10.0.18363.959 X86 Intel Xeon CPU E5-1650 v4 3.60GHz 3.1.6 5.0.20.41714
Faster 9.60 5.15 1.86 +0 Windows 10.0.19041.450 Arm Microsoft SQ1 3.0 GHz 3.1.6 5.0.20.41714
Slower 7.98 21.66 0.37 +0 macOS Mojave 10.14.5 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 3.1.6 5.0.20.41714

@DrewScoggins this regression did not get detected by the bot as it was added very recently 👍

@adamsitnik adamsitnik added this to the 5.0.0 milestone Sep 2, 2020
@ghost
Copy link

ghost commented Sep 2, 2020

Tagging subscribers to this area: @tannergooding, @pgovind
See info in area-owners.md if you want to be subscribed.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Sep 2, 2020
@tannergooding
Copy link
Member

This one is also a bit strange as there were no codechanges to these Quaternion methods between 3.1 and 5.0.

I'm guessing there is some codegen change by the JIT instead.

@jeffschwMSFT jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Sep 2, 2020
@jeffhandley
Copy link
Member

@jeffschwMSFT Is this being reviewed on your team or is there routing still needed here? Thanks!

@jeffschwMSFT jeffschwMSFT added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Numerics labels Sep 3, 2020
@jeffschwMSFT
Copy link
Member

@JulieLeeMSFT

@tannergooding
Copy link
Member

Looks like the Conjugate and Negate methods aren't being inlined in .NET 5, which is leading to the perf slowdown.

However, as indicated the managed implementation hasn't changed and is actually rather simple: https://github.com/dotnet/corefx/blob/release/3.1/src/System.Numerics.Vectors/src/System/Numerics/Quaternion.cs#L124, so this is likely to do with one of the JIT changes.
Here is the netcoreapp3.1 disassembly: System.Numerics.Tests.Perf_Quaternion-asm.md.txt
Here is the net5.0 disassembly: System.Numerics.Tests.Perf_Quaternion-asm.md.txt

@JulieLeeMSFT
Copy link
Member

CC @dotnet/jit-contrib

@AndyAyersMS
Copy link
Member

I'll take a look.

@AndyAyersMS AndyAyersMS self-assigned this Sep 3, 2020
@AndyAyersMS
Copy link
Member

Per ETL (for ConjugateBenchmark) : System.Numerics.Quaternion.Conjugate is an unprofitable inline.

So will need to look at the jit dumps.

@AndyAyersMS
Copy link
Member

In 3.1 we got a SIMD boost during inlining:

INLINER impTokenLookupContextHandle for System.Numerics.Quaternion:Conjugate(struct):struct is 0x00007FFE90CA3399.
SIMD Candidate Type System.Numerics.Quaternion
  Unknown SIMD Type
*************** In fgFindBasicBlocks() for System.Numerics.Quaternion:Conjugate(struct):struct
...
multiplier in methods of promotable struct increased to 3.
Inline candidate has SIMD type args, locals or return value.  Multiplier increased to 6.
Inline candidate callsite is boring.  Multiplier increased to 7.3.
calleeNativeSizeEstimate=697
callsiteNativeSizeEstimate=105
benefit multiplier=7.3
threshold=766
Native estimate for function size is within threshold for inlining 69.7 <= 76.6 (multiplier = 7.3)

and we don't get this in 5.0

INLINER impTokenLookupContextHandle for System.Numerics.Quaternion:Conjugate(System.Numerics.Quaternion):System.Numerics.Quaternion is 0x00007FFE9F529851.
Notify VM instruction set (AVX) must be supported.
*************** In fgFindBasicBlocks() for System.Numerics.Quaternion:Conjugate(System.Numerics.Quaternion):System.Numerics.Quaternion
...
multiplier in methods of promotable struct increased to 3.
Inline candidate callsite is boring.  Multiplier increased to 4.3.
calleeNativeSizeEstimate=697
callsiteNativeSizeEstimate=105
benefit multiplier=4.3
threshold=451
Native estimate for function size exceeds threshold for inlining 69.7 > 45.1 (multiplier = 4.3)

Inline expansion aborted, inline not profitable

So need to dig in and figure out why we're no longer seeing this boost.

@tannergooding
Copy link
Member

It might be one of the changes I made with how the older SIMD intrinsics were handled.

However, I find it a bit odd that it would have been recognized as SIMD anyways as Quaternion isn't marked Intrinsic and fails the getBaseTypeAndSizeOfSIMDType check...

@CarolEidt
Copy link
Contributor

However, I find it a bit odd that it would have been recognized as SIMD anyways

The heuristics (used to?) take into account when SIMD arguments, locals or return values are present.

@tannergooding
Copy link
Member

Ah, it might actually be this bit: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/importer.cpp#L19301-L19303

// If this is a SIMD class (i.e. in the SIMD assembly), then we will consider that we've
// found a SIMD type, even if this may not be a type we recognize (the assumption is that
// it is likely to use a SIMD type, and therefore we want to increase the inlining multiplier).

In 3.1 we simply had isSIMDClass do the following: https://github.com/dotnet/coreclr/blob/release/3.1/src/jit/compiler.h#L7732-L7735

bool isSIMDClass(CORINFO_CLASS_HANDLE clsHnd)
{
    return info.compCompHnd->isInSIMDModule(clsHnd);
}

In .NET 5, now that they are in S.P.Corelib, we are instead doing the following: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/compiler.h#L8057-L8066

bool isSIMDClass(CORINFO_CLASS_HANDLE clsHnd)
{
    if (isIntrinsicType(clsHnd))
    {
        const char* namespaceName = nullptr;
        (void)getClassNameFromMetadata(clsHnd, &namespaceName);
        return strcmp(namespaceName, "System.Numerics") == 0;
    }
    return false;
}

So while the new code is technically more correct in the checks, it causes various Quaternion, Plane, Matrix4x4, and Matrix3x2 to no longer get the "simd boost" they were getting previously.

@tannergooding
Copy link
Member

The simplest fix is likely to just mark Quaternion, Plane, Matrix4x4, and Matrix3x2 as [Intrinsic], which should cause them to be recognized as potential SIMD types.

In the case of something like Quaternion, it is essentially a Vector4 already and should ideally be optimized similarly internally (but is not today).

Are there any concerns with this approach or would an alternative be better?

@AndyAyersMS
Copy link
Member

So we were sort of boosting "by accident" in 3.1. Clearly this was beneficial for perf; not sure about code size, but typically we reason that anyone using these types cares about perf and so would like the jit to be a bit more aggressive.

If we want to make this deliberate then we can intrisinfy the remaining types, I think that's a reasonable fix.

I also wonder if we're checking consistently. The args case currently does

if (lvaTable[lclVarTree->AsLclVarCommon()->GetLclNum()].lvSIMDType)

while the locals and return value are checked via isSIMDorHWSIMDClass.

@tannergooding
Copy link
Member

So we were sort of boosting "by accident" in 3.1

It looks more like it was intentional than by accident, given the associated comment.

If we want to make this deliberate then we can intrisinfy the remaining types, I think that's a reasonable fix.

👍, most of these types should eventually be intrinsified anyways (although whether that happens via proper intrinsic recognition or via HWIntrinsics in managed code is probably somewhat up in the air still).

I also wonder if we're checking consistently. The args case currently does

It looks like they are different, yeah.

From what I can tell, impInlineRecordArgInfo and impInlineInitVars mark pInlineInfo->hasSIMDTypeArgLocalOrReturn. That value then impacts whether InlineObservation::CALLEE_HAS_SIMD is flagged which impacts whether the JitInlineSIMDMultipler is applied.

It looks like impInlineRecordArgInfo does an exact check around lvSIMDType, which is generally set based on varTypeIsSIMD, and which requires getBaseTypeAndSizeOFSIMDType to not return TYP_UNKNOWN while InitVars does the looser check only validating IsSIMDorHWSIMDClass

@AndyAyersMS
Copy link
Member

@tannergooding can you work up a fix? I'm OOF the next few days.

Seems like this might clear the RC2 bar.

I would just add the [Intrinsic] attribute for now.

@tannergooding
Copy link
Member

I've put up #41829 and validated that it was working:

[2020/09/03 14:22:22][INFO] // * Summary *
[2020/09/03 14:22:22][INFO]
[2020/09/03 14:22:22][INFO] BenchmarkDotNet=v0.12.1.1405-nightly, OS=Windows 10.0.19041.388 (2004/May2020Update/20H1)
[2020/09/03 14:22:22][INFO] AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
[2020/09/03 14:22:22][INFO] .NET Core SDK=6.0.100-alpha.1.20453.5
[2020/09/03 14:22:22][INFO]   [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), X64 RyuJIT
[2020/09/03 14:22:22][INFO]   Job-EMLNHA : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
[2020/09/03 14:22:22][INFO]
[2020/09/03 14:22:22][INFO] PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  Toolchain=CoreRun
[2020/09/03 14:22:22][INFO] IterationTime=250.0000 ms  MaxIterationCount=20  MinIterationCount=15
[2020/09/03 14:22:22][INFO] WarmupCount=1
[2020/09/03 14:22:22][INFO]
[2020/09/03 14:22:22][INFO] |             Method |     Mean |     Error |    StdDev |   Median |       Min |      Max | Code Size | Gen 0 | Gen 1 | Gen 2 | Allocated |
[2020/09/03 14:22:22][INFO] |------------------- |---------:|----------:|----------:|---------:|----------:|---------:|----------:|------:|------:|------:|----------:|
[2020/09/03 14:22:22][INFO] | ConjugateBenchmark | 1.030 ns | 0.0309 ns | 0.0289 ns | 1.021 ns | 0.9983 ns | 1.104 ns |     130 B |     - |     - |     - |         - |
[2020/09/03 14:22:22][INFO] |    NegateBenchmark | 1.215 ns | 0.0165 ns | 0.0147 ns | 1.208 ns | 1.1971 ns | 1.239 ns |     142 B |     - |     - |     - |         - |
[2020/09/03 14:22:22][INFO]

@AndyAyersMS
Copy link
Member

Thanks!

@JulieLeeMSFT
Copy link
Member

Relabeling to system.numeric since the fix is in that code.

@ghost
Copy link

ghost commented Sep 4, 2020

Tagging subscribers to this area: @tannergooding, @pgovind
See info in area-owners.md if you want to be subscribed.

@JulieLeeMSFT JulieLeeMSFT removed the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 4, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants