Performance regressions in Quaternion.Conjugate and Quaternion.Negate #41738

adamsitnik · 2020-09-02T14:45:37Z

It looks like Conjugate and Negate methods of the Quaternion type have regressed compared to 3.1.

@tannergooding could you please take a look and triage it? I assume that this might be acceptable similarly to #39035 but I don't have enough knowledge to make any decisions here.

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --filter 'System.Numerics.Tests.Perf_Quaternion.ConjugateBenchmark' 'System.Numerics.Tests.Perf_Quaternion.Negat*'

To see all the numbers please click "details" below:

System.Numerics.Tests.Perf_Quaternion.ConjugateBenchmark

Result	Base	Diff	Ratio	Alloc Delta	Operating System	Bit	Processor Name	Base V	Diff V
Slower	1.30	7.91	0.16	+0	Windows 10.0.18363.959	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	3.1.6	5.0.20.41714
Slower	6.43	18.76	0.34	+0	manjaro	X64	Intel Core i7-4771 CPU 3.50GHz (Haswell)	3.1.6	5.0.20.41714
Slower	12.90	21.39	0.60	+0	ubuntu 16.04	Arm64	Unknown processor	3.1.6	5.0.20.41714
Slower	12.90	27.30	0.47	+0	ubuntu 16.04	Arm64	Unknown processor	3.1.7	5.0.20.41714
Slower	14.44	24.09	0.60	+0	ubuntu 16.04	Arm64	Unknown processor	3.1.6	5.0.20.41714
Slower	1.00	7.67	0.13	+0	Windows 10.0.18363.959	X86	Intel Xeon CPU E5-1650 v4 3.60GHz	3.1.6	5.0.20.41714
Faster	9.44	5.20	1.81	+0	Windows 10.0.19041.450	Arm	Microsoft SQ1 3.0 GHz	3.1.6	5.0.20.41714
Slower	8.01	23.80	0.34	+0	macOS Mojave 10.14.5	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	3.1.6	5.0.20.41714

System.Numerics.Tests.Perf_Quaternion.NegateBenchmark

Result	Base	Diff	Ratio	Alloc Delta	Modality	Operating System	Bit	Processor Name	Base V	Diff V
Slower	1.54	7.44	0.21	+0		Windows 10.0.18363.959	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	3.1.6	5.0.20.41714
Slower	6.49	17.96	0.36	+0		manjaro	X64	Intel Core i7-4771 CPU 3.50GHz (Haswell)	3.1.6	5.0.20.41714
Slower	13.67	22.12	0.62	+0		ubuntu 16.04	Arm64	Unknown processor	3.1.6	5.0.20.41714
Slower	12.52	20.97	0.60	+0		ubuntu 16.04	Arm64	Unknown processor	3.1.7	5.0.20.41714
Slower	14.06	18.74	0.75	+0		ubuntu 16.04	Arm64	Unknown processor	3.1.6	5.0.20.41714
Slower	1.26	7.64	0.16	+0	bimodal	Windows 10.0.18363.959	X86	Intel Xeon CPU E5-1650 v4 3.60GHz	3.1.6	5.0.20.41714
Faster	9.52	5.05	1.89	+0		Windows 10.0.19041.450	Arm	Microsoft SQ1 3.0 GHz	3.1.6	5.0.20.41714
Slower	7.99	23.96	0.33	+0		macOS Mojave 10.14.5	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	3.1.6	5.0.20.41714

System.Numerics.Tests.Perf_Quaternion.NegationOperatorBenchmark

Result	Base	Diff	Ratio	Alloc Delta	Operating System	Bit	Processor Name	Base V	Diff V
Slower	1.56	7.47	0.21	+0	Windows 10.0.18363.959	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	3.1.6	5.0.20.41714
Slower	6.48	18.24	0.36	+0	manjaro	X64	Intel Core i7-4771 CPU 3.50GHz (Haswell)	3.1.6	5.0.20.41714
Slower	12.90	22.14	0.58	+0	ubuntu 16.04	Arm64	Unknown processor	3.1.6	5.0.20.41714
Slower	12.90	22.12	0.58	+0	ubuntu 16.04	Arm64	Unknown processor	3.1.7	5.0.20.41714
Slower	13.67	24.44	0.56	+0	ubuntu 16.04	Arm64	Unknown processor	3.1.6	5.0.20.41714
Slower	1.26	7.16	0.18	+0	Windows 10.0.18363.959	X86	Intel Xeon CPU E5-1650 v4 3.60GHz	3.1.6	5.0.20.41714
Faster	9.60	5.15	1.86	+0	Windows 10.0.19041.450	Arm	Microsoft SQ1 3.0 GHz	3.1.6	5.0.20.41714
Slower	7.98	21.66	0.37	+0	macOS Mojave 10.14.5	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	3.1.6	5.0.20.41714

@DrewScoggins this regression did not get detected by the bot as it was added very recently 👍

The text was updated successfully, but these errors were encountered:

ghost · 2020-09-02T14:45:39Z

Tagging subscribers to this area: @tannergooding, @pgovind
See info in area-owners.md if you want to be subscribed.

tannergooding · 2020-09-02T15:09:08Z

This one is also a bit strange as there were no codechanges to these Quaternion methods between 3.1 and 5.0.

I'm guessing there is some codegen change by the JIT instead.

jeffhandley · 2020-09-03T06:18:51Z

@jeffschwMSFT Is this being reviewed on your team or is there routing still needed here? Thanks!

jeffschwMSFT · 2020-09-03T14:45:32Z

@JulieLeeMSFT

tannergooding · 2020-09-03T15:20:47Z

Looks like the Conjugate and Negate methods aren't being inlined in .NET 5, which is leading to the perf slowdown.

However, as indicated the managed implementation hasn't changed and is actually rather simple: https://github.com/dotnet/corefx/blob/release/3.1/src/System.Numerics.Vectors/src/System/Numerics/Quaternion.cs#L124, so this is likely to do with one of the JIT changes.
Here is the netcoreapp3.1 disassembly: System.Numerics.Tests.Perf_Quaternion-asm.md.txt
Here is the net5.0 disassembly: System.Numerics.Tests.Perf_Quaternion-asm.md.txt

JulieLeeMSFT · 2020-09-03T15:56:09Z

CC @dotnet/jit-contrib

AndyAyersMS · 2020-09-03T16:15:10Z

I'll take a look.

AndyAyersMS · 2020-09-03T16:34:20Z

Per ETL (for ConjugateBenchmark) : System.Numerics.Quaternion.Conjugate is an unprofitable inline.

So will need to look at the jit dumps.

AndyAyersMS · 2020-09-03T18:41:26Z

In 3.1 we got a SIMD boost during inlining:

INLINER impTokenLookupContextHandle for System.Numerics.Quaternion:Conjugate(struct):struct is 0x00007FFE90CA3399.
SIMD Candidate Type System.Numerics.Quaternion
  Unknown SIMD Type
*************** In fgFindBasicBlocks() for System.Numerics.Quaternion:Conjugate(struct):struct
...
multiplier in methods of promotable struct increased to 3.
Inline candidate has SIMD type args, locals or return value.  Multiplier increased to 6.
Inline candidate callsite is boring.  Multiplier increased to 7.3.
calleeNativeSizeEstimate=697
callsiteNativeSizeEstimate=105
benefit multiplier=7.3
threshold=766
Native estimate for function size is within threshold for inlining 69.7 <= 76.6 (multiplier = 7.3)

and we don't get this in 5.0

INLINER impTokenLookupContextHandle for System.Numerics.Quaternion:Conjugate(System.Numerics.Quaternion):System.Numerics.Quaternion is 0x00007FFE9F529851.
Notify VM instruction set (AVX) must be supported.
*************** In fgFindBasicBlocks() for System.Numerics.Quaternion:Conjugate(System.Numerics.Quaternion):System.Numerics.Quaternion
...
multiplier in methods of promotable struct increased to 3.
Inline candidate callsite is boring.  Multiplier increased to 4.3.
calleeNativeSizeEstimate=697
callsiteNativeSizeEstimate=105
benefit multiplier=4.3
threshold=451
Native estimate for function size exceeds threshold for inlining 69.7 > 45.1 (multiplier = 4.3)

Inline expansion aborted, inline not profitable

So need to dig in and figure out why we're no longer seeing this boost.

tannergooding · 2020-09-03T19:00:44Z

It might be one of the changes I made with how the older SIMD intrinsics were handled.

However, I find it a bit odd that it would have been recognized as SIMD anyways as Quaternion isn't marked Intrinsic and fails the getBaseTypeAndSizeOfSIMDType check...

CarolEidt · 2020-09-03T19:03:44Z

However, I find it a bit odd that it would have been recognized as SIMD anyways

The heuristics (used to?) take into account when SIMD arguments, locals or return values are present.

tannergooding · 2020-09-03T19:07:30Z

Ah, it might actually be this bit: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/importer.cpp#L19301-L19303

// If this is a SIMD class (i.e. in the SIMD assembly), then we will consider that we've
// found a SIMD type, even if this may not be a type we recognize (the assumption is that
// it is likely to use a SIMD type, and therefore we want to increase the inlining multiplier).

In 3.1 we simply had isSIMDClass do the following: https://github.com/dotnet/coreclr/blob/release/3.1/src/jit/compiler.h#L7732-L7735

bool isSIMDClass(CORINFO_CLASS_HANDLE clsHnd)
{
    return info.compCompHnd->isInSIMDModule(clsHnd);
}

In .NET 5, now that they are in S.P.Corelib, we are instead doing the following: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/compiler.h#L8057-L8066

bool isSIMDClass(CORINFO_CLASS_HANDLE clsHnd)
{
    if (isIntrinsicType(clsHnd))
    {
        const char* namespaceName = nullptr;
        (void)getClassNameFromMetadata(clsHnd, &namespaceName);
        return strcmp(namespaceName, "System.Numerics") == 0;
    }
    return false;
}

So while the new code is technically more correct in the checks, it causes various Quaternion, Plane, Matrix4x4, and Matrix3x2 to no longer get the "simd boost" they were getting previously.

tannergooding · 2020-09-03T19:14:25Z

The simplest fix is likely to just mark Quaternion, Plane, Matrix4x4, and Matrix3x2 as [Intrinsic], which should cause them to be recognized as potential SIMD types.

In the case of something like Quaternion, it is essentially a Vector4 already and should ideally be optimized similarly internally (but is not today).

Are there any concerns with this approach or would an alternative be better?

AndyAyersMS · 2020-09-03T19:52:58Z

So we were sort of boosting "by accident" in 3.1. Clearly this was beneficial for perf; not sure about code size, but typically we reason that anyone using these types cares about perf and so would like the jit to be a bit more aggressive.

If we want to make this deliberate then we can intrisinfy the remaining types, I think that's a reasonable fix.

I also wonder if we're checking consistently. The args case currently does

if (lvaTable[lclVarTree->AsLclVarCommon()->GetLclNum()].lvSIMDType)

while the locals and return value are checked via isSIMDorHWSIMDClass.

tannergooding · 2020-09-03T20:09:23Z

So we were sort of boosting "by accident" in 3.1

It looks more like it was intentional than by accident, given the associated comment.

If we want to make this deliberate then we can intrisinfy the remaining types, I think that's a reasonable fix.

👍, most of these types should eventually be intrinsified anyways (although whether that happens via proper intrinsic recognition or via HWIntrinsics in managed code is probably somewhat up in the air still).

I also wonder if we're checking consistently. The args case currently does

It looks like they are different, yeah.

From what I can tell, impInlineRecordArgInfo and impInlineInitVars mark pInlineInfo->hasSIMDTypeArgLocalOrReturn. That value then impacts whether InlineObservation::CALLEE_HAS_SIMD is flagged which impacts whether the JitInlineSIMDMultipler is applied.

It looks like impInlineRecordArgInfo does an exact check around lvSIMDType, which is generally set based on varTypeIsSIMD, and which requires getBaseTypeAndSizeOFSIMDType to not return TYP_UNKNOWN while InitVars does the looser check only validating IsSIMDorHWSIMDClass

AndyAyersMS · 2020-09-03T20:14:02Z

@tannergooding can you work up a fix? I'm OOF the next few days.

Seems like this might clear the RC2 bar.

I would just add the [Intrinsic] attribute for now.

tannergooding · 2020-09-03T21:31:23Z

I've put up #41829 and validated that it was working:

[2020/09/03 14:22:22][INFO] // * Summary *
[2020/09/03 14:22:22][INFO]
[2020/09/03 14:22:22][INFO] BenchmarkDotNet=v0.12.1.1405-nightly, OS=Windows 10.0.19041.388 (2004/May2020Update/20H1)
[2020/09/03 14:22:22][INFO] AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
[2020/09/03 14:22:22][INFO] .NET Core SDK=6.0.100-alpha.1.20453.5
[2020/09/03 14:22:22][INFO]   [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), X64 RyuJIT
[2020/09/03 14:22:22][INFO]   Job-EMLNHA : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
[2020/09/03 14:22:22][INFO]
[2020/09/03 14:22:22][INFO] PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  Toolchain=CoreRun
[2020/09/03 14:22:22][INFO] IterationTime=250.0000 ms  MaxIterationCount=20  MinIterationCount=15
[2020/09/03 14:22:22][INFO] WarmupCount=1
[2020/09/03 14:22:22][INFO]
[2020/09/03 14:22:22][INFO] |             Method |     Mean |     Error |    StdDev |   Median |       Min |      Max | Code Size | Gen 0 | Gen 1 | Gen 2 | Allocated |
[2020/09/03 14:22:22][INFO] |------------------- |---------:|----------:|----------:|---------:|----------:|---------:|----------:|------:|------:|------:|----------:|
[2020/09/03 14:22:22][INFO] | ConjugateBenchmark | 1.030 ns | 0.0309 ns | 0.0289 ns | 1.021 ns | 0.9983 ns | 1.104 ns |     130 B |     - |     - |     - |         - |
[2020/09/03 14:22:22][INFO] |    NegateBenchmark | 1.215 ns | 0.0165 ns | 0.0147 ns | 1.208 ns | 1.1971 ns | 1.239 ns |     142 B |     - |     - |     - |         - |
[2020/09/03 14:22:22][INFO]

AndyAyersMS · 2020-09-03T22:40:24Z

Thanks!

JulieLeeMSFT · 2020-09-04T18:22:54Z

Relabeling to system.numeric since the fix is in that code.

ghost · 2020-09-04T18:23:18Z

Tagging subscribers to this area: @tannergooding, @pgovind
See info in area-owners.md if you want to be subscribed.

adamsitnik added area-System.Numerics tenet-performance Performance related issue labels Sep 2, 2020

adamsitnik added this to the 5.0.0 milestone Sep 2, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Sep 2, 2020

jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Sep 2, 2020

jeffschwMSFT added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Numerics labels Sep 3, 2020

AndyAyersMS self-assigned this Sep 3, 2020

tannergooding mentioned this issue Sep 3, 2020

Marking Matrix3x2, Matrix4x4, Plane, and Quaternion as Intrinsic #41829

Merged

adamsitnik mentioned this issue Sep 4, 2020

.NET 5.0 Microbenchmarks Performance Study Report #41871

Closed

21 tasks

JulieLeeMSFT added the area-System.Numerics label Sep 4, 2020

JulieLeeMSFT removed the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 4, 2020

tannergooding closed this as completed in #41829 Sep 4, 2020

github-actions bot mentioned this issue Sep 4, 2020

[release/5.0-rc2] Marking Matrix3x2, Matrix4x4, Plane, and Quaternion as Intrinsic #41885

Merged

JulieLeeMSFT assigned tannergooding and unassigned AndyAyersMS Sep 4, 2020

ghost locked as resolved and limited conversation to collaborators Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regressions in Quaternion.Conjugate and Quaternion.Negate #41738

Performance regressions in Quaternion.Conjugate and Quaternion.Negate #41738

adamsitnik commented Sep 2, 2020

System.Numerics.Tests.Perf_Quaternion.ConjugateBenchmark

System.Numerics.Tests.Perf_Quaternion.NegateBenchmark

System.Numerics.Tests.Perf_Quaternion.NegationOperatorBenchmark

ghost commented Sep 2, 2020

tannergooding commented Sep 2, 2020

jeffhandley commented Sep 3, 2020

jeffschwMSFT commented Sep 3, 2020

tannergooding commented Sep 3, 2020

JulieLeeMSFT commented Sep 3, 2020

AndyAyersMS commented Sep 3, 2020

AndyAyersMS commented Sep 3, 2020

AndyAyersMS commented Sep 3, 2020

tannergooding commented Sep 3, 2020

CarolEidt commented Sep 3, 2020

tannergooding commented Sep 3, 2020

tannergooding commented Sep 3, 2020

AndyAyersMS commented Sep 3, 2020

tannergooding commented Sep 3, 2020

AndyAyersMS commented Sep 3, 2020

tannergooding commented Sep 3, 2020

AndyAyersMS commented Sep 3, 2020

JulieLeeMSFT commented Sep 4, 2020

ghost commented Sep 4, 2020

Performance regressions in Quaternion.Conjugate and Quaternion.Negate #41738

Performance regressions in Quaternion.Conjugate and Quaternion.Negate #41738

Comments

adamsitnik commented Sep 2, 2020

Repro

System.Numerics.Tests.Perf_Quaternion.ConjugateBenchmark

System.Numerics.Tests.Perf_Quaternion.NegateBenchmark

System.Numerics.Tests.Perf_Quaternion.NegationOperatorBenchmark

ghost commented Sep 2, 2020

tannergooding commented Sep 2, 2020

jeffhandley commented Sep 3, 2020

jeffschwMSFT commented Sep 3, 2020

tannergooding commented Sep 3, 2020

JulieLeeMSFT commented Sep 3, 2020

AndyAyersMS commented Sep 3, 2020

AndyAyersMS commented Sep 3, 2020

AndyAyersMS commented Sep 3, 2020

tannergooding commented Sep 3, 2020

CarolEidt commented Sep 3, 2020

tannergooding commented Sep 3, 2020

tannergooding commented Sep 3, 2020

AndyAyersMS commented Sep 3, 2020

tannergooding commented Sep 3, 2020

AndyAyersMS commented Sep 3, 2020

tannergooding commented Sep 3, 2020

AndyAyersMS commented Sep 3, 2020

JulieLeeMSFT commented Sep 4, 2020

ghost commented Sep 4, 2020