Add AggressiveInlining to Double.CompareTo() and Single.CompareTo() #56501

rickbrew · 2021-07-28T22:36:20Z

As per discussion in #56493

As per discussion at dotnet#56493

dotnet-issue-labeler · 2021-07-28T22:36:23Z

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

dnfadmin · 2021-07-28T22:36:33Z

All CLA requirements met.

src/libraries/System.Private.CoreLib/src/System/Double.cs

stephentoub · 2021-07-29T00:47:51Z

cc: @tannergooding, @GrabYourPitchforks

ghost · 2021-07-29T00:48:06Z

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

As per discussion in #56493

Author:	rickbrew
Assignees:	-
Labels:	`area-System.Runtime`, `community-contribution`
Milestone:	-

GrabYourPitchforks · 2021-07-29T21:02:23Z

I agree with Tanner that special-casing this in the JIT to emit specialized instructions would result in the best codegen. I don't know how difficult that would be in practice, as you don't want to end up in a situation where we're really effective at setting eax to the appropriate value, but where we still incur the series of jumps after the fact. Ideally you'd want the JIT to be able to generate jumps directly to the applicable targets.

stephentoub · 2021-07-30T13:54:35Z

Sounds like the general consensus is this should be merged as-is but there should be an issue for .NET 7 tracking making CompareTo an intrinsic?

rickbrew · 2021-07-30T14:00:36Z

Sounds like the general consensus is this should be merged as-is but there should be an issue for .NET 7 tracking making CompareTo an intrinsic?

Do you want to changes you suggested above with splitting NaN handling into its own method?

stephentoub · 2021-07-30T14:01:23Z

I'll defer to Tanner/Levi's judgement on that.

rickbrew · 2021-07-30T14:44:10Z

If the method is split, as you suggested, I believe performance would remain about the same for datasets with large amounts (>99%, although I'm guessing/estimating here) of NaN values, although codegen size would increase due to inlining the first part of the method. But, sorting a list of all NaNs seems like an uncommon/pathological corner case that doesn't need to be considered w.r.t. optimization? @tannergooding might know better though.

tannergooding · 2021-07-30T16:21:55Z

I agree that even when you have NaN it's likely to be more of a corner case than not. However, I would like to see more numbers/data of how this impacts sparse data sets or other scenarios first.

Likewise, I think that overall the best thing is to get the JIT to treat this intrinsicly. Today we have:

if (m_value < value) return -1;
        if (m_value > value) return 1;
        if (m_value == value) return 0;

        // At least one of the values is NaN.
        if (double.IsNaN(m_value))
            return double.IsNaN(value) ? 0 : -1;
        else
            return 1;

which generates assembly that is effectively:

    L0000: vzeroupper
    L0003: vmovsd xmm0, [rcx]
    L0007: vucomisd xmm1, xmm0
    L000b: ja short L0029
    L000d: vucomisd xmm0, xmm1
    L0011: ja short L0032
    L0013: vucomisd xmm0, xmm1
    L0017: jp short L001b
    L0019: je short L002f
    L001b: vucomisd xmm0, xmm0
    L001f: jp short L0023
    L0021: je short L0032
    L0023: vucomisd xmm1, xmm1
    L0027: jp short L002f
    L0029: mov eax, 0xffffffff
    L002e: ret
    L002f: xor eax, eax
    L0031: ret
    L0032: mov eax, 1
    L0037: ret

While in practice, we only need no more than 3 ucomisd (1 if neither input is NaN) and then a brief set of jumps and we could generate something more like:

entry:
  xor      eax,  eax     ; Clear result to zero
  vmovsd   xmm0, [rcx]   ; Load "this" into xmm0
  vucomisd xmm0, xmm1    ; Compare "this" to "value"
  jp       nan           ; Parity flag is set, one input is NaN (unpredicted)
  jc       less_than     ; Carry flag is set, "this" is less than "value"
  jnz      greater_than  ; Zero flag is not set, "this" is greater than "value"
  ret                    ; Return value is already zero (inputs are equal)
greater_than:
  inc      eax           ; Return value should be one
  ret
less_than:
  dec      eax           ; Return value should be negative one
  ret
nan:
  vucomisd xmm0, xmm0    ; Compare "this" to "this"
  jnp      greater_than  ; Parity flag is not set, "this" is not NaN; "value" is a NaN
  vucomisd xmm1, xmm1    ; Compare "value" to "value"
  jnp      less_than     ; Parity flag is not set, "this" is NaN; "value" is not a NaN
  ret                    ; Return value is already zero (inputs are both NaN)

This would result in approx. 36 bytes of assembly which is down from 56 bytes. It has significantly fewer comparisons which saves us a number of cycles (~3-7 cycles per ucomis on modern CPUs) and many fewer branches in the "tight" window which should improve the branch predictor throughput. -- This could probably be even more efficient and branch predictor friendly with CMOVcc, but we don't generate those today.

Intrinsicly recognizing CompareTo might also be better for scenarios where users are relying on CompareTo for comparison, as we could then correctly optimize something like if (value.CompareTo(other) < 0) to be just value < other and not generate the rest of the code. Although I'm not sure how useful that would be in practice.

Likewise, even without intrinsifying, we should get the JIT to understand cases like:

  vucomisd xmm0, xmm0
  jp short label1
  je short label2
label1:
  vucomisd xmm1, xmm1
  ; more code
label2:
  ; more code

and have it generate:

  vucomisd xmm0, xmm0
  je short label2
  vucomisd xmm1, xmm1
  ; more code
label2:
  ; more code

danmoseley · 2021-08-06T00:24:45Z

@rickbrew it sounds like the next action here is to do some quick perf tests including lots of NaN's.

@tannergooding do we not have any perf tests for comparing numbers? Maybe I'm missing them..
https://github.com/dotnet/performance/blob/main/src/benchmarks/micro/libraries/System.Runtime/Perf.Double.cs

danmoseley · 2021-08-14T05:11:36Z

@rickbrew do you expect to be able to gather this data in time to possibly get this into 6.0?

rickbrew · 2021-08-14T18:18:04Z

@danmoseley Cutoff for .NET 6 is 8/17? No, that won't be possible for me.

I also wasn't sure if that was up to me or @tannergooding, since #56493 was assigned to him. Chalk it up to being a first-timer in the dotnet/runtime repo.

I can still take a look in a bit, I've been very busy finishing, preparing, and stabilizing Paint.NET 4.3 (finally ported to .NET 5)

danmoseley · 2021-08-14T23:01:28Z

Congratulations on the port! Do you notice any differences compared to running it on .NET Framework?

rickbrew · 2021-08-15T03:24:09Z

Oh yes, lots. It's faster, maybe by about 15% across the board before doing any further optimizations that weren't possible with Framework.. Except for startup performance which has about a 30% penalty due to using a lot of C++/CLI code, which can't be crossgen'd. I interop heavily with the likes of D2D, WIC, et. al. That will slowly evaporate/improve as I port all of that over to C# (@tannergooding 's TerraFX will be put to good use).

SCD is fantastic for simplifying and alleviating a lot of various install issues. Installation and updating is much faster because I crossgen on my build machine instead of NGEN on the user's system. I've also been able to sink a lot of time into SIMD optimizations, both to add newly optimized code paths and to port many others from native C code that I was p/invoking. ARM64 support is great, really glad that landed in 5.0.9 for WinForms and WPF. Being able to load plugins into isolated AssemblyLoadContexts is a good thing and solves a lot of issues.

Plugin compatibility remains a struggle, although most work fine as-is w/o recompilation (including super old .NET 2.0 or even 1.1 DLLs). Some framework DLLs have been removed (System.Windows.Forms.DataVisualization, WCF), some classes have been removed (Menu and ContextMenu), and a few things have bugs in them (XmlSerializer dies when used in a collectible AssemblyLoadContext, but that's fixed in .NET 6 and I have a workaround). I'm steadily working through as many of those as possible with various workarounds, fixes, or even hot patching via Mono.Cecil. Some plugin authors will be putting out new updates, while others have wandered elsewhere in life and so their plugins will need app-side fixes/workarounds/shims.

Download size is also a struggle. SCD grows my installer from ~12MB to >50MB, so hopefully my hosting provider doesn't get stressed out. I'm also actively pursuing ways to trim the size through various creative methods (e.g. #56699).

Having access to all the new stuff in the runtime and framework, more libraries on nuget, and the massively improved JIT/codegen is really great. It finally feels like I'm working with the runtime now, instead of working around it.

danmoseley · 2021-08-15T04:06:54Z

That's a great result so far thank you for sharing cc @richlander although I'm guessing he's aware of your progress.

danmoseley · 2021-08-15T04:07:56Z

@tannergooding i will be out next week. Do you think it would be reasonable to take this change Monday with the evidence we have? It seems we have daily good confidence it will on balance be a benefit. Your call.

danmoseley · 2021-08-15T04:10:03Z

One other question @rickbrew as it wasn't clear - have you experimented with trimming before crossgen? We did a lot of work to make the shared framework safe to trim. WPF/Winforms are not as you likely know. Cc @eerhardt

rickbrew · 2021-08-15T04:34:34Z

@danmoseley I haven't tried trimming just yet, partly because I couldn't get ILLink's command-line parameters to work. But also, I suspect PDN's trimmability is low because of the plugin situation. They need access to both PDN and runtime/framework DLLs. I may look into it more later, after 4.3's release, maybe after .NET 6's release. I see value in being able to trim libraries like Newtonsoft.Json.dll and will definitely need it when I start using TerraFX.Interop.dll (8MB!), neither of which plugins would be using (and they can have their own private copy now anyway, thanks to AseemblyLoadContext).

Also, the biggest bang-for-buck action I can see right now is trimming out R2R data from large framework DLLs that are not used, or not used enough, at least not on the startup path (e.g. PresentationFramework.dll is 17MB, but only 6MB w/o R2R data, see also #56699). An uncrossgen utility would be really helpful to enable an easy, impactful win.

rickbrew · 2021-08-15T19:55:51Z

I decided to start running benchmarks and gathering the data anyway, as I could use a break from the PDN stuff.

Anyway the first result I'm seeing is that @stephentoub 's "split inlined" idea is actually faster than the fully inlined version, at least on NaN-less data. Still fleshing out the benchmark to get data on various mixtures of NaNs in the data, but I wanted to share that first before anything got merged. nope, that wasn't necessarily right, I'll wait until I've got enough data to profess any actual conclusions 😂

rickbrew · 2021-08-15T21:43:27Z

I've created some benchmark code in this repository: https://github.com/rickbrew/CompareInliningBenchmarks

VS 2022 17.0.0 Preview 3.0
Ryzen 5950X, PBO enabled, 64GB DDR4-3600 RAM
Because PBO (precision boost overdrive, aka auto-overclocking) is temperature sensitive, I'm running the CPU fans at max to help it run at a consistent clock speed. Otherwise it might run at a slower speed later in the benchmark run. There's also a portable AC unit in the room.

The benchmark consists of creating an array with several million elements, sized so as to fit just within the 32MB L3 cache available to each chiplet on the CPU. The array is filled with values 0 through N-1 (ints cast to float/double). The NaNPercentage determines how many NaNs are evenly spread throughout the array. Then the array is shuffled using Fisher-Yates. The random seed is the same every time, for both floats and doubles.

The implementation of Sort() is Timsort, pulled from my Paint.NET codebase, which has full interfaces+structs+generics+inlining. It is not the exact same as the framework's Sort(), which I'm told boxes the comparer (which would be very bad for this benchmark).

Here are the results for doubles with 4M elements in the array:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1165 (21H1/May2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=6.0.100-preview.7.21379.14
  [Host]     : .NET 6.0.0 (6.0.21.37719), X64 RyuJIT
  Job-JXESFV : .NET 6.0.0 (6.0.21.37719), X64 RyuJIT

Method	NaNPercentage	Mean	Error	StdDev	Median	Ratio
SortStandard	0	296.9 ms	2.36 ms	2.20 ms	297.6 ms	1.00
SortFullyInlined	0	192.4 ms	3.58 ms	3.35 ms	191.4 ms	0.65
SortSplitInlined	0	197.1 ms	2.74 ms	2.57 ms	197.2 ms	0.66

SortStandard	10	299.1 ms	2.45 ms	2.29 ms	299.6 ms	1.00
SortFullyInlined	10	193.1 ms	3.20 ms	2.99 ms	192.6 ms	0.65
SortSplitInlined	10	197.4 ms	3.03 ms	2.84 ms	196.3 ms	0.66

SortStandard	20	295.4 ms	4.62 ms	4.33 ms	295.2 ms	1.00
SortFullyInlined	20	187.6 ms	3.65 ms	4.21 ms	186.8 ms	0.64
SortSplitInlined	20	192.1 ms	2.60 ms	2.43 ms	192.2 ms	0.65

SortStandard	30	293.2 ms	5.50 ms	5.40 ms	291.7 ms	1.00
SortFullyInlined	30	177.0 ms	3.54 ms	7.14 ms	173.9 ms	0.63
SortSplitInlined	30	192.1 ms	3.82 ms	6.69 ms	189.9 ms	0.68

SortStandard	40	275.6 ms	2.54 ms	2.26 ms	275.7 ms	1.00
SortFullyInlined	40	166.0 ms	3.29 ms	5.94 ms	163.9 ms	0.63
SortSplitInlined	40	184.7 ms	3.62 ms	4.17 ms	184.7 ms	0.68

SortStandard	50	271.7 ms	5.40 ms	10.27 ms	268.9 ms	1.00
SortFullyInlined	50	158.7 ms	3.17 ms	6.69 ms	156.7 ms	0.59
SortSplitInlined	50	180.1 ms	3.60 ms	7.18 ms	177.6 ms	0.67

And for floats with 8M elements in the array. Same amount of memory and bandwidth, but higher compute cost:

Method	NaNPercentage	Mean	Error	StdDev	Ratio
SortStandard	0	867.2 ms	1.69 ms	1.49 ms	1.00
SortFullyInlined	0	640.1 ms	2.66 ms	2.22 ms	0.74
SortSplitInlined	0	670.8 ms	3.08 ms	2.88 ms	0.77

SortStandard	10	827.5 ms	5.48 ms	5.13 ms	1.00
SortFullyInlined	10	642.7 ms	3.07 ms	2.87 ms	0.78
SortSplitInlined	10	661.3 ms	3.64 ms	3.40 ms	0.80

SortStandard	20	808.0 ms	3.01 ms	2.82 ms	1.00
SortFullyInlined	20	593.6 ms	2.23 ms	2.08 ms	0.73
SortSplitInlined	20	643.8 ms	3.05 ms	2.85 ms	0.80

SortStandard	30	748.6 ms	2.88 ms	2.41 ms	1.00
SortFullyInlined	30	565.9 ms	2.54 ms	2.38 ms	0.76
SortSplitInlined	30	617.4 ms	3.49 ms	3.26 ms	0.82

SortStandard	40	726.1 ms	5.10 ms	4.77 ms	1.00
SortFullyInlined	40	504.0 ms	1.86 ms	1.74 ms	0.69
SortSplitInlined	40	576.4 ms	2.98 ms	2.64 ms	0.79

SortStandard	50	671.1 ms	6.76 ms	6.32 ms	1.00
SortFullyInlined	50	456.7 ms	1.51 ms	1.34 ms	0.68
SortSplitInlined	50	548.9 ms	2.18 ms	1.82 ms	0.82

I'm not sure that having 50% NaNs (or even 10%!) is a realistic scenario, but it does illustrate some divergence in results between fully- and split-inlined.

I didn't go above 50% NaNs because by the time you reach 100% NaNs the data is fully sorted and then what's the point of sorting, and it didn't seem like a realistic scenario either.

rickbrew · 2021-08-15T21:46:32Z

So this illustrates that for sorting an array of elements, inlining CompareTo() has great results. It's hard to say what the impact would be in other scenarios that have more/less reliance on CompareTo() for their performance.

tannergooding · 2021-08-17T16:44:19Z

Given the above, and that split inlining is still a lot faster than no inlining, I think its reasonable as a stop gap.

@danmoseley, do you have any concerns with us merging this for RC1? If the official benchmarks show any regression not covered here or on older hardware, we'd likely need to pull it back out.

jkotas · 2021-08-17T18:02:31Z

src/libraries/System.Private.CoreLib/src/System/Double.cs


+        private int CompareToWithAtLeastOneNaN(double value)


Does the manual split inlining actually make the code significantly smaller in real world situations vs. just aggressive inlining the whole thing? The call introduced by the split inlining is going to have secondary effect like extra register spilling, so it is not obvious to me that it is actually profitable.

The entire inlined assembly is fairly large: #56501 (comment)

NaN checks in particular aren't done very efficiently today and we should try to improve it in .NET 7. So with the partial inlining, the call will remove 2-4 extra branches from the inlined code-path (in favor of a call which may spill on a probably unlikely path).

In the link, I gave an example of what we could generate for this entire comparison if we did something "more optimally".

@jkotas In sharplab, split inlining results in smaller IL but larger machine code https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBLANgHwAEAmARgFgAoQgBgAJDSA6AJQFcA7DbAWxiYDCEHgAc8MKAGUJAN2xgYAZwDcVKoQDMDYnQF0A3lTrGGWwijoBZABQBKA0ZMBfKi8rqtijFDZgMVgE8AEQg2YFwYKkNKEzoRKGwZAEMMGDpYJIATCA5cALoQsIi6HgB9ZNw2GFUYk0djAG1LGAwACwhMgElRXGtmto7ukVwAeRFuHMUmADkITtzsDkWAc1sAXXrTOkX/IVEk2AAVCGtC8LSKqttN6NjY7AAzOmsyy7SAHjo3+0IAdjo4KQancTI9nq8kpU0gA+L6Qq4Mf5Azb3J4vcrwtIAXixcKhP3+NGBILoAHpSXQAIL+CJJLx0HJpCBPNoXTGKbYc6ZJaZMFGgtHZIr8TqKbnTdHfa61EkmP50IXnJii8XWb50AD8dHoIAByJldxguEUkQNsvl+tiblimyaLXaXR6fXtgx6YwmHCmlOWy1gikUiRg81wixW602mm2XF0whEBxgxwAYmxcHlg4sYJlTqFznirjd+cYwZLMXRPur5YDiSDixCoXRYRWkdW7rWMfWcXmYATtS3YuSqTSYHT/IyGSzWmyoRzsFyeXyzQLnoqIsqxTyS/jpbLYvKVyL1xL1VqdXq+yYjSbC3cLS3rXUDXaBo7hs7n0NRuNsJMmN7fUoAxkIMFiWDhVg2A1Ix2GN9iOCBJGGbAMHTDhM2sfcu23ExbhJNs3jLTDETPa88NLRtMR7S1cLROsqjoTsm17a95T2OM4IAdSQ1pqQAGWHLwRlQ1VvjvNQDXiRIUjSaDWPjY5OLaXj+IwQSYFVDDvgLRdjAHak6Fpekx2ZOhWS7Gc515EjBRzVcVQ3WjuywnciP3NdhIozVtToXUq2vS9TWc4xb02NwnCAA==

edit: looks like it's using .NET 5 though

I don't have a particular preference on which we go with or if we hold off and wait for the JIT to make this more intrinsic or even simply improve the general handling for these kinds of checks and branches.

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4aNYgGYmpBkIYBvGgzNNdxFAwCyACgCUx0+YC+Nd9S27cGKBzAMBgAVVQwaE2pzJnIkBmAICEkGAElcWWgMGAATAH0RcWxYYIgAMQ5JSQBPFK5JXi4cuzYYbGyAeTqq2TFsLgAebIgOYGkAPgZsBxczSOjogDNoBjsGoN4GAF4GOjUGDf7JlgAZGC4AcwwACwY4BnI93gBqJ+mo+Y/eBZWC3uKyirVWr1RrZOzYADavAAumhJlCGE97tCnIc6E5iAB2BgLbCSXAwDTvD5Yhh+DiEmYMVzRKlUiE2GDXCDZFLiSR2RnM1ns9piPgQLi4FgAQXO51guFwvAUMGBDQa5wc0KpOhicTWBlEfxgJXKlRqdQaTQaDAAIsNRjAGPhcgo8RS4RaRtIGPbJBS3tE5h99t87Lb3RSGIcgzAMdi4A8qdEvitAw7rRMwxH7kTfX743bE1ttimmNjdjHzAB6EsMEVBaR4IKC60Qb7Xa1h3D7VsAOWw7ZYxbMcbsQxdgjSnfbAezHvDXoz5lJg6tLBHXbs+YA/DsGCBbtHifMYPiYL35qSd9FPLTd8w4gkkql0pkcvltUVdRAevUMPLQc1Wh0uj0+kGS1xkmadZiPJYoBWTUNm2XZ9hDI5Tgua5bjTfYXjAmd+1+F8SnfXhPyNb9IRhOFSMRZFUR2VNcQPdNfVJclKV3GlzE0XcGSZK4WTZMQOS5HieX4vkBSFUVxUlaVZS/RVlVVSxYn2HgtUKf4CKIkETS4c1gOtBNJzhedXRTKkfU+f0DODUNE1TKMGIsrMwwYZNbILNMj37KzrU2PM3NJItd2iUlcP+AB1QirkrU4a06GBR3HMM4RTBzqQ46IxCgGVsCydVlKCULXwi65otaXw4oS01nStG0J0dBhjObWyzKPMsKyrMra0aBgGzJK4msnVteA7LseyC8x+0axdcAS7yHCwjM5z06aErXDct3so99wJI9guxU83A8IA=

shows the behavior of the two options in a simple real world case. Notice that the IsSorted_CompareToFullyInlined (code size 0x59) produces smaller code than IsSorted_CompareToSplitInlined (code size 0x6f). The obvious problem is that the slow path is a good inlining candidate and the JIT decides to inline it. Once you try to fix it by marking the slow path with NoInlining:

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgCUBXAOwwEt8YLAMIR8AB14AbGFADKMgG68wMXAG4aNYgGYmpBkIYBvGgzNNdxFAwCyACgCUx0+YC+Nd9S27cGKBzAMBgAVVQwaE2pzJnIkBmAICEkGAElcWWgMGAATAH0RcWxYYIgAMQ5JSQBPFK5JXi4cuzYYbGyAeTqq2TFsLgAebIgOYGkAPgZsBxczSOjogDNoBjsGoN4GAF4GOjUGDf7JlgAZGC4AcwwACwY4BnI93gBqJ+mo+Y/eBZWC3uKyirVWr1RrZOzYADavAAumhJlCGE97tCnIc6E5iAB2BgLbCSXAwDTvD5Yhh+DiEmYMVzRKlUiE2GDXCDZFLiSR2RnM1ns9piPgQLi4FgAQXO51guFwvAUMGBDQa5wc0KpOhicTWBlEfxgJXKlRqdQaTQaDAAIsNRjAGPhcgo8RS4RaRtIGPbJBS3tE5h99t87Lb3RSGIcgzAMdi4A8qdEvitAw7rRMwxH7kTfX743bE1ttimmNjdjHzAB6EsMEVBaR4IKC60Qb7Xa1h3D7VsAOWw7ZYxbMcbsQxdgjSnfbAezHvDXoz5lJg6tLBHXbs+YA/DsGCBbtHifMYPiYL35qSd9FPLTd8w4gkkql0pkcvltUVdRAevUMPLQc1Wh0uj0+kGS1xkmadZiPJYoBWTUNm2XZ9hDI5Tgua5bjTfYXjAmd+1+F8SnfXhPyNb9IRhOFSMRZFUR2VNcQPdNfVJclKV3GlzE0XcGSZK4WTZMQOS5HieX4vkBSFUVxUlaVZS/RVlVVSxYn2HgtUKf4CKIkETS4c1gOtBNJzhedXRTKkfU+f0DODUNE1TKMGIsrMwwYZNbILNMj37KzrU2PM3NJItd2iUlcP+AB1QirkrU4a06GBR3HMM4RTBzqQ46IuO5PiBO43jeX5XhBWFdsIFki55N3MQoBlbAsnVZSglC18IuuaLWl8OKEtNZ0rRtCdHQYYzm1ssyjzLCsq3a2tGgYBsySuYbJ1bXgOy7HsgvMfshsXXAEu8hwsIzOc9J2hK1w3Ld7KPfcCSPYLsVPNwPCAA

it gets a bit better, but the code size of the non-split inlined version is still smaller.

I don't have a particular preference on which we go with or if we hold off and wait for the JIT to make this more intrinsic

I agree that it would be nice to teach the JIT to deal with the efficiently. If we want to tweak the performance by manual inlining, simple AggressiveInlining should be better than the manual split inlinining as my examples demonstrated.

If we want to tweak the performance by manual inlining, simple AggressiveInlining should be better than the manual split inlinining as my examples demonstrated.

I'm fine with this. I would like to disagree on the examples adequately demonstrating the difference however as smaller code isn't necessarily better.

There are many factors that can impact perf here including the likelihood that a given path will be hit (NaNs are likely rare) and considerations such as the number of branches in a "small window" (16-32 aligned bytes of assembly). This was why I was initially hesitant about the change as it could regress certain "hot loops" due to the additional branches (7) that would now be present directly in the loop. Of course, the partial inlining could also impact this in interesting ways with its overall larger codegen but it does reduce the branch count by 3 compared to full inlining.

I agree that you would be likely able to come up with cases where the split inlining is better due for micro-architecture reasons, if you tried hard enough. The data we have so far in this thread is that a simple AggressiveInlining produces faster and smaller code.

To satisfy my own curiosity, I tried it in double.cs in corelib. Here main is what's currently checked in, pr1 is CompareTo aggressively inlined, and pr2 is it split.

private double[] _doubles = Enumerable.Range(0, 1_000_000).Select(i => i * 1.23).ToArray(); private double[] _scratch = new double[1_000_000]; [Benchmark] public bool IsSorted() { ReadOnlySpan<double> a = _doubles; for (int i = 0; i < a.Length - 1; i++) if (a[i].CompareTo(a[i + 1]) > 0) return false; return true; } [Benchmark] public void CopyAndSort() { _doubles.CopyTo(_scratch, 0); Array.Sort(_scratch); } [Benchmark] public int Search() => Array.BinarySearch(_doubles, 2_000_000); [Benchmark] public int CompareSequence() => _doubles.AsSpan().SequenceCompareTo(_doubles);

Method Toolchain Mean Ratio Code Size

IsSorted main\corerun.exe 1,921,366.71 ns 1.00 155 B

IsSorted pr1\corerun.exe 739,726.37 ns 0.39 97 B

IsSorted pr2\corerun.exe 970,601.44 ns 0.50 117 B

CopyAndSort main\corerun.exe 11,761,377.29 ns 1.00 1,060 B

CopyAndSort pr1\corerun.exe 11,810,071.67 ns 1.00 1,060 B

CopyAndSort pr2\corerun.exe 11,837,290.42 ns 1.01 1,060 B

Search main\corerun.exe 58.53 ns 1.00 207 B

Search pr1\corerun.exe 35.51 ns 0.61 207 B

Search pr2\corerun.exe 38.65 ns 0.66 207 B

CompareSequence main\corerun.exe 1,829,633.22 ns 1.00 330 B

CompareSequence pr1\corerun.exe 1,452,953.83 ns 0.79 306 B

CompareSequence pr2\corerun.exe 1,523,349.48 ns 0.83 316 B

I asked whether it would make sense to split it. I'm fine with the answer being "no" 😄

Alright I have reverted the commit that split the inlining, so we're back to just regular aggressive inlined

danmoseley · 2021-08-17T21:59:21Z

@danmoseley, do you have any concerns with us merging this for RC1? If the official benchmarks show any regression not covered here or on older hardware, we'd likely need to pull it back out

Nope, it'd very low risk if we have to pull it out. If you've concluded it's a worthwhile change in the real world which it sounds like you have.

src/libraries/System.Private.CoreLib/src/System/Single.cs

ghost · 2021-08-19T21:29:06Z

Hello @tannergooding!

Because this pull request has the auto-merge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (`@msftbot`) and give me an instruction to get started! Learn more here.

stephentoub · 2021-08-20T20:59:07Z

@tannergooding, is the plan to backport this?

jeffhandley · 2021-08-20T21:15:49Z

@stephentoub Yep. @tannergooding and I chatted about it earlier. Before we merge into release/6.0 though, I'd like to make sure the latest perf run against main doesn't indicate any unintended regressions (even though likelihood is low).

GrabYourPitchforks · 2021-08-20T21:29:55Z

Before we merge into release/6.0 though, I'd like to make sure the latest perf run against main doesn't indicate any unintended regressions

As @rickbrew mentioned earlier, we may see a minor size-on-disk regression as would accompany most "this method is now inlined where previously it wasn't" changes. I assume we're still ok with that?

jeffhandley · 2021-08-20T21:43:19Z

Assuming it's truly minor, yeah.

Add AggressiveInlining to Double.CompareTo() and Single.CompareTo()

bb0f8cb

As per discussion at dotnet#56493

ghost added the community-contribution Indicates that the PR has been added by a community member label Jul 28, 2021

rickbrew mentioned this pull request Jul 28, 2021

Double.CompareTo() performance hurt by not being inlined #56493

Closed

stephentoub reviewed Jul 29, 2021

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Double.cs Show resolved Hide resolved

stephentoub added the area-System.Runtime label Jul 29, 2021

stephentoub added the tenet-performance Performance related issue label Jul 29, 2021

Use split inlining

f7a10f0

jkotas reviewed Aug 17, 2021

View reviewed changes

Revert split inlining

08f46fb

jkotas approved these changes Aug 17, 2021

View reviewed changes

runfoapp bot mentioned this pull request Aug 18, 2021

Test failure readytorun/determinism/crossgen2determinism/crossgen2determinism.sh #57620

Closed

karelz mentioned this pull request Aug 19, 2021

[QUIC] MsQuicStream ctor code shuffles #55815

Closed

stephentoub reviewed Aug 19, 2021

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Single.cs Show resolved Hide resolved

tannergooding approved these changes Aug 19, 2021

View reviewed changes

tannergooding added this to the 7.0.0 milestone Aug 19, 2021

tannergooding added the auto-merge label Aug 19, 2021

karelz mentioned this pull request Aug 20, 2021

Assertion failed: OpenSsl error queue is not empty #57722

Closed

tannergooding merged commit 4a4b697 into dotnet:main Aug 20, 2021

This was referenced Aug 25, 2021

Regression in Single/Double parsing #58027

Closed

[Perf] Changes at 8/20/2021 11:46:06 PM dotnet/perf-autofiling-issues#840

Closed

kunalspathak mentioned this pull request Aug 31, 2021

Perf regression in string handling / culture ICU #58029

Closed

ghost locked as resolved and limited conversation to collaborators Nov 3, 2021

Method	Toolchain	Mean	Ratio	Code Size
IsSorted	main\corerun.exe	1,921,366.71 ns	1.00	155 B
IsSorted	pr1\corerun.exe	739,726.37 ns	0.39	97 B
IsSorted	pr2\corerun.exe	970,601.44 ns	0.50	117 B

CopyAndSort	main\corerun.exe	11,761,377.29 ns	1.00	1,060 B
CopyAndSort	pr1\corerun.exe	11,810,071.67 ns	1.00	1,060 B
CopyAndSort	pr2\corerun.exe	11,837,290.42 ns	1.01	1,060 B

Search	main\corerun.exe	58.53 ns	1.00	207 B
Search	pr1\corerun.exe	35.51 ns	0.61	207 B
Search	pr2\corerun.exe	38.65 ns	0.66	207 B

CompareSequence	main\corerun.exe	1,829,633.22 ns	1.00	330 B
CompareSequence	pr1\corerun.exe	1,452,953.83 ns	0.79	306 B
CompareSequence	pr2\corerun.exe	1,523,349.48 ns	0.83	316 B

Add AggressiveInlining to Double.CompareTo() and Single.CompareTo() #56501

Add AggressiveInlining to Double.CompareTo() and Single.CompareTo() #56501

Conversation

rickbrew commented Jul 28, 2021

dotnet-issue-labeler bot commented Jul 28, 2021

dnfadmin commented Jul 28, 2021 • edited Loading

stephentoub commented Jul 29, 2021

ghost commented Jul 29, 2021

GrabYourPitchforks commented Jul 29, 2021

stephentoub commented Jul 30, 2021

rickbrew commented Jul 30, 2021

stephentoub commented Jul 30, 2021

rickbrew commented Jul 30, 2021

tannergooding commented Jul 30, 2021 • edited Loading

danmoseley commented Aug 6, 2021

danmoseley commented Aug 14, 2021

rickbrew commented Aug 14, 2021

danmoseley commented Aug 14, 2021

rickbrew commented Aug 15, 2021 • edited Loading

danmoseley commented Aug 15, 2021

danmoseley commented Aug 15, 2021

danmoseley commented Aug 15, 2021 • edited Loading

rickbrew commented Aug 15, 2021 • edited Loading

rickbrew commented Aug 15, 2021 • edited Loading

rickbrew commented Aug 15, 2021 • edited Loading

rickbrew commented Aug 15, 2021

tannergooding commented Aug 17, 2021

jkotas Aug 17, 2021

Choose a reason for hiding this comment

tannergooding Aug 17, 2021 • edited Loading

Choose a reason for hiding this comment

rickbrew Aug 17, 2021 • edited Loading

Choose a reason for hiding this comment

tannergooding Aug 17, 2021

Choose a reason for hiding this comment

jkotas Aug 17, 2021 • edited Loading

Choose a reason for hiding this comment

tannergooding Aug 17, 2021 • edited Loading

Choose a reason for hiding this comment

jkotas Aug 17, 2021

Choose a reason for hiding this comment

stephentoub Aug 17, 2021

Choose a reason for hiding this comment

rickbrew Aug 17, 2021

Choose a reason for hiding this comment

danmoseley commented Aug 17, 2021

ghost commented Aug 19, 2021

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

stephentoub commented Aug 20, 2021

jeffhandley commented Aug 20, 2021

GrabYourPitchforks commented Aug 20, 2021

jeffhandley commented Aug 20, 2021

dnfadmin commented Jul 28, 2021 •

edited

Loading

tannergooding commented Jul 30, 2021 •

edited

Loading

rickbrew commented Aug 15, 2021 •

edited

Loading

danmoseley commented Aug 15, 2021 •

edited

Loading

rickbrew commented Aug 15, 2021 •

edited

Loading

rickbrew commented Aug 15, 2021 •

edited

Loading

rickbrew commented Aug 15, 2021 •

edited

Loading

tannergooding Aug 17, 2021 •

edited

Loading

rickbrew Aug 17, 2021 •

edited

Loading

jkotas Aug 17, 2021 •

edited

Loading

tannergooding Aug 17, 2021 •

edited

Loading

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (`@msftbot`) and give me an instruction to get started! Learn more here.