-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AggressiveInlining to Double.CompareTo() and Single.CompareTo() #56501
Conversation
As per discussion at dotnet#56493
I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @dotnet/area-system-runtime Issue DetailsAs per discussion in #56493
|
I agree with Tanner that special-casing this in the JIT to emit specialized instructions would result in the best codegen. I don't know how difficult that would be in practice, as you don't want to end up in a situation where we're really effective at setting eax to the appropriate value, but where we still incur the series of jumps after the fact. Ideally you'd want the JIT to be able to generate jumps directly to the applicable targets. |
Sounds like the general consensus is this should be merged as-is but there should be an issue for .NET 7 tracking making CompareTo an intrinsic? |
Do you want to changes you suggested above with splitting |
I'll defer to Tanner/Levi's judgement on that. |
If the method is split, as you suggested, I believe performance would remain about the same for datasets with large amounts (>99%, although I'm guessing/estimating here) of |
I agree that even when you have Likewise, I think that overall the best thing is to get the JIT to treat this intrinsicly. Today we have: if (m_value < value) return -1;
if (m_value > value) return 1;
if (m_value == value) return 0;
// At least one of the values is NaN.
if (double.IsNaN(m_value))
return double.IsNaN(value) ? 0 : -1;
else
return 1; which generates assembly that is effectively: L0000: vzeroupper
L0003: vmovsd xmm0, [rcx]
L0007: vucomisd xmm1, xmm0
L000b: ja short L0029
L000d: vucomisd xmm0, xmm1
L0011: ja short L0032
L0013: vucomisd xmm0, xmm1
L0017: jp short L001b
L0019: je short L002f
L001b: vucomisd xmm0, xmm0
L001f: jp short L0023
L0021: je short L0032
L0023: vucomisd xmm1, xmm1
L0027: jp short L002f
L0029: mov eax, 0xffffffff
L002e: ret
L002f: xor eax, eax
L0031: ret
L0032: mov eax, 1
L0037: ret While in practice, we only need no more than 3 entry:
xor eax, eax ; Clear result to zero
vmovsd xmm0, [rcx] ; Load "this" into xmm0
vucomisd xmm0, xmm1 ; Compare "this" to "value"
jp nan ; Parity flag is set, one input is NaN (unpredicted)
jc less_than ; Carry flag is set, "this" is less than "value"
jnz greater_than ; Zero flag is not set, "this" is greater than "value"
ret ; Return value is already zero (inputs are equal)
greater_than:
inc eax ; Return value should be one
ret
less_than:
dec eax ; Return value should be negative one
ret
nan:
vucomisd xmm0, xmm0 ; Compare "this" to "this"
jnp greater_than ; Parity flag is not set, "this" is not NaN; "value" is a NaN
vucomisd xmm1, xmm1 ; Compare "value" to "value"
jnp less_than ; Parity flag is not set, "this" is NaN; "value" is not a NaN
ret ; Return value is already zero (inputs are both NaN) This would result in approx. 36 bytes of assembly which is down from 56 bytes. It has significantly fewer comparisons which saves us a number of cycles (~3-7 cycles per Intrinsicly recognizing Likewise, even without intrinsifying, we should get the JIT to understand cases like: vucomisd xmm0, xmm0
jp short label1
je short label2
label1:
vucomisd xmm1, xmm1
; more code
label2:
; more code and have it generate: vucomisd xmm0, xmm0
je short label2
vucomisd xmm1, xmm1
; more code
label2:
; more code |
@rickbrew it sounds like the next action here is to do some quick perf tests including lots of NaN's. @tannergooding do we not have any perf tests for comparing numbers? Maybe I'm missing them.. |
@rickbrew do you expect to be able to gather this data in time to possibly get this into 6.0? |
@danmoseley Cutoff for .NET 6 is 8/17? No, that won't be possible for me. I also wasn't sure if that was up to me or @tannergooding, since #56493 was assigned to him. Chalk it up to being a first-timer in the dotnet/runtime repo. I can still take a look in a bit, I've been very busy finishing, preparing, and stabilizing Paint.NET 4.3 (finally ported to .NET 5) |
Congratulations on the port! Do you notice any differences compared to running it on .NET Framework? |
Oh yes, lots. It's faster, maybe by about 15% across the board before doing any further optimizations that weren't possible with Framework.. Except for startup performance which has about a 30% penalty due to using a lot of C++/CLI code, which can't be crossgen'd. I interop heavily with the likes of D2D, WIC, et. al. That will slowly evaporate/improve as I port all of that over to C# (@tannergooding 's TerraFX will be put to good use). SCD is fantastic for simplifying and alleviating a lot of various install issues. Installation and updating is much faster because I crossgen on my build machine instead of NGEN on the user's system. I've also been able to sink a lot of time into SIMD optimizations, both to add newly optimized code paths and to port many others from native C code that I was p/invoking. ARM64 support is great, really glad that landed in 5.0.9 for WinForms and WPF. Being able to load plugins into isolated Plugin compatibility remains a struggle, although most work fine as-is w/o recompilation (including super old .NET 2.0 or even 1.1 DLLs). Some framework DLLs have been removed ( Download size is also a struggle. SCD grows my installer from ~12MB to >50MB, so hopefully my hosting provider doesn't get stressed out. I'm also actively pursuing ways to trim the size through various creative methods (e.g. #56699). Having access to all the new stuff in the runtime and framework, more libraries on nuget, and the massively improved JIT/codegen is really great. It finally feels like I'm working with the runtime now, instead of working around it. |
That's a great result so far thank you for sharing cc @richlander although I'm guessing he's aware of your progress. |
@tannergooding i will be out next week. Do you think it would be reasonable to take this change Monday with the evidence we have? It seems we have daily good confidence it will on balance be a benefit. Your call. |
@danmoseley I haven't tried trimming just yet, partly because I couldn't get ILLink's command-line parameters to work. But also, I suspect PDN's trimmability is low because of the plugin situation. They need access to both PDN and runtime/framework DLLs. I may look into it more later, after 4.3's release, maybe after .NET 6's release. I see value in being able to trim libraries like Also, the biggest bang-for-buck action I can see right now is trimming out R2R data from large framework DLLs that are not used, or not used enough, at least not on the startup path (e.g. PresentationFramework.dll is 17MB, but only 6MB w/o R2R data, see also #56699). An |
I decided to start running benchmarks and gathering the data anyway, as I could use a break from the PDN stuff.
|
I've created some benchmark code in this repository: https://github.com/rickbrew/CompareInliningBenchmarks VS 2022 17.0.0 Preview 3.0 The benchmark consists of creating an array with several million elements, sized so as to fit just within the 32MB L3 cache available to each chiplet on the CPU. The array is filled with values 0 through N-1 (ints cast to float/double). The The implementation of Here are the results for
And for floats with 8M elements in the array. Same amount of memory and bandwidth, but higher compute cost:
I'm not sure that having 50% NaNs (or even 10%!) is a realistic scenario, but it does illustrate some divergence in results between fully- and split-inlined. I didn't go above 50% NaNs because by the time you reach 100% NaNs the data is fully sorted and then what's the point of sorting, and it didn't seem like a realistic scenario either. |
So this illustrates that for sorting an array of elements, inlining |
Given the above, and that split inlining is still a lot faster than no inlining, I think its reasonable as a stop gap. @danmoseley, do you have any concerns with us merging this for RC1? If the official benchmarks show any regression not covered here or on older hardware, we'd likely need to pull it back out. |
|
||
private int CompareToWithAtLeastOneNaN(double value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the manual split inlining actually make the code significantly smaller in real world situations vs. just aggressive inlining the whole thing? The call introduced by the split inlining is going to have secondary effect like extra register spilling, so it is not obvious to me that it is actually profitable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entire inlined assembly is fairly large: #56501 (comment)
NaN
checks in particular aren't done very efficiently today and we should try to improve it in .NET 7. So with the partial inlining, the call will remove 2-4 extra branches from the inlined code-path (in favor of a call which may spill on a probably unlikely path).
In the link, I gave an example of what we could generate for this entire comparison if we did something "more optimally".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkotas In sharplab, split inlining results in smaller IL but larger machine code https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBLANgHwAEAmARgFgAoQgBgAJDSA6AJQFcA7DbAWxiYDCEHgAc8MKAGUJAN2xgYAZwDcVKoQDMDYnQF0A3lTrGGWwijoBZABQBKA0ZMBfKi8rqtijFDZgMVgE8AEQg2YFwYKkNKEzoRKGwZAEMMGDpYJIATCA5cALoQsIi6HgB9ZNw2GFUYk0djAG1LGAwACwhMgElRXGtmto7ukVwAeRFuHMUmADkITtzsDkWAc1sAXXrTOkX/IVEk2AAVCGtC8LSKqttN6NjY7AAzOmsyy7SAHjo3+0IAdjo4KQancTI9nq8kpU0gA+L6Qq4Mf5Azb3J4vcrwtIAXixcKhP3+NGBILoAHpSXQAIL+CJJLx0HJpCBPNoXTGKbYc6ZJaZMFGgtHZIr8TqKbnTdHfa61EkmP50IXnJii8XWb50AD8dHoIAByJldxguEUkQNsvl+tiblimyaLXaXR6fXtgx6YwmHCmlOWy1gikUiRg81wixW602mm2XF0whEBxgxwAYmxcHlg4sYJlTqFznirjd+cYwZLMXRPur5YDiSDixCoXRYRWkdW7rWMfWcXmYATtS3YuSqTSYHT/IyGSzWmyoRzsFyeXyzQLnoqIsqxTyS/jpbLYvKVyL1xL1VqdXq+yYjSbC3cLS3rXUDXaBo7hs7n0NRuNsJMmN7fUoAxkIMFiWDhVg2A1Ix2GN9iOCBJGGbAMHTDhM2sfcu23ExbhJNs3jLTDETPa88NLRtMR7S1cLROsqjoTsm17a95T2OM4IAdSQ1pqQAGWHLwRlQ1VvjvNQDXiRIUjSaDWPjY5OLaXj+IwQSYFVDDvgLRdjAHak6Fpekx2ZOhWS7Gc515EjBRzVcVQ3WjuywnciP3NdhIozVtToXUq2vS9TWc4xb02NwnCAA==
edit: looks like it's using .NET 5 though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a particular preference on which we go with or if we hold off and wait for the JIT to make this more intrinsic or even simply improve the general handling for these kinds of checks and branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shows the behavior of the two options in a simple real world case. Notice that the IsSorted_CompareToFullyInlined
(code size 0x59) produces smaller code than IsSorted_CompareToSplitInlined
(code size 0x6f). The obvious problem is that the slow path is a good inlining candidate and the JIT decides to inline it. Once you try to fix it by marking the slow path with NoInlining:
it gets a bit better, but the code size of the non-split inlined version is still smaller.
I don't have a particular preference on which we go with or if we hold off and wait for the JIT to make this more intrinsic
I agree that it would be nice to teach the JIT to deal with the efficiently. If we want to tweak the performance by manual inlining, simple AggressiveInlining should be better than the manual split inlinining as my examples demonstrated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to tweak the performance by manual inlining, simple AggressiveInlining should be better than the manual split inlinining as my examples demonstrated.
I'm fine with this. I would like to disagree on the examples adequately demonstrating the difference however as smaller code isn't necessarily better.
There are many factors that can impact perf here including the likelihood that a given path will be hit (NaNs
are likely rare) and considerations such as the number of branches in a "small window" (16-32 aligned bytes of assembly). This was why I was initially hesitant about the change as it could regress certain "hot loops" due to the additional branches (7
) that would now be present directly in the loop. Of course, the partial inlining could also impact this in interesting ways with its overall larger codegen but it does reduce the branch count by 3
compared to full inlining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that you would be likely able to come up with cases where the split inlining is better due for micro-architecture reasons, if you tried hard enough. The data we have so far in this thread is that a simple AggressiveInlining produces faster and smaller code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To satisfy my own curiosity, I tried it in double.cs in corelib. Here main is what's currently checked in, pr1 is CompareTo aggressively inlined, and pr2 is it split.
private double[] _doubles = Enumerable.Range(0, 1_000_000).Select(i => i * 1.23).ToArray();
private double[] _scratch = new double[1_000_000];
[Benchmark]
public bool IsSorted()
{
ReadOnlySpan<double> a = _doubles;
for (int i = 0; i < a.Length - 1; i++)
if (a[i].CompareTo(a[i + 1]) > 0)
return false;
return true;
}
[Benchmark]
public void CopyAndSort()
{
_doubles.CopyTo(_scratch, 0);
Array.Sort(_scratch);
}
[Benchmark]
public int Search() => Array.BinarySearch(_doubles, 2_000_000);
[Benchmark]
public int CompareSequence() => _doubles.AsSpan().SequenceCompareTo(_doubles);
Method | Toolchain | Mean | Ratio | Code Size |
---|---|---|---|---|
IsSorted | main\corerun.exe | 1,921,366.71 ns | 1.00 | 155 B |
IsSorted | pr1\corerun.exe | 739,726.37 ns | 0.39 | 97 B |
IsSorted | pr2\corerun.exe | 970,601.44 ns | 0.50 | 117 B |
CopyAndSort | main\corerun.exe | 11,761,377.29 ns | 1.00 | 1,060 B |
CopyAndSort | pr1\corerun.exe | 11,810,071.67 ns | 1.00 | 1,060 B |
CopyAndSort | pr2\corerun.exe | 11,837,290.42 ns | 1.01 | 1,060 B |
Search | main\corerun.exe | 58.53 ns | 1.00 | 207 B |
Search | pr1\corerun.exe | 35.51 ns | 0.61 | 207 B |
Search | pr2\corerun.exe | 38.65 ns | 0.66 | 207 B |
CompareSequence | main\corerun.exe | 1,829,633.22 ns | 1.00 | 330 B |
CompareSequence | pr1\corerun.exe | 1,452,953.83 ns | 0.79 | 306 B |
CompareSequence | pr2\corerun.exe | 1,523,349.48 ns | 0.83 | 316 B |
I asked whether it would make sense to split it. I'm fine with the answer being "no" 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright I have reverted the commit that split the inlining, so we're back to just regular aggressive inlined
Nope, it'd very low risk if we have to pull it out. If you've concluded it's a worthwhile change in the real world which it sounds like you have. |
Hello @tannergooding! Because this pull request has the p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (
|
@tannergooding, is the plan to backport this? |
@stephentoub Yep. @tannergooding and I chatted about it earlier. Before we merge into |
As @rickbrew mentioned earlier, we may see a minor size-on-disk regression as would accompany most "this method is now inlined where previously it wasn't" changes. I assume we're still ok with that? |
Assuming it's truly minor, yeah. |
As per discussion in #56493