Use a faster sort algorithm pdqsort for Array.Sort #68436

hez2010 · 2022-04-23T09:55:08Z

Based on the work https://drops.dagstuhl.de/opus/volltexte/2016/6389/pdf/LIPIcs-ESA-2016-38.pdf and C++ implementation https://github.com/orlp/pdqsort.

I haven't implemented the branch-less partition right, which can be improved in the future.

Benchmark: https://gist.github.com/hez2010/6b52929ee1755788c34818972c46aefb

Fixes #68325

ghost · 2022-04-23T09:55:14Z

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #68325

Author:	hez2010
Assignees:	-
Labels:	`area-System.Runtime`
Milestone:	-

jkotas · 2022-04-23T13:32:30Z

src/libraries/System.Private.CoreLib/src/System/Collections/Generic/ArraySortHelper.cs

+        internal static ref T UnguardedAccess<T>(Span<T> span, int index)
+        {
+            // return ref Unsafe.Add(ref MemoryMarshal.GetReference(span), index);
+            // TODO: investigate where can we use the above line with confidence instead of this.
+            return ref span[index];
+        }
+
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        internal static Span<T> UnguardedSlice<T>(Span<T> span, int begin, int end)
+        {
+            return MemoryMarshal.CreateSpan(ref UnguardedAccess(span, begin), end - begin);
+        }


Using unsafe code throughout a complex algorithm like sort is a major liability. The existing sort is intentionally using safe bounds checked access, except a few carefully audited places.

jkotas · 2022-04-23T14:17:14Z

Changing the unstable Array.Sort algorithm is semi-breaking change. It generates a ton of follow up work in projects out there in updating their test baselines. It would be best to switch to a stable sort algorithm that we can keep tweaking for performance in non-breaking way.

What's the best stable sort algorithm available and how does it compare with this and with what we have today? I understand that we would be leaving some perf on the table with stable sort algorithm, but we would be gaining predictability that has a lot of value.

tfenise · 2022-04-23T14:38:15Z

It seems fluxsort allocates, which may be undesirable in certain cases.

danmoseley · 2022-04-23T14:38:21Z

stable sort algorithm that we can keep tweaking for performance in non-breaking way.

Subsequently changing to an even better stable algorithm could break baselines again, right?

quadsort

Just curious, are there algorithms that take advantage of vectorization? I don't see it mentioned there.

hez2010 · 2022-04-23T15:02:40Z

Changing the unstable Array.Sort algorithm is semi-breaking change.

@jkotas Aren't we already using an unstable algorithm? We used introsort before which is already unstable.

hez2010 · 2022-04-23T15:31:29Z

@jkotas Since we already used an unstable sort algorithm, I would like to land this PR (pdqsort). The previous failure was caused by a mistake where I forgot to handle null case while comparing.

stephentoub · 2022-04-23T15:42:41Z

Since we already used an unstable sort algorithm, I would like to land this PR (pdqsort)

It's unstable but determistic. @jkotas' concern is that a lot of code, especially tests, has taken a dependency on the ordering, and we don't want to churn it only to churn it again.

jkotas · 2022-04-23T16:20:57Z

Subsequently changing to an even better stable algorithm could break baselines again, right?

No. All stable sort algorithms produce same results. If we switch to a stable algorithm, switching to even better stable algorithm would not change observable behavior.

Just curious, are there algorithms that take advantage of vectorization? I don't see it mentioned there.

Yes, you can write specialized sort algorithms that take advantage of vectorization and parallelization. They are good candidates for standalone nuget package, not for the good-enough sort algorithm in the core library.

hez2010 · 2022-04-23T16:32:49Z

I would like to see the CI first. If there is no test failure related to this change, I think it's okay to ship it.
For reference, Rust is also using pdqsort, and Go switched to pdqsort a while ago.

hez2010 · 2022-04-23T17:27:38Z

I see CircleInConvex is failing because the cmp function is malformed:

static int cmp(Point a, Point b)
{
    if (a.X < b.X || a.X == b.X && a.Y < b.Y)
        return 1;
    else
        return 0;
}

The above cmp function results in indeterminate behavior, which is strongly coupled with introsort behavior (code really shouldn't assume Array.Sort having a introsort implementation).

Note that If I change to any sort algorithm other than intersort such as heap sort, insertion sort and etc, the CircleInConvex will also fail.

I think instead of keeping the original sorting algorithm to make test pass, we should change the cmp function of the test.

danmoseley · 2022-04-23T18:01:41Z

No. All stable sort algorithms produce same results. If we switch to a stable algorithm, switching to even better stable algorithm would not change observable behavior.

Ah yes, I was thinking that stable means, always orders elements that compare identically in some predictable order. But actually means, always orders elements that compare identically in the same order they were provided.

src/tests/JIT/SIMD/CircleInConvex.cs

hez2010 · 2022-04-24T11:05:09Z

Test failures are unrelated.

jkotas · 2022-04-24T14:56:45Z

CircleInConvex is a good test fix. It may be good to submit it as a separate PR.

If there is no test failure related to this change, I think it's okay to ship it.

I will leave it for @dotnet/area-system-runtime owners to figure out how to drive this behavior change through.

For example, based on past experience - this sort algorithm change would generate a bunch of work for Bing.

I think we can get people behind something like this: "We are sorry for changing the sort algorithm again. This is the last time. The sort algorithm is going to be stable from now and improvements are not going to have observable impact going forward (as long as your compare function is correct).".

On other hand, "We are sorry for changing the sort algorithm again and we are likely going to do it at least one more time soon so prepare to do the work on your test baselines twice." will be harder to get people behind. There are likely going to be asks for keeping the current sort algorithm under a switch, but that is not particularly great for small binary sizes that are important for other form factors. The whole thing gets complicated.

hez2010 · 2022-04-25T01:45:28Z

It may be good to submit it as a separate PR.

#68475

hez2010 · 2022-04-27T04:45:07Z

Most methods of pdqsort can be shared with introsort (insertion sort and heap sort), I think we can provide a runtime switch to fallback to the old introsort behavior, and it won't introduce observable code size regression.
I think we should land this PR as soon as possible to gather more feedbacks during preview stage.
@jkotas what do you think?

bencyoung-Fignum · 2022-04-27T08:30:02Z

Does stability matter for native types anyway as you can't tell the difference between two ints with the same value? pdqsort could be used for a known list of types?

jkotas · 2022-04-27T13:53:57Z

@jkotas what do you think?

I do not have much else to say than what I have said in #68436 (comment) . If you agree that we want to end up with a stable sorting algorithm, the intermediate steps create extra work.

bencyoung-Fignum · 2022-04-28T08:58:38Z

Timsort is a good stable sort for references, so you could have Timsort for general sorting and Pdqsort for primitives where stability doesn't matter...

Only downside is it can allocate, but it's likely you can have a degree of stack allocation for small cases and only switch to dynamic if needed

GSPP · 2022-05-16T12:38:32Z

When the algorithm becomes stable, people will take a dependency on that behavior (justified or not). This means that it can never again be made unstable. It's a one-way street.

This feels like .NET 1.0 locking in a specific Random algorithm. This 1.0 algorithm was not very good compared to modern standards.

It's also like the .NET 1.0 String.GetHashCode(). Nobody should take a dependency on it, but people do. And it makes useful changes that much harder.

bencyoung-Fignum · 2022-05-16T13:38:04Z

I'm just trying to get clarity on what an unstable sort means on primitive structs seeing as you can't tell the difference between two integers with the same value? I can certainly see the difference if you're using an extracted key to sort different values but using an unstable sort for Array.Sort<int[]>(array) seems safe from the stability perspective as you can't tell....

Whether it's worth having a faster unstable sort for this kind of case is another question!

hez2010 · 2022-05-16T15:03:28Z

I'm agreeing with @GSPP. Once we switch to a stable sort algorithm and document the behavior, we will never be able to go back to the unstable world again because people will start to rely on the stable behavior.

tannergooding · 2022-05-16T15:06:30Z

That is an intentional and desirable option. Developers are already reliant on the "unstable" sort having not changed and therefore being viewed as "stable" when in actuality its simply deterministic.

hez2010 · 2022-05-16T15:11:43Z

Then we can consider adding a StableSort method to provide stable sort behavior, so developers can utilize the StableSort method for sequence-sensitive jobs.

tannergooding · 2022-05-16T16:10:14Z

As per the above, if any break is taken we are more interested in taking a 1-time break to make the default sorting algorithm stable. This allows us to improve perf over time without continuously hurting developers who may be dependent on a particular behavior.

A faster but unstable sorting algorithm could then be discussed independently, but I imagine its better for that to simply be provided via a 3rd party package maintained by the community.

tannergooding · 2022-05-16T16:12:23Z

As for the other question that popped up, there is indeed no difference between "stable" and "unstable" for primitive types as there is no way to disambiguate 5 from 5. The "stable" vs "unstable" only really impacts types that contain references or certain types of pointers/handles. For such types you might have two references/pointers/handles that are inequal, but who contents represent the same data and so their own comparison treats them as equal anyways. This occurs for string as an example.

That being said, having a significantly different algorithm for primitive types needs to be worth the cost. That is, it needs to be shown that the additional code complexity and long term maintenance of the additional algorithm is worthwhile. This will likely require showing not only perf numbers but also showing that such sequences occur "frequently enough" that this shouldn't just be handled by some 3rd party package maintained by the community.

KalleOlaviNiemitalo · 2022-05-16T16:22:11Z

no difference between "stable" and "unstable" for primitive types

Unless the app provides an IComparer that treats some values as equivalent.

tannergooding · 2022-05-16T16:41:07Z

Unless the app provides an IComparer that treats some values as equivalent.

Sure, but that's not comparing primitives, that's using a custom comparer and is effectively treated as "custom type" at that point.

jeffhandley · 2022-06-24T23:49:21Z

Due to the lack of consensus for moving this forward, I'm closing the PR. I will leave #68325 open though and discussion can continue there for under what conditions we would approve a PR for the change. Thanks for the effort on this, @hez2010.

dotnet-issue-labeler bot added the area-System.Runtime label Apr 23, 2022

ghost added the community-contribution Indicates that the PR has been added by a community member label Apr 23, 2022

hez2010 marked this pull request as ready for review April 23, 2022 09:57

Implement Pattern-defeating Quick Sort

bca5b53

hez2010 force-pushed the pdqsort branch from 123f3e3 to bca5b53 Compare April 23, 2022 10:19

hez2010 added 4 commits April 23, 2022 18:22

Use safe access for now

71f0fa3

Fix asserts

9cdfc60

Remove invalid assert

9ae18d1

Skip if array has at most one element

35b981f

jkotas reviewed Apr 23, 2022

View reviewed changes

This comment was marked as outdated.

Sign in to view

hez2010 marked this pull request as draft April 23, 2022 14:34

hez2010 changed the title ~~Implement Pattern-defeating Quick Sort~~ WIP: Use a faster stable sort algorithm for Array.Sort Apr 23, 2022

This comment was marked as outdated.

Sign in to view

Handle null case in LessThan

bade53b

hez2010 marked this pull request as ready for review April 23, 2022 15:25

hez2010 changed the title ~~WIP: Use a faster stable sort algorithm for Array.Sort~~ Use a faster sort algorithm pdqsort for Array.Sort Apr 23, 2022

This was referenced Apr 23, 2022

Segmentation fault in LibraryImportGenerator.Unit.Tests #68443

Closed

Segmentation fault in LibraryImportGenerator.Unit.Tests #67031

Closed

hez2010 force-pushed the pdqsort branch from 1e3cb1c to bade53b Compare April 24, 2022 05:57

Fix CircleInConvex test

b493d07

hez2010 commented Apr 24, 2022

View reviewed changes

src/tests/JIT/SIMD/CircleInConvex.cs Show resolved Hide resolved

hez2010 mentioned this pull request Apr 25, 2022

Fix CircleInConvex test #68475

Merged

jeffhandley closed this Jun 24, 2022

jeffhandley mentioned this pull request Jun 24, 2022

[API Proposal]: Improve default sort algorithm #68325

Open

ghost locked as resolved and limited conversation to collaborators Jul 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a faster sort algorithm pdqsort for Array.Sort #68436

Use a faster sort algorithm pdqsort for Array.Sort #68436

hez2010 commented Apr 23, 2022 •

edited

Loading

ghost commented Apr 23, 2022

jkotas Apr 23, 2022

jkotas commented Apr 23, 2022

This comment was marked as outdated.

This comment was marked as outdated.

tfenise commented Apr 23, 2022

danmoseley commented Apr 23, 2022

This comment was marked as outdated.

hez2010 commented Apr 23, 2022 •

edited

Loading

hez2010 commented Apr 23, 2022 •

edited

Loading

stephentoub commented Apr 23, 2022

jkotas commented Apr 23, 2022

hez2010 commented Apr 23, 2022 •

edited

Loading

hez2010 commented Apr 23, 2022 •

edited

Loading

danmoseley commented Apr 23, 2022

hez2010 commented Apr 24, 2022

jkotas commented Apr 24, 2022

hez2010 commented Apr 25, 2022

hez2010 commented Apr 27, 2022

bencyoung-Fignum commented Apr 27, 2022

jkotas commented Apr 27, 2022

bencyoung-Fignum commented Apr 28, 2022 •

edited

Loading

GSPP commented May 16, 2022

bencyoung-Fignum commented May 16, 2022

hez2010 commented May 16, 2022

tannergooding commented May 16, 2022

hez2010 commented May 16, 2022

tannergooding commented May 16, 2022

tannergooding commented May 16, 2022

KalleOlaviNiemitalo commented May 16, 2022

tannergooding commented May 16, 2022

jeffhandley commented Jun 24, 2022

Use a faster sort algorithm pdqsort for Array.Sort #68436

Use a faster sort algorithm pdqsort for Array.Sort #68436

Conversation

hez2010 commented Apr 23, 2022 • edited Loading

ghost commented Apr 23, 2022

jkotas Apr 23, 2022

Choose a reason for hiding this comment

jkotas commented Apr 23, 2022

This comment was marked as outdated.

This comment was marked as outdated.

tfenise commented Apr 23, 2022

danmoseley commented Apr 23, 2022

This comment was marked as outdated.

hez2010 commented Apr 23, 2022 • edited Loading

hez2010 commented Apr 23, 2022 • edited Loading

stephentoub commented Apr 23, 2022

jkotas commented Apr 23, 2022

hez2010 commented Apr 23, 2022 • edited Loading

hez2010 commented Apr 23, 2022 • edited Loading

danmoseley commented Apr 23, 2022

hez2010 commented Apr 24, 2022

jkotas commented Apr 24, 2022

hez2010 commented Apr 25, 2022

hez2010 commented Apr 27, 2022

bencyoung-Fignum commented Apr 27, 2022

jkotas commented Apr 27, 2022

bencyoung-Fignum commented Apr 28, 2022 • edited Loading

GSPP commented May 16, 2022

bencyoung-Fignum commented May 16, 2022

hez2010 commented May 16, 2022

tannergooding commented May 16, 2022

hez2010 commented May 16, 2022

tannergooding commented May 16, 2022

tannergooding commented May 16, 2022

KalleOlaviNiemitalo commented May 16, 2022

tannergooding commented May 16, 2022

jeffhandley commented Jun 24, 2022

hez2010 commented Apr 23, 2022 •

edited

Loading

hez2010 commented Apr 23, 2022 •

edited

Loading

hez2010 commented Apr 23, 2022 •

edited

Loading

hez2010 commented Apr 23, 2022 •

edited

Loading

hez2010 commented Apr 23, 2022 •

edited

Loading

bencyoung-Fignum commented Apr 28, 2022 •

edited

Loading