Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize SpanHelpers<T>.IndexOf #60974

Merged
merged 10 commits into from
Nov 22, 2021
Merged

Conversation

alexcovington
Copy link
Contributor

Add a vectorized path in SpanHelpers.T.cs for value types that performs the same logic as SpanHelpers.IndexOf<T> but will vectorize the operations where possible.

Baseline is commit 31c38ef.

Benchmark results:

PS C:\Users\acovingt\source\repos\performance> py .\scripts\benchmarks_ci.py -c Release -f net6.0 --filter System.Collections.Contains*Int32* System.Memory.Span*Int32*IndexOf* System.Tests.Perf_Array.IndexOf* --corerun C:\Users\acovingt\source\repos\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\corerun.exe --bdn-artifacts C:\Users\acovingt\Documents\micro-compare-final\runtime\ --bdn-arguments "--launchCount 5"
PS C:\Users\acovingt\source\repos\performance> py .\scripts\benchmarks_ci.py -c Release -f net6.0 --filter System.Collections.Contains*Int32* System.Memory.Span*Int32*IndexOf* System.Tests.Perf_Array.IndexOf* --corerun C:\Users\acovingt\source\repos\runtime-master\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\corerun.exe --bdn-artifacts C:\Users\acovingt\Documents\micro-compare-final\runtime-master\ --bdn-arguments "--launchCount 5"
PS C:\Users\acovingt\source\repos\performance\src\tools\ResultsComparer> dotnet run -- --base C:\Users\acovingt\Documents\micro-compare-final\runtime-master\results\ --diff C:\Users\acovingt\Documents\micro-compare-final\runtime\results\ --threshold 3% --noise 5ns
summary:
better: 11, geomean: 1.979
total diff: 11

No Slower results for the provided threshold = 3% and noise filter = 5ns.

| Faster                                                            | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| ----------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Collections.ContainsFalse<Int32>.ICollection(Size: 512)    |      2.97 |         49319.68 |         16603.82 |         |
| System.Collections.ContainsFalse<Int32>.ImmutableArray(Size: 512) |      2.77 |         50947.90 |         18381.32 |         |
| System.Collections.ContainsFalse<Int32>.Queue(Size: 512)          |      2.23 |         46246.47 |         20723.50 |         |
| System.Collections.ContainsFalse<Int32>.Array(Size: 512)          |      2.06 |         49333.88 |         23935.60 |         |
| System.Collections.ContainsTrue<Int32>.Array(Size: 512)           |      1.95 |         34437.90 |         17660.55 |         |
| System.Collections.ContainsFalse<Int32>.List(Size: 512)           |      1.89 |         44184.18 |         23359.91 |         |
| System.Collections.ContainsTrue<Int32>.ImmutableArray(Size: 512)  |      1.87 |         35886.60 |         19165.18 |         |
| System.Collections.ContainsTrue<Int32>.List(Size: 512)            |      1.78 |         33536.09 |         18847.10 |         |
| System.Collections.ContainsTrue<Int32>.ICollection(Size: 512)     |      1.76 |         34361.86 |         19510.32 |         |
| System.Collections.ContainsTrue<Int32>.Queue(Size: 512)           |      1.74 |         34743.45 |         19957.12 |         |
| System.Memory.Span<Int32>.LastIndexOfAnyValues(Size: 512)         |      1.28 |           103.35 |            80.87 |         |

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Oct 28, 2021
@ghost
Copy link

ghost commented Oct 28, 2021

Tagging subscribers to this area: @GrabYourPitchforks, @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Add a vectorized path in SpanHelpers.T.cs for value types that performs the same logic as SpanHelpers.IndexOf<T> but will vectorize the operations where possible.

Baseline is commit 31c38ef.

Benchmark results:

PS C:\Users\acovingt\source\repos\performance> py .\scripts\benchmarks_ci.py -c Release -f net6.0 --filter System.Collections.Contains*Int32* System.Memory.Span*Int32*IndexOf* System.Tests.Perf_Array.IndexOf* --corerun C:\Users\acovingt\source\repos\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\corerun.exe --bdn-artifacts C:\Users\acovingt\Documents\micro-compare-final\runtime\ --bdn-arguments "--launchCount 5"
PS C:\Users\acovingt\source\repos\performance> py .\scripts\benchmarks_ci.py -c Release -f net6.0 --filter System.Collections.Contains*Int32* System.Memory.Span*Int32*IndexOf* System.Tests.Perf_Array.IndexOf* --corerun C:\Users\acovingt\source\repos\runtime-master\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\corerun.exe --bdn-artifacts C:\Users\acovingt\Documents\micro-compare-final\runtime-master\ --bdn-arguments "--launchCount 5"
PS C:\Users\acovingt\source\repos\performance\src\tools\ResultsComparer> dotnet run -- --base C:\Users\acovingt\Documents\micro-compare-final\runtime-master\results\ --diff C:\Users\acovingt\Documents\micro-compare-final\runtime\results\ --threshold 3% --noise 5ns
summary:
better: 11, geomean: 1.979
total diff: 11

No Slower results for the provided threshold = 3% and noise filter = 5ns.

| Faster                                                            | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| ----------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Collections.ContainsFalse<Int32>.ICollection(Size: 512)    |      2.97 |         49319.68 |         16603.82 |         |
| System.Collections.ContainsFalse<Int32>.ImmutableArray(Size: 512) |      2.77 |         50947.90 |         18381.32 |         |
| System.Collections.ContainsFalse<Int32>.Queue(Size: 512)          |      2.23 |         46246.47 |         20723.50 |         |
| System.Collections.ContainsFalse<Int32>.Array(Size: 512)          |      2.06 |         49333.88 |         23935.60 |         |
| System.Collections.ContainsTrue<Int32>.Array(Size: 512)           |      1.95 |         34437.90 |         17660.55 |         |
| System.Collections.ContainsFalse<Int32>.List(Size: 512)           |      1.89 |         44184.18 |         23359.91 |         |
| System.Collections.ContainsTrue<Int32>.ImmutableArray(Size: 512)  |      1.87 |         35886.60 |         19165.18 |         |
| System.Collections.ContainsTrue<Int32>.List(Size: 512)            |      1.78 |         33536.09 |         18847.10 |         |
| System.Collections.ContainsTrue<Int32>.ICollection(Size: 512)     |      1.76 |         34361.86 |         19510.32 |         |
| System.Collections.ContainsTrue<Int32>.Queue(Size: 512)           |      1.74 |         34743.45 |         19957.12 |         |
| System.Memory.Span<Int32>.LastIndexOfAnyValues(Size: 512)         |      1.28 |           103.35 |            80.87 |         |
Author: alexcovington
Assignees: -
Labels:

area-System.Memory, community-contribution

Milestone: -


internal static bool CanVectorizeIndexOfForType<T>()
{
return (typeof(T) == typeof(byte)) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's no "built-in" way to achieve this. Or at least any kind of central place for all the helpers that use vectorized pathes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized there is Vector<T>.IsTypeSupported. This should work as a replacement for checks within the new IndexOfValueType<T> method. I'll make the changes and remove this helper method since that will be cleaner.

alexcovington and others added 2 commits October 28, 2021 12:22
Co-authored-by: Günther Foidl <gue@korporal.at>
{
// bool and char will already have been checked before, just do checks for types
// that are equal to sizeof(int) or sizeof(long)
if (Unsafe.SizeOf<T>() == sizeof(int))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this break for float and double? This is why we have IsBitwiseEquatable (see https://source.dot.net/System.Private.CoreLib/R/e4188e6833cbc739.html) as a helper API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll add a check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need tests for float and double somewhere -- did anything fail before you fixed this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, nothing failed for me when testing locally using:

.\build.cmd clr+libs+libs.tests -c Checked -test
.\build.cmd clr+libs+libs.tests -c Release -test

But some new tests for float and double would probably be appropriate just to be sure. Do you know the best place to add those?

@@ -225,6 +226,22 @@ public static void Fill<T>(ref T refData, nuint numElements, T value)
{
Debug.Assert(length >= 0);

if (typeof(T).IsValueType && RuntimeHelpers.IsBitwiseEquatable<T>())
{
// bool and char will already have been checked before, just do checks for types
Copy link
Member

@danmoseley danmoseley Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I see byte/bool and char are checked here
https://github.com/danmoseley/runtime/blob/f3ca6f91ba9c758bb246be8ba26bd356d3f9dda6/src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.cs#L298

... why are 1 and 2 byte sizes treated specially there, and 4 and 8 byte sizes treated specially here? why not all in the same place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think I was referencing a different check for those types, but putting them all near the section you linked would be cleaner and more intuitive. I'll move the checks there instead.

@@ -193,5 +195,114 @@ public static void ContainsNull_String(string[] spanInput, bool expected)
Span<string> theStrings = spanInput;
Assert.Equal(expected, theStrings.Contains(null));
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danmoseley I've added some tests here since there were not many tests for Span<T>.Contains. I believe the existing tests for Array.IndexOf (link) already have enough coverage. Let me know if I can add/modify anything.

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, big thanks for your contribution @alexcovington !

The benchmark numbers look great. Is there any chance you could run them for smaller collections as well? Currently we test only 512 elements which ofc is going to be great for the vectorized code path. You could do that by modifying the following Params:

https://github.com/dotnet/performance/blob/29d9a3b7fcd7c8fc36e73170328175f182d04195/src/benchmarks/micro/libraries/System.Collections/Contains/ContainsFalse.cs#L34-L35

https://github.com/dotnet/performance/blob/29d9a3b7fcd7c8fc36e73170328175f182d04195/src/benchmarks/micro/libraries/System.Collections/Contains/ContainsTrue.cs#L35-L36

https://github.com/dotnet/performance/blob/d7dac8a7ca12a28d099192f8a901cf8e30361384/src/benchmarks/micro/libraries/System.Memory/Span.cs#L21-L22

In a following way:

- [Params(Utils.DefaultCollectionSize)]
+ [Params(Utils.DefaultCollectionSize, 7)]

@alexcovington
Copy link
Contributor Author

Sorry for the delay, @adamsitnik.

The benchmark numbers look great. Is there any chance you could run them for smaller collections as well?

Absolutely, I've added the following size parameters to the tests you suggested:

[Params(7, 25, Utils.DefaultCollectionSize)]

Here's the comparison:

summary:
better: 26, geomean: 1.814
worse: 6, geomean: 1.129
total diff: 32

| Slower                                                        | diff/base | Base Median (ns) | Diff Median (ns) | Modality|
| ------------------------------------------------------------- | ---------:| ----------------:| ----------------:| -------- |
| System.Collections.ContainsTrue<Int32>.Span(Size: 7)          |      1.31 |            22.65 |            29.67 | bimodal |
| System.Collections.ContainsTrue<Int32>.List(Size: 7)          |      1.16 |            39.03 |            45.28 | several?|
| System.Collections.ContainsFalse<Int32>.Array(Size: 7)        |      1.12 |            50.77 |            56.82 |         |
| System.Collections.ContainsFalse<Int32>.Queue(Size: 7)        |      1.09 |            81.53 |            88.55 |         |
| System.Collections.ContainsTrue<Int32>.ImmutableList(Size: 7) |      1.06 |           110.52 |           117.53 |         |
| System.Collections.ContainsFalse<Int32>.HashSet(Size: 25)     |      1.06 |           106.13 |           112.16 | several?|

| Faster                                                               | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| -------- |
| System.Collections.ContainsFalse<Int32>.Span(Size: 512)              |      3.65 |         50605.50 |         13847.76 |         |
| System.Collections.ContainsFalse<Int32>.List(Size: 512)              |      3.39 |         48622.78 |         14347.67 | several?|
| System.Collections.ContainsFalse<Int32>.Array(Size: 512)             |      3.30 |         49393.34 |         14984.60 | several?|
| System.Collections.ContainsFalse<Int32>.ImmutableArray(Size: 512)    |      3.10 |         51001.78 |         16456.27 |         |
| System.Memory.Span<Int32>.IndexOfValue(Size: 512)                    |      3.05 |            46.63 |            15.31 |         |
| System.Collections.ContainsFalse<Int32>.Queue(Size: 512)             |      3.01 |         50827.41 |         16865.51 | several?|
| System.Collections.ContainsFalse<Int32>.ICollection(Size: 512)       |      3.00 |         44683.99 |         14892.82 |         |
| System.Collections.ContainsTrue<Int32>.Span(Size: 512)               |      2.22 |         32125.17 |         14463.48 |         |
| System.Collections.ContainsTrue<Int32>.List(Size: 512)               |      2.22 |         33579.71 |         15130.16 |         |
| System.Collections.ContainsTrue<Int32>.Queue(Size: 512)              |      2.18 |         34733.56 |         15936.95 |         |
| System.Collections.ContainsTrue<Int32>.ICollection(Size: 512)        |      2.17 |         34269.35 |         15763.35 |         |
| System.Collections.ContainsTrue<Int32>.Array(Size: 512)              |      2.17 |         34494.07 |         15868.69 |         |
| System.Collections.ContainsTrue<Int32>.ImmutableArray(Size: 512)     |      2.08 |         35624.17 |         17141.36 |         |
| System.Collections.ContainsFalse<Int32>.Span(Size: 25)               |      1.62 |           150.01 |            92.59 | several?|
| System.Collections.ContainsTrue<Int32>.List(Size: 25)                |      1.29 |           198.87 |           154.60 |         |
| System.Collections.ContainsFalse<Int32>.List(Size: 25)               |      1.28 |           188.51 |           146.71 |         |
| System.Collections.ContainsTrue<Int32>.Queue(Size: 25)               |      1.26 |           257.97 |           205.55 |         |
| System.Collections.ContainsTrue<Int32>.Array(Size: 25)               |      1.25 |           246.52 |           196.65 |         |
| System.Collections.ContainsFalse<Int32>.ImmutableArray(Size: 25)     |      1.24 |           308.03 |           248.96 |         |
| System.Collections.ContainsTrue<Int32>.Span(Size: 25)                |      1.23 |           135.52 |           110.20 |         |
| System.Collections.ContainsFalse<Int32>.Array(Size: 25)              |      1.22 |           228.54 |           187.44 |         |
| System.Collections.ContainsFalse<Int32>.ICollection(Size: 25)        |      1.17 |           211.66 |           180.44 | several?|
| System.Collections.ContainsTrue<Int32>.ICollection(Size: 25)         |      1.17 |           234.60 |           200.38 |         |
| System.Collections.ContainsTrue<Int32>.ImmutableArray(Size: 25)      |      1.17 |           304.97 |           261.20 |         |
| System.Collections.ContainsTrue<Int32>.ImmutableSortedSet(Size: 512) |      1.07 |         15123.96 |         14125.02 | bimodal |
| System.Collections.ContainsFalse<Int32>.Queue(Size: 25)              |      1.07 |           341.48 |           319.70 |         |

So there is a slight regression for smaller buffers that can't be vectorized. It looks like the extra few nanoseconds are due to the extra conditional check that needs to be performed. PerfView shows more samples being collected around one of the checks to see if the type of the buffer is a value type:

PerfView screenshots

vectorize-array-indexof-baseline-perfview

vectorize-array-indexof-diff-perfview

Once the buffer is large enough to be vectorized, the extra cost of the conditional is made up for by the perf increase of the vectorized path.

Please let me know if I can add any additional information or run any other scenarios.

@adamsitnik
Copy link
Member

@alexcovington thank you for providing the benchmark numbers. LGTM, :shipit:

@adamsitnik adamsitnik merged commit 3745e38 into dotnet:main Nov 22, 2021
@danmoseley
Copy link
Member

Nice improvements thanks @alexcovington !

Do you plan to do more vectorization type work?

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Nov 23, 2021

@alexcovington
Copy link
Contributor Author

Sorry for the delayed response, @danmoseley.

Do you plan to do more vectorization type work?

Yes, we're always interested in potential vector/SIMD optimization opportunities. Let me know if we can assist with similar work!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Memory community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants