-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize {Last}IndexOfAny{Except} for ASCII needles #76740
Conversation
Tagging subscribers to this area: @dotnet/area-system-memory Issue DetailsFixes #68328 (and contributes the workhorse implementation for an eventual dedicated API) This PR adds a vectorized path for Similarly to the For inputs with long runs of no matches, the vectorized path has 10-30x the throughput. X86 numbersThese numbers were collected on a Windows Azure VM running on an Intel Xeon 8370C processor.
On X86 we have to do a bit more work if the needle contains a zero:
Approximate ARM64 numbers:
|
Excellent, thanks for working on this.
Have we investigated doing something different for short haystacks? e.g. something super simple like: if (haystack.Length < Vector128<short>.Count)
{
for (int i = 0; i < haystack.Length; i++)
if (needle.Contains(haystack[i]))
return i;
return -1;
}
... ? |
That is what we do for the haystack.Length >= 8 ? Vectorized() : SimpleLoop() whereas haystack.Length >= 8 ? Vectorized() : ProbabilisticMap() In places where we were already using the It would likely be beneficial to tweak the exact cutoff and add the int IndexOfAny()
{
if (haystack.Length < Max(8, needle.Length / 2))
return SimpleLoop();
if (IsAscii(needle))
return Vectorized();
return ProbabilisticMap();
}
int IndexOfAnyExcept()
{
if (haystack.Length >= Max(8, needle.Length / 2) && IsAscii(needle))
return Vectorized();
return SimpleLoop();
} I can look into what sort of numbers we'd see with something like that, though I would hope that in general, a dedicated API to hide the init cost completely would be the preferred approach. |
Right, this is the main thing I was asking about. The probabilistic map path today needs to loop through each char individually, and do additional work for each. My gut would be that, other than for obscenely long needles, you could have a reasonably-sized haystack and still win with the simple loop doing a vectorized contains on the needles. |
Certainly for cases where you're going to be invoking something repeatedly and have the foresight to create and cache the preprocessed vector information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
Left some notes.
I expect these APIs should work for byte-inputs too, so with another static abstract interface (probably my favorite C# 11 feature) this can be done without duplication. I'm re-writing the base64 code in my repo using Char- and Byte-Operations for this* -- i.e. reading two char/short-vectors and combining them into one byte vector for further processing. Something similar could be done here too.
* it's still WIP, so pack signed vs. unsigned isn't done there
src/libraries/System.Private.CoreLib/src/System/IndexOfAnyAsciiSearcher.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/IndexOfAnyAsciiSearcher.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/IndexOfAnyAsciiSearcher.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/IndexOfAnyAsciiSearcher.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/IndexOfAnyAsciiSearcher.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/IndexOfAnyAsciiSearcher.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/IndexOfAnyAsciiSearcher.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/IndexOfAnyAsciiSearcher.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/IndexOfAnyAsciiSearcher.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/IndexOfAnyAsciiSearcher.cs
Outdated
Show resolved
Hide resolved
I added a fast path for short haystacks to the probabilistic code path, leading to nice improvements there (I updated the numbers in the top post). E.g.
My benchmark for
|
The cutoff for the simple loop is now: searchSpaceLength < Vector128<short>.Count || (searchSpaceLength < 20 && searchSpaceLength < (valuesLength >> 1)) Updated the benchmarks above, this is now a win pretty much across the board. |
/azp run runtime |
Azure Pipelines successfully started running 1 pipeline(s). |
src/libraries/System.Private.CoreLib/src/System/IndexOfAnyAsciiSearcher.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/IndexOfAnyAsciiSearcher.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/IndexOfAnyAsciiSearcher.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/IndexOfAnyAsciiSearcher.cs
Outdated
Show resolved
Hide resolved
Curious to see what this does to regex redux https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/regexredux-csharpcore-5.html |
It won't do anything. We don't use this API from RegexCompiler / source generator (or, rather, we rely on the API immediately delegating to the 4/5-char overloads), nor does regex redux have any sets that would trigger it. We will update RegexCompiler / source generator to use whatever new API we create that plugs into the same implementation this is adding but that let's us precompute the vector/bit set rather than having to do it on each call. |
/azp run runtime |
Azure Pipelines successfully started running 1 pipeline(s). |
72ae58c
to
2ffdaf6
Compare
src/libraries/System.Private.CoreLib/src/System/ProbabilisticMap.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/ProbabilisticMap.cs
Outdated
Show resolved
Hide resolved
|
||
Span<char> needleSpace = stackalloc char[8]; | ||
Span<char> haystackSpace = stackalloc char[40]; | ||
var rng = new Random(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make this reproducable, we should include a seed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, I removed it because the RNG part represents a significant portion of the test execution time, with the non-explicit-seed ctor using the faster impl AFAIK.
For reproducibility, I made sure to emit the exact inputs that failed as part of the error message. Does that address your concerns here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that address your concerns here?
Only partially. There are two concerns:
- That a test could fail and you not know why or be able to trigger it again.
- That two runs of the same test suite might be non-deterministic.
Outputting the exact inputs addresses (1) but not (2). And (2) in general contributes to test flakiness.
If this test is about randomly stressing the implementation, then I don't believe it belongs in the functional test suite. If it's about using pseudo-randomness to try out thousands of inputs without having to manually code them all and be creative enough to ensure a reasonable spread of inputs, then it should be done in a way that runs the same tests every time, which means using a seed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to waste less time in Random and use a constant seed.
It was extremely useful when working on the initial implementation to get rid of all the edge-case bugs. Some were so odd that I don't believe we would have come up with test cases for otherwise.
I increased the number of iterations so it now takes about a second of CPU time to run through them, so I moved these to outerloop.
Outerloop already takes ~4 minutes on my machine so I'm gonna assume that's acceptable.
src/libraries/System.Private.CoreLib/src/System/IndexOfAnyAsciiSearcher.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/IndexOfAnyAsciiSearcher.cs
Show resolved
Hide resolved
@@ -1791,8 +1779,9 @@ private static unsafe int IndexOfAnyProbabilistic(ref char searchSpace, int sear | |||
/// </summary> | |||
/// <param name="span">The span to search.</param> | |||
/// <param name="values">The set of values to search for.</param> | |||
public static int LastIndexOfAny<T>(this Span<T> span, ReadOnlySpan<T> values) where T : IEquatable<T>? | |||
=> LastIndexOfAny((ReadOnlySpan<T>)span, values); | |||
[MethodImpl(MethodImplOptions.AggressiveInlining)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was necessary? Or just added for consistency with something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was removed by mistake in #75754 (comment)
I added it back to at least make it consistent with the rest of {Last}IndexOfAny
overloads.
Meaning with this PR, for long or short inputs, vectorized or not, ASCII or not, everything you've tried is as good or better than it was before? If so, yay! |
Yes*. The main scenario I can think of that would regress is if the needle contains non-ASCII, but starts with ASCII (e.g. For short inputs that still fall back to a simple for loop (short input + large needle), there is a slightly higher per-call overhead to do the extra calls/checks. E.g.
There can also be slight differences due to a change of "naive for loop" |
473bfff
to
e9ded14
Compare
/azp run runtime-libraries-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
Contributes to #68328 (and contributes the workhorse implementation for an eventual dedicated API)
cc: @gfoidl
This PR adds a vectorized path for
IndexOfAny
-like methods if the input is at least 8 characters long and the needle is only ASCII.It is used if
Ssse3
orAdvSimd.Arm64
are supported.I also unified all
(Last)IndexOfAny(Except)
methods to use the same approach ofleading to perf improvements for both short and long inputs.
For inputs with long runs of no matches, the vectorized path has 10 to 140+ times the throughput on realistic inputs.
As the
-Except
methods were previouslyO(n * m)
, you can of course see arbitrarily large improvements.X86 numbers
These numbers were collected on a Windows Azure VM running on an Intel Xeon 8370C processor.
On X86 we have to do a bit more work if the needle contains a zero:
Approximate ARM64 numbers: