-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorized common String.Split() paths #38001
Conversation
src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs
Outdated
Show resolved
Hide resolved
@tannergooding Do you have any thoughts on the use of |
@bbartels Can you provide the actual benchmark tests? I appreciate you posting the numbers (and calling out regressions!), but we should also confirm that the tests represent realistic inputs. |
While the tests don't represent realistic inputs, I've specifically tested performance of increasing seperators frequency. The SepFreq in the tests mean how often a separator appears in the input sequence, so SepFreq=1 means every character represents a separator, SepFreq=2: every second char etc. Here is the Benchmark file: https://gist.github.com/bbartels/0e45fd3977067ce013bd0b28d6548b4e |
src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs
Outdated
Show resolved
Hide resolved
Vector256<ushort> charVector = Unsafe.As<char, Vector256<ushort>>(ref ci); | ||
Vector256<ushort> cmp = Avx2.CompareEqual(charVector, v1); | ||
|
||
if (v2 is Vector256<ushort> vecSep2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought on eliminating these branches (didn't validate it, don't know if it's worth it).
Instead of checking v2 is Vector
in each iteration, one could set v2
to a dummy-value, so that iif c2
is null the result of Avx2.Or(Avx2.CompareEqual(charVector, vecSep2), cmp)
won't be changed.
A such dummy-value is Vector256<byte>.Zero
.
With this value it won't work if the input is all zero though, but otherwise the two branches (in the loop) are eliminated resulting in smaller loop-bodies.
For v3
analogous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good idea! Instead of setting it to Vector256<byte>.Zero
I could try with the inverse of the other separator char, this way it would even work when c == '\0'
. I'll give it a benchmark and see if it makes a difference!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, upon benchmarking this seems to be significantly slower (-20%) in cases, where there is only a single separator char (which I feel like is probably the most common scenario). I'm leaving this open for now if someone has some other ideas about eliminating the branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
significantly slower (-20%) in cases, where there is only a single separator char
TBH I can't believe this. Can you show the code?
The branch-predictor does a good job, but such an decrease with branchless operations that can be executed in parallel (instruction level parallelism) is not what I expected.
the most common scenario
I believe so too. Maybe it is worth it to special case this (when the branchless variant proves to be definitely slower 😉).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I am misunderstanding something, were you suggesting the following:
if (v2 is Vector256<ushort> vecSep2)
{
cmp = Avx2.Or(Avx2.CompareEqual(vi, vecSep2), cmp);
}
if (v3 is Vector256<ushort> vecSep3)
{
cmp = Avx2.Or(Avx2.CompareEqual(vi, vecSep3), cmp);
}
to
cmp = Avx2.Or(Avx2.CompareEqual(vi, vecSep2), cmp);
cmp = Avx2.Or(Avx2.CompareEqual(vi, vecSep3), cmp);
where vecSep2/3 are defaulted to something that wouldn't interfere with the existing cmp value if only one separator was defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks quite good (expecte the stack-spills I'm seeing / investigating).
If \0
isn't a concern, so the default values of the args c{2,3}
could be set to \0
to avoid the branches at creating the vectors. This could a have a positive effect on smaller inputs.
With your idea to setting it to v1
you're on the safe side though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having nullable arguments, just set the dummy-values in
runtime/src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs
Lines 1577 to 1578 in 140c2ce
case 1: | |
sep0 = separators[0]; |
as there it's known. You know what I mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in calling it like MakeSeparatorListVectorized(ref sepListBuilder, sep0, sep0, sep0)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.
// ...
MakeSeparatorListVectorized(ref sepListBuilder, sep0, sep0, sep0);
// ...
MakeSeparatorListVectorized(ref sepListBuilder, sep0, sep1, sep0);
// ...
MakeSeparatorListVectorized(ref sepListBuilder, sep0, sep1, sep2);
Then the checks in MakeSeparatorListVectorized
for null
(default arg) can be eliminated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 2b0b8f8
src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs
Outdated
Show resolved
Hide resolved
…ation.cs Co-authored-by: Günther Foidl <gue@korporal.at>
…ation.cs Co-authored-by: Günther Foidl <gue@korporal.at>
I've also done some thought experiments how this could possibly be applied to #934. Because of the way the SpanSplitEnumerator is being implemented it restricts the ability to improve performance by finding multiple indices at once. This could significantly help in perf for input with a high separator density. The only solutions I could think of is having a stackallocated buffer that is being passed to the Enumerator. This way it could use something similar to this PR to fill the buffer with indices and MoveNext() would use the buffer as a lookup first, before finding more indices. |
Given the severe performance difference on AMD my preference would be to not use it for the time being. Performing an |
@bbartels if you didn't already you might review coverage in https://github.com/dotnet/performance and see whether it has reasonable cases. These are the tests we use to sign off on libraries perf. |
@danmosemsft Ah good point, will do! I was a little stumped when trying to come up with a saturated set of common/realistic scenarios. |
Had a friend run some numbers on Zen and it is an order of magnitude slower than on Intel. So it is definitely not an option to use |
src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs
Outdated
Show resolved
Hide resolved
There isn't a mechanism to do this today and it likely isn't worthwhile since the perf difference between pext and the fallback is likely not that great on Intel. |
src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs
Outdated
Show resolved
Hide resolved
@tannergooding What about the use of AVX2 vs. normal SSE instructions in these code paths? In the UTF-8 transcoding code paths we avoided AVX2 because it didn't offer enough of a performance improvement over SSE when given realistic inputs. (I recorded only a ~5% max difference.) Avoiding AVX2 also allowed the code to work across a wider range of processors. |
I don't think we should provide AVX2 without providing SSE. Whether we provide AVX2 as an additional code path should largely come down to perf numbers vs complexity. |
BTW I wrote a benchmark which should be more realistic then the earlier frequency-based benchmark. It's a naive CSV parser, which is a very common use case for public class StringSplitRunner
{
private string[] _strings;
public IEnumerable<string> CorpusList()
{
yield return "https://www.nefsc.noaa.gov/drifter/drift_180351431.csv";
yield return "http://www.transparency.ri.gov/awards/awardsummary.csv";
yield return "https://www.census.gov/econ/bfs/csv/date_table.csv";
yield return "https://www.sba.gov/sites/default/files/aboutsbaarticle/FY16_SBA_RAW_DATA.csv";
yield return "https://www.epa.gov/sites/production/files/2014-05/tri_2012_nd.csv";
yield return "https://wfmi.nifc.gov/fire_reporting/annual_dataset_archive/1972-2010/_WFMI_Big_Files/BOR_1972-2010_Gis.csv";
}
[ParamsSource("CorpusList")]
public string CorpusUri { get; set; }
[GlobalSetup]
public void Setup()
{
_strings = GetStringsFromCorpus().GetAwaiter().GetResult();
}
private async Task<string[]> GetStringsFromCorpus()
{
var response = await new HttpClient().GetAsync(CorpusUri);
response.EnsureSuccessStatusCode();
var body = await response.Content.ReadAsStringAsync();
List<string> lines = new List<string>();
StringReader reader = new StringReader(body);
string line;
while ((line = reader.ReadLine()) != null)
{
lines.Add(line);
}
return lines.ToArray();
}
[Benchmark]
public string[] SplitCsv()
{
string[] split = null;
string[] lines = _strings;
for (int i = 0; i < lines.Length; i++)
{
split = lines[i].Split(',');
}
return split;
}
} BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.329 (2004/?/20H1)
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.7.20313.3
Delta for these real-world benchmarks is +20% improvement through -7% regression. |
…ation.cs Co-authored-by: Günther Foidl <gue@korporal.at>
@tannergooding do you need any more updates or is this ready for you to review? |
{ | ||
// Redundant test so we won't prejit remainder of this method | ||
// on platforms without SSE. | ||
if (!Sse.IsSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be Sse41
to match the highest use ISA in it.
Changes LGTM minus the one IsSupported check. |
Done, missed it needing to be more constrained that it was 😅 Hope everything is okay now! |
To rerun CI, just close and reopen! |
Looks like CI is happy 😄 |
@tannergooding @danmoseley Would this be ready to merge? |
Thanks for the reminder. @tannergooding is this ready to sign off? |
yay! thank you @bbartels for the contribution, and @tannergooding for the comprehensive review. I hope the long time this PR took will not put you off making another contribution. It is quite rare it takes this long, and we're now more actively tracking PR's internally, eg., we have a mailer for PR's that have not moved for a week or two. |
* upstream/main: (568 commits) [wasm] Set __DistroRid on Windows to browser-wasm (dotnet#50842) [wasm] Fix order of include paths, to have the obj dir first (dotnet#50303) [wasm] Fix debug build of AOT cross compiler (dotnet#50418) Fix outdated comment (dotnet#50834) [wasm][tests] Add properties to allow passing args to xharness (dotnet#50678) Vectorized common String.Split() paths (dotnet#38001) Fix binplacing symbol files. (dotnet#50819) Move type check to after the null ref branch in out marshalling of blittable classes. (dotnet#50735) Remove extraneous CMake version requirement. (dotnet#50805) [wasm] Remove unncessary condition for EMSDK (dotnet#50810) Add loop alignment stats to JitLogCsv (dotnet#50624) Resolve ILLink warnings in System.Diagnostics.DiagnosticSource (dotnet#50265) Avoid unnecessary closures/delegates in Process (dotnet#50496) Fix for field layout verification across version bubble boundary (dotnet#50364) JIT: Enable CSE for VectorX.Create (dotnet#50644) [main] Update dependencies from mono/linker (dotnet#50779) [mono] More domain cleanup (dotnet#50771) Race condition in Mock reference tracker runtime with GC. (dotnet#50804) Remove IAssemblyName (and various fusion remnants) (dotnet#50755) Disable failing test for GCStress. (dotnet#50828) ...
Thank you @danmoseley, I always appreciate your kind words :) Also many thanks to @gfoidl for the awesome support during this PR! |
Great just ping one of us at that time if you would like help finding a project! |
This change introduced some regressions - #51259 |
This reverts commit 295bcdc.
This PR vectorizes the MakeSeparatorList codepath for 3 or less separators.
This implementation relies on the
pext
instruction. During my research I've found that this instruction seems to perform fairly poorly on AMD's Zen architecture, though I don't have such a processor myself, so I cannot test what kind of numbers this implementation would be getting.As this is my first attempt at vectorizing an algorithm I'm not sure if there is a better way to shift the indices without
pext
.Regressions were only noted when every character within the input string is a separator, in such cases this implementation incurs a 0-17% penalty depending on the situation (see benchmarks below). As this scenario is likely extremely rare I feel like the benefit outweigh these regressions.
Detailed Benchmarks can be found here: https://gist.github.com/bbartels/dc85e4946810dfe1c5c25169ddc5e00c
Notable Improvements:
Notable Regressions: