Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorized common String.Split() paths #38001

Merged
merged 28 commits into from
Apr 7, 2021

Conversation

bbartels
Copy link
Contributor

@bbartels bbartels commented Jun 17, 2020

This PR vectorizes the MakeSeparatorList codepath for 3 or less separators.

This implementation relies on the pext instruction. During my research I've found that this instruction seems to perform fairly poorly on AMD's Zen architecture, though I don't have such a processor myself, so I cannot test what kind of numbers this implementation would be getting.
As this is my first attempt at vectorizing an algorithm I'm not sure if there is a better way to shift the indices without pext.

Regressions were only noted when every character within the input string is a separator, in such cases this implementation incurs a 0-17% penalty depending on the situation (see benchmarks below). As this scenario is likely extremely rare I feel like the benefit outweigh these regressions.

Detailed Benchmarks can be found here: https://gist.github.com/bbartels/dc85e4946810dfe1c5c25169ddc5e00c

Notable Improvements:

Method Ver Size SepFreq SepCount Mean Ratio
Split Old 200 100 1 173.84 ns 1.00
Split New 200 100 1 98.80 ns 0.57
Split Old 200 100 3 217.55 ns 1.00
Split New 200 100 3 105.02 ns 0.48
Split Old 1000 100 3 994.25 ns 1.00
Split New 1000 100 3 377.05 ns 0.38
Split Old 10000 100 1 7,461.19 ns 1.00
Split New 10000 100 1 3,777.11 ns 0.51
Split Old 10000 2000 3 8,196.41 ns 1.00
Split New 10000 2000 3 2,320.83 ns 0.28

Notable Regressions:

Method Ver Size SepFreq SepCount Mean Ratio
Split Old 16 1 1 114.78 ns 1.00
Split New 16 1 1 124.41 ns 1.08
Split Old 1000 1 1 5,771.31 ns 1.00
Split New 1000 1 1 6,728.70 ns 1.17
Split Old 10000 1 3 56,995.57 ns 1.00
Split New 10000 1 3 63,264.99 ns 1.11

@GrabYourPitchforks
Copy link
Member

@tannergooding Do you have any thoughts on the use of pext in the runtime? We used it in the UTF-8 code paths in the 3.0 release but eventually backed it out due to the substantial regression on AMD. See #2251 for context.

@GrabYourPitchforks
Copy link
Member

@bbartels Can you provide the actual benchmark tests? I appreciate you posting the numbers (and calling out regressions!), but we should also confirm that the tests represent realistic inputs.

@bbartels
Copy link
Contributor Author

bbartels commented Jun 17, 2020

@bbartels Can you provide the actual benchmark tests? I appreciate you posting the numbers (and calling out regressions!), but we should also confirm that the tests represent realistic inputs.

While the tests don't represent realistic inputs, I've specifically tested performance of increasing seperators frequency. The SepFreq in the tests mean how often a separator appears in the input sequence, so SepFreq=1 means every character represents a separator, SepFreq=2: every second char etc.
Starting with SepFreq=2 the performance is on par with the old implementation.
The only thing I could see violating this is input with the same average frequency, but different distribution.

Here is the Benchmark file: https://gist.github.com/bbartels/0e45fd3977067ce013bd0b28d6548b4e

Vector256<ushort> charVector = Unsafe.As<char, Vector256<ushort>>(ref ci);
Vector256<ushort> cmp = Avx2.CompareEqual(charVector, v1);

if (v2 is Vector256<ushort> vecSep2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought on eliminating these branches (didn't validate it, don't know if it's worth it).

Instead of checking v2 is Vector in each iteration, one could set v2 to a dummy-value, so that iif c2 is null the result of Avx2.Or(Avx2.CompareEqual(charVector, vecSep2), cmp) won't be changed.

A such dummy-value is Vector256<byte>.Zero.
With this value it won't work if the input is all zero though, but otherwise the two branches (in the loop) are eliminated resulting in smaller loop-bodies.

For v3 analogous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good idea! Instead of setting it to Vector256<byte>.Zero I could try with the inverse of the other separator char, this way it would even work when c == '\0'. I'll give it a benchmark and see if it makes a difference!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, upon benchmarking this seems to be significantly slower (-20%) in cases, where there is only a single separator char (which I feel like is probably the most common scenario). I'm leaving this open for now if someone has some other ideas about eliminating the branches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

significantly slower (-20%) in cases, where there is only a single separator char

TBH I can't believe this. Can you show the code?
The branch-predictor does a good job, but such an decrease with branchless operations that can be executed in parallel (instruction level parallelism) is not what I expected.

the most common scenario

I believe so too. Maybe it is worth it to special case this (when the branchless variant proves to be definitely slower 😉).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I am misunderstanding something, were you suggesting the following:

if (v2 is Vector256<ushort> vecSep2)
{
    cmp = Avx2.Or(Avx2.CompareEqual(vi, vecSep2), cmp);
}

if (v3 is Vector256<ushort> vecSep3)
{
    cmp = Avx2.Or(Avx2.CompareEqual(vi, vecSep3), cmp);
}

to

cmp = Avx2.Or(Avx2.CompareEqual(vi, vecSep2), cmp);
cmp = Avx2.Or(Avx2.CompareEqual(vi, vecSep3), cmp);

where vecSep2/3 are defaulted to something that wouldn't interfere with the existing cmp value if only one separator was defined?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite good (expecte the stack-spills I'm seeing / investigating).

If \0 isn't a concern, so the default values of the args c{2,3} could be set to \0 to avoid the branches at creating the vectors. This could a have a positive effect on smaller inputs.

With your idea to setting it to v1 you're on the safe side though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having nullable arguments, just set the dummy-values in


as there it's known. You know what I mean?

Copy link
Contributor Author

@bbartels bbartels Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in calling it like MakeSeparatorListVectorized(ref sepListBuilder, sep0, sep0, sep0) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

// ...
MakeSeparatorListVectorized(ref sepListBuilder, sep0, sep0, sep0);
// ...
MakeSeparatorListVectorized(ref sepListBuilder, sep0, sep1, sep0);
// ...
MakeSeparatorListVectorized(ref sepListBuilder, sep0, sep1, sep2);

Then the checks in MakeSeparatorListVectorized for null (default arg) can be eliminated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 2b0b8f8

bbartels and others added 2 commits June 17, 2020 11:17
…ation.cs

Co-authored-by: Günther Foidl <gue@korporal.at>
…ation.cs

Co-authored-by: Günther Foidl <gue@korporal.at>
@bbartels
Copy link
Contributor Author

bbartels commented Jun 17, 2020

I've also done some thought experiments how this could possibly be applied to #934. Because of the way the SpanSplitEnumerator is being implemented it restricts the ability to improve performance by finding multiple indices at once.

This could significantly help in perf for input with a high separator density. The only solutions I could think of is having a stackallocated buffer that is being passed to the Enumerator. This way it could use something similar to this PR to fill the buffer with indices and MoveNext() would use the buffer as a lookup first, before finding more indices.
It would be nice if the SpanSplitEnumerator could handle such buffer allocation itself, so I did some research, but it seems like ref structs don't have an ability to allocate a dynamic or even fixed sized in-place buffer (dotnet/roslyn#35658). Maybe this could be a consideration for discussion when #934 is re-reviewed?
/cc @GrabYourPitchforks

@tannergooding
Copy link
Member

Do you have any thoughts on the use of pext in the runtime?

Given the severe performance difference on AMD my preference would be to not use it for the time being. Performing an shift + and seems like a sufficient alternative, especially when the mask is known in advance.
Doing this will also make it easier to port to ARM64 where there may not be a clear equivalent to PDEP and PEXT.

@danmoseley
Copy link
Member

@bbartels if you didn't already you might review coverage in https://github.com/dotnet/performance and see whether it has reasonable cases. These are the tests we use to sign off on libraries perf.

@bbartels
Copy link
Contributor Author

@danmosemsft Ah good point, will do! I was a little stumped when trying to come up with a saturated set of common/realistic scenarios.

@bbartels
Copy link
Contributor Author

Do you have any thoughts on the use of pext in the runtime?

Given the severe performance difference on AMD my preference would be to not use it for the time being. Performing an shift + and seems like a sufficient alternative, especially when the mask is known in advance.
Doing this will also make it easier to port to ARM64 where there may not be a clear equivalent to PDEP and PEXT.

Had a friend run some numbers on Zen and it is an order of magnitude slower than on Intel. So it is definitely not an option to use pext on amd. Is there potentially a way to run different code based on whether it is an Intel or Amd processor?

@tannergooding
Copy link
Member

Is there potentially a way to run different code based on whether it is an Intel or Amd processor?

There isn't a mechanism to do this today and it likely isn't worthwhile since the perf difference between pext and the fallback is likely not that great on Intel.

@GrabYourPitchforks
Copy link
Member

@tannergooding What about the use of AVX2 vs. normal SSE instructions in these code paths? In the UTF-8 transcoding code paths we avoided AVX2 because it didn't offer enough of a performance improvement over SSE when given realistic inputs. (I recorded only a ~5% max difference.) Avoiding AVX2 also allowed the code to work across a wider range of processors.

@tannergooding
Copy link
Member

I don't think we should provide AVX2 without providing SSE.

Whether we provide AVX2 as an additional code path should largely come down to perf numbers vs complexity.

@GrabYourPitchforks
Copy link
Member

BTW I wrote a benchmark which should be more realistic then the earlier frequency-based benchmark. It's a naive CSV parser, which is a very common use case for string.Split.

    public class StringSplitRunner
    {
        private string[] _strings;

        public IEnumerable<string> CorpusList()
        {
            yield return "https://www.nefsc.noaa.gov/drifter/drift_180351431.csv";
            yield return "http://www.transparency.ri.gov/awards/awardsummary.csv";
            yield return "https://www.census.gov/econ/bfs/csv/date_table.csv";
            yield return "https://www.sba.gov/sites/default/files/aboutsbaarticle/FY16_SBA_RAW_DATA.csv";
            yield return "https://www.epa.gov/sites/production/files/2014-05/tri_2012_nd.csv";
            yield return "https://wfmi.nifc.gov/fire_reporting/annual_dataset_archive/1972-2010/_WFMI_Big_Files/BOR_1972-2010_Gis.csv";
        }

        [ParamsSource("CorpusList")]
        public string CorpusUri { get; set; }

        [GlobalSetup]
        public void Setup()
        {
            _strings = GetStringsFromCorpus().GetAwaiter().GetResult();
        }

        private async Task<string[]> GetStringsFromCorpus()
        {
            var response = await new HttpClient().GetAsync(CorpusUri);
            response.EnsureSuccessStatusCode();

            var body = await response.Content.ReadAsStringAsync();

            List<string> lines = new List<string>();

            StringReader reader = new StringReader(body);
            string line;
            while ((line = reader.ReadLine()) != null)
            {
                lines.Add(line);
            }

            return lines.ToArray();
        }

        [Benchmark]
        public string[] SplitCsv()
        {
            string[] split = null;
            string[] lines = _strings;
            for (int i = 0; i < lines.Length; i++)
            {
                split = lines[i].Split(',');
            }
            return split;
        }
    }
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.329 (2004/?/20H1)
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.7.20313.3
Method Job Toolchain CorpusUri Mean Error StdDev Ratio RatioSD
SplitCsv Job-GSIJRW master http:(...)y.csv [54] 728.25 μs 9.283 μs 7.752 μs 1.00 0.00
SplitCsv Job-DIUBJG pr38001 http:(...)y.csv [54] 584.46 μs 11.674 μs 12.491 μs 0.80 0.02
SplitCsv Job-GSIJRW master http(...).csv [107] 22.46 μs 0.387 μs 0.580 μs 1.00 0.00
SplitCsv Job-DIUBJG pr38001 http(...).csv [107] 19.10 μs 0.378 μs 0.632 μs 0.85 0.03
SplitCsv Job-GSIJRW master https(...)e.csv [50] 97.29 μs 1.702 μs 2.441 μs 1.00 0.00
SplitCsv Job-DIUBJG pr38001 https(...)e.csv [50] 104.31 μs 1.441 μs 1.348 μs 1.07 0.04
SplitCsv Job-GSIJRW master https(...)d.csv [66] NA NA NA ? ?
SplitCsv Job-DIUBJG pr38001 https(...)d.csv [66] NA NA NA ? ?
SplitCsv Job-GSIJRW master https(...)1.csv [54] 125.83 μs 4.661 μs 13.523 μs 1.00 0.00
SplitCsv Job-DIUBJG pr38001 https(...)1.csv [54] 111.81 μs 1.861 μs 2.069 μs 0.81 0.07
SplitCsv Job-GSIJRW master https(...)A.csv [77] 582.79 μs 11.351 μs 10.617 μs 1.00 0.00
SplitCsv Job-DIUBJG pr38001 https(...)A.csv [77] 516.06 μs 10.292 μs 16.910 μs 0.90 0.04

Delta for these real-world benchmarks is +20% improvement through -7% regression.

Base automatically changed from master to main March 1, 2021 09:06
@danmoseley
Copy link
Member

@tannergooding do you need any more updates or is this ready for you to review?

{
// Redundant test so we won't prejit remainder of this method
// on platforms without SSE.
if (!Sse.IsSupported)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be Sse41 to match the highest use ISA in it.

@tannergooding
Copy link
Member

Changes LGTM minus the one IsSupported check.

@bbartels
Copy link
Contributor Author

Changes LGTM minus the one IsSupported check.

Done, missed it needing to be more constrained that it was 😅 Hope everything is okay now!

@danmoseley
Copy link
Member

To rerun CI, just close and reopen!

@danmoseley danmoseley closed this Mar 27, 2021
@danmoseley danmoseley reopened this Mar 27, 2021
@bbartels
Copy link
Contributor Author

Looks like CI is happy 😄

@bbartels
Copy link
Contributor Author

bbartels commented Apr 6, 2021

@tannergooding @danmoseley Would this be ready to merge?

@danmoseley
Copy link
Member

Thanks for the reminder. @tannergooding is this ready to sign off?

@tannergooding tannergooding merged commit 295bcdc into dotnet:main Apr 7, 2021
@danmoseley
Copy link
Member

yay! thank you @bbartels for the contribution, and @tannergooding for the comprehensive review.

I hope the long time this PR took will not put you off making another contribution. It is quite rare it takes this long, and we're now more actively tracking PR's internally, eg., we have a mailer for PR's that have not moved for a week or two.

thaystg added a commit to thaystg/runtime that referenced this pull request Apr 7, 2021
* upstream/main: (568 commits)
  [wasm] Set __DistroRid on Windows to browser-wasm (dotnet#50842)
  [wasm] Fix order of include paths, to have the obj dir first (dotnet#50303)
  [wasm] Fix debug build of AOT cross compiler (dotnet#50418)
  Fix outdated comment (dotnet#50834)
  [wasm][tests] Add properties to allow passing args to xharness (dotnet#50678)
  Vectorized common String.Split() paths (dotnet#38001)
  Fix binplacing symbol files. (dotnet#50819)
  Move type check to after the null ref branch in out marshalling of blittable classes. (dotnet#50735)
  Remove extraneous CMake version requirement. (dotnet#50805)
  [wasm] Remove unncessary condition for EMSDK (dotnet#50810)
  Add loop alignment stats to JitLogCsv (dotnet#50624)
  Resolve ILLink warnings in System.Diagnostics.DiagnosticSource (dotnet#50265)
  Avoid unnecessary closures/delegates in Process (dotnet#50496)
  Fix for field layout verification across version bubble boundary (dotnet#50364)
  JIT: Enable CSE for VectorX.Create (dotnet#50644)
  [main] Update dependencies from mono/linker (dotnet#50779)
  [mono] More domain cleanup (dotnet#50771)
  Race condition in Mock reference tracker runtime with GC. (dotnet#50804)
  Remove IAssemblyName (and various fusion remnants) (dotnet#50755)
  Disable failing test for GCStress. (dotnet#50828)
  ...
@bbartels
Copy link
Contributor Author

yay! thank you @bbartels for the contribution, and @tannergooding for the comprehensive review.

I hope the long time this PR took will not put you off making another contribution. It is quite rare it takes this long, and we're now more actively tracking PR's internally, eg., we have a mailer for PR's that have not moved for a week or two.

Thank you @danmoseley, I always appreciate your kind words :)
Haha, don't worry about it, I absolutely loved working on this PR as it both is fairly impactful (at least compared to my previous PR's 😅), and taught me a lot of things about low level compute.
University definitely gives me very little time at the moment to make further contributions, but I am hoping I'll have time again in Summer!

Also many thanks to @gfoidl for the awesome support during this PR!

@danmoseley
Copy link
Member

Great just ping one of us at that time if you would like help finding a project!

@kunalspathak
Copy link
Member

This change introduced some regressions - #51259

danmoseley added a commit to danmoseley/runtime that referenced this pull request Apr 15, 2021
@ghost ghost locked as resolved and limited conversation to collaborators May 15, 2021
@karelz karelz added this to the 6.0.0 milestone May 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants