Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve vectorization of String.Split #64899

Merged
merged 9 commits into from
Mar 24, 2022
Merged

Improve vectorization of String.Split #64899

merged 9 commits into from
Mar 24, 2022

Conversation

yesmey
Copy link
Contributor

@yesmey yesmey commented Feb 7, 2022

This pull request aims to simplify and improve upon the current vectorized fast path of string.Split.

Changes include:

  • Replace specialized SSE4.1 instructions with the new cross-platform intrinsic API
  • Add 265 bit instructions for longer strings
  • Improve the common path of Append in ValueListBuilder
    • Haven't made any explicit benchmark for this, but you can compare assembly output here: before after

For benchmark testing I tried to use both the csv parsing in #38001 and the benchmark referenced in #51259
The benchmarks include both 256 bit and 128 bit versions (sse/avx). Unfortunately I have not been able to benchmark any other platforms than x86_64

Benchmarks
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1466 (20H2/October2020Update)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-alpha.1.21568.2
  [Host]     : .NET 7.0.0 (7.0.21.56701), X64 RyuJIT
  Job-OHGYOD : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

Toolchain=CoreRun  
Method CorpusUri Mean Error StdDev
SplitCsv main http(...).csv [107] 11.71 μs 0.224 μs 0.210 μs
SplitCsv Vector128 http(...).csv [107] 9.915 μs 0.0225 μs 0.0175 μs
SplitCsv Vector256 http(...).csv [107] 9.768 μs 0.1745 μs 0.2612 μs
SplitCsv main https(...)e.csv [50] 69.22 μs 0.182 μs 0.170 μs
SplitCsv Vector128 https(...)e.csv [50] 65.784 μs 0.0870 μs 0.0772 μs
SplitCsv Vector256 https(...)e.csv [50] 58.383 μs 0.3115 μs 0.2914 μs
SplitCsv main https(...)A.csv [77] 354.65 μs 2.050 μs 1.712 μs
SplitCsv Vector128 https(...)A.csv [77] 311.968 μs 0.8166 μs 0.6376 μs
SplitCsv Vector256 https(...)A.csv [77] 319.919 μs 2.0044 μs 1.8749 μs
Method s chr Mean Error StdDev
SplitArray main A B C(...)X Y Z [51] ' ' 291.20 ns 5.879 ns 12.655 ns
SplitArray Vector128 A B C(...)X Y Z [51] ' ' 270.11 ns 2.206 ns 2.063 ns
SplitArray Vector256 A B C(...)X Y Z [51] ' ' 271.53 ns 3.810 ns 3.377 ns
SplitArray main ABCDE(...)VWXYZ [26] ' ' 19.57 ns 0.180 ns 0.151 ns
SplitArray Vector128 ABCDE(...)VWXYZ [26] ' ' 18.82 ns 0.082 ns 0.077 ns
SplitArray Vector256 ABCDE(...)VWXYZ [26] ' ' 19.35 ns 0.059 ns 0.052 ns
Benchmark code
[DisassemblyDiagnoser]
public class CsvBenchmarks
{
    private string[] _strings;

    public IEnumerable<string> CorpusList()
    {
        yield return "https://www.census.gov/econ/bfs/csv/date_table.csv";
        yield return "https://www.sba.gov/sites/default/files/aboutsbaarticle/FY16_SBA_RAW_DATA.csv";
        yield return "https://wfmi.nifc.gov/fire_reporting/annual_dataset_archive/1972-2010/_WFMI_Big_Files/BOR_1972-2010_Gis.csv";
    }

    [ParamsSource("CorpusList")]
    public string CorpusUri { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        _strings = GetStringsFromCorpus().GetAwaiter().GetResult();
    }

    private async Task<string[]> GetStringsFromCorpus()
    {
        using var client = new HttpClient();
        using var response = await client.GetAsync(CorpusUri);
        response.EnsureSuccessStatusCode();

        var body = await response.Content.ReadAsStringAsync();

        List<string> lines = new();

        StringReader reader = new StringReader(body);
        string? line;
        while ((line = reader.ReadLine()) != null)
        {
            lines.Add(line);
        }

        return lines.ToArray();
    }

    [Benchmark]
    public string[]? SplitCsv()
    {
        string[]? split = null;
        string[] lines = _strings;
        for (int i = 0; i < lines.Length; i++)
        {
            split = lines[i].Split(',');
        }
        return split;
    }   
}

[DisassemblyDiagnoser]
public class RegressionBenchmark
{
    [Benchmark]
    [Arguments("A B C D E F G H I J K L M N O P Q R S T U V W X Y Z", ' ')]
    [Arguments("ABCDEFGHIJKLMNOPQRSTUVWXYZ", ' ')]
    public string[] SplitArray(string s, char chr)
        => s.Split(chr);
}

public class Program
{
    public static void Main(string[] args)
    {
        BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
    }
}

Related to #51259

- Implement Vector265 for longer strings
- Simplify the Vector code and use new cross-platform intrinsic API
- Use ref _firstChar instead of ref MemoryMarshal.GetReference(this.AsSpan());
- Use unsigned check for separators.Length so that two redundant range checks are optimized away
@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Feb 7, 2022
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

@@ -1637,27 +1637,13 @@ private void MakeSeparatorList(ReadOnlySpan<char> separators, ref ValueListBuild
}

// Special-case the common cases of 1, 2, and 3 separators, with manual comparisons against each separator.
else if (separators.Length <= 3)
else if (separators.Length <= 3u)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it affect codegen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it got rid of redundant range checks for separators, doing (uint)separators.Length <= (uint)3 is one movsxd less, but I personally thought this was cleaner. However, I can see it being too obscure with it's intent.

Vector256<ushort> vector = Vector256.LoadUnsafe(ref source, (uint)i);
Vector256<ushort> cmp = Vector256.Equals(vector, v1) | Vector256.Equals(vector, v2) | Vector256.Equals(vector, v3);

uint mask = cmp.AsByte().ExtractMostSignificantBits() & 0b0101010101010101;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good idea to also use TestZ for faster out, e.g.

if (cmp == Vector256<ushort>.Zero)
    continue;

it's faster than movmsk

{
sepListBuilder.Append(idx);
sepListBuilder.Append(i + BitOperations.TrailingZeroCount(mask) / 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
if ((lowBits & 0xF) != 0)
Vector256<ushort> vector = Vector256.LoadUnsafe(ref source, (uint)i);
Vector256<ushort> cmp = Vector256.Equals(vector, v1) | Vector256.Equals(vector, v2) | Vector256.Equals(vector, v3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider splitting this to temps for better pipelining so all compare instructions will be next to each other and so are ORs


for (int idx = i; lowBits != 0; idx++)
int vector256ShortCount = Vector256<ushort>.Count;
for (; (i + vector256ShortCount) <= Length; i += vector256ShortCount)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider processing trailing elements via overlapping instead of scalar fallback

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a risk though that the code will start getting a bit complicated, I wanted to keep the code easy to follow since it's only used for a specific scenario. If you still think it's worth it, I can definitely look into it

Copy link
Member

@EgorBo EgorBo Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handling trailing elements in the same loop (or via a spilled iteration) shows nice improvements for small-medium sized inputs, in theory it only adds an additional check inside the loop, feel free to keep it as is, we can then follow up


for (int idx = i; lowBits != 0; idx++)
int vector256ShortCount = Vector256<ushort>.Count;
for (; (i + vector256ShortCount) <= Length; i += vector256ShortCount)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i + vector256ShortCount) <= Length might overflow, it should be
i <= Length - vector256ShortCount

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides that the i <= len - count version can keep the len - count in a register, whilst i + count needs a repeated addition.

Also local vector256ShortCount isn't needed, as JIT will treat Vector256<ushort>.Count as constant.

Vector128<ushort> v3 = Vector128.Create((ushort)c3);

ref char c0 = ref MemoryMarshal.GetReference(this.AsSpan());
int cond = Length & -Vector128<ushort>.Count;
int i = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int -> nint, it will help to avoid redundant sign extensions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same variable is used as index to the scalar/non vectorized version at the bottom. I'll see if I can find a middle-way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can always cast it to signed just once before the scalar version


for (int idx = i; lowBits != 0; idx++)
int vector256ShortCount = Vector256<ushort>.Count;
for (; (i + vector256ShortCount) <= Length; i += vector256ShortCount)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides that the i <= len - count version can keep the len - count in a register, whilst i + count needs a repeated addition.

Also local vector256ShortCount isn't needed, as JIT will treat Vector256<ushort>.Count as constant.

Comment on lines 1706 to 1707
int vector128ShortCount = Vector128<ushort>.Count;
for (; (i + vector128ShortCount) <= Length; i += vector128ShortCount)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int vector128ShortCount = Vector128<ushort>.Count;
for (; (i + vector128ShortCount) <= Length; i += vector128ShortCount)
for (; i <= Length - Vector128<ushort>.Count; i += Vector128<ushort>.Count)

When i is of type nint just check if the comparison doesn't introduce any sign extensions -- please double check to be on the safe side.

@yesmey
Copy link
Contributor Author

yesmey commented Feb 7, 2022

@EgorBo @gfoidl Thanks for the good tips and feedback, I updated the pull request accordingly.
Unfortunately the 256 bit code had a bug - I was masking the movmskb result with every other bit, but accidentally had copied the mask from the 128 bit code where the result is only 16 bit. There wasn't any string in the test suite to cover it.

The benchmark numbers for 256 bit is much more realistic now. It looks to be much closer to the 128 bit version now. Please let me know your opinion, and sorry for the mistake

@gfoidl
Copy link
Member

gfoidl commented Feb 8, 2022

benchmark numbers for 256 bit is much more realistic now. It looks to be much closer to the 128 bit version now

It's the current numbers in the PR's description?
For Vector256 there's only little gain, so is it worth to have a dedicated code-path for it? ARM won't support it anyway.

@yesmey
Copy link
Contributor Author

yesmey commented Feb 8, 2022

@gfoidl Yes those are the latest numbers. I can remove the 256bit path it if you want

@stephentoub
Copy link
Member

Yes those are the latest numbers. I can remove the 256bit path it if you want

Are any of these tests for really long inputs containing very few separators?

Vector256<byte> cmp = (vector1 | vector2 | vector3).AsByte();
Vector256<ushort> v1 = Vector256.Create((ushort)c);
Vector256<ushort> v2 = Vector256.Create((ushort)c2);
Vector256<ushort> v3 = Vector256.Create((ushort)c3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, am just curios if our guidelines allow to use var here, the type of vector should be pretty obvious from the expression on the right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guidelines say var should only be used for a ctor or explicit cast. While it's arguable that Create is equivalent to a ctor, there's nothing that requires it to return the same type it's declared on, and in fact there are cases where Create methods don't, e.g. File.Create.

@yesmey
Copy link
Contributor Author

yesmey commented Feb 9, 2022

Sorry for my delay, I decided to rewrite the 256 bit spilling and made some improvements for the scalar loop.

Here's a gist of a much bigger bechmark suite: https://gist.github.com/yesmey/2e7a7868bb10043553b78d77cbc3f2b8
(note: the bold text is baseline)

benchmark code for gist
public class Benchmarks
{
    private static string _testStr;
    private static System.Text.StringBuilder st;
    private static char[][] _testChar = new char[3][];

    static Benchmarks()
    {
        st = new System.Text.StringBuilder(5_000_000);
        _testChar[0] = new char[1] { ' ' };
        _testChar[1] = new char[2] { ' ', 't' };
        _testChar[2] = new char[3] { ' ', 't', 'f' };
    } 
    
    private static string BuildStr(char c, int stringLength, int sepFreq, char sep)
    {
        for (int i = 0; i < stringLength; i++)
        {
            if (i % sepFreq == 0)
            {
                st.Append(sep);
            }
            else { st.Append(c); }
        }
        string t =  st.ToString();
        st.Clear();
        return t;
    }

    [GlobalSetup]
    public void Init()
    {
        _testStr = BuildStr('a', Size, SepFreq, _testChar[2][SplitCount - 1]);
    }

    [Params(16, 64, 200, 1000, 10000)]
    public int Size { get; set; }
    
    [Params(1, 2, 5, 200)]
    public int SepFreq { get; set; }
    
    [Params(1, 2, 3)]
    public int SplitCount { get; set; }
    
    [Benchmark]
    public string[] Split()
    {
        return _testStr.Split(_testChar[SplitCount - 1]);
    }
}

Updated numbers from previous benchmarks:

csv + dotnet/performance
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1526 (20H2/October2020Update)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.2.22108.4
  [Host]     : .NET 7.0.0 (7.0.22.10302), X64 RyuJIT
  Job-AJDBJE : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-XTZHCY : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

Method Job Toolchain CorpusUri Mean Error StdDev Ratio RatioSD
SplitCsv Job-AJDBJE \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe http(...).csv [107] 11.493 μs 0.1911 μs 0.1787 μs 1.19 0.02
SplitCsv Job-XTZHCY \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe http(...).csv [107] 9.631 μs 0.1624 μs 0.1519 μs 1.00 0.00
SplitCsv Job-AJDBJE \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe https(...)e.csv [50] 60.477 μs 0.2351 μs 0.2200 μs 0.96 0.01
SplitCsv Job-XTZHCY \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe https(...)e.csv [50] 62.927 μs 0.9685 μs 1.5078 μs 1.00 0.00
SplitCsv Job-AJDBJE \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe https(...)A.csv [77] 343.339 μs 3.5451 μs 3.3161 μs 1.19 0.01
SplitCsv Job-XTZHCY \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe https(...)A.csv [77] 287.859 μs 2.7981 μs 2.6174 μs 1.00 0.00
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1526 (20H2/October2020Update)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.2.22108.4
  [Host]     : .NET 7.0.0 (7.0.22.10302), X64 RyuJIT
  Job-AJDBJE : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-XTZHCY : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

Method Job Toolchain s chr arr options Mean Error StdDev Ratio RatioSD
SplitChar Job-AJDBJE \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe A B C(...)X Y Z [51] ** ** ? ? 312.40 ns 3.786 ns 3.541 ns 1.08 0.11
SplitChar Job-XTZHCY \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe A B C(...)X Y Z [51] ? ? 284.75 ns 8.865 ns 26.139 ns 1.00 0.00
Split Job-AJDBJE \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe A B C(...)X Y Z [51] ? Char[1] None 292.72 ns 2.842 ns 2.519 ns 1.17 0.02
Split Job-XTZHCY \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe A B C(...)X Y Z [51] ? Char[1] None 251.01 ns 4.483 ns 3.974 ns 1.00 0.00
Split Job-AJDBJE \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe A B C(...)X Y Z [51] ? Char[1] RemoveEmptyEntries 368.19 ns 5.179 ns 4.591 ns 1.09 0.02
Split Job-XTZHCY \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe A B C(...)X Y Z [51] ? Char[1] RemoveEmptyEntries 337.88 ns 0.900 ns 0.841 ns 1.00 0.00
SplitChar Job-AJDBJE \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe ABCDE(...)VWXYZ [26] ** ** ? ? 19.20 ns 0.035 ns 0.032 ns 0.76 0.00
SplitChar Job-XTZHCY \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe ABCDE(...)VWXYZ [26] ? ? 25.16 ns 0.044 ns 0.041 ns 1.00 0.00
Split Job-AJDBJE \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe ABCDE(...)VWXYZ [26] ? Char[1] None 18.71 ns 0.028 ns 0.025 ns 0.63 0.00
Split Job-XTZHCY \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe ABCDE(...)VWXYZ [26] ? Char[1] None 29.67 ns 0.052 ns 0.046 ns 1.00 0.00
Split Job-AJDBJE \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe ABCDE(...)VWXYZ [26] ? Char[1] RemoveEmptyEntries 17.71 ns 0.026 ns 0.024 ns 0.55 0.00
Split Job-XTZHCY \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe ABCDE(...)VWXYZ [26] ? Char[1] RemoveEmptyEntries 32.04 ns 0.050 ns 0.045 ns 1.00 0.00

There seems to be regressions on the strings with no split chars in them

@ghost
Copy link

ghost commented Feb 12, 2022

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

This pull request aims to simplify and improve upon the current vectorized fast path of string.Split.

Changes include:

  • Replace specialized SSE4.1 instructions with the new cross-platform intrinsic API
  • Add 265 bit instructions for longer strings
  • Improve the common path of Append in ValueListBuilder
    • Haven't made any explicit benchmark for this, but you can compare assembly output here: before after

For benchmark testing I tried to use both the csv parsing in #38001 and the benchmark referenced in #51259
The benchmarks include both 256 bit and 128 bit versions (sse/avx). Unfortunately I have not been able to benchmark any other platforms than x86_64

Benchmarks
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1466 (20H2/October2020Update)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-alpha.1.21568.2
  [Host]     : .NET 7.0.0 (7.0.21.56701), X64 RyuJIT
  Job-OHGYOD : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

Toolchain=CoreRun  
Method CorpusUri Mean Error StdDev
SplitCsv main http(...).csv [107] 11.71 μs 0.224 μs 0.210 μs
SplitCsv Vector128 http(...).csv [107] 9.915 μs 0.0225 μs 0.0175 μs
SplitCsv Vector256 http(...).csv [107] 9.768 μs 0.1745 μs 0.2612 μs
SplitCsv main https(...)e.csv [50] 69.22 μs 0.182 μs 0.170 μs
SplitCsv Vector128 https(...)e.csv [50] 65.784 μs 0.0870 μs 0.0772 μs
SplitCsv Vector256 https(...)e.csv [50] 58.383 μs 0.3115 μs 0.2914 μs
SplitCsv main https(...)A.csv [77] 354.65 μs 2.050 μs 1.712 μs
SplitCsv Vector128 https(...)A.csv [77] 311.968 μs 0.8166 μs 0.6376 μs
SplitCsv Vector256 https(...)A.csv [77] 319.919 μs 2.0044 μs 1.8749 μs
Method s chr Mean Error StdDev
SplitArray main A B C(...)X Y Z [51] ' ' 291.20 ns 5.879 ns 12.655 ns
SplitArray Vector128 A B C(...)X Y Z [51] ' ' 270.11 ns 2.206 ns 2.063 ns
SplitArray Vector256 A B C(...)X Y Z [51] ' ' 271.53 ns 3.810 ns 3.377 ns
SplitArray main ABCDE(...)VWXYZ [26] ' ' 19.57 ns 0.180 ns 0.151 ns
SplitArray Vector128 ABCDE(...)VWXYZ [26] ' ' 18.82 ns 0.082 ns 0.077 ns
SplitArray Vector256 ABCDE(...)VWXYZ [26] ' ' 19.35 ns 0.059 ns 0.052 ns
Benchmark code
[DisassemblyDiagnoser]
public class CsvBenchmarks
{
    private string[] _strings;

    public IEnumerable<string> CorpusList()
    {
        yield return "https://www.census.gov/econ/bfs/csv/date_table.csv";
        yield return "https://www.sba.gov/sites/default/files/aboutsbaarticle/FY16_SBA_RAW_DATA.csv";
        yield return "https://wfmi.nifc.gov/fire_reporting/annual_dataset_archive/1972-2010/_WFMI_Big_Files/BOR_1972-2010_Gis.csv";
    }

    [ParamsSource("CorpusList")]
    public string CorpusUri { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        _strings = GetStringsFromCorpus().GetAwaiter().GetResult();
    }

    private async Task<string[]> GetStringsFromCorpus()
    {
        using var client = new HttpClient();
        using var response = await client.GetAsync(CorpusUri);
        response.EnsureSuccessStatusCode();

        var body = await response.Content.ReadAsStringAsync();

        List<string> lines = new();

        StringReader reader = new StringReader(body);
        string? line;
        while ((line = reader.ReadLine()) != null)
        {
            lines.Add(line);
        }

        return lines.ToArray();
    }

    [Benchmark]
    public string[]? SplitCsv()
    {
        string[]? split = null;
        string[] lines = _strings;
        for (int i = 0; i < lines.Length; i++)
        {
            split = lines[i].Split(',');
        }
        return split;
    }   
}

[DisassemblyDiagnoser]
public class RegressionBenchmark
{
    [Benchmark]
    [Arguments("A B C D E F G H I J K L M N O P Q R S T U V W X Y Z", ' ')]
    [Arguments("ABCDEFGHIJKLMNOPQRSTUVWXYZ", ' ')]
    public string[] SplitArray(string s, char chr)
        => s.Split(chr);
}

public class Program
{
    public static void Main(string[] args)
    {
        BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
    }
}

Related to #51259

Author: yesmey
Assignees: -
Labels:

area-System.Runtime, community-contribution

Milestone: -

@yesmey
Copy link
Contributor Author

yesmey commented Feb 13, 2022

Status update: I can't get the 256 bit version to perform well on lower-mid ranges because of the saving/restore overhead of registers due to the nested calls inside ValueListBuilder.Append. The 256 bit assembly currently looks like this: https://gist.github.com/yesmey/7786c102927cf8e9abf966cf44a35484, and as you can tell there's a lot of initial vmovaps just for the potential call of Grow in AddWithResize. Just to prove my point, I commented out Grow inside AddWithResize for comparison here.

So since I'm not getting any further there, I'm thinking maybe giving up on the 256 bit and keep the 128 bit version, which is on par in performance, just to have an implementation for arm

@EgorBo
Copy link
Member

EgorBo commented Feb 13, 2022

So since I'm not getting any further there, I'm thinking maybe giving up on the 256 bit and keep the 128 bit version

that's ok, we try to use AVX only where it's definitely profitable.

@yesmey
Copy link
Contributor Author

yesmey commented Feb 13, 2022

benchmarks for commit dcadf05

CSV parsing benchmarks
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1526 (20H2/October2020Update)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.2.22108.4
  [Host]     : .NET 7.0.0 (7.0.22.10302), X64 RyuJIT
  Job-EPAGWH : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-DBDUQW : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

Method Job Toolchain CorpusUri Mean Error StdDev Ratio RatioSD
SplitCsv Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe http(...).csv [107] 11.25 μs 0.104 μs 0.098 μs 1.00 0.00
SplitCsv Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe http(...).csv [107] 10.04 μs 0.011 μs 0.009 μs 0.89 0.01
SplitCsv Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe https(...)e.csv [50] 57.39 μs 1.093 μs 1.074 μs 1.00 0.00
SplitCsv Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe https(...)e.csv [50] 53.80 μs 0.305 μs 0.285 μs 0.94 0.02
SplitCsv Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe https(...)A.csv [77] 326.79 μs 1.039 μs 0.921 μs 1.00 0.00
SplitCsv Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe https(...)A.csv [77] 297.05 μs 5.757 μs 5.912 μs 0.91 0.02
dotnet/performance benchmarks
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1526 (20H2/October2020Update)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.2.22108.4
  [Host]     : .NET 7.0.0 (7.0.22.10302), X64 RyuJIT
  Job-EPAGWH : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-DBDUQW : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

Method Job Toolchain s chr arr options Mean Error StdDev Ratio RatioSD
SplitChar Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe A B C(...)X Y Z [51] ** ** ? ? 297.36 ns 5.652 ns 6.282 ns 1.00 0.00
SplitChar Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe A B C(...)X Y Z [51] ? ? 291.99 ns 5.821 ns 12.022 ns 0.99 0.04
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe A B C(...)X Y Z [51] ? Char[1] None 314.12 ns 9.964 ns 29.379 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe A B C(...)X Y Z [51] ? Char[1] None 272.73 ns 3.769 ns 5.160 ns 0.88 0.08
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe A B C(...)X Y Z [51] ? Char[1] RemoveEmptyEntries 374.66 ns 4.502 ns 3.759 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe A B C(...)X Y Z [51] ? Char[1] RemoveEmptyEntries 366.11 ns 0.924 ns 0.819 ns 0.98 0.01
SplitChar Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe ABCDE(...)VWXYZ [26] ** ** ? ? 25.87 ns 0.030 ns 0.027 ns 1.00 0.00
SplitChar Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe ABCDE(...)VWXYZ [26] ? ? 18.23 ns 0.016 ns 0.013 ns 0.70 0.00
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe ABCDE(...)VWXYZ [26] ? Char[1] None 18.19 ns 0.146 ns 0.122 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe ABCDE(...)VWXYZ [26] ? Char[1] None 18.39 ns 0.034 ns 0.030 ns 1.01 0.01
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe ABCDE(...)VWXYZ [26] ? Char[1] RemoveEmptyEntries 17.82 ns 0.017 ns 0.013 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe ABCDE(...)VWXYZ [26] ? Char[1] RemoveEmptyEntries 18.37 ns 0.066 ns 0.059 ns 1.03 0.00
partial 38001 issue benchmark suite
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1526 (20H2/October2020Update)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.2.22108.4
  [Host]     : .NET 7.0.0 (7.0.22.10302), X64 RyuJIT
  Job-EPAGWH : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-DBDUQW : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

Method Job Toolchain Size SepFreq SplitCount Mean Error StdDev Median Ratio RatioSD
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 16 1 1 90.66 ns 0.146 ns 0.129 ns 90.66 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 16 1 1 98.80 ns 0.351 ns 0.329 ns 98.83 ns 1.09 0.00
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 16 1 2 101.48 ns 1.119 ns 0.992 ns 101.72 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 16 1 2 110.29 ns 2.051 ns 1.919 ns 110.52 ns 1.08 0.02
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 16 5 1 62.59 ns 1.013 ns 0.846 ns 62.98 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 16 5 1 54.82 ns 0.617 ns 0.516 ns 54.83 ns 0.88 0.01
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 16 5 2 66.75 ns 1.380 ns 1.842 ns 65.78 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 16 5 2 54.66 ns 1.106 ns 0.924 ns 54.41 ns 0.82 0.03
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 16 200 1 33.08 ns 0.107 ns 0.095 ns 33.10 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 16 200 1 30.76 ns 0.220 ns 0.205 ns 30.80 ns 0.93 0.01
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 16 200 2 35.91 ns 0.745 ns 1.362 ns 35.77 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 16 200 2 31.82 ns 0.462 ns 0.432 ns 31.69 ns 0.89 0.06
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 200 1 1 1,043.18 ns 20.611 ns 36.099 ns 1,045.06 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 200 1 1 1,060.80 ns 20.757 ns 32.922 ns 1,052.50 ns 1.02 0.05
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 200 1 2 928.95 ns 18.445 ns 35.538 ns 937.08 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 200 1 2 1,040.02 ns 12.802 ns 9.995 ns 1,044.64 ns 1.09 0.03
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 200 5 1 546.27 ns 10.817 ns 18.073 ns 544.85 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 200 5 1 455.87 ns 9.008 ns 10.373 ns 453.83 ns 0.84 0.04
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 200 5 2 514.16 ns 5.528 ns 4.900 ns 513.00 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 200 5 2 481.95 ns 9.573 ns 18.444 ns 485.77 ns 0.92 0.03
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 200 200 1 65.77 ns 0.118 ns 0.098 ns 65.77 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 200 200 1 66.08 ns 0.448 ns 0.420 ns 66.05 ns 1.00 0.01
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 200 200 2 65.91 ns 0.365 ns 0.342 ns 65.75 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 200 200 2 67.15 ns 0.508 ns 0.424 ns 67.10 ns 1.02 0.01
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 1000 1 1 4,139.27 ns 15.050 ns 11.750 ns 4,139.32 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 1000 1 1 4,358.61 ns 9.762 ns 8.654 ns 4,354.70 ns 1.05 0.00
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 1000 1 2 4,162.92 ns 52.087 ns 77.961 ns 4,133.73 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 1000 1 2 4,639.01 ns 92.551 ns 197.234 ns 4,646.57 ns 1.11 0.05
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 1000 5 1 2,348.05 ns 47.024 ns 112.666 ns 2,355.87 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 1000 5 1 2,201.39 ns 53.919 ns 158.983 ns 2,145.12 ns 0.94 0.08
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 1000 5 2 2,408.45 ns 49.325 ns 145.436 ns 2,465.65 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 1000 5 2 2,160.78 ns 42.667 ns 85.210 ns 2,151.04 ns 0.89 0.07
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 1000 200 1 248.99 ns 0.974 ns 0.911 ns 249.01 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 1000 200 1 246.70 ns 1.574 ns 1.314 ns 247.05 ns 0.99 0.01
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 1000 200 2 245.95 ns 1.117 ns 1.045 ns 245.52 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 1000 200 2 244.58 ns 1.302 ns 1.218 ns 244.91 ns 0.99 0.00
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 10000 1 1 40,257.63 ns 395.206 ns 330.015 ns 40,114.94 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 10000 1 1 43,103.42 ns 797.633 ns 622.740 ns 43,088.24 ns 1.07 0.01
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 10000 1 2 40,372.55 ns 764.782 ns 715.378 ns 39,915.27 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 10000 1 2 43,351.75 ns 857.048 ns 1,916.910 ns 42,166.56 ns 1.10 0.05
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 10000 5 1 23,065.97 ns 419.810 ns 372.151 ns 22,920.42 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 10000 5 1 21,142.54 ns 471.224 ns 1,389.414 ns 20,593.73 ns 0.94 0.08
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 10000 5 2 25,646.14 ns 506.991 ns 1,204.917 ns 25,906.24 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 10000 5 2 23,400.37 ns 466.163 ns 1,344.987 ns 23,348.94 ns 0.92 0.08
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 10000 200 1 2,161.35 ns 3.897 ns 3.254 ns 2,159.56 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 10000 200 1 2,182.22 ns 16.648 ns 15.573 ns 2,181.21 ns 1.01 0.01
Split Job-EPAGWH \runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 10000 200 2 2,174.07 ns 3.713 ns 3.291 ns 2,173.86 ns 1.00 0.00
Split Job-DBDUQW \yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe 10000 200 2 2,130.82 ns 8.968 ns 7.949 ns 2,131.29 ns 0.98 0.00
benchmark source
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

public class CsvBenchmarks
{
    private string[] _strings;

    public IEnumerable<string> CorpusList()
    {
        // only these three urls still return any result
        yield return "https://www.census.gov/econ/bfs/csv/date_table.csv";
        yield return "https://www.sba.gov/sites/default/files/aboutsbaarticle/FY16_SBA_RAW_DATA.csv";
        yield return "https://wfmi.nifc.gov/fire_reporting/annual_dataset_archive/1972-2010/_WFMI_Big_Files/BOR_1972-2010_Gis.csv";
    }

    [ParamsSource("CorpusList")]
    public string CorpusUri { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        _strings = GetStringsFromCorpus().GetAwaiter().GetResult();
    }

    private async Task<string[]> GetStringsFromCorpus()
    {
        using var client = new HttpClient();
        using var response = await client.GetAsync(CorpusUri);
        response.EnsureSuccessStatusCode();

        var body = await response.Content.ReadAsStringAsync();

        List<string> lines = new();

        StringReader reader = new StringReader(body);
        string? line;
        while ((line = reader.ReadLine()) != null)
        {
            lines.Add(line);
        }

        return lines.ToArray();
    }

    [Benchmark]
    public string[]? SplitCsv()
    {
        string[]? split = null;
        string[] lines = _strings;
        for (int i = 0; i < lines.Length; i++)
        {
            split = lines[i].Split(',');
        }
        return split;
    }   
}

public class RegressionBenchmark
{
    [Benchmark]
    [Arguments("A B C D E F G H I J K L M N O P Q R S T U V W X Y Z", ' ')]
    [Arguments("ABCDEFGHIJKLMNOPQRSTUVWXYZ", ' ')]
    public string[] SplitChar(string s, char chr)
        => s.Split(chr);

    [Benchmark]
    [Arguments("A B C D E F G H I J K L M N O P Q R S T U V W X Y Z", new char[] { ' ' }, StringSplitOptions.None)]
    [Arguments("A B C D E F G H I J K L M N O P Q R S T U V W X Y Z", new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)]
    [Arguments("ABCDEFGHIJKLMNOPQRSTUVWXYZ", new char[]{' '}, StringSplitOptions.None)]
    [Arguments("ABCDEFGHIJKLMNOPQRSTUVWXYZ", new char[]{' '}, StringSplitOptions.RemoveEmptyEntries)]
    public string[] Split(string s, char[] arr, StringSplitOptions options)
        => s.Split(arr, options);
}

public class Benchmarks
{
    private static string _testStr;
    private static System.Text.StringBuilder st;
    private static char[][] _testChar = new char[3][];

    static Benchmarks()
    {
        st = new System.Text.StringBuilder(5_000_000);
        _testChar[0] = new char[1] { ' ' };
        _testChar[1] = new char[3] { ' ', 't', 'f' };
    } 
    
    private static string BuildStr(char c, int stringLength, int sepFreq, char sep)
    {
        for (int i = 0; i < stringLength; i++)
        {
            if (i % sepFreq == 0)
            {
                st.Append(sep);
            }
            else { st.Append(c); }
        }
        string t =  st.ToString();
        st.Clear();
        return t;
    }

    [GlobalSetup]
    public void Init()
    {
        _testStr = BuildStr('a', Size, SepFreq, _testChar[1][SplitCount - 1]);
    }

    [Params(16, 200, 1000, 10000)]
    public int Size { get; set; }
    
    [Params(1, 5, 200)]
    public int SepFreq { get; set; }

    [Params(1, 2)]
    public int SplitCount { get; set; }
    
    [Benchmark]
    public string[] Split()
    {
        return _testStr.Split(_testChar[SplitCount - 1]);
    }
}

public class Program
{
    public static void Main(string[] args)
    {
        BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
    }
}

@danmoseley
Copy link
Member

@EgorBo @stephentoub @gfoidl is your feedback addressed ?

@danmoseley danmoseley closed this Mar 23, 2022
@danmoseley danmoseley reopened this Mar 23, 2022
Copy link
Member

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danmoseley I had another look, when these points are addressed I'm happy with the PR 😄.

@@ -1609,14 +1609,13 @@ private void MakeSeparatorList(ReadOnlySpan<char> separators, ref ValueListBuild
}

// Special-case the common cases of 1, 2, and 3 separators, with manual comparisons against each separator.
else if (separators.Length <= 3)
else if ((uint)separators.Length <= (uint)3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this cast still needed?
AFAIK JIT recognizes this now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfoidl do you mean we no longer need the pattern if ((uint)index > (uint)array.Length) that we have everywhere in the tree? If so we should have an issue to remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#62864 is the PR for that change (got merged 26 days ago).

If so we should have an issue to remove it.

Filed #67044 for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great! Glad it got fixed, I'll get rid of it

// Redundant test so we won't prejit remainder of this method
// on platforms without SSE.
if (!Sse41.IsSupported)
if (!Vector128.IsHardwareAccelerated)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the comment from the previous version (left side of comparison) to make it clear that this check is needed to avoid prejit.
Otherwise a Debug.Assert(Vector128.IsHardwareAccelerated) could do it too.

Copy link
Contributor Author

@yesmey yesmey Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll reintroduce the comment with a small text change since it's not limited to only SSE anymore

int i = 0;

for (; i < cond; i += Vector128<ushort>.Count)
while (offset <= lengthToExamine - (nuint)Vector128<ushort>.Count)
Copy link
Member

@gfoidl gfoidl Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above is L1618 we guard by Vector128<ushort>.Count * 2, so when reaching this point, we know that there are for sure enough elements available. This this check isn't need at this point. So you could change the loop to da do-while loop. Thus the first iteration is without any (further) pre-condition, and after the iteration the check for more available elements is done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll add a Debug.Assert on entry to make it a little more obvious to the reader that its a precondition

{
char curr = Unsafe.Add(ref c0, (IntPtr)(uint)i);
char curr = (char)Unsafe.Add(ref source, (nint)offset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
char curr = (char)Unsafe.Add(ref source, (nint)offset);
char curr = (char)Unsafe.Add(ref source, offset);

Not needed, there's an overload for nuint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must've missed that it got added, thanks

Copy link
Member

@gfoidl gfoidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question -- otherwise LGTM.

@@ -1615,8 +1615,7 @@ private void MakeSeparatorList(ReadOnlySpan<char> separators, ref ValueListBuild
sep0 = separators[0];
sep1 = separators.Length > 1 ? separators[1] : sep0;
sep2 = separators.Length > 2 ? separators[2] : sep1;

if (Length >= 16 && Sse41.IsSupported)
if (Vector128.IsHardwareAccelerated && Length >= Vector128<ushort>.Count * 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to double-check: the * 2 is intentional as perf-numbers showed that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly, smaller strings doesn't perform as well

@danmoseley
Copy link
Member

methodtable assert is #64544
FSW crash is #67071
JSON assert is #60962

@danmoseley danmoseley merged commit b4e258a into dotnet:main Mar 24, 2022
radekdoulik pushed a commit to radekdoulik/runtime that referenced this pull request Mar 30, 2022
@EgorBo
Copy link
Member

EgorBo commented Apr 7, 2022

Improvement on win-x64 dotnet/perf-autofiling-issues#4291

@danmoseley
Copy link
Member

Nice drop in that graph @yesmey . Do you plan to do more of this kind of work?

@ghost ghost locked as resolved and limited conversation to collaborators May 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Runtime community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants