Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RyuJIT] Unroll StartsWith/SequenceEqual for ([ReadOnly]Span<char>, const-string) #46392

Closed
wants to merge 15 commits into from

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Dec 24, 2020

This PR optimizes the following APIs:

bool MemoryExtensions.StartsWith(ReadOnlySpan<T>, ReadOnlySpan<T>)
bool MemoryExtensions.StartsWith(ReadOnlySpan<T>, ReadOnlySpan<T>, StringComparison)
bool MemoryExtensions.SequenceEqual(ReadOnlySpan<T>, ReadOnlySpan<T>)
bool MemoryExtensions.Equals(ReadOnlySpan<T>, ReadOnlySpan<T>, StringComparison)

when the second arg is a constant string. For now it only handles strings with length of 1, 2 or 4 chars, but it can be extended to handle longer strings as well via SIMD ([8..32] range), here are some benchmarks.

It addresses #45613 but only for spans since we promote them for high-performance tasks. In theory, it can be then extended to support plain string objects if needed.

1. Small strings (e.g. length is 4)

[Benchmark]
[Arguments("https://google.com")]
public bool SpanStartsWith(string str) => str.AsSpan().StartsWith("http");
       |       Mean |
       |-----------:|
master |  4.0456 ns |
    PR |  0.3910 ns |   10x faster

Codegen diff example: https://www.diffchecker.com/BeNpMF78

2. Small strings (e.g. length is 4, ignore case)

[Benchmark]
[Arguments("https://google.com")]
public bool SpanStartsWith(string str) => str.AsSpan().StartsWith("http", StringComparison.OrdinalIgnoreCase);
         |       Mean |
         |-----------:|
  master | 19.7386 ns |
      PR |  0.5509 ns |   35x faster

3. Bigger strings (e.g. length is 23)

[Benchmark]
public bool SpanStartsWith(string str) => str.AsSpan().StartsWith("ProxyAuthenticateHeader");
         |       Mean |
         |-----------:|
  master |  3.1628 ns |
proposed |  0.5293 ns |   5x faster (two AVX2 vectors)

2. Bigger strings (e.g. length is 23, ignore case)

[Benchmark]
public bool SpanStartsWith(string str) => 
    str.AsSpan().StartsWith("ProxyAuthenticateHeader", StringComparison.OrdinalIgnoreCase);
         |       Mean |
         |-----------:|
  master | 40.4105 ns |
proposed |  0.6116 ns |   66x faster (two AVX2 vectors)

Standalone benchmark: https://gist.github.com/EgorBo/8a4e4cda14eac0e605dd7bac68c56314

Inspired by LLVM: https://godbolt.org/z/8fcqfb

/cc @jkotas @dotnet/jit-contrib

@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

@EgorBo EgorBo added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 24, 2020
@EgorBo EgorBo changed the title [RyuJIT] Optimize MemoryExtensions.StartsWith(Span,Span) for const strings [RyuJIT] Unroll StartsWith/SequenceEqual for ([ReadOnly]Span<char>, const-string) Dec 25, 2020
@EgorBo
Copy link
Member Author

EgorBo commented Dec 25, 2020

For SIMD we only need to emit something like this in JIT:


           *  COMMA  
           +--*  ASG       simd32
           |  +--*  LCL_VAR   simd32<Vector256`1[UInt16]> V07 tmp3
           |  \--*  HWINTRINSIC simd32 ushort Or
           |     +--*  HWINTRINSIC simd32 ushort Xor
           |     |  +--*  HWINTRINSIC simd32 ushort Or
           |     |  |  +--*  LCL_VAR   simd32<Vector256`1[UInt16]> V02 loc0
           |     |  |  \--*  HWINTRINSIC simd32 ushort Create (.........)
           |     |  \--*  HWINTRINSIC simd32 ushort Create
           |     \--*  HWINTRINSIC simd32 ushort Xor
           |        +--*  HWINTRINSIC simd32 ushort Or
           |        |  +--*  LCL_VAR   simd32<Vector256`1[UInt16]> V03 loc1
            \       |  \--*  HWINTRINSIC simd32 ushort Create (.........)
             \      \--*  HWINTRINSIC simd32 ushort Create
              \--*  HWINTRINSIC bool   ushort TestZ
                 +--*  LCL_VAR   simd32<Vector256`1[UInt16]> V07 tmp3
                 \--*  LCL_VAR   simd32<Vector256`1[UInt16]> V07 tmp3

it depends on const string size, it can be a single vector (128 or 256bit).

@EgorBo EgorBo force-pushed the jit-intrin-memoryext-startswith branch from 57a2833 to 1f25caf Compare December 25, 2020 20:11
@EgorBo EgorBo force-pushed the jit-intrin-memoryext-startswith branch from 0646428 to 27232d8 Compare December 26, 2020 12:45
@AndyAyersMS
Copy link
Member

cc @BruceForstall

@JulieLeeMSFT JulieLeeMSFT added this to the 6.0.0 milestone Feb 8, 2021
Base automatically changed from master to main March 1, 2021 09:07
@EgorBo
Copy link
Member Author

EgorBo commented Mar 8, 2021

Any interest on the above or I should close?

@BruceForstall
Copy link
Member

@EgorBo My thought is that this is a lot of code for a very specific set of APIs, and quite restricted compared to the generality of the specified APIs (e.g., only certain types of strings). Is it sufficiently motivated? Namely, is there no other way for users to achieve better performance with a source-level implementation that would be "good enough", without the need to support all-platforms with JIT changes? Could the JIT do better (even if not optimal) with the general case without converting these to intrinsics? Could the source implementation use already existing hardware intrinsics? Could a new, specific, set of source-level APIs be defined to handle these special cases, not requiring JIT changes to achieve the perf results?

@EgorBo
Copy link
Member Author

EgorBo commented Mar 8, 2021

@EgorBo My thought is that this is a lot of code for a very specific set of APIs, and quite restricted compared to the generality of the specified APIs (e.g., only certain types of strings). Is it sufficiently motivated? Namely, is there no other way for users to achieve better performance with a source-level implementation that would be "good enough", without the need to support all-platforms with JIT changes? Could the JIT do better (even if not optimal) with the general case without converting these to intrinsics? Could the source implementation use already existing hardware intrinsics? Could a new, specific, set of source-level APIs be defined to handle these special cases, not requiring JIT changes to achieve the perf results?

Ok, makes sense going to close it.

PS: Still, this benchmark:

[Benchmark]
public bool SpanStartsWith(string str) => 
    str.AsSpan().StartsWith("ProxyAuthenticateHeader", StringComparison.OrdinalIgnoreCase);
         |       Mean |
         |-----------:|
  master | 40.4105 ns |
proposed |  0.6116 ns |   66x faster (two AVX2 vectors)

is a sign we can do better there.

@EgorBo EgorBo closed this Mar 8, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Apr 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants