[RyuJIT] Unroll StartsWith/SequenceEqual for ([ReadOnly]Span<char>, const-string) #46392

EgorBo · 2020-12-24T21:50:54Z

This PR optimizes the following APIs:

bool MemoryExtensions.StartsWith(ReadOnlySpan<T>, ReadOnlySpan<T>)
bool MemoryExtensions.StartsWith(ReadOnlySpan<T>, ReadOnlySpan<T>, StringComparison)
bool MemoryExtensions.SequenceEqual(ReadOnlySpan<T>, ReadOnlySpan<T>)
bool MemoryExtensions.Equals(ReadOnlySpan<T>, ReadOnlySpan<T>, StringComparison)

when the second arg is a constant string. For now it only handles strings with length of 1, 2 or 4 chars, but it can be extended to handle longer strings as well via SIMD ([8..32] range), here are some benchmarks.

It addresses #45613 but only for spans since we promote them for high-performance tasks. In theory, it can be then extended to support plain string objects if needed.

1. Small strings (e.g. length is 4)

[Benchmark]
[Arguments("https://google.com")]
public bool SpanStartsWith(string str) => str.AsSpan().StartsWith("http");

       |       Mean |
       |-----------:|
master |  4.0456 ns |
    PR |  0.3910 ns |   10x faster

Codegen diff example: https://www.diffchecker.com/BeNpMF78

2. Small strings (e.g. length is 4, ignore case)

[Benchmark]
[Arguments("https://google.com")]
public bool SpanStartsWith(string str) => str.AsSpan().StartsWith("http", StringComparison.OrdinalIgnoreCase);

         |       Mean |
         |-----------:|
  master | 19.7386 ns |
      PR |  0.5509 ns |   35x faster

3. Bigger strings (e.g. length is 23)

[Benchmark]
public bool SpanStartsWith(string str) => str.AsSpan().StartsWith("ProxyAuthenticateHeader");

         |       Mean |
         |-----------:|
  master |  3.1628 ns |
proposed |  0.5293 ns |   5x faster (two AVX2 vectors)

2. Bigger strings (e.g. length is 23, ignore case)

[Benchmark]
public bool SpanStartsWith(string str) => 
    str.AsSpan().StartsWith("ProxyAuthenticateHeader", StringComparison.OrdinalIgnoreCase);

         |       Mean |
         |-----------:|
  master | 40.4105 ns |
proposed |  0.6116 ns |   66x faster (two AVX2 vectors)

Standalone benchmark: https://gist.github.com/EgorBo/8a4e4cda14eac0e605dd7bac68c56314

Inspired by LLVM: https://godbolt.org/z/8fcqfb

/cc @jkotas @dotnet/jit-contrib

…inter) == ToHex("cstr")`

Dotnet-GitSync-Bot · 2020-12-24T21:50:58Z

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

src/coreclr/jit/importer.cpp

EgorBo · 2020-12-25T12:35:42Z

For SIMD we only need to emit something like this in JIT:


           *  COMMA  
           +--*  ASG       simd32
           |  +--*  LCL_VAR   simd32<Vector256`1[UInt16]> V07 tmp3
           |  \--*  HWINTRINSIC simd32 ushort Or
           |     +--*  HWINTRINSIC simd32 ushort Xor
           |     |  +--*  HWINTRINSIC simd32 ushort Or
           |     |  |  +--*  LCL_VAR   simd32<Vector256`1[UInt16]> V02 loc0
           |     |  |  \--*  HWINTRINSIC simd32 ushort Create (.........)
           |     |  \--*  HWINTRINSIC simd32 ushort Create
           |     \--*  HWINTRINSIC simd32 ushort Xor
           |        +--*  HWINTRINSIC simd32 ushort Or
           |        |  +--*  LCL_VAR   simd32<Vector256`1[UInt16]> V03 loc1
            \       |  \--*  HWINTRINSIC simd32 ushort Create (.........)
             \      \--*  HWINTRINSIC simd32 ushort Create
              \--*  HWINTRINSIC bool   ushort TestZ
                 +--*  LCL_VAR   simd32<Vector256`1[UInt16]> V07 tmp3
                 \--*  LCL_VAR   simd32<Vector256`1[UInt16]> V07 tmp3

it depends on const string size, it can be a single vector (128 or 256bit).

AndyAyersMS · 2021-01-23T03:16:19Z

cc @BruceForstall

EgorBo · 2021-03-08T15:54:03Z

Any interest on the above or I should close?

BruceForstall · 2021-03-08T22:29:14Z

@EgorBo My thought is that this is a lot of code for a very specific set of APIs, and quite restricted compared to the generality of the specified APIs (e.g., only certain types of strings). Is it sufficiently motivated? Namely, is there no other way for users to achieve better performance with a source-level implementation that would be "good enough", without the need to support all-platforms with JIT changes? Could the JIT do better (even if not optimal) with the general case without converting these to intrinsics? Could the source implementation use already existing hardware intrinsics? Could a new, specific, set of source-level APIs be defined to handle these special cases, not requiring JIT changes to achieve the perf results?

EgorBo · 2021-03-08T22:39:42Z

@EgorBo My thought is that this is a lot of code for a very specific set of APIs, and quite restricted compared to the generality of the specified APIs (e.g., only certain types of strings). Is it sufficiently motivated? Namely, is there no other way for users to achieve better performance with a source-level implementation that would be "good enough", without the need to support all-platforms with JIT changes? Could the JIT do better (even if not optimal) with the general case without converting these to intrinsics? Could the source implementation use already existing hardware intrinsics? Could a new, specific, set of source-level APIs be defined to handle these special cases, not requiring JIT changes to achieve the perf results?

Ok, makes sense going to close it.

PS: Still, this benchmark:

[Benchmark]
public bool SpanStartsWith(string str) => 
    str.AsSpan().StartsWith("ProxyAuthenticateHeader", StringComparison.OrdinalIgnoreCase);

         |       Mean |
         |-----------:|
  master | 40.4105 ns |
proposed |  0.6116 ns |   66x faster (two AVX2 vectors)

is a sign we can do better there.

EgorBo added 2 commits December 24, 2020 20:32

Optimize span.StartsWith("cstr") to `span.Length >= 4 && *(span._po…

389a1a8

…inter) == ToHex("cstr")`

Optimize span.StartsWith("cstr") to `span.Length >= 4 && *(span._po…

22e2cc2

…inter) == ToHex("cstr")`

EgorBo added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 24, 2020

EgorBo added 2 commits December 25, 2020 11:23

Extract into a separate function

7e24372

Add SequenceEqual support

a6ecf24

EgorBo changed the title ~~[RyuJIT] Optimize MemoryExtensions.StartsWith(Span,Span) for const strings~~ [RyuJIT] Unroll StartsWith/SequenceEqual for ([ReadOnly]Span<char>, const-string) Dec 25, 2020

gfoidl reviewed Dec 25, 2020

View reviewed changes

src/coreclr/jit/importer.cpp Outdated Show resolved Hide resolved

EgorBo added 4 commits December 25, 2020 13:45

Add "ignoreCase" support, fix a bug, formatting

3b1df76

Fix "ignoreCase"

24c33e9

Clean up

d465e60

Formatting

d1e5931

EgorBo added 2 commits December 25, 2020 17:42

Add tests

0561cac

Cover more methods, add ignore case support.

1f25caf

EgorBo force-pushed the jit-intrin-memoryext-startswith branch from 57a2833 to 1f25caf Compare December 25, 2020 20:11

EgorBo added 4 commits December 25, 2020 23:30

Remove redundant intrinsics

8cdda9b

Handle "..".AsSpan()

790df25

More tests, formatting

3d2ded2

Fix failing tests

27232d8

EgorBo force-pushed the jit-intrin-memoryext-startswith branch from 0646428 to 27232d8 Compare December 26, 2020 12:45

Clean up

e108914

JulieLeeMSFT requested a review from BruceForstall February 8, 2021 21:51

JulieLeeMSFT assigned EgorBo Feb 8, 2021

JulieLeeMSFT added this to the 6.0.0 milestone Feb 8, 2021

Base automatically changed from master to main March 1, 2021 09:07

EgorBo closed this Mar 8, 2021

ghost locked as resolved and limited conversation to collaborators Apr 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RyuJIT] Unroll StartsWith/SequenceEqual for ([ReadOnly]Span<char>, const-string) #46392

[RyuJIT] Unroll StartsWith/SequenceEqual for ([ReadOnly]Span<char>, const-string) #46392

EgorBo commented Dec 24, 2020 •

edited

Loading

Dotnet-GitSync-Bot commented Dec 24, 2020

EgorBo commented Dec 25, 2020 •

edited

Loading

AndyAyersMS commented Jan 23, 2021

EgorBo commented Mar 8, 2021

BruceForstall commented Mar 8, 2021

EgorBo commented Mar 8, 2021

[RyuJIT] Unroll StartsWith/SequenceEqual for ([ReadOnly]Span<char>, const-string) #46392

[RyuJIT] Unroll StartsWith/SequenceEqual for ([ReadOnly]Span<char>, const-string) #46392

Conversation

EgorBo commented Dec 24, 2020 • edited Loading

1. Small strings (e.g. length is 4)

2. Small strings (e.g. length is 4, ignore case)

3. Bigger strings (e.g. length is 23)

2. Bigger strings (e.g. length is 23, ignore case)

Dotnet-GitSync-Bot commented Dec 24, 2020

EgorBo commented Dec 25, 2020 • edited Loading

AndyAyersMS commented Jan 23, 2021

EgorBo commented Mar 8, 2021

BruceForstall commented Mar 8, 2021

EgorBo commented Mar 8, 2021

EgorBo commented Dec 24, 2020 •

edited

Loading

EgorBo commented Dec 25, 2020 •

edited

Loading