Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Unroll Equals and StartsWith for constant strings and spans. [0..32] chars #65288

Merged
merged 38 commits into from
Mar 2, 2022

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Feb 14, 2022

This PR unrolls Equals/StartsWith for "half" constant strings and spans in [0..32] chars length.
It does pretty much the same job as #64821 but in JIT this time to avoid problems listed there (e.g. low inliner's budget and throughput penalty). I decided to create a new PR to keep that as a reference ("what can be done in pure C#").

This PR handles the following APIs:

String.Equals(string s1, string s2) // which means it also handles `str == "str"`
String.Equals(string s1, string s2, StringComparison.Ordinal)

obj.Equals(string s1)
obj.Equals(string s1, StringComparison.Ordinal)
obj.StartsWith(string s1, StringComparison.Ordinal)

MemoryExtensions.SequenceEqual<char>(ROS<char> s1, ROS<char> s2)
MemoryExtensions.Equals(ROS<char> s1, ROS<char> s2, StringComparison.Ordinal)
MemoryExtensions.StartsWith(ROS<char> s1, ROS<char> s2, StringComparison.Ordinal)

It unrolls & vectorizes them when either s1 or s2 are constants for [0..32] utf16 chars range using SWAR, SSE, AVX or AdvSimd (arm64).

The existing code can be re-used to handle:

  • Span.CopyTo
  • OrdinalIgnoreCase for any case this PR already handles for just Ordinal
  • UTF8 string literals or RVA data (e.g. either, hopefully, upcoming Utf8String literals or ufcpp/StringLiteralGenerator).
  • [unlikely] CurrentCulture case if we add "IsAscii or fallback" check
  • EndsWith
  • Cover longer sequences for really hot blocks (PGO), e.g. via 4-8 128bit vectors on arm64

Motivation:

I've seen quite a few SequenceEqual against constant data in high-perf scenarios e.g. "is input one of the well-known headers?" or e.g. TechEmpower: https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/CSharp/aspnetcore/PlatformBenchmarks/BenchmarkApplication.cs#L98 - mostly on UTF8 inputs (I plan to support those eventually), but UTF16 are also not rare.

Also, #65222 relies on this PR (it introduced a couple of regressions in order to make code cleaner and BE-friendly and expects this PR to fix them)

Example:

bool Test1(string s) => s.StartsWith("https://", StringComparison.Ordinal);

bool Test2(string s) => s == "Access-Control-Allow-Credentials";

New codegen:

; Method Tests:Test1(System.String):bool
       vzeroupper 
       cmp      dword ptr [rcx+8], 8  ;;  <-- Length >= 8
       jne      SHORT G_M46665_IG04
       vmovdqu  xmm0, xmmword ptr [rcx+12]
       vpxor    xmm0, xmm0, xmmword ptr [reloc @RWD00]
       vptest   xmm0, xmm0
       sete     al
       movzx    rax, al
       jmp      SHORT G_M46665_IG05 ;; we end up doing 'movzx rax, al' twice :(
G_M46665_IG04:
       xor      eax, eax
G_M46665_IG05:
       movzx    rax, al
       ret      
RWD00  	dq	0070007400740068h, 002F002F003A0073h
; Total bytes of code: 41


; Method Tests:Test2(System.String):bool
       vzeroupper 
       test     rcx, rcx  ;;  <-- is null?
       je       SHORT G_M48330_IG06
       cmp      dword ptr [rcx+8], 32  ;;  <-- Length == 32
       jne      SHORT G_M48330_IG05
       vmovdqu  ymm0, ymmword ptr[rcx+12]
       vpxor    ymm0, ymm0, ymmword ptr[reloc @RWD00]
       vmovdqu  ymm1, ymmword ptr[rcx+44]  ;; <- btw, the "pipeline" issue
       vpxor    ymm1, ymm1, ymmword ptr[reloc @RWD32]
       vpor     ymm0, ymm0, ymm1
       vptest   ymm0, ymm0
       sete     al
       movzx    rax, al
       jmp      SHORT G_M48330_IG07
G_M48330_IG05:
       xor      eax, eax
       jmp      SHORT G_M48330_IG07
G_M48330_IG06:
       xor      eax, eax
G_M48330_IG07:
       movzx    rax, al
       vzeroupper 
       ret      
RWD00  	dq	0065006300630041h, 0043002D00730073h, 00720074006E006Fh, 0041002D006C006Fh
RWD32  	dq	0077006F006C006Ch, 006500720043002Dh, 0074006E00650064h, 0073006C00610069h
; Total bytes of code: 70

There are minor block layout issues but they're unrelated. The algorithm has a pretty simple heuristic (budget) to avoid doing too many unrolls in a single method.

Simple benchmark

[Benchmark]
[Arguments("https://google.com")]
public bool StartsWith(string s) => s.StartsWith("https://", StringComparison.Ordinal);
Method Toolchain Mean Error StdDev Ratio
StartsWith \Core_Root\corerun.exe 3.8516 ns 0.0050 ns 0.0045 ns 1.00
StartsWith \Core_Root_PR\corerun.exe 0.4802 ns 0.0019 ns 0.0015 ns 0.12

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 14, 2022
@ghost ghost assigned EgorBo Feb 14, 2022
@EgorBo EgorBo marked this pull request as draft February 14, 2022 02:10
@ghost
Copy link

ghost commented Feb 14, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR unrolls Equals/StartsWith for "half" constant strings and spans in [0..32] chars length.
It does pretty much the same job as #64821 but in JIT this time to avoid problems listed there (e.g. low inliner's budget and throughput penalty). I decided to create a new PR to keep that as a reference ("what can be done in pure C#").

This PR handles the following APIs:

String.Equals(string s1, string s2)
String.Equals(string s1, string s2, StringComparison.Ordinal)

strObj.Equals(string s1)
strObj.Equals(string s1, StringComparison.Ordinal)
strObj.StartsWith(string s1, StringComparison.Ordinal)

MemoryExtensions.SequenceEqual<char>(ROS<char> s1, ROS<char> s2)
MemoryExtensions.Equals(ROS<char> s1, ROS<char> s2, StringComparison.Ordinal)
MemoryExtensions.StartsWith(ROS<char> s1, ROS<char> s2, StringComparison.Ordinal)

It unrolls & vectorizes them when either s1 or s2 are constants for [0..32] utf16 chars range using SWAR, SSE, AVX or AdvSimd (arm64).

The existing code can be re-used to handle:

  • Span.CopyTo
  • OrdinalIgnoreCase for any case this PR already handles for just Ordinal
  • UTF8 string literals or RVA data (e.g. either, hopefully, upcoming Utf8String literals or ufcpp/StringLiteralGenerator).
  • [unlikely] CurrentCulture case if we add "IsAscii or fallback" check

Motivation:

I've seen quite a few SequenceEqual against constant data in high-perf scenarios e.g. "is input one of the well-known headers?" or e.g. TechEmpower: https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/CSharp/aspnetcore/PlatformBenchmarks/BenchmarkApplication.cs#L98 - mostly on UTF8 inputs (I plan to support those eventually), but UTF16 are also not rare.

Also, #65222 relies on this PR.

Example:

bool Test1(string s) => s.StartsWith("https://", StringComparison.Ordinal);

bool Test2(string s) => s == "Access-Control-Allow-Credentials";

New codegen:

; Method Tests:Test1(System.String):bool
       vzeroupper 
       cmp      dword ptr [rcx+8], 8  ;;  <-- Length >= 8
       jne      SHORT G_M46665_IG04
       vmovdqu  xmm0, xmmword ptr [rcx+12]
       vpxor    xmm0, xmm0, xmmword ptr [reloc @RWD00]
       vptest   xmm0, xmm0
       sete     al
       movzx    rax, al
       jmp      SHORT G_M46665_IG05 ;; we end up doing 'movzx rax, al' twice :(
G_M46665_IG04:
       xor      eax, eax
G_M46665_IG05:
       movzx    rax, al
       ret      
RWD00  	dq	0070007400740068h, 002F002F003A0073h
; Total bytes of code: 41


; Method Tests:Test2(System.String):bool
       vzeroupper 
       test     rcx, rcx  ;;  <-- is null?
       je       SHORT G_M48330_IG06
       cmp      dword ptr [rcx+8], 32  ;;  <-- Length == 32
       jne      SHORT G_M48330_IG05
       vmovdqu  ymm0, ymmword ptr[rcx+12]
       vpxor    ymm0, ymm0, ymmword ptr[reloc @RWD00]
       vmovdqu  ymm1, ymmword ptr[rcx+44]  ;; <- btw, the "pipeline" issue
       vpxor    ymm1, ymm1, ymmword ptr[reloc @RWD32]
       vpor     ymm0, ymm0, ymm1
       vptest   ymm0, ymm0
       sete     al
       movzx    rax, al
       jmp      SHORT G_M48330_IG07
G_M48330_IG05:
       xor      eax, eax
       jmp      SHORT G_M48330_IG07
G_M48330_IG06:
       xor      eax, eax
G_M48330_IG07:
       movzx    rax, al
       vzeroupper 
       ret      
RWD00  	dq	0065006300630041h, 0043002D00730073h, 00720074006E006Fh, 0041002D006C006Fh
RWD32  	dq	0077006F006C006Ch, 006500720043002Dh, 0074006E00650064h, 0073006C00610069h
; Total bytes of code: 70

There are minor block layout issues but they're unrelated. The algorithm has a pretty simple heuristic (budget) to avoid doing too many unrolls in a single method.

Simple benchmark

[Benchmark]
[Arguments("https://google.com")]
public bool StartsWith(string s) => s.StartsWith("https://", StringComparison.Ordinal);
Method Toolchain Mean Error StdDev Ratio
StartsWith \Core_Root\corerun.exe 3.8516 ns 0.0050 ns 0.0045 ns 1.00
StartsWith \Core_Root_PR\corerun.exe 0.4802 ns 0.0019 ns 0.0015 ns 0.12

NOTE: OrdinalIgnoreCase performance is terrible and if we implement it here we'll get something like this:
image
(from #46392)

PS: Spans aren't enabled in the current commit - there is a small issue with them I am trying to resolve.
TODO: Add more tests

Author: EgorBo
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

public static int Main()
{
int testCount = 0;
foreach (var method in typeof(Tests).GetMethods())
Copy link
Member

@am11 am11 Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way (that avoids reflection and repeating Equals_# methods) could be by marking ValidateEquals method public and dup the string instances:

         int testCount = 0;
         foreach (string testStr in Tests.s_TestData)
         {
             testCount++;
             string newStr = new string(testStr.ToCharArray());
             Tests.ValidateEquals(testStr == newStr, testStr, newStr);
         }

Copy link
Member Author

@EgorBo EgorBo Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to use SG/T4 generation? Because I do need explicit literals in methods to test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explicit literals

Ah, ok. 👍
(SG/T4 would pull in more infra code here than necessary. I was thinking if intention is to test other instance of same string, then we can clone it via char[] overload and reduce LOC)

@stephentoub
Copy link
Member

The algorithm has a pretty simple heuristic (budget) to avoid doing too many unrolls in a single method.

I assume this is the heuristic?
https://github.com/dotnet/runtime/pull/65288/files#diff-af8a9db9bb28767adad71c804c3a62d35b4e34ffbc64362b7138ab703ef964ccR3944
What's that going to mean in practice for regex? The generated regex method gets big quick with any reasonably-sized expression.

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Feb 14, 2022

PS: Spans aren't enabled in the current commit - there is a small issue with them I am trying to resolve.

Remember that the "take advantage of the fact that strings are null-terminated" trick won't work with spans. Spans aren't guaranteed null-terminated.

Edit: there are a few other patterns that can be optimized as well. For example, string.StartsWith(char) (when char is anything other than null) can literally be a comparison against the _firstChar field without a length check.

@EgorBo
Copy link
Member Author

EgorBo commented Feb 14, 2022

What's that going to mean in practice for regex? The generated regex method gets big quick with any reasonably-sized expression.

There are methods like this in which Roslyn ends up emitting hundreds of str == "...:" and it's not clear what to do with them. Anyway, maybe the limit is not worth it at all, I'll check the diffs.
Also, if you emit really large methods with a lot of "cheap" or "AggressiveInlining" calls - those can easily stop being inlined very quickly due to inliner's budget.

Remember that the "take advantage of the fact that strings are null-terminated" trick won't work with spans. Spans aren't guaranteed null-terminated.

Yeah I don't rely on it at all here - it's left up-for-grabs 🙂 Even Length = 3 case is handled via two Int32 loads.
It's not Span and StartsWith friendly, also it feels a bit dangerous to rely on extra 2 bytes padding to be zeroed always. E.g. you had that AllocateUnitialized idea to improve string allocations that could brake this assumption.

Edit: there are a few other patterns that can be optimized as well. For example, string.StartsWith(char) (when char is anything other than null) can literally be a comparison against the _firstChar field without a length check.

That one is already optimized in Main via IsKnownConstant ;-)

@GrabYourPitchforks
Copy link
Member

That one is already optimized in Main via IsKnownConstant ;-)

Neat! That'll teach me to look at main sources rather than 6.0 sources. 😁

@EgorBo EgorBo marked this pull request as ready for review February 16, 2022 15:23
@EgorBo
Copy link
Member Author

EgorBo commented Feb 16, 2022

I think I've added everything I wanted to add, it now handles Span too, full list of APIs:

//   1) String.Equals(string, string) // which means `str == "..."` too
//   2) String.Equals(string, string, StringComparison.Ordinal)
//   3) str.Equals(string)
//   4) str.Equals(String, StringComparison.Ordinal)
//   5) str.StartsWith(string, StringComparison.Ordinal)
//   6) MemoryExtensions.SequenceEqual<char>(ROS<char>, ROS<char>)
//   7) MemoryExtensions.Equals(ROS<char>, ROS<char>, StringComparison.Ordinal)
//   8) MemoryExtensions.StartsWith<char>(ROS<char>, ROS<char>)
//   9) MemoryExtensions.StartsWith(ROS<char>, ROS<char>, StringComparison.Ordinal)

all of them work when one of the arguments (left or right) is a string literal or string.op_Implicit("str to span") or "str".AsSpan(). for [0..16] inputs on ARM64 and SSE2 and additionally [16..32] for AVX.

I do plan to add CopyTo and OrdinalIgnoreCase in my spare time as they should be relatively easy to follow up.
And, ideally, RVA<byte> support.

This PR needs more tests for all kinds of cases and an additional test for cross-page boundary (just in case) and I need to investigate the diffs (SPMI ones are empty because of lack of needed context - getStringLiteral VM API) but the changes are more or less final to me and ready for review.

src/coreclr/jit/importer_vectorization.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/importer_vectorization.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/importer_vectorization.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/importer_vectorization.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/importer_vectorization.cpp Outdated Show resolved Hide resolved
EgorBo and others added 4 commits February 26, 2022 17:29
Co-authored-by: SingleAccretion <62474226+SingleAccretion@users.noreply.github.com>
Co-authored-by: SingleAccretion <62474226+SingleAccretion@users.noreply.github.com>
@EgorBo
Copy link
Member Author

EgorBo commented Mar 2, 2022

/azp run runtime-coreclr outerloop, runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@EgorBo
Copy link
Member Author

EgorBo commented Mar 2, 2022

All outerloop failures so far are #65991 (and a few #58616)

Mono llvmfullaot Pri0 is likely stuck because of #66083

@EgorBo
Copy link
Member Author

EgorBo commented Mar 17, 2022

Improvement on alpine-x64 dotnet/perf-autofiling-issues#3978

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants