Vectorize {Last}IndexOf{Any} and {Last}IndexOfAnyExcept without code duplication #73768

adamsitnik · 2022-08-11T12:40:20Z

@stephentoub @jkotas the only difference between LastIndexOf and LastIndexOfAnyExcept is addional negation. I wanted to avoid code duplication without losing perf, so I've introduced new interface and two structs that are implementing it (first does ==, second !=). By having the right generic constrains, I was able to get exactly the same perf (all the calls got inlined).

Codegen diff between my first and second commit: https://www.diffchecker.com/qeBXi1Rj

It works great for CLR RyuJIT x64, but I wonder what downsides it has (AOT support, generic code bloat?) Please let me know if using such pattern is acceptable. If it is, I could vectorize more similar methods without duplicating the code.

ghost · 2022-08-11T12:40:36Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

@stephentoub @jkotas the only difference between LastIndexOf and LastIndexOfAnyExcept is addional negation. I wanted to avoid code duplication without losing perf, so I've introduced new interface and two structs that are implementing it. By having the right generic constrains, I was able to get exactly the same perf (all the calls got inlined).

Codegen diff between my first and second commit: https://www.diffchecker.com/qeBXi1Rj

It works great for CLR RyuJIT x64, but I wonder what downsides it has (AOT support, generic code bloat?) Please let me know if using such pattern is acceptable. If it is, I could vectorize more similar methods without duplicating the code.

Author:	adamsitnik
Assignees:	-
Labels:	`NO-MERGE`, `area-System.Memory`
Milestone:	-

stephentoub · 2022-08-11T13:08:35Z

Haven't reviewed yet (won't be able to until later today), but in concept I'm happy with it. Can we do the same for IndexOf{Any} and the other overloads?

(@GrabYourPitchforks had actually suggested this approach initially and I'd looked at doing so when adding the methods initially, but the direct use of intrinsics made it challenging.)

jkotas · 2022-08-11T15:34:37Z

I wonder what downsides it has (AOT support, generic code bloat?)

The auxiliary types created in patterns like this have larger static footprint. It is not a big deal if the set of instantiations is small and finite.

You can reduce this downside by using the existing types instead of introducing new ones. For example:

// You can also just define your own `class Negate { }` and `class DoNotNegate { }` instead of piggy backing on Int32/UInt32.
using Negate = System.Int32;
using DoNotNegate = System.UInt32;

...

SpanHelpers.LastIndexOfValueType<byte, Negate>(....)`

...

if (typeof(N) == typeof(Negate))
{
    equals = ~equals;
}

jkotas · 2022-08-11T15:42:52Z

src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.cs

-
-            return SpanHelpers.LastIndexOf<T>(ref MemoryMarshal.GetReference(span), value, span.Length);
-        }
+            => LastIndexOf((ReadOnlySpan<T>)span, value);


This is not inlineable when T is generic variable due to current generic inlining limitations. It means that this change to just forward Span to ReadOnlySpan will come with some perf regression in some situations.

(Just pointing it out. I will leave it up to you whether to take this regression for simplicity. Either way is fine with me.)

stephentoub · 2022-08-11T16:26:15Z

It is not a big deal if the set of instantiations is small and finite.

I think that's the case here. The way this is set up it seems like there should be at most 8: 4 primitive types * 2 helpers.

adamsitnik · 2022-08-11T18:52:33Z

I think I've hit a bug in JIT: #73804

adamsitnik · 2022-08-11T18:55:34Z

I've ported IndexOf too, but some of the tests are failing due to unaligned reads for strlen:

runtime/src/libraries/System.Private.CoreLib/src/System/String.cs

Lines 599 to 601 in 3e0a5ad

    
           // IndexOf processes memory in aligned chunks, and thus it won't crash even if it accesses memory beyond the null terminator. 
        
           // This IndexOf behavior is an implementation detail of the runtime and callers outside System.Private.CoreLib must not depend on it. 
        
           int length = SpanHelpers.IndexOf(ref *ptr, '\0', int.MaxValue);

I am going to continue working on this tomorrow

JulieLeeMSFT · 2022-08-11T19:14:17Z

I think I've hit a bug in JIT: #73804

@adamsitnik, is this PR for .NET 8 or .NET 7? Trying to decide if we need to fix #73804 in .NET 7 or 8.

stephentoub · 2022-08-11T19:20:49Z

, is this PR for .NET 8 or .NET 7?

It's intended for 7.

… and avoid duplication by calling it from both Span and ROS

… to searching for zeros

jkotas · 2022-08-12T14:10:59Z

address review from Jan, don't cast Span to ROS.

This does not address the problem with inlining limitations that I have mentioned. It actually makes it worse since both ReadOnlySpan and Span overloads get performance hit. It was just Span overload before the last commit.

Repro:

using System.Diagnostics;
using System.Runtime.CompilerServices;

ReadOnlySpan<MyStruct<string>> span = new MyStruct<string>[1];

var sw = new Stopwatch();
for (;;)
{
    sw.Restart();
    for (int i = 0; i < 100000000; i++) ContainsDefault(span);
    Console.WriteLine(sw.ElapsedMilliseconds);  
}

[MethodImpl(MethodImplOptions.NoInlining)]
static bool ContainsDefault<T>(ReadOnlySpan<T> span) where T: IEquatable<T>
   => span.Contains(default);

public struct MyStruct<T> : IEquatable<MyStruct<T>>
{
    int _value;

    bool IEquatable<MyStruct<T>>.Equals(MyStruct<T> other) => _value == other._value;
}

Baseline

729ms per iteration

Stacktrace to SpanHelpers.Contains:

System_Private_CoreLib!System.SpanHelpers.Contains<MyStruct<string>>+0x8
System_Private_CoreLib!System.MemoryExtensions.Contains<MyStruct<string>>+0x48
repro!Program.<<Main>$>g__ContainsDefault|0_0<MyStruct<string>>+0x3d [C:\repro\Program.cs @ 16] 
repro!Program.<Main>$+0xda [C:\repro\Program.cs @ 10]

Current PR:

820ms per iteration

Stacktrace to SpanHelpers.Contains. Notice an extra MemoryExtensions.Contains frame.

System_Private_CoreLib!System.SpanHelpers.Contains<MyStruct<string>>+0x31d [C:\runtime\src\libraries\System.Private.CoreLib\src\System\SpanHelpers.T.cs @ 270] 
System_Private_CoreLib!System.MemoryExtensions.Contains<MyStruct<string>>+0x48 [C:\runtime\src\libraries\System.Private.CoreLib\src\System\MemoryExtensions.cs @ 303] 
System_Private_CoreLib!System.MemoryExtensions.Contains<MyStruct<string>>+0x48 [C:\runtime\src\libraries\System.Private.CoreLib\src\System\MemoryExtensions.cs @ 278] 
repro!Program.<<Main>$>g__ContainsDefault|0_0<MyStruct<string>>+0x3d [C:\repro\Program.cs @ 16] 
repro!Program.<Main>$+0xda [C:\repro\Program.cs @ 10]

Current PR without the latest commit:

Same as baseline.

There is no good way to work around the inlining limitations without code duplication.

SamMonoRT · 2022-09-15T19:56:55Z

@adamsitnik - this has caused significant regressions in AOT-WASM and Interpreter WASM scenarios as indicated in linked issues above. #74395 (comment) comment made last month indicated this could have been a reason for the regressions, but somehow we didn't quite follow up on that. I want to discuss how we can avoid this in the future, possibly with manual runs on Mono scenarios with the changes ? @DrewScoggins - is it possible to call out improvements as well as regressions for the changes across various runs.

Also @adamsitnik - I believe it is quite late, but are there any remote chances we can revert those changes from 7.0/release ?

cc @jeffhandley

adamsitnik · 2022-09-16T09:25:57Z

First of all, please excuse me for missing #74395 (comment)

I want to discuss how we can avoid this in the future, possibly with manual runs on Mono scenarios with the changes ?

From my perspective all I need is documentation that describes how to benchmark and preferably how to profile WASM. We have an issue for that: dotnet/BenchmarkDotNet#1818 but it did not receive a lot of traction. We should extend https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md and https://github.com/dotnet/performance/blob/main/docs/profiling-workflow-dotnet-runtime.md with WASM instructions so folks like me can just run the benchmarks themselves and avoid introducing regressions.

I believe it is quite late, but are there any remote chances we can revert those changes from 7.0/release ?

I would prefer to not revert it, as it has brought a lot of perf improvements for arm64. In my opinion the best quick fix would be to re-introduce Vector<T> code paths for the methods that were previously vectorized and now are not (because WASM does not support Vector128). I am going to give it a try. The question is whether such a backport would be accepted? @jeffhandley @danmoseley ?

danmoseley · 2022-09-16T10:59:53Z

I think we'd be interested in a change that preserves the win for non WASM. It really depends on risk/confidence.

vargaz · 2022-09-16T14:14:02Z

We need a solution for net7 which can be implemented quickly and its low risk.

stephentoub · 2022-09-16T14:17:48Z

We need a solution for net7 which can be implemented quickly and its low risk.

I agree. I think what folks are pointing out is that reverting the previous changes is not low risk.

lewing · 2022-09-16T14:34:00Z

We need a solution for net7 which can be implemented quickly and its low risk.

I agree. I think what folks are pointing out is that reverting the previous changes is not low risk.

Which means a high risk change was committed post rc1 and leaves the wasm runtime in a fairly tight spot with no time to react.

stephentoub · 2022-09-16T14:37:05Z

Which means a high risk change was committed post rc1

What do you mean post-RC1? This change is in RC1.

lewing · 2022-09-16T14:37:47Z

I meant post branch for rc1, sorry for the imprecision

stephentoub · 2022-09-16T14:44:18Z

I meant post branch for rc1

Yes, it was merged into the rc1 branch a month ago, the day after the rc1 branch was snapped from main. I'm surprised that makes a material difference for wasm having time to react.

Regardless, we all agree we want to fix the wasm regressions; let's jointly find a solution rather than placing blame. Adam suggested adding in some Vector<T> paths. Tanner suggested some strategically placed ifdefs (though I don't know exactly what he had in mind). Are you pushing back against those? Are there other options on the table?

jkotas · 2022-09-16T14:51:47Z

The simplest lowest-risk solution for .NET 7 is to just put whatever was there before under #if MONO ifdef. No need to be creative this close to GA.

stephentoub · 2022-09-16T14:53:12Z

That sounds fine to me. I'm assuming that's what Tanner had in mind (but don't know for sure).

adamsitnik · 2022-09-16T14:53:29Z

My current idea:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static bool ExecuteVectorizedCodePath<T>(int length) where T : struct
#if TARGET_WASM
    => Vector.IsHardwareAccelerated && length >= Vector<T>.Count;
#else
    => Vector128.IsHardwareAccelerated && length >= Vector128<T>.Count;
#endif

[MethodImpl(MethodImplOptions.AggressiveOptimization)]
internal static bool ContainsValueType<T>(ref T searchSpace, T value, int length) where T : struct, INumber<T>
{
    if (!ExecuteVectorizedCodePath<T>(length))
    {
        // current non-vectorized code path
    }
#if TARGET_WASM
    else
    {
        // restored Vector<T> path
    }
#else
    else if (Vector256.IsHardwareAccelerated && length >= Vector256<T>.Count)
    {
        // current Vector256 code path
    }
    else
    {
        // current Vector128 code path
    }
#endif

    return false;
}

I am already working on a fix, but I don't know how to benchmark WASM AOT yet

jkotas · 2022-09-16T14:56:59Z

@adamsitnik #74395 (comment) says that the problem is caused by generics and depending on JIT/AOT doing complex optimizations to streamline the code.

I do not think adding Vector<T> paths back is going to fix this.

adamsitnik · 2022-09-16T15:01:09Z

@jkotas The report mentions regression in Contains which does not use the generic "hack", that is why I am going to benchmark Vector<T> approach first.

radekdoulik · 2022-09-16T15:45:59Z

There might be multiple issues. The one I saw when working on SIMD improvements is related to generics. We now end with shared generic code in methods, which were specialized (non shared) before. In my case that leads to resulting code not using SIMD intrinsics. More importantly the shared generic code is also slower for default (non-SIMD) cases and is visible in browser-bench measurements, screenshot in #75709 - the most affected graph "flavors" are these with SIMD, in the others it is visible too, just in a smaller scale.

jkotas · 2022-09-16T15:47:03Z

The report mentions regression in Contains

It is ImmutableArray.Contains, Queue.Contains, etc. All of these are implemented using IndexOf that I believe was switched to the generic impl:

runtime/src/libraries/System.Collections.Immutable/src/System/Collections/Immutable/ImmutableArray_1.cs

Lines 278 to 281 in 57bfe47

    
           public bool Contains(T item) 
        
           { 
        
               return this.IndexOf(item) >= 0; 
        
           }

runtime/src/libraries/System.Private.CoreLib/src/System/Collections/Generic/Queue.cs

Lines 290 to 298 in 57bfe47

    
           if (_head < _tail) 
        
           { 
        
               return Array.IndexOf(_array, item, _head, _size) >= 0; 
        
           } 
        
           // We've wrapped around. Check both partitions, the least recently enqueued first. 
        
           return 
        
               Array.IndexOf(_array, item, _head, _array.Length - _head) >= 0 || 
        
               Array.IndexOf(_array, item, 0, _tail) >= 0;

danmoseley · 2022-09-16T16:10:10Z

I don't know how to benchmark WASM AOT yet

The advantage of putting the old code back in #if MONO is that you shouldn't need to profile WASM, or at least it need not hold up merging the change. My preference FWIW is to get Mono back to the old codepath promptly, and look for confirmation from the perf lab in a couple days presumably.

adamsitnik · 2022-09-16T16:46:24Z

OK, I am going to bring old the back old code for Mono-only.

jeffhandley · 2022-10-06T05:52:45Z

Adding a reference link to where we closed the loop with the performance results that illustrate the mono regressions were indeed fixed:
dotnet/perf-autofiling-issues#7981 (comment)

adamsitnik added 2 commits August 11, 2022 13:27

Vectorize LastIndexOf

9d01665

use structs to get performant code without code duplication

14224a3

adamsitnik added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) area-System.Memory labels Aug 11, 2022

ghost assigned adamsitnik Aug 11, 2022

adamsitnik added 6 commits August 11, 2022 16:01

use it in all possible places

d05cf56

vectorize LastIndexOfAny(value0, value1)

22f71ff

vectorize LastIndexOfAnyExcept(value0, value1)

bbc64e0

simplify it to make it easier to add 3 and 4 values overloads

f265d28

vectorize LastIndexOfAny and LastIndexOfAnyExcept for 3 values

b0f259c

rename (I am not convinced it's the best name yet)

7dc03c3

jkotas reviewed Aug 11, 2022

View reviewed changes

adamsitnik mentioned this pull request Aug 11, 2022

[arm64] JIT assertion failures for valid C# code #73804

Closed

This was referenced Aug 11, 2022

Infra improvements for Helix #68176

Closed

system.collections.concurrent.tests failed in CI #73038

Closed

Long Running Test: Interop/MonoAPI/MonoMono/PInvokeDetach/PInvokeDetach.sh #73040

Closed

adamsitnik added 5 commits August 12, 2022 10:22

vectorize Contains

628d429

hide the implementation details

114ca88

address review from Jan, don't cast Span to ROS. Introduce new helper…

95a1b62

… and avoid duplication by calling it from both Span and ROS

vectorize IndexOf(value)

ab8df3d

rename IndexOf used only by strlen to IndexOfNullByte and optimize it…

bb13957

… to searching for zeros

This was referenced Sep 15, 2022

[Perf] Linux/x64: 160 Regressions on 8/17/2022 6:09:21 PM dotnet/perf-autofiling-issues#7981

Closed

[Perf] Linux/x64: 27 Regressions on 8/17/2022 6:09:21 PM #74395

Closed

radekdoulik mentioned this pull request Sep 15, 2022

[wasm] Perf regression of Span.IndexOf #75709

Closed

Rob-Hague mentioned this pull request Sep 16, 2022

Audit MemoryExtensions.IndexOf variants #75754

Merged

jeffhandley mentioned this pull request Sep 20, 2022

[Mono] Restore old code to solve the recent SpanHelpers regressions #75917

Merged

2 tasks

adamsitnik mentioned this pull request Sep 28, 2022

Potential perf improvements for Mono AOT #76318

Closed

EgorBo mentioned this pull request Sep 28, 2022

.NET 7.0 RC1 Microbenchmarks Performance Study Report #76320

Closed

18 tasks

lewing mentioned this pull request Sep 28, 2022

[mono][interpreter] Fix performance of new span helpers #76326

Closed

This was referenced Sep 29, 2022

[Perf] ubuntu 20.04/arm64 : Improvement on 8/13/2022 7:28:42 AM dotnet/perf-autofiling-issues#7375

Closed

[Perf] Linux/arm64: 25 Improvements on 8/17/2022 3:53:31 PM dotnet/perf-autofiling-issues#7377

Closed

ghost locked as resolved and limited conversation to collaborators Nov 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize {Last}IndexOf{Any} and {Last}IndexOfAnyExcept without code duplication #73768

Vectorize {Last}IndexOf{Any} and {Last}IndexOfAnyExcept without code duplication #73768

adamsitnik commented Aug 11, 2022 •

edited

Loading

ghost commented Aug 11, 2022

stephentoub commented Aug 11, 2022 •

edited

Loading

jkotas commented Aug 11, 2022 •

edited

Loading

jkotas Aug 11, 2022

stephentoub commented Aug 11, 2022

adamsitnik commented Aug 11, 2022

adamsitnik commented Aug 11, 2022

JulieLeeMSFT commented Aug 11, 2022

stephentoub commented Aug 11, 2022

jkotas commented Aug 12, 2022 •

edited

Loading

SamMonoRT commented Sep 15, 2022 •

edited

Loading

adamsitnik commented Sep 16, 2022

danmoseley commented Sep 16, 2022

vargaz commented Sep 16, 2022

stephentoub commented Sep 16, 2022 •

edited

Loading

lewing commented Sep 16, 2022

stephentoub commented Sep 16, 2022

lewing commented Sep 16, 2022

stephentoub commented Sep 16, 2022

jkotas commented Sep 16, 2022

stephentoub commented Sep 16, 2022

adamsitnik commented Sep 16, 2022

jkotas commented Sep 16, 2022

adamsitnik commented Sep 16, 2022

radekdoulik commented Sep 16, 2022 •

edited

Loading

jkotas commented Sep 16, 2022

danmoseley commented Sep 16, 2022

adamsitnik commented Sep 16, 2022

jeffhandley commented Oct 6, 2022

Vectorize {Last}IndexOf{Any} and {Last}IndexOfAnyExcept without code duplication #73768

Vectorize {Last}IndexOf{Any} and {Last}IndexOfAnyExcept without code duplication #73768

Conversation

adamsitnik commented Aug 11, 2022 • edited Loading

ghost commented Aug 11, 2022

stephentoub commented Aug 11, 2022 • edited Loading

jkotas commented Aug 11, 2022 • edited Loading

jkotas Aug 11, 2022

Choose a reason for hiding this comment

stephentoub commented Aug 11, 2022

adamsitnik commented Aug 11, 2022

adamsitnik commented Aug 11, 2022

JulieLeeMSFT commented Aug 11, 2022

stephentoub commented Aug 11, 2022

jkotas commented Aug 12, 2022 • edited Loading

Baseline

Current PR:

Current PR without the latest commit:

SamMonoRT commented Sep 15, 2022 • edited Loading

adamsitnik commented Sep 16, 2022

danmoseley commented Sep 16, 2022

vargaz commented Sep 16, 2022

stephentoub commented Sep 16, 2022 • edited Loading

lewing commented Sep 16, 2022

stephentoub commented Sep 16, 2022

lewing commented Sep 16, 2022

stephentoub commented Sep 16, 2022

jkotas commented Sep 16, 2022

stephentoub commented Sep 16, 2022

adamsitnik commented Sep 16, 2022

jkotas commented Sep 16, 2022

adamsitnik commented Sep 16, 2022

radekdoulik commented Sep 16, 2022 • edited Loading

jkotas commented Sep 16, 2022

danmoseley commented Sep 16, 2022

adamsitnik commented Sep 16, 2022

jeffhandley commented Oct 6, 2022

adamsitnik commented Aug 11, 2022 •

edited

Loading

stephentoub commented Aug 11, 2022 •

edited

Loading

jkotas commented Aug 11, 2022 •

edited

Loading

jkotas commented Aug 12, 2022 •

edited

Loading

SamMonoRT commented Sep 15, 2022 •

edited

Loading

stephentoub commented Sep 16, 2022 •

edited

Loading

radekdoulik commented Sep 16, 2022 •

edited

Loading