Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[arm64] Unexpected performance degradation for Vector128<short> #73921

Closed
adamsitnik opened this issue Aug 14, 2022 · 3 comments
Closed

[arm64] Unexpected performance degradation for Vector128<short> #73921

adamsitnik opened this issue Aug 14, 2022 · 3 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@adamsitnik
Copy link
Member

While working on #73768 I've re-implemented Span.IndexOf and Span.LastIndexOf by using Vector128<T> with generic implementation that can be used for byte, short, int and long.

IndexOfValueType and LastIndexOfValueType are very similar, but the first is performant for both byte and short, while the latter only for byte. It's very likely related to #73804

Repro:

public class Repro
{
    private byte[] _bytes;
    private short[] _shorts;

    public Repro()
    {
        _bytes = new byte[512];
        _bytes[_bytes.Length / 2] = 1;
        _shorts = new short[512];
        _shorts[_shorts.Length / 2] = 1;
    }

    [Benchmark]
    public int Byte_FirstIndex_BuiltIn() => new ReadOnlySpan<byte>(_bytes).IndexOf((byte)1);

    [Benchmark]
    public int Byte_Lastndex_BuiltIn() => new ReadOnlySpan<byte>(_bytes).LastIndexOf((byte)1);

    [Benchmark]
    public int Short_FirstIndex_BuiltIn() => new ReadOnlySpan<short>(_shorts).IndexOf((short)1);

    [Benchmark]
    public int Short_Lastndex_BuiltIn() => new ReadOnlySpan<short>(_shorts).LastIndexOf((short)1);

    [Benchmark]
    public int Byte_FirstIndex() => IndexOfValueType<byte>(ref _bytes[0], 1, _bytes.Length);

    [Benchmark]
    public int Byte_Lastndex() => LastIndexOfValueType<byte>(ref _bytes[0], 1, _bytes.Length);

    [Benchmark]
    public int Short_FirstIndex() => IndexOfValueType<short>(ref _shorts[0], 1, _shorts.Length);

    [Benchmark]
    public int Short_Lastndex() => LastIndexOfValueType<short>(ref _shorts[0], 1, _shorts.Length);

    // this method is performant for bytes and shorts
    [MethodImpl(MethodImplOptions.AggressiveOptimization)]
    private static int IndexOfValueType<T>(ref T searchSpace, T value, int length) where T : struct, INumber<T>
    {
        if (!Vector128.IsHardwareAccelerated || length < Vector128<T>.Count)
        {
            // removed for brevity
        }
        else
        {
            Vector128<T> equals, values = Vector128.Create(value);
            ref T currentSearchSpace = ref searchSpace;
            ref T oneVectorAwayFromEnd = ref Unsafe.Add(ref searchSpace, length - Vector128<T>.Count);

            // Loop until either we've finished all elements or there's less than a vector's-worth remaining.
            do
            {
                equals = Vector128.Equals(values, Vector128.LoadUnsafe(ref currentSearchSpace));
                if (equals == Vector128<T>.Zero)
                {
                    currentSearchSpace = ref Unsafe.Add(ref currentSearchSpace, Vector128<T>.Count);
                    continue;
                }

                return ComputeFirstIndex(ref searchSpace, ref currentSearchSpace, equals);
            }
            while (!Unsafe.IsAddressGreaterThan(ref currentSearchSpace, ref oneVectorAwayFromEnd));

            // If any elements remain, process the first vector in the search space.
            if ((uint)length % Vector128<T>.Count != 0)
            {
                equals = Vector128.Equals(values, Vector128.LoadUnsafe(ref oneVectorAwayFromEnd));
                if (equals != Vector128<T>.Zero)
                {
                    return ComputeFirstIndex(ref searchSpace, ref oneVectorAwayFromEnd, equals);
                }
            }
        }

        return -1;
    }

    // this method is performant for bytes, but NOT for shorts
    [MethodImpl(MethodImplOptions.AggressiveOptimization)]
    private static int LastIndexOfValueType<T>(ref T searchSpace, T value, int length)
        where T : struct, INumber<T>
    {
        if (!Vector128.IsHardwareAccelerated || length < Vector128<T>.Count)
        {
            // removed for brevity
        }
        else
        {
            Vector128<T> equals, values = Vector128.Create(value);
            ref T currentSearchSpace = ref Unsafe.Add(ref searchSpace, length - Vector128<T>.Count);

            // Loop until either we've finished all elements or there's less than a vector's-worth remaining.
            do
            {
                equals = Vector128.Equals(values, Vector128.LoadUnsafe(ref currentSearchSpace));
                if (equals == Vector128<T>.Zero)
                {
                    currentSearchSpace = ref Unsafe.Subtract(ref currentSearchSpace, Vector128<T>.Count);
                    continue;
                }

                return ComputeLastIndex(ref searchSpace, ref currentSearchSpace, equals);
            }
            while (!Unsafe.IsAddressLessThan(ref currentSearchSpace, ref searchSpace));

            // If any elements remain, process the first vector in the search space.
            if ((uint)length % Vector128<T>.Count != 0)
            {
                equals = Vector128.Equals(values, Vector128.LoadUnsafe(ref searchSpace));
                if (equals != Vector128<T>.Zero)
                {
                    return ComputeLastIndex(ref searchSpace, ref searchSpace, equals);
                }
            }
        }

        return -1;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static int ComputeFirstIndex<T>(ref T searchSpace, ref T current, Vector128<T> equals) where T : struct
    {
        uint notEqualsElements = equals.ExtractMostSignificantBits();
        int index = BitOperations.TrailingZeroCount(notEqualsElements);
        return index + (int)((long)Unsafe.ByteOffset(ref searchSpace, ref current) / Unsafe.SizeOf<T>());
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static int ComputeLastIndex<T>(ref T searchSpace, ref T current, Vector128<T> equals) where T : struct
    {
        uint notEqualsElements = equals.ExtractMostSignificantBits();
        int index = 31 - BitOperations.LeadingZeroCount(notEqualsElements); // 31 = 32 (bits in Int32) - 1 (indexing from zero)
        return (int)((long)Unsafe.ByteOffset(ref searchSpace, ref current) / Unsafe.SizeOf<T>()) + index;
    }
}
BenchmarkDotNet=v0.13.1.1845-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22413.1
  [Host]     : .NET 7.0.0 (7.0.22.41112), Arm64 RyuJIT AdvSIMD
  Job-UQDAMS : .NET 7.0.0 (7.0.22.41112), Arm64 RyuJIT AdvSIMD
Method Mean
Byte_FirstIndex_BuiltIn 33.11 ns
Byte_FirstIndex 17.52 ns
Byte_Lastndex_BuiltIn 25.44 ns
Byte_Lastndex 17.29 ns
Short_FirstIndex_BuiltIn 44.17 ns
Short_FirstIndex 34.28 ns
Short_Lastndex_BuiltIn 42.36 ns
Short_Lastndex 330.83 ns

cc @EgorBo @kunalspathak @AndyAyersMS

@adamsitnik adamsitnik added tenet-performance Performance related issue area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Aug 14, 2022
@adamsitnik adamsitnik added this to the 7.0.0 milestone Aug 14, 2022
@ghost
Copy link

ghost commented Aug 14, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

While working on #73768 I've re-implemented Span.IndexOf and Span.LastIndexOf by using Vector128<T> with generic implementation that can be used for byte, short, int and long.

IndexOfValueType and LastIndexOfValueType are very similar, but the first is performant for both byte and short, while the latter only for byte. It's very likely related to #73804

Repro:

public class Repro
{
    private byte[] _bytes;
    private short[] _shorts;

    public Repro()
    {
        _bytes = new byte[512];
        _bytes[_bytes.Length / 2] = 1;
        _shorts = new short[512];
        _shorts[_shorts.Length / 2] = 1;
    }

    [Benchmark]
    public int Byte_FirstIndex_BuiltIn() => new ReadOnlySpan<byte>(_bytes).IndexOf((byte)1);

    [Benchmark]
    public int Byte_Lastndex_BuiltIn() => new ReadOnlySpan<byte>(_bytes).LastIndexOf((byte)1);

    [Benchmark]
    public int Short_FirstIndex_BuiltIn() => new ReadOnlySpan<short>(_shorts).IndexOf((short)1);

    [Benchmark]
    public int Short_Lastndex_BuiltIn() => new ReadOnlySpan<short>(_shorts).LastIndexOf((short)1);

    [Benchmark]
    public int Byte_FirstIndex() => IndexOfValueType<byte>(ref _bytes[0], 1, _bytes.Length);

    [Benchmark]
    public int Byte_Lastndex() => LastIndexOfValueType<byte>(ref _bytes[0], 1, _bytes.Length);

    [Benchmark]
    public int Short_FirstIndex() => IndexOfValueType<short>(ref _shorts[0], 1, _shorts.Length);

    [Benchmark]
    public int Short_Lastndex() => LastIndexOfValueType<short>(ref _shorts[0], 1, _shorts.Length);

    // this method is performant for bytes and shorts
    [MethodImpl(MethodImplOptions.AggressiveOptimization)]
    private static int IndexOfValueType<T>(ref T searchSpace, T value, int length) where T : struct, INumber<T>
    {
        if (!Vector128.IsHardwareAccelerated || length < Vector128<T>.Count)
        {
            // removed for brevity
        }
        else
        {
            Vector128<T> equals, values = Vector128.Create(value);
            ref T currentSearchSpace = ref searchSpace;
            ref T oneVectorAwayFromEnd = ref Unsafe.Add(ref searchSpace, length - Vector128<T>.Count);

            // Loop until either we've finished all elements or there's less than a vector's-worth remaining.
            do
            {
                equals = Vector128.Equals(values, Vector128.LoadUnsafe(ref currentSearchSpace));
                if (equals == Vector128<T>.Zero)
                {
                    currentSearchSpace = ref Unsafe.Add(ref currentSearchSpace, Vector128<T>.Count);
                    continue;
                }

                return ComputeFirstIndex(ref searchSpace, ref currentSearchSpace, equals);
            }
            while (!Unsafe.IsAddressGreaterThan(ref currentSearchSpace, ref oneVectorAwayFromEnd));

            // If any elements remain, process the first vector in the search space.
            if ((uint)length % Vector128<T>.Count != 0)
            {
                equals = Vector128.Equals(values, Vector128.LoadUnsafe(ref oneVectorAwayFromEnd));
                if (equals != Vector128<T>.Zero)
                {
                    return ComputeFirstIndex(ref searchSpace, ref oneVectorAwayFromEnd, equals);
                }
            }
        }

        return -1;
    }

    // this method is performant for bytes, but NOT for shorts
    [MethodImpl(MethodImplOptions.AggressiveOptimization)]
    private static int LastIndexOfValueType<T>(ref T searchSpace, T value, int length)
        where T : struct, INumber<T>
    {
        if (!Vector128.IsHardwareAccelerated || length < Vector128<T>.Count)
        {
            // removed for brevity
        }
        else
        {
            Vector128<T> equals, values = Vector128.Create(value);
            ref T currentSearchSpace = ref Unsafe.Add(ref searchSpace, length - Vector128<T>.Count);

            // Loop until either we've finished all elements or there's less than a vector's-worth remaining.
            do
            {
                equals = Vector128.Equals(values, Vector128.LoadUnsafe(ref currentSearchSpace));
                if (equals == Vector128<T>.Zero)
                {
                    currentSearchSpace = ref Unsafe.Subtract(ref currentSearchSpace, Vector128<T>.Count);
                    continue;
                }

                return ComputeLastIndex(ref searchSpace, ref currentSearchSpace, equals);
            }
            while (!Unsafe.IsAddressLessThan(ref currentSearchSpace, ref searchSpace));

            // If any elements remain, process the first vector in the search space.
            if ((uint)length % Vector128<T>.Count != 0)
            {
                equals = Vector128.Equals(values, Vector128.LoadUnsafe(ref searchSpace));
                if (equals != Vector128<T>.Zero)
                {
                    return ComputeLastIndex(ref searchSpace, ref searchSpace, equals);
                }
            }
        }

        return -1;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static int ComputeFirstIndex<T>(ref T searchSpace, ref T current, Vector128<T> equals) where T : struct
    {
        uint notEqualsElements = equals.ExtractMostSignificantBits();
        int index = BitOperations.TrailingZeroCount(notEqualsElements);
        return index + (int)((long)Unsafe.ByteOffset(ref searchSpace, ref current) / Unsafe.SizeOf<T>());
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static int ComputeLastIndex<T>(ref T searchSpace, ref T current, Vector128<T> equals) where T : struct
    {
        uint notEqualsElements = equals.ExtractMostSignificantBits();
        int index = 31 - BitOperations.LeadingZeroCount(notEqualsElements); // 31 = 32 (bits in Int32) - 1 (indexing from zero)
        return (int)((long)Unsafe.ByteOffset(ref searchSpace, ref current) / Unsafe.SizeOf<T>()) + index;
    }
}
BenchmarkDotNet=v0.13.1.1845-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22413.1
  [Host]     : .NET 7.0.0 (7.0.22.41112), Arm64 RyuJIT AdvSIMD
  Job-UQDAMS : .NET 7.0.0 (7.0.22.41112), Arm64 RyuJIT AdvSIMD
Method Mean
Byte_FirstIndex_BuiltIn 33.11 ns
Byte_FirstIndex 17.52 ns
Byte_Lastndex_BuiltIn 25.44 ns
Byte_Lastndex 17.29 ns
Short_FirstIndex_BuiltIn 44.17 ns
Short_FirstIndex 34.28 ns
Short_Lastndex_BuiltIn 42.36 ns
Short_Lastndex 330.83 ns

cc @EgorBo @kunalspathak @AndyAyersMS

Author: adamsitnik
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr

Milestone: 7.0.0

@EgorBo
Copy link
Member

EgorBo commented Aug 15, 2022

@adamsitnik I can't repro it on Main and Preview7 I get

BenchmarkDotNet=v0.13.1, OS=macOS Monterey 12.4 (21F79) [Darwin 21.5.0]
Intel Core i9-9980HK CPU 2.40GHz, 1 CPU, 16 logical and 8 physical cores

|         Method |     Mean |    Error |   StdDev |
|--------------- |---------:|---------:|---------:|
|  Byte_Lastndex | 14.91 ns | 0.032 ns | 0.030 ns |
| Short_Lastndex | 30.54 ns | 0.171 ns | 0.151 ns |

Short_Lastndex is twice slower because _shorts is twice larger (in bytes)

@adamsitnik
Copy link
Member Author

@EgorBo thanks for the sanity check!

I've synced my fork again, performed full build (previously was building only a subset of CoreLib) and I also can't repro it now.

@ghost ghost locked as resolved and limited conversation to collaborators Sep 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

2 participants