Vectorize TrimTransparentPixels in GifEncoderCore #2500

gfoidl · 2023-07-27T19:29:26Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

A simple benchmark -- just for the inner loop -- yields:

|     Method |      Mean |    Error |   StdDev | Ratio |
|----------- |----------:|---------:|---------:|------:|
|    Default | 102.33 ns | 2.073 ns | 2.973 ns |  1.00 |
| Vectorized |  16.53 ns | 0.065 ns | 0.055 ns |  0.16 |

This is measured with .NET 7, but the codegen for .NET 6 is very similar.

benchmark code

using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using BenchmarkDotNet.Attributes;

Bench bench = new();
bench.Setup();
Console.WriteLine(bench.Default());
Console.WriteLine(bench.Vectorized());

#if !DEBUG
BenchmarkDotNet.Running.BenchmarkRunner.Run<Bench>();
#endif

public class Bench
{
    private byte[] _rowSpan = null!;
    private byte _trimmableIndex;

    [GlobalSetup]
    public void Setup()
    {
        _rowSpan = new byte[100];
        _rowSpan.AsSpan().Fill(42);
        _rowSpan.AsSpan(25, 9).Clear();

        _trimmableIndex = 42;
    }

    [Benchmark(Baseline = true)]
    public (int left, int right, bool isTransparentRow) Default()
    {
        Span<byte> rowSpan = _rowSpan;
        byte trimmableIndex = _trimmableIndex;

        int left = int.MaxValue;
        int right = int.MinValue;
        bool isTransparentRow = true;

        for (int x = 0; x < rowSpan.Length; ++x)
        {
            if (rowSpan[x] != trimmableIndex)
            {
                isTransparentRow = false;
                left = Math.Min(left, x);
                right = Math.Max(right, x);
            }
        }

        if (left == int.MaxValue)
        {
            left = 0;
        }

        if (right == int.MinValue)
        {
            right = rowSpan.Length;
        }

        return (left, right, isTransparentRow);
    }

    [Benchmark]
    public (int left, int right, bool isTransparentRow) Vectorized()
    {
        Span<byte> rowSpan = _rowSpan;
        byte trimmableIndex = _trimmableIndex;

        int left = int.MaxValue;
        int right = int.MinValue;
        bool isTransparentRow = true;

        ref byte rowPtr = ref MemoryMarshal.GetReference(rowSpan);
        nint rowLength = (nint)(uint)rowSpan.Length;
        nint x = 0;

        if (Vector128.IsHardwareAccelerated && rowLength >= Vector128<byte>.Count)
        {
            Vector256<byte> trimmableVec256 = Vector256.Create(trimmableIndex);

            if (Vector256.IsHardwareAccelerated && rowLength >= Vector256<byte>.Count)
            {
                do
                {
                    Vector256<byte> vec = Vector256.LoadUnsafe(ref rowPtr, (nuint)x);
                    Vector256<byte> notEquals = ~Vector256.Equals(vec, trimmableVec256);

                    if (notEquals != Vector256<byte>.Zero)
                    {
                        isTransparentRow = false;
                        uint mask = notEquals.ExtractMostSignificantBits();
                        nint start = x + (nint)uint.TrailingZeroCount(mask);

                        nint end = (nint)uint.LeadingZeroCount(mask);
                        // end is from the end, but we need the index from the beginning
                        end = x + Vector256<byte>.Count - 1 - end;

                        left = Math.Min(left, (int)start);
                        right = Math.Max(right, (int)end);
                    }

                    x += Vector256<byte>.Count;
                }
                while (x <= rowLength - Vector256<byte>.Count);
            }

            Vector128<byte> trimmableVec = Vector256.IsHardwareAccelerated
                ? trimmableVec256.GetLower()
                : Vector128.Create(trimmableIndex);

            while (x <= rowLength - Vector128<byte>.Count)
            {
                Vector128<byte> vec = Vector128.LoadUnsafe(ref rowPtr, (nuint)x);
                Vector128<byte> notEquals = ~Vector128.Equals(vec, trimmableVec);

                if (notEquals != Vector128<byte>.Zero)
                {
                    isTransparentRow = false;
                    uint mask = notEquals.ExtractMostSignificantBits();
                    nint start = x + (nint)uint.TrailingZeroCount(mask);

                    nint end = (nint)uint.LeadingZeroCount(mask) - Vector128<byte>.Count;
                    // end is from the end, but we need the index from the beginning
                    end = x + Vector128<byte>.Count - 1 - end;

                    left = Math.Min(left, (int)start);
                    right = Math.Max(right, (int)end);
                }

                x += Vector128<byte>.Count;
            }
        }

        for (; x < rowLength; ++x)
        {
            if (Unsafe.Add(ref rowPtr, x) != trimmableIndex)
            {
                isTransparentRow = false;
                left = Math.Min(left, (int)x);
                right = Math.Max(right, (int)x);
            }
        }

        if (left == int.MaxValue)
        {
            left = 0;
        }

        if (right == int.MinValue)
        {
            right = (int)rowLength;
        }

        return (left, right, isTransparentRow);
    }
}

gfoidl

Some notes for review.

gfoidl · 2023-07-27T19:33:17Z

src/ImageSharp/Formats/Gif/GifEncoderCore.cs

+                        Vector256<byte> vec = Vector256.LoadUnsafe(ref rowPtr, (nuint)x);
+                        Vector256<byte> notEquals = ~Vector256.Equals(vec, trimmableVec256);
+
+                        if (notEquals != Vector256<byte>.Zero)


At the moment I don't have any idea on how to make this branchless.
isTransparentRow could be tracked in a vector, but left and right not, as there's a mismatch of vector-types, namely byte and int.

A quite complicated approach would be to use VectorXYZ<byte> and track the left and right -- but just before these can overflow merge it back to the scalar left, right and start over. But I guess the book-keeping is more work, so I'm not sure if this is actually faster. For sure the code gets painful.

gfoidl · 2023-07-27T19:35:45Z

src/ImageSharp/Formats/Gif/GifEncoderCore.cs

+                }
+            }
+#endif
+            for (; x < rowLength; ++x)


The remainder could be handled vectorized too, by shifting the mask of the most significant bits around by the count of elements left in the final vector.
I tried this somewhere else, the cost for that book-keeping isn't negligible, so didn't do this here and now (maybe I'll try this later).

Hm, I remembered that movemask isn't the fastest, and ptest (TestZ in .NET-terms) is faster but current benchmarks didn't prove this, also Intel's instruction table didn't show any benefit in terms of latency or throughput. Thus simplified that check.

JimBobSquarePants · 2023-07-29T08:59:28Z

Oof! Look at those numbers!! Thanks so much for looking at this. I'm going have a good dig through it to wrap my head round what you have done.

gfoidl · 2023-07-29T09:04:47Z

😃
I think the easiest way to understand / check is to use the benchmark-code (see top-comment) in a simple console app and step with the debugger through it (maybe change the size of the _rowSpan to 20 or that like.
Calculation of the correct index for end is the strangest part IMO.

PS: I'm back on Tuesday, so maybe slow to respond in the meantime.

JimBobSquarePants · 2023-08-09T11:30:27Z

This is fantastic stuff. I figured theoretically after reading some of the source for Span.IndexOf that masking with bit counting would be the vectorized solution I just had no idea how I'd actually implement it.

Tip of the cap to you sir.

gfoidl · 2023-08-09T12:15:01Z

Thanks for the kind words ❤️

Vectorize TrimTransparentPixels in GifEncoderCore

5416edb

gfoidl mentioned this pull request Jul 27, 2023

Preserve Gif color palettes and deduplicate frame pixels. #2455

Merged

4 tasks

gfoidl commented Jul 27, 2023

View reviewed changes

JimBobSquarePants merged commit 949e6ad into SixLabors:js/gif-fixes Aug 9, 2023

gfoidl deleted the git-transparency-simd branch August 9, 2023 12:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize TrimTransparentPixels in GifEncoderCore #2500

Vectorize TrimTransparentPixels in GifEncoderCore #2500

gfoidl commented Jul 27, 2023 •

edited

Loading

gfoidl left a comment

gfoidl Jul 27, 2023

gfoidl Jul 27, 2023

JimBobSquarePants commented Jul 29, 2023

gfoidl commented Jul 29, 2023 •

edited

Loading

JimBobSquarePants commented Aug 9, 2023

gfoidl commented Aug 9, 2023

Vectorize TrimTransparentPixels in GifEncoderCore #2500

Vectorize TrimTransparentPixels in GifEncoderCore #2500

Conversation

gfoidl commented Jul 27, 2023 • edited Loading

Prerequisites

Description

gfoidl left a comment

Choose a reason for hiding this comment

gfoidl Jul 27, 2023

Choose a reason for hiding this comment

gfoidl Jul 27, 2023

Choose a reason for hiding this comment

JimBobSquarePants commented Jul 29, 2023

gfoidl commented Jul 29, 2023 • edited Loading

JimBobSquarePants commented Aug 9, 2023

gfoidl commented Aug 9, 2023

gfoidl commented Jul 27, 2023 •

edited

Loading

gfoidl commented Jul 29, 2023 •

edited

Loading