Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Shared] Optimized sequence equal #257

Merged

Conversation

takahiro0327
Copy link
Collaborator

Optimized SequenceEqual. Please merge if you have no problem.

The loading time for scenes containing large images with the same content is slightly reduced.
This change should also slightly reduce the loading time of apng/gifs.


Loading time for one huge scene

Before: average 33.78[s]

Scene loading completed: 34.1[s]
Scene loading completed: 33.6[s]
Scene loading completed: 33.5[s]
Scene loading completed: 33.3[s]
Scene loading completed: 34.4[s]

After: average 32.04[s] (-1.74)

Scene loading completed: 32.2[s]
Scene loading completed: 31.8[s]
Scene loading completed: 32.2[s]
Scene loading completed: 31.8[s]
Scene loading completed: 32.2[s]

Linq's SequenceEqual was taking 40ms to compare 4MB of data.

https://github.com/mono/mono/blob/c6cdaadb54a1173484f1ada524306ddbf8c2e7d5/mcs/class/referencesource/System.Core/System/Linq/Enumerable.cs#L944
I think the implementation is this SequenceEqual.
I think the overhead of MoveNext() and comparer in the Enumerator is the cause of the slowdown.
The large GCAlloc may also cause boxing by reference to Current.


I hadn't looked at the profiler in a while, and there was something at the top. So I looked into it.

image
Before Profile:
MonoProfilerOutput_2024-06-01_13-41-13.csv
After Profile:
MonoProfilerOutput_2024-06-01_13-48-02.csv

(cherry picked from commit c3b33dcbafb201706c9bf7ff201cc38e05b2bfca)
Copy link
Contributor

@ManlyMarco ManlyMarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can add comments explaining what the code does or name the variables more clearly? It's hard to understand some of it.

@takahiro0327
Copy link
Collaborator Author

How about this.

I had ChatGPT write most of it, except for the Note at the end. No really, it's amazing. I gave him the code and asked him to add a comment and he did.

@takahiro0327
Copy link
Collaborator Author

Change variable names as well.

@takahiro0327
Copy link
Collaborator Author

fixed

@ManlyMarco
Copy link
Contributor

That's better, thanks. Someone suggested that this might be even faster or at least simpler.

@takahiro0327
Copy link
Collaborator Author

Is the method of using SIMD?
System.SpanHelpers.SequenceEqual and System.Runtime.Intrinsics are not in KK/KKS Mono?
I mean, I don't think they are implemented in Mono.

I suppose if I were to write and use native functions in C it would be possible, but ...
Currently it takes less than 1ms for a few MB of data.
If the data size could be in the GB range, it might be worth implementing...

@takahiro0327
Copy link
Collaborator Author

@ManlyMarco
Copy link
Contributor

I meant specifically the code in lines 931 to 950. It seems to not use anything special outside the Unsafe class and looks similar to your solution.

Copy link
Contributor

@ManlyMarco ManlyMarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. It might make sense to move SequenceEqual into a utility class and use it in other places in the code. I believe it's used a bunch inside IllusionFixes too. Maybe rename it to SequenceEqualFast?

@takahiro0327
Copy link
Collaborator Author

Oh, I see, that way.
Using do-while and using offset instead of indexer, I guess.
I'll give it a try.

@takahiro0327
Copy link
Collaborator Author

Time for a 4MB comparison:

        static TextureContainerManager()
        {
            byte[] a = new byte[4 << 20];
            byte[] b = new byte[4 << 20];

            Stopwatch watch = new Stopwatch();

            long minTicks = long.MaxValue;

            for( int i = 0; i < 1000; ++i )
            {
                watch.Reset();
                watch.Start();
                TextureKey.SequenceEqual(a, b);
                watch.Stop();

                minTicks = Math.Min(minTicks, watch.ElapsedTicks);
                
            }

            
            System.Console.WriteLine($"@@@@ {minTicks * 1000000 / (double)Stopwatch.Frequency}[us]");
        }

536.2[us]

if (((int)pA & 7) == 0 && ((int)pB & 7) == 0)
{
    ulong* ulongA = (ulong*)pA;
    ulong* ulongB = (ulong*)pB;
    offset = bytes & ~7;       // Round down to the nearest multiple of 8.
    int count = offset >> 3;    // Divide by 8 to get the number of 64-bit blocks.

    for (int i = 0; i < count; ++i)
    {
        if (ulongA[i] != ulongB[i])
            goto NotEquals;
    }
}

425.5[us]

if (((int)pA & 7) == 0 && ((int)pB & 7) == 0)
{
    offset = bytes & ~7;       // Round down to the nearest multiple of 8.

    for (int i = 0; i < offset; i += 8)
    {
        if (*(ulong*)(pA + i) != *(ulong*)(pB + i))
            goto NotEquals;
    }
}

409[us]

if (((int)pA & 7) == 0 && ((int)pB & 7) == 0)
{
    offset = bytes & ~7;       // Round down to the nearest multiple of 8.

    int i = 0;
    
    do
    {
        if (*(ulong*)(pA + i) != *(ulong*)(pB + i))
            goto NotEquals;

        i += 8;
    }
    while (i < offset);
}

314.9[us]

if (((int)pA & 7) == 0 && ((int)pB & 7) == 0)
{
    offset = bytes & ~7;       // Round down to the nearest multiple of 8.

    byte* pA_ = pA;
    byte* pB_ = pB;
    byte* pALast = pA + offset;

    do
    {
        if (*(ulong*)pA_ != *(ulong*)pB_)
            goto NotEquals;

        pA_ += 8;
        pB_ += 8;
    }
    while (pA_ != pALast);
}

278.4[us]

if (((int)pA & 7) == 0 && ((int)pB & 7) == 0 && bytes >= 32)
{
    offset = bytes & ~31;       // Round down to the nearest multiple of 32.

    byte* pA_ = pA;
    byte* pB_ = pB;
    byte* pALast = pA + offset;

    do
    {
        if (*(ulong*)pA_ != *(ulong*)pB_)
            goto NotEquals;

        pA_ += 8;
        pB_ += 8;

        if (*(ulong*)pA_ != *(ulong*)pB_)
            goto NotEquals;

        pA_ += 8;
        pB_ += 8;

        if (*(ulong*)pA_ != *(ulong*)pB_)
            goto NotEquals;

        pA_ += 8;
        pB_ += 8;

        if (*(ulong*)pA_ != *(ulong*)pB_)
            goto NotEquals;

        pA_ += 8;
        pB_ += 8;
    }
    while (pA_ != pALast);
}

@takahiro0327
Copy link
Collaborator Author

A new function called SequenceEqualFast was used.
Moved definition to Shared.


349.6[us] NG

if (((int)pA & 7) == 0 && ((int)pB & 7) == 0 && bytes >= 32)
{
    offset = bytes & ~31;       // Round down to the nearest multiple of 32.

    byte* pA_ = pA;
    byte* pB_ = pB;
    byte* pALast = pA + offset;

    do
    {
        if (*(ulong*)(pA_ + 0) != *(ulong*)(pB_ + 0))
            goto NotEquals;

        if (*(ulong*)(pA_ + 8) != *(ulong*)(pB_ + 8))
            goto NotEquals;

        if (*(ulong*)(pA_ +16) != *(ulong*)(pB_ +16))
            goto NotEquals;

        if (*(ulong*)(pA_ +24) != *(ulong*)(pB_ +24))
            goto NotEquals;

        pA_ += 32;
        pB_ += 32;
    }
    while (pA_ != pALast);
}

Copy link
Contributor

@ManlyMarco ManlyMarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, good work!

The method could be added to KKAPI so that other plugins can use it easily instead of copying it. OverlayMod would probably benefit from it.

@takahiro0327
Copy link
Collaborator Author

I experimented to see if I could adapt the following code in Core.UncensorSelector.Controller.cs.
If dedicated code was written, it was up to three times faster. However, it is fast enough from the original, so we will leave it as it is.
Perhaps because the number of elements is small, the load is not high to begin with.

            private IEnumerator HandleUVCorrupionsCo(SkinnedMeshRenderer dst, Vector2[] uvCopy)
            {
                // Wait for next frame to let the graphics logic run. Issue seems to happen between frames.
                yield return null;

#if KK
                if (Scene.Instance.NowSceneNames.Contains("ClassRoomSelect"))
                {
                    yield return null;
                    yield return null;
                    yield return null;
                    yield return null;
                }
#endif
                DoHandleUVCorrupions = true;

                
                stopwatch.Start();

                // Check if UVs got corrupted after moving the mesh, most common fail point
                if (! Utility.SequenceEqualFast(dst.sharedMesh.uv,uvCopy))
                {
                    Logger.LogWarning($"UVs got corrupted when changing uncensor mesh {dst.sharedMesh.name}, attempting to fix");
                    dst.sharedMesh.uv = uvCopy;
                    yield return null;

                    if (!Utility.SequenceEqualFast(dst.sharedMesh.uv,uvCopy))
                        Logger.LogError("Failed to fix UVs, body textures might be displayed corrupted. Consider updating your GPU drivers.");
                }

                stopwatch.Stop();
                System.Console.WriteLine($"@@@@@ {stopwatch.ElapsedMilliseconds}[ms]");
            }

            static System.Diagnostics.Stopwatch stopwatch = new System.Diagnostics.Stopwatch();

System.Linq.SequenceEqual: 12[ms]
Ver1: 14[ms]
Ver2: 8[ms]
Ver3: 4[ms]

Ver1:

        static public bool SequenceEqualFast<T>(IList<T> a, IList<T> b) where T : struct
        {
            if (System.Object.ReferenceEquals(a, b))
                return true;

            if (a == null || b == null)
                return false;

            int count = a.Count;

            if (count != b.Count)
                return false;

            if (count <= 0)
                return true;

            for (int i = 0; i < count; ++i)
                if (!a[i].Equals(b[i]))
                    goto NotEquals;

            return true;

NotEquals:
            return false;
        }

ver2:

        static public bool SequenceEqualFast(IList<UnityEngine.Vector2> a, IList<UnityEngine.Vector2> b)
        {
            if (System.Object.ReferenceEquals(a, b))
                return true;

            if (a == null || b == null)
                return false;

            int count = a.Count;

            if (count != b.Count)
                return false;

            if (count <= 0)
                return true;

            for (int i = 0; i < count; ++i)
                if (a[i] != b[i])
                    goto NotEquals;

            return true;

NotEquals:
            return false;
        }

ver3:

        static public bool SequenceEqualFast(UnityEngine.Vector2[] a, UnityEngine.Vector2[] b)
        {
            if (System.Object.ReferenceEquals(a, b))
                return true;

            if (a == null || b == null)
                return false;

            int count = a.Length;

            if (count != b.Length)
                return false;

            if (count <= 0)
                return true;

            for (int i = 0; i < count; ++i)
                if (a[i] != b[i])
                    goto NotEquals;

            return true;

NotEquals:
            return false;
        }

@takahiro0327 takahiro0327 deleted the sequence-equal-optimization branch June 2, 2024 12:09
@ManlyMarco
Copy link
Contributor

ManlyMarco commented Jun 2, 2024

I think the ver3 is still worthwhile to add, especially since it's so straight forward. In theory there may be some uncensor with a very large mesh, which would have a bigger effect on time.

I think ++i in the for loop should be i++ instead.

@takahiro0327
Copy link
Collaborator Author

takahiro0327 commented Jun 2, 2024

Ah, the high poly model.
Roger that. I'll add it in a few days.

I think ++i in the for loop should be i++ instead.

Why?
No, well, it's C# and it doesn't matter either way.
In C++, ++i is preferable, right? The prefix is less likely to cause extra instances.
In C#, probably both are fine, but I'm learning from C++ and using prepositioning.

@ManlyMarco
Copy link
Contributor

You're right that it doesn't matter in C#, but to people used to C and C++ seeing ++i in this case will seem wrong.

@takahiro0327
Copy link
Collaborator Author

takahiro0327 commented Jun 2, 2024

I'm good with C++, but I know of cases where i++ can cause performance problems. So i++ looks more unnatural.

It's not a penalty that happens with int types, and most modern compilers can avoid it.
I just make it a habit to write ++i if I have no reason to.

@takahiro0327
Copy link
Collaborator Author

Should ver3 also be implemented in KKAPI and SequenceEqualFast in KKPlugin be removed?

@ManlyMarco
Copy link
Contributor

It would be a good idea to include it in KKAPI, yes. Other common variations too maybe.
For now it can also stay KKPlugins since some plugins here don't reference KKAPI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants