Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GetNonRandomizedHashCode produces too many collisions for small strings #92556

Open
EgorBo opened this issue Sep 24, 2023 · 3 comments
Open

GetNonRandomizedHashCode produces too many collisions for small strings #92556

EgorBo opened this issue Sep 24, 2023 · 3 comments
Labels
area-System.Runtime needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration tenet-performance Performance related issue
Milestone

Comments

@EgorBo
Copy link
Member

EgorBo commented Sep 24, 2023

While I was investigating perf traces for some of our 1P customers, I noticed samples inside Dictionary.FindValue on collision++ I then realized it hits quite of few of them (but not enough to rehash to Marvin since the dictrionary overall is not too big)

I made a small repro - here I create 5-char strings AA000 and then change the last 3 chars (only to [0..z] symbols) to find total number of collisions for this string:

using System.Diagnostics;
using System.Numerics;

var sw = Stopwatch.StartNew();
const int From = '0';
const int To = 'z';
var hashes = new HashSet<int>();
char[] str = "AA000".ToCharArray();
int totalCollisions = 0;

unsafe
{
    fixed (char* pStr = str)
    {
        for (int ch2 = From; ch2 < To; ch2++)
        for (int ch3 = From; ch3 < To; ch3++)
        for (int ch4 = From; ch4 < To; ch4++)
        {
            pStr[2] = (char)ch2;
            pStr[3] = (char)ch3;
            pStr[4] = (char)ch4;

            int hash = GetNonRandomizedHashCode(pStr, str.Length);
            if (!hashes.Add(hash)) 
                totalCollisions++;
        }
    }
}
sw.Stop();
Console.WriteLine($"Collisions found: {totalCollisions} in {sw.Elapsed.TotalSeconds}");


static unsafe int GetNonRandomizedHashCode(char* src, int length)
{
    uint hash1 = (5381 << 16) + 5381;
    uint hash2 = hash1;
    uint* ptr = (uint*)src;
    while (length > 2)
    {
        length -= 4;
        hash1 = (BitOperations.RotateLeft(hash1, 5) + hash1) ^ ptr[0];
        hash2 = (BitOperations.RotateLeft(hash2, 5) + hash2) ^ ptr[1];
        ptr += 2;
    }
    if (length > 0)
        hash2 = (BitOperations.RotateLeft(hash2, 5) + hash2) ^ ptr[0];
    return (int)(hash1 + (hash2 * 1566083941));
}

it found 213120 (!!) unique collisions! And I was only changing last 3 chars to ['0'..'z'] symbols. Examples:

Collision found: AAD4T and AAC4w
Collision found: AAB4P and AAO45
Collision found: AAH4T and AAK43
Collision found: AAd4P and AAc43
Collision found: AAr1p and AAp16
Collision found: AAV4I and AAW4h
Collision found: AAD4T and AAE45
Collision found: AAr1p and AAq1W
Collision found: AAd4P and AAe4q
Collision found: AAF4U and AAC48
...

(for comparison, Marvin found 15)

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Sep 24, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Sep 24, 2023
@EgorBo
Copy link
Member Author

EgorBo commented Sep 24, 2023

So I am wondering if there any low-hanging improvements can be fixed here. E.g. the most popular types of strings are all ASCII which means they're half zeros. Also, the current algorithm uses a hack where it loads \0 for odd-sized strings

cc @GrabYourPitchforks

@EgorBo
Copy link
Member Author

EgorBo commented Sep 24, 2023

So probably a good option is to replace it with xxHash3 which should also be significantly faster for large inputs due to vectorization.

@EgorBo EgorBo added area-System.Runtime and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Sep 24, 2023
@ghost
Copy link

ghost commented Sep 24, 2023

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

While I was investigating perf traces for some of our 1P customers, I noticed samples inside Dictionary.FindValue on collision++ I then realized it hits quite of few of them (but not enough to rehash to Marvin since the dictrionary overall is not too big)

I made a small repro - here I create 5-char strings AA000 and then change the last 3 chars (only to [0..z] symbols) to find total number of collisions for this string:

using System.Numerics;
using System.Runtime.CompilerServices;

class Program
{
    const int From = '0';
    const int To = 'z';

    static void Main()
    {
        Parallel.For(From, To, Run);
        Console.WriteLine(totalCollisions);
    }

    static int totalCollisions = 0;

    static unsafe void Run(int i1)
    {
        char[] str1 = "AA000".ToCharArray();
        char[] str2 = "AA000".ToCharArray();
        fixed (char* p1 = str1)
        {
            fixed (char* p2 = str2)
            {
                for (int j1 = From; j1 < To; j1++)
                {
                    for (int k1 = From; k1 < To; k1++)
                    {
                        p1[2] = (char)i1;
                        p1[3] = (char)j1;
                        p1[4] = (char)k1;

                        for (int i2 = From; i2 < To; i2++)
                        {
                            for (int j2 = From; j2 < To; j2++)
                            {
                                for (int k2 = From; k2 < To; k2++)
                                {
                                    if (i1 == i2 && j1 == j2 && k1 == k2)
                                        continue;

                                    p2[2] = (char)i2;
                                    p2[3] = (char)j2;
                                    p2[4] = (char)k2;

                                    if (Hashes.GetNonRandomizedHashCode(p1, str1.Length) == Hashes.GetNonRandomizedHashCode(p2, str2.Length))
                                    {
                                        Interlocked.Increment(ref totalCollisions);
                                        //Console.WriteLine($"Collision found: {new string(str1)} and {new string(str2)}");
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
        Console.WriteLine("Done");
    }
}

public class Hashes
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static unsafe int GetNonRandomizedHashCode(char* src, int length)
    {
        uint hash1 = (5381 << 16) + 5381;
        uint hash2 = hash1;
        uint* ptr = (uint*)src;
        while (length > 2)
        {
            length -= 4;
            hash1 = (BitOperations.RotateLeft(hash1, 5) + hash1) ^ ptr[0];
            hash2 = (BitOperations.RotateLeft(hash2, 5) + hash2) ^ ptr[1];
            ptr += 2;
        }
        if (length > 0)
            hash2 = (BitOperations.RotateLeft(hash2, 5) + hash2) ^ ptr[0];
        return (int)(hash1 + (hash2 * 1566083941));
    }
}

it found 522736 (!!) unique collisions! And I was only changing last 3 chars to ['0'..'z'] symbols. Examples:

Collision found: AAD4T and AAC4w
Collision found: AAB4P and AAO45
Collision found: AAH4T and AAK43
Collision found: AAd4P and AAc43
Collision found: AAr1p and AAp16
Collision found: AAV4I and AAW4h
Collision found: AAD4T and AAE45
Collision found: AAr1p and AAq1W
Collision found: AAd4P and AAe4q
Collision found: AAF4U and AAC48
...

(for comparison, Marvin found 30 and 32 with a different seed, xxHash3 found 34)

Author: EgorBo
Assignees: -
Labels:

area-System.Runtime, untriaged

Milestone: -

@jkotas jkotas added the tenet-performance Performance related issue label Sep 25, 2023
@tannergooding tannergooding added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels Jun 24, 2024
@stephentoub stephentoub added this to the Future milestone Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Runtime needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

4 participants