-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GetNonRandomizedHashCode produces too many collisions for small strings #92556
Comments
So I am wondering if there any low-hanging improvements can be fixed here. E.g. the most popular types of strings are all ASCII which means they're half zeros. Also, the current algorithm uses a hack where it loads \0 for odd-sized strings |
So probably a good option is to replace it with xxHash3 which should also be significantly faster for large inputs due to vectorization. |
Tagging subscribers to this area: @dotnet/area-system-runtime Issue DetailsWhile I was investigating perf traces for some of our 1P customers, I noticed samples inside I made a small repro - here I create 5-char strings using System.Numerics;
using System.Runtime.CompilerServices;
class Program
{
const int From = '0';
const int To = 'z';
static void Main()
{
Parallel.For(From, To, Run);
Console.WriteLine(totalCollisions);
}
static int totalCollisions = 0;
static unsafe void Run(int i1)
{
char[] str1 = "AA000".ToCharArray();
char[] str2 = "AA000".ToCharArray();
fixed (char* p1 = str1)
{
fixed (char* p2 = str2)
{
for (int j1 = From; j1 < To; j1++)
{
for (int k1 = From; k1 < To; k1++)
{
p1[2] = (char)i1;
p1[3] = (char)j1;
p1[4] = (char)k1;
for (int i2 = From; i2 < To; i2++)
{
for (int j2 = From; j2 < To; j2++)
{
for (int k2 = From; k2 < To; k2++)
{
if (i1 == i2 && j1 == j2 && k1 == k2)
continue;
p2[2] = (char)i2;
p2[3] = (char)j2;
p2[4] = (char)k2;
if (Hashes.GetNonRandomizedHashCode(p1, str1.Length) == Hashes.GetNonRandomizedHashCode(p2, str2.Length))
{
Interlocked.Increment(ref totalCollisions);
//Console.WriteLine($"Collision found: {new string(str1)} and {new string(str2)}");
}
}
}
}
}
}
}
}
Console.WriteLine("Done");
}
}
public class Hashes
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe int GetNonRandomizedHashCode(char* src, int length)
{
uint hash1 = (5381 << 16) + 5381;
uint hash2 = hash1;
uint* ptr = (uint*)src;
while (length > 2)
{
length -= 4;
hash1 = (BitOperations.RotateLeft(hash1, 5) + hash1) ^ ptr[0];
hash2 = (BitOperations.RotateLeft(hash2, 5) + hash2) ^ ptr[1];
ptr += 2;
}
if (length > 0)
hash2 = (BitOperations.RotateLeft(hash2, 5) + hash2) ^ ptr[0];
return (int)(hash1 + (hash2 * 1566083941));
}
} it found
(for comparison, Marvin found 30 and 32 with a different seed, xxHash3 found 34)
|
While I was investigating perf traces for some of our 1P customers, I noticed samples inside
Dictionary.FindValue
oncollision++
I then realized it hits quite of few of them (but not enough to rehash to Marvin since the dictrionary overall is not too big)I made a small repro - here I create 5-char strings
AA000
and then change the last 3 chars (only to[0..z]
symbols) to find total number of collisions for this string:it found
213120
(!!) unique collisions! And I was only changing last 3 chars to['0'..'z']
symbols. Examples:(for comparison, Marvin found 15)
The text was updated successfully, but these errors were encountered: