Skip to content

String Hashing

Kirill Osenkov edited this page Jul 28, 2022 · 12 revisions

String hash functions used by MSBuild

For MSBuild BinaryLogger we decided to go with this one, which is very good and super simple: https://github.com/dotnet/msbuild/blob/e1de0c476fac967bd1f1d4c17e5c8a70513cfdee/src/Build/Logging/BinaryLogger/BuildEventArgsWriter.cs#L1218-L1246

Microsoft.NET.StringTools.dll (part of MSBuild) chose djb2: https://github.com/dotnet/msbuild/blob/e1de0c476fac967bd1f1d4c17e5c8a70513cfdee/src/StringTools/InternableString.cs#L310-L397

Evaluating non-cryptographic zero-allocation string hash functions

Looking at deduplicating strings in an MSBuild binlog. Can't hold on to strings to catch collisions so need to rely on hash function not having collisions. Datasets include binlogs with up to 2.7 million strings. Different binlogs have different distributions on string sizes, so hash function behavior will be very different on similar-sized binlogs.

I ran the tests on .NET Framework 64-bit. MSBuild is mostly 32-bit but I couldn't fit the large binlogs in a 32-bit process so had to run the test in 64-bit. The micro-benchmarks ran in .NET Framework 32-bit. You can run on .NET 5 and I'm sure you'll get different numbers but I needed this for MSBuild.

The binlog dataset is not publicly available, only the micro-benchmark is public.

Spoiler alert: I'm going with FNV-1a 64-bit hash size and an "optimization": https://github.com/KirillOsenkov/Benchmarks/blob/9cf36b1bd72a6768b5cb4d3a7838571e88bd1e4b/Benchmarks/Tests/StringHash.Fnv.cs#L54-L72

Links:

Collisions

I've evaluated 32-bit and 64-bit hash sizes. All functions tested are roughly equivalent in terms of collisions and provide a good even distribution. The Birthday Paradox roughly estimates the number of collisions per dataset size. I've found that 32-bit hash size is not enough for the datasets. All hash functions start having collisions between 30,000-100,000 strings. For largest binlogs at 2.7 million strings all 32-bit functions have ~800 collisions which is too much. Having a collision would mean the user would be shown a completely unrelated random string in a random location and this is very confusing.

Good news is that 64-bit hash size had 0 collisions for all binlogs I tested. I don't know how much larger the datasets have to get to start hitting 64-bit collisions. But good news is that collisions are unlikely (but possible!) for the current size.

Going with a 64-bit hash size.

Perf

The results vary heavily based on the binlog. xxHash32 and Marvin compete for the first place on large binlogs. FNV and others behaved well. I'd say xxHash is the overall winner in terms of perf.

Results for the largest binlog:

Strings: 2,744,490
  Fnv1a64         Collisions: 0               Elapsed: 00:00:10.7489566
  Fnv1a64Fast     Collisions: 0               Elapsed: 00:00:11.0249801
  xxHash64        Collisions: 0               Elapsed: 00:00:11.5102876
  Murmur3-128     Collisions: 0               Elapsed: 00:00:12.8930213
  Fnv1a32Fast     Collisions: 836             Elapsed: 00:00:08.2846016
  xxHash32        Collisions: 913             Elapsed: 00:00:09.5995559
  Murmur3-32      Collisions: 905             Elapsed: 00:00:09.6880563
  djb2            Collisions: 943             Elapsed: 00:00:10.3829440
  Fnv1a32         Collisions: 920             Elapsed: 00:00:11.8845191
  GetHashCode     Collisions: 880             Elapsed: 00:00:12.1517814
  Marvin          Collisions: 874             Elapsed: 00:00:12.7311613

Faster FNV-1a

When hashing a string the canonical FNV-1a dictates that each octet needs to be mixed in separately. So for a .NET char you need to do it twice:

    char ch = text[i];
    byte b = (byte)ch;
    hash ^= b;
    hash *= Prime;

    b = (byte)(ch >> 8);
    hash ^= b;
    hash *= Prime;

However I decided to see if I can get away with mixing in the entire char in one fell swoop:

    char ch = text[i];
    hash = (hash ^ ch) * Prime;

and the 64-bit hash still had no collisions and performance improved roughly 2x, so going to go with this optimization. This optimization for 32-bit did result in more collisions than the canonical version in some situations:

  Fnv1a32Fast  Collisions: 15  Elapsed: 00:00:00.9699976
  Fnv1a32      Collisions: 8   Elapsed: 00:00:01.7843502

but not in others:

  Fnv1a32Fast  Collisions: 9   Elapsed: 00:00:00.2443658
  Fnv1a32      Collisions: 17  Elapsed: 00:00:00.4131953

roughly the collisions stayed on par but performance improved 2x.

Full results

Strings: 5,755
  xxHash64        Collisions: 0               Elapsed: 00:00:00.0026576
  GetHashCode     Collisions: 0               Elapsed: 00:00:00.0032591
  Murmur3-128     Collisions: 0               Elapsed: 00:00:00.0036303
  Murmur3-32      Collisions: 0               Elapsed: 00:00:00.0045689
  djb2            Collisions: 0               Elapsed: 00:00:00.0057463
  Fnv1a64Fast     Collisions: 0               Elapsed: 00:00:00.0061315
  Fnv1a64         Collisions: 0               Elapsed: 00:00:00.0106836
  Fnv1a32         Collisions: 0               Elapsed: 00:00:00.0108679
  Marvin          Collisions: 0               Elapsed: 00:00:00.0121878
  xxHash32        Collisions: 0               Elapsed: 00:00:00.0131451
  Fnv1a32Fast     Collisions: 0               Elapsed: 00:00:00.0214134


Strings: 8,978
  Murmur3-128     Collisions: 0               Elapsed: 00:00:00.0014915
  GetHashCode     Collisions: 0               Elapsed: 00:00:00.0016946
  xxHash32        Collisions: 0               Elapsed: 00:00:00.0017028
  xxHash64        Collisions: 0               Elapsed: 00:00:00.0018599
  djb2            Collisions: 0               Elapsed: 00:00:00.0022298
  Murmur3-32      Collisions: 0               Elapsed: 00:00:00.0022457
  Fnv1a32Fast     Collisions: 0               Elapsed: 00:00:00.0025141
  Fnv1a32         Collisions: 0               Elapsed: 00:00:00.0042188
  Fnv1a64         Collisions: 0               Elapsed: 00:00:00.0050270
  Fnv1a64Fast     Collisions: 0               Elapsed: 00:00:00.0051942
  Marvin          Collisions: 0               Elapsed: 00:00:00.0064653


Strings: 6,170
  xxHash64        Collisions: 0               Elapsed: 00:00:00.0010432
  Murmur3-128     Collisions: 0               Elapsed: 00:00:00.0011453
  GetHashCode     Collisions: 0               Elapsed: 00:00:00.0015364
  Murmur3-32      Collisions: 0               Elapsed: 00:00:00.0015985
  Marvin          Collisions: 0               Elapsed: 00:00:00.0018468
  Fnv1a64Fast     Collisions: 0               Elapsed: 00:00:00.0023387
  djb2            Collisions: 0               Elapsed: 00:00:00.0023605
  Fnv1a32Fast     Collisions: 0               Elapsed: 00:00:00.0023911
  xxHash32        Collisions: 0               Elapsed: 00:00:00.0025984
  Fnv1a32         Collisions: 0               Elapsed: 00:00:00.0041617
  Fnv1a64         Collisions: 0               Elapsed: 00:00:00.0041768


Strings: 5,767
  xxHash64        Collisions: 0               Elapsed: 00:00:00.0035053
  xxHash32        Collisions: 0               Elapsed: 00:00:00.0046727
  Murmur3-128     Collisions: 0               Elapsed: 00:00:00.0052897
  GetHashCode     Collisions: 0               Elapsed: 00:00:00.0074348
  Marvin          Collisions: 0               Elapsed: 00:00:00.0132039
  djb2            Collisions: 0               Elapsed: 00:00:00.0140372
  Fnv1a32Fast     Collisions: 0               Elapsed: 00:00:00.0140615
  Fnv1a64Fast     Collisions: 0               Elapsed: 00:00:00.0141157
  Fnv1a32         Collisions: 0               Elapsed: 00:00:00.0270089
  Fnv1a64         Collisions: 0               Elapsed: 00:00:00.0271505
  Murmur3-32      Collisions: 1               Elapsed: 00:00:00.0089661


Strings: 10,577
  xxHash64        Collisions: 0               Elapsed: 00:00:00.0043155
  xxHash32        Collisions: 0               Elapsed: 00:00:00.0055293
  Murmur3-128     Collisions: 0               Elapsed: 00:00:00.0061946
  GetHashCode     Collisions: 0               Elapsed: 00:00:00.0073197
  Marvin          Collisions: 0               Elapsed: 00:00:00.0106193
  Fnv1a32Fast     Collisions: 0               Elapsed: 00:00:00.0132216
  Fnv1a64Fast     Collisions: 0               Elapsed: 00:00:00.0135098
  djb2            Collisions: 0               Elapsed: 00:00:00.0136523
  Murmur3-32      Collisions: 0               Elapsed: 00:00:00.0143501
  Fnv1a32         Collisions: 0               Elapsed: 00:00:00.0250287
  Fnv1a64         Collisions: 0               Elapsed: 00:00:00.0288256


Strings: 13,144
  xxHash32        Collisions: 0               Elapsed: 00:00:00.0068207
  Murmur3-128     Collisions: 0               Elapsed: 00:00:00.0073917
  GetHashCode     Collisions: 0               Elapsed: 00:00:00.0091019
  xxHash64        Collisions: 0               Elapsed: 00:00:00.0091676
  Murmur3-32      Collisions: 0               Elapsed: 00:00:00.0110823
  Fnv1a64Fast     Collisions: 0               Elapsed: 00:00:00.0167608
  Fnv1a32Fast     Collisions: 0               Elapsed: 00:00:00.0168389
  djb2            Collisions: 0               Elapsed: 00:00:00.0171546
  Fnv1a64         Collisions: 0               Elapsed: 00:00:00.0321957
  Fnv1a32         Collisions: 0               Elapsed: 00:00:00.0352745
  Marvin          Collisions: 1               Elapsed: 00:00:00.0129518


Strings: 22,963
  xxHash64        Collisions: 0               Elapsed: 00:00:00.0052700
  Murmur3-128     Collisions: 0               Elapsed: 00:00:00.0060257
  GetHashCode     Collisions: 0               Elapsed: 00:00:00.0061548
  Murmur3-32      Collisions: 0               Elapsed: 00:00:00.0077195
  Fnv1a64Fast     Collisions: 0               Elapsed: 00:00:00.0099857
  djb2            Collisions: 0               Elapsed: 00:00:00.0106575
  xxHash32        Collisions: 0               Elapsed: 00:00:00.0122694
  Fnv1a32Fast     Collisions: 0               Elapsed: 00:00:00.0124988
  Marvin          Collisions: 0               Elapsed: 00:00:00.0132141
  Fnv1a64         Collisions: 0               Elapsed: 00:00:00.0175389
  Fnv1a32         Collisions: 0               Elapsed: 00:00:00.0216942


Strings: 33,245
  Murmur3-128     Collisions: 0               Elapsed: 00:00:00.0055987
  Fnv1a32Fast     Collisions: 0               Elapsed: 00:00:00.0083299
  xxHash64        Collisions: 0               Elapsed: 00:00:00.0112533
  Fnv1a64Fast     Collisions: 0               Elapsed: 00:00:00.0129992
  Fnv1a32         Collisions: 0               Elapsed: 00:00:00.0139026
  GetHashCode     Collisions: 0               Elapsed: 00:00:00.0142404
  Murmur3-32      Collisions: 0               Elapsed: 00:00:00.0143126
  xxHash32        Collisions: 0               Elapsed: 00:00:00.0151984
  Fnv1a64         Collisions: 0               Elapsed: 00:00:00.0180864
  Marvin          Collisions: 0               Elapsed: 00:00:00.0205346
  djb2            Collisions: 1               Elapsed: 00:00:00.0092903


Strings: 40,004
  xxHash64        Collisions: 0               Elapsed: 00:00:00.0611167
  xxHash32        Collisions: 0               Elapsed: 00:00:00.0808315
  Murmur3-128     Collisions: 0               Elapsed: 00:00:00.1196355
  GetHashCode     Collisions: 0               Elapsed: 00:00:00.1333144
  Murmur3-32      Collisions: 0               Elapsed: 00:00:00.1646332
  Marvin          Collisions: 0               Elapsed: 00:00:00.1926077
  Fnv1a64Fast     Collisions: 0               Elapsed: 00:00:00.2671009
  Fnv1a32Fast     Collisions: 0               Elapsed: 00:00:00.2682030
  djb2            Collisions: 0               Elapsed: 00:00:00.2741916
  Fnv1a32         Collisions: 0               Elapsed: 00:00:00.5120110
  Fnv1a64         Collisions: 0               Elapsed: 00:00:00.5264215


Strings: 95,200
  Murmur3-128     Collisions: 0               Elapsed: 00:00:00.0522342
  xxHash64        Collisions: 0               Elapsed: 00:00:00.0596649
  Murmur3-32      Collisions: 0               Elapsed: 00:00:00.0954755
  Fnv1a64Fast     Collisions: 0               Elapsed: 00:00:00.1185492
  Fnv1a64         Collisions: 0               Elapsed: 00:00:00.1993123
  xxHash32        Collisions: 1               Elapsed: 00:00:00.0498529
  GetHashCode     Collisions: 1               Elapsed: 00:00:00.0617051
  Marvin          Collisions: 3               Elapsed: 00:00:00.0875014
  Fnv1a32Fast     Collisions: 2               Elapsed: 00:00:00.1141006
  djb2            Collisions: 2               Elapsed: 00:00:00.1203841
  Fnv1a32         Collisions: 2               Elapsed: 00:00:00.2019409


Strings: 288,134
  Murmur3-128     Collisions: 0               Elapsed: 00:00:00.1702985
  xxHash64        Collisions: 0               Elapsed: 00:00:00.1863770
  Fnv1a64Fast     Collisions: 0               Elapsed: 00:00:00.2373387
  Fnv1a64         Collisions: 0               Elapsed: 00:00:00.3931002
  xxHash32        Collisions: 9               Elapsed: 00:00:00.1286781
  GetHashCode     Collisions: 9               Elapsed: 00:00:00.1491944
  Murmur3-32      Collisions: 9               Elapsed: 00:00:00.1840540
  Marvin          Collisions: 12              Elapsed: 00:00:00.2290242
  djb2            Collisions: 17              Elapsed: 00:00:00.2416959
  Fnv1a32Fast     Collisions: 9               Elapsed: 00:00:00.2426100
  Fnv1a32         Collisions: 17              Elapsed: 00:00:00.3948087


Strings: 220,177
  xxHash64        Collisions: 0               Elapsed: 00:00:00.1724258
  Murmur3-128     Collisions: 0               Elapsed: 00:00:00.2341446
  Fnv1a64Fast     Collisions: 0               Elapsed: 00:00:00.5152870
  Fnv1a64         Collisions: 0               Elapsed: 00:00:00.9254929
  xxHash32        Collisions: 1               Elapsed: 00:00:00.2250230
  GetHashCode     Collisions: 8               Elapsed: 00:00:00.2954302
  Murmur3-32      Collisions: 6               Elapsed: 00:00:00.3391128
  Marvin          Collisions: 6               Elapsed: 00:00:00.4150414
  djb2            Collisions: 12              Elapsed: 00:00:00.5085072
  Fnv1a32Fast     Collisions: 4               Elapsed: 00:00:00.5168778
  Fnv1a32         Collisions: 4               Elapsed: 00:00:00.9458197


Strings: 302,685
  xxHash64        Collisions: 0               Elapsed: 00:00:00.2679292
  Murmur3-128     Collisions: 0               Elapsed: 00:00:00.4355118
  Fnv1a64Fast     Collisions: 0               Elapsed: 00:00:00.9426571
  Fnv1a64         Collisions: 0               Elapsed: 00:00:01.7275618
  xxHash32        Collisions: 11              Elapsed: 00:00:00.3675883
  GetHashCode     Collisions: 16              Elapsed: 00:00:00.5649634
  Murmur3-32      Collisions: 11              Elapsed: 00:00:00.5997929
  Marvin          Collisions: 15              Elapsed: 00:00:00.6960864
  djb2            Collisions: 9               Elapsed: 00:00:00.9451107
  Fnv1a32Fast     Collisions: 15              Elapsed: 00:00:00.9599598
  Fnv1a32         Collisions: 8               Elapsed: 00:00:01.7636629


Strings: 681,346
  xxHash64        Collisions: 0               Elapsed: 00:00:04.3879496
  Murmur3-128     Collisions: 0               Elapsed: 00:00:06.9035179
  Fnv1a64         Collisions: 0               Elapsed: 00:00:08.8463328
  Fnv1a64Fast     Collisions: 0               Elapsed: 00:00:15.8771657
  Marvin          Collisions: 71              Elapsed: 00:00:03.3497778
  Murmur3-32      Collisions: 52              Elapsed: 00:00:06.7033923
  djb2            Collisions: 57              Elapsed: 00:00:06.9822773
  xxHash32        Collisions: 54              Elapsed: 00:00:07.6723920
  Fnv1a32         Collisions: 67              Elapsed: 00:00:12.8489714
  GetHashCode     Collisions: 54              Elapsed: 00:00:18.8672386
  Fnv1a32Fast     Collisions: 56              Elapsed: 00:00:18.9755531


Strings: 2,744,490
  Fnv1a64         Collisions: 0               Elapsed: 00:00:10.7489566
  Fnv1a64Fast     Collisions: 0               Elapsed: 00:00:11.0249801
  xxHash64        Collisions: 0               Elapsed: 00:00:11.5102876
  Murmur3-128     Collisions: 0               Elapsed: 00:00:12.8930213
  Fnv1a32Fast     Collisions: 836             Elapsed: 00:00:08.2846016
  xxHash32        Collisions: 913             Elapsed: 00:00:09.5995559
  Murmur3-32      Collisions: 905             Elapsed: 00:00:09.6880563
  djb2            Collisions: 943             Elapsed: 00:00:10.3829440
  Fnv1a32         Collisions: 920             Elapsed: 00:00:11.8845191
  GetHashCode     Collisions: 880             Elapsed: 00:00:12.1517814
  Marvin          Collisions: 874             Elapsed: 00:00:12.7311613


Strings: 2,711,714
  Murmur3-128     Collisions: 0               Elapsed: 00:00:10.1355745
  xxHash64        Collisions: 0               Elapsed: 00:00:11.5539888
  Fnv1a64Fast     Collisions: 0               Elapsed: 00:00:12.1999304
  Fnv1a64         Collisions: 0               Elapsed: 00:00:13.8865074
  Marvin          Collisions: 861             Elapsed: 00:00:10.0699179
  Fnv1a32Fast     Collisions: 888             Elapsed: 00:00:11.0205952
  GetHashCode     Collisions: 817             Elapsed: 00:00:11.0887121
  xxHash32        Collisions: 845             Elapsed: 00:00:11.8177209
  djb2            Collisions: 842             Elapsed: 00:00:12.3049021
  Murmur3-32      Collisions: 855             Elapsed: 00:00:12.3816100
  Fnv1a32         Collisions: 859             Elapsed: 00:00:14.2399122


Strings: 746,697
  xxHash64        Collisions: 0               Elapsed: 00:00:00.7954374
  Murmur3-128     Collisions: 0               Elapsed: 00:00:01.1594128
  Fnv1a64Fast     Collisions: 0               Elapsed: 00:00:02.4248640
  Fnv1a64         Collisions: 0               Elapsed: 00:00:04.6697070
  xxHash32        Collisions: 61              Elapsed: 00:00:01.2658189
  Murmur3-32      Collisions: 74              Elapsed: 00:00:01.6817723
  Marvin          Collisions: 69              Elapsed: 00:00:01.8497144
  GetHashCode     Collisions: 67              Elapsed: 00:00:02.3598280
  Fnv1a32Fast     Collisions: 60              Elapsed: 00:00:02.4199466
  djb2            Collisions: 67              Elapsed: 00:00:02.5917555
  Fnv1a32         Collisions: 71              Elapsed: 00:00:04.5650691


Strings: 1,633,734
  xxHash64        Collisions: 0               Elapsed: 00:00:01.7773656
  Murmur3-128     Collisions: 0               Elapsed: 00:00:01.9570083
  Fnv1a64Fast     Collisions: 0               Elapsed: 00:00:04.7777651
  Fnv1a64         Collisions: 0               Elapsed: 00:00:05.8021424
  xxHash32        Collisions: 308             Elapsed: 00:00:01.8537316
  Murmur3-32      Collisions: 303             Elapsed: 00:00:02.4810003
  Marvin          Collisions: 315             Elapsed: 00:00:02.7430470
  djb2            Collisions: 289             Elapsed: 00:00:03.3194793
  Fnv1a32         Collisions: 330             Elapsed: 00:00:06.1967731
  Fnv1a32Fast     Collisions: 312             Elapsed: 00:00:12.0185989
  GetHashCode     Collisions: 324             Elapsed: 00:00:16.0363182