-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Conversation
Works by maintaining the xxHash32 state variables (v1 -> v4, length) as well as a queue of values that fall outside of the block size (16 bytes/4 ints). Seed is assumed to be zero as there is no seed ctor in the proposal - however, it is possible to add a seed without changing the structure. Tests against known xxHash32 vectors are provided ("abcd1234efgh5678...").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking up on the PR for the HashCode type!
{ | ||
// Initialize | ||
|
||
state[0] = Prime1 + Prime2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid HashCode consumers to confuse the hash-code to be suitable for persistence, the initialization should make use of a random, per-app-domain, seed. For XX32, The seed is added to the first three state-vars (state[0] += seed
, same for idx 1 and 2), and being subtracted from for the last (state[3] = seed - state[3]
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. I still have the seed code in my polyfill.
This would break the test vector test (which the build server is having problems with anyway). I can't think of any other way to test the output of this function. Suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At my prototype project Haschisch.Kastriert I implemented a version where the seed could be specified explicitly. The tests would use a known seed, and the overloads for the default seed would use random numbers (from the security.cryptography randomnumbergenerator).
|
||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
[TargetedPatchingOptOut("Performance critical for inlining across NGen images.")] | ||
private uint Round(uint seed, uint input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be static.
# pragma warning disable 0809 | ||
[Obsolete("Use ToHashCode to retrieve the computed hash code.", error: true)] | ||
[EditorBrowsable(EditorBrowsableState.Never)] | ||
public override int GetHashCode() => ToHashCode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My (and also @tannergooding 's) interpretation of the proposed API was to throw new NotImplementedException();
here. Usually, and also for this implementation, ToHashCode()
destroys state, i.e. calling .ToHashCode()
twice will return different results. For .GetHashCode()
this would be more than just unexpected, it would lead to wrong results for all the usual consumers (Dictionary etc.). Failing with InvalidOperationException
might be even better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to do this, just need a final say on which Exception and what message.
public static int Combine<T1>(T1 value1) | ||
{ | ||
var hc = new HashCode(); | ||
hc.Add(value1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've once tried combining hash-codes with generic code, but due to performance (factor 2 to 8) ended up manually unrolling all .Combine()
overloads. I see that you've elegantly improved handling the state and queue, so this might improve the situation; but do you have some ballpark estimate how much this costs? (For comparison: this was a larger difference than choosing SipHash 1-3, which would be hash-dos resistant).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll benchmark post the results in a comment below.
Assert.NotEqual(hcs[i], hcs[j]); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the issues' proposal, the idea was that building the hash-code using .Add()
would result in the same hash-code as using a single .Combine()
call; it would be nice to have a test-case for that.
cc: @morganbr |
{ | ||
var val = (uint)value; | ||
|
||
fixed (uint* state = _state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a convention we called such fixed pointers pXxx (pState in this example). It makes it easier to see in following code where there is a danger of overruns.
fixed (uint* state = _state) | ||
{ | ||
var queue = state + Queue; | ||
var position = state[Length] & 0x3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is not really Length but rather LastIndex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it seems like this should be LengthIdx, as the array is used as a packed struct (which would be nicer).
How messy? Could you simply use 4 ulongs? It might actually be faster as the code would not have loops. |
@KrzysztofCwalina much less messy compared to the first iteration before switching to unsafe. Unsafe is still cleaner, but not enough to offset this perf improvement:
@gimpf unrolling the loop makes an absurd difference:
I'll have these changes done by tonight. Aside: the tests failures are because of |
@jcdickinson Beware constant-folding. When doing benchmark with constants and unrolled implementations, sometimes the JIT works better than you'd expect. I had to switch to randomly initialized values to make that problem go away (and not static readonly, but initialized per benchmark run). |
@gimpf thanks for the feedback. I updated the benchmark with: private volatile int _counter;
private int GetCounter() => _counter++;
[Benchmark(Baseline = false)]
public void Unrolled()
{
Safe.HashCode.Combine(
GetCounter(),
GetCounter(),
GetCounter(),
GetCounter(),
GetCounter(),
GetCounter(),
GetCounter(),
GetCounter());
} It's still twice as fast:
It looks like it's still worth it. I was hoping to avoid having to handcraft all those unrolled loops 😛 |
} | ||
|
||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
[TargetedPatchingOptOut("Performance critical for inlining across NGen images.")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkotas is "targetedpatchingoptout" still needed in corfx?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if so when do we apply it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkotas is "targetedpatchingoptout" still needed in corfx?
No. These attributes can be deleted.
Order of members now matches the proposal, if that matters. Combine methods are unrolled for perf. Removed superfluous attributes. Test vector tests are now optional (behind SYSTEM_HASHCODE_TESTVECTORS def) as the random seed will break these tests. Defining SYSTEM_HASHCODE_TESTVECTORS will disable the per-AppDomain random seed. Combine is now tested against the behavior of Add.
❌ NETFX x86 Release Build
What do I need to do to get the new type going in NETFX? |
@@ -5,6 +5,7 @@ | |||
<ProjectGuid>{56B9D0A9-44D3-488E-8B42-C14A6E30CAB2}</ProjectGuid> | |||
<AssemblyName>System.Runtime</AssemblyName> | |||
<IsPartialFacadeAssembly>true</IsPartialFacadeAssembly> | |||
<AllowUnsafeBlocks>true</AllowUnsafeBlocks> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have realized that this is no longer needed, I'll put it into the next commit that solves the build problem and any additional feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
@@ -3675,6 +3675,56 @@ public sealed partial class WeakReference<T> : System.Runtime.Serialization.ISer | |||
public void SetTarget(T target) { } | |||
public bool TryGetTarget(out T target) { throw null; } | |||
} | |||
|
|||
[System.Runtime.InteropServices.StructLayoutAttribute(System.Runtime.InteropServices.LayoutKind.Auto)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is StructLayout
specified here in the ref but not the implementation? Is it even necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea. I had a look at the rest of the file and followed suit. Looking again it seems as though the vast majority include it.
Removed.
{ | ||
private uint _v1, _v2, _v3, _v4; | ||
private uint _queue1, _queue2, _queue3; | ||
private uint _length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Private fields aren't included in the refs for other structs (e.g. DateTime
), why are they needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per the rest of the file, it seems as though either the fields are present or [StructLayout(Size))]
is set. Maybe the length of these types is important?
Removed.
{ | ||
var hc1 = (uint)(value1?.GetHashCode() ?? 0); | ||
|
||
var hash = MixEmptyState(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the corefx coding style:
- We only use
var
when it's obvious what the variable type is (i.e.var stream = new FileStream(...)
notvar stream = OpenStandardInput()
).
In this case it looks like uint
.
Applies throughout.
@jcdickinson, don't worry about adding it to NetFX yet, we'll handle that after it gets into CoreFX. Just turn the test off for it by checking TargetGroup != netfx in the test csproj. |
@morganbr thanks for the info. Must I commit that change? |
@jcdickinson, you'll need it sometime before we merge, but I'm working on more feedback right now so no need for a separate commit. |
Can someone confirm that structs in |
# if SYSTEM_HASHCODE_TESTVECTORS | ||
private static readonly uint _seed = 0; | ||
# else | ||
private static readonly uint _seed = unchecked((uint)new Random().Next(int.MinValue, int.MaxValue)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing to s_seed
.
structs in this file do not need to be correct lenght. Delete the private fields and the StructLayoutAttribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great progress, @jcdickinson and I'm excited to see this coming together.
I've visually verified that the implementation matches the original xxHash and hopefully most of my other comments should be straightforward to address.
# if SYSTEM_HASHCODE_TESTVECTORS | ||
private static readonly uint _seed = 0; | ||
# else | ||
private static readonly uint _seed = unchecked((uint)new Random().Next(int.MinValue, int.MaxValue)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use RandomNumberGenerator.Create().GetBytes()
here for high-quality entropy. It's not critical for xxHash, but would be important with other algorithms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
System.Security is not available in this assembly :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bartonjs, can you please help with the incantations for getting proper entropy here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial thought: If seeded Marvin is enabled by default, string.Empty.GetHashCode() is entropic. But there's probably a way of getting at randomness from within System.Runtime if I let myself think for more than 20 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new Random().Next(int.MinValue, int.MaxValue)
should be good enough source of entropy in .NET Core.
RandomNumberGenerator.Create().GetBytes()
At this level of the stack, you can use SystemNative_GetNonCryptographicallySecureRandomBytesand Windows equivalent if you have a good reason to - look for
Interop.GetRandomBytesin CoreLib. The
Random` global seed is initialized by this function among other things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of reads doesn't matter that much (or at all, I haven't checked the generated code).
And as always, it's best to measure before attempting optimizations. Try it with a fixed (const) seed and compare the result with the current version. If the difference is significant then we can consider alternatives. One possible solution is to avoid static constructors and instead use the good old fashioned if (s_seed == 0) s_seed = ...;
Intrinsic fields might still be an option but to use that we need to ensure that this code is never crossgened, that sounds kind of spooky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkotas , I think it's somewhat important to use a cryptographic random number generator here for a couple reasons:
- We may change the algorithm in the future to account for DoS resistance, which would make a strong RNG critical
- Non-cryptographic RNG can have seeding issues such that multiple AppDomains/processes/machines end up with the same seed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having dependency from the core runtime parts on the full cryptographic stack would create ugly dependencies. I do think it is the right tradeoff between complexity and security. We had this discussion multiple times before, e.g. #16652.
We may change the algorithm in the future to account for DoS resistance
If we ever do this, we can re-evaluate what makes sense.
Non-cryptographic RNG can have seeding issues
If there are seeding issues in SystemNative_GetNonCryptographicallySecureRandomBytes
, we should fix them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no real difference between any approach in terms of performance:
Method | Mean | Error | StdDev | Scaled |
---|---|---|---|---|
Seed Read 1 | 12.12 ns | 0.1618 ns | 0.1514 ns | 0.22 |
Seed Read 4 | 12.01 ns | 0.1978 ns | 0.1850 ns | 0.21 |
Seed Lazy | 11.91 ns | 0.0165 ns | 0.0154 ns | 0.21 |
Seed Thread Lazy | 13.85 ns | 0.0334 ns | 0.0261 ns | 0.25 |
Is there another place where I can see how the secure RNG is used so that I can chamber a change containing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After further investigation with @bartonjs, CoreLib's Interop.GetRandomBytes is acceptable on most machines, so I can live with it here. @jcdickinson, I think the best way to get that would be to actually move this change to System.Private.CoreLib in the https://github.com/dotnet/coreclr/ repo after we sign off here (we'll just merge as soon as CI goes green assuming nothing meaningful changes in the move). Make sure to put it in the src/mscorlib/shared directory so that it automatically mirrors to CoreRT. That'll give you access to the API and also let CoreLib use HashCode.
@jkotas, for reference, the issues in GetNonCryptographicallySecureRandomBytes are:
- It can use urandom before it's seeded, meaning it may be predictable or even constant between containers/VMs/IoT devices where entropy trickles in slowly and a .NET service may run on boot.
- It silently fails over from random/urandom to srand(time(null)) if a machine doesn't support them or they're too slow. If this happens, the seed is both very predictable and could easily be the same across processes or machines. It also means cases like leap seconds and VM rollbacks will cause repeat behavior.
private uint _queue1, _queue2, _queue3; | ||
private uint _length; | ||
|
||
[MethodImpl(MethodImplOptions.AggressiveInlining)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and everywhere else you've used AggressiveInlining, do you have data showing it's better? @jkotas, do we have guidance on when to use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it should be supported by data that shows it makes difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last when I checked (with my implementations in Haschisch) it made a difference. It could easily be that (at best) some are redundant (as the JIT would inline anyways), but any single function call that remains will cause performance to drop. Also, the typical .GetHashCode()
caller usually will not be inlined anyways (to help against exploding code-size).
I have to admit that I didn't check in detail how much, if any, of the performance improvement remains in more complex code; but again, given that the caller usually will not be inlined, I'd be surprised if this annotation would be detrimental (although I have been suprised many times already, so what gives...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
supported by data
Just remembered that there's an app for that.
Most of the methods run up against "Native estimate for function size exceeds threshold" (with exception to MixEmptyState
). Writing to to arguments (all the mix methods) outright prevented inlining even with AggressiveInlining
. Switch statements are a no-no. Curiously Rol
doesn't get inlined, I thought it was supposed to be a JIT intrinsic?
The tool doesn't support generic methods at all. Everything else is inlined after the latest batch of changes:
{ | ||
unchecked | ||
{ | ||
var hc1 = (uint)(value1?.GetHashCode() ?? 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we actually defined what null should do in the proposal. I have slightly mixed feelings. Treating it as zero may be more likely to cause collisions since lots of GetHashCodeImplementations may also return zero (for example, that means Combine((Nullable<int>)0)
is the same as Combine((Nullable<int>)null)
). We could:
- Keep using zero.
- Throw ArgumentNullException.
- Pick a largish non-zero number that might be less likely to collide -- perhaps 0x6E756C6C ("null" in ascii).
- Use typeof(T1).GetHashCode() so that we're at least including some information in the hash code and so that Combine((string)null, 1) is different than Combine((object)null, 1). I'm not sure how this would perform though.
The downside to throwing ArgumentNullException is that it means Roslyn can't just generate GetHashCode overrides that look like return Combine(a, b, c, d);
I might lean toward 3 or 4. @jcdickinson, @terrajobst or other folks, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oddly I had this discussion with my brother today. He basically argued what you said. However,
- If your application differentiates between null and zero/unit, using zero in this case will result in a collision (which will swiftly fail an equals check). This results in marginally decreased performance.
- If your application does not differentiate (e.g. empty list = null) this will result in the surprising behavior of the "same" value being bucketed separately. This results in surprising behavior.
I don't like surprising other developers, so I'm quite strongly in the 0
camp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EqualityComparer.Default agrees with you. zero it is.
|
||
switch (position) | ||
{ | ||
case 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The queueing could probably use a comment since xxHash's initialization is confusing
} | ||
|
||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public int ToHashCode() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is supposed to check if it's already been finalized and throw since otherwise calling ToHashCode twice in a row would return different answers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be possible to track that it's been finalized in some combination of the existing fields if you're concerned about cache lines. Something like _length=UInt.Max and _queue1=1 would be otherwise invalid (as long as we clear _queue1 in Add when we get a fourth input).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ToHashCode doesn't actually change any state - so it is safe to call multiple times. This is not due to any genius on my part, just the ghost in the machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest guarding it anyway since the "ghost in the machine" could easily change (and would if we switch algorithms on some platform).
hc.Add(1); | ||
hc.Add(ConstComparer.ConstantValue); | ||
|
||
Assert.NotEqual(expected.ToHashCode(), hc.ToHashCode()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear on why these wouldn't be equal. Int32.GetHashCode just returns that int, so we should get 1234 as the second input for both cases. I think you might have a bug above where you're adding to hc in both cases instead of adding to expected.
|
||
# if SYSTEM_HASHCODE_TESTVECTORS | ||
[Theory] | ||
[InlineData(0x02cc5d05U)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did you find these test vectors? How can verify them? Are there others we should include?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the PR blurb for more info. I used this site with strings copied (increasing lengths of 4) from "abcd0123efgh4567ijkl8901mnop2345qrst6789uvwx0123yzab". My original polyfill test actually self-documents it, but Encoding
is not available at this callsite - so I turned them all into uint literals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see now. Please add a comment to the code so that we can reproduce this in the future.
{ | ||
var hc = new HashCode(); | ||
hc.Add(1); | ||
hc.Add(100.2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather avoid using floats. If this test starts failing on some platform due to floating point precision differences, it'll be awful to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be OK in this case, as the CLR uses IEEE floats, the float is loaded from a literal, the spec says that conversion from R8 to internal float and back results in the same value, and GetHashCode() for floats should only use the bits available from the 32/64 bit of IEEE float, not any platform specific extended precision bits.
But, on the other hand, for testing, it would likely be more interesting to use a non-primitive value than a floating point number type.
@@ -0,0 +1,406 @@ | |||
// Licensed to the .NET Foundation under one or more agreements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow the rules from the "Copying Files from Other Projects" section of https://github.com/dotnet/coreclr/blob/master/Documentation/project-docs/contributing.md. In particular, I think that means including the xxHash license here and adding it to the 3rd party notices file. @richlander, please confirm we don't need to do anything else to make this file follow the rules.
|
||
namespace System | ||
{ | ||
public struct HashCode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to comment here that we're using xxHash32 and link back to it for future reference.
break; | ||
} | ||
|
||
_length++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed the earlier comment about this overflowing. Could you just add an overflow check here and throw rather than silently resetting the state?
A cursory look from someone on CoreFX team would be helpful - e.g. Numerics experts/owners @eerhardt @ViktorHofer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
|
||
public static class HashCodeTests | ||
{ | ||
# if SYSTEM_HASHCODE_TESTVECTORS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these tests are compiled out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also - (nit) the normal code formatting is to have #if FOO
with no spaces between the #
and if
. This will just cause the next person editing this code troubles since VS will attempt to format the code (with the default format settings).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eerhardt they are compiled out because the GetRandomBytes()
randomization in HashCode removes the determinism. HashCode needs to be adjusted before they can execute correctly.
They are provided as a sort of semi-functional documentation. Any future patches against this type can use the tests to validate against actual vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds fine to me. Any chance you can stick that in a comment in the code?
|
||
Assert.Equal(expected, (uint)hc.ToHashCode()); | ||
} | ||
# endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) Same comment as above - #endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jcdickinson
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing in this change seems wrong, but a few more tests to shore things up would be good.
|
||
Assert.Equal(expected, (uint)hc.ToHashCode()); | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to see a #else which verifies that the seed is non-zero. E.g. that the empty hash is not 0x02cc5d05.
If the seed is allowed to be randomly zero then it will have the potential for failure; but a 32-bit seed means a 1 in 4 billion chance of failure
hc.Add(8); | ||
Assert.Equal(hc.ToHashCode(), HashCode.Combine(1, 2, 3, 4, 5, 6, 7, 8)); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I recall correctly, adding any non-null object o
and adding o.GetHashCode()
(or the deferred-to-an-assistant equivalent) should be the same. If true, tests for that seem good. (If false... tests for that seem not unreasonable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd personally be happier with tests that codify the behaviors of GetHashCode()
and Equals(object)
(either should be targetable if you box the HashCode object).
But code inspection says they just throw, and this covers the functional aspects of the API, so it's fine.
Other bonus tests: type-specific nulls (e.g. (string)null
), showing that things are null-safe.
@dotnet-bot test OSX x64 Debug Build please |
Alpine.3.6 x64 Debug Build
|
@dotnet-bot test Alpine.3.6 x64 Debug Build please |
Thank you, @jcdickinson! This is a big improvement for hashing and you did a great job working through complex issues! Now I can start filing issues to replace all of the other hashes 😄 |
* HashCode based on xxHash32 Works by maintaining the xxHash32 state variables (v1 -> v4, length) as well as a queue of values that fall outside of the block size (16 bytes/4 ints). The seed is initialized to random bytes. Further details, unit tests and history: #25013 Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
* HashCode based on xxHash32 Works by maintaining the xxHash32 state variables (v1 -> v4, length) as well as a queue of values that fall outside of the block size (16 bytes/4 ints). The seed is initialized to random bytes. Further details, unit tests and history: #25013 Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
Question: if i need to hash more than 8 values, which is the preferred pattern: var hash = new HashCode();
hash.Add(value1); ... ; hash.Add(valueN);
return hash.ToHashCode(); or return Hash.Combine(value1, ..., value7, Hash.Combine(value8, ..., value15, Hash.Combine(... Thanks! |
The first one is the right pattern to use:
|
Great. That's what i'm using in roslyn's new codegen. Thanks! |
* HashCode based on xxHash32 Works by maintaining the xxHash32 state variables (v1 -> v4, length) as well as a queue of values that fall outside of the block size (16 bytes/4 ints). The seed is initialized to random bytes. Further details, unit tests and history: #25013 Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
* HashCode based on xxHash32 Works by maintaining the xxHash32 state variables (v1 -> v4, length) as well as a queue of values that fall outside of the block size (16 bytes/4 ints). The seed is initialized to random bytes. Further details, unit tests and history: #25013 Signed-off-by: dotnet-bot-corefx-mirror <dotnet-bot@microsoft.com>
* HashCode based on xxHash32 (dotnet/corefx#14354) Tests against known xxHash32 vectors are provided ("abcd1234efgh5678..."). Test vector tests are now optional (behind SYSTEM_HASHCODE_TESTVECTORS def) as the random seed will break these tests. Defining SYSTEM_HASHCODE_TESTVECTORS will disable the per-AppDomain random seed. Combine is now tested against the behavior of Add. Commit migrated from dotnet/corefx@bd88e51
Fixes #14354
Description
Works by maintaining the xxHash32 state variables (v1 -> v4, length) as well as a queue of values that fall outside of the block size (16 bytes/4 ints). Seed is assumed to be zero as there is no ctor in the proposal - however, it is possible to add a seed without changing the structure.
It's possible to do this without
unsafe
, but the code is much more messy. This can be changed if addingAllowUnsafeBlocks
to System.Runtime is a problem.Tests against known xxHash32 vectors are provided. The hashes were calculated by providing chunks from
"abcd0123efgh4567ijkl8901mnop2345qrst6789uvwx0123yzab"
(increasing in length 4) to this site. The integers are hard-coded in the tests because System.Text.Encoding.ASCII is not available.Performance
Updates (staged)
The test vectors are now normally disabled due to the per-AppDomain seed randomization, define
SYSTEM_HASHCODE_TESTVECTORS
to enable them (which will disable the seed randomization, setting it to 0).NB: this has been excluded from netfx builds in the tests csproj.
Deviations from xxHash32
ulong
(but would make the struct larger than a cache line). 4,294,967,300 fields ought to be enough for anyone; you have bigger problems if you are creating hash codes based on that much data (15GB).int
.I'll be online in a few hours for feedback and to sign the CLA.
3rd Party Code
Submission containing materials of a third party:
cc: @stephentoub, @KrzysztofCwalina