-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Add System.HashCode to make it easier to generate good hash codes. #19621
Comments
Proposal: add hash randomization support public static HashCode Randomized<T> { get; } // or CreateRandomized<T>
or
public static HashCode Randomized(Type type); // or CreateRandomized(Type type)
|
Proposal: add support for collections public HashCode Combine<T>(T[] values);
public HashCode Combine<T>(T[] values, IEqualityComparer<T> comparer);
public HashCode Combine<T>(Span<T> values);
public HashCode Combine<T>(Span<T> values, IEqualityComparer<T> comparer);
public HashCode Combine<T>(IEnumerable<T> values);
public HashCode Combine<T>(IEnumerable<T> IEqualityComparer<T> comparer); |
I think there is no need in overloads |
Yes, that was part of my eventual plan for this proposal. I think it's important to focus on how we want the API to look like before we go about adding those methods, though. |
What about having Hash32 and Hash64 types that would internally store 4 or 8 bytes worth of data? Document the pros/cons of each. Hash64 being good for X, but being potentially slower. Hash32 being faster, but potentially not as distributed (or whatever the tradeoff actually is).
This seems like useful behavior. But i could see people wanting to control this. So perhaps there should be two ways to create the Hash, one that takes no seed (and uses a random seed) and one that allows the seed to be provided. |
Note: Roslyn would love if this could be provided in the Fx. We're adding a feature to spit out a GetHashCode for the user. Currently, it generates code like: public override int GetHashCode()
{
var hashCode = -1923861349;
hashCode = hashCode * -1521134295 + this.b.GetHashCode();
hashCode = hashCode * -1521134295 + this.i.GetHashCode();
hashCode = hashCode * -1521134295 + EqualityComparer<string>.Default.GetHashCode(this.s);
return hashCode;
} This is not a great experience, and it exposes many ugly concepts. We would be thrilled to have a Hash.Whatever API that we could call through instead. Thanks! |
What about MurmurHash? It is reasonably fast and has very good hashing properties. There is also two different implementations, one that spits out 32-bit hashes and another that spits out 128-bit hashes. |
There is also vectorized implementations for both the 32-bit.and 128-bit formats. |
@tannergooding MurmurHash is fast, but not secure, from the sounds of this blog post. |
@jkotas, has there been any work in the JIT around generating better code for >4-byte structs on 32-bit since our discussions last year? Also, what do you think of @CyrusNajmabadi's proposal:
I still think this type would be very valuable to offer to developers and it would be great to have it in 2.0. |
@jamesqo, I don't think this implementation needs to be cryptographically secure (that is the purpose of the explicit cryptographically hashing functions). Also, that article applies to Murmur2. The issue has been resolved in the Murmur3 algorithm. |
I am not aware of any.
The framework types should be simple choices that work well for 95%+ of cases. They may not be the fastest ones, but that's fine. Having you to choose between Hash32 and Hash64 is not a simple choice. |
That's fine with me. But can we at least have a good-enough solution for those 95% cases? Right now there's nothing... :-/ |
@CyrusNajmabadi Why are you calling EqualityComparer here, and not just this.s.GetHashCode()? |
For non-structs: so that we don't need to check for null. This is close to what we generate for anonymous types behind the scenes as well. I optimize the case of known non-null values to generate code that would be more pleasing to users. But it would be nice to just have a built in API for this. |
The call to EqualityComparer.Default.GetHashCode is like 10x+ more expensive than check for null... . |
Sounds like a problem. if only there were good hash code API we could call in the Fx that i could defer to :) (also, we have that problem then in our anonymous types as that's what we generate there as well). Not sure what we do for tuples, but i'm guessing it's similar. |
|
Oh no. Looks like tuple can just use "HashHelpers". Could that be exposed so that users can get the same benefit? |
Great. I'm happy to do something similar. I started from our anonymous types because i figured they were reasonable best practices. If not, that's fine. :) But that's not why i'm here. I'm here to get some system that actually combines the hashes effectively. If/when that can be provided we'll gladly move to calling into that instead of hardcoding in random numbers and combining hash values ourselves. |
What would be the API shape that you think would work best for the compiler generated code? |
Literally any of the 32bit solutions that were presented earlier would be fine with me. Heck, 64bit solutions are fine with me. Just some sort of API that you can get that says "i can combine hashes in some sort of reasonable fashion and produce a reasonably distributed result". |
I can't reconcile these statements:
And
Can we not come up with a simple 32bit accumulating hash that works well enough for 95% of cases? What are the cases that aren't handled well here, and why do we think they're in the 95% case? |
@jkotas, is performance really that critical for this type? I think on average things like hashtable lookups and this would take up way more time than a few struct copies. If it does turn out to be a bottleneck, would it be reasonable to ask the JIT team to optimize 32-bit struct copies after the API is released so they have some incentive, rather than blocking this API on that when nobody is working on optimizing copies? |
We have been burnt really badly by default 32bit accumulating hash for strings, and that's why Marvin hash for strings in .NET Core - https://github.com/dotnet/corert/blob/87e58839d6629b5f90777f886a2f52d7a99c076f/src/System.Private.CoreLib/src/System/Marvin.cs#L25. I do not think we want to repeat same mistake here.
I do not think the performance is critical. Since it looks like that this API is going to be used by auto-generated compiler code, I think we should be preferring smaller generated code over how it looks. The non-fluent pattern is smaller code. |
That doesn't seem like the 95% case. We're talking about normal developers just wanting a "good enough" hash for all those types where they manually do things today.
This is not for use by the Roslyn compiler. This is for use by the Roslyn IDE when we help users generate GetHashCodes for their types. THis is code that the user will see and have to maintain, and having something sensible like: return Hash.Combine(this.A?.GetHashCode() ?? 0,
this.B?.GetHashCode() ?? 0,
this.C?.GetHashCode() ?? 0); is a lot nicer than a user seeing and having to maintain: var hashCode = -1923861349;
hashCode = hashCode * -1521134295 + this.b.GetHashCode();
hashCode = hashCode * -1521134295 + this.i.GetHashCode();
hashCode = hashCode * -1521134295 + EqualityComparer<string>.Default.GetHashCode(this.s);
return hashCode; |
I mean, we already have this code in the Fx: We think it's good enough for tuples. It's unclear to me why it would be such a problem to make it available for users who want it for their own types. Note: we've even considered doing this in roslyn: return (this.A, this.B, this.C).GetHashCode(); But now you're forcing people to generate a (potentially large) struct just to get some sort of reasonable default hashing behavior. |
The original string hash was a "good enough" hash that worked well for normal developers. But then it was discovered that ASP.NET webservers were vulnerable to DoS attacks because they tend to store received stuff in hashtables. So the "good enough" hash basically turned into a bad security issue.
No necessarily. We made a back stop measure for tuples to make the hashcode randomized that gives us option to modify the algorithm later. |
This looks reasonable to me. |
How awful would it be to add the HashCode source to the project like Roslyn does (with IL) with (the much simpler) compiler attribute class definitions when they aren't available through any referenced assembly? |
I'm just surprised there are no good ways to get overflow math to work in VB at all :( |
So, at a minimum, even if we were hashing two values together, it seems like we would have to create: var hc1 = (uint)(value1?.GetHashCode() ?? 0); // can overflow
var hc2 = (uint)(value2?.GetHashCode() ?? 0); // can overflow
uint hash = MixEmptyState();
hash += 8; // can overflow
hash = QueueRound(hash, hc1);
hash = QueueRound(hash, hc2);
hash = MixFinal(hash);
return (int)hash; // can overflow Note that this code already has 4 lines that can overflow. It also has two helper functions you need to call (i'm ignoring MixEmptyState as that seems more like a constant). MixFinal can definitely overflow: private static uint MixFinal(uint hash)
{
hash ^= hash >> 15;
hash *= Prime2;
hash ^= hash >> 13;
hash *= Prime3;
hash ^= hash >> 16;
return hash;
} as can QueueRound: private static uint QueueRound(uint hash, uint queuedValue)
{
hash += queuedValue * Prime3;
return Rol(hash, 17) * Prime4;
} So i don't honestly see how this would work :( |
How do you envision this working? What would customers write, and what would the compilers then do in response? |
Also, something that would address all of this is if .Net already has public helpers exposed on the surface API that convert from uint to int32 (and vice versa) without overflow. Do those exist? If so, i can easily write the VB versions, just using these for the situations where we need to go between the types without overflowing. |
I would think so. I mean, think about this from a customer perspective. They just want a decent GetHashCode method that is nicely self contained and gives reasonable results. Having that feature go and bloat up their code with auxiliary crap is going to be pretty unpleasant. It's also pretty bad given that the C# experience will be just fine. |
You might be able to get roughly the right overflow behavior by casting to and from some combination of signed and unsigned 64-bit types. Something like this (untested and I don't know VB casting syntax):
|
How do you knwo the following doesn't overflow? (Int32)((Unt64)hashCode * -1521134295) Or the final (int32) cast for that matter? |
I didn't realize it would use overflow-checked conv operations. I guess you could mask it down to 32 bits before casting:
|
presumably 31 bits, as a value of uint32.Max would also overflow on conversion to Int32 :) That's def possible. Ugly... but possible :) There's gunna be a lot of casts in this code. |
Ok. I think i have a workable solution. The core of the algorithm we generate today is: hashCode = hashCode * -1521134295 + j.GetHashCode(); Let's say that we're doing 64bit math, but "hashCode" has been capped to 32 bits. Then |
Thanks! |
@MaStr11 @morganbr @sharwell and everyone here. I've updated my code to generate the following for VB: Dim hashCode As Long = 2118541809
hashCode = (hashCode * -1521134295 + a.GetHashCode()) And Integer.MaxValue
hashCode = (hashCode * -1521134295 + b.GetHashCode()) And Integer.MaxValue
Return CType(hashCode And Integer.MaxValue, Integer) Can someone sanity check me to make sure that this makes sense and should not overflow even with checked mode on? |
@CyrusNajmabadi , that won't overflow (because Int64.Max = Int32.Max*Int32.Max and your constants are much smaller than that) but you're masking the high bit to zero, so it's only a 31-bit hash. Is leaving the high bit on considered an overflow? |
@CyrusNajmabadi But no, it can't actually overflow. |
Btw- I'd rather have Roslyn add a NuGet package than add a suboptimal hash. |
That's a good point. I think i was thinking about another algorithm that was using uints. So in order to safely convert from the long to a uint, i needed to not include the sign bit. However, as this is all signed math, i think it would be fine to just mask against 0xffffffff ensuring we only keep the bottom 32bit after adding each entry. |
Users can already do that if they want. This is about what to do when users do not, or can not, add those dependencies. This is also about providing a reasonably 'good enough' hash for users. i.e. something better than the common "x + y + z" approach that people often take. It's not intended to be 'optimal' because there's no good definition of what 'optimal' is when it comes to hashing for all users. Note that the approach we're taking here is the one already emitted by the compiler for anonymous types. It exhibits reasonably good behavior while not adding a ton of complexity to the user's code. As time, as more and more users are able to move forward, such can can slowly disappear and be replaced with HashCode.Combine for most people. |
So i worked at it a bit and came up with the following that i think addresses all concerns: Dim hashCode As Long = 2118541809
hashCode = (hashCode * -1521134295 + a.GetHashCode()).GetHashCode()
hashCode = (hashCode * -1521134295 + b.GetHashCode()).GetHashCode()
Return CType(hashCode, Integer) The piece that's interesting is specifically calling |
@CyrusNajmabadi Actually offering to install the package is what I was asking about. Saves me from having to do it. |
If you type HashCode, then if System.HashCode is provided in an MS nuget package, then Roslyn will offer it. |
I want it to generate the nonexistent GetHashCode overload and install the package in the same operation. |
I don't think that's an appropriate choice for most users. Adding dependencies is a very heavyweight operation that users should not be forced into. Users can decide the right time to make those choices, and the IDE will respect it. That's been the approach we've taken with all our features up to now, and it's been a healthy one that people seem to like. |
Note: what nuget package is this api even being included in for us to add a reference to? |
The implementation is in System.Private.CoreLib.dll, so it would come as part of the runtime package. The contract is System.Runtime.dll. |
Ok. If that's the case, then it sounds like a user would get this if/when they move to a more recent Target Framework. That sort of thing is not at all a step i would have the "generate equals+hashcode" do to a user's project. |
Update 6/16/17: Looking for volunteers
The API shape has been finalized. However, we're still deciding on the best hash algorithm out of a list of candidates to use for the implementation, and we need someone to help us measure the throughput/distribution of each algorithm. If you'd like to take that role up, please leave a comment below and @karelz will assign this issue to you.
Update 6/13/17: Proposal accepted!
Here's the API that was approved by @terrajobst at https://github.com/dotnet/corefx/issues/14354#issuecomment-308190321:
The original text of this proposal follows.
Rationale
Generating a good hash code should not require use of ugly magic constants and bit twiddling on our code. It should be less tempting to write a bad-but-concise
GetHashCode
implementation such asProposal
We should add a
HashCode
type to enscapulate hash code creation and avoid forcing devs to get mixed up in the messy details. Here is my proposal, which is based off of https://github.com/dotnet/corefx/issues/14354#issuecomment-305019329, with a few minor revisions.Remarks
See @terrajobst's comment at https://github.com/dotnet/corefx/issues/14354#issuecomment-305019329 for the goals of this API; all of his remarks are valid. I would like to point out these ones in particular, however:
The text was updated successfully, but these errors were encountered: