-
-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make digestof and hash() return USize instead of U64 and add hash64() #2615
Conversation
On first sight, this makes total sense. Afaik this change means that hashing behaviour and identity equality behave differently on 64 bit and 32 bit machines. Does this introduce a problem for having these two kinds of systems speak to each other in a future distributed pony? Or even for exchanging serialized stuff? |
Serialisation already produces different results on different data layouts because the size of pointers and numeric types is different. The communication of 32-bit and 64-bit systems in distributed Pony is a larger problem that will have to be resolved eventually (not necessarily in the initial implementation of distributed Pony). Of course, the simplest way to resolve it would be to state that only systems with the same data layout can communicate natively, and that different data layouts require the programmer to handle the communication manually. |
I think we discussed this idea on a previous sync call, and decided not to do it. I'm going to try to find some record of that decision and bring it back to this thread. |
Here's the record of that decision, and by the date on which it occurred you could choose to go back into the archives and listen to it if you want to hear more. Basically, the rationale is that 32-bit hashing is a lot more likely to have collisions. Here's an illustration of the problem, taken from this blog post:
In data structures like hash tables, a collision just means a performance loss. But in situations where you're using hashes as unique IDs, it can be much more problematic. I personally am working on a CRDT-based application where hashed values are used as replica IDs, and coordination cannot be done to verify uniqueness across the set of replicas because it must be coordination-free by design - in situations like this, hash collisions will compromise the correctness of the algorithm, and I don't feel comfortable using 32-bit hashes for this. |
Discussed on the sync call. We discussed the possibility of having two hash functions: @Praetonus then followed up with a question about whether the low-collision hash should be 128-bit instead of 64-bit. I'll have to think a bit more about this, but it sounds like it might be a good solution as well. |
Discussed again during sync. We agreed on the second hash function being 64 bit. I'll update the PR. |
@jemc Besides |
@Praetonus - yes, I believe so. |
I've updated the PR with the discussed changes. I've also added manual changelog entries so I've removed the changelog label from the PR. |
Default hash values will now match the platform machine word width for performance. `hash64()` can be used if a low collision rate is needed.
…ponylang#2615) Default hash values will now match the platform machine word width for performance. `hash64()` can be used if a low collision rate is needed.
Having hash values being 64-bit wide on both 32 and 64-bit platforms was a bit odd. With USize, that width will match the machine word width on every platform.