-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend the Hasher
trait with fn delimit
to support one-shot hashing
#1666
Conversation
Hasher
trait with fn delimit
for one-shot hashingHasher
trait with fn delimit
to support one-shot hashing
Thanks for the RFC @pczarn! Could you also expand the RFC with some details about how implementations of Additionally, would it be possible to prototype this change and take a look at some performance benchmarks as well? Those would probably be the most convincing for ensuring that a change like this is merged quickly. |
@alexcrichton This RFC as-is doesn't concern current usage. Only the definition and some implementations of |
I came up with this arthurprs/hash-rs@1a5bfd2 in order to test the outcome of this change. Results: http://i.imgur.com/N5AVi12.png Yes, it's that good. Now that we have sip13 in core we won't even need any other algorithm for hashing the bytes in the specialized form, the good old write + finish is probably enough. |
The motivation talks about one-shot hashing a lot as motivation, but then doesn't seem to follow up on how the addition of |
I made an update. I hope the motivation is more clear. @arthurprs You implemented an alternative listed in the RFC. The main goal of this RFC is basically refactoring -- there's nothing to benchmark. Is Sip13 fast enough for all purposes? Good question. @sfackler What do you mean by "consumers of these APIs", specifically? Did you see the update? |
@pczarn right, but it gives a realistic preview of the speedup unlocked by this proposal. |
Thanks for reviving this. We need to discuss some of the drawbacks. You can do this to make it possible to use the FarmHasher for multi-part things: self.result ^= farmhash::hash64(msg); // xor is the simplest possible mixer, could be something else Either with that or without it, a drawback remains: The Hash trait is formulated so that it is used to "stream" bytes to a hasher in slices of bytes. The implicit protocol is that it doesn't matter how you slice your bytes, and equivalent stream of bytes should hash the same way. There's no way for the usual farmhash algorithm to support that, so either way you use it as a hasher together with the Hash trait, you break the rules of the Hash trait a bit. |
@pczarn the libs team talked about this awhile ago, but our conclusion was that while the motivation seems sound we were a little confused as to how this would affect the standard library. Right now this seems to be adding a bit of complexity but then not taking advantage of it? Could you expand a bit on how the standard library (e.g. |
@alexcrichton I agree it's a bit confusing in the current form, that's why I was trying to show another face of the improvement @ #1666 (comment) EDIT: here's a better graph (linear Y axis) http://imgur.com/a/0ruST In that commit/test I modified write_usize to skip over the first delimiter, the same could be done for the write_delimiter and give big improvements on small keys. Of course this applies for all hashers. Oneshot hashers benefit from this even more (as finish involves little to no work), but would still have to mix possible n-part hashes as @bluss said above. |
Yeah we were basically just thinking it'd be good for the RFC itself to encode the performance wins and/or changes to |
Is there a way to attack this problem with specialization? So that we can reach one-shot hashing for precisely the types that support it. |
Exactly. There are no substantial changes in this RFC on its own.
A trait method with a default definition will reach one-shot hashing for the hashers that need it. Using specialization is possible, but does not give any advantage. |
What about specializing a trait for types that are compatible with one-shot hashing. For example |
Yes, the plan is to make such a marker trait for specialization. The code for that trait is in contain-rs/hashmap2#5: https://github.com/pczarn/hashmap2/blob/e8fb6cacde8dcb166b009057afce7588b8d27a77/src/adaptive_map.rs#L194 |
@pczarn unfortunately though the libs team was hesitant about considering for a merge as-is. It wasn't clear why we would do this as it doesn't detail the follow-up work and how it relates to |
do you plan to pursue this PR further? I think you are onto something here. Hashing an usize for every slice is ridiculous performance wise. Preventing the write_u8(0xFF) for strings would provide a significant gain too. |
I do like the idea of hash functions offering a special delimiter input, but.. Afaik there are no common hash functions that offer such a delimiter input, and one cannot provide it using a hash function's existing API that hides the state, as doing so allows delimiters to be faked. We discussed hashing recently in #1768 but I can rehash ;) the direction taken there : There was a suggestion that hashing should be built on a writer trait than cannot fail, perhaps a general purpose writer trait with an associated error type set to There is a wider array of hash outputs than simply a In this vein, one might handle delimiters with trait that provided a |
I'm not sure what you mean here. The proposed function is implemented by default using the already existing writer methods. The other proposals/motivations seem orthogonal even if dealing with the same trait. The objective here is to avoid hashing additional stuff when not needed. |
If your delimiter is created using a hash function's existing writer method, then an input can be crafted to fake a delimiter, possibly invalidating security assumptions. I'm saying : It's a nice idea to support delimiters that cannot be faked, but since afaik no current hash functions provide them, then one should do it with another separate trait. I'll say that stronger : It's a nice idea to support a delimiter that provably cannot be faked, relative to cryptographic assumptions, even if the hash function itself is not cryptographically secure for speed reasons. You could hash on You can do exactly this with an HMAC construction using two-ish calls to SipHash, so the key protection properties of SipHash give the desired security. Yet, modern cryptographic hash functions like SHA3 have much more resistance to extension attacks, etc. without being used in an HMAC, so maybe it's worth asking say the SipHash authors (DJB, et al.) if a delimiter could be more efficient than an HMAC. I donno hash functions well but conceivably adding a cryptographic delimiter to SipHash could almost double the speed over HMAC constructions. Anyways, if you want a secure delimiter right now, then you need to compose two or three invocations of SipHash into an HMAC, and make another |
We need a hash API to be capable of producing specific results, like say if an existing protocol requires hashing a series of |
@burdges why does this |
Yes, it would be incorrect to use the current |
There need not be any security problem with using the same traits to feed data to both cryptographic and non-cryptographic hash functions. And Go does roughly that if I recall. It's true I'd favor addressing the endianness issue correctly within the existing family of hashing traits, so that they can be used cleanly across the board. Remember, there are still hash tables one writes to disk, which benefit from fixing both the endianness issues, and anything that benefits in memory hash tables. |
Just noticed the current About endianness, I think the easy solution is wrapping the
These |
I suppose the way one implements the secure delimiters I suggested up thread might be :
I'm rather happy with all this now. I do still wonder if we want some way to pass richer information to |
@pczarn Any update on this? Adding the implications to the existing uses of |
Looking at the various This is an issue, and I think the exact thing this RFC is trying to solve is solved simply by specifying that Furthermore, there is a performance advantage: It would mean that there is no need for "left over" chunks, which aren't complete enough to be committed to the state value yet. These can hurt performance a lot, to the extend where the streaming version of the hash function is up to 2x slower than the static version. |
Very good point. I suppose, if you need more fine grained control over the block filling, then you need a specific hash function too, so you're non-generic enough to build a simple wrapper to handle left overs anyways. |
After rereading the RFC and associated discussion, it is still not clear to me how the addition of a
I'm personally inclined to FCP to close - cc @rust-lang/libs. |
Less rigorous hashers like FnvHasher (or similar) can choose to skip hashing the delimiters at all. That's a marginal but noticeable improvement. |
@rfcbot fcp close |
Team member @sfackler has proposed to close this. The next step is review by the rest of the tagged teams: No concerns currently listed. Once these reviewers reach consensus, this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up! See this document for info about what commands tagged team members can give me. |
I posted benchmarks upthread that give 5-10% speedup in siphash backed hashmap benchmarks using short strings (by skipping the first delimiter so the hasher stays "correct" in the case of multi part keys). So it's apliable to any hashers while remaining correct. |
The RFC text's motivation is focused entirely on one shot hashing - is that not actually the motivation for this change? The change made in that benchmark seems like it will break third party types that implement |
I agree the rfc motivations (as it's written now) may not sound appealing. The part I'm defending is just a single line under Alternatives in the RFC. I don't think it breaks as the calls to the hasher remain the same, like |
@sfackler I may change the RFC's motivation to focus on speedup, but I'm not convinced by @arthurprs's benchmark. Do you think the speedup will matter in real use cases? Is it important to optimize for siphash when adaptive hashing might work without siphash most of the time? If not, I'll close the RFC. |
Just to be extra clear, I really want to see the proposal of this RFC merged. I was just defending it from a different POV, avoiding hashing stuff we don't have too (thus performance gains). There's of course the POV of allowing non-multi-part-ish types to be really hashed in a single shot. But I'd argue again that the end motivation is performance, as the current trait can be implemented with a non-streaming hash (ex: farmhash) that mixes the intermediate hash results. It's just not as good (hash quality) nor fast as it could be. |
Ping @sfackler, can you work with @arthurprs and @pczarn to reach a resolution? |
It can be implement correctly, mixing the intermediate hashes is still correct even if quality-wise it isn't optimal. Fnv, Fx, farmhash, ..., and others all do this. Having
All hashers can probably take advantage of the added |
If we're going to merge this, I think the RFC text needs to be updated to reflect the current thoughts on motivation - the I'm still a bit worried from downstream breakage if we change str's |
The other day I was talking with @michaelwoerister about the "specification" of The requirements we had in mind are:
|
That's how most non-streaming hashers implement it, FNV being the biggest example in crates.io |
ping @BurntSushi, @brson (checkboxes) |
🔔 This is now entering its final comment period, as per the review above. 🔔 |
The final comment period is now complete. |
I'm going to close this RFC for the time being, as per the commentary above. We're definitely still interested in motion in this area! Please ping @sfackler if you'd like to take up the mantle. |
Rich view
Related work
rust-lang/rust#28044 "WIP: Hash and Hasher update for faster common case hashing"
rust-lang/rust#29139 "Make short string hashing 30% faster by splitting Hash::hash_end from Hash::hash"
contain-rs/hashmap2#5 "Implement adaptive hashing using specialization"