-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Hash and Hasher update for faster common case hashing #28044
Conversation
(rust_highfive has picked a reviewer for you, use r? to override) |
- str now emits a delimiter of its own length - str and [u8] hash the same - Hasher::delimiter customizes how a delimiter is handled Add method `fn delimit(&mut self, len: usize)` to Hasher. This method makes the hasher emit a delimiter for a chunk of length `len`. For example str and slices both emit a delimiter for their length during hashing. The Hasher impl decides how to implement the delimiter. By default it emits the whole `usize` as data to the hashing stream. SipHash will ignore the first delimiter and hash the others as data. Since it hashes in the total length, hashing all but one delimiters is equivalent to hashing all lengths. For the next example, take something like farmhash that is not designed for streaming hashing. It could be implemented like this: - Every call to Hasher::write runs the whole hashing algorithm. Previous hash is xored together with the new result. - Delimiters are ignored, since the length of each chunk to write is already hashed in. It follows a sketch of how siphash and farmhash could work with this change: When hashing a: &[u8] - SipHash: `write(a); finish();` - Farmhash: `hash = write(a); hash` Both SipHash and Farmhash will hash just the bytes of a string in a single Hasher::write and a single Hasher::finish. When hashing (a: &[u8], b: [u8]): - SipHash: `write(a); write(b.len()); write(b); finish();` - Farmhash: `hash = write(a); hash ^= write(b); hash`
The delimiter is emitted so that tuples and other composites don't offer easy ways to get hash collisions. It makes sense that the hashing algorithm should be able to decide how to handle these, especially since the faster hashers may not care about preventing hash flooding. SipHasher will of course continue to care. |
fn delimit(&mut self, len: usize) { | ||
// skip the first delimiter | ||
if self.length > 0 { | ||
self.write_usize(len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this works fine with siphash and still defends against hash flooding by tuples or structs. The first string or slice doesn't get its length emitted (to quicken the common case -- only one field is hashed).
("aa", "a")
and ("a", "aa")
don't collide in this model, but maybe there's something I'm not thinking of?
There's also the option to do something that is outside the data stream so to speak -- we could insert an extra sip compress round here for the self.length > 0
case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about ( &[1usize,2usize], None::<usize> )
and ( &[1usize], Some(1usize) )
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would advise against inserting extra sip compress rounds in any circumstance. Cryptographers today tend to frown on level-crossing things like that; SipHash is designed to achieve its security goals when fed a byte stream, and nobody knows whether it still will if fed an arbitrary mixture of "bytes" and "bonus rounds".
Even if we had a hash function designed for use on a 257-element alphabet, there would be a collision between vec![vec![],vec![]]
and vec![vec![vec![]]]
: Vec<Vec<Vec<i8>>>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points.
Yes I think (&[x, 1], None)
and (&[x], Some(0))
would both hash to the same sequence of x, 1, 0 (in for example 64-bit integers), That's not good, and it allows arbitrary many values to produce the same hash, so SipHasher cannot permit this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or wait, is that a problem? It's just two elements with the same hash, not arbitrary many of one hash. So the example it's not enough for a hash flooding attack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make more than two, although it's limited to no more than N + 1 interpretations for an N-byte string. This example achieves N / usize::BYTES
:
#[derive(Hash)]
enum Foo { One, Two(Box<Foo>) }
(vec![2usize; n], One)
(My experiments with --pretty expanded
tell me that discriminators are one-based, not zero-based).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha! The actual discriminant of None
is 0, but maybe deriving doesn't use that, I didn't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I made a critical reading comprehension error earlier. The discriminants are 0, both for user-defined types and the builtin Option
; see http://is.gd/YjOcxM (playpen).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the One / Two example, since the type Foo is actually variable length (a linked list), it must emit a delimiter of its own. ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good example, it seems to be a flooding attack. It's properly delimited by One on its own.
Remove assert_eq! from the benchmarks. Their role is not to check the value (and returning the value is enough to black_box it).
Siphash, hashing the string "abcd":
In the latter case, the write call |
simple benchmark for hashing 4 & 8-byte strings and slices. |
CC @shepmaster (maintainer of XX) Would also be interested in FNV numbers. |
@bluss would you expect any differences for streaming hashers, like XXHash? |
Farmhash benchmark. Implemented this way, farmhasher's result depends on the way the byte stream is chunked. This is not ideal, not something I propose for libstd. @shepmaster How str emits bytes (Hash trait) would change (it would be identical to the byte slice). If xxhasher does not change the default behavior of delimiter, not much changes. It's an optimization opportunity for short-input hashing, but I understand xxhash is mostly for long input(?), so it shouldn't make much difference. |
@bluss Your comment still mentions Your approach is pretty smart. However, FarmHash with small writes and seeding has some slowdown, because it uses 16 byte hashing to combine the hash result and the seed. Here's uint64_t a = (hash_result ^ seed) * kMul;
a ^= (a >> 47);
uint64_t b = (seed ^ a) * kMul;
b ^= (b >> 47);
b *= kMul;
return b; Source: https://github.com/google/farmhash/blob/master/src/farmhash.h#L129-L138 The hasher can hold a buffer of hash results. FarmHash could be used to hash this buffer's contents once buffer is full, and in |
@pczarn The main target is a speedup where hashing needs only a single write, the rest is a bonus. I consider this change to only be the next best thing after some kind of explicit one-shot hashing, but it seems hard to implement that in a backwards compatible way. I don't think hash with seed is that bad, it's using a function I didn't profile really either, maybe it can be improved. We can skip the seed on the first invocation of write. |
I like the idea of catering to algorithms like Farmhash, but this method seems a little odd to add for cases like that. Do you know if other hashing frameworks typically have methods like this? One thing I'd be worried about is what happens if you call |
@alexcrichton That ought to be a valid operation. Hashing This PR is just experimental, I'd want to do something like gankro's hash_one_shot idea, but I can't find a way to do it that doesn't also change Hash / Hasher backwards incompatibly. The main thing I'm worried is that someone will point out how chunking-dependent hashing might totally break hashing of some types (I can imagine a rope behaving that way). Should that stop this proposal, even if this proposal doesn't introduces that, it only caters to a farmhash impl that could do so? |
I was mostly worried about a custom hash implementation of of something like Some opinions I have in this area are:
I personally prefer to do design like this either externally or on issues, but I know others feel differently :) |
That would be an invalid implementation of hash. Delimit should be emitted when you have n units of something to write., like str and slice do. |
Yeah it certainly feels invalid, but it's possible and may end up happening from time to time, which is just something to consider. |
|
☔ The latest upstream changes (presumably #28532) made this pull request unmergeable. Please resolve the merge conflicts. |
Time to close, sorry for keeping this around. Anyway, I'd love to do this kind of change, just don't want to keep the PR open indefinitely. |
Add method
fn delimit(&mut self, len: usize)
to Hasher.Wtf8
does too.Hasher::delimiter
customizes how a delimiter is handledThe Hasher impl decides how to implement the delimiter. By default it
emits the whole
usize
as data to the hashing stream.SipHash will ignore the first delimiter and hash the others as
usize
.For the next example, take something like farmhash that is not designed
for streaming hashing. It could be implemented like this:
Hasher::write
runs the whole hashing algorithm.Previous hash is a seed to the next hash invocation.
already hashed in.
It follows a sketch of how siphash and farmhash could work with this change:
When hashing
a: &[u8]
write(a); finish();
hash = write(a); hash
Both SipHash and Farmhash will hash just the bytes of the slice in
a single Hasher::write and a single Hasher::finish.
When hashing
(a: &[u8], b: [u8])
:write(a); write(b.len()); write(b); finish();
hash = write(a); hash ^= write(b); hash