WIP: Hash and Hasher update for faster common case hashing #28044

bluss · 2015-08-27T16:57:15Z

Add method fn delimit(&mut self, len: usize) to Hasher.

str now emits a delimiter of its own length
str and [u8] hash the same, and Wtf8 does too.
Hasher::delimiter customizes how a delimiter is handled

The Hasher impl decides how to implement the delimiter. By default it
emits the whole usize as data to the hashing stream.

SipHash will ignore the first delimiter and hash the others as usize.

For the next example, take something like farmhash that is not designed
for streaming hashing. It could be implemented like this:

Every call to Hasher::write runs the whole hashing algorithm.
Previous hash is a seed to the next hash invocation.
Delimiters are ignored, since the length of each chunk to write is
already hashed in.

It follows a sketch of how siphash and farmhash could work with this change:

When hashing a: &[u8]

SipHash: write(a); finish();
Farmhash: hash = write(a); hash

Both SipHash and Farmhash will hash just the bytes of the slice in
a single Hasher::write and a single Hasher::finish.

When hashing (a: &[u8], b: [u8]):

SipHash: write(a); write(b.len()); write(b); finish();
Farmhash: hash = write(a); hash ^= write(b); hash

rust-highfive · 2015-08-27T16:57:26Z

r? @alexcrichton

(rust_highfive has picked a reviewer for you, use r? to override)

- str now emits a delimiter of its own length - str and [u8] hash the same - Hasher::delimiter customizes how a delimiter is handled Add method `fn delimit(&mut self, len: usize)` to Hasher. This method makes the hasher emit a delimiter for a chunk of length `len`. For example str and slices both emit a delimiter for their length during hashing. The Hasher impl decides how to implement the delimiter. By default it emits the whole `usize` as data to the hashing stream. SipHash will ignore the first delimiter and hash the others as data. Since it hashes in the total length, hashing all but one delimiters is equivalent to hashing all lengths. For the next example, take something like farmhash that is not designed for streaming hashing. It could be implemented like this: - Every call to Hasher::write runs the whole hashing algorithm. Previous hash is xored together with the new result. - Delimiters are ignored, since the length of each chunk to write is already hashed in. It follows a sketch of how siphash and farmhash could work with this change: When hashing a: &[u8] - SipHash: `write(a); finish();` - Farmhash: `hash = write(a); hash` Both SipHash and Farmhash will hash just the bytes of a string in a single Hasher::write and a single Hasher::finish. When hashing (a: &[u8], b: [u8]): - SipHash: `write(a); write(b.len()); write(b); finish();` - Farmhash: `hash = write(a); hash ^= write(b); hash`

bluss · 2015-08-27T17:13:02Z

The delimiter is emitted so that tuples and other composites don't offer easy ways to get hash collisions. It makes sense that the hashing algorithm should be able to decide how to handle these, especially since the faster hashers may not care about preventing hash flooding. SipHasher will of course continue to care.

bluss · 2015-08-27T18:10:39Z

src/libcore/hash/sip.rs

+    fn delimit(&mut self, len: usize) {
+        // skip the first delimiter
+        if self.length > 0 {
+            self.write_usize(len);


I think this works fine with siphash and still defends against hash flooding by tuples or structs. The first string or slice doesn't get its length emitted (to quicken the common case -- only one field is hashed).

("aa", "a") and ("a", "aa") don't collide in this model, but maybe there's something I'm not thinking of?

There's also the option to do something that is outside the data stream so to speak -- we could insert an extra sip compress round here for the self.length > 0 case.

What about ( &[1usize,2usize], None::<usize> ) and ( &[1usize], Some(1usize) ) ?

I would advise against inserting extra sip compress rounds in any circumstance. Cryptographers today tend to frown on level-crossing things like that; SipHash is designed to achieve its security goals when fed a byte stream, and nobody knows whether it still will if fed an arbitrary mixture of "bytes" and "bonus rounds".

Even if we had a hash function designed for use on a 257-element alphabet, there would be a collision between vec![vec![],vec![]] and vec![vec![vec![]]] : Vec<Vec<Vec<i8>>>.

Good points.

Yes I think (&[x, 1], None) and (&[x], Some(0)) would both hash to the same sequence of x, 1, 0 (in for example 64-bit integers), That's not good, and it allows arbitrary many values to produce the same hash, so SipHasher cannot permit this.

Or wait, is that a problem? It's just two elements with the same hash, not arbitrary many of one hash. So the example it's not enough for a hash flooding attack.

I can make more than two, although it's limited to no more than N + 1 interpretations for an N-byte string. This example achieves N / usize::BYTES:

#[derive(Hash)] enum Foo { One, Two(Box<Foo>) } (vec![2usize; n], One)

(My experiments with --pretty expanded tell me that discriminators are one-based, not zero-based).

Aha! The actual discriminant of None is 0, but maybe deriving doesn't use that, I didn't know.

Actually I made a critical reading comprehension error earlier. The discriminants are 0, both for user-defined types and the builtin Option; see http://is.gd/YjOcxM (playpen).

About the One / Two example, since the type Foo is actually variable length (a linked list), it must emit a delimiter of its own. ?

I think it's a good example, it seems to be a flooding attack. It's properly delimited by One on its own.

Remove assert_eq! from the benchmarks. Their role is not to check the value (and returning the value is enough to black_box it).

bluss · 2015-08-28T15:02:23Z

Siphash, hashing the string "abcd":

Pre PR: write(b"abcd"); write(b"\xff"); finish(), 54 ns/iter
Post PR: write(b"abcd"); finish(), 47 ns/iter Edit: This result was 40/ns when I managed to get it to inline both write and finish inlined into one function..

In the latter case, the write call is could be (ugh..) inlined. So that's a demonstration of why this could be a powerful change, to just omit a single byte from the common case of hashing a string. For slices, we also omit a second write call, but this time it's 8 extra bytes we're not hashing.

bluss · 2015-08-28T15:10:11Z

simple benchmark for hashing 4 & 8-byte strings and slices.

Gankra · 2015-08-28T15:28:50Z

CC @shepmaster (maintainer of XX)

Would also be interested in FNV numbers.

shepmaster · 2015-08-28T15:37:49Z

@bluss would you expect any differences for streaming hashers, like XXHash?

bluss · 2015-08-28T16:00:02Z

Farmhash benchmark. Implemented this way, farmhasher's result depends on the way the byte stream is chunked. This is not ideal, not something I propose for libstd.

@shepmaster How str emits bytes (Hash trait) would change (it would be identical to the byte slice). If xxhasher does not change the default behavior of delimiter, not much changes. It's an optimization opportunity for short-input hashing, but I understand xxhash is mostly for long input(?), so it shouldn't make much difference.

pczarn · 2015-08-30T16:20:29Z

@bluss Your comment still mentions ^=.

Your approach is pretty smart. However, FarmHash with small writes and seeding has some slowdown, because it uses 16 byte hashing to combine the hash result and the seed. Here's Hash128to64, paraphrased:

  uint64_t a = (hash_result ^ seed) * kMul;
  a ^= (a >> 47);
  uint64_t b = (seed ^ a) * kMul;
  b ^= (b >> 47);
  b *= kMul;
  return b;

Source: https://github.com/google/farmhash/blob/master/src/farmhash.h#L129-L138

The hasher can hold a buffer of hash results. FarmHash could be used to hash this buffer's contents once buffer is full, and in fn finish.

bluss · 2015-08-30T21:28:38Z

@pczarn The main target is a speedup where hashing needs only a single write, the rest is a bonus. I consider this change to only be the next best thing after some kind of explicit one-shot hashing, but it seems hard to implement that in a backwards compatible way.

I don't think hash with seed is that bad, it's using a function I didn't profile really either, maybe it can be improved. We can skip the seed on the first invocation of write.

alexcrichton · 2015-09-16T14:46:36Z

I like the idea of catering to algorithms like Farmhash, but this method seems a little odd to add for cases like that. Do you know if other hashing frameworks typically have methods like this?

One thing I'd be worried about is what happens if you call delimit but then don't write any data? I think, however, it'd be fine to just clearly document that it's an erroneous case that's not guaranteed to work.

bluss · 2015-09-16T15:57:10Z

@alexcrichton That ought to be a valid operation. Hashing ("", "x", "") will end up doing that. As you can see, without the elision of delimiters, the slice hashing is just the same as before. We could even choose to not elide any delimiters for Siphash at all.

This PR is just experimental, I'd want to do something like gankro's hash_one_shot idea, but I can't find a way to do it that doesn't also change Hash / Hasher backwards incompatibly.

The main thing I'm worried is that someone will point out how chunking-dependent hashing might totally break hashing of some types (I can imagine a rope behaving that way). Should that stop this proposal, even if this proposal doesn't introduces that, it only caters to a farmhash impl that could do so?

alexcrichton · 2015-09-18T16:00:32Z

I was mostly worried about a custom hash implementation of of something like struct Foo(usize) where the hash algorithm was hasher.delimit(self.0), in that case all instances would hash to the same thing, right? (at least with siphash).

Some opinions I have in this area are:

I'm not sure it's necessarily very desirable to have [u8] and str hash the same way, it's easy to accidentally rely on this sometimes perhaps.
Hashing in one shot is possible today with a custom hasher, it's just not generally applicable to all types and cases (even using the same API we have right now).
The Hasher trait doesn't currently have many strong guarantees about it, for example if you have a block of data and feed it via two writes, there's no documented guarantee that no matter where you split the block you'll get the same hash result. This may be a nice guarantee to have, however.

I personally prefer to do design like this either externally or on issues, but I know others feel differently :)

bluss · 2015-09-18T17:01:23Z

That would be an invalid implementation of hash. Delimit should be emitted when you have n units of something to write., like str and slice do.

alexcrichton · 2015-09-18T17:16:06Z

Yeah it certainly feels invalid, but it's possible and may end up happening from time to time, which is just something to consider.

bluss · 2015-09-18T22:15:37Z

Having slices and str hash the same is a request that seemed to be pretty popular (issue str and [u8] should hash the same #27108). (With delimiter "elision" we make it into a performance win, and not a loss, too).
One of the main targets is of course to be able to use a one shot hasher like farmhash with Rust's hashmap. For this reason I'm looking for this way in.
It's a WIP PR so that I could check if the code compiled. Due to Gankro's experiences when changing the hashers (breakage in rustc and all over), I needed a full build.

bors · 2015-09-20T16:24:14Z

☔ The latest upstream changes (presumably #28532) made this pull request unmergeable. Please resolve the merge conflicts.

bluss · 2015-11-11T14:30:29Z

Time to close, sorry for keeping this around.

Anyway, I'd love to do this kind of change, just don't want to keep the PR open indefinitely.

rust-highfive assigned alexcrichton Aug 27, 2015

bluss mentioned this pull request Aug 27, 2015

Tracking issue for custom hashers in HashMap #27713

Closed

bluss force-pushed the hash-oneshot branch from 1d334ea to e8695f9 Compare August 27, 2015 17:03

bluss reviewed Aug 27, 2015
View reviewed changes

Fixup hash tests and benchmarks

a661c5c

Remove assert_eq! from the benchmarks. Their role is not to check the value (and returning the value is enough to black_box it).

bluss unassigned alexcrichton Aug 27, 2015

sorear mentioned this pull request Oct 18, 2015

Make short string hashing 30% faster by splitting Hash::hash_end from Hash::hash #29139

Closed

bluss closed this Nov 11, 2015

pczarn mentioned this pull request Mar 24, 2016

Implement adaptive hashing using specialization contain-rs/hashmap2#5

Open

pczarn mentioned this pull request Jul 4, 2016

Extend the Hasher trait with fn delimit to support one-shot hashing rust-lang/rfcs#1666

Closed

WIP: Hash and Hasher update for faster common case hashing #28044

WIP: Hash and Hasher update for faster common case hashing #28044

Uh oh!

Conversation

bluss commented Aug 27, 2015

Uh oh!

rust-highfive commented Aug 27, 2015

Uh oh!

bluss commented Aug 27, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bluss commented Aug 28, 2015

Uh oh!

bluss commented Aug 28, 2015

Uh oh!

Gankra commented Aug 28, 2015

Uh oh!

shepmaster commented Aug 28, 2015

Uh oh!

bluss commented Aug 28, 2015

Uh oh!

pczarn commented Aug 30, 2015

Uh oh!

bluss commented Aug 30, 2015

Uh oh!

alexcrichton commented Sep 16, 2015

Uh oh!

bluss commented Sep 16, 2015

Uh oh!

alexcrichton commented Sep 18, 2015

Uh oh!

bluss commented Sep 18, 2015

Uh oh!

alexcrichton commented Sep 18, 2015

Uh oh!

bluss commented Sep 18, 2015

Uh oh!

bors commented Sep 20, 2015

Uh oh!

bluss commented Nov 11, 2015

Uh oh!

Uh oh!