Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Hash and Hasher update for faster common case hashing #28044

Closed
wants to merge 2 commits into from

Conversation

bluss
Copy link
Member

@bluss bluss commented Aug 27, 2015

Add method fn delimit(&mut self, len: usize) to Hasher.

  • str now emits a delimiter of its own length
  • str and [u8] hash the same, and Wtf8 does too.
  • Hasher::delimiter customizes how a delimiter is handled

The Hasher impl decides how to implement the delimiter. By default it
emits the whole usize as data to the hashing stream.

SipHash will ignore the first delimiter and hash the others as usize.

For the next example, take something like farmhash that is not designed
for streaming hashing. It could be implemented like this:

  • Every call to Hasher::write runs the whole hashing algorithm.
    Previous hash is a seed to the next hash invocation.
  • Delimiters are ignored, since the length of each chunk to write is
    already hashed in.

It follows a sketch of how siphash and farmhash could work with this change:

When hashing a: &[u8]

  • SipHash: write(a); finish();
  • Farmhash: hash = write(a); hash

Both SipHash and Farmhash will hash just the bytes of the slice in
a single Hasher::write and a single Hasher::finish.

When hashing (a: &[u8], b: [u8]):

  • SipHash: write(a); write(b.len()); write(b); finish();
  • Farmhash: hash = write(a); hash ^= write(b); hash

@rust-highfive
Copy link
Collaborator

r? @alexcrichton

(rust_highfive has picked a reviewer for you, use r? to override)

- str now emits a delimiter of its own length
- str and [u8] hash the same
- Hasher::delimiter customizes how a delimiter is handled

Add method `fn delimit(&mut self, len: usize)` to Hasher.

This method makes the hasher emit a delimiter for a chunk of length
`len`. For example str and slices both emit a delimiter for their length
during hashing.

The Hasher impl decides how to implement the delimiter. By default it
emits the whole `usize` as data to the hashing stream.

SipHash will ignore the first delimiter and hash the others as data.
Since it hashes in the total length, hashing all but one delimiters is
equivalent to hashing all lengths.

For the next example, take something like farmhash that is not designed
for streaming hashing. It could be implemented like this:

- Every call to Hasher::write runs the whole hashing algorithm.
  Previous hash is xored together with the new result.
- Delimiters are ignored, since the length of each chunk to write is
  already hashed in.

It follows a sketch of how siphash and farmhash could work with this
change:

When hashing a: &[u8]

- SipHash: `write(a); finish();`
- Farmhash: `hash = write(a); hash`

Both SipHash and Farmhash will hash just the bytes of a string in
a single Hasher::write and a single Hasher::finish.

When hashing (a: &[u8], b: [u8]):

- SipHash: `write(a); write(b.len()); write(b); finish();`
- Farmhash: `hash = write(a); hash ^= write(b); hash`
@bluss
Copy link
Member Author

bluss commented Aug 27, 2015

The delimiter is emitted so that tuples and other composites don't offer easy ways to get hash collisions. It makes sense that the hashing algorithm should be able to decide how to handle these, especially since the faster hashers may not care about preventing hash flooding. SipHasher will of course continue to care.

fn delimit(&mut self, len: usize) {
// skip the first delimiter
if self.length > 0 {
self.write_usize(len);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works fine with siphash and still defends against hash flooding by tuples or structs. The first string or slice doesn't get its length emitted (to quicken the common case -- only one field is hashed).

("aa", "a") and ("a", "aa") don't collide in this model, but maybe there's something I'm not thinking of?

There's also the option to do something that is outside the data stream so to speak -- we could insert an extra sip compress round here for the self.length > 0 case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about ( &[1usize,2usize], None::<usize> ) and ( &[1usize], Some(1usize) ) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advise against inserting extra sip compress rounds in any circumstance. Cryptographers today tend to frown on level-crossing things like that; SipHash is designed to achieve its security goals when fed a byte stream, and nobody knows whether it still will if fed an arbitrary mixture of "bytes" and "bonus rounds".

Even if we had a hash function designed for use on a 257-element alphabet, there would be a collision between vec![vec![],vec![]] and vec![vec![vec![]]] : Vec<Vec<Vec<i8>>>.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points.

Yes I think (&[x, 1], None) and (&[x], Some(0)) would both hash to the same sequence of x, 1, 0 (in for example 64-bit integers), That's not good, and it allows arbitrary many values to produce the same hash, so SipHasher cannot permit this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or wait, is that a problem? It's just two elements with the same hash, not arbitrary many of one hash. So the example it's not enough for a hash flooding attack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make more than two, although it's limited to no more than N + 1 interpretations for an N-byte string. This example achieves N / usize::BYTES:

#[derive(Hash)]
enum Foo { One, Two(Box<Foo>) }

(vec![2usize; n], One)

(My experiments with --pretty expanded tell me that discriminators are one-based, not zero-based).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha! The actual discriminant of None is 0, but maybe deriving doesn't use that, I didn't know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I made a critical reading comprehension error earlier. The discriminants are 0, both for user-defined types and the builtin Option; see http://is.gd/YjOcxM (playpen).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the One / Two example, since the type Foo is actually variable length (a linked list), it must emit a delimiter of its own. ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good example, it seems to be a flooding attack. It's properly delimited by One on its own.

Remove assert_eq! from the benchmarks. Their role is not to check the
value (and returning the value is enough to black_box it).
@bluss
Copy link
Member Author

bluss commented Aug 28, 2015

Siphash, hashing the string "abcd":

  • Pre PR: write(b"abcd"); write(b"\xff"); finish(), 54 ns/iter
  • Post PR: write(b"abcd"); finish(), 47 ns/iter Edit: This result was 40/ns when I managed to get it to inline both write and finish inlined into one function..

In the latter case, the write call is could be (ugh..) inlined. So that's a demonstration of why this could be a powerful change, to just omit a single byte from the common case of hashing a string. For slices, we also omit a second write call, but this time it's 8 extra bytes we're not hashing.

@bluss
Copy link
Member Author

bluss commented Aug 28, 2015

simple benchmark for hashing 4 & 8-byte strings and slices.

@Gankra
Copy link
Contributor

Gankra commented Aug 28, 2015

CC @shepmaster (maintainer of XX)

Would also be interested in FNV numbers.

@shepmaster
Copy link
Member

@bluss would you expect any differences for streaming hashers, like XXHash?

@bluss
Copy link
Member Author

bluss commented Aug 28, 2015

Farmhash benchmark. Implemented this way, farmhasher's result depends on the way the byte stream is chunked. This is not ideal, not something I propose for libstd.

@shepmaster How str emits bytes (Hash trait) would change (it would be identical to the byte slice). If xxhasher does not change the default behavior of delimiter, not much changes. It's an optimization opportunity for short-input hashing, but I understand xxhash is mostly for long input(?), so it shouldn't make much difference.

@pczarn
Copy link
Contributor

pczarn commented Aug 30, 2015

@bluss Your comment still mentions ^=.

Your approach is pretty smart. However, FarmHash with small writes and seeding has some slowdown, because it uses 16 byte hashing to combine the hash result and the seed. Here's Hash128to64, paraphrased:

  uint64_t a = (hash_result ^ seed) * kMul;
  a ^= (a >> 47);
  uint64_t b = (seed ^ a) * kMul;
  b ^= (b >> 47);
  b *= kMul;
  return b;

Source: https://github.com/google/farmhash/blob/master/src/farmhash.h#L129-L138

The hasher can hold a buffer of hash results. FarmHash could be used to hash this buffer's contents once buffer is full, and in fn finish.

@bluss
Copy link
Member Author

bluss commented Aug 30, 2015

@pczarn The main target is a speedup where hashing needs only a single write, the rest is a bonus. I consider this change to only be the next best thing after some kind of explicit one-shot hashing, but it seems hard to implement that in a backwards compatible way.

I don't think hash with seed is that bad, it's using a function I didn't profile really either, maybe it can be improved. We can skip the seed on the first invocation of write.

@alexcrichton
Copy link
Member

I like the idea of catering to algorithms like Farmhash, but this method seems a little odd to add for cases like that. Do you know if other hashing frameworks typically have methods like this?

One thing I'd be worried about is what happens if you call delimit but then don't write any data? I think, however, it'd be fine to just clearly document that it's an erroneous case that's not guaranteed to work.

@bluss
Copy link
Member Author

bluss commented Sep 16, 2015

@alexcrichton That ought to be a valid operation. Hashing ("", "x", "") will end up doing that. As you can see, without the elision of delimiters, the slice hashing is just the same as before. We could even choose to not elide any delimiters for Siphash at all.

This PR is just experimental, I'd want to do something like gankro's hash_one_shot idea, but I can't find a way to do it that doesn't also change Hash / Hasher backwards incompatibly.

The main thing I'm worried is that someone will point out how chunking-dependent hashing might totally break hashing of some types (I can imagine a rope behaving that way). Should that stop this proposal, even if this proposal doesn't introduces that, it only caters to a farmhash impl that could do so?

@alexcrichton
Copy link
Member

I was mostly worried about a custom hash implementation of of something like struct Foo(usize) where the hash algorithm was hasher.delimit(self.0), in that case all instances would hash to the same thing, right? (at least with siphash).

Some opinions I have in this area are:

  • I'm not sure it's necessarily very desirable to have [u8] and str hash the same way, it's easy to accidentally rely on this sometimes perhaps.
  • Hashing in one shot is possible today with a custom hasher, it's just not generally applicable to all types and cases (even using the same API we have right now).
  • The Hasher trait doesn't currently have many strong guarantees about it, for example if you have a block of data and feed it via two writes, there's no documented guarantee that no matter where you split the block you'll get the same hash result. This may be a nice guarantee to have, however.

I personally prefer to do design like this either externally or on issues, but I know others feel differently :)

@bluss
Copy link
Member Author

bluss commented Sep 18, 2015

That would be an invalid implementation of hash. Delimit should be emitted when you have n units of something to write., like str and slice do.

@alexcrichton
Copy link
Member

Yeah it certainly feels invalid, but it's possible and may end up happening from time to time, which is just something to consider.

@bluss
Copy link
Member Author

bluss commented Sep 18, 2015

  • Having slices and str hash the same is a request that seemed to be pretty popular (issue str and [u8] should hash the same #27108). (With delimiter "elision" we make it into a performance win, and not a loss, too).
  • One of the main targets is of course to be able to use a one shot hasher like farmhash with Rust's hashmap. For this reason I'm looking for this way in.
  • It's a WIP PR so that I could check if the code compiled. Due to Gankro's experiences when changing the hashers (breakage in rustc and all over), I needed a full build.

@bors
Copy link
Contributor

bors commented Sep 20, 2015

☔ The latest upstream changes (presumably #28532) made this pull request unmergeable. Please resolve the merge conflicts.

@bluss
Copy link
Member Author

bluss commented Nov 11, 2015

Time to close, sorry for keeping this around.

Anyway, I'd love to do this kind of change, just don't want to keep the PR open indefinitely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants