Speed up `SipHasher128`. #68914

nnethercote · 2020-02-07T02:24:36Z

The current code in SipHasher128::short_write is inefficient. It uses
u8to64_le (which is complex and slow) to extract just the right number of
bytes of the input into a u64 and pad the result with zeroes. It then
left-shifts that value in order to bitwise-OR it with self.tail.

For example, imagine we have a u32 input 0xIIHH_GGFF and only need three bytes
to fill up self.tail. The current code uses u8to64_le to construct
0x0000_0000_00HH_GGFF, which is just 0xIIHH_GGFF with the 0xII removed and
zero-extended to a u64. The code then left-shifts that value by five bytes --
discarding the 0x00 byte that replaced the 0xII byte! -- to give
0xHHGG_FF00_0000_0000. It then then ORs that value with self.tail.

There's a much simpler way to do it: zero-extend to u64 first, then left shift.
E.g. 0xIIHH_GGFF is zero-extended to 0x0000_0000_IIHH_GGFF, and then
left-shifted to 0xHHGG_FF00_0000_0000. We don't have to take time to exclude
the unneeded 0xII byte, because it just gets shifted out anyway! It also avoids
multiple occurrences of unsafe.

There's a similar story with the setting of self.tail at the method's end.
The current code uses u8to64_le to extract the remaining part of the input,
but the same effect can be achieved more quickly with a right shift on the
zero-extended input.

This commit changes SipHasher128 to use the simpler shift-based approach. The
code is also smaller, which means that short_write is now inlined where
previously it wasn't, which makes things faster again. This gives big
speed-ups for all incremental builds, especially "baseline" incremental
builds.

r? @michaelwoerister

nnethercote · 2020-02-07T02:25:20Z

BTW, I'm planning to make the equivalent change to SipHasher in core, but I will do that as a separate PR.

@bors try @rust-timer queue

rust-timer · 2020-02-07T02:25:22Z

Awaiting bors try build completion

bors · 2020-02-07T02:25:33Z

⌛ Trying commit e606fe7 with merge 05cb825...

@michaelwoerister

Speed up `SipHasher128`. The current code in `SipHasher128::short_write` is inefficient. It uses `u8to64_le` (which is complex and slow) to extract just the right number of bytes of the input into a u64 and pad the result with zeroes. It then left-shifts that value in order to bitwise-OR it with `self.tail`. For example, imagine we have a u32 input 0xIIHH_GGFF and only need three bytes to fill up `self.tail`. The current code uses `u8to64_le` to construct 0x0000_0000_00HH_GGFF, which is just 0xIIHH_GGFF with the 0xII removed and zero-extended to a u64. The code then left-shifts that value by five bytes -- discarding the 0x00 byte that replaced the 0xII byte! -- to give 0xHHGG_FF00_0000_0000. It then then ORs that value with self.tail. There's a much simpler way to do it: zero-extend to u64 first, then left shift. E.g. 0xIIHH_GGFF is zero-extended to 0x0000_0000_IIHH_GGFF, and then left-shifted to 0xHHGG_FF00_0000_0000. We don't have to take time to exclude the unneeded 0xII byte, because it just gets shifted out anyway! It also avoids multiple occurrences of `unsafe`. There's a similar story with the setting of `self.tail` at the method's end. The current code uses `u8to64_le` to extract the remaining part of the input, but the same effect can be achieved more quickly with a right shift on the zero-extended input. This commit changes `SipHasher128` to use the simpler shift-based approach. The code is also smaller, which means that `short_write` is now inlined where previously it wasn't, which makes things faster again. This gives big speed-ups for all incremental builds, especially "baseline" incremental builds. r? @michaelwoerister

nnethercote · 2020-02-07T02:27:56Z

Local results (check builds only) are excellent:

ctfe-stress-4-check
        avg: -4.7%?     min: -13.1%?    max: -0.1%?
clap-rs-check
        avg: -3.0%      min: -9.7%      max: -0.4%
coercions-check
        avg: -3.4%?     min: -5.7%?     max: -0.6%?
tuple-stress-check
        avg: -3.4%      min: -4.9%      max: -1.0%
ucd-check
        avg: -2.6%      min: -4.4%      max: -0.7%
html5ever-check
        avg: -2.0%      min: -3.9%      max: -0.6%
serde-check
        avg: -1.8%      min: -3.7%      max: -0.4%
unicode_normalization-check
        avg: -1.8%      min: -3.5%      max: -0.4%
keccak-check
        avg: -1.5%      min: -3.5%      max: -0.2%
issue-46449-check
        avg: -1.1%      min: -3.3%      max: -0.4%
piston-image-check
        avg: -2.3%      min: -3.1%      max: -0.7%
serde-serde_derive-check
        avg: -1.6%      min: -3.0%      max: -0.6%
await-call-tree-check
        avg: -1.6%      min: -3.0%      max: -0.6%
cranelift-codegen-check
        avg: -2.1%      min: -3.0%      max: -0.7%
deep-vector-check
        avg: -2.4%      min: -3.0%      max: -1.1%
script-servo-check
        avg: -2.3%      min: -2.9%      max: -0.8%
regex-check
        avg: -2.5%      min: -2.9%      max: -0.7%
ripgrep-check
        avg: -2.0%      min: -2.9%      max: -0.6%
webrender-check
        avg: -2.2%      min: -2.9%      max: -0.7%
encoding-check
        avg: -2.2%      min: -2.9%      max: -0.8%
syn-check
        avg: -2.2%      min: -2.8%      max: -0.8%
inflate-check
        avg: -1.2%      min: -2.8%      max: -0.2%
webrender-wrench-check
        avg: -1.6%      min: -2.8%      max: -0.5%
cargo-check
        avg: -1.7%      min: -2.7%      max: -0.5%
unused-warnings-check
        avg: -2.2%      min: -2.6%      max: -1.4%
futures-check
        avg: -1.8%      min: -2.6%      max: -0.4%
helloworld-check
        avg: -1.3%      min: -2.5%      max: -0.6%
style-servo-check
        avg: -2.0%      min: -2.5%      max: -0.7%
tokio-webpush-simple-check
        avg: -1.4%      min: -2.4%      max: -0.4%
regression-31157-check
        avg: -1.5%      min: -2.4%      max: -0.4%
hyper-2-check
        avg: -1.6%      min: -2.2%      max: -0.5%
deeply-nested-check
        avg: -1.1%      min: -2.0%      max: -0.2%
unify-linearly-check
        avg: -1.1%      min: -2.0%      max: -0.3%
packed-simd-check
        avg: -1.2%      min: -1.9%      max: -0.5%
wg-grammar-check
        avg: -1.0%      min: -1.9%      max: -0.1%
wf-projection-stress-65510-che...
        avg: -0.6%      min: -1.6%      max: -0.0%
token-stream-stress-check
        avg: -0.2%      min: -0.3%      max: -0.1%

It's notable that every benchmark except for token-stream-stress-check got at least a 1.6% speedup for one of the runs. It's rare for any speed improvement to have such a wide effect.

src/librustc_data_structures/sip128.rs

bors · 2020-02-07T05:08:46Z

☀️ Try build successful - checks-azure
Build commit: 05cb825 (05cb82590d75ad280719e8d36c7a39653538be09)

rust-timer · 2020-02-07T05:08:47Z

Queued 05cb825 with parent 442ae7f, future comparison URL.

michaelwoerister · 2020-02-07T09:28:19Z

Thanks for the PR, @nnethercote. Looks like a great find! We are doing lots of hashing :)

Did you think about which implications these changes might have on big endian systems? Hashing needs to be stable across platforms for cross-compilation. The changes are probably fine with respect to this but it's something to look out for.

I'll review in detail soon.

nnethercote · 2020-02-07T11:57:05Z

I did think about big-endian. AFAIK, the new code will work fine there. Things are simpler because the code operates mostly on integers, with fewer conversions. But I'm happy to hear a second opinion.

Is there a way to know for sure if I'm right about this? Do you know if the tests cover BE, or if we have access to any BE machines for testing?

michaelwoerister · 2020-02-07T15:07:13Z

We do run tests on some big endian platforms, I think (right, @rust-lang/infra?)

In the past we had problems with symbol name mismatches when compiling some things on little-endian and the rest on big-endian, because the symbol hashes didn't match up. But we now know which kinds of bug reports to look out for after a change like this and testing should be better too now.

michaelwoerister · 2020-02-07T15:07:53Z

https://cfarm.tetaneutral.net/ provides access to some big endian systems.

kennytm · 2020-02-07T16:25:24Z

@michaelwoerister the CI run tests on ARM, x86 and WASM only, so no big-endian platforms.

nnethercote · 2020-02-10T03:28:55Z

@michaelwoerister: After some thought I see that my original code was not correct for big-endian. Fortunately there is a cheap and easy fix.

Consider this 9 byte stream:

write_u32(0xDDCCBBAA)
write_u8(0xEE)
write_u32(0xIIHHGGFF)

On little-endian it is equivalent to write([AA,BB,CC,DD, EE, FF,GG,HH,II]). SipHash parses the input stream as 8-byte little-endian words, so it must process the first 8 bytes of this stream as 0xHHGGFF_EE_DDCCBBAA, and the second 8 bytes would be 0x??????????????_II.

On big-endian it is equivalent to write([DD,CC,BB,AA, EE, II,HH,GG,FF]). SipHash parses the input stream as 8-byte little-endian words, so it must process the first 8 bytes of this stream as 0xGGHHII_EE_AABBCCDD, and the second 8 bytes would be 0x??????????????_FF.

The new short_write works correctly for little-endian, i.e. given the above write_u32/write_u8/write_u32 sequence, the first 8 byte value produced from the stream is 0xHHGGFF_EE_DDCCBBAA, with 0xII leftover.

To make it work for big-endian, we just need to call to_le to do a byte-swap on the integer inputs in in write_* before doing anything else, and then things work out. E.g. it's easy to see that the 8 byte value 0xHHGGFF_EE_DDCCBBAA becomes 0xGGHHII_EE_AABBCCDD (with 0xFF left over)
when the individual integers are byte-swapped, and that's the value we want on big-endian.

I have updated the PR to do this. I haven't tested it on a big-endian machine but I'm fairly confident it's correct. But this stuff is tricky to think about so, again, I'm happy to hear second opinions.

The current code in `SipHasher128::short_write` is inefficient. It uses `u8to64_le` (which is complex and slow) to extract just the right number of bytes of the input into a u64 and pad the result with zeroes. It then left-shifts that value in order to bitwise-OR it with `self.tail`. For example, imagine we have a u32 input 0xIIHH_GGFF and only need three bytes to fill up `self.tail`. The current code uses `u8to64_le` to construct 0x0000_0000_00HH_GGFF, which is just 0xIIHH_GGFF with the 0xII removed and zero-extended to a u64. The code then left-shifts that value by five bytes -- discarding the 0x00 byte that replaced the 0xII byte! -- to give 0xHHGG_FF00_0000_0000. It then then ORs that value with self.tail. There's a much simpler way to do it: zero-extend to u64 first, then left shift. E.g. 0xIIHH_GGFF is zero-extended to 0x0000_0000_IIHH_GGFF, and then left-shifted to 0xHHGG_FF00_0000_0000. We don't have to take time to exclude the unneeded 0xII byte, because it just gets shifted out anyway! It also avoids multiple occurrences of `unsafe`. There's a similar story with the setting of `self.tail` at the method's end. The current code uses `u8to64_le` to extract the remaining part of the input, but the same effect can be achieved more quickly with a right shift on the zero-extended input. All that works on little-endian. It doesn't work for big-endian, but we can just do a `to_le` before calling `short_write` and then it works. This commit changes `SipHasher128` to use the simpler shift-based approach. The code is also smaller, which means that `short_write` is now inlined where previously it wasn't, which makes things faster again. This gives big speed-ups for all incremental builds, especially "baseline" incremental builds.

src/librustc_data_structures/sip128.rs

michaelwoerister · 2020-02-10T14:02:16Z

Note that we already call to_le() on layer above in StableHasher. That's probably sufficient then? It's annoying that we don't have test running on big endian platforms :/

michaelwoerister · 2020-02-10T16:35:47Z

OK, I wrote the following test program that compares hash values before and after this PR:

https://github.com/michaelwoerister/sip-endian/blob/master/main.rs

On a little endian machine everything works as expected. However, when I tried it on a big endian machine (gcc110 from cfarm.tetaneutral.net), I got different values until I removed the to_le() calls from the PR's implementation. Once I did that the values matched those on the little endian machine (and those of the current implementation).

The requirement here is that the same sequence of write_xyz() calls with the same numeric values must produce the same final hash value, independent of endianess. For the current implementation this is achieved by treating everything as byte slices and making sure that all such slices are brought into a platform independent order (by calling to_le() in StableHasher).

However, the implementation in this PR does not operate on byte slices anymore, so there is no need to do the whole byte-swapping dance. The new short_write uses only uses bit operations and those are endian independent.

So the correct fix, in my opinion, is to remove the to_le() calls from both the short_write() invocations in SipHasher128 and from the write_xyz() calls in StableHasher. (Funnily enough current version of this PR probably works too because it swaps the bytes once in StableHasher and then swaps them back again in SipHasher).

michaelwoerister · 2020-02-10T19:21:47Z

Also, we should be able to replace the u8to64_le() with something more straightforward that does just a single memcpy (and let's us get rid of this weird load_int_le macro):

    /// Loads up to 7 bytes from a byte-slice into a u64.
    #[inline]
    fn u8to64_le(buf: &[u8], start: usize, len: usize) -> u64 {
        assert!(len <= 8 && start + len <= buf.len());
        let mut out = 0u64;

        unsafe {
            let out_ptr = &mut out as *mut _ as *mut u8;
            ptr::copy_nonoverlapping(buf.as_ptr().offset(start as isize), out_ptr, len);
        }

        #[cfg(target_endian = "big")]
        {
            // If this is a big endian system we swap bytes, so that the first
            // byte ends up in the lowest order byte, like SipHash expects.
            out = out.swap_bytes();
        }

        out
    }

nnethercote · 2020-02-11T00:19:54Z

I requested access to the GCC farm on Saturday, but I am still waiting for a response.

The requirement here is that the same sequence of write_xyz() calls with the same numeric values must produce the same final hash value, independent of endianess.

Hmm. I was using this code as the basis for my reasoning:

rust/src/libcore/hash/mod.rs

Lines 297 to 326 in e6ec0d1

    
               /// Writes a single `u16` into this hasher. 
        
               #[inline] 
        
               #[stable(feature = "hasher_write", since = "1.3.0")] 
        
               fn write_u16(&mut self, i: u16) { 
        
                   self.write(&i.to_ne_bytes()) 
        
               } 
        
               /// Writes a single `u32` into this hasher. 
        
               #[inline] 
        
               #[stable(feature = "hasher_write", since = "1.3.0")] 
        
               fn write_u32(&mut self, i: u32) { 
        
                   self.write(&i.to_ne_bytes()) 
        
               } 
        
               /// Writes a single `u64` into this hasher. 
        
               #[inline] 
        
               #[stable(feature = "hasher_write", since = "1.3.0")] 
        
               fn write_u64(&mut self, i: u64) { 
        
                   self.write(&i.to_ne_bytes()) 
        
               } 
        
               /// Writes a single `u128` into this hasher. 
        
               #[inline] 
        
               #[stable(feature = "i128", since = "1.26.0")] 
        
               fn write_u128(&mut self, i: u128) { 
        
                   self.write(&i.to_ne_bytes()) 
        
               } 
        
               /// Writes a single `usize` into this hasher. 
        
               #[inline] 
        
               #[stable(feature = "hasher_write", since = "1.3.0")] 
        
               fn write_usize(&mut self, i: usize) { 
        
                   self.write(&i.to_ne_bytes()) 
        
               }

The use of to_ne_bytes shows that, by default, for a given sequence of write_xyz calls, any hasher will give different results on little-endian vs. big-endian. Going back to my example:

write_u32(0xDDCCBBAA)
write_u8(0xEE)
write_u32(0xIIHHGGFF)

On little-endian it is equivalent to write([AA,BB,CC,DD, EE, FF,GG,HH,II]) On big-endian it is equivalent to write([DD,CC,BB,AA, EE, II,HH,GG,FF]). Clearly the results will be different.

I was taking this equivalence to be axiomatic (i.e. required). But it makes sense that StableHasher requires the same results on little-endian and big-endian, therefore it must violate this equivalence. I guess that's ok, so long as it's consistent?

But should SipHasher128 violate this equivalence? Likewise, what about SipHasher in core? I'm not sure. My instinct is that SipHasher128/SipHasher should not violate the equivalence, in which case the endian-independence should be provided by StableHasher -- and it currently does this by using to_le in its write_xyz methods.

On a little endian machine everything works as expected. However, when I tried it on a big endian machine (gcc110 from cfarm.tetaneutral.net), I got different values until I removed the to_le() calls from the PR's implementation. Once I did that the values matched those on the little endian machine (and those of the current implementation).

Thank you for doing this checking. Here's what I was expecting from SipHasher128:

le-old-code == le-new-code
be-old-code == be-new-code
le-old-code != be-old-code
le-new-code != be-new-code

Can you write down the values you got for the four combinations?

For StableHasher, I would expect:

le-old-code == le-new-code == be-old-code == be-new-code

because of the extra to_le in StableHasher::write_xyz.

nnethercote · 2020-02-11T00:23:23Z

In case it helps, here is what I think should happen in the four SipHasher28 cases for the above example.

-----------------------------------------------------------------------------
little-endian
-----------------------------------------------------------------------------
SipHasher128, old code
- write_u32(0xDDCCBBAA)
  - short_write([AA, BB, CC, DD])
  - needed = 8, fill = 4
  - self.tail |= u8to64_le(msg, 0, 4) << 0 --> 0xDDCCBBAA
- write_u8(0xEE)
  - short_write([EE])
  - needed = 4, fill = 1
  - self.tail |= u8to64_le(msg, 0, 1) << 4*8 --> 0xEE_CCDDBBAA
- write_u32(0xIIHHGGFF)
  - short_write([FF, GG, HH, II])
  - needed = 3, fill = 3
  - self.tail |= u8to64_le(msg, 0, 3) << 5*8 --> 0xHHGGFF_EE_CCDDBBAA
  - process
  - self.tail = u8to64_le(msg, 3, 1) --> 0xII

SipHasher128, new code
- write_u32(0xDDCCBBAA)
  - short_write(0x00000000_DDCCBBAA)
  - needed = 8, fill = 4
  - self.tail |= x << 0 --> 0xDDCCBBAA
- write_u8(0xEE)
  - short_write(0x00000000_000000EE)
  - needed = 4, fill = 1
  - self.tail |= x << 4*8 --> 0xEE_CCDDBBAA
- write_u32(0xIIHHGGFF)
  - short_write(0x00000000_IIHHGGFF)
  - needed = 3, fill = 3
  - self.tail |= x << 5*8 --> 0xHHGGFF_EE_CCDDBBAA
  - process
  - self.tail = x >> 3*8 --> 0xII

-----------------------------------------------------------------------------
big-endian
-----------------------------------------------------------------------------
SipHasher128, old code
- write_u32(0xDDCCBBAA)
  - short_write([DD, CC, BB, AA])
  - needed = 8, fill = 4
  - self.tail |= u8to64_le(msg, 0, 4) << 0 --> 0xAABBCCDD
- write_u8(0xEE)
  - short_write([EE])
  - needed = 4, fill = 1
  - self.tail |= u8to64_le(msg, 0, 1) << 4*8 --> 0xEE_AABBCCDD
- write_u32(0xIIHHGGFF)
  - short_write([II, HH, GG, FF])
  - needed = 3, fill = 3
  - self.tail |= u8to64_le(msg, 0, 3) << 5*8 --> 0xGGHHII_EE_AABBCCDD
  - process
  - self.tail = u8to64_le(msg, 3, 1) --> 0xFF

SipHasher128, new code
- write_u32(0xDDCCBBAA)
  - short_write(0x00000000_AABBCCDD)    // was byte-swapped, then zero-extended
  - needed = 8, fill = 4
  - self.tail |= x << 0 --> 0xAABBCCDD
- write_u8(0xEE)
  - short_write(0x00000000_000000EE)    // was zero-extended
  - needed = 4, fill = 1
  - self.tail |= x << 4*8 --> 0xEE_AABBCCDD
- write_u32(0xIIHHGGFF)
  - short_write(0x00000000_FFGGHHII)    // was byte-swapped, then zero-extended
  - needed = 3, fill = 3
  - self.tail |= x << 5*8 --> 0xGGHHII_EE_AABBCCDD
  - process
  - self.tail = x >> 3*8 --> 0xFF

I have confirmed that the two little-endian cases are correct, I haven't been able to confirm the big-endian cases.

nnethercote · 2020-02-11T00:44:01Z

>         #[cfg(target_endian = "big")]
>         {
>             // If this is a big endian system we swap bytes, so that the first
>             // byte ends up in the lowest order byte, like SipHash expects.
>             out = out.swap_bytes();
>         }
>
>         out

This whole snippet can be simplified to out.to_le().

nnethercote · 2020-02-11T01:13:45Z

@michaelwoerister: I have added some debugging eprintln statements to your code:
https://github.com/nnethercote/sip-endian/tree/add-some-printlns

This is the output I get on little-endian:

old early  : 0xddccbbaa, 4
old early  : 0xeeddccbbaa, 5
old process: 0x345678eeddccbbaa, 5
old spill  : 0x12, 1
old: 20f554e44fa4ca9 d68f01a898684a41
new early  : 0xddccbbaa, 4
new early  : 0xeeddccbbaa, 5
new process: 0x345678eeddccbbaa, 5
new spill  : 0x12, 1
new: 20f554e44fa4ca9 d68f01a898684a41

This is the output I expect on big-endian:

old early  : 0xaabbccdd, 4
old early  : 0xeeaabbccdd, 5
old process: 0x563412eeaabbccdd, 5
old spill  : 0x78, 1
old: 20f554e44fa4ca9 d68f01a898684a41
new early  : 0xaabbccdd, 4
new early  : 0xeeaabbccdd, 5
new process: 0x563412eeaabbccdd, 5
new spill  : 0x78, 1
new: <something> <something>

Can you check the big-endian results?

michaelwoerister · 2020-02-11T12:43:45Z

Here is what I get on the big endian machine:

old early  : 0xddccbbaa, 4
old early  : 0xeeddccbbaa, 5
old process: 0x345678eeddccbbaa, 5
old spill  : 0x12, 1
old: 20f554e44fa4ca9 d68f01a898684a41
new early  : 0xddccbbaa, 4
new early  : 0xeeddccbbaa, 5
new process: 0x345678eeddccbbaa, 5
new spill  : 0x12, 1
new: 20f554e44fa4ca9 d68f01a898684a41

michaelwoerister · 2020-02-11T13:45:45Z

So it looks the same as on little-endian. This is what I expected because the code in question operates on integers, not on byte sequences, i.e. the number 0xaabbccdd might have a different memory layout on big-endian, but it still has the same numeric value and print!("{:x}", n) prints the same on all architectures.

The use of to_ne_bytes shows that, by default, for a given sequence of write_xyz calls, any hasher will give different results on little-endian vs. big-endian. Going back to my example:

write_u32(0xDDCCBBAA)
write_u8(0xEE)
write_u32(0xIIHHGGFF)

On little-endian it is equivalent to write([AA,BB,CC,DD, EE, FF,GG,HH,II]) On big-endian it is equivalent to write([DD,CC,BB,AA, EE, II,HH,GG,FF]). Clearly the results will be different.

I see, that is interesting. I didn't know that the libstd implementation worked this way. It's clear that that must give different results depending on endianess. At the same time StableHasher must give the same result on all platforms for that sequence of calls. I think it's fine for SipHasher128 to handle this differently than libstd, as long as we document it. I don't think there's an actual requirement that write_u32() corresponds to hashing any specific sequence of bytes. The only real requirement I see for the generic Hasher is that any sequence of calls deterministically results in the same hash value on the same platform (i.e. the minimum requirements for making it usable with a hashtable). I think the main reason the standard library hashes things in native byte order is performance, not because it's a strict requirement.

So I think our options for SipHasher128 are:

Don't do any endianess conversions on short_write arguments and rely on short_write to be implemented in an endian independent way (which it is as long as it only does bitwise and arithmetic operations).
Make short_write take a byte slice again and then make sure that StableHasher makes things endian independent by always converting to little endian. (~= the current implementation)
Try to make SipHasher128 behave exactly the same way as std::hash::Hasher (i.e. giving different results depending on endianess) while still using integer arguments for short_write and then let StableHasher pre-process the integers in a way that leads to endian independent hash values. (~= the current version of this PR?)

I prefer option (1) as it is just simpler.

This whole snippet can be simplified to out.to_le().

Yeah, I know. I just find to_le() confusing in most contexts. E.g. why does x.to_le().to_le() give me big-endian encoding on a big endian system? I personally prefer to call swap_bytes() which is just more explicit. What I usually really want is to_le_bytes(), that makes a lot more sense to me. Anyway, if you strongly prefer to_le() to my more verbose version, I won't fight you on it. I mostly want to get rid of the weird sequence of if statements in u8to64_le.

nnethercote · 2020-02-11T23:27:14Z

So it looks the same as on little-endian. This is what I expected because the code in question operates on integers, not on byte sequences, i.e. the number 0xaabbccdd might have a different memory layout on big-endian, but it still has the same numeric value and print!("{:x}", n) prints the same on all architectures.

The new code operates on integers, so I agree that the new short_write is endian-independent; that's why I added the to_le calls in the write_xyz methods, so that the results on big-endian would be different to little-endian, as is done for libstd.

But the old code involves byte sequences in short_write_gen/short_write, and so should not be endian-independent. And the fact that StableHasher has the to_le calls makes sense with this theory -- because SipHasher128 is not endian-independent, StableHasher::write_xyz has to byte-swap on big-endian to be endian-independent.

I just looked more closely at your sip-endian code and I now understand why it gives the same results for all four cases: le-old-code, le-new-code, be-old-code, be-new-code. You didn't copy the SipHasher128 implementations exactly -- you added to_le calls to the old write_xyz functions, and and removed them from the new write_xyz functions! So you effectively emulated StableHasher, which is supposed to get the same result on big-endian and little-endian. This shows that the PR as written is correct, yay! (If you undo those changes and re-run, you should get the big-endian outputs I predicted above, starting with old early : 0xaabbccdd, 4.)

I see, that is interesting. I didn't know that the libstd implementation worked this way. It's clear that that must give different results depending on endianess. At the same time StableHasher must give the same result on all platforms for that sequence of calls.

Yes.

I think it's fine for SipHasher128 to handle this differently than libstd, as long as we document it.

I'd prefer it to handle it the same way...

So I think our options for SipHasher128 are:

Don't do any endianess conversions on short_write arguments and rely on short_write to be implemented in an endian independent way (which it is as long as it only does bitwise and arithmetic operations).

Make short_write take a byte slice again and then make sure that StableHasher makes things endian independent by always converting to little endian. (~= the current implementation)

Try to make SipHasher128 behave exactly the same way as std::hash::Hasher (i.e. giving different results depending on endianess) while still using integer arguments for short_write and then let StableHasher pre-process the integers in a way that leads to endian independent hash values. (~= the current version of this PR?)

I prefer option (1) as it is just simpler.

I prefer option (3). My desires are:

I don't want to change SipHasher's current behaviour, which is the libstd behaviour (i.e. different results on big-endian and little-endian), because it's exposed to every Rust program and so changing it seems like a very bad idea.
I want SipHasher and SipHasher128 to be as similar as possible.
- Because the latter was clearly derived from the former and I want that derivation to be obvious.
- Because subtle (i.e. big-endian-only) differences between the two could be confusing.
- Because I want SipHasher to get the same speed-ups that SipHasher128 is getting.

The only way to satisfy all of these is via (3), which the current PR code implements. The downside is extra to_le calls in both StableHasher::write_xyz and SipHasher128::write_xyz, but I think that's reasonable to satisfy the desires above.

Does that sound reasonable?

I just find to_le() confusing in most contexts. E.g. why does x.to_le().to_le() give me big-endian encoding on a big endian system? I personally prefer to call swap_bytes() which is just more explicit.

That's interesting. I find lots of things about little-endian/big-endian confusing, but I don't have trouble with to_le. It's just a no-op on little-endian and a byte-swap on big-endian. So I do prefer the to_le form. I definitely agree that either version is a clear improvement to u8to64_le and that eliminating load_int_le! is a good thing.

nnethercote · 2020-02-11T23:43:15Z

One more tidbit: the unit tests have a test of short_write that check that it has the libstd behaviour, i.e. different results on big-endian vs little-endian. Line 412 is the endian-dependent one:

rust/src/librustc_data_structures/sip128/tests.rs

Lines 401 to 418 in 3f32e30

    
           #[test] 
        
           fn test_write_short_works() { 
        
               let test_usize = 0xd0c0b0a0usize; 
        
               let mut h1 = SipHasher128::new_with_keys(0, 0); 
        
               h1.write_usize(test_usize); 
        
               h1.write(b"bytes"); 
        
               h1.write(b"string"); 
        
               h1.write_u8(0xFFu8); 
        
               h1.write_u8(0x01u8); 
        
               let mut h2 = SipHasher128::new_with_keys(0, 0); 
        
               h2.write(unsafe { 
        
                   slice::from_raw_parts(&test_usize as *const _ as *const u8, mem::size_of::<usize>()) 
        
               }); 
        
               h2.write(b"bytes"); 
        
               h2.write(b"string"); 
        
               h2.write(&[0xFFu8, 0x01u8]); 
        
               assert_eq!(h1.finish128(), h2.finish128()); 
        
           }

So the test confirms that the libstd behaviour is intended.

This makes it faster and also changes it to a safe function. (Thanks to Michael Woerister for the suggestion.) `load_int_le!` is also no longer necessary.

nnethercote · 2020-02-12T00:46:08Z

@michaelwoerister: I have left the previous changes in place, because I think option (3) is the best. I have also added another commit that makes the u8to64_le changes you suggest. I think this is in a good enough state to land, though if you are able to do a full test run on a big-endian machine that would be welcome.

michaelwoerister · 2020-02-12T11:13:48Z

The only way to satisfy all of these is via (3), which the current PR code implements. The downside is extra to_le calls in both StableHasher::write_xyz and SipHasher128::write_xyz, but I think that's reasonable to satisfy the desires above.

Does that sound reasonable?

Yes, I'm OK with that.

That's interesting. I find lots of things about little-endian/big-endian confusing, but I don't have trouble with to_le. It's just a no-op on little-endian and a byte-swap on big-endian.

I think what I find even more confusing about to_le is when it is used as a substitute for converting from little endian to native, i.e:

// We have some bytes that encode the number 1 as a 32 bit integer in LE format, 
let le_bytes = [1, 0, 0, 0];

// Load the bytes into the `u32` verbatim
let x: u32 = *(&le_bytes as &u32);

// On a big endian machine x is now the number 16777216, so we convert 
// to big endian by, obviously, calling `to_le()`
x.to_le()

It does the right thing but it spells the opposite of what it does. What I usually really want, and I'm glad that Rust has it since recently, is from_le_bytes():

// We have some bytes that encode the number 1 as a 32 bit integer in LE format, 
let le_bytes = [1, 0, 0, 0];

// So much nicer!
let x = u32::from_le_bytes(le_bytes);

I would actually prefer implementing u8to64_le in terms of from_le_bytes if that optimizes as well as the current version. But I don't want to block this PR on it. Thanks for looking into things so thoroughly!

@bors r+

bors · 2020-02-12T11:13:49Z

📌 Commit 9aea154 has been approved by michaelwoerister

bors · 2020-02-12T11:14:18Z

💡 This pull request was already approved, no need to approve it again.

There's another pull request that is currently being tested, blocking this pull request: Rollup of 11 pull requests #69088

bors · 2020-02-12T11:14:20Z

📌 Commit 9aea154 has been approved by michaelwoerister

@michaelwoerister

…r=michaelwoerister Speed up `SipHasher128`. The current code in `SipHasher128::short_write` is inefficient. It uses `u8to64_le` (which is complex and slow) to extract just the right number of bytes of the input into a u64 and pad the result with zeroes. It then left-shifts that value in order to bitwise-OR it with `self.tail`. For example, imagine we have a u32 input `0xIIHH_GGFF` and only need three bytes to fill up `self.tail`. The current code uses `u8to64_le` to construct `0x0000_0000_00HH_GGFF`, which is just `0xIIHH_GGFF` with the `0xII` removed and zero-extended to a u64. The code then left-shifts that value by five bytes -- discarding the `0x00` byte that replaced the `0xII` byte! -- to give `0xHHGG_FF00_0000_0000`. It then then ORs that value with `self.tail`. There's a much simpler way to do it: zero-extend to u64 first, then left shift. E.g. `0xIIHH_GGFF` is zero-extended to `0x0000_0000_IIHH_GGFF`, and then left-shifted to `0xHHGG_FF00_0000_0000`. We don't have to take time to exclude the unneeded `0xII` byte, because it just gets shifted out anyway! It also avoids multiple occurrences of `unsafe`. There's a similar story with the setting of `self.tail` at the method's end. The current code uses `u8to64_le` to extract the remaining part of the input, but the same effect can be achieved more quickly with a right shift on the zero-extended input. This commit changes `SipHasher128` to use the simpler shift-based approach. The code is also smaller, which means that `short_write` is now inlined where previously it wasn't, which makes things faster again. This gives big speed-ups for all incremental builds, especially "baseline" incremental builds. r? @michaelwoerister

@ghost

Rollup of 8 pull requests Successful merges: - #67585 (Improve `char::is_ascii_*` codegen) - #68914 (Speed up `SipHasher128`.) - #68994 (rustbuild: include channel in sanitizers installed name) - #69032 (ICE in nightly-2020-02-08: handle TerminatorKind::Yield in librustc_mir::transform::promote_consts::Validator method) - #69034 (parser: Remove `Parser::prev_token_kind`) - #69042 (Remove backtrace header text) - #69059 (Remove a few unused objects) - #69089 (Properly use the darwin archive format on Apple targets) Failed merges: r? @ghost

nnethercote · 2020-02-12T21:44:45Z

I would actually prefer implementing u8to64_le in terms of from_le_bytes

from_le_bytes takes a [u8; 8] argument, so I'm having trouble seeing how you would write u8to64_le with it. I might be overlooking something.

`SipHasher128`'s `u8to64_le` function was simplified in rust-lang#68914. Unfortunately, the new version is slower, because it introduces `memcpy` calls with non-statically-known lengths. This commit reverts the change, and adds an explanatory comment (which is also added to `libcore/hash/sip.rs`). This barely affects `SipHasher128`'s speed because it doesn't use `u8to64_le` much, but it does result in `SipHasher128` once again being consistent with `libcore/hash/sip.rs`.

@michaelwoerister

…lwoerister Revert `u8to64_le` changes from #68914. `SipHasher128`'s `u8to64_le` function was simplified in #68914. Unfortunately, the new version is slower, because it introduces `memcpy` calls with non-statically-known lengths. This commit reverts the change, and adds an explanatory comment (which is also added to `libcore/hash/sip.rs`). This barely affects `SipHasher128`'s speed because it doesn't use `u8to64_le` much, but it does result in `SipHasher128` once again being consistent with `libcore/hash/sip.rs`. r? @michaelwoerister

This commit changes `sip::Hasher` to use the faster `short_write` approach that was used for `SipHasher128` in rust-lang#68914. This has no effect because `sip::Hasher::short_write` is currently unused. See the next commit for more details, and a fix. (One difference with rust-lang#68914 is that this commit doesn't apply the `u8to64_le` change from that PR, because I found it is slower, because it introduces `memcpy` calls with non-statically-known lengths. Therefore, this commit also undoes the `u8to64_le` change in `SipHasher128` for this reason. This doesn't affect `SipHasher128` much because it doesn't use `u8to64_le` much, but I am making the change to keep the two implementations consistent.)

rust-highfive assigned michaelwoerister Feb 7, 2020

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 7, 2020

Mark-Simulacrum reviewed Feb 7, 2020

View reviewed changes

src/librustc_data_structures/sip128.rs Show resolved Hide resolved

nnethercote force-pushed the speed-up-SipHasher128 branch from e606fe7 to a2da54b Compare February 10, 2020 03:25

nnethercote force-pushed the speed-up-SipHasher128 branch from a2da54b to 7edcdc8 Compare February 10, 2020 03:31

nnethercote force-pushed the speed-up-SipHasher128 branch from 7edcdc8 to f8a0286 Compare February 10, 2020 04:54

tesuji reviewed Feb 10, 2020

View reviewed changes

src/librustc_data_structures/sip128.rs Show resolved Hide resolved

Improve u8to64_le.

9aea154

This makes it faster and also changes it to a safe function. (Thanks to Michael Woerister for the suggestion.) `load_int_le!` is also no longer necessary.

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 12, 2020

Dylan-DPC-zz mentioned this pull request Feb 12, 2020

Rollup of 8 pull requests #69094

Merged

bors merged commit 9aea154 into rust-lang:master Feb 12, 2020

nnethercote deleted the speed-up-SipHasher128 branch February 12, 2020 21:43

nnethercote mentioned this pull request Feb 14, 2020

Speed up DefaultHasher, SipHasher, and SipHasher13. #69152

Closed

nnethercote mentioned this pull request Feb 20, 2020

Revert u8to64_le changes from #68914. #69332

Merged

tgnottingham mentioned this pull request Sep 30, 2020

Stable hashing: add comments and tests concerning platform-independence #77319

Merged

nnethercote mentioned this pull request Dec 6, 2021

SipHasher takes up lots of time in incremental builds #51054

Closed

Speed up SipHasher128. #68914

Speed up SipHasher128. #68914

Uh oh!

Conversation

nnethercote commented Feb 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nnethercote commented Feb 7, 2020

Uh oh!

rust-timer commented Feb 7, 2020

Uh oh!

bors commented Feb 7, 2020

Uh oh!

nnethercote commented Feb 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bors commented Feb 7, 2020

Uh oh!

rust-timer commented Feb 7, 2020

Uh oh!

michaelwoerister commented Feb 7, 2020

Uh oh!

nnethercote commented Feb 7, 2020

Uh oh!

michaelwoerister commented Feb 7, 2020

Uh oh!

michaelwoerister commented Feb 7, 2020

Uh oh!

kennytm commented Feb 7, 2020

Uh oh!

nnethercote commented Feb 10, 2020

Uh oh!

Uh oh!

michaelwoerister commented Feb 10, 2020

Uh oh!

michaelwoerister commented Feb 10, 2020

Uh oh!

michaelwoerister commented Feb 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nnethercote commented Feb 11, 2020

Uh oh!

nnethercote commented Feb 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nnethercote commented Feb 11, 2020

Uh oh!

nnethercote commented Feb 11, 2020

Uh oh!

michaelwoerister commented Feb 11, 2020

Uh oh!

michaelwoerister commented Feb 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nnethercote commented Feb 11, 2020

Uh oh!

nnethercote commented Feb 11, 2020

Uh oh!

nnethercote commented Feb 12, 2020

Uh oh!

michaelwoerister commented Feb 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bors commented Feb 12, 2020

Uh oh!

bors commented Feb 12, 2020

Uh oh!

bors commented Feb 12, 2020

Uh oh!

nnethercote commented Feb 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Speed up `SipHasher128`. #68914

Speed up `SipHasher128`. #68914

nnethercote commented Feb 7, 2020 •

edited

Loading

nnethercote commented Feb 7, 2020 •

edited

Loading

michaelwoerister commented Feb 10, 2020 •

edited

Loading

nnethercote commented Feb 11, 2020 •

edited

Loading

michaelwoerister commented Feb 11, 2020 •

edited

Loading

michaelwoerister commented Feb 12, 2020 •

edited

Loading