Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix Hash implementation #66

Closed
wants to merge 1 commit into from
Closed

Conversation

pascalkuthe
Copy link
Collaborator

@pascalkuthe pascalkuthe commented Oct 27, 2022

The current Hash implementation of Ropey violates the
std API contract and therefore breaks with most non-default Hashers.

Ropeys current Hash implementation simply hashes all chunks in a loop.
However, this means that the chunk boundries are encoded into the Hash.
From the documentation of Hasher:

This trait provides no guarantees about how the various write_* methods are
defined and implementations of [Hash] should not assume that they work one
way or another. You cannot assume, for example, that a [write_u32] call is
equivalent to four calls of [write_u8]. Nor can you assume that adjacent
write calls are merged, so it's possible, for example, that

# fn foo(hasher: &mut impl std::hash::Hasher) {
hasher.write(&[1, 2]);
hasher.write(&[3, 4, 5, 6]);
# }

and

# fn foo(hasher: &mut impl std::hash::Hasher) {
hasher.write(&[1, 2, 3, 4]);
hasher.write(&[5, 6]);
# }

end up producing different hashes.

Thus to produce the same hash value, [Hash] implementations must ensure
for equivalent items that exactly the same sequence of calls is made -- the
same methods with the same parameters in the same order.

This means that no matter what the chunk boundaries are, the same write calls must happen.
The tests did not catch this because the default Hasher does not exploit this property.
However, most other Hashers with less strong guarantees about dos resistance (ahash, fxhash, fnvhash)
do exploit this property.
The default Hasher may also start to exploit this property at any point in the future (and it would not be considered a breaking change by std).
When using such a Hasher with ropey the property hash(rope1) != hash(rope2) <=> rope1 != rope2 does
not hold anymore. This leads to HashMaps with the same key present multiple times (or failed lookups when the key does exist).

The simplest fix for this is simply to call write_u8 for each byte of the chunk.
However, this is extremely slow, because Hashers usually heavily optimize batch operations.
For example for ahash this is at least 16 times slower, because it always operates on 16 bytes at once.

Instead, I have created a hash implementation that uses a temporary buffer to divide
the bytes of the ropes into slices with length BYTES_MIN (except the last slice which may be smaller) and passes those to the Hasher.
This causes a bit of overhead as some bytes need to be copied at the chunk boundaries.
However, the overhead should be much smaller than calling write_u8 repeatedly and should only degrade performance slightly

src/slice.rs Outdated Show resolved Hide resolved
src/slice.rs Outdated Show resolved Hide resolved
tests/hash.rs Outdated Show resolved Hide resolved
@pascalkuthe pascalkuthe force-pushed the fix_hash branch 3 times, most recently from f8000f2 to 74aade1 Compare October 27, 2022 15:59
Copy link
Owner

@cessen cessen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for catching this! I could have sworn I had read the Hasher documentation carefully, but apparently not.

I would definitely like to get this merged, but I'd prefer a different implementation that doesn't make any assumptions about other parts of Ropey. I've requested changes in-line.

src/rope.rs Outdated
Comment on lines 2055 to 2059
// we can not simply call state.write(chunk) for each chunk
// because Hasher.write only produces the same input if the
// calls match exactly. See documentation of hash and hast_02 test below
// so we have to perform some tricks here to ensure that the same
// calls to write are produce no matter the chunk layout
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: comments in Ropey should be complete sentences with proper capitalization and punctuation.

(To be fair, I'm pretty sure I've written comments in Ropey that don't follow that, because Ropey has been around for a while and my views on comment style have changed during that time. But new comments should follow that.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the comments, hope its ok now

Comment on lines +28 to +32
[[bin]]
name = "hash"
path = "fuzz_targets/hash.rs"
test = false
doc = false
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be property tests, not fuzz tests. In general, I'm trying to keep Ropey's fuzz testing aimed more at things like memory safety and unintended panics. (I also have some general tree-validity checks in there, but that's just because they were convenient to throw in.)

Randomized testing for for logical correctness goes in the property tests. See tests/proptest_tests.rs.

Copy link
Collaborator Author

@pascalkuthe pascalkuthe Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this tests makes more sense as a proptests. II implemented as a fuzz tests because property tests often don't run long enough/can't use debug information to exhaust all possible code paths (and there is really no practical difference between them except a nicer reduce logic apart from that). I didn't change this yet because you said you will reconsider below. Just let me know if I should turn this into a propytest and I will do so

src/rope.rs Outdated
Comment on lines 2071 to 2073
// this bounds check always succedes because all chunks are larger then MIN_BYTES
// the only exception is the root leaf in a small file but that would be `RSEnum::Light`
// invariant for all cunks: MIN_BYTES <= chunk.len() <= MAX_BYTES < 3* MIN_BYTES
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would strongly prefer that the Hash implementation not make any assumptions about chunk size. For things handling the node tree, of course. But it feels odd here, and my gut tells me it could be a foot gun in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is actually worng here. The boundscheck always succedes here because we perform this exact boundscheck in the if above. I can turn the manually unrolled loop below into an actual loop to avoid the dependency on MAX_BYTES

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the implementation and the comments, it does not rely on any internal invariants of the rope anymore

src/rope.rs Outdated
// calls match exactly. See documentation of hash and hast_02 test below
// so we have to perform some tricks here to ensure that the same
// calls to write are produce no matter the chunk layout
let mut accumulator = [0u8; MIN_BYTES];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make more sense for the accumulator buffer to just have its own fixed size, rather than being based on MIN_BYTES. I'm not sure what size would be best, but we can play around with it to see what performs best.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will setup some benchmarks. The problem is that the performance highly depends on the exact hasher implementation. I think we generally want a a multiple of 16 here (which MIN_BYTES is) that is smaller then most chunks so that the chunk.len() < MIN_BYTES - accumulator_pos branch can get good branch prediction. So MIN_BYTES might just be a sweetspot but I guess I will look into benchmarking. Another reason MIN_BYTES is attractive here is that it is fixed to the chunksize so if that ever changes we won't have to redetermine the ideal buffer size

Copy link
Collaborator Author

@pascalkuthe pascalkuthe Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a separate constant for BUFFER_SIZE. From your comment below it sounds like you will change the implementation anyway so I didn't invest the time to benchmark it and just set it to BYTES_MIN for now

@pascalkuthe
Copy link
Collaborator Author

pascalkuthe commented Nov 1, 2022

Thanks for the review. I won't have time to get into this too much today but I will try to address your comments and performs some benchmarks later this week.

@cessen
Copy link
Owner

cessen commented Nov 1, 2022

Yeah, no rush! I've actually come down with Covid (I tried to be pretty careful at the conference, masking up and everything, but nothing's 100%). So my brain isn't at it's best at the moment anyway, and I'll likely be getting to things a bit slowly for the next week-ish as I recover.

@cessen
Copy link
Owner

cessen commented Nov 1, 2022

Thinking about it more, I actually have a somewhat specific implementation in mind for this, because I've implemented code like this many, many times before (ironically, internally to hashers, to avoid exactly this issue). But it's not reasonable to expect you to hit a narrow target like that via progressive comments in a code review.

So I think what probably makes the most sense is to accept this PR, and then I can do my own implementation afterwards if I have the energy.

However, I would still like the fuzz tests to be moved to property tests, since the issue this addresses is a lot closer to a logic error than a safety issue. (I'll revisit this when my head is on straight again.)

@pascalkuthe pascalkuthe force-pushed the fix_hash branch 2 times, most recently from 87eb745 to 1dbb7c6 Compare November 4, 2022 18:42
@cessen
Copy link
Owner

cessen commented Nov 6, 2022

So, I think what I'm going to do (if it's okay with you) is actually just redo this PR myself, but nabbing some of your unit tests. I was initially hesitant to do that because I didn't want to steal credit away from your work (which is very appreciated!). But I'll just make sure to credit you in the commit message(s). Moreover, I definitely plan to merge your other PRs once they pass review, of which #70 in particular is quite substantial.

Anyway, my rationale behind just re-doing it is:

  • I'll basically be reimplementing the Hash impl anyway.
  • I don't think we actually need any fuzzing/property-esk tests for this particular bug. Just normal unit tests are fine, to ensure we don't regress in the future.
  • Those tests can be improved by adding some private/doc-hidden APIs to Ropey to allow building a Rope with specific chunks, which I think will be useful for testing other things as well. That way we can directly construct the test cases we're looking for, rather than trying to indirectly coax Ropey into hopefully constructing things with the chunk boundaries we want, which isn't reliable into the future anyway.

So I'll just be taking care of that all at once. That okay with you @pascalkuthe?

@pascalkuthe
Copy link
Collaborator Author

So, I think what I'm going to do (if it's okay with you) is actually just redo this PR myself, but nabbing some of your unit tests. I was initially hesitant to do that because I didn't want to steal credit away from your work (which is very appreciated!). But I'll just make sure to credit you in the commit message(s). Moreover, I definitely plan to merge your other PRs once they pass review, of which #70 in particular is quite substantial.

Anyway, my rationale behind just re-doing it is:

* I'll basically be reimplementing the Hash impl anyway.

* I don't think we actually need any fuzzing/property-esk tests for this _particular_ bug.  Just normal unit tests are fine, to ensure we don't regress in the future.

* Those tests can be improved by adding some private/doc-hidden APIs to Ropey to allow building a Rope with specific chunks, which I think will be useful for testing other things as well.  That way we can directly construct the test cases we're looking for, rather than trying to indirectly coax Ropey into hopefully constructing things with the chunk boundaries we want, which isn't reliable into the future anyway.

So I'll just be taking care of that all at once. That okay with you @pascalkuthe?

Sure go ahead. It would be nice if you put some kind of credit into a commit but I mostly just care about that the bug gets fixed in a performant manner. Please also test with either ahash or fxhash as the performance differences are much more visible there then with siphash

@archseer
Copy link
Contributor

archseer commented Nov 6, 2022

But I'll just make sure to credit you in the commit message(s).

If you put a Co-authored-by: Name Here <email@org> in the commit message, github will correctly attribute that to both users :)

@cessen
Copy link
Owner

cessen commented Nov 7, 2022

Done in fef5be9.

I also added benchmarks with the default hasher, fnv, and fxhash at various text sizes. Unsurprisingly (to me, at least), fnv is the slowest. People think of fnv as a "fast" hash, but it works per-byte which actually makes it pretty slow unless you're using extremely small input sizes.

To test overhead, I also implemented a bogus version that just passes the first chunk to the hasher and then stops. For the benches with very small inputs, this ends up being (incidentally) correct, but with as low overhead as possible. Compared to that, the impl I've ended up with has about 15% performance overhead on tiny inputs, for all the tested hashes, which doesn't seem too bad to me. That overhead should be notably less on larger inputs due to amortization, though I haven't tested yet to confirm.

In any case, it seems at least reasonably efficient.

@cessen cessen closed this Nov 7, 2022
@pascalkuthe
Copy link
Collaborator Author

Done in fef5be9.

I also added benchmarks with the default hasher, fnv, and fxhash at various text sizes. Unsurprisingly (to me, at least), fnv is the slowest. People think of fnv as a "fast" hash, but it works per-byte which actually makes it pretty slow unless you're using extremely small input sizes.

To test overhead, I also implemented a bogus version that just passes the first chunk to the hasher and then stops. For the benches with very small inputs, this ends up being (incidentally) correct, but with as low overhead as possible. Compared to that, the impl I've ended up with has about 15% performance overhead on tiny inputs, for all the tested hashes, which doesn't seem too bad to me. That overhead should be notably less on larger inputs due to amortization, though I haven't tested yet to confirm.

In any case, it seems at least reasonably efficient.

Awesome thanks for implementing a fix so quickly! Yeah fnv is not a great hasher. Fxhash is generally faster then siphahs tough ( ahahs is more robust against hash collisions at the const of being ever so slightly slower then FXhash when not compiling with AES CPU feature so I generally use that). The performance numbers sound good, how did you end up with 256 buffer size? I would have expected larger buffers to be fast or did that not make a difference when put to the test?

@cessen
Copy link
Owner

cessen commented Nov 7, 2022

how did you end up with 256 buffer size?

Just trial-and-error, testing power-of-two sizes. I actually want to revisit that when I can bench on my main machine. I'm currently isolating in my bedroom, so I just have my tiny, slow laptop to use. And that means there's also a lot of CPU frequency scaling going on, which makes benchmarking a bit tricky (which is also why I've only tested powers of two so far--more fine-grained changes were too noisy on this machine). So once I can benchmark on my desktop machine, I might tweak things more based on those results.

I would have expected larger buffers to be fast or did that not make a difference when put to the test?

Of the sizes I tested, 256 was the fastest. 512 was about on par, but slightly slower. 1024 and larger sharply fell off in performance, and 128 and lower fell off somewhat less sharply.

The approach I implemented can skip data copying (just passing slices directly from the chunks) under certain conditions, one of which is that the chunk has to be larger-than-or-equal-to the block size. So making the block size smaller actually makes the zero-copy code path more likely. My suspicion is that it's that vs amortizing the loop overhead that led to this result. 256 is large enough to amortize the loop-and-test overhead fairly well, but small enough to frequently take the zero-copy code path. That's my guess, at least. More benchmarking/testing is still needed to narrow in on the best performance, I think.

That does mean that there's likely some relationship between the block size and average chunk size with respect to performance. But it's not clear if it's a linear relationship, so leaving it independent still seems best to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants