-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hash collisions with tuples #5257
Comments
That's pretty funny. Probably there are other combinations of types with this problem as well. For strings and vectors we could hash the length along with the bytes. enums could have similar problems if they don't hash the discriminant. |
@graydon what do you think? |
Yeah, maybe redo the trait as just hash-specific and include lengths |
I think there should be a rule: When you implement the trait for a type T, you must emit enough information to discover where your byte stream ends without any outside help except knowledge of your static type. This means that for example strings and vectors must include their length. Tuples do not need to, because the length is implicit in the type. Enums must include their variant id. If everyone follows this rule, everything is fine. Deriving iter bytes will help here, of course, because it will follow this rule implicitly. |
Nominating for Maturity 5, Production Ready. |
Er, this is already milestoned. My mistake. :P |
Visiting for bug triage, email 2013-08-05. It seems like we could put in the change suggested by @brson and @graydon (perhaps just for strings and vectors, or perhaps include enum's discriminants as well), just for Hash. That would be the, mm, most direct way to address this ticket, I think. But @nikomatsakis has posted a more general principle, I think it was meant for Anyway, does anyone have feedback on niko's suggestion? |
@pnkfelix I tend to agree that iterbytes and serialization are deeply connected. I originally wanted to remove iterbytes, but in the discussion on #8038, I think we sort of settled on the idea that iter-bytes is basically a specialized serialization for the purposes of hashing, which is usually the same thing but not always. @erickt pointed out that the serialization API includes some higher-level methods for things like maps and so forth that are not particularly well-suited to hashing -- basically that in general-purpose serialization, we might allow more license than we would want for hashing. shrug I guess there is no reason to shoehorn everything into one trait, so long as have deriving modes. |
That said I think iterbytes should nonetheless always ensure that the bytes iterated over are sufficient to reconstruct the value up to the point of Eq comparisons (that is, if two distinct values would nonetheless be considered Eq, then of course they can hash together). (This is, incidentally, a reason to distinguish serialization and hashing: one might want to define a newtyped tuple that is symmetric with respect to equality or whatever) |
For short strings, it might have an impact. That's 8 bytes more to hash. One way to do it might be to use a terminator instead, like |
here's a patch that hashes the length for vectors, and uses a terminating byte for str. |
That looks about right to me. |
Address issue #5257, for example these values all had the same hash value: ("aaa", "bbb", "ccc") ("aaab", "bb", "ccc") ("aaabbb", "", "ccc") IterBytes for &[A] now includes the length, before calling iter_bytes on each element. IterBytes for &str is now terminated by a byte that does not appear in UTF-8. This way only one more byte is processed when hashing strings.
This was closed by #8545 |
(I think that #8545 only addressed the vector and str case, not the more general problem as outlined by Niko. In particular, enums should probably also include a representation of their discriminant.) |
Oh, nevermind then! |
Enums using deriving already hash the discriminant, and all custom implementations of IterBytes I can find in the treedo as well. |
|
On Mon, Aug 19, 2013 at 01:44:55AM -0700, Felix S Klock II wrote:
Most of them do? I presume that #[deriving(IterBytes)] does the right |
On Mon, Aug 19, 2013 at 02:06:03AM -0700, blake2-ppc wrote:
Oh, I should have read the full thread before replying. Sounds good to me! |
for Ascii in std/str/ascii.rs I don't know. It used to have a test verifying that I don't think there is anything wrong with the impl for |
@blake2-ppc doesn't work inside libstd properly. (I think it can be hacked around to be made possible currently, by adding things to the secret |
Fixed by #8545 |
Resolve false positives of unnecessary_cast for non-decimal integers This PR resolves false positives of `unnecessary_cast` for hexadecimal integers to floats and adds a corresponding test case. Fixes: rust-lang#5220 changelog: none
For example `(UniCase::new("prefix"), UniCase::new("suffix"))` would always collide with (Unicase::new("pre"), Unicase::new("fixsuffix")). See also rust-lang/rust#5257.
For example `(UniCase::new("prefix"), UniCase::new("suffix"))` would always collide with `(Unicase::new("pre"), Unicase::new("fixsuffix"))`. See also rust-lang/rust#5257.
At the moment tuples have an
IterBytes
implementation that combines theIterBytes
implementations of the contained elements. This means that("aaa", "", "").to_bytes()
is equal to("a", "a", "a").to_bytes()
, so their hash collides.You can take advantage of this to easily find any number of hash collisions quickly:
I don't really know how to do this properly (implementing
Hash
on containers).The text was updated successfully, but these errors were encountered: