Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hash collisions with tuples #5257

Closed
thestinger opened this issue Mar 6, 2013 · 22 comments
Closed

hash collisions with tuples #5257

thestinger opened this issue Mar 6, 2013 · 22 comments

Comments

@thestinger
Copy link
Contributor

At the moment tuples have an IterBytes implementation that combines the IterBytes implementations of the contained elements. This means that ("aaa", "", "").to_bytes() is equal to ("a", "a", "a").to_bytes(), so their hash collides.

You can take advantage of this to easily find any number of hash collisions quickly:

value: ("aaa", "bbb", "ccc"), hash: 10071128569180783593
value: ("aaab", "bb", "ccc"), hash: 10071128569180783593
value: ("aaabb", "b", "ccc"), hash: 10071128569180783593
value: ("aaabbb", "", "ccc"), hash: 10071128569180783593
value: ("aaabbbc", "", "cc"), hash: 10071128569180783593
value: ("aaabbbcc", "c", ""), hash: 10071128569180783593

I don't really know how to do this properly (implementing Hash on containers).

@brson
Copy link
Contributor

brson commented Mar 7, 2013

That's pretty funny. Probably there are other combinations of types with this problem as well. For strings and vectors we could hash the length along with the bytes.

enums could have similar problems if they don't hash the discriminant.

@brson
Copy link
Contributor

brson commented Mar 7, 2013

@graydon what do you think?

@graydon
Copy link
Contributor

graydon commented Mar 7, 2013

Yeah, maybe redo the trait as just hash-specific and include lengths

@nikomatsakis
Copy link
Contributor

I think there should be a rule: When you implement the trait for a type T, you must emit enough information to discover where your byte stream ends without any outside help except knowledge of your static type. This means that for example strings and vectors must include their length. Tuples do not need to, because the length is implicit in the type. Enums must include their variant id. If everyone follows this rule, everything is fine. Deriving iter bytes will help here, of course, because it will follow this rule implicitly.

@bstrie
Copy link
Contributor

bstrie commented Jun 5, 2013

Nominating for Maturity 5, Production Ready.

@bstrie
Copy link
Contributor

bstrie commented Jun 5, 2013

Er, this is already milestoned. My mistake. :P

@pnkfelix
Copy link
Member

Visiting for bug triage, email 2013-08-05.

It seems like we could put in the change suggested by @brson and @graydon (perhaps just for strings and vectors, or perhaps include enum's discriminants as well), just for Hash. That would be the, mm, most direct way to address this ticket, I think.

But @nikomatsakis has posted a more general principle, I think it was meant for IterBytes as a whole. It sounded rather like what you would need to implement a type-based value serialization mechanism. (Don't we already have one or more serialization traits?) The IterBytes trait is solely documented as being in place to support Hash, which to me means that there is a lot of freedom in how one implements it. Maybe I am mistaken in drawing a connection between serialization and IterBytes.

Anyway, does anyone have feedback on niko's suggestion?

@nikomatsakis
Copy link
Contributor

@pnkfelix I tend to agree that iterbytes and serialization are deeply connected. I originally wanted to remove iterbytes, but in the discussion on #8038, I think we sort of settled on the idea that iter-bytes is basically a specialized serialization for the purposes of hashing, which is usually the same thing but not always. @erickt pointed out that the serialization API includes some higher-level methods for things like maps and so forth that are not particularly well-suited to hashing -- basically that in general-purpose serialization, we might allow more license than we would want for hashing. shrug I guess there is no reason to shoehorn everything into one trait, so long as have deriving modes.

@nikomatsakis
Copy link
Contributor

That said I think iterbytes should nonetheless always ensure that the bytes iterated over are sufficient to reconstruct the value up to the point of Eq comparisons (that is, if two distinct values would nonetheless be considered Eq, then of course they can hash together). (This is, incidentally, a reason to distinguish serialization and hashing: one might want to define a newtyped tuple that is symmetric with respect to equality or whatever)

@bluss
Copy link
Member

bluss commented Aug 15, 2013

IterBytes for &[A] needs to hash in the length, anything else? I can't find any explicit IterBytes impl that would not be fixed by that (str uses vec).

For short strings, it might have an impact. That's 8 bytes more to hash. One way to do it might be to use a terminator instead, like f(self.as_bytes()) && f([0xFF]). There are a number of bytes that can not appear in UTF-8 and not in str.

@bluss
Copy link
Member

bluss commented Aug 15, 2013

here's a patch that hashes the length for vectors, and uses a terminating byte for str.

https://gist.github.com/anonymous/34dedfa9f4b8d32134fd

@nikomatsakis
Copy link
Contributor

That looks about right to me.

bors added a commit that referenced this issue Aug 18, 2013
Address issue #5257, for example these values all had the same hash value:

	("aaa", "bbb", "ccc")
	("aaab", "bb", "ccc")
	("aaabbb", "", "ccc")

IterBytes for &[A] now includes the length, before calling iter_bytes on
each element.

IterBytes for &str is now terminated by a byte that does not appear in
UTF-8. This way only one more byte is processed when hashing strings.
@alexcrichton
Copy link
Member

This was closed by #8545

@pnkfelix
Copy link
Member

(I think that #8545 only addressed the vector and str case, not the more general problem as outlined by Niko. In particular, enums should probably also include a representation of their discriminant.)

@alexcrichton
Copy link
Member

Oh, nevermind then!

@alexcrichton alexcrichton reopened this Aug 19, 2013
@bluss
Copy link
Member

bluss commented Aug 19, 2013

Enums using deriving already hash the discriminant, and all custom implementations of IterBytes I can find in the treedo as well.

@bluss
Copy link
Member

bluss commented Aug 19, 2013

#[deriving(IterBytes)]
enum Test<T> { Zero, One(T), Two(T), }
#[doc = "Automatically derived."]
pub impl <T: ::std::to_bytes::IterBytes> ::std::to_bytes::IterBytes for
         Test<T> {
    fn iter_bytes(&self, __arg_0: ::bool, __arg_1: ::std::to_bytes::Cb) ->
     ::bool {
        match *self {
            Zero => 0u.iter_bytes(__arg_0, |_buf| { __arg_1(_buf) }),
            One(ref __self_0) =>
            1u.iter_bytes(__arg_0, |_buf| { __arg_1(_buf) }) &&
                __self_0.iter_bytes(__arg_0, |_buf| { __arg_1(_buf) }),
            Two(ref __self_0) =>
            2u.iter_bytes(__arg_0, |_buf| { __arg_1(_buf) }) &&
                __self_0.iter_bytes(__arg_0, |_buf| { __arg_1(_buf) })
        }
    }
}

@nikomatsakis
Copy link
Contributor

On Mon, Aug 19, 2013 at 01:44:55AM -0700, Felix S Klock II wrote:

(I think that #8545 only addressed the vector and str case, not the
more general problem as outlined by Niko. In particular, enums
should probably also include a representation of their
discriminant.)

Most of them do? I presume that #[deriving(IterBytes)] does the right
thing in this case, though I haven't checked. I guess we should
inspect the manual implementations to (a) see whether they can be made
automatic and, if not, (b) whether they account for the discriminant.

@nikomatsakis
Copy link
Contributor

On Mon, Aug 19, 2013 at 02:06:03AM -0700, blake2-ppc wrote:

Enums using deriving already hash the discriminant, and all custom implementations of IterBytes I can find in the treedo as well.

Oh, I should have read the full thread before replying. Sounds good to me!

@bluss
Copy link
Member

bluss commented Aug 19, 2013

Abi and AbiSet in libsyntax can have their impl derived, same with enum ObsoleteSyntax

for Ascii in std/str/ascii.rs I don't know. It used to have a test verifying that .to_bytes() produced the same ascii string again, which arguably seems to be a misunderstanding of the ToBytes trait.

I don't think there is anything wrong with the impl for Option<T>, but it looks like it can be derived as well (maybe that doesn't work inside libstd?)

@huonw
Copy link
Member

huonw commented Aug 19, 2013

@blake2-ppc doesn't work inside libstd properly. (I think it can be hacked around to be made possible currently, by adding things to the secret mod std { .. } inner module, but also, eventually, done in a more structured way: #8242.)

@catamorphism
Copy link
Contributor

Fixed by #8545

bors added a commit to rust-lang-ci/rust that referenced this issue May 2, 2020
Resolve false positives of unnecessary_cast for non-decimal integers

This PR resolves false positives of `unnecessary_cast` for hexadecimal integers to floats and adds a corresponding test case.

Fixes: rust-lang#5220

changelog: none
niklasf added a commit to niklasf/unicase that referenced this issue Jun 11, 2022
For example `(UniCase::new("prefix"), UniCase::new("suffix"))` would always
collide with (Unicase::new("pre"), Unicase::new("fixsuffix")).

See also rust-lang/rust#5257.
niklasf added a commit to niklasf/unicase that referenced this issue Jun 11, 2022
For example `(UniCase::new("prefix"), UniCase::new("suffix"))` would always
collide with `(Unicase::new("pre"), Unicase::new("fixsuffix"))`.

See also rust-lang/rust#5257.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants