-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JRFC 20 - Merkle DAG #20
Comments
No! No! No! Your example code is vulnerable to the same trivial hash collision attack as (EDIT: BitTorrent BEP 30 and ) the first proposal for Gnutella THEX. You need a mandatory prefix (as in THEX ) or Suffix (as in Sakura) to distinguish leaves from interior nodes. Sakura allows you to have at most one variable-length byte array and an arbitrary number of node hashes go into each node hash. If you really need more than one variable length byte array going into a node, then you'll want to prepend prefix varints to each thing you're hashing. Prepend 0 to each node hash and prefix_varint( length + 1 ) to each byte[]. (The +1 is so a zero length byte[] doesn't collide with a node hash.) Though, I really suggest using Sakura, since it's by the same people who designed AES and SHA-3, and comes with a proof that it's no weaker than the underlying hash function you use. The only disadvantage is that you can hash at most one variable-length byte[] into each node. |
@kmag what are you talking about? pointers? in a merkledag-- the point is not to distinguish between leaf and non-leaf nodes. everything is a node. think about it like git. git does not distinguish leaf and non-leaf nodes. git is used for maintaining integrity. if the collision you mention was likely, people would've already pulled it off in git, which is using (the now weak) sha1, and used for mission critical operations all over. Also note sha2 so far has no known collisions. |
http://lmgtfy.com/?q=sakura+hash+construction git implements its tree different than you have. You've implemented a tree where it's trivial to replace any interior binary node with a 64-byte leaf node and get the same hash. This is independent of the undelying hash function. Really, use Sakura, which is probably as strong as the underlying hash function. Joan Daemen designed AES, SHA-3, and Sakura. He knows what he's doing orders of magnitude better than you or I do. Basically, you need an unambiguous header or unambiguous footer. THEX and git use unambiguous headers, Sakura uses an unambiguous footer. Don't be like DIME and try and implement something that needs to be parsed from both ends and ends up accidentally being ambiguous. Once again, just use Sakura. |
I'll read http://keccak.noekeon.org/Sakura.pdf tomorrow after some sleep. regardless thanks for pointing this out.
what's the difference you're thinking of? other than lacking the formatting, a git tree is essentially a sequence of
and a blob is just the data. Which is what the example above attempts to replicate (maybe i screwed up and introduced a bug?)
I'm not seeing the construction of two different byte sequences that collide on the hash function... without that being the same as finding a hash collision. |
oh @kmag, the above example should be: func (d *Directory) Add(name string, h Hashable) {
d.Entries[name] = h.Hash()
} where |
PoC exploit code to help you understand the problem: (edited comment to be less inflammatory, sorry) #!/usr/bin/env python3
# Demonstration of trivial collisions in naive Merkle tree implementations.
from hashlib import sha256
from base64 import urlsafe_b64encode
from base64 import urlsafe_b64decode as b64decode
def b64encode(b):
return urlsafe_b64encode(b).decode('utf8')
file0 = b'A'
file1 = b'B'
file2 = b'C'
file3 = b'D'
file4 = b64decode(b'ayPA1fNdGxH5toPwsKYXNV3rESd9ka4JHTmcZVuHlA0_OdXDSOW3nQboQsEU5sxXFYO79E5LDr_aGgHsBXRdQw==')
entity0 = [ file0, file1, [ file2, file3 ] ]
entity1 = [ file0, file1, file4 ]
def wrong_way_to_do_it(entity):
'''BitTorrent BEP 30 / original THEX proposal wrong way to implement a Merkle tree'''
result = sha256()
if type(entity) == bytes:
# Hashing a file
result.update(entity)
elif type(entity) == list:
# Hashing a directory
for e in entity:
result.update(wrong_way_to_do_it(e))
else:
raise TypeError('Unsupported type')
return result.digest()
def less_wrong(entity):
'''Almost Joan Daemen's Sakura encoding, but really use Sakura instead.'''
result = sha256()
if type(entity) == bytes:
# Hashing a file
result.update(entity + b'\x00')
elif type(entity) == list:
# Hashing a directory
for e in entity:
result.update(less_wrong(e) + b'\x01')
else:
raise TypeError('Unsupported type')
return result.digest()
if __name__ == '__main__':
wrong0 = wrong_way_to_do_it(entity0)
wrong1 = wrong_way_to_do_it(entity1)
print('%s == hash(%s)' % (b64encode(wrong0), str(entity0)))
print('%s == hash(%s)' % (b64encode(wrong1), str(entity1)))
if wrong0 != wrong1:
print('\nCollision avoided.')
else:
print('\nDrat! Hash collision ate all my data!')
if less_wrong(entity0) != less_wrong(entity1):
print("Use Daemen's Sakura construction instead!") |
interesting attack, good thing ipfs uses |
Are you sure it doesn't? Git objects have types not just in the parent tree but also in the object's header itself, so even if a file (blob) has identical contents to a directory (tree) it will still have a different hash. (Fossil doesn't, meanwhile.) |
elif type(entity) == list:
# Hashing a directory
for e in entity:
result.update(wrong_way_to_do_it(e)) this is not what I did above. there was a misleading bug in the original example (unedited above) that i corrected in a comment later. To put it in terms of your example, "my" way of doing it is this (note dict, not list): elif type(entity) == dict:
# Hashing a directory
for name, e in entity.items():
result.update(name)
result.update(wrong_way_to_do_it(e)) This includes the name of the entries. Which is essentially how git does trees, and how I've always done them too.
I.e. the hash functions chunks the inputs to build its own merkle trees (NOT any kind of merkle dag, specifically hash trees without extra data in the internal nodes) and then it re-hashes the list of (only!) hashes. Then, we could have something like this:
I have "second" in quotes because this is also sort-of "the same" pre-image, but as far our hash function definition goes, it does count as a second pre-image. Bad times. Note that this IS NOT what's happening in my example above. The example above hashes any file (no matter how big) with the same hash function. (And trees along with their names). (( Of course, depending on the hash function used (i.e. if it does chunking + hashtree for hashing large byte sequences), the problem can still emerge on large files, but this is now hash-function specific. We're discussing generalized constructions. I do agree that merkle-dag implementors should take this feature of their hash functions into account, but you mentioned that the problem did not depend on the hash function...))
I meant, it does not distinguish them in the way the the final Tiger spec does: fundamentally different nodes, with different hash functions. the git node distinctions are not And, you're totally right that:
👍 :) as mine too. |
I'm sorry about getting a bit hot headed. It's just rather annoying to see lots of proposals (such as BitTorrent's BEP 30) that repeat mistakes of the past. It's scary how often people copy-paste example code as the starting point for production code. Your IPFS looks very interesting. Best of luck, and please take the time to get the small details right. The devil is in the details. As far as the semantics of your use case not causing problems (your point 5), it's much better to have the security of your data integrity layer depend on the semantics of the system that's being built on top of it. There are two distinct graphs that have the same hash. This is by definition a collision of the graph hash construction. Maybe the higher level semantics mitigate the impact of such a collision, but it's a bad idea to introduce interdependency between components like that, especially when secure constructions such as Sakura exist. On a side note, prepending the file name to the hash doesn't fix things by itself. elif type(entity) == dict:
# Hashing a directory
for name, e in entity.items():
result.update(name)
result.update(wrong_way_to_do_it(e)) still allows trivial collisions. There needs to be a bit more framing here, and ideally the data integrity layer would perform its own secure framing, so whomever was designing the application layer framing wouldn't have to worry about subtleties like this. |
Edited the comment in my example code to point out that BitTorrent BEP 30 and the original THEX proposal both made this mistake. In any case, crypto is very hard. Merkle trees get screwed up by people all the time. I think in fact we should start referring to Sakura-using Merkle trees as Sakura trees to reduce the numebre of people who keep making naive Merkle tree implementations. |
When I hear "Merkle Tree", I think this:
But only this. A more generalized version of a merkle tree, where non-leaf nodes also contain data, is significantly different. Sometimes "Like Git" or "Like a blockchain" or "Like ZFS" works, but sometimes not. (@davidad probably feels the same way too). And since most of the time I mean a DAG, not a strict Tree, we might as well redefine. In interest of convenient future referencing, let's state the definition that has emerged through the use of "Merkle Trees" in various systems here. If someone has a better one (or can point me to the formal name of this thing that I didn't know about...) please post it below!
Merkle DAG
A directed acyclic graph whose objects are linked to each other (usually just by their hash), where the
hash(object)
includes allhash(linked_object)
. This gives the Merkle DAG useful properties:Example
Sample git
tree
object:Sample git
blob
named5716ca5987cbf97d6bb54920bea6adde242d87e6
Examples in practice
Example in Code
(Disclaimer: I haven't run this. For illustration only)
The text was updated successfully, but these errors were encountered: