-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hash conversion for import/export and long term archival #1953
Comments
@btrask probably worth moving to notes repo -- https://github.com/ipfs/notes/ as its more general stuff than go-ipfs specific |
(will respond there) |
-- also, i think you misunderstand IPFS. the whole point is creating this graph. you could have the hash of just the data as a secondary attribute, but it will never be the address of the objects. |
See also the growing ipld spec, which is a standalone format: https://github.com/ipfs/specs/blob/ipld-spec/merkledag/ipld.md also, the word "proprietary" does not make sense here at all.
i think you mean application-specific. |
I just wrote up another use case for being able to generate hashes in advance of adding them to IPFS (#1957). "Application-specific" implies that other applications/software don't have a legitimate need to be able to compute IPFS-compatible hashes, which I disagree with. Since IPFS hashes include the DAG data, that means the DAG must be fully specified, because applications need to be able to generate DAG data that is bit-for-bit identical. That includes how blocks are broken up and even how IPFS formats its whitespace. Do you agree that having multiple hashes for a single (otherwise identical) file is a problem? Because IPFS has this problem by itself, with the trickle-DAG. |
No, i don't think this is a problem at all. the hashes are not hashes of a file. they are hashes of a graph. The graphs represent the files AND access patterns AND deduplication AND compression AND .... Repeating:
|
Interesting that you mention compression, since Git went through exactly the same confusion. When Git was first written by Linus, it used SHA-1s of the compressed data (using zlib). But later on they realized that was limiting, because it made them dependent on the exact compression used. So they switched and now they generate the hash of the uncompressed data, then compress it for storage. (I think I read a really good article about this a few years ago, but the best citation I could find is this Stack Overflow question). Similarly in IPFS, this would grant you more flexibility to improve the DAG in the future. It makes sense to me that the "file hash" should be a secondary attribute from the perspective of IPFS. However, it should be the primary hash from the perspective of the user. The user cares about the file, not IPFS' internal representation of it. What do you think? |
To be clear, this might fit the plumbing/porcelain split. Plumbing = hash of top node in the merkle DAG. Porcelain = hash of the original file content. |
You (and some subset of users) care more about the file hash. but many others care more about the dag. the dag is really important. a dag optimized for streaming video and a dag that gives a compressed delta representation from previous version are critically different to applications. let's put this another way. An image can be encoded as jpg or a png, but their hashes are different! horrible! let's not hash the image files, let's come up with some canonical representations of the image -- maybe take image fingerprints based on the full bitmap or vectorized form -- and hash those. after all, the user cares about the image, not the file system's internal representation of it.
I am well aware, and i disagree. This compression (what i talk about) is completely separate from compression on repo (disk) and on the wire as optimization around storing + transferring the dag (what git talks about), which yes, its best to take advantage of better representations as they come. The compression i'm talking about is for finding the most compact way to represent the data and let that be the canonical form. Attempt to approach low kolmogorov complexity states, use (hash) those, and compute it out. sure, can find better, but no, i dont want to content address Pi, i want to content address a program that computes Pi.
it would be nice if you stopped assuming that people are wrong when you haven't yet understood them. |
everything is a relative projection, and there are no absolutes, only the illusion of one from your relative vantage point. there is truly no canonical representation of any piece of information. even representations with the smallest kolmogorov complexity may have isomorphisms of exactly the same size. every version you pick, file or dag, is just one projection of many possible. yes, there will be many hashes for "yielding the same information". but different ways of yielding it. |
I care about both. IPFS has a very good interface for letting users/applications manipulate the DAG to store all sorts of data and do cool things. However, when a user uses
This is precisely the mistake that Git made.
At the boundary of a file system (like IPFS), the file is the Platonic ideal. While it's true that there are semantic equivalences above that, they are unknowable without AI or lots of manual intervention. On the other hand I don't see how obfuscating which files started out as equivalent is anything but a step backwards. Traditional file systems use many different types of internal representation and caching. Regardless of how a file is fragmented, whether it is sparse, or what extended attributes it has, the file name doesn't change. |
I agree with @jbenet: ipfs has the ambition to be a universal archive of dags, and to succeed it needs flexibility in the hash function, chunking algorithms and serialization of the wrapping that a simple file requires. That requires to effectively have a new hash, incompatible with the already widely used ones. But the result is that is tricky to convert the already existing archival infrastructure of plain files (e.g. linux distribution repositories) if ipfs doesn't make easy to rehash them using its own hashing algorithm. For example: on my machine running What I would like to have:
|
Have you tried --only-hash ?
|
(Am fine making tools but they will need the importers (what decides how
|
@jbenet It sounds like the importers are what needs to be standardized anyway? A stand-alone package for |
@jbenet Yes of course, I used I tried the three possibilties and these are my totally nonscientific results (~1 GB video file, sandy bridge i3):
|
agreed on both!
it does not make a difference? huh! we may have broken it. cc @whyrusleeping o/ @robcat this is definitely a bug. we should reach sha256 perf. (unless you're using rabin) |
|
@jbenet: this modularity would be very cool, but just publishing the ipfs low level specifications could go a long way... This is a use case I was thinking about: Of course, people would not trust a fat opaque binary to go through all their files (possibility of data corruption). But using the ipfs specification, such a script could be written in a few lines just using bash and the standard unix utilities (e.g. sha256sum and split), making it much more trustable. |
not likely, not with more sophisticated chunking using things like rabin fingerprinting with specific parameters. in the end reading our modular code will be easier. |
Hey @jbenet, I talked to @mekarpeles and he suggested we talk. If you're interested and have some time it might be useful. I prefer Skype but can also use Hangouts. |
hey @btrask yeah, let's. i think it will help align a lot of our perspectives. we have our community hangouts on mondays, maybe after those. else tue is possible. let's maybe schedule by mail or irc |
Thanks for chatting @jbenet! I opened a couple issues to try to document some of our conversation. |
👍 |
Recently I've found this thread, and I'm very disappointed that IPFS, that was advertised as content-addressed network, can not be used for addressing files by their content. It's a very basic and obvious use case: find a file by its SHA-1 or SHA256 hash. There are many existing centralised systems that use their own Merkle DAG (with objects addressed by their standard hashes) that could benefit from IPFS as storage, making these systems distributed and decentralised with minimal effort. For example, imagine distributed Git. It would be great to be able to retrieve commit by its SHA-1 hash, and then retrieve all objects this commit contains by their respective SHA-1 hashes recursively. Or, for example, repository of RPM packages used by Unfortunately, I've found from this thread (and another thread linked from this one) that IPFS can't do this with its current design. It looks very strange to me that this basic and obvious use case was not included in initial design and for now it's just a remote goal that can be achieved only by writing a separate service for mapping from file hash to internal IPFS hash. But it's somewhat relieving to find that IPFS is still very useful for posting cat photos: they can be compressed using different lossy algorithms, so they don't need to have stable hashes. |
Closing, please move further discussion into the ipfs/notes repo |
I've talked to Juan about this before but I figured it would be good to have an issue for it. I'm not trying to beat a dead horse though. This is separate from the URI format debate (#1678).
There is a lot of existing software that might want to either move on top of IPFS, or exchange data with IPFS, that is currently using hashes of files (e.g. images) or other data. This software is often using MD5, SHA-1 or SHA-256.
There are three aspects of conversion that pose a problem for compatibility outside the IPFS ecosystem:
The multihash format
This can be solved with better tooling. The simplest would be a CLI tool (probably written in Go) that accepts hashes in various standard formats and encodings (e.g. SHA-256 hex) and outputs a multihash. When you give it a multihash it might give you the same hash in multiple standard encodings (e.g. hex and base-64). It might also produce multihashes from files directly. (Perhaps a tool like this exists, but even the JS multihash lib doesn't produce fully base-58 encoded multihashes for you.)
Documenting and reproducing the IPFS DAG format, and generating it on the fly while hashing
This is a big problem for any software that wants to interface with IPFS. IPFS addresses are hashes of the DAG data, but the DAG format is... whatever IPFS decides to stick in its protobufs (e.g. #1925). It isn't formally documented AFAIK, and there's no other software that can create IPFS DAGs besides IPFS itself. Even worse, it seems difficult to do this efficiently in a low-overhead streaming hasher (without retaining extra intermediate data).
Flexibility in the IPFS DAG format, for example the trickle-DAG mode
IPFS itself can't agree on the hash for any given file. Add a file with
ipfs add myfile
andipfs add -t myfile
and you get two different hashes. The trickle-DAG is a cool and useful feature that allows efficient streaming and seeking (#713) over the IPFS network. However it shouldn't result in different file addresses. (Even worse, I think other DAG structures are possible and could be used by alternate front-ends, all resulting in incompatible hashes.)Converting hashes without recomputing them from the underlying file is impossible
Because IPFS hashes include the IPFS DAG structure, they are effectively a different hash algorithm altogether. You can convert between SHA-1 hex and base-64 without re-encoding, and even to multihash, but you can't get a full IPFS hash because the DAG data must be "mixed in" during hashing. But this is no worse than defining a new set of hash algorithms if the other problems can be addressed.
Right now, IPFS hashes are effectively proprietary, and I can't trust them for long term use/storage. Sure the IPFS code is open but work needs to be done to make it practically possible for other software to generate and validate IPFS hashes. The bare minimum is documenting the DAG format and writing a simple, portable algorithm for generating IPFS hashes (synthesizing whatever necessary DAG meta-data). IPFS itself should also standardize on one representation (which doesn't mean getting rid of the trickle-DAG mode). Ideally, file hashes would be the raw hash of file content, not including anything related to how it's stored in the DAG (but still using the multihash format).
Thank you!
The text was updated successfully, but these errors were encountered: