-
Notifications
You must be signed in to change notification settings - Fork 30
Content Encryption #270
Comments
We should consider how it would work with the rest of the toolkit. For example GC of the repo. One of the solutions for it might be having two part encryption key, one for links one for the content but this gets problematic with IPLD. |
Yeah, that gets really problematic with ipld. I also didnt consider GC when thinking about this... that makes things rather painful... |
One solution is to only store the plaintext, annotated with encryption nonce and key, and encrypt the data on the fly once it leaves the host. I probably wouldn't go down the multiple keys path as that seams like it solve only one very specific problem. Is there supposed to be a database of known keys, or should the user pass a key and subsequent keys will be in encrypted DAG element? Because if we have a database I think it's simplest (and probably fast enough) to decrypt on the fly. |
My 2 cents: If you're only adding a single file then there is no way to hide the size without padding the file pre encyption. Assuming that you want to store actual sub-trees with files and folders you can hide the size of individual files and also the directory contents, filenames etc if you want. To do this properly I expect you will reinvent what we do in Peergos. A file tree for us is stored in a merkle-btree, giving you a single root you can pin for gc purposes. Each directory and file gets it's own encryption key. each directory consists of encrypted links to the children (and encrypted directory name). The actual structure is called cryptree (paper is in our repo) and was invented by Wuala in 2007. Using different keys for every file allows you to selectively grant read access, but for your purposes you might be okay with a single key for the entire sub-tree. We also chunk files into 5 MiB sections which each get independently encrypted and added to the btree under a random label. This means the network can see the total size of all files, and roughly the total number of directories (large directories are chunked as well), but not the individual metadata like names, modification date, path, or indeed the topology of the tree. Then to read the tree you just need the encryption key, and bree label of the root dir. Another thing I'd advise is to not use asymmetric encryption. The reason being that all standard asymmetric encryption is trivially broken by a sufficiently large quantum computer (which we expect in 5 - 15 years). The cryptree structure is nice because it allows you to create the whole fine grained structure without any asymmetric crypto. One nice bonus feature would be do do what we do to obtain the root key, which is to generate it from a memory hard hashing function like scrypt. Then you can optionally password protect files, which may result in better UX than a full 32 byte key. (Obviously the encryption is only as secure as the password though) |
I like the Encrypt First approach. Most use cases are about protecting the data not about hiding minor metadata. I agree that Allowing the content to be replicated through the network (without having the key) is a very good. My proposal is simple
Please read and comment on my notes Using PKCS 7 (Crypto Message Syntax) and the DAG. The above assumes that a peer has a secure keystore to manage the encrypting and signing keys. |
@ianopolous my understanding is that CMS allows password based encryption. I'm interested in your warning about asymmetric encryption. Could you provide some links to the cryptree structure? |
Personally, I'm not convinced that the encrypt second approach is really worth anything if peers won't be able to replicate the content. We might as well just encrypt the data locally and use authentication to control who gets access. That's a lot more flexible. If we want to be able to replicate and completely encrypt, we wouldn't be able to lazily fetch files (data access patterns will reveal the file structure). Instead, we'd have to store the entire filesystem in some form of append-only log (possibly with checkpointing). Unfortunately, this would be very inefficient. There are probably ways to do this with differential privacy but I don't know any off the top of my head. |
Interesting discussion, been working on this a bit in our Dweb Library, and use it in our Versioning demo. The approach taken is that content is encrypted symetrically, then the key is stored encrypted with the public key of any authorised viewer on an AccessControlList. Content is encrypted/decrypted as it goes through the transport layer. When encrypted content is received, the transport retrieves the ACL, and looks for matching keys between there and any of the user's keys. Similar to @richardschneider 's suggestion we wrap the encrypted content in the necessary fields and then store in the DAG. That's glossing over the details, but there is also:
Its entirely browser/JS based, the transporting peers/content servers have no knowledge of the fact content is encrypted nor access to the content, and private keys aren't stored unless themselves encrypted with some higher level key. An earlier version also works in Python, but obviously not over IPFS. Its transport agnostic, and runs over both IPFS and our internet HTTPS contenthash servers, the default is to use both, but because of the IPFS crashing bugs in Pubsub it will crash every few minutes on IPFS, so until that is fixed, you can use the HTTPS only version by adding transport=HTTP to the URL e.g. keys; and versions; Add "&verbose=true" to each of these as well if you want to look at the console to see what is happening. (It doesn't tackle the file length concerns of @Stebalien as we don't think that level of encryption/hiding is needed for 90% of applications out there, and the other 10% probably need something much more complex). |
@mitra42 Do you encrypt the DAG nodes themselves? Does your approach just store files, or do you also store folders? |
@richardschneider Here is the cryptree paper: https://github.com/Peergos/Peergos/blob/master/papers/wuala-cryptree.pdf In my opinion granting access is outside the scope of this for ipfs (do one thing and do it well). You have access if you have the appropriate symmetric key/passphrase. Individual applications can then use their own key sharing mechanism on top of this. @Stebalien In our approach to "encrypt second", everything including all metadata, directory structure and topology is encrypted and can be safely shared and duplicated around the network. In terms of hiding the access pattern that is also achievable (if you want it) but is a separate concern to defending the data at rest. We have this implemented already in Java and Javascript for Peergos, see the following: |
Our transports are already encrypted and authenticated, that's why I suggested authentication. Unlike simple encryption, authentication is revocable (and generally more flexible). A simple solution would be to have a separate "capabilities" service. If peer A makes a request to peer B that requires capability C, peer A would ask its capabilities service if peer B has capability C. The two peers' capability services would then run an authentication protocol to actually determine if B has capability C. This allows us to keep authentication/capabilities entirely separate from bitswap, pubsub, etc.
Data at rest is really the concern of the OS (e.g., full disk encryption). |
@Stebalien By data at rest I mean the data structure as visible to the entire network (as ipfs is essentially a global file system) - so nothing to do with your harddisk that you happen to be running ipfs on. I would strongly advise against authentication (in the form of asymmetric crypto) for the reason i discussed above. Symmetric encryption + hashing are all you need. |
Good discussion … Sorry if I confused things, When I refer to “encryption in the Transport layer”, I’m talking about it from the app’s perspective, and mean from the point where something is stored in IPFS, to the point where its retrieved by another node. Its great that IPFS is encrypted on the wire between two peers, but in a distributed file system that doesn't protect data since any peer can retrieve it, so another layer is required not just on-the-wire (bitswap) encryption. I think in that sense I'm referring to the same thing as @ianopolous @RangerMauve - you ask “Do you encrypt the DAG nodes themselves” and I’m not quite clear what you are asking. We encrypt JSON data structures, wrap in another structure (so you know what you’ve got) and then store with dag.put We are currently storing data structures, I’d use the same approach for files, but haven’t done so yet. @ianopolous - I agree you “can” build a key sharing mechanism, the challenge is to actually do so in a distributed system. (From historical perspective its one reason why PGP didn’t achieve wide adoption for email - key sharing was so hard that in practice didn’t get done). In our model, the encrypted data carries a pointer to the Access Control List, so that the transport knows how to decrypt it. I’d suggest that that mechanism is made self-describing as well, so for example someone retrieving a DAG node could see that it was encrypted with Peergos, or with some other mechanism so that it knew if it could decrypt it. Peergos looks interesting, is the API defined somewhere ? And I see a comment on a cross-compile to Javascript but cant find it on the repository. @Stebalien - IPFS is “data at rest”, as Ian says IPFS is a global shared file system, so data in it should be encrypted one way or another. Also … you can’t presume a negotiation between Peer A and Peer B during the decryption process. You have to presume that Peer B posted the data (and authentication info) to IPFS and then disconnected, and is not available in any way during decryption. Peer A has to interact entirely with the data stored in the global file system. |
@mitra42 What I'm really saying is that the key sharing mechanism should be independent of this layer which should be pure capability based decryption. The simplest way to handle the key is then to include it in links to the file, as @whyrusleeping mentioned in UX it can be in the fragment of a url for example. (An example in peergos is https://demo.peergos.net/#pQd8rmrEhBN1RbDLK1ioBnFF4YLgvPAmte3ypNDiwshMJJip9Dbbgw4t/FGXm4KWWePPNfdN91MNCeHgC16Wxemt4C4iDoS6qz1ea/5Pf7SvpL6BKtVUnPGmU3CqpZJ1hypK17GZbF27Ui8hKa2CXZWZZ ) Then your file(s) are as secure as the mechanism you use to share those links (modulo browsers logging them in history etc. ) Peergos is meant to be able to be self hosted within ipfs, so our api is a subset of IPFS's api (watch the console on our demo server to see us logging ipfs api calls). As authenticated pub-sub is still in the works, we have our own equivalent which we hope to switch out for ipfs's version once it is mature. We cross compile to JS using GWT (the ui code is in a different repo - https://github.com/Peergos/web-ui). This allows us to use a small amount of JS in vuejs for the ui, and reuse all our java code directly. |
@mitra42 Hopefully my proposal on using CMS will address some of your needs. CMS provides data at rest. The content is encrypted with a randomly generated symmetric key. It is not reliant on transport encryption and can be used with unsecured transports. Let's call this key the CEK (content encrypting key) CMS includes the Thus, anyone the holding the KEK, can get the CEK and then read the data. An example CMS, decoded by |
@richardschneider Can CMS handle entire file trees as opposed to a single file? |
@ianopolous CMS is about messages (the M in CMS). It is basically about encrypting/decrypting the message. So, in my use case the message is the content of a single file. Can 'entire file trees' be described in one message? If so, CMS will work with it. |
@richardschneider ok but then we are talking about different things. I'm describing something an entire file tree can be encrypted with in ipfs, in a way that you get a single root which you can pin to store the entire subtree. Even with a single file you have to consider large files which need to be split, and the associated structures linking them in the dag. |
@ianopolous I'm proposing that CMS is used to encrypt the |
@richardschneider Ah, that clarifies it. What other information is outside the PBNode.data? filenames? At least the directory topology is still in the clear with that as well as individual file sizes, if you're adding a directory, but it's much easier to implement than a new unixfs format. |
@ianopolous Yes, it's a very low-level change, so higher-level thingies (pin, unixfs, ...) just work. Things that an attacker can determine
Note that the data size is the size of the CMS message not the plain content size. The Encrypt Second approach would prevent the attacker from getting the data name and size. |
@ianopolous - Adding it to the URL makes sense, but unfortunately IPFS doesn’t use URLs, it uses a single key. I’m assuming that in many cases a unencrypted resource will hold a pointer to an encrypted resource, therefore whatever makes that link (whether its embedded in HTML, or inside IPLD for example) has to be able to embed one of these links (IPLD can’t) and to do so safely (I’m not sure if your url includes the key, which would make it unsafe). I think this requires two things. Your link to https://github.com/Peergos/web-ui seems to be missing instructions on how to include it in a browser page. @richardschneider - yes I saw your proposal for the CMS - similar semantics to what we do EXCEPT you are using peerId for the authentication, when I think you want to do it by the PublicKey, i.e. its me - the holder of the key - that is authenticated, not the particular machine I happen to be using. |
@mitra42 What gave you the idea that I'm using peerid? I'm using the key ID, which just looks like a peer ID. |
@mitra42 IPFS does use urls. E.g. https://ipfs.io/ipfs/QmTkzDwWqPbnAh5YiV5VwcTLnGdwSNsNTn2aDxdXBFca7D An unencrypted resource holding a pointer to an encrypted resource doesn't make sense in a capability based system. The web-ui repo is just an example of using the gwt compiled lib. It's mostly just a matter of including the library. |
Sorry @richardschneider , I was confusing your term "CMS" and read "Capability Management Service", and there was a post by @Stebalien today proposing that for authentication (using Peer ID). @ianopolous - I believe in IPLD it uses either just the hash or /ipfs/Qm..., either way I'm not sure how you'd pass your key as in https://demo.peergos.net/#pQ.../FGX example above. Also ... pointers from unencrypted to encrypted make LOTS of sense ... For example, I can pass you a pointer to a google doc here, but unless you are on the list of people with access then you cant access it. That access control is handled independently from giving you a reference to the resource. |
@mitra42 You're referring to merkle-links between ipld objects, which are a cid. In that case I wouldn't be including the key because it would be visible to the network. In peergos everything is ipld. Within a given merkle-btree we have links between objects, but these links are encrypted, and thus not visible to the network (or the pinner). This is how we hide the directory tree topology, and indeed the size of individual files within a btree (whose chunks link to each other). In my user's btree I can store a link to a file in your btree, but that link isn't visible to the network. That is because we go even further than all this and are careful to hide the social graph as a primary goal. As soon as I had an ipld merkle link between different users btrees (which must be network visible to be any use) than the world now knows that I'm friends with you. Re links from public to encrypted: Note I said in a capability based system. Your example is not a capability based system. It relies on a particular server and hence is not decentralized. |
@whyrusleeping re: UX I do not see the need for adding a "key id" to the IPFS hash. Either the
I also think its a misuse of the URL fragment identifier. Typically, the fragment id points to some content within the resource; not how to transform the resource. |
@mitra42 What I was alluding to is that the DAG itself should be encrypt-able so that people can't traverse it unless they are able to decrypt the individual DAG node data |
@ianopolous - just for clarification, our solution is NOT single server based in any way, it is (like yours I believe) based solely on the global file system of IPFS (or other transports) and the decentralized append-only logs of YJS (or could be OrbitDB etc). @RangerMauve - agree, the DAG should be encryptable. @ianopolous - sorry, but what I'm asking, but don't seem to be able to communicate clearly (because you and I may be using different terms) is that you specify that you want to use a URL with a extra field for a key identifier, and i'm suggesting that the pointers within the IPLD don't support that (I think they should support any kind of URL e.g. to IPNS or even to outside IPFS) |
I would love to see native encryption for IPFS. There are a lot of use cases where I want to share content with a small set of people and I don't want anyone on the network to be able to see what I have. I really like the idea of having the key easy to pass around with the CID such as QmFooBarEncrypted#key=1bf72d9ef as I think this naturally acts like a capability model. Anyone who has the link has the key. But it also downgrades nicely. I can ask someone to pin QmFooBarEncrypted without giving them access to the content. I see some talk about using access control instead of encryption. I don't think this effectively solves the problem for a number of reasons:
I think the better approach is to encrypt each block separately. Each link is then a link to the encrypted block and an encryption key. A cool property of this is that if you use the content of the block to derive the encryption key you can continue to get deduplication (possibly derived using a key to prevent people from identifying who has known content). Or you can use random keys for maximum security (but no deduplication). Encrypting per-block also means that you can continue to do partial access and get CIDs for subtrees like today. The downsides I see are:
But overall I think it is the best approach. The core of IPFS doesn't really change at all. It just adds a convention for how to encrypt and decrypt on top of the current block and bitswap structure that exists today. In fact this is a client-only change. You could continue to pin this content on existing pinning services and any encryption-aware client can download and decrypt it. In fact if we keep the links unencrypted I think encryption derived from block contents could be enabled by default in the future. The only real user difference is that the CIDs include an encryption key. (Although that would be a breaking change for a lot of software, so it can't be taken lightly). The fact that anyone on the internet can't download and inspect everything my node has cached is quite a nice benefit. (although it isn't a panacea, because if you type a popular CID into a search engine you may well find a key, but it is way better than as soon as I add anything it is available to everyone) |
While being very conscious of the technical difficulties presented by the problem of encryption (and probably even more so on IPFS) I'm always surprised on how unimportant this detail seems to be for most cloud file system providers. Of course large corporations have the resources to insure that their data is either meaningless or pre-encrypted before getting stored on cloud services. But it is surprising how difficult this, to me most basic need, becomes for the average user. Keeping your data on your machine means you have to rely on hardware that is sooner or later doomed to fail. If your data is really important then you have to regularly check your hardware and have multiple local backups. Who really does that? For the cloud there are painful solutions such as encfs (but definitively not for the average Joe), Cryptomator(open source but very fragile and not that intuitive to use) or Boxcryptor which can be costly. Again very conscious of the technical hurdles and also the ethical implications (bad guys), but then as said as an average Joe I'm not really very eager to expose my banking statements out there to public access. In conclusion it seems to me that anybody putting out a cloud storage solution should at the very least offer the possibility to keep your data private. So I have to confess my disappointment in seeing this issue dying a slow death. |
Here is a proposal for how to encrypt the DAG while preserving links (so that an untrusted party can pin the content). I would appreciate feedback and if people think it is the right approach we can propose it as an IPFS spec. https://docs.google.com/document/d/1RwVeEL6HETPgJGM0bPyiXIvZeZYqRZ3D5cmqG7a6ox8/edit |
I've been thinking about this specifically in the context of encrypted DAGs that can still be pinned by pinning services with a single root. I was thinking that this could be done with two steps:
I think this approach can leave maximal flexibility for IPLD datasets while preserving privacy and ensuring data can be pinned and shared around. Some stuff that is outside of the scope is figuring out how to have sub-dags with different encryption (maybe they have separate roots that are linked to from the main root?) |
The encrypted DAGs in Peergos can be (and are) pinned by a single root cid. It is basically a big CHAMP (HAMT) where the mappings have no visible relation to each other, nor is the root dir entry visible. It is not visible how many files or directories there are. Even the directory topology is hidden. In our Cryptree+ every file or dir has a unique symmetric decryption key. Nowadays we also require a secret mirror key to be able to retrieve the encrypted blocks on another node as well. (for details see https://peergos.org/posts/bats) For your last point, @RangerMauve you'd need the decryption key, but also the CHAMP label of the root to then be able to read the whole file tree. |
The reason I didn't go for this approach is that I wanted it to be easy to share and pin subtrees. If your root is special you need to generate a new root to share subtrees and that root won't be pinned by as many people as it would otherwise. With the proposed solution each subtree can be shared just by sharing the URL without proving to any other files or trees of the "entire" share. |
Currently, ipfs does not support any sort of encryption in-line. Users can always encrypt content before adding it into ipfs, which is a solid solution (and a good separation of concerns) but it leaves a bit to be desired for people who just want to use ipfs directly.
There are two main ways that we can have encrypted content in ipfs, both encrypt the actual content, but the first is essentially 'encrypt the data, then add it to ipfs'. The second is 'add the data to ipfs, then encrypt each dag node individually'.
Encrypt First
When encrypting the content first, the dag structure is still plaintext. This has several implications, primarily:
Being able to have other peers replicate your content without the encryption keys is a very nice feature to have.
Encrypt Second (or Encrypt the Dag)
To add a bit more secrecy to your data, you can encrypt the raw dag nodes. This definitely prevents anyone without the right keys from gleaning much information about the object in question (does it have links? what is the object type? how big is the whole structure?)
This sort of encryption would be required to encrypt things other than files, like directories, or arbitrary ipld nodes.
UX
So the previous part needs a fair amount of thought, but once we get an idea of how we want to do that, we need a way of actually interacting with encrypted objects. A proposal I have is to adopt something like:
Where 1bf72d9ef is the key to decrypt the data referenced by QmFooBarEncrypted.
This is nice because with a browser extension, you could achieve the same effect in a browser without leaking the keys to anyone else (since the hash fragment is never sent to the server).
This is definitely a WIP proposal, i'll be updating it as I think through things more and discuss.
The text was updated successfully, but these errors were encountered: