-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplication #11
Comments
I think that data de-duplication between many users that are unaware of each other is probably too severe an amount of leakage, just on principle. But also I don't think it's generally possible to combine ORAM with storage-provider deduplication in the way you're thinking, primarily because: ORAM relies on being able to read a block of memory and then re-write the same thing, such that the provider can't tell that the same content was just written back. This is done by re-encrypting with a new random nonce. So I think the actual solution to de-duplication (within a single UtahFS archive) wouldn't be cryptographic at all. You would add a content-based addressing layer "on top" of UtahFS. Which would be cool, but it doesn't increase or decrease the value proposition of ORAM. |
I think it was meant to do deduplication within UtahFS. But despite that I think this feature would be useful especially to those not using ORAM to save money on the backend S3. |
If by storage-provider deduplication you mean between many users that are unaware of each other, that was not my intention. But that's my mistake, I lead the conversation there by mentioning convergent encryption, which on second thought doesn't make any sense here. Sorry for adding noise. I meant a single user that stores heavily duplicated data in its own volume. Which as you've said could maybe be implemented on "top of" UtahFS? It seems to me that if it is possible to marry deduplication in that scenario with the current properties of UtahFS, ORAM will be helpful at further hiding the redundancies in the data. No? Again, sorry for wasting bandwidth if this is just a stupid question. Feel free to close! |
ORAM will hide "what you store", even though I think that if UtahFS gets a higher adoption rate it is quite obvious that you use it, as it might be the only thing that uses ORAM... But: At least for me that would be acceptable for storing backups, as I'll not be using ORAM anyway as that would get really fast really expansive. |
The point is not necessarily to hide that you're using UtahFS (is that even a goal for the project?), it's hiding what kind of data you're storing in your volume. Simply encrypting content and/or metadata is often not enough. You can encrypt those and still leak whether you're backing up photos or movies, vs. PDFs, logs, git repos, VM snapshots, etc. Access patterns will be very different for each of those, depending on the size of files, whether it is write once read many, write once read never, write many, etc. ORAM is supposed to help here. My question is if ORAM might also help hide access patterns related to deduplication of heavily deduplicated data. |
This was exactly what I was trying to say. It will have the same consequences as any deterministic encryption schema (with ORAM disabled). With ORAM enabled it is slightly better, as it rules out most known plaintext attacks, with one notable exception. The total size of the bucket will not change if the attacker provided plaintext already is within the bucket. Therefore in the case of an attacker being able to provide the plaintext and monitor the total bucket size it will inherently leak the fact that that file exists (assuming an only growing bucket). This could than again be mitigated if also delete operations happen (e.g. Archive mode is not used), as because of that there will be unused blocks that UtahFS can reuse. Resulting in the bucket not always growing if a new file is added to it. |
Either I'm missing something, or we're digressing. If it's a single user volume, and deduplication is only applied to that single user's data within that encrypted volume, how would an attacker provide plaintext that is already within the bucket? I know I'm entirely to blame for the confusion by mentioning convergent encryption. That makes zero sense here. But I'm definitely not interested in helping the storage provider do deduplication between mine and others' data (and leaking information that way). I'm interested in saving myself money when I store heavily duplicated data, and still hiding the fact that that is (or isn't) the kind of data I'm handling. |
I think we're clearly talking past each other. I'll try one last time.
Well, only because it is a single bucket does not mean, that an attacker could not potentially provide plaintext. e.g. If you use it as the storage backend for your web application, or similar. If you use it solely as "dropbox" replacement that thread is out of scope for you than.
As long as you do not have attacker provided data and also not use archive mode I do not see a big problem. |
I see. Finally got it, thanks for taking the time to explain it. I wouldn't use this as a storage back end for user provided data but that's a good point. In that case, I don't think it's common that a single user is (supposed to be) able to observe the total size of the bucket. So, as you've said, ORAM helps. One important caveat is the storage provider is (or colludes with) the active attacker. |
You won't, but others might.
Open Grafanas or Promethium instances are a thing (sometimes intentionally like at https://freifunk.fail/ ), so your storage provider does not necessarily need to contribute... |
Given the intended security properties, how do you feel about data deduplication?
Would the ORAM implementation help with the information leakage inherent to convergent encryption?
The text was updated successfully, but these errors were encountered: