WERVD Storage

Technical documentation

FIXMES

This encryption needs to be optional, not everyone values security vs. usabilty / portability / reliability the same way. (bnvk)
If optional, making the unencrypted data-store compatible with Maildir or some other popular mailbox format would be optimal. (bre)

Rationale

Mailpile stores e-mail and a large amount of other sensitive data. One of the design goals of the project is to encrypt all the data we store to disk, in order to mitigate privacy leaks in the event that hardware is physically stolen or lost.

Due to our goal of providing robust security to non-technical users, we do not consider it sufficient to delegate this security to the operating system's full-disk encryption, as that encryption is not guaranteed to be active or even available in many cases.

E-mail is largely a "write once, read many" medium, although there is also a need to be able to delete messages.

Encrypting all the data, although good for privacy, has the downside of making the data more brittle. A single flipped bit can render an entire message useless and unparsable. However, storage is relatively cheap and e-mail relatively small by modern standards, making it feasible to improve robustness simply by duplicating data. Simple duplication is more flexible than error correction codes, in that it allows duplicates to reside on physically separate medium when possible.

Finally, there is a need to be able to potentially store data outside the normal filesystem (e.g. in an IMAP server or other database), or replicate it from one system to another. Storing individual messages as uniquely named plain-text ASCII files which are considered immutable until they are deleted, greatly simplifies both of these use cases.

WERVD Capabilities

The Mailpile encrypted data store provides the following capabilities:

W - write once
E - encrypt data
R - read many
V - verify integrity (detect corruption & recover)
D - delete

Capabilities NOT provided include append and update.

The verification strategy allows for multiple copies of data to be written to disk and automatic detection and recovery from corruption.

WERVD Implementation

Encryption and File Format

The encryption used is AES-256-CBC, with a per-file nonce. The file format embeds the encryption scheme, so this can be changed at a later date. The file content is stored as base64 data (base64 encoded after encryption), with a simple header and footer delimiting the beginning and end of the data, as follows:

-----BEGIN MAILPILE ENCRYPTED DATA-----
cipher: aes-256-cbc
nonce: NONCE-STRING

...BASE64-ENCODED-DATA...
-----END MAILPILE ENCRYPTED DATA-----

The order of the cipher and nonce headers is not considered important, and other headers following the same RFC822-inspired format may be added later.

Newlines may be either CRLF or LF sequences and the trailing newline is considered part of both the beginning and ending delimeters.

Verification IDs

Verification IDs serve the dual purpose of uniquely identifying a file within the WERVP store and verifying the contents are not corrupt.

Verification IDs are generated by calculating an MD5 message digest over the entire encrypted and encoded message, including both delimiters and their trailing newlines. During digest calculation, all newlines should be encoded as CRLF pairs.

This MD5 sum is then encoded using a web-safe variant of BASE64 (all whitespace stripped and any + characters replaced with -), and the end-result is used as the canonical identifier for this data in the Mailpile WERVD data-store.

These verification identifiers may be truncated depending on the uniqueness and robustness requirements of the calling code.

File-system Mapping & Duplication

Mapping the data's Verification ID to N duplicate file paths in the filesystem is done as so:

1. Calculate and MD5 digest of the concatenation of the verification ID
   and the decimal string representation of the replica number.
2. Generate a web-safe base64 encoding of the MD5 digest.
3. Truncate the web-safe base64 encoding to the desired length,
   generally 10 characters.
4. Use the first two characters of the truncated base64 as a
   directory name and the remaining characters as a file-name within
   that folder.
5. The duplication ID may be used to select alternate root paths for
   the storage tree, facilitating duplication to alternate physical
   media.

Implementation notes

In practice, Verification IDs and file system paths may be generated together so the truncation of the Verification ID can be adjusted in order to avoid name collisions in the file-system.
The operating system's native hard link mechanism must NOT be used when generating duplicates, otherwise no fault tolerance will be gained.
Mailpile will use the Verification ID as the "PTR" in the metadata index.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly