You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It appears like the file formats have a lot of redundancy.
For example, ever Record, Tombstone, and Index entry have an individual crc32, a version byte, plus 4 byte record size.
Lets take IndexFileEntry for example
/**
* checksum - 4 bytes.
* version - 1 byte.
* Key size - 1 bytes.
* record size - 4 bytes.
* record offset - 4 bytes.
* sequence number - 8 bytes
*/
A few things come to mind.
The file could have a header with the version number, since it is identical for all entries. Since the file is only read sequentially, and truncated at the first corrupted item found, the header could contain the first sequenceNumber as well, and values afterward can be deltas relative to this value using a variable length encoding. RecordOffset is similar -- the values are monotonically increasing and could be delta encoded with a variable length integer.
As for the checksum, it could be written for small 'blocks' rather than for each record. This also would accelerate recovery from a crash, as each block could be something like: (2 byte size, 8 byte xxHash checksum, size bytes of index entries). Validating the file would then only need to go a block at a time until it fails. As long as the block had at least 3 entries, it would save space. I suspect something like flushing a block every ~ 32 entries or 2k bytes (whatever is first) would work well -- ~9% as many bytes used for checksums, but small enough chunks so that it shouldn't significantly impact the chance that data fails to reach disk before a crash.
Also, unless hardware accelerated, crc32 is much slower than XXHash and also more prone to collisions.
The text was updated successfully, but these errors were encountered:
It appears like the file formats have a lot of redundancy.
For example, ever Record, Tombstone, and Index entry have an individual crc32, a version byte, plus 4 byte record size.
Lets take IndexFileEntry for example
A few things come to mind.
size
bytes of index entries). Validating the file would then only need to go a block at a time until it fails. As long as the block had at least 3 entries, it would save space. I suspect something like flushing a block every ~ 32 entries or 2k bytes (whatever is first) would work well -- ~9% as many bytes used for checksums, but small enough chunks so that it shouldn't significantly impact the chance that data fails to reach disk before a crash.The text was updated successfully, but these errors were encountered: