A fast key-value store with ACID guarantees written in pure Rust. Heavily inspired by Badger.
This is a work in progress. Use it at your own risk.
The database does not yet support durability. That means your data will be lost after a restart. You also currently have to manually delete the data from the directory. There is also no shapshot creation mechanism yet, which means outdated data is never deleted. Finally there is currently no way to delete a certain key from the database.
Everything else is working correctly and can in theory already be used.
The implementation closely follows the way Badger got implemented, as the Dgraph team explained in their blog post. Given so Horst's transactions offer the some ACID guarantees as Badger does.
We maintain the list of pending commits in a sorted Vec
containing (commit_ts, finished)
elements. If a commit finishes successfully finished
gets set to true
, if it fails the entry gets removed. Once all transactions up to a certain commit_ts
have been marked as finished, they all get removed from the Vec
and the oracle's read_ts
gets increased to that commit_ts
.
The value log is implemented in such a way that each VLog
file contains solely the transactions corresponding to a certain interval of commit timestamps which is disjunct with the intervals of all other VLog
files. This ensures we can easily detect after the creation of a snapshot which VLog
files can be deleted and which must be preserved.
The LSM tree uses a similiar system. Each Slice
(or 'rune' as it is called on Wikipedia) contains solely the key-value pairs corresponding to a distinct interval of commit timestamps. The Slice
s are in that sense sorted, with older Slice
s from higher levels containing solely older values and newer Slice
s containing solely newer values. This ensures that once we find the first key-value pair with a commit_ts
lower then the transaction's read_ts
, we can be sure the older Slice
s will not contain a newer value and end the search early.
Contrary to Badger which uses byte slices of arbitrary length as keys, we use fixed sized keys of type u128
. You can still use others keys, but you would have to convert them to u128
before adding them to the database. You should also make sure there are no key collisions. We will likely add some interface and utility functionalites to assist with that.
- Oracle
- Implement a seperate task which manages the oracle.
- Let the oracle keep track of commit timestamps for all keys.
- We use a sorted
Vec
for quick search capabilites and to keep the memory usage low. - A
BTreeMap
gets used for quickly adding new key/commit_ts pairs. - If the
BTreeMap
gets too large, we merge it with the mainVec
store. - If a certain commit updates too many keys, it gets directly merged into the main
Vec
store. - Upon conflict detection, both stores are checked for
commit_ts
higher then the transactionsread_ts
.
- We use a sorted
- Let the oracle keep track of pending transaction commits.
- Successfully commited transactions get saved in a
Vec
with both thecommit_ts
and afinished
flag defaulting to false. - If a transaction successfully completes a commit, it informs the oracle which will set the
finished
flag tofalse
. - If a transaction fails, the oracle gets also informed and the corresponding entry gets deleted from the store.
- After each successfull commit the oracle removes all entries starting from the beginning which consecutively(!) have the
finished
flag set to true. The oracle'sread_ts
gets updated to the highestcommit_ts
where there are no pending commits with a lowercommit_ts
.
- Successfully commited transactions get saved in a
- Provide the interface for letting other tasks and transactions communicate with the oracle.
- Allow asking for the latest
read_ts
. - Allow asking for commiting a transaction.
- Send over all keys read or set by the transaction.
- Perform conflict detection by ensuring none of the keys read or set got modified.
- Allow asking for the latest
- Value log
- Implement a seperate task which manages the value log.
- Catch incomming transactions and write them in batches for better write performance.
- Implement a buffer for incomming transactions.
- Empty the buffer at a configuable interval.
- Empty the buffer if a configurable size gets exceeded.
- Make sure the transactions are given to the correct
VLog
s (s.b.).
- Add a cache for recently read values for better performance.
- Implement the management for opening mulitple
VLog
files. - Allow creating
VLog
files in multiple directories, e.g. on different harddrives. - Open multiple
VLog
files.- Implement a seperate taks which manages each Individual
VLog
file. - Allow for concurrent reads from different
VLog
files.
- Implement a seperate taks which manages each Individual
- Make each
VLog
file contain the transactions corresponding to a certain interval ofcommit_ts
which is disjunct with all otherVLog
s.- Let the header of each
VLog
file maintain the upper and lower bound (inclusive) of the commit timestamps contained therein. - The first
VLog
's lower bound gets initialized to zero. - Upon writing transactions to a
VLog
, maintain an index of the highestcommit_ts
seen so far on thisVLog
. - Upon closure of the
VLog
, update the headers upper bound to the highest seencommit_ts
. Do also save the upper bounds fixed flag (s.b.). - If a
VLog
exceeds a certain size limit, create a newVLog
.- The last
VLog
's upper bound gets marked as fixed. - The new
VLog
's lower bound gets set as the lastVLog
's upper bound +1.
- The last
- Make sure that all new transactions get written to the correct
VLog
.
- Let the header of each
- Log-Structure-Merge (LSM) tree
- Implement a seperate task which manages the LSM tree.
- Implement a seperate task for each
Slice
and the level 0 storage. - Implement level 0 (RAM storage).
- Store new key-value-pairs in a
BTreeMap
. - Regularely merge the new keys into a sorted
Vec
for better memory usage. - Do also trigger a merge whenever a configurable size gets exceeded.
- Implement a configurable maximum total number of entries in level 0.
- All entries get merged into the sorted
Vec
before further processing. - Level 0 asks the oracle for the latest
read_ts
and splits its currently saved entries into those with acommit_ts
above thisread_ts
, and those whosecommit_ts
is below or equal to it. - Those with a
commit_ts
above theread_ts
stay inside level 0, the rest get added to level 1.
- All entries get merged into the sorted
- Store new key-value-pairs in a
- Implement higher levels (Disk storage).
- Implement
Slice
s which are immutable. - Ensure we can efficiently read
Slice
s from disk.- Make use of mmap. (memmap)
- Implement a pure rust alternative.
- Implement a caching file reader.
- Solve bugs at certain corner edge cases.
- Make sure we can efficiently search for the latest value (with respect to a certain
read_ts
) of a certain key.- Make sure each
Slice
contains a distinct interval of commit timestamps. - Implement a properly working binary search.
- Different
Slice
s can be searched concurrently.
- Make sure each
- Implement
- Durability
- Add a proper clean shutdown mechanism.
- Add a flag to all file headers indicating if the file got properly closed.
- Add a header entry to
VLog
files taking note of the last known fully commitedcommit_ts
.
- Reopen all files on restart.
- Reopen and add all
VLog
files on restart. - Add some validation to the
VLog
files header parsing to detect possible contract violations or similiar inconsistancies. - Reopen and add all
Slice
files on restart. - Add some validation to the
Slice
files header parsing to detect possible contract violations or similiar inconsistancies. - Detect not fully written transactions and truncate the corresponding
VLog
files. - Detect not fully commited transactions and update the
Slice
s as needed.
- Reopen and add all
- Add a proper clean shutdown mechanism.
- Database functions
- Implement a snapshot creation function, so outdated data gets removed.
- Add a function for manually triggering the creation of a snapshot.
- Define and implement other criteria (time interval, database size) for automatic snapshot creation.
- Implement a seperate task which manages shapshot creations.
- Make sure it returns an appropriate error if a snapshot creation is requested even trough one is currently already being created.
- Let it ask the LSM tree for handles to all
Slice
s saved at level 1 and above plus the current snapshotSlice
, if any.- This will also give it access to the highest
commit_ts
currently saved on disk in the LSM tree. - Note that by construction that this
commit_ts
is lower or equal to the oracle'sread_ts
. In particular we can be sure there are no pending or future transactions which would belong into this snapshot.
- This will also give it access to the highest
- Perform a merge of all
Slice
s.- This will exclude all outdated values in the newly created snapshot
Slice
. - This
Slice
is only temporary, since it points to the values of the not yet mergedVLog
s.
- This will exclude all outdated values in the newly created snapshot
- Give a handle of the new temporary snapshot
Slice
to the value log and ask it to create a snapshotVLog
containing all values referenced by theSlice
.- Give that
VLog
asnapshot
andfinished
flag since it will not contain a regular transaction and can not know its total size in advance. - Create a new snapshot
Slice
file. - Iterate over the keys in the temporary snapshot
Slice
, read the values from the oldVLog
files, add them to the new snapshotVLog
and add a corresponding reference to the new snapshotSlice
file. - Once done, mark the new
VLog
file asfinished
,delete the temporary snapshotSlice
and give the new snapshotSlice
to the snapshot task.
- Give that
- Give the new snapshot
Slice
over to the LSM tree together with a list of theSlice
s which got merged into it.- Let the LSM tree delete all new outdated
Slice
s (including a possible old snapshot one) and add the new snapshotSlice
. - Once done inform the snapshot task of it.
- Let the LSM tree delete all new outdated
- Tell the value log to switch to the new snaphost
VLog
and delete allVLog
files (including a possible old snapshot one) which contain solely values withcommit_ts
lower or equal to the snapshots maxcommit_ts
.
- Implement a delete key function by adding a corresponding flag.
- Implement an async key range iteration function. (Usefull for prefix scanning)
- Implement key conversion functionalites.
- Add an interface for using abritrary keys (probably trough a trait).
- Implement a utility structure which detects and prevents key collisions.
- Implement a snapshot creation function, so outdated data gets removed.