Our goal for this blog post is to describe a key-value store with certain properties:
- Persistant on-disk storage of key-value pairs.
- Thread safety, should be
Send
+Sync
. - Lockless reads across threads.
- No single bottleneck lock while writing.
In order to implement all of this in about 400 lines, we give up on any mutability of already written keys. That is, when a key-value pair is written once, it will thereafter never change.
We also have to make sure K
and V
are Copy
, i.e Plain Old Data.
The datastructure of choice in this implementation is a hash-array mapped trie (or HAMT for short). It can be thought of as a nested hash-table.
The nodes of the trie consists of a series of Entries.
The node sizes are defined as the amount of Entries that fit into a 4Kib page. (a common cache-page size)
We will take a closer look at Entry
here.
The first two fields should be self-explanatory, but what about next
and the checksums?
It turns out, since we are using memory-mapped files, the initial state of any region of a page will be all zeroes. If we did not have the checksums, an entry with key
and val
set to zero would be indistinguisable from a vacant entry in the node.
The checksums also work as such, checksums, making sure any botched half-writes to the underlying medium will also be considered vacant.
In order to quickly insert into the table without lock-contestation the initial search is carried out completely without locking by the following procedure.
The root-node on page 0 is consulted, the key is hashed and it's slot is calculated. Here we have three possible outcomes for the corresponding entry:
1: The entry is invalid
- Just write a new
Entry
in place and end.
2: The next
attribute is nonzero
- Recurse into the new page pointed to.
3: The next
attribute is zero
- Create a new page, and modify the
next
value in theEntry
When do we need to take locks on writing?
In a relatively small number of cases, a new page has to be created. For this a lock must be taken, to update the page counter, and in certain cases creating new memory mapped files. The majority of writes doesn't touch this path.
When an entry is modified, either by assigning it a key-value pair or by updating it's next pointer, a corresponding lock must be taken.
But which lock? Having one single lock for writes would be a bottleneck, but keeping individual locks for each page is unwieldly.
To solve this problem we use a static array of mutexes like this:
static SHARDS: [Mutex<()>; 1024] = Default::default()
(if only we had const generics that is)
We then calculate a value from the page and slot offset of the entry, take that modulo 1024, and lock that mutex.
This way, we have only a very small chance of two writes happening simultaneously in buckets with identical shard-index, and the bottleneck is avoided while still maintaining safety.
For the on-disk storage we're using memory mapped files. Let's have a look at their layout.
If the first memory map fits 32 pages, the second fits 64, and so on. This is so that we can have a finite amount of Mmaps and still not run out of space anytime soon.
For stress testing, one of the tests spawns 8 writer and 8 reader threads and simultaneously writes and reads values in random order. I might consider more advanced test methods, but for now it's a proof of concept.
Check out the docs
Check out the code in this repository for the concrete implementation.
For comments: Reddit post
Kristoffer Ström kristoffer@rymdkoloni.se