Skip to content
Jordan Williams edited this page Apr 2, 2014 · 36 revisions

HeftyDB is a persistent, sorted, key-value store library for the JVM. This document provides an overview of its internals.

Highlights

Log Structured Merge Trees

Writes are first done to an in memory skiplist, and are periodically written to disk in batches when the skiplist fills up. Reads check each of these tables to find the requested key, but this can be expensive if there are a large number of tables. To reduce this cost, a background compaction task periodically combines these serialized tables together in sorted order into larger tables according to a predefined strategies.

Other databases based on a similar design include Cassandra, LevelDB, and Wired Tiger. See this paper for more on log structured merge trees.

B+Tree Tables

Instead of the simpler SSTable design used by LevelDB, HeftyDB writes full B+Tree tables with the index in a separate file. As the index and data blocks are written incrementally, very large tables can be merged together without using large amounts of memory.

Snapshotted Multi-version Concurrency Control

Reads and range scans never block writes and always see a consistent snapshot of the database as of the beginning of the read / scan request.

Multi-threaded Compactions

Batched table writes and compaction tasks can make use of multiple CPU cores to make use of as much system IO capacity as possible.

Database Files

The following types of files make up a HeftyDB database. In general, the file formats described below are designed for code simplicity over absolute storage efficiency.

Commit Log File

Each record is written to a commit log file that corresponds to the current memory table. Commit log file names are of the form table_id.log. The commit log file is essentially a singly linked list of records that consist of:

  1. Key Data
  2. Value Data
  3. A deterministically pseudo-random integer that is used an efficient check on file integrity.

The first record is preceded by a long integer that is the seed used by the pseudo-random number generator.

Table File

A HeftyDB table file consists of a doubly linked list of record blocks followed by a fixed-size table trailer that contains metadata about the table itself. Conceptually, these record blocks make up the leaf nodes of the B+Tree that makes up each table. Table file names are of the form table_id.table.

Each record block follows a consistent format that is designed to allow for zero-allocation binary search over the keys in the block. The format is as follows:

  1. An integer containing the number of keys in the block
  2. A series of integers containing the offset of each key within the block
  3. The key / value pairs themselves. Each key consists of a series of bytes followed by a long integer that contains the snapshot id at which the key was written. Each key and value is preceded with an integer that contains its length in bytes.

Record block sizes are configurable, and are aligned to the system virtual memory page size to avoid extraneous memory copies.

Index File

A HeftyDB index file contains B+Tree index blocks that are of a fixed size. The final block in the file is the root index block. Index file names are of the form table_id.index.

Every index block follows a consistent format that is designed to allow for zero-allocation binary search over the keys in the block. The format is as follows:

  1. An integer containing the number of keys in the block
  2. A series of integers containing the offset of each key within the block
  3. A series of records that contains the starting key and snapshot id in each block, a long integer that contains the file offset that the child block begins at, an integer that contains the size of the child block, and a single byte that indicates whether the block points to a leaf node in a table file.

After the root index block at the end of the file, there is an integer containing the size of the root index block, followed by a long integer containing the starting offset of the root index block.

Index block sizes are configurable, and are aligned to the system virtual memory page size to avoid extraneous memory copies.

Bloom Filter File

To improve read efficiency, a bloom filter file is written with each table and index file. Bloom filter file names are of the form table_id.filter. The bloom filter file consists of:

  1. An integer that indicates the number of hash functions used for each key in the filter
  2. The raw bytes that represent the underlying filter bit set

Write Path

Writes in HeftyDB are serialized so that only one thread can perform a write operation at any given time, but do not block read or scan operations on other threads.

Key Write

When a write request comes in, the writing thread first acquires the database wide write lock. The current write snapshot id is incremented at the beginning of the request, but is not yet visible to concurrent reads or range scans. The data is then written to the commit log, where it is optionally fsynced for durability in case of a crash. Once the data has been written to the commit log, the record is added to an in-memory skiplist called the memory table, where it is inserted in sorted order relative to other records in the table. If the skiplist is not full, then the write is considered completed and its snapshot id is made visible to new readers and range scans. Finally, the write lock is released.

Table Batch Write

If the memory table is full, it is replaced by an empty skiplist and a fresh commit log file (both named using the next available incrementing table id), and the previous memory table is written out sequentially to disk by a background thread. If there are multiple versions of a key in the memory table, only the latest version is actually written into the file table to reduce the amount of garbage in the resulting table file.

Both the table and index files for a particular memory table are written incrementally one block at a time. Each key is also hashed into an in-memory bloom filter for the table which is written out once the table is complete.

Once the table has been committed to disk, the corresponding memory table is freed and the newly written table is able to start handling read and scan requests in its place.

Read Path

Reads to a HeftyDB database can happen on any number of concurrent threads, and do not block write operations or range scans.

When a read request for a particular key comes in, the reading thread first acquires a read lock on the set of tables in the database to ensure that no background table batch write or compaction operation changes the set of tables during the read request. Next, the bloom filter for each table is checked to see if it may contain the current key.

If this check returns true, then the table is searched using the table index to retrieve the a version of the requested key that is less than or equal to the newest visible snapshot at the beginning of the read. If the key is found in the table, then if the key is newer than other versions found in previous tables, it replaces the previously found key as the version of the key closest to the requested snapshot.

Once each table has been skipped due to the bloom filter or searched, the record with the key version closest to the requested snapshot is returned. Finally, the read lock on the set of tables in the database is released.

Block Cache

HeftyDB can be configured with a database wide cache for record blocks and index blocks, which can be sized independently for each block type. When a block is read from a table or index file, it is read directly into a n off-heap memory buffer from disk or from the operating system page cache. A key can then be found within the block of off-heap memory without allocating any intermediate Java objects, which allows for very efficient searches.

Since the block cache only contains pointers to these segments of off-heap memory, the cache does not put much pressure on the garbage collector even at large sizes. Initial benchmarks have shown that it is worth configuring HeftyDB with a separate block cache when the cache hit rate is relatively high or the whole dataset will fit in cache.

Scan Path

Scan operations over a HeftyDB database can happen on any number of concurrent threads, and do not block write operations or read operations.

First, a read lock is acquired on all of the tables in the database. Next, the scan operation searches for the requested key in each table in the database and starts a cursor at the closest key in each table that is lexicographically greater than or equal the requested key at the requested snapshot, and the read lock on the set of tables in the database is released.

Each of these cursors is expressed using a standard Java iterator, and each of these table iterators are combined using a special merging iterator that takes a sorted stream of records from each table iterator and merges it into a single sorted stream of records. The merging iterator is wrapped in another iterator that ensures that only the latest version of each unique key less than or equal to the requested snapshot is included in the aggregate stream of records.

The merging iterator is aware of concurrent operations that change the set of tables in the database, and the iterator recreates itself whenever the set of tables that make up the database changes during a range scan operation. Even though the underlying data may change due to compactions and additional writes, the stream of records returned by a particular range scan will only include records that were visible as of the snapshot that the range scan began at or at a particular snapshot specified when the range scan began.

Scan operations in HeftyDB are quite efficient since disk seeks only have to be done when a new block needs to be read in. Since blocks are generally the size of several virtual memory pages, several records can be read for each disk seek. Note that the block cache is not used for range scan operations, as a single long running range scan could end up evicting frequently used blocks.

Snapshots

HeftyDB uses snapshots for concurrency control. HeftyDB snapshots are long integer values that are incremented whenever a write operation occurs. To prevent a background compaction thread from removing old versions of keys at a particular snapshot, a particular snapshot id can be retained by the calling application code. Once the application code is finished using a snapshot, it can be released so that a compaction thread is free to discard old versions of keys at that particular snapshot.

Note that snapshot retention is not persistent across process restarts or crashes.

Compactions

Since HeftyDB has an append only design, garbage can build up in the table files over time as new versions of records are written. Additionally, to maintain read throughput, HeftyDB tries to keep the set of tables that it needs to search to a minimum.

The solution to both of these problems is to execute a merging compaction operation periodically on a background thread. A compaction operation combines multiple tables using the merging iterator described in the range scan section, and writes a new table file, index file, and bloom filter sequentially to disk that contains the records from each of the source tables in sorted order.

A particular database compaction operation is described by an object called a compaction plan. A compaction plan contains one or more compaction tasks each of which describe which table ids are going to be merged into a new single table. These compaction tasks are then executed on one of two thread pools - one which is dedicated short running compaction tasks and one which is dedicated to longer running compaction tasks as specified by the compaction planner. This prevents long running compaction tasks from blocking short running tasks.

The compaction planner is called whenever the set of tables in the database changes and then determines whether or not it needs to generate and execute a new compaction plan.

Currently there is only one compaction planner in HeftyDB, which does a basic size tiered compaction. Essentially, it combines groups of 5 tables that are a particular size into a single new table, which can then be combined, and so on. While simplistic, this has turned out to be remarkably effective in practice.

An application can also provide a custom compaction planner in order to tailor compaction operations to a particular workload.

Crash Recovery

Crash recovery is efficient and straightforward, and involves the following steps:

  1. Any partially written table files which were started as part of a compaction or table batch write (which all end with a .temp extension) are deleted since they cannot be trusted to be complete.

  2. Any commit log files are read in, sorted, and written out as table files. Once these tables are committed, the log files are deleted.

  3. Each of these tables is opened, and the maximum snapshot id found across each of them becomes the next available snapshot id, and the maximum table id across each of them becomes the next available table id.