Optimise state storage with Chunked Ranges™

## Motivation

At the moment we duplicate _a lot_ of data when we store `BeaconState`s in the database. For example, the `active_index_roots` vector is `2**16 elements * 32 bytes = 2MiB` in size, is updated only once per epoch, and is stored in every single state! Over the course of the time it takes to change every element in that vector (`2**16` epochs), this field alone accounts for `2**16 epochs * 64 slots per epoch * 2MiB per state = 8 TiB`!!! All that for only two vectors worth of unique values, i.e. `4MiB` of data!

## Basic Idea

Rather than duplicating data in every `BeaconState`, we should store the contents of each vector in the database separately, and reconstruct `BeaconState`s on-demand. This works for vectors which are updated one element at a time in a predictable way (once-per-slot, once-per-epoch, etc), but doesn't help when it comes to fields like the validator registry (further work).

Whichever key-value database we use (cf. https://github.com/sigp/lighthouse/issues/483), I think it's safe to say that it will satisfy these two properties:

1. Reading some number of bytes using a small number of queries (keys) will be faster than reading the same number of bytes using a large number of queries. In other words: storing a vector of length N in _chunks_ will be more read-efficient than storing the elements individually and performing N reads every time we want to construct the vector.
2. Looking up multiple keys in a contiguous range is faster than looking up disparate keys from different parts of the DB.

I did a few rough benchmarks on LevelDB using our wrappers, and both these assumptions seem to hold true (see [db_bench](https://github.com/michaelsproul/lighthouse/tree/db_bench)). The crux of this RFC is exploit both of these properties when storing vectors.

Let `v: FixedVector<T, N>` be a vector, and let `k` be a "chunk size". To store `v` in the database, we use a table (sequential range of keys), where each entry contains _multiple_ chunks (to account for forks).

```rust
struct ChunkedEntry<T> {
    /// Value used to identify the fork we're on (more on this later)
    id: Hash256,
    /// Dynamic vector of length at least 1, and at most k
    values: Vec<T>,
}
```

The table is conceptually a map from `u64 => Vec<ChunkedEntry>`, implemented by the KV store. The `u64` index `i` is mapped to a 32-byte key under a prefix determined by the field name, like `hash("active_index_roots")[..24] + i.to_bytes()`. The _values_ stored in the DB are serialised forms of `ChunkedEntry`, which are `32 + 1 + k * s(T)` bytes in the worst-case, where `s(T)` is the size of `T` in bytes, and 1 byte is used to represent the length (which is fine for `k < 256`).

More on those indexes: the way the table is indexed depends on the update pattern of the field. For a once-per-epoch field like `active_index_roots`, the index to use is `i = epoch / k`, with `j = epoch % k` acting as the index into `ChunkedEntry::values`.

## Handling Forks

The `ChunkedEntry::id` field is intended to allow us to distinguish between chunks stored for different forks. If we have a chain beginning from a state `A` and forking at `C`, like `A <- B <- C <- {D or E}`, then we store _two_ chunked entries at the same index, one with the values for each fork.

For now assume we have a vector that is updated every slot, let `k = 4`, and let `v(S)` denote the value of the new vector element introduced at state `S`. The table looks like:

```
index: u64 | chunked entries: Vec<ChunkedEntry>
---------------------------------------------------------------------
        0  | [[id1: v(A) v(B) v(C) v(D)], [id2: v(A) v(B) v(C) v(E)]]
```

Now let's consider what happens when we've stored "light" versions of the states `A..E` on disk, and want to reconstruct the vector for some states. If we want to look-up `D` or `E`, it'd be good if we could match some property of `D` or `E` against the IDs of the entries. What about the state root? To be clear:

**Let the ID field of a ChunkedEntry be equal to the _state root_ of the state from which the _last_ value in the chunk was taken.**

In the example above we'd have `id1 = state_root(D)` and `id2 = state_root(E)`. When constructing the states `D` or `E` we then simply pick the entry with the matching state root (presumably we know the state roots of any states we're attempting to construct -- this seems like a reasonable assumption). Before the fork, the entry would be `[state_root(C): v(A) v(B) v(C)]`, and we'd mutate/update it when storing the values for `D`/`E`.

But what if we've got the values for `D` and `E` stored already and we want to reconstruct state `C`? The state root for `C` doesn't match either of the IDs. In this case, we should look up the successor state roots of `C` (namely `D` _and_ `E`) and use those to match against the IDs in the table. It should be sufficient to choose one of the possible successors, and we should be able to know we're in this case based on the state's slot modulo `k`. In the worst case, maybe we'll need to grab a bunch of state roots that succeed `C` and look for them in the bucket.

How do we know the successor states? And what IDs do we use for the vector of state roots themselves? Read on!

## State Roots & Block Roots

In this scheme, state roots and block roots are handled differently to other vectors (recall that the `BeaconState` contains `state_roots` and `block_roots` fixed-len vectors).

**For `state_roots` let the ID be equal to the state root of the state previous to the state who's state root is stored in the first entry of the chunk.**

This creates a thread of continuity between indices in the table, allowing us to traverse backwards from any `(state_root, slot)` pair. For example, the linear chain `A -> B -> .. -> H` would have its state roots (abbreviated `r(S)`) stored as follows:

```
index: u64 | chunked entries: Vec<ChunkedEntry>
---------------------------------------------------------------------
        0  | [[0x00: r(A) r(B) r(C) r(D)]
        1  | [[r(D): r(E) r(F) r(G) r(H)]
```

For block roots we do something similar (use the previous block root as the ID).

Then, constructing a full state from a state root looks something like:

1. Look up the light state in the database, this yields the slot if it is not already known, but none of the vector fields which are stored separately.
2. Construct the `state_roots` field with a range query over the on-disk state roots table (based on the slot). Choose from multiple chunks by matching the _entries_ of the ChunkedEntry against the state root of interest. Backtrack using the IDs of the chunked entries.
3. Construct the `block_roots` field similarly.
4. Construct all the other vector fields with reference to the already constructed `state_roots`. Use a range query on the field's table, then choose between forks based on the relevant entries of `state_roots` (some kind of `zip`).

## Relevant BeaconState Fields

* `block_roots`: once-per-slot, 8192 entries, 32 bytes each
* `state_roots`: once-per-slot, 8192 entries, 32 bytes each
* `randao_mixes`: once-per-slot, 65536 entries, 32 bytes each
* `active_index_roots`: once-per-epoch, 65536 entries, 32 bytes each
* `compact_committees_roots`: once-per-epoch, 65536 entries, 32 bytes each

The potential savings are, frankly, enormous. Don't pass up on this fantastic deal!! [humour]

## Nice Properties

* Linear space usage. Storing a `Vector<T, N>` requires `N/k` buckets of size `33 + k s(T)` bytes, i.e. `33N/k + N s(T)` bytes, which is a tad more than the theoretical minimum `N s(T)`. For our active roots example from earlier, this is an overhead of only `2 * 33 * 65536 / 8 = 528KiB`, which is a lot better than ~8TiB.
* Easy garbage collection. Once blocks are finalised, one can scan the database and hose out any chunked entries for forks that didn't eventuate to anything.

## Next Steps

* How bad is it to be increasing the size of values in the NoSQL database so frequently? Should we pre-allocate every chunk to contain a full `k` elements initialised to 0?
* What is the optimal value for `k`? High values mean more duplication in the case of forks, but also more efficient use of space when there is no forking. Higher values also mean bigger writes when updating a single bucket. Lots of factors to trade-off against each other.
* Implement a prototype?

## Future Work

These are all out of scope for this issue:

* Optimising storage of fields with more complex update patterns, like `validators`, `balances`, `slashings`, etc (hard).
* Optimising storage of fields with similar-ish update patterns, like `current_crosslinks` which is updated as-a-whole, once per epoch (easy-ish).
* Optimising all the variable-length lists that are append-only, like `current_epoch_attestations`. We could probably store just new entries in the states themselves, and reconstruct the full vectors from the concatenation of entries in previous states (easy-ish).

## Addendum

1. The efficacy of range queries seems to be common knowledge in the NoSQL community, but is implied by [this](https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#prefix-databases)
2. I don't think we need to store a length on-disk when storing a `Vec<ChunkedEntry>`, as each `ChunkedEntry` knows its size. Proof-of-concept impl [here](https://github.com/michaelsproul/lighthouse/blob/0e978c5c9c64019c5522b72d67ad6396007a2557/db_bench/src/lib.rs#L85-L96).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimise state storage with Chunked Ranges™ #484

Motivation

Basic Idea

Handling Forks

State Roots & Block Roots

Relevant BeaconState Fields

Nice Properties

Next Steps

Future Work

Addendum

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimise state storage with Chunked Ranges™ #484

Description

Motivation

Basic Idea

Handling Forks

State Roots & Block Roots

Relevant BeaconState Fields

Nice Properties

Next Steps

Future Work

Addendum

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions