-
Notifications
You must be signed in to change notification settings - Fork 951
Description
Motivation
At the moment we duplicate a lot of data when we store BeaconStates in the database. For example, the active_index_roots vector is 2**16 elements * 32 bytes = 2MiB in size, is updated only once per epoch, and is stored in every single state! Over the course of the time it takes to change every element in that vector (2**16 epochs), this field alone accounts for 2**16 epochs * 64 slots per epoch * 2MiB per state = 8 TiB!!! All that for only two vectors worth of unique values, i.e. 4MiB of data!
Basic Idea
Rather than duplicating data in every BeaconState, we should store the contents of each vector in the database separately, and reconstruct BeaconStates on-demand. This works for vectors which are updated one element at a time in a predictable way (once-per-slot, once-per-epoch, etc), but doesn't help when it comes to fields like the validator registry (further work).
Whichever key-value database we use (cf. #483), I think it's safe to say that it will satisfy these two properties:
- Reading some number of bytes using a small number of queries (keys) will be faster than reading the same number of bytes using a large number of queries. In other words: storing a vector of length N in chunks will be more read-efficient than storing the elements individually and performing N reads every time we want to construct the vector.
- Looking up multiple keys in a contiguous range is faster than looking up disparate keys from different parts of the DB.
I did a few rough benchmarks on LevelDB using our wrappers, and both these assumptions seem to hold true (see db_bench). The crux of this RFC is exploit both of these properties when storing vectors.
Let v: FixedVector<T, N> be a vector, and let k be a "chunk size". To store v in the database, we use a table (sequential range of keys), where each entry contains multiple chunks (to account for forks).
struct ChunkedEntry<T> {
/// Value used to identify the fork we're on (more on this later)
id: Hash256,
/// Dynamic vector of length at least 1, and at most k
values: Vec<T>,
}The table is conceptually a map from u64 => Vec<ChunkedEntry>, implemented by the KV store. The u64 index i is mapped to a 32-byte key under a prefix determined by the field name, like hash("active_index_roots")[..24] + i.to_bytes(). The values stored in the DB are serialised forms of ChunkedEntry, which are 32 + 1 + k * s(T) bytes in the worst-case, where s(T) is the size of T in bytes, and 1 byte is used to represent the length (which is fine for k < 256).
More on those indexes: the way the table is indexed depends on the update pattern of the field. For a once-per-epoch field like active_index_roots, the index to use is i = epoch / k, with j = epoch % k acting as the index into ChunkedEntry::values.
Handling Forks
The ChunkedEntry::id field is intended to allow us to distinguish between chunks stored for different forks. If we have a chain beginning from a state A and forking at C, like A <- B <- C <- {D or E}, then we store two chunked entries at the same index, one with the values for each fork.
For now assume we have a vector that is updated every slot, let k = 4, and let v(S) denote the value of the new vector element introduced at state S. The table looks like:
index: u64 | chunked entries: Vec<ChunkedEntry>
---------------------------------------------------------------------
0 | [[id1: v(A) v(B) v(C) v(D)], [id2: v(A) v(B) v(C) v(E)]]
Now let's consider what happens when we've stored "light" versions of the states A..E on disk, and want to reconstruct the vector for some states. If we want to look-up D or E, it'd be good if we could match some property of D or E against the IDs of the entries. What about the state root? To be clear:
Let the ID field of a ChunkedEntry be equal to the state root of the state from which the last value in the chunk was taken.
In the example above we'd have id1 = state_root(D) and id2 = state_root(E). When constructing the states D or E we then simply pick the entry with the matching state root (presumably we know the state roots of any states we're attempting to construct -- this seems like a reasonable assumption). Before the fork, the entry would be [state_root(C): v(A) v(B) v(C)], and we'd mutate/update it when storing the values for D/E.
But what if we've got the values for D and E stored already and we want to reconstruct state C? The state root for C doesn't match either of the IDs. In this case, we should look up the successor state roots of C (namely D and E) and use those to match against the IDs in the table. It should be sufficient to choose one of the possible successors, and we should be able to know we're in this case based on the state's slot modulo k. In the worst case, maybe we'll need to grab a bunch of state roots that succeed C and look for them in the bucket.
How do we know the successor states? And what IDs do we use for the vector of state roots themselves? Read on!
State Roots & Block Roots
In this scheme, state roots and block roots are handled differently to other vectors (recall that the BeaconState contains state_roots and block_roots fixed-len vectors).
For state_roots let the ID be equal to the state root of the state previous to the state who's state root is stored in the first entry of the chunk.
This creates a thread of continuity between indices in the table, allowing us to traverse backwards from any (state_root, slot) pair. For example, the linear chain A -> B -> .. -> H would have its state roots (abbreviated r(S)) stored as follows:
index: u64 | chunked entries: Vec<ChunkedEntry>
---------------------------------------------------------------------
0 | [[0x00: r(A) r(B) r(C) r(D)]
1 | [[r(D): r(E) r(F) r(G) r(H)]
For block roots we do something similar (use the previous block root as the ID).
Then, constructing a full state from a state root looks something like:
- Look up the light state in the database, this yields the slot if it is not already known, but none of the vector fields which are stored separately.
- Construct the
state_rootsfield with a range query over the on-disk state roots table (based on the slot). Choose from multiple chunks by matching the entries of the ChunkedEntry against the state root of interest. Backtrack using the IDs of the chunked entries. - Construct the
block_rootsfield similarly. - Construct all the other vector fields with reference to the already constructed
state_roots. Use a range query on the field's table, then choose between forks based on the relevant entries ofstate_roots(some kind ofzip).
Relevant BeaconState Fields
block_roots: once-per-slot, 8192 entries, 32 bytes eachstate_roots: once-per-slot, 8192 entries, 32 bytes eachrandao_mixes: once-per-slot, 65536 entries, 32 bytes eachactive_index_roots: once-per-epoch, 65536 entries, 32 bytes eachcompact_committees_roots: once-per-epoch, 65536 entries, 32 bytes each
The potential savings are, frankly, enormous. Don't pass up on this fantastic deal!! [humour]
Nice Properties
- Linear space usage. Storing a
Vector<T, N>requiresN/kbuckets of size33 + k s(T)bytes, i.e.33N/k + N s(T)bytes, which is a tad more than the theoretical minimumN s(T). For our active roots example from earlier, this is an overhead of only2 * 33 * 65536 / 8 = 528KiB, which is a lot better than ~8TiB. - Easy garbage collection. Once blocks are finalised, one can scan the database and hose out any chunked entries for forks that didn't eventuate to anything.
Next Steps
- How bad is it to be increasing the size of values in the NoSQL database so frequently? Should we pre-allocate every chunk to contain a full
kelements initialised to 0? - What is the optimal value for
k? High values mean more duplication in the case of forks, but also more efficient use of space when there is no forking. Higher values also mean bigger writes when updating a single bucket. Lots of factors to trade-off against each other. - Implement a prototype?
Future Work
These are all out of scope for this issue:
- Optimising storage of fields with more complex update patterns, like
validators,balances,slashings, etc (hard). - Optimising storage of fields with similar-ish update patterns, like
current_crosslinkswhich is updated as-a-whole, once per epoch (easy-ish). - Optimising all the variable-length lists that are append-only, like
current_epoch_attestations. We could probably store just new entries in the states themselves, and reconstruct the full vectors from the concatenation of entries in previous states (easy-ish).