[Execution state] - Checkpointer and Flattener speed and memory improvements #1750

ramtinms · 2021-12-09T04:46:02Z

currently the checkpoint and trie flattener reconstructs the nodes as the storable version which would duplicate the memory usage when checkpointing, by reusing the same structure the speed and memory usage of the checkpointer can be improved a lot.

Ullaakut · 2021-12-17T09:39:54Z

@ramtinms Currently, checkpoints contain clear text ledger payloads, which seems very inefficient to me both in terms of encoding/decoding speed and storage space usage to a lesser extent.

Using a zstandard compression dictionary, for example, could both improve the speed (by having less data to read/write) of processing checkpoints and reduce their size.

This could be done without breaking changes by still supporting the uncompressed format when reading old checkpoints, of course. WDYT?

bluesign · 2021-12-17T12:37:01Z

Payload compression can be further extended by custom zstd dictionary for each cadence type (I think limit there is 2^32 dictionaries) and uncompress on first access ( or better LRU cache )

fxamacker · 2022-01-04T14:45:41Z

Ideas for Improving Speed and Memory

@ramtinms These are just some ideas that came up while getting familiar with mtrie code in flow-go. Thoughts?

Use stream encoding

The current implementation builds the entire nodes in memory in FlattenForest() before writing to checkpoint file. Because NodeIterator returns nodes in the sequence of "DESCENDANTS-FIRST-RELATIONSHIP", I think we can stream encode each node returned by the iterator and filtered for uniqueness. Conversion from Node to StorableNode can also be eliminated if we stream encode each node directly.

Checkpoint file format would need to be modified to enable stream mode. For example, instead of encoding number of nodes in the header, we can encode a marker after all nodes are serialized. Same idea for stream encode StorableTries.

Optimize `NodeIterator`

Only return nodes that weren't visited while traversing other tries.

Reduce buffer allocation

Interim nodes are encoded to checkpoint with the same size, so we can reuse a scratch buffer. Same for StorableTrie.

E.g. wal.Storecheckpoint() can pass a reusable scratch buffer to flattener.EncodeStorableNode() and flattener.EncodeStorableTrie().

fxamacker · 2022-01-04T15:05:05Z

Ideas for Reducing Checkpoint File Size

@ramtinms Maybe we can save over 500MB by not repeating version number in the checkpoint. Thoughts?

EncodeStorableNode() here and EncodeStorableTrie() here encode encodingDecodingVersion (2 bytes) for every node and trie. Maybe we can move encodingDecodingVersion to a higher level, or eliminate it since there is a version number in the checkpoint file header already? This will save 2 bytes for every serialized node and trie. If there are ~150 million nodes in mainnet checkpoint, that would result in about 300MB savings.
Every leaf node payload is encoded with version (2 bytes) + type (1 byte). Maybe we can eliminate these 3 bytes for storable nodes, like TrieUpdate and TrieProof encoding? This will save 3 bytes for every leaf node. If there are ~70 million leaf nodes in mainnet checkpoint, that would result in about 210MB savings.

Ullaakut · 2022-01-05T14:04:43Z

Reduce buffer allocation

Interim nodes are encoded to checkpoint with the same size, so we can reuse a scratch buffer. Same for StorableTrie.

WDYM by that? AFAIK since currently the payloads are encoded in flat nodes, their size is not always the same, which means we cannot directly guess the position of a node in the checkpoint's encoded data unless we have already read the whole checkpoint. But maybe I misunderstood what you meant.

Also I'm not sure if that's on the table or not but using extension nodes in some cases could also reduce the checkpoint size as it would reduce the amount of interim nodes that are needed. Last time we talked about it though @ramtinms argued that it wouldn't save much since entries in the trie have a path distribution that is random so there's very little overlap between paths which could be improved with extensions. When we try it on our end we still get some improvements though, even though they're indeed not as huge as initially expected (the initial test was skewed because it used the method to generate paths that is used in unit tests, which uses right side padding and results in a lot of savings by getting rid of most of the branches).

fxamacker · 2022-01-05T14:56:36Z

Reduce buffer allocation

Interim nodes are encoded to checkpoint with the same size, so we can reuse a scratch buffer. Same for StorableTrie.

WDYM by that? AFAIK since currently the payloads are encoded in flat nodes, their size is not always the same, which means we cannot directly guess the position of a node in the checkpoint's encoded data unless we have already read the whole checkpoint. But maybe I misunderstood what you meant.

Hi Brendan. Unless I'm mistaken, the interim nodes don't vary in size.

What I meant is: instead of EncodeStorableNode() creating a new buffer for every node, we can create a scratch buffer in the caller and EncodeStorableNode() can reuse that buffer (if it fits). Given interim node is encoded to the same size (52 bytes), we can guarantee all interim node encoding reuses the scratch buffer. We can of course allocate a bigger scratch buffer to account for some leaf nodes as well.

For example:

        scratch := make([]byte, 0, 52) // reusable scratch buffer to reduce number of allocs

        for i := 1; i < len(storableNodes); i++ {
		bytes := flattener.EncodeStorableNode(storableNodes[i], scratch) // use scratch buffer
		_, err = crc32Writer.Write(bytes)
		if err != nil {
			return fmt.Errorf("error while writing node date: %w", err)
		}
	}

Also I'm not sure if that's on the table or not but using extension nodes

Thank you for the update on extension nodes. I haven't got a chance to take a closer look at it yet (I got sick in early Dec and am starting to feel better now). I'll take a look at your code this week.

fxamacker · 2022-01-05T14:58:42Z

More Ideas for Reducing Checkpoint File Size

If checkpointer encodes leaf nodes and interim nodes differently as suggested by Ramtin in "separation of node types" #1745, further reduction is possible.

Leaf nodes can skip MaxDepth (2 bytes), RegCount (8 bytes), LIndex (8 bytes), RIndex (8 bytes) fields at the expense of adding type field (1 byte). Total saving for each leaf node is 25 bytes. For ~70 million leaves, it is ~1.7GB of savings for leaf nodes.
Interim nodes can skip MaxDepth (2 bytes), RegCount (8 bytes), path length (2 byte), payload length (4 bytes) fields at the expense of adding type field (1 byte). Total saving for each interim node is 15 bytes. For ~80 million interim nodes, it is ~1.2GB of savings for interim nodes.

Together with 510MB size reduction in previous comment, combined savings are around 3.4GB.

fxamacker · 2022-01-07T03:54:07Z

Hi @Ullaakut,

Also I'm not sure if that's on the table or not but using extension nodes in some cases could also reduce the checkpoint size as it would reduce the amount of interim nodes that are needed. Last time we talked about it though @ramtinms argued that it wouldn't save much since entries in the trie have a path distribution that is random so there's very little overlap between paths which could be improved with extensions. When we try it on our end we still get some improvements though, even though they're indeed not as huge as initially expected (the initial test was skewed because it used the method to generate paths that is used in unit tests, which uses right side padding and results in a lot of savings by getting rid of most of the branches).

I read some of the DPS trie code and I can see how extension nodes are beneficial in some cases.

Some concerns about adding extension nodes to Flow's mtrie include:

Adding extension nodes to Flow's mtrie would increase the complexity of updates, which is already complex. Flow's mtrie is immutable and supports batch updates, so adding extension nodes involves more complexity.
Ramtin mentioned that in Flow's mtrie, keys are distributed almost uniformly (SHA3), so it's very unlikely we'd benefit that much from compaction in the middle of a branch.

Given these concerns, I don't know if there would be enough benefits from adding extension nodes to Flow's mtrie to offset the added complexity.

fxamacker self-assigned this Jan 12, 2022

fxamacker mentioned this issue Feb 3, 2022

Optimize MTrie checkpoint: 47x speedup (11.7 hours -> 15 mins), -431 GB alloc/op, -7.6 billion allocs/op, -6.9 GB file size #1944

Merged

9 tasks

fxamacker added Breaking Change Performance labels Feb 25, 2022

fxamacker closed this as completed in #1944 Mar 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Execution state] - Checkpointer and Flattener speed and memory improvements #1750

[Execution state] - Checkpointer and Flattener speed and memory improvements #1750

ramtinms commented Dec 9, 2021

Ullaakut commented Dec 17, 2021

bluesign commented Dec 17, 2021

fxamacker commented Jan 4, 2022

fxamacker commented Jan 4, 2022

Ullaakut commented Jan 5, 2022

Reduce buffer allocation

fxamacker commented Jan 5, 2022

Reduce buffer allocation

fxamacker commented Jan 5, 2022

fxamacker commented Jan 7, 2022

[Execution state] - Checkpointer and Flattener speed and memory improvements #1750

[Execution state] - Checkpointer and Flattener speed and memory improvements #1750

Comments

ramtinms commented Dec 9, 2021

Ullaakut commented Dec 17, 2021

bluesign commented Dec 17, 2021

fxamacker commented Jan 4, 2022

Ideas for Improving Speed and Memory

Use stream encoding

Optimize NodeIterator

Reduce buffer allocation

fxamacker commented Jan 4, 2022

Ideas for Reducing Checkpoint File Size

Ullaakut commented Jan 5, 2022

Reduce buffer allocation

fxamacker commented Jan 5, 2022

Reduce buffer allocation

fxamacker commented Jan 5, 2022

More Ideas for Reducing Checkpoint File Size

fxamacker commented Jan 7, 2022

Optimize `NodeIterator`