Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Execution state] - Checkpointer and Flattener speed and memory improvements #1750

Closed
ramtinms opened this issue Dec 9, 2021 · 8 comments · Fixed by #1944
Closed

[Execution state] - Checkpointer and Flattener speed and memory improvements #1750

ramtinms opened this issue Dec 9, 2021 · 8 comments · Fixed by #1944

Comments

@ramtinms
Copy link
Contributor

ramtinms commented Dec 9, 2021

currently the checkpoint and trie flattener reconstructs the nodes as the storable version which would duplicate the memory usage when checkpointing, by reusing the same structure the speed and memory usage of the checkpointer can be improved a lot.

@Ullaakut
Copy link

@ramtinms Currently, checkpoints contain clear text ledger payloads, which seems very inefficient to me both in terms of encoding/decoding speed and storage space usage to a lesser extent.

Using a zstandard compression dictionary, for example, could both improve the speed (by having less data to read/write) of processing checkpoints and reduce their size.

This could be done without breaking changes by still supporting the uncompressed format when reading old checkpoints, of course. WDYT?

@bluesign
Copy link
Contributor

Payload compression can be further extended by custom zstd dictionary for each cadence type (I think limit there is 2^32 dictionaries) and uncompress on first access ( or better LRU cache )

@fxamacker
Copy link
Member

Ideas for Improving Speed and Memory

@ramtinms These are just some ideas that came up while getting familiar with mtrie code in flow-go. Thoughts?

Use stream encoding

The current implementation builds the entire nodes in memory in FlattenForest() before writing to checkpoint file. Because NodeIterator returns nodes in the sequence of "DESCENDANTS-FIRST-RELATIONSHIP", I think we can stream encode each node returned by the iterator and filtered for uniqueness. Conversion from Node to StorableNode can also be eliminated if we stream encode each node directly.

Checkpoint file format would need to be modified to enable stream mode. For example, instead of encoding number of nodes in the header, we can encode a marker after all nodes are serialized. Same idea for stream encode StorableTries.

Optimize NodeIterator

Only return nodes that weren't visited while traversing other tries.

Reduce buffer allocation

Interim nodes are encoded to checkpoint with the same size, so we can reuse a scratch buffer. Same for StorableTrie.

E.g. wal.Storecheckpoint() can pass a reusable scratch buffer to flattener.EncodeStorableNode() and flattener.EncodeStorableTrie().

@fxamacker
Copy link
Member

Ideas for Reducing Checkpoint File Size

@ramtinms Maybe we can save over 500MB by not repeating version number in the checkpoint. Thoughts?

  • EncodeStorableNode() here and EncodeStorableTrie() here encode encodingDecodingVersion (2 bytes) for every node and trie. Maybe we can move encodingDecodingVersion to a higher level, or eliminate it since there is a version number in the checkpoint file header already? This will save 2 bytes for every serialized node and trie. If there are ~150 million nodes in mainnet checkpoint, that would result in about 300MB savings.

  • Every leaf node payload is encoded with version (2 bytes) + type (1 byte). Maybe we can eliminate these 3 bytes for storable nodes, like TrieUpdate and TrieProof encoding? This will save 3 bytes for every leaf node. If there are ~70 million leaf nodes in mainnet checkpoint, that would result in about 210MB savings.

@Ullaakut
Copy link

Ullaakut commented Jan 5, 2022

Reduce buffer allocation

Interim nodes are encoded to checkpoint with the same size, so we can reuse a scratch buffer. Same for StorableTrie.

WDYM by that? AFAIK since currently the payloads are encoded in flat nodes, their size is not always the same, which means we cannot directly guess the position of a node in the checkpoint's encoded data unless we have already read the whole checkpoint. But maybe I misunderstood what you meant.

Also I'm not sure if that's on the table or not but using extension nodes in some cases could also reduce the checkpoint size as it would reduce the amount of interim nodes that are needed. Last time we talked about it though @ramtinms argued that it wouldn't save much since entries in the trie have a path distribution that is random so there's very little overlap between paths which could be improved with extensions. When we try it on our end we still get some improvements though, even though they're indeed not as huge as initially expected (the initial test was skewed because it used the method to generate paths that is used in unit tests, which uses right side padding and results in a lot of savings by getting rid of most of the branches).

@fxamacker
Copy link
Member

Reduce buffer allocation

Interim nodes are encoded to checkpoint with the same size, so we can reuse a scratch buffer. Same for StorableTrie.

WDYM by that? AFAIK since currently the payloads are encoded in flat nodes, their size is not always the same, which means we cannot directly guess the position of a node in the checkpoint's encoded data unless we have already read the whole checkpoint. But maybe I misunderstood what you meant.

Hi Brendan. Unless I'm mistaken, the interim nodes don't vary in size.

What I meant is: instead of EncodeStorableNode() creating a new buffer for every node, we can create a scratch buffer in the caller and EncodeStorableNode() can reuse that buffer (if it fits). Given interim node is encoded to the same size (52 bytes), we can guarantee all interim node encoding reuses the scratch buffer. We can of course allocate a bigger scratch buffer to account for some leaf nodes as well.

For example:

        scratch := make([]byte, 0, 52) // reusable scratch buffer to reduce number of allocs

        for i := 1; i < len(storableNodes); i++ {
		bytes := flattener.EncodeStorableNode(storableNodes[i], scratch) // use scratch buffer
		_, err = crc32Writer.Write(bytes)
		if err != nil {
			return fmt.Errorf("error while writing node date: %w", err)
		}
	}

Also I'm not sure if that's on the table or not but using extension nodes

Thank you for the update on extension nodes. I haven't got a chance to take a closer look at it yet (I got sick in early Dec and am starting to feel better now). I'll take a look at your code this week.

@fxamacker
Copy link
Member

More Ideas for Reducing Checkpoint File Size

If checkpointer encodes leaf nodes and interim nodes differently as suggested by Ramtin in "separation of node types" #1745, further reduction is possible.

  • Leaf nodes can skip MaxDepth (2 bytes), RegCount (8 bytes), LIndex (8 bytes), RIndex (8 bytes) fields at the expense of adding type field (1 byte). Total saving for each leaf node is 25 bytes. For ~70 million leaves, it is ~1.7GB of savings for leaf nodes.

  • Interim nodes can skip MaxDepth (2 bytes), RegCount (8 bytes), path length (2 byte), payload length (4 bytes) fields at the expense of adding type field (1 byte). Total saving for each interim node is 15 bytes. For ~80 million interim nodes, it is ~1.2GB of savings for interim nodes.

Together with 510MB size reduction in previous comment, combined savings are around 3.4GB.

@fxamacker
Copy link
Member

Hi @Ullaakut,

Also I'm not sure if that's on the table or not but using extension nodes in some cases could also reduce the checkpoint size as it would reduce the amount of interim nodes that are needed. Last time we talked about it though @ramtinms argued that it wouldn't save much since entries in the trie have a path distribution that is random so there's very little overlap between paths which could be improved with extensions. When we try it on our end we still get some improvements though, even though they're indeed not as huge as initially expected (the initial test was skewed because it used the method to generate paths that is used in unit tests, which uses right side padding and results in a lot of savings by getting rid of most of the branches).

I read some of the DPS trie code and I can see how extension nodes are beneficial in some cases.

Some concerns about adding extension nodes to Flow's mtrie include:

  • Adding extension nodes to Flow's mtrie would increase the complexity of updates, which is already complex. Flow's mtrie is immutable and supports batch updates, so adding extension nodes involves more complexity.

  • Ramtin mentioned that in Flow's mtrie, keys are distributed almost uniformly (SHA3), so it's very unlikely we'd benefit that much from compaction in the middle of a branch.

Given these concerns, I don't know if there would be enough benefits from adding extension nodes to Flow's mtrie to offset the added complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants