-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize MTrie checkpoint: 47x speedup (11.7 hours -> 15 mins), -431 GB alloc/op, -7.6 billion allocs/op, -6.9 GB file size #1944
Conversation
NewUniqueNodeIterator() can be used to optimize node iteration for forest. It skips shared sub-tries that were visited and only iterates unique nodes.
Use NewUniqueNodeIterator() in FlattenForest() to skip traversing visited shared sub-trie while flattening forest.
- Merge FlattenForest() with StoreCheckpoint() to iterate and serialize nodes without creating intermediate StorableNode/StorableTrie objects. - Stream encode nodes to avoid creating 400+ million element slice holding all nodes. - Change checkpoint file format (v4) to store node count and trie count at the footer (instead of header) required for stream encoding. - Support previous checkpoint formats (v1, v3).
- Merge RebuildTries() with LoadCheckpoint() to deserialize data to nodes without creating intermediate StorableNode/StorableTrie objects. - Avoid creating 400+ million element slice holding all StorableNodes read from checkpoint file - DiskWal.Replay*() APIs are changed. checkpointFn receives []*trie.MTrie instead of FlattenedForest. - Remove files contaning StorableNode/StorableTrie/FlattenedForest etc. * mtrie/flattener/forest.go * mtrie/flattener/forest_test.go * mtrie/flattener/storables.go * mtrie/flattener/trie.go * mtrie/flattener/trie_test.go
Further reduction of 4.4+GB is planned for a total reduction of 10.2+GB. (see TODOs at bottom). Leaf node is encoded as: - node type (1 byte) (new in v4) - height (2 bytes) - max depth (2 bytes) - reg count (8 bytes) - hash (2 bytes + 32 bytes) - path (2 bytes + 32 bytes) - payload (4 bytes + n bytes) Encoded payload size also reduced by removing prefix (version 2 bytes + type 1 byte). Interim node is encoded as: - node type (1 byte) (new in v4) - height (2 bytes) - max depth (2 bytes) - reg count (8 bytes) - lchild index (8 bytes) - rchild index (8 bytes) - hash (2 bytes + 32 bytes) Trie is encoded as: - root node index (8 bytes) - hash (2 bytes + 32 bytes) Removed v3 leaf node fields: - version (2 bytes) - left child index (8 bytes) - right child index (8 bytes) - payload version and type (3 bytes) Removed v3 interim node fields: - version (2 bytes) - path (2 bytes) - payload (4 bytes) Removed v3 trie field: - version (2 bytes) Leaf node data is reduced by 20 bytes (2+8+8+3-1). Interim node data is reduced by 7 bytes (2+2+4-1). Trie is reduced by 2 bytes. TODO: remove max depth and reg count fields from both leaf node and interim node types. TODO: reduce hash length from 2 bytes to 1 byte for both leaf node and interim node types.
Hey @fxamacker, first of all great work. Does this changes allow using checkpoint file without loading it totally to memory ? If not is it possible to add an index to the end ? It would be great to have ability to query checkpoint file with a CLI tool for lookup. But with recent changes using atree on storage, I have to import all checkpoint file to Badger ( or similar DB ) or load fully in to the memory. |
Codecov Report
@@ Coverage Diff @@
## master #1944 +/- ##
==========================================
+ Coverage 57.18% 57.29% +0.11%
==========================================
Files 634 633 -1
Lines 36679 36833 +154
==========================================
+ Hits 20976 21105 +129
- Misses 13068 13077 +9
- Partials 2635 2651 +16
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Reduce allocs by using a 4096 byte scratch buffer to reduce another 400+ million allocs during checkpoint reading.
Reduce allocs by using a 4096 byte scratch buffer to reduce another 400+ million allocs during checkpoint writing.
Hello @bluesign,
The entire checkpoint file will be stream decoded in order to create in-memory tries. Currently, the in-memory tries will have all the data from checkpoint file. Issue #1746 will be handled by a separate PR after this one is merged.
I understand.
Maybe (depends on details/tradeoffs/etc), but adding index is outside the scope of this PR. There's a chance this PR might store the payload in a separate file. It is a tiny intermediate step towards #1746. Not sure, but maybe this step could be useful to projects like https://github.com/bluesign/checkpointState Please consider opening an issue to add index to checkpoint file, so discussions can continue there if needed. This PR is focused on optimizing checkpoint (creation and loading) for speed, memory, and file size. |
Decode payload from input buffer and only copy shared data, to avoid extra allocs of payload's Key.KeyParts ([]KeyPart). Loading checkpoint v4 and replaying WALs used 3,076,382,277 fewer allocs/op compared to v3. Prior to this change: When decoding payload during checkpoint loading, payload object was created with shared data from read buffer and was deep copied. Since payload contains key parts as []KeyPart, new slice was created during deep copying.
EncodeAndAppendPayloadWithoutPrefix() appends encoded payload to input buffer. If payload is nil, unmodified buffer is returned. This edge case is uncommon and didn't get triggered when creating checkpoint.3485 from checkpoint.3443 with 41 WAL segments (both v3->v4 and v4->v4). Add tests.
NewForest() doesn't have a logger. Error is returned by onTreeEvicted() callback function passed to NewForest() by NewLedger(). NewLedger() has a Logger and defines onTreeEvicted(), so we can probably log the error from inside onTreeEvicted() instead. However, that change is outside the scope for PR #1944.
Add benchmarks for checkpoint creation and checkpoint loading. Checkpoint creation time can vary. For example: 926 seconds (first run after OS boot) 863 seconds (2nd run of benchmark without OS reboot in between) Also fix off-by-one error in the checkpoint filename created by benchmarks: was checkpoint.00003485 (number should've been 3484) now checkpoint.00003484 (same file size and hash, only name changed) For BenchmmarkNewCheckpoint and BenchmarkLoadCheckpointAndWALs: If the current folder doesn't contain the checkpoint and WAL files, then the -dir option should be used. These two benchmarks should be used to benchmark real data. For example, use checkpoint.00003443 and 41 WAL files (00003444 - 00003484) to create checkpoint.00003484. BenchmarkNewCheckpointRandom* generates random WAL segments and doesn't require -dir option or any files. They can be used for regression tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just took high-level look at the PR. I mainly focused my review on any changes to the core logic of the trie. Didn't go deep into the checkpointing logic itself, as I am not familiar with this part of the code base.
// NodeIterator only uses visitedNodes for read operation. | ||
// No special handling is needed if visitedNodes is nil. | ||
// WARNING: visitedNodes is not safe for concurrent use. | ||
visitedNodes map[*node.Node]uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
// and updated MTrie after register writes). | ||
// NodeIterator only uses visitedNodes for read operation. | ||
// No special handling is needed if visitedNodes is nil. | ||
// WARNING: visitedNodes is not safe for concurrent use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add this line also to the existing NewNodeIterator
constructor please
func NewNodeIterator(mTrie *trie.MTrie) *NodeIterator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should add the same concurrency warning here because visitedNodes
will always be nil
which appears to make it safe for concurrent use (unless I'm mistaken).
I added this comment to NewNodeIterator
:
// NodeIterator created by NewNodeIterator is safe for concurrent use
// because visitedNodes is always nil in this case.
Please let me know if I need to update the comment. 🙏
- Remove return value (error) from onTreeEvicted() - Log error in onTreeEvicted() in NewLedger() - Replace all empty onTreeEvicted callbacks with nil
NodeIterator created by NewNodeIterator is safe for concurrent use because visitedNodes is always nil in this case.
Description
Optimize checkpoint creating (includes loading):
Most of the optimizations were proposed in comments to issue #1750. I'm moving all remaining optimizations like concurrency and/or compression, etc. to separate PRs so this is ready for review as-is.
Increased interim + leaf node counts are causing checkpoint creation to take hours. This PR sacrifices some readability and simplicity as tradeoffs and gains speed, memory efficiency, and storage efficiency.
I limited scope of PR to optimizations that don't require performance tradeoffs or overhead (like adding processes, IPC).
Big thanks to @ramtinms for opening #1750 to point out that trie flattening can have big optimizations. 👍
Closes #1750
Closes #1884
Updates #1744, #1746, https://github.com/dapperlabs/flow-go/issues/6114
Impact on Execution Nodes
EN Startup Time
This PR speeds up EN startup time in several ways:
See issue #1884 for more info about extra WAL segments causing EN startup delays.
EN Memory Use and System Requirements
This PR can reduce long-held data in RAM (15+ hours on EN3) by up to 116 GB. Additionally, eliminating 431 GB alloc/op and 7.6 billion allocs/op will reduce load on the Go garbage collector.
Benchmark Comparisons
Unoptimized Checkpoint Creation
After the first 30 minutes and for next 11+ hours (15-17+ hours on EN3):
Optimized Checkpoint Creation
Finishes in 15+ minutes and peaks in the last minute at:
Preliminary Results (WIP) Without Adding Concurrency Yet
MTrie Checkpoint Load+Create v3 (old) vs v4 (WIP)
UPDATE: on March 1, optimized checkpoint creation speed (v4 -> v4 with 41 WALs) varied by 63 seconds between the first 2 runs (all 3 used same input files to create same output):
Load Checkpoint File + replay 41 WALs v3 (old) vs v4 (WIP)
Changes include:
TODO
Additional TODOs that will probably be wrapped up in a separate PR
DecodeKey()
,DecodeKeyPart()
, andDecodePayload()
. Not high priority because these functions appear to be unused.maybe split checkpoint file into 3 files (metadata, nodes, and payload file).I synced with Ramtin and his preference is to keep the checkpoint as one file for this PR.