Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EN Performance] Optimize checkpoint serialization for -37GB operational RAM, -2.7 minutes duration, -19.6 million allocs (50% fewer allocs) #3050

Merged
merged 7 commits into from
Aug 23, 2022

Conversation

fxamacker
Copy link
Member

@fxamacker fxamacker commented Aug 22, 2022

Primary goal is to reduce operational RAM in checkpoint v5. Secondary goals include speeding up checkpointing and redesign to simplify concurrency in the next PR.

UPDATE: 🚀 Full checkpointing v5 finishes in 12-13 minutes on EN4 and reduced peak memory use more than expected. This PR was merged on Aug 23 and deployed to EN4.mainnet19 on Oct 7, 2022.

This PR replaces largest data structure used for checkpoint serialization. During serialization, this change processes subtries instead of entire tries at once. Changes also focused on preallocations to increase memory savings.

Serializing data in parallel is made easier (because this PR splits mtrie into multiple subtries), but adding parallelism is outside the scope of this PR. Issue #3075 should be used to determine if parallelism is worthwhile (at this time) before implementing it because parallelism has tradeoffs such as consuming more RAM, etc.

Closes #2964
Updates #1744
Updates #3075

Preliminary Results Using Level 4 (16 Subtries)

Using August 12 mainnet checkpoint file:

  • -37GB peak RAM (top command), -23GB RAM (go bench B/op)
  • -19.6 million (-50%) allocs/op in serialization phase
  • -2.7 minutes duration
Before:    625746 ms    88320868048 B/op    39291999 allocs/op
After:     461937 ms    64978613264 B/op    19671410 allocs/op

Root is at Level 0.
Benchmark used Go 1.18.5 on benchnet-dev-004.
No benchstat comparisons yet (n=5+) due to duration and memory required.

Tests

This PR passed unit tests and round-trip tests before it was merged to master on August 23, 2022:

  • On Sunday, August 21, 2022, I confirmed it passed round-trip tests using a 150GB checkpoint file (August 12 checkpoint file from mainnet). The final 150GB output exactly matched expected results (b2sum of 150GB files matched).
  • On Wednesday, August 31, 2022 another person mentioned in standup meeting a different test (comparing file size) also produced expected results.

NOTE: As of Sept 13, 2022 this PR has not been merged to mainnet.

EDIT: Added more details after reading PR review questions.
Clarified root is at level 0 and we're using level 4 (16 subtries).
Mentioned tests, including round-trip tests on Aug 21 that passed before merging PR to master on Aug 23.
Mention issue #3075 to replace "issue will be opened" about adding parallelism made easier by this PR.
Make it more clear this PR is not deployed yet to mainnet.

NodeIterator is modified to receive *node.Node instead of *trie.MTrie.
Replace very large Go map holding all unique nodes with
smaller map of each subtrie to:
- reduce operational memory by 37GB
- reduce allocs by 19.6 million (50% of serialization allocs)
- reduce duration by 2.7 minutes
@fxamacker fxamacker added Performance Execution Cadence Execution Team labels Aug 22, 2022
@fxamacker fxamacker self-assigned this Aug 22, 2022
@fxamacker fxamacker requested a review from zhangchiqing August 22, 2022 17:27
@@ -115,7 +114,7 @@ func (i *NodeIterator) Next() bool {
// initial call to Next() for a non-empty trie
i.dig(i.unprocessedRoot)
i.unprocessedRoot = nil
return true
return len(i.stack) > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we change this? It seems like an important change, but I don't see why we would

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug fix for a problem that didn't surface yet because of the way node iterator was used.

The bug is unique node iterator's Next() returns true when i.unprocessedRoot is visited already and i.stack is empty.

This bug doesn't happen when we iterate nodes of an entire trie (root nodes are always unique). In this PR, we iterate nodes of subtries and subtries can be shared and visited already. So instead of always returning true assuming there's at least one unique node when digging i.unprocessedRoot, we only return true when there are unique nodes in the internal stack after calling dig(i.unprocessedRoot).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! It does make sense.
Would it be possible to maybe add a test catching this particular bug, and showing how the fix helps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to maybe add a test catching this particular bug, and showing how the fix helps?

Test for this bug is already in iterator_test.go#L269-L396.

The test iterates 3 left substries and 3 right subtries (some subtries are shared). The test verifies that:

  • order of iterated nodes is descendents first
  • shared subtries/nodes are not iterated twice
  • non-nil node is returned (meaning as long as Next() returns true Value() returns a non-nil node)

@@ -30,7 +30,7 @@ func TestPopulatedTrie(t *testing.T) {
emptyTrie := trie.NewEmptyMTrie()

// key: 0000...
p1 := utils.PathByUint8(1)
p1 := utils.PathByUint8(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its the only place we change it in a test - is this value irrelevant or internal working has changed somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this value irrelevant or internal working has changed somehow?

Yes, this value is irrelevant and internal working hasn't changed.

The intent is to use p1 as a left leaf node and p2 as a right leaf node of the same parent.

Given p2 path is 0100 0000 created using utils.PathByUint8(64), p1's path can be either of those two paths:

  • 0000 0000 created using utils.PathByUint8(0)
  • 0000 0001 created using utils.PathByUint8(1)

I changed the p1 path to utils.PathByUint8(0) to be consistent with its comment // key: 0000..., which doesn't change the intention of the test.

@m4ksio
Copy link
Contributor

m4ksio commented Aug 22, 2022

Looks good overall and really smart idea!
But one thing I don't get, or maybe my understanding of this changes isn't right - where does such large amount of RAM savings comes from?
This significantly reduces size of allNodes, but all nodes are still serialized. Since allNodes is just a map from pointer to uint64, memory reduction should be rather small in this case

@@ -76,13 +75,13 @@ type NodeIterator struct {
// as for each node, the children have been previously encountered.
// NodeIterator created by NewNodeIterator is safe for concurrent use
// because visitedNodes is always nil in this case.
func NewNodeIterator(mTrie *trie.MTrie) *NodeIterator {
func NewNodeIterator(n *node.Node) *NodeIterator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for fixing this node iterator to be a proper node iterator.

@fxamacker
Copy link
Member Author

fxamacker commented Aug 22, 2022

Looks good overall and really smart idea! But one thing I don't get, or maybe my understanding of this changes isn't right - where does such large amount of RAM savings comes from? This significantly reduces size of allNodes, but all nodes are still serialized. Since allNodes is just a map from pointer to uint64, memory reduction should be rather small in this case

@m4ksio yeah, that was my thought initially too but other aspects like memory savings from preallocations are huge (for very large maps).

Preallocation saves a lot of memory even with same map size of 1 million elements:

BenchmarkMap1000000-4               	      91114871 ns/op	 7299274 B/op	    3198 allocs/op
BenchmarkPreallocatedMap1000000-4   	      81377072 ns/op	 2874023 B/op	       1 allocs/op

Copy link
Contributor

@ramtinms ramtinms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@codecov-commenter
Copy link

codecov-commenter commented Aug 23, 2022

Codecov Report

Merging #3050 (20f3022) into master (395422b) will increase coverage by 0.04%.
The diff coverage is 87.27%.

@@            Coverage Diff             @@
##           master    #3050      +/-   ##
==========================================
+ Coverage   54.43%   54.47%   +0.04%     
==========================================
  Files         722      722              
  Lines       66839    66910      +71     
==========================================
+ Hits        36383    36449      +66     
- Misses      27401    27405       +4     
- Partials     3055     3056       +1     
Flag Coverage Δ
unittests 54.47% <87.27%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
ledger/complete/wal/checkpointer.go 63.20% <86.66%> (+3.33%) ⬆️
ledger/complete/mtrie/flattener/iterator.go 100.00% <100.00%> (ø)
insecure/corruptible/network.go 57.04% <0.00%> (-0.71%) ⬇️
admin/command_runner.go 79.88% <0.00%> (ø)
fvm/handler/contract.go 88.59% <0.00%> (ø)
engine/collection/synchronization/engine.go 68.97% <0.00%> (ø)
module/mempool/epochs/transactions.go 100.00% <0.00%> (+9.67%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@fxamacker fxamacker merged commit 78a3caf into master Aug 23, 2022
@fxamacker fxamacker deleted the fxamacker/reduce-checkpoint-serialization-memory branch August 23, 2022 20:39
@fxamacker
Copy link
Member Author

Updated text to mention testing because another test for this PR was conducted and mentioned today (August 31, 2022).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Execution Cadence Execution Team Performance
Projects
None yet
4 participants