Optimize MTrie Checkpoint (regCount & regSize): -9GB alloc/op, -110 milllion allocs/op, -4GB file size #2126

fxamacker · 2022-03-09T17:29:39Z

Description

Remove Node.regCount and Node.maxDepth to reduce file size and memory
- -10 bytes per node file size reduction
- -16 bytes per node memory reduction -- this memory reduction extends beyond checkpoint. 🎉
Optimize MTrie update() to reduce heap allocations
Fix Payload.Equals() crash bug when p is nil
Fix Payload.Equals() bug when comparing nil payload with empty payload

Closes #1747
Closes #1748
Closes #2125
Updates #1744
Updates #1884 https://github.com/dapperlabs/flow-go/issues/6114

Big thanks to @ramtinms for opening issues #1747, #1748, and for clarifying things. 👍

Changes include:

remove Node.regCount (8 bytes) and Node.maxDepth (2 bytes)
remove regCount (8 bytes) and maxDepth (2 bytes) from checkpoint node serialization
add MTrie.regCount (8 bytes) and MTrie.regSize (8 bytes)
add regCount (8 bytes) and regSize (b bytes) to checkpoint trie serialization
modify trie update() to return regCountDelta, regSizeDelta, and lowestHeightTouched so that NewTrieWithUpdatedRegisters() can compute regCount, regSize, and maxDepthTouched for the updated trie
fix Payload.Equals() crash bug when p is nil
fix Payload.Equals() bug when comparing nil payload with empty payload
optimize MTrie update() to reduce heap allocations

Note

lowestHeightTouched returned by update() is the lowest height reached during recursive update. Unlike maxDepth, lowestHeightTouched isn't affected by prune flag. It's mostly new/updated node height. It can also be height of new node created from compact leaf at a higher height.

Preliminary Results

Unoptimized Checkpoint Creation

After the first 30 minutes and for next 11+ hours on benchnet (15-17+ hours on EN3):

Optimized Checkpoint Creation (PR #1944)

Finishes in about 15 minutes and peaks in the last minute at:

Optimized Checkpoint Creation (PR #1944 + PR #2126)

Finishes in 14 - 15 minutes and peaks in the last minute at:

New Checkpoint (Load Old + Replay 41 WALs + Create New)

PR #1944 vs PR #2126

> benchstat v4_to_v4.txt v41_to_v41.txt
name              old time/op    new time/op    delta
NewCheckpoint-48      886s ± 0%      868s ± 2%  -2.10%

name              old alloc/op   new alloc/op   delta
NewCheckpoint-48     159GB ± 0%     150GB ± 0%  -5.61%

name              old allocs/op  new allocs/op  delta
NewCheckpoint-48     2.19G ± 0%     2.08G ± 0%  -4.91%

Mainnet vs (PR #1944 + PR #2126)

name              old time/op    new time/op    delta
NewCheckpoint-48    42052s ± 0%      868s ± 2%  -97.94%

name              old alloc/op   new alloc/op   delta
NewCheckpoint-48     590GB ± 0%     150GB ± 0%  -74.55%

name              old allocs/op  new allocs/op  delta
NewCheckpoint-48     9.80G ± 0%     2.08G ± 0%  -78.76%

Go 1.16 on benchnet (the big one)
NOTE: File system cache can make the duration fluctuate. The first run after OS reboot can take 20-60 secs longer due to cache and other reasons (e.g. not waiting long enough after OS boots).

Reduce RAM use, file size, and improve speed. E.g. reduce checkpoint file size by over 4GB and RAM usage by a similar amount. Speed will also benefit from these changes (benchmarks coming.) lowestHeightTouched returned by update() is the lowest height reached during recursive update. Unlike maxDepth, lowestHeightTouched isn't affected by prune flag. It's mostly new/updated node height. It can also be height of new node created from compact leaf at a higher height. Changes include: - remove Node.regCount (8 bytes) and Node.maxDepth (2 bytes) - remove regCount (8 bytes) and maxDepth (2 bytes) from checkpoint node serialization - add MTrie.regCount (8 bytes) and MTrie.regSize (8 bytes) - add regCount (8 bytes) and regSize (b bytes) to checkpoint trie serialization - modify trie update() to return regCountDelta, regSizeDelta, and lowestHeightTouched so that NewTrieWithUpdatedRegisters() can compute regCount, regSize, and maxDepthTouched for the updated trie - fix Payload.Equals() crash bug when p is nil - fix Payload.Equals() bug when comparing nil payload with empty payload

codecov-commenter · 2022-03-09T18:37:11Z

Codecov Report

Merging #2126 (4bcc1c8) into master (e52604d) will increase coverage by 0.02%.
The diff coverage is 84.14%.

@@            Coverage Diff             @@
##           master    #2126      +/-   ##
==========================================
+ Coverage   57.50%   57.52%   +0.02%     
==========================================
  Files         637      641       +4     
  Lines       37261    37884     +623     
==========================================
+ Hits        21426    21794     +368     
- Misses      13151    13328     +177     
- Partials     2684     2762      +78

Flag	Coverage Δ
unittests	`57.52% <84.14%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
ledger/trie.go	`9.28% <28.57%> (+5.55%)`	⬆️
ledger/complete/mtrie/flattener/encoding_v3.go	`44.68% <37.03%> (+1.20%)`	⬆️
ledger/complete/wal/checkpointer.go	`63.08% <84.21%> (+1.85%)`	⬆️
ledger/complete/mtrie/trie/trie.go	`61.83% <98.63%> (+6.65%)`	⬆️
ledger/complete/ledger.go	`59.18% <100.00%> (-0.21%)`	⬇️
ledger/complete/mtrie/flattener/encoding.go	`78.66% <100.00%> (-2.18%)`	⬇️
ledger/complete/mtrie/forest.go	`61.63% <100.00%> (+0.48%)`	⬆️
ledger/complete/mtrie/node/node.go	`68.18% <100.00%> (-2.98%)`	⬇️
ledger/ledger.go	`16.88% <100.00%> (+16.88%)`	⬆️
...sus/approvals/assignment_collector_statemachine.go	`42.30% <0.00%> (-9.62%)`	⬇️
... and 31 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2717aba...4bcc1c8. Read the comment docs.

Output variables shared outside goroutine are allocated on the heap. Use channel to transfer output variables to avoid heap allocation.

AlexHentschel

Thanks for the cleanup. The only aspect I am concerned about is the different convention introduced in this PR.

Existing conventions

In our abstract storage model, we have registers
- Per convention, we say that a register is "allocated" if and only if it holds a non-empty Value. Equivalently, we say that a register is "unallocated", if and only if its value is empty.
- A register is uniquely identified by a path (32 byte address space)
  - A register's clear-text key is auxiliary information that does not impact whether a register is allocated or not.
  - Technically, we get a register's path by hashing the key. However, the hashing happens outside of the MTrie's scope and the trie itself is oblivious to the values in key.
The Trie extends the convention of allocated registers:

The Trie must store allocated registers. An allocated register has a hash that is necessarily different than the default hash, because an allocated register has a non-empty value, which is part of the hash. Furthermore, also the hash of a sub-trie with any allocated registers has a non-default hash.
The trie may store unallocated registers. We specifically designed the hash function such that it returns the default hash for an unallocated register.

Based on this convention, the Trie distinguishes registers only based on whether or not they have an empty Value.

Now, lets look at regCount (I'll get to regSize later). There are 3 possible conventions that are consistent with the conventions above:

Option 1: Theoretically, we could count all registers in the storage model (no matter whether allocated or not). That would be a constant number of 2^256. Not very interesting.
Option 2: We only count the allocated registers. That would tell us how much registers the application layer is actively using; but not provide any information about the number of registers that are explicitly represented in the trie despite being unallocated.
Option 3: We count all registers that explicitly represented in the trie (no matter whether they are allocated or unallocated).

For practical purposes, I would expect that Option 2 and Option 3 are close to identical. All tries that we keep in memory are pruned, so there should be no unallocated registers represented in the tree. Not sure what your intended usage is for regCount (?). Given that there is little practical difference, my gut feeling is that option 3 would be easiest to implement consistently.

Comparison to convention introduced in this PR

In this PR you start a different convention by working with the Payload.Size(). This is inconsistent with the existing convention because:
- In the existing convention, the trie algorithms only destingush between allocated and unallocated registers. The Payload.Keys is auxiliary data which the higher-level business logic can use arbitrarily; the trie does not make any assumptions about how Payload.Keys is used.
- In your implementation, the treatment of unallocated registers suddenly changes depending on whether or not Payload.Keys is set. This creates dependencies / assumptions about usage patters of the application logic (more inter-dependencies -> more brittle code). Furthermore, it significantly increases the complexity of the logic, because we suddenly move from a two-state register attribute of {allocated; unallocated} to a 4-state attribute <Payload.Keys present? Yes or No> x <Payload.Value present? Yes or No>.

Suggestion:

I would suggest to implement Option 3:

If we add a new leaf, regCount increases by one; no matter whether or not the register is allocated. The register exists in the tree, so it takes up storage and hence is worth-while to monitor.
This convention also makes regSize easy to monitor, because you don't have to distinguish between allocated and unallocated registers either. As soon as we represent a register in the tree, we count payloads[0].Size() as its storage footprint.

ledger/complete/mtrie/trie/trie.go

AlexHentschel · 2022-03-15T00:51:01Z

ledger/complete/mtrie/trie/trie.go

 		// runtime optimization: process the left child is a separate thread
-		wg := sync.WaitGroup{}
-		wg.Add(1)
-		go func() {
-			defer wg.Done()
-			lChild = update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune)
-		}()
-		rChild = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune)
-		wg.Wait()
+
+		// Since we're receiving 4 items from goroutine, use a
+		// channel to prevent them going on the heap.
+		type updateResult struct {
+			child               *node.Node
+			regCountDelta       int64
+			regSizeDelta        int64
+			lowestHeightTouched int
+		}
+		results := make(chan updateResult, 1)
+		go func(retChan chan<- updateResult) {
+			child, regCountDelta, regSizeDelta, lowestHeightTouched := update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune)
+			retChan <- updateResult{child, regCountDelta, regSizeDelta, lowestHeightTouched}
+		}(results)
+
+		rChild, rRegCountDelta, rRegSizeDelta, rLowestHeightTouched = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune)
+
+		// Wait for results from goroutine.
+		ret := <-results
+		lChild, lRegCountDelta, lRegSizeDelta, lLowestHeightTouched = ret.child, ret.regCountDelta, ret.regSizeDelta, ret.lowestHeightTouched


Have you benchmarked the construction with the channels here? This is very performance-sensitive code. I would guess that a waitgroup is much more performant than a channel (there are multiple synchronization paradigms involved in a channel; here the implementation).

Suggestion:

I think this synchronization can be implemented as easily using WaitGroup:

Create an instance of updateResult and a wg:= sync.WaitGroup{}

spawn the auxiliary go routine

The memory model guarantees that step 1 happens before any statement within the routine

write the return values to updateResult

in the main thread wg.Wait() for the auxiliary thread. This guarantees that the operations in the auxiliary thread happened before the Wait() returns.

Note: While Go's memory model does not talk about WaitGroups, they also establish a happens-before. Hence,

In summary, I think the entire else-branch can be written as follows

Suggested change

// runtime optimization: process the left child is a separate thread

wg := sync.WaitGroup{}

wg.Add(1)

go func() {

defer wg.Done()

lChild = update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune)

}()

rChild = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune)

wg.Wait()

// Since we're receiving 4 items from goroutine, use a

// channel to prevent them going on the heap.

type updateResult struct {

child *node.Node

regCountDelta int64

regSizeDelta int64

lowestHeightTouched int

}

results := make(chan updateResult, 1)

go func(retChan chan<- updateResult) {

child, regCountDelta, regSizeDelta, lowestHeightTouched := update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune)

retChan <- updateResult{child, regCountDelta, regSizeDelta, lowestHeightTouched}

}(results)

rChild, rRegCountDelta, rRegSizeDelta, rLowestHeightTouched = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune)

// Wait for results from goroutine.

ret := <-results

lChild, lRegCountDelta, lRegSizeDelta, lLowestHeightTouched = ret.child, ret.regCountDelta, ret.regSizeDelta, ret.lowestHeightTouched

// runtime optimization: process the left child is a separate thread

wg := sync.WaitGroup{}

wg.Add(1)

go func() {

lChild, lRegCountDelta, lRegSizeDelta, lLowestHeightTouched = update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune)

}()

rChild, rRegCountDelta, rRegSizeDelta, rLowestHeightTouched = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune)

wg.Wait() // WaitGroup induces happens-before relationship: https://groups.google.com/g/golang-nuts/c/5oHzhzXCcmM

Have you benchmarked the construction with the channels here? This is very performance-sensitive code.

Yes. Last week, I ran non-micro benchmarks. Checkpoint creation benchmark using mainnet data (replays 264881 updates from WALs). Using channel instead of WaitGroup was faster and reduced allocs by hundreds of millions.

I also ran micro benchmarks yesterday and they confirm the suggested WaitGroup approach can be slower than this PR by about 12%-21% while also causing extra 1-4 allocs/op depending on details (see below).

I would guess that a waitgroup is much more performant than a channel

Although WaitGroup approach can be faster (especially with 2+ goroutines), we only use 1 goroutine and need to communicate results from it. So using channel is faster and uses fewer allocs/op in this case.

Suggestion:

I think this synchronization can be implemented as easily using WaitGroup:

Create an instance of updateResult and a wg:= sync.WaitGroup{}

spawn the auxiliary go routine
...
In summary, I think the entire else-branch can be written as follows

The suggested approach in the code snippet (screenshot) can be slower by around 21% and also cause an extra 4 allocs/op.

The extra allocs/op (lChild, lRegCountDelta, lRegSizeDelta, lLowestHeightTouched) is incurred for all recursive calls, even for calls that use the "if-else" branch avoiding the goroutine.

Create an instance of updateResult and a wg:= sync.WaitGroup{}

If the 4 values are put into updateResult struct, then using the suggested WaitGroup approach still causes 1 extra alloc/op and it's still slower than using channel in this PR by around 12%.

More Details

For micro benchmarks today, I used Go 1.17.8 on linux_amd64 (Haswell Xeon) with this code:
https://gist.github.com/fxamacker/4cec13885dcf2c19d055b13f2b51289b

There are two groups of benchmarks:

WaitGroup vs channel (signal only, no data values communicated)

communicating values from goroutine

WaitGroup vs channel

With one goroutine, it is faster to use channel.

name old time/op new time/op delta WaitGroup-8 471ns ± 0% 448ns ± 0% -4.79% (p=0.000 n=20+20) name old alloc/op new alloc/op delta WaitGroup-8 32.0B ± 0% 112.0B ± 0% +250.00% (p=0.000 n=20+20) name old allocs/op new allocs/op delta WaitGroup-8 2.00 ± 0% 2.00 ± 0% ~ (all equal)

Communicating values from goroutine

Communicating 4 ints using referenced variables with WaitGroup causes 4 integers to allocate on the heap, and is also slower.

> go-benchrun WaitGroupReferenceFourInts ChannelResult -benchmem -count=20 name old time/op new time/op delta WaitGroupReferenceFourInts-8 580ns ± 0% 457ns ± 0% -21.15% (p=0.000 n=19+20) name old alloc/op new alloc/op delta WaitGroupReferenceFourInts-8 96.0B ± 0% 112.0B ± 0% +16.67% (p=0.000 n=20+20) name old allocs/op new allocs/op delta WaitGroupReferenceFourInts-8 6.00 ± 0% 2.00 ± 0% -66.67% (p=0.000 n=20+20)

Communicating 1 wrapper referenced struct with WaitGroup requires allocating struct on the heap, and is slower.

> go-benchrun WaitGroupReferenceOneStruct ChannelResult -benchmem -count=20 name old time/op new time/op delta WaitGroupReferenceOneStruct-8 520ns ± 0% 458ns ± 0% -11.91% (p=0.000 n=19+17) name old alloc/op new alloc/op delta WaitGroupReferenceOneStruct-8 72.0B ± 0% 112.0B ± 0% +55.56% (p=0.000 n=20+20) name old allocs/op new allocs/op delta WaitGroupReferenceOneStruct-8 3.00 ± 0% 2.00 ± 0% -33.33% (p=0.000 n=20+20)

Although WaitGroup approach can be faster (especially with 2+ goroutines), we only use 1 goroutine and need to communicate results from it. So using channel is faster and uses fewer allocs/op in this case.

Thanks for the detailed benchmarking result. Makes sense. Could you add a brief TLDR to the code, so that we document this consideration, please.

Added in 051ba8e

Interesting results! I didn't expect channel to be faster than waitGroup in this case.

ledger/complete/mtrie/trie/trie_test.go

ledger/ledger_test.go

When compactLeaf from a higher height needs to be created at a lower height during update, its payload doesn't need to deep copied because we are reusing the trie internal reference and payloads of leafs are immutable.

A register is allocated if and only if it has a non-empty value. A register is unallocated if and only if its value is empty. In update(), use empty payload value to distinguish allocated vs unallocated registers to adjust allocatedRegCountDelta and allocatedRegSizeDelta.

fxamacker · 2022-03-16T23:53:09Z

@AlexHentschel Thanks for the detailed explanation! 😄 It cleared up a misunderstanding. I was using payload size instead of payload value size to test whether a register is allocated.

I would suggest to implement Option 3:

If we add a new leaf, regCount increases by one; no matter whether or not the register is allocated. The register exists in the tree, so it takes up storage and hence is worth-while to monitor.
This convention also makes regSize easy to monitor, because you don't have to distinguish between allocated and unallocated registers either. As soon as we represent a register in the tree, we count payloads[0].Size() as its storage footprint.

I considered Options 2 & 3 before opening this PR, and created this PR for Option 2 (only track allocated registers).

I didn't pick Option 3 (track both allocated and unallocated registers in the trie) because it seemed keeping track of all registers in the trie requires keeping track of unallocated registers which might be removed by pruning. So option 3 shifts complexity from detecting "allocated vs unallocated" to detecting "whether a pruned child node was just unallocated".

Please let me know if I missed anything that should be considered when choosing between Option 2 & 3.

ramtinms

Nice work, thanks for doing it, looks good to me.

In v0, length of encoded payload value is encoded in 8 bytes. In v1, length of encoded payload value is encoded in 4 bytes. Payload decoding handles both v0 and v1. Payload is encoded according to version received as parameter. This allows gradual migration of other encoding formats using payload, such as TrieUpdate and TrieProof.

Bump version of checkpoint file to v5. Support reading v4 for compatibility. Add tests and test data to read v4.

AlexHentschel · 2022-03-22T17:19:09Z

ledger/complete/mtrie/trie/trie.go

 }

-// update traverses the subtree and updates the stored registers
+// update traverses the subtree, updates the stored registers, and returns:
+//   * new or orignial node (n)


Suggested change

// * new or orignial node (n)

// * new or original node (n)

Fixed in 953571a.

AlexHentschel · 2022-03-22T20:28:13Z

ledger/complete/mtrie/trie/trie.go

+						if payloads[i].IsEmpty() {
+							// Old payload is not empty while new payload is empty.
+							// Allocated register will be unallocated.
+							allocatedRegCountDelta = -1


❗

I am not sure this logic here is correct in all cases. Depending on whether the prune parameter is set or not, we might add registers with empty Value

Consider the following sequence of events:

a register with empty Value is added with a NewTrieWithUpdatedRegisters call with prune = false

flow-go/ledger/complete/mtrie/trie/trie.go

Lines 350 to 353 in ae98303

if payloads[0].IsEmpty() {

// Unallocated register doesn't affect allocatedRegCountDelta and allocatedRegSizeDelta.

return n, 0, 0, nodeHeight

}

this will not increase the regCount

In a second call, with NewTrieWithUpdatedRegisters call with prune = true we remove the leaf. This will decrease regCount.

Due to the existence of the prune flag, you can't make any assumption whether an existing leaf represents an allocated or an unallocated register, because with prune = false, our trie will store unallocasted registers. Hence, whenever you change the value (meaning the register was already explicitly represented in the Trie before the update), you have to consider all 4 possibilities:
{Before the update, the leaf represented an allocated or unallocated register?} x {After the update, the leaf represents an allocated or unallocated register?}

You could extract the logic into a sub-function, which would also increase readability of the code:

func computeDeltas(oldNode, newNode *node.Node) (allocatedRegCountDelta int64, allocatedRegSizeDelta int64) { var oldNodeIsAllocated, newNodeIsAlocated int64 var oldNodeSize, newNodeSize int if oldNode != nil && !oldNode.Payload().IsEmpty() { oldNodeIsAllocated = 1 oldNodeSize = oldNode.Payload().Size() } if newNode != nil && !newNode.Payload().IsEmpty() { newNodeIsAlocated = 1 newNodeSize = newNode.Payload().Size() } allocatedRegCountDelta = newNodeIsAlocated - oldNodeIsAllocated allocatedRegSizeDelta = int64(newNodeSize - oldNodeSize) return }

I am not sure this logic here is correct in all cases. Depending on whether the prune parameter is set or not, we might add registers with empty Value

Consider the following sequence of events:

a register with empty Value is added with a NewTrieWithUpdatedRegisters call with prune = false
...
this will not increase the regCount

In a second call, with NewTrieWithUpdatedRegisters call with prune = true we remove the leaf. This will decrease regCount.

I think the logic correctly handles the sequence of events mentioned. Because in step 2, we don't decrease the reg count when prune = true. Pruning is not a consideration since reg count is decreased only when allocated payload is unallocated (not pruned).

To confirm, I added TestTrieAllocatedRegCountRegSizeWithMixedPruneFlag just now in d9573d0 and cbb8805.

TestTrieAllocatedRegCountRegSizeWithMixedPruneFlag tests allocated register count and size for updated trie with mixed pruning flag. It tests the following updates: * step 1 : update empty trie with new paths and payloads (255 allocated registers) * step 2 : remove a payload without pruning (254 allocated registers) * step 3a: remove previously removed payload with pruning (254 allocated registers) * step 3b: update trie from step 2 with a new payload (sibling of removed payload) with pruning (255 allocated registers)

You could extract the logic into a sub-function, which would also increase readability of the code:

func computeDeltas(...

I agree and extracted to two functions in 4bcc1c8

// computeAllocatedRegDeltasFromHigherHeight returns the deltas needed // to compute the allocated reg count and reg size when a payload // needs to be updated or unallocated at a lower height. func computeAllocatedRegDeltasFromHigherHeight(oldPayload *ledger.Payload) (allocatedRegCountDelta, allocatedRegSizeDelta int64) // computeAllocatedRegDeltas returns the allocated reg count and reg size deltas // computed from old payload and new payload. // CAUTION: !oldPayload.Equals(newPayload) func computeAllocatedRegDeltas(oldPayload, newPayload *ledger.Payload) (allocatedRegCountDelta, allocatedRegSizeDelta int64)

UPDATE: added "compute" prefix to the function names

AlexHentschel · 2022-03-22T20:55:14Z

ledger/complete/mtrie/trie/trie.go

 		// runtime optimization: process the left child is a separate thread
-		wg := sync.WaitGroup{}
-		wg.Add(1)
-		go func() {
-			defer wg.Done()
-			lChild = update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune)
-		}()
-		rChild = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune)
-		wg.Wait()
+
+		// Since we're receiving 4 items from goroutine, use a
+		// channel to prevent them going on the heap.
+		type updateResult struct {
+			child               *node.Node
+			regCountDelta       int64
+			regSizeDelta        int64
+			lowestHeightTouched int
+		}
+		results := make(chan updateResult, 1)
+		go func(retChan chan<- updateResult) {
+			child, regCountDelta, regSizeDelta, lowestHeightTouched := update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune)
+			retChan <- updateResult{child, regCountDelta, regSizeDelta, lowestHeightTouched}
+		}(results)
+
+		rChild, rRegCountDelta, rRegSizeDelta, rLowestHeightTouched = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune)
+
+		// Wait for results from goroutine.
+		ret := <-results
+		lChild, lRegCountDelta, lRegSizeDelta, lLowestHeightTouched = ret.child, ret.regCountDelta, ret.regSizeDelta, ret.lowestHeightTouched


Thanks for the detailed benchmarking result. Makes sense. Could you add a brief TLDR to the code, so that we document this consideration, please.

ledger/complete/mtrie/trie/trie.go

AlexHentschel · 2022-03-22T21:46:23Z

Thanks for your explanation why you went with Option 2.

option 3 shifts complexity from detecting "allocated vs unallocated" to detecting "whether a pruned child node was just unallocated".

I think you are right. I gave it some thought and in fact it seems Option 2 would be even more complex, as it interacts with the pruning logic. Great insight. Thanks.

TestTrieAllocatedRegCountRegSizeWithMixedPruneFlag tests allocated register count and size for updated trie with mixed pruning flag. It tests the following updates with prune flag set to true: * step 1 : update empty trie with new paths and payloads (255 allocated registers) * step 2 : remove a payload without pruning (254 allocated registers) * step 3a: remove previously removed payload with pruning (254 allocated registers) * step 3b: update trie from step 2 with a new payload (sibling of removed payload) with pruning (255 allocated registers)

- computeAllocatedRegDeltasFromHigherHeight - computeAllocatedRegDeltas

Update checkpoint file format version from v4 to v5

…ad-value-length-in-checkpoint Reduce checkpoint file size by using fewer bytes to encode length of encoded payload value

ledger/complete/mtrie/node/node.go

ledger/trie.go

ledger/complete/mtrie/trie/trie.go

tarakby · 2022-03-23T20:34:26Z

ledger/complete/mtrie/trie/trie.go

+	}
+	oldPayloadSize := oldPayload.Size()
+	allocatedRegSizeDelta -= int64(oldPayloadSize)
+	return


Is there a reason for preferring subtractions over direct assignment? (I guess the compiler may generate the same code in this case)

allocatedRegCountDelta = -1

and

allocatedRegSizeDelta = -int64(oldPayloadSize)

@tarakby Good question! No preference, it was just extracted code from a larger function I didn't bother modifying. 😄

Go 1.17 compiler produces the same binary. See code snippet comparison at godbolt.org.

Unfortunately, Compiler Explorer forgets the scrolled position of the 2 compiler output views. It's useful to see side-by-side while hovering over the source code window.

tarakby · 2022-03-23T21:00:43Z

ledger/complete/mtrie/trie/trie.go

 		// runtime optimization: process the left child is a separate thread
-		wg := sync.WaitGroup{}
-		wg.Add(1)
-		go func() {
-			defer wg.Done()
-			lChild = update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune)
-		}()
-		rChild = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune)
-		wg.Wait()
+
+		// Since we're receiving 4 items from goroutine, use a
+		// channel to prevent them going on the heap.
+		type updateResult struct {
+			child               *node.Node
+			regCountDelta       int64
+			regSizeDelta        int64
+			lowestHeightTouched int
+		}
+		results := make(chan updateResult, 1)
+		go func(retChan chan<- updateResult) {
+			child, regCountDelta, regSizeDelta, lowestHeightTouched := update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune)
+			retChan <- updateResult{child, regCountDelta, regSizeDelta, lowestHeightTouched}
+		}(results)
+
+		rChild, rRegCountDelta, rRegSizeDelta, rLowestHeightTouched = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune)
+
+		// Wait for results from goroutine.
+		ret := <-results
+		lChild, lRegCountDelta, lRegSizeDelta, lLowestHeightTouched = ret.child, ret.regCountDelta, ret.regSizeDelta, ret.lowestHeightTouched


Interesting results! I didn't expect channel to be faster than waitGroup in this case.

tarakby

It's good to see the register count and max depth removed from the node struct. Nice work!

fxamacker added Bugfix Improvement Performance labels Mar 9, 2022

fxamacker requested review from ramtinms, m4ksio and AlexHentschel as code owners March 9, 2022 17:29

fxamacker self-assigned this Mar 9, 2022

Merge branch 'master' into fxamacker/move-regcount-regsize-to-trie

e555900

fxamacker marked this pull request as draft March 10, 2022 04:26

fxamacker added 2 commits March 10, 2022 15:32

Avoid heap allocs for goroutine in update()

697f366

Output variables shared outside goroutine are allocated on the heap. Use channel to transfer output variables to avoid heap allocation.

Check for nil payload in Payload funcs

51bd891

fxamacker changed the title ~~WIP: Add MTrie.regCount and MTrie.regSize to reduce memory use, file size, and fix Payload.Equals() bugs~~ Optimize MTrie Checkpoint (regCount & regSize): -9GB alloc/op, -110 milllion allocs/op, -4GB file size Mar 11, 2022

fxamacker marked this pull request as ready for review March 11, 2022 02:02

fxamacker added the Breaking Change label Mar 11, 2022

AlexHentschel requested changes Mar 15, 2022

View reviewed changes

fxamacker added 4 commits March 15, 2022 14:26

Avoid deep copy of compactLeaf payload in update

a4d12a5

When compactLeaf from a higher height needs to be created at a lower height during update, its payload doesn't need to deep copied because we are reusing the trie internal reference and payloads of leafs are immutable.

Move updateResult type outside of update()

09a0620

Merge branch 'master' into fxamacker/move-regcount-regsize-to-trie

ae98303

fxamacker requested a review from AlexHentschel March 17, 2022 00:04

ramtinms approved these changes Mar 17, 2022

View reviewed changes

fxamacker added 3 commits March 17, 2022 21:22

Bump checkpoint file format to v5

029ad04

Bump version of checkpoint file to v5. Support reading v4 for compatibility. Add tests and test data to read v4.

Support V4 and V5 in checkpoint extraction cmd

532fb46

ramtinms requested a review from tarakby March 22, 2022 19:59

fxamacker mentioned this pull request Mar 22, 2022

Update checkpoint file format version from v4 to v5 #2174

Merged

AlexHentschel reviewed Mar 22, 2022

View reviewed changes

fxamacker added 2 commits March 22, 2022 16:16

Add comments about WaitGroup vs channel in update

051ba8e

Fix typo in a comment

953571a

AlexHentschel approved these changes Mar 22, 2022

View reviewed changes

ledger/complete/mtrie/trie/trie.go Outdated Show resolved Hide resolved

ledger/complete/mtrie/trie/trie.go Outdated Show resolved Hide resolved

fxamacker added 8 commits March 22, 2022 18:08

Remove some blank lines after comments

38a12bc

Extract some logic from trie update to new funcs

4bcc1c8

Fix comment for TestTrieAllocatedRegCountRegSizeWithMixedPruneFlag

cbb8805

Add "compute" prefix to two function names

67368d2

- computeAllocatedRegDeltasFromHigherHeight - computeAllocatedRegDeltas

Merge branch 'master' into fxamacker/move-regcount-regsize-to-trie

7e2a913

Merge pull request #2174 from onflow/fxamacker/bump-checkpoint-version

c90c1db

Update checkpoint file format version from v4 to v5

Merge pull request #2165 from onflow/fxamacker/optimize-encoded-paylo…

f8dfffd

…ad-value-length-in-checkpoint Reduce checkpoint file size by using fewer bytes to encode length of encoded payload value

fxamacker merged commit 5af3bb2 into master Mar 23, 2022

fxamacker deleted the fxamacker/move-regcount-regsize-to-trie branch March 23, 2022 18:13

tarakby reviewed Mar 23, 2022

View reviewed changes

tarakby approved these changes Mar 23, 2022

View reviewed changes

fxamacker mentioned this pull request Mar 25, 2022

[Execution State] EN startup was sometimes delayed because older checkpoint file was loaded, causing 40 extra WAL segments to replay #1884

Closed

fxamacker added the Execution Cadence Execution Team label Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize MTrie Checkpoint (regCount & regSize): -9GB alloc/op, -110 milllion allocs/op, -4GB file size #2126

Optimize MTrie Checkpoint (regCount & regSize): -9GB alloc/op, -110 milllion allocs/op, -4GB file size #2126

fxamacker commented Mar 9, 2022 •

edited

Loading

codecov-commenter commented Mar 9, 2022 •

edited

Loading

AlexHentschel left a comment

AlexHentschel Mar 15, 2022

fxamacker Mar 16, 2022

AlexHentschel Mar 22, 2022 •

edited

Loading

fxamacker Mar 22, 2022

tarakby Mar 23, 2022

fxamacker commented Mar 16, 2022

ramtinms left a comment

AlexHentschel Mar 22, 2022

fxamacker Mar 22, 2022

AlexHentschel Mar 22, 2022 •

edited

Loading

fxamacker Mar 23, 2022 •

edited

Loading

fxamacker Mar 23, 2022 •

edited

Loading

AlexHentschel Mar 22, 2022 •

edited

Loading

AlexHentschel commented Mar 22, 2022

tarakby Mar 23, 2022

fxamacker Mar 24, 2022

tarakby Mar 23, 2022

tarakby left a comment

	if payloads[0].IsEmpty() {
	// Unallocated register doesn't affect allocatedRegCountDelta and allocatedRegSizeDelta.
	return n, 0, 0, nodeHeight
	}

Optimize MTrie Checkpoint (regCount & regSize): -9GB alloc/op, -110 milllion allocs/op, -4GB file size #2126

Optimize MTrie Checkpoint (regCount & regSize): -9GB alloc/op, -110 milllion allocs/op, -4GB file size #2126

Conversation

fxamacker commented Mar 9, 2022 • edited Loading

Description

Changes include:

Note

Preliminary Results

Unoptimized Checkpoint Creation

Optimized Checkpoint Creation (PR #1944)

Optimized Checkpoint Creation (PR #1944 + PR #2126)

New Checkpoint (Load Old + Replay 41 WALs + Create New)

PR #1944 vs PR #2126

Mainnet vs (PR #1944 + PR #2126)

codecov-commenter commented Mar 9, 2022 • edited Loading

Codecov Report

AlexHentschel left a comment

Choose a reason for hiding this comment

Existing conventions

Comparison to convention introduced in this PR

Suggestion:

Choose a reason for hiding this comment

Suggestion:

Choose a reason for hiding this comment

Suggestion:

More Details

WaitGroup vs channel

Communicating values from goroutine

AlexHentschel Mar 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fxamacker commented Mar 16, 2022

ramtinms left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexHentschel Mar 22, 2022 • edited Loading

Choose a reason for hiding this comment

❗

fxamacker Mar 23, 2022 • edited Loading

Choose a reason for hiding this comment

fxamacker Mar 23, 2022 • edited Loading

Choose a reason for hiding this comment

AlexHentschel Mar 22, 2022 • edited Loading

Choose a reason for hiding this comment

AlexHentschel commented Mar 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarakby left a comment

Choose a reason for hiding this comment

fxamacker commented Mar 9, 2022 •

edited

Loading

codecov-commenter commented Mar 9, 2022 •

edited

Loading

AlexHentschel Mar 22, 2022 •

edited

Loading

AlexHentschel Mar 22, 2022 •

edited

Loading

fxamacker Mar 23, 2022 •

edited

Loading

fxamacker Mar 23, 2022 •

edited

Loading

AlexHentschel Mar 22, 2022 •

edited

Loading