Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize MTrie Checkpoint (regCount & regSize): -9GB alloc/op, -110 milllion allocs/op, -4GB file size #2126

Merged
merged 21 commits into from
Mar 23, 2022

Conversation

fxamacker
Copy link
Member

@fxamacker fxamacker commented Mar 9, 2022

Description

  • Remove Node.regCount and Node.maxDepth to reduce file size and memory
    • -10 bytes per node file size reduction
    • -16 bytes per node memory reduction -- this memory reduction extends beyond checkpoint. 🎉
  • Optimize MTrie update() to reduce heap allocations
  • Fix Payload.Equals() crash bug when p is nil
  • Fix Payload.Equals() bug when comparing nil payload with empty payload

Closes #1747
Closes #1748
Closes #2125
Updates #1744
Updates #1884 https://github.com/dapperlabs/flow-go/issues/6114

Big thanks to @ramtinms for opening issues #1747, #1748, and for clarifying things. 👍

Changes include:

  • remove Node.regCount (8 bytes) and Node.maxDepth (2 bytes)

  • remove regCount (8 bytes) and maxDepth (2 bytes) from checkpoint node serialization

  • add MTrie.regCount (8 bytes) and MTrie.regSize (8 bytes)

  • add regCount (8 bytes) and regSize (b bytes) to checkpoint trie serialization

  • modify trie update() to return regCountDelta, regSizeDelta, and lowestHeightTouched so that NewTrieWithUpdatedRegisters() can compute regCount, regSize, and maxDepthTouched for the updated trie

  • fix Payload.Equals() crash bug when p is nil

  • fix Payload.Equals() bug when comparing nil payload with empty payload

  • optimize MTrie update() to reduce heap allocations

Note

lowestHeightTouched returned by update() is the lowest height reached during recursive update. Unlike maxDepth, lowestHeightTouched isn't affected by prune flag. It's mostly new/updated node height. It can also be height of new node created from compact leaf at a higher height.

Preliminary Results

Unoptimized Checkpoint Creation

After the first 30 minutes and for next 11+ hours on benchnet (15-17+ hours on EN3):

Optimized Checkpoint Creation (PR #1944)

Finishes in about 15 minutes and peaks in the last minute at:

Optimized Checkpoint Creation (PR #1944 + PR #2126)

Finishes in 14 - 15 minutes and peaks in the last minute at:
image

New Checkpoint (Load Old + Replay 41 WALs + Create New)

PR #1944 vs PR #2126

> benchstat v4_to_v4.txt v41_to_v41.txt
name              old time/op    new time/op    delta
NewCheckpoint-48      886s ± 0%      868s ± 2%  -2.10%

name              old alloc/op   new alloc/op   delta
NewCheckpoint-48     159GB ± 0%     150GB ± 0%  -5.61%

name              old allocs/op  new allocs/op  delta
NewCheckpoint-48     2.19G ± 0%     2.08G ± 0%  -4.91%

Mainnet vs (PR #1944 + PR #2126)

name              old time/op    new time/op    delta
NewCheckpoint-48    42052s ± 0%      868s ± 2%  -97.94%

name              old alloc/op   new alloc/op   delta
NewCheckpoint-48     590GB ± 0%     150GB ± 0%  -74.55%

name              old allocs/op  new allocs/op  delta
NewCheckpoint-48     9.80G ± 0%     2.08G ± 0%  -78.76%

Go 1.16 on benchnet (the big one)
NOTE: File system cache can make the duration fluctuate. The first run after OS reboot can take 20-60 secs longer due to cache and other reasons (e.g. not waiting long enough after OS boots).

Reduce RAM use, file size, and improve speed.
E.g. reduce checkpoint file size by over 4GB and
RAM usage by a similar amount.  Speed will also
benefit from these changes (benchmarks coming.)

lowestHeightTouched returned by update() is the lowest height
reached during recursive update.  Unlike maxDepth,
lowestHeightTouched isn't affected by prune flag.  It's
mostly new/updated node height.  It can also be height of
new node created from compact leaf at a higher height.

Changes include:

- remove Node.regCount (8 bytes) and Node.maxDepth (2 bytes)

- remove regCount (8 bytes) and maxDepth (2 bytes) from
checkpoint node serialization

- add MTrie.regCount (8 bytes) and MTrie.regSize (8 bytes)

- add regCount (8 bytes) and regSize (b bytes) to checkpoint
trie serialization

- modify trie update() to return regCountDelta, regSizeDelta,
and lowestHeightTouched so that NewTrieWithUpdatedRegisters()
can compute regCount, regSize, and maxDepthTouched for the
updated trie

- fix Payload.Equals() crash bug when p is nil

- fix Payload.Equals() bug when comparing nil payload with
empty payload
@codecov-commenter
Copy link

codecov-commenter commented Mar 9, 2022

Codecov Report

Merging #2126 (4bcc1c8) into master (e52604d) will increase coverage by 0.02%.
The diff coverage is 84.14%.

@@            Coverage Diff             @@
##           master    #2126      +/-   ##
==========================================
+ Coverage   57.50%   57.52%   +0.02%     
==========================================
  Files         637      641       +4     
  Lines       37261    37884     +623     
==========================================
+ Hits        21426    21794     +368     
- Misses      13151    13328     +177     
- Partials     2684     2762      +78     
Flag Coverage Δ
unittests 57.52% <84.14%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
ledger/trie.go 9.28% <28.57%> (+5.55%) ⬆️
ledger/complete/mtrie/flattener/encoding_v3.go 44.68% <37.03%> (+1.20%) ⬆️
ledger/complete/wal/checkpointer.go 63.08% <84.21%> (+1.85%) ⬆️
ledger/complete/mtrie/trie/trie.go 61.83% <98.63%> (+6.65%) ⬆️
ledger/complete/ledger.go 59.18% <100.00%> (-0.21%) ⬇️
ledger/complete/mtrie/flattener/encoding.go 78.66% <100.00%> (-2.18%) ⬇️
ledger/complete/mtrie/forest.go 61.63% <100.00%> (+0.48%) ⬆️
ledger/complete/mtrie/node/node.go 68.18% <100.00%> (-2.98%) ⬇️
ledger/ledger.go 16.88% <100.00%> (+16.88%) ⬆️
...sus/approvals/assignment_collector_statemachine.go 42.30% <0.00%> (-9.62%) ⬇️
... and 31 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2717aba...4bcc1c8. Read the comment docs.

@fxamacker fxamacker marked this pull request as draft March 10, 2022 04:26
Output variables shared outside goroutine are allocated on the heap.
Use channel to transfer output variables to avoid heap allocation.
@fxamacker fxamacker changed the title WIP: Add MTrie.regCount and MTrie.regSize to reduce memory use, file size, and fix Payload.Equals() bugs Optimize MTrie Checkpoint (regCount & regSize): -9GB alloc/op, -110 milllion allocs/op, -4GB file size Mar 11, 2022
@fxamacker fxamacker marked this pull request as ready for review March 11, 2022 02:02
Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the cleanup. The only aspect I am concerned about is the different convention introduced in this PR.

Existing conventions

  1. In our abstract storage model, we have registers
    • Per convention, we say that a register is "allocated" if and only if it holds a non-empty Value. Equivalently, we say that a register is "unallocated", if and only if its value is empty.
    • A register is uniquely identified by a path (32 byte address space)
      • A register's clear-text key is auxiliary information that does not impact whether a register is allocated or not.
      • Technically, we get a register's path by hashing the key. However, the hashing happens outside of the MTrie's scope and the trie itself is oblivious to the values in key.
  2. The Trie extends the convention of allocated registers:
  • The Trie must store allocated registers. An allocated register has a hash that is necessarily different than the default hash, because an allocated register has a non-empty value, which is part of the hash. Furthermore, also the hash of a sub-trie with any allocated registers has a non-default hash.
  • The trie may store unallocated registers. We specifically designed the hash function such that it returns the default hash for an unallocated register.

Based on this convention, the Trie distinguishes registers only based on whether or not they have an empty Value.

Now, lets look at regCount (I'll get to regSize later). There are 3 possible conventions that are consistent with the conventions above:

  • Option 1: Theoretically, we could count all registers in the storage model (no matter whether allocated or not). That would be a constant number of 2^256. Not very interesting.
  • Option 2: We only count the allocated registers. That would tell us how much registers the application layer is actively using; but not provide any information about the number of registers that are explicitly represented in the trie despite being unallocated.
  • Option 3: We count all registers that explicitly represented in the trie (no matter whether they are allocated or unallocated).

For practical purposes, I would expect that Option 2 and Option 3 are close to identical. All tries that we keep in memory are pruned, so there should be no unallocated registers represented in the tree. Not sure what your intended usage is for regCount (?). Given that there is little practical difference, my gut feeling is that option 3 would be easiest to implement consistently.

Comparison to convention introduced in this PR

  • In this PR you start a different convention by working with the Payload.Size(). This is inconsistent with the existing convention because:
    • In the existing convention, the trie algorithms only destingush between allocated and unallocated registers. The Payload.Keys is auxiliary data which the higher-level business logic can use arbitrarily; the trie does not make any assumptions about how Payload.Keys is used.
    • In your implementation, the treatment of unallocated registers suddenly changes depending on whether or not Payload.Keys is set. This creates dependencies / assumptions about usage patters of the application logic (more inter-dependencies -> more brittle code). Furthermore, it significantly increases the complexity of the logic, because we suddenly move from a two-state register attribute of {allocated; unallocated} to a 4-state attribute <Payload.Keys present? Yes or No> x <Payload.Value present? Yes or No>.

Suggestion:

I would suggest to implement Option 3:

  • If we add a new leaf, regCount increases by one; no matter whether or not the register is allocated. The register exists in the tree, so it takes up storage and hence is worth-while to monitor.
  • This convention also makes regSize easy to monitor, because you don't have to distinguish between allocated and unallocated registers either. As soon as we represent a register in the tree, we count payloads[0].Size() as its storage footprint.

ledger/complete/mtrie/trie/trie.go Show resolved Hide resolved
ledger/complete/mtrie/trie/trie.go Outdated Show resolved Hide resolved
ledger/complete/mtrie/trie/trie.go Outdated Show resolved Hide resolved
ledger/complete/mtrie/trie/trie.go Outdated Show resolved Hide resolved
ledger/complete/mtrie/trie/trie.go Outdated Show resolved Hide resolved
ledger/complete/mtrie/trie/trie.go Outdated Show resolved Hide resolved
ledger/complete/mtrie/trie/trie.go Outdated Show resolved Hide resolved
Comment on lines 430 to 450
// runtime optimization: process the left child is a separate thread
wg := sync.WaitGroup{}
wg.Add(1)
go func() {
defer wg.Done()
lChild = update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune)
}()
rChild = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune)
wg.Wait()

// Since we're receiving 4 items from goroutine, use a
// channel to prevent them going on the heap.
type updateResult struct {
child *node.Node
regCountDelta int64
regSizeDelta int64
lowestHeightTouched int
}
results := make(chan updateResult, 1)
go func(retChan chan<- updateResult) {
child, regCountDelta, regSizeDelta, lowestHeightTouched := update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune)
retChan <- updateResult{child, regCountDelta, regSizeDelta, lowestHeightTouched}
}(results)

rChild, rRegCountDelta, rRegSizeDelta, rLowestHeightTouched = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune)

// Wait for results from goroutine.
ret := <-results
lChild, lRegCountDelta, lRegSizeDelta, lLowestHeightTouched = ret.child, ret.regCountDelta, ret.regSizeDelta, ret.lowestHeightTouched
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you benchmarked the construction with the channels here? This is very performance-sensitive code. I would guess that a waitgroup is much more performant than a channel (there are multiple synchronization paradigms involved in a channel; here the implementation).

Suggestion:

I think this synchronization can be implemented as easily using WaitGroup:

  1. Create an instance of updateResult and a wg:= sync.WaitGroup{}
  2. spawn the auxiliary go routine
    • The memory model guarantees that step 1 happens before any statement within the routine
    • write the return values to updateResult
  3. in the main thread wg.Wait() for the auxiliary thread. This guarantees that the operations in the auxiliary thread happened before the Wait() returns.

In summary, I think the entire else-branch can be written as follows

Suggested change
// runtime optimization: process the left child is a separate thread
wg := sync.WaitGroup{}
wg.Add(1)
go func() {
defer wg.Done()
lChild = update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune)
}()
rChild = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune)
wg.Wait()
// Since we're receiving 4 items from goroutine, use a
// channel to prevent them going on the heap.
type updateResult struct {
child *node.Node
regCountDelta int64
regSizeDelta int64
lowestHeightTouched int
}
results := make(chan updateResult, 1)
go func(retChan chan<- updateResult) {
child, regCountDelta, regSizeDelta, lowestHeightTouched := update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune)
retChan <- updateResult{child, regCountDelta, regSizeDelta, lowestHeightTouched}
}(results)
rChild, rRegCountDelta, rRegSizeDelta, rLowestHeightTouched = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune)
// Wait for results from goroutine.
ret := <-results
lChild, lRegCountDelta, lRegSizeDelta, lLowestHeightTouched = ret.child, ret.regCountDelta, ret.regSizeDelta, ret.lowestHeightTouched
// runtime optimization: process the left child is a separate thread
wg := sync.WaitGroup{}
wg.Add(1)
go func() {
lChild, lRegCountDelta, lRegSizeDelta, lLowestHeightTouched = update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune)
}()
rChild, rRegCountDelta, rRegSizeDelta, rLowestHeightTouched = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune)
wg.Wait() // WaitGroup induces happens-before relationship: https://groups.google.com/g/golang-nuts/c/5oHzhzXCcmM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you benchmarked the construction with the channels here? This is very performance-sensitive code.

Yes. Last week, I ran non-micro benchmarks. Checkpoint creation benchmark using mainnet data (replays 264881 updates from WALs). Using channel instead of WaitGroup was faster and reduced allocs by hundreds of millions.

I also ran micro benchmarks yesterday and they confirm the suggested WaitGroup approach can be slower than this PR by about 12%-21% while also causing extra 1-4 allocs/op depending on details (see below).

I would guess that a waitgroup is much more performant than a channel

Although WaitGroup approach can be faster (especially with 2+ goroutines), we only use 1 goroutine and need to communicate results from it. So using channel is faster and uses fewer allocs/op in this case.

Suggestion:

I think this synchronization can be implemented as easily using WaitGroup:

  1. Create an instance of updateResult and a wg:= sync.WaitGroup{}
  2. spawn the auxiliary go routine
    ...
    In summary, I think the entire else-branch can be written as follows

image

The suggested approach in the code snippet (screenshot) can be slower by around 21% and also cause an extra 4 allocs/op.

The extra allocs/op (lChild, lRegCountDelta, lRegSizeDelta, lLowestHeightTouched) is incurred for all recursive calls, even for calls that use the "if-else" branch avoiding the goroutine.

  1. Create an instance of updateResult and a wg:= sync.WaitGroup{}

If the 4 values are put into updateResult struct, then using the suggested WaitGroup approach still causes 1 extra alloc/op and it's still slower than using channel in this PR by around 12%.

More Details

For micro benchmarks today, I used Go 1.17.8 on linux_amd64 (Haswell Xeon) with this code:
https://gist.github.com/fxamacker/4cec13885dcf2c19d055b13f2b51289b

There are two groups of benchmarks:

WaitGroup vs channel

With one goroutine, it is faster to use channel.

name         old time/op    new time/op    delta
WaitGroup-8     471ns ± 0%     448ns ± 0%    -4.79%  (p=0.000 n=20+20)

name         old alloc/op   new alloc/op   delta
WaitGroup-8     32.0B ± 0%    112.0B ± 0%  +250.00%  (p=0.000 n=20+20)

name         old allocs/op  new allocs/op  delta
WaitGroup-8      2.00 ± 0%      2.00 ± 0%      ~     (all equal)

Communicating values from goroutine

Communicating 4 ints using referenced variables with WaitGroup causes 4 integers to allocate on the heap, and is also slower.

> go-benchrun WaitGroupReferenceFourInts ChannelResult -benchmem -count=20

name                        old time/op    new time/op    delta
WaitGroupReferenceFourInts-8     580ns ± 0%     457ns ± 0%  -21.15%  (p=0.000 n=19+20)

name                        old alloc/op   new alloc/op   delta
WaitGroupReferenceFourInts-8     96.0B ± 0%    112.0B ± 0%  +16.67%  (p=0.000 n=20+20)

name                        old allocs/op  new allocs/op  delta
WaitGroupReferenceFourInts-8      6.00 ± 0%      2.00 ± 0%  -66.67%  (p=0.000 n=20+20)

Communicating 1 wrapper referenced struct with WaitGroup requires allocating struct on the heap, and is slower.


> go-benchrun WaitGroupReferenceOneStruct ChannelResult -benchmem -count=20

name                        old time/op    new time/op    delta
WaitGroupReferenceOneStruct-8 520ns ± 0% 458ns ± 0% -11.91% (p=0.000 n=19+17)

name                        old alloc/op   new alloc/op   delta
WaitGroupReferenceOneStruct-8 72.0B ± 0% 112.0B ± 0% +55.56% (p=0.000 n=20+20)

name                        old allocs/op  new allocs/op  delta
WaitGroupReferenceOneStruct-8 3.00 ± 0% 2.00 ± 0% -33.33% (p=0.000 n=20+20)

Although WaitGroup approach can be faster (especially with 2+ goroutines), we only use 1 goroutine and need to communicate results from it. So using channel is faster and uses fewer allocs/op in this case.

Copy link
Member

@AlexHentschel AlexHentschel Mar 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed benchmarking result. Makes sense. Could you add a brief TLDR to the code, so that we document this consideration, please.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 051ba8e

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting results! I didn't expect channel to be faster than waitGroup in this case.

ledger/complete/mtrie/trie/trie_test.go Outdated Show resolved Hide resolved
ledger/ledger_test.go Show resolved Hide resolved
When compactLeaf from a higher height needs to be created at a lower
height during update, its payload doesn't need to deep copied because we
are reusing the trie internal reference and payloads of leafs are
immutable.
A register is allocated if and only if it has a non-empty value.
A register is unallocated if and only if its value is empty.

In update(), use empty payload value to distinguish allocated vs
unallocated registers to adjust allocatedRegCountDelta and
allocatedRegSizeDelta.
@fxamacker
Copy link
Member Author

@AlexHentschel Thanks for the detailed explanation! 😄 It cleared up a misunderstanding. I was using payload size instead of payload value size to test whether a register is allocated.

I would suggest to implement Option 3:

If we add a new leaf, regCount increases by one; no matter whether or not the register is allocated. The register exists in the tree, so it takes up storage and hence is worth-while to monitor.
This convention also makes regSize easy to monitor, because you don't have to distinguish between allocated and unallocated registers either. As soon as we represent a register in the tree, we count payloads[0].Size() as its storage footprint.

I considered Options 2 & 3 before opening this PR, and created this PR for Option 2 (only track allocated registers).

I didn't pick Option 3 (track both allocated and unallocated registers in the trie) because it seemed keeping track of all registers in the trie requires keeping track of unallocated registers which might be removed by pruning. So option 3 shifts complexity from detecting "allocated vs unallocated" to detecting "whether a pruned child node was just unallocated".

Please let me know if I missed anything that should be considered when choosing between Option 2 & 3.

@fxamacker fxamacker requested a review from AlexHentschel March 17, 2022 00:04
Copy link
Contributor

@ramtinms ramtinms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, thanks for doing it, looks good to me.

In v0, length of encoded payload value is encoded in 8 bytes.
In v1, length of encoded payload value is encoded in 4 bytes.

Payload decoding handles both v0 and v1.

Payload is encoded according to version received as parameter.
This allows gradual migration of other encoding formats using
payload, such as TrieUpdate and TrieProof.
Bump version of checkpoint file to v5.
Support reading v4 for compatibility.
Add tests and test data to read v4.
}

// update traverses the subtree and updates the stored registers
// update traverses the subtree, updates the stored registers, and returns:
// * new or orignial node (n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// * new or orignial node (n)
// * new or original node (n)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 953571a.

Comment on lines 369 to 372
if payloads[i].IsEmpty() {
// Old payload is not empty while new payload is empty.
// Allocated register will be unallocated.
allocatedRegCountDelta = -1
Copy link
Member

@AlexHentschel AlexHentschel Mar 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this logic here is correct in all cases. Depending on whether the prune parameter is set or not, we might add registers with empty Value

Consider the following sequence of events:

  1. a register with empty Value is added with a NewTrieWithUpdatedRegisters call with prune = false
    if payloads[0].IsEmpty() {
    // Unallocated register doesn't affect allocatedRegCountDelta and allocatedRegSizeDelta.
    return n, 0, 0, nodeHeight
    }

    this will not increase the regCount
  2. In a second call, with NewTrieWithUpdatedRegisters call with prune = true we remove the leaf. This will decrease regCount.

Due to the existence of the prune flag, you can't make any assumption whether an existing leaf represents an allocated or an unallocated register, because with prune = false, our trie will store unallocasted registers. Hence, whenever you change the value (meaning the register was already explicitly represented in the Trie before the update), you have to consider all 4 possibilities:
{Before the update, the leaf represented an allocated or unallocated register?} x {After the update, the leaf represents an allocated or unallocated register?}

You could extract the logic into a sub-function, which would also increase readability of the code:

func computeDeltas(oldNode, newNode *node.Node) (allocatedRegCountDelta int64, allocatedRegSizeDelta int64) {
	var oldNodeIsAllocated, newNodeIsAlocated int64
	var oldNodeSize, newNodeSize int
	if oldNode != nil && !oldNode.Payload().IsEmpty() {
		oldNodeIsAllocated = 1
		oldNodeSize = oldNode.Payload().Size()
	}
	if newNode != nil && !newNode.Payload().IsEmpty() {
		newNodeIsAlocated = 1
		newNodeSize = newNode.Payload().Size()
	}

	allocatedRegCountDelta = newNodeIsAlocated - oldNodeIsAllocated
	allocatedRegSizeDelta = int64(newNodeSize - oldNodeSize)
	return
}

Copy link
Member Author

@fxamacker fxamacker Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this logic here is correct in all cases. Depending on whether the prune parameter is set or not, we might add registers with empty Value

Consider the following sequence of events:

  1. a register with empty Value is added with a NewTrieWithUpdatedRegisters call with prune = false
    ...
    this will not increase the regCount
  2. In a second call, with NewTrieWithUpdatedRegisters call with prune = true we remove the leaf. This will decrease regCount.

I think the logic correctly handles the sequence of events mentioned. Because in step 2, we don't decrease the reg count when prune = true. Pruning is not a consideration since reg count is decreased only when allocated payload is unallocated (not pruned).

To confirm, I added TestTrieAllocatedRegCountRegSizeWithMixedPruneFlag just now in d9573d0 and cbb8805.

TestTrieAllocatedRegCountRegSizeWithMixedPruneFlag tests allocated
register count and size for updated trie with mixed pruning flag.
It tests the following updates:

* step 1 : update empty trie with new paths and payloads
           (255 allocated registers)

* step 2 : remove a payload without pruning
           (254 allocated registers)

* step 3a: remove previously removed payload with pruning
           (254 allocated registers)

* step 3b: update trie from step 2 with a new payload (sibling of
           removed payload) with pruning (255 allocated registers)

Copy link
Member Author

@fxamacker fxamacker Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could extract the logic into a sub-function, which would also increase readability of the code:

func computeDeltas(...

I agree and extracted to two functions in 4bcc1c8

// computeAllocatedRegDeltasFromHigherHeight returns the deltas needed
// to compute the allocated reg count and reg size when a payload
// needs to be updated or unallocated at a lower height.
func computeAllocatedRegDeltasFromHigherHeight(oldPayload *ledger.Payload)
    (allocatedRegCountDelta, allocatedRegSizeDelta int64)

// computeAllocatedRegDeltas returns the allocated reg count and reg size deltas
// computed from old payload and new payload.
// CAUTION: !oldPayload.Equals(newPayload)
func computeAllocatedRegDeltas(oldPayload, newPayload *ledger.Payload)
    (allocatedRegCountDelta, allocatedRegSizeDelta int64) 

UPDATE: added "compute" prefix to the function names

Comment on lines 430 to 450
// runtime optimization: process the left child is a separate thread
wg := sync.WaitGroup{}
wg.Add(1)
go func() {
defer wg.Done()
lChild = update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune)
}()
rChild = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune)
wg.Wait()

// Since we're receiving 4 items from goroutine, use a
// channel to prevent them going on the heap.
type updateResult struct {
child *node.Node
regCountDelta int64
regSizeDelta int64
lowestHeightTouched int
}
results := make(chan updateResult, 1)
go func(retChan chan<- updateResult) {
child, regCountDelta, regSizeDelta, lowestHeightTouched := update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune)
retChan <- updateResult{child, regCountDelta, regSizeDelta, lowestHeightTouched}
}(results)

rChild, rRegCountDelta, rRegSizeDelta, rLowestHeightTouched = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune)

// Wait for results from goroutine.
ret := <-results
lChild, lRegCountDelta, lRegSizeDelta, lLowestHeightTouched = ret.child, ret.regCountDelta, ret.regSizeDelta, ret.lowestHeightTouched
Copy link
Member

@AlexHentschel AlexHentschel Mar 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed benchmarking result. Makes sense. Could you add a brief TLDR to the code, so that we document this consideration, please.

ledger/complete/mtrie/trie/trie.go Outdated Show resolved Hide resolved
ledger/complete/mtrie/trie/trie.go Outdated Show resolved Hide resolved
@AlexHentschel
Copy link
Member

Thanks for your explanation why you went with Option 2.

option 3 shifts complexity from detecting "allocated vs unallocated" to detecting "whether a pruned child node was just unallocated".

I think you are right. I gave it some thought and in fact it seems Option 2 would be even more complex, as it interacts with the pruning logic. Great insight. Thanks.

TestTrieAllocatedRegCountRegSizeWithMixedPruneFlag tests allocated
register count and size for updated trie with mixed pruning flag.
It tests the following updates with prune flag set to true:

* step 1 : update empty trie with new paths and payloads
           (255 allocated registers)

* step 2 : remove a payload without pruning
           (254 allocated registers)

* step 3a: remove previously removed payload with pruning
           (254 allocated registers)

* step 3b: update trie from step 2 with a new payload (sibling of
           removed payload) with pruning (255 allocated registers)
- computeAllocatedRegDeltasFromHigherHeight
- computeAllocatedRegDeltas
Update checkpoint file format version from v4 to v5
…ad-value-length-in-checkpoint

Reduce checkpoint file size by using fewer bytes to encode length of encoded payload value
@fxamacker fxamacker merged commit 5af3bb2 into master Mar 23, 2022
@fxamacker fxamacker deleted the fxamacker/move-regcount-regsize-to-trie branch March 23, 2022 18:13
ledger/complete/mtrie/node/node.go Show resolved Hide resolved
ledger/trie.go Show resolved Hide resolved
ledger/complete/mtrie/trie/trie.go Show resolved Hide resolved
}
oldPayloadSize := oldPayload.Size()
allocatedRegSizeDelta -= int64(oldPayloadSize)
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for preferring subtractions over direct assignment? (I guess the compiler may generate the same code in this case)

allocatedRegCountDelta = -1

and

allocatedRegSizeDelta = -int64(oldPayloadSize)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarakby Good question! No preference, it was just extracted code from a larger function I didn't bother modifying. 😄

Go 1.17 compiler produces the same binary. See code snippet comparison at godbolt.org.

Unfortunately, Compiler Explorer forgets the scrolled position of the 2 compiler output views. It's useful to see side-by-side while hovering over the source code window.

Comment on lines 430 to 450
// runtime optimization: process the left child is a separate thread
wg := sync.WaitGroup{}
wg.Add(1)
go func() {
defer wg.Done()
lChild = update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune)
}()
rChild = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune)
wg.Wait()

// Since we're receiving 4 items from goroutine, use a
// channel to prevent them going on the heap.
type updateResult struct {
child *node.Node
regCountDelta int64
regSizeDelta int64
lowestHeightTouched int
}
results := make(chan updateResult, 1)
go func(retChan chan<- updateResult) {
child, regCountDelta, regSizeDelta, lowestHeightTouched := update(nodeHeight-1, lchildParent, lpaths, lpayloads, lcompactLeaf, prune)
retChan <- updateResult{child, regCountDelta, regSizeDelta, lowestHeightTouched}
}(results)

rChild, rRegCountDelta, rRegSizeDelta, rLowestHeightTouched = update(nodeHeight-1, rchildParent, rpaths, rpayloads, rcompactLeaf, prune)

// Wait for results from goroutine.
ret := <-results
lChild, lRegCountDelta, lRegSizeDelta, lLowestHeightTouched = ret.child, ret.regCountDelta, ret.regSizeDelta, ret.lowestHeightTouched
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting results! I didn't expect channel to be faster than waitGroup in this case.

Copy link
Contributor

@tarakby tarakby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to see the register count and max depth removed from the node struct. Nice work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants