Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase stateCache size to fix headState missing in checkpoint sync start #4534

Closed
wants to merge 1 commit into from

Conversation

g11tech
Copy link
Contributor

@g11tech g11tech commented Sep 10, 2022

In checkpoint sync start, the stateCache of 3 * 32 caches seem to be not sufficient since generally we start from finalized which is almost 3 epochs behind
So before bn can sync blocks and generate more states nearer to head, the getHeadState functions seem to be running out of the states.

This PR does a quick-fix for expanding the stateCache's cache to 6 epochs (to give margin on 3 epochs) so that bns can stay behind 3 epochs + while still trying to find/sync peers.
There could be a better solution, but this is a hotfix to the current issue users might run into.

(see the notifier crossing 96+ skipped slots here since last finalized checkpoint sync comforably without error)
image

User issue also seems to have resolved.

Closes #4523

@g11tech g11tech requested a review from a team as a code owner September 10, 2022 23:05
@github-actions
Copy link
Contributor

Performance Report

✔️ no performance regression detected

Full benchmark results
Benchmark suite Current: e3cdd4b Previous: 65b38ee Ratio
getPubkeys - index2pubkey - req 1000 vs - 250000 vc 2.0096 ms/op 2.0655 ms/op 0.97
getPubkeys - validatorsArr - req 1000 vs - 250000 vc 85.763 us/op 71.801 us/op 1.19
BLS verify - blst-native 2.2217 ms/op 1.8555 ms/op 1.20
BLS verifyMultipleSignatures 3 - blst-native 4.5250 ms/op 3.8096 ms/op 1.19
BLS verifyMultipleSignatures 8 - blst-native 9.7068 ms/op 8.1954 ms/op 1.18
BLS verifyMultipleSignatures 32 - blst-native 35.284 ms/op 29.693 ms/op 1.19
BLS aggregatePubkeys 32 - blst-native 46.607 us/op 39.098 us/op 1.19
BLS aggregatePubkeys 128 - blst-native 183.54 us/op 152.89 us/op 1.20
getAttestationsForBlock 195.22 ms/op 170.27 ms/op 1.15
isKnown best case - 1 super set check 513.00 ns/op 430.00 ns/op 1.19
isKnown normal case - 2 super set checks 496.00 ns/op 416.00 ns/op 1.19
isKnown worse case - 16 super set checks 503.00 ns/op 414.00 ns/op 1.21
CheckpointStateCache - add get delete 9.9340 us/op 8.9150 us/op 1.11
validate gossip signedAggregateAndProof - struct 5.1367 ms/op 4.2718 ms/op 1.20
validate gossip attestation - struct 2.4365 ms/op 2.0335 ms/op 1.20
pickEth1Vote - no votes 2.5496 ms/op 2.1438 ms/op 1.19
pickEth1Vote - max votes 21.166 ms/op 19.461 ms/op 1.09
pickEth1Vote - Eth1Data hashTreeRoot value x2048 12.768 ms/op 11.509 ms/op 1.11
pickEth1Vote - Eth1Data hashTreeRoot tree x2048 23.239 ms/op 21.778 ms/op 1.07
pickEth1Vote - Eth1Data fastSerialize value x2048 1.8729 ms/op 1.6318 ms/op 1.15
pickEth1Vote - Eth1Data fastSerialize tree x2048 14.087 ms/op 13.216 ms/op 1.07
bytes32 toHexString 1.2110 us/op 1.0430 us/op 1.16
bytes32 Buffer.toString(hex) 862.00 ns/op 681.00 ns/op 1.27
bytes32 Buffer.toString(hex) from Uint8Array 1.1010 us/op 938.00 ns/op 1.17
bytes32 Buffer.toString(hex) + 0x 823.00 ns/op 688.00 ns/op 1.20
Object access 1 prop 0.40400 ns/op 0.37200 ns/op 1.09
Map access 1 prop 0.34300 ns/op 0.29400 ns/op 1.17
Object get x1000 20.987 ns/op 18.304 ns/op 1.15
Map get x1000 1.1620 ns/op 0.96900 ns/op 1.20
Object set x1000 130.18 ns/op 124.66 ns/op 1.04
Map set x1000 79.535 ns/op 76.893 ns/op 1.03
Return object 10000 times 0.44600 ns/op 0.37610 ns/op 1.19
Throw Error 10000 times 7.0110 us/op 6.0132 us/op 1.17
enrSubnets - fastDeserialize 64 bits 2.9180 us/op 2.5210 us/op 1.16
enrSubnets - ssz BitVector 64 bits 832.00 ns/op 755.00 ns/op 1.10
enrSubnets - fastDeserialize 4 bits 418.00 ns/op 369.00 ns/op 1.13
enrSubnets - ssz BitVector 4 bits 837.00 ns/op 734.00 ns/op 1.14
prioritizePeers score -10:0 att 32-0.1 sync 2-0 105.23 us/op 102.61 us/op 1.03
prioritizePeers score 0:0 att 32-0.25 sync 2-0.25 156.41 us/op 122.17 us/op 1.28
prioritizePeers score 0:0 att 32-0.5 sync 2-0.5 248.43 us/op 221.06 us/op 1.12
prioritizePeers score 0:0 att 64-0.75 sync 4-0.75 457.69 us/op 465.52 us/op 0.98
prioritizePeers score 0:0 att 64-1 sync 4-1 544.12 us/op 458.69 us/op 1.19
RateTracker 1000000 limit, 1 obj count per request 244.67 ns/op 188.12 ns/op 1.30
RateTracker 1000000 limit, 2 obj count per request 180.65 ns/op 141.30 ns/op 1.28
RateTracker 1000000 limit, 4 obj count per request 151.63 ns/op 119.87 ns/op 1.26
RateTracker 1000000 limit, 8 obj count per request 141.60 ns/op 107.59 ns/op 1.32
RateTracker with prune 4.8510 us/op 4.2310 us/op 1.15
array of 16000 items push then shift 3.7580 us/op 3.1753 us/op 1.18
LinkedList of 16000 items push then shift 20.240 ns/op 17.525 ns/op 1.15
array of 16000 items push then pop 268.13 ns/op 226.65 ns/op 1.18
LinkedList of 16000 items push then pop 19.399 ns/op 16.272 ns/op 1.19
array of 24000 items push then shift 5.4681 us/op 4.5630 us/op 1.20
LinkedList of 24000 items push then shift 23.158 ns/op 20.273 ns/op 1.14
array of 24000 items push then pop 232.98 ns/op 207.71 ns/op 1.12
LinkedList of 24000 items push then pop 20.513 ns/op 17.661 ns/op 1.16
intersect bitArray bitLen 8 14.035 ns/op 11.701 ns/op 1.20
intersect array and set length 8 187.44 ns/op 167.61 ns/op 1.12
intersect bitArray bitLen 128 74.220 ns/op 72.214 ns/op 1.03
intersect array and set length 128 2.6375 us/op 2.3105 us/op 1.14
Buffer.concat 32 items 2.2690 ns/op 1.9000 ns/op 1.19
pass gossip attestations to forkchoice per slot 3.7153 ms/op 3.2191 ms/op 1.15
computeDeltas 4.0358 ms/op 3.4919 ms/op 1.16
computeProposerBoostScoreFromBalances 1.0751 ms/op 907.70 us/op 1.18
altair processAttestation - 250000 vs - 7PWei normalcase 4.1456 ms/op 3.7169 ms/op 1.12
altair processAttestation - 250000 vs - 7PWei worstcase 7.2853 ms/op 6.0288 ms/op 1.21
altair processAttestation - setStatus - 1/6 committees join 242.70 us/op 210.30 us/op 1.15
altair processAttestation - setStatus - 1/3 committees join 463.65 us/op 401.52 us/op 1.15
altair processAttestation - setStatus - 1/2 committees join 651.00 us/op 564.88 us/op 1.15
altair processAttestation - setStatus - 2/3 committees join 855.45 us/op 724.12 us/op 1.18
altair processAttestation - setStatus - 4/5 committees join 1.1757 ms/op 1.0049 ms/op 1.17
altair processAttestation - setStatus - 100% committees join 1.3896 ms/op 1.1863 ms/op 1.17
altair processBlock - 250000 vs - 7PWei normalcase 30.382 ms/op 27.418 ms/op 1.11
altair processBlock - 250000 vs - 7PWei normalcase hashState 45.241 ms/op 41.681 ms/op 1.09
altair processBlock - 250000 vs - 7PWei worstcase 95.477 ms/op 83.294 ms/op 1.15
altair processBlock - 250000 vs - 7PWei worstcase hashState 114.42 ms/op 102.47 ms/op 1.12
phase0 processBlock - 250000 vs - 7PWei normalcase 4.1941 ms/op 4.2389 ms/op 0.99
phase0 processBlock - 250000 vs - 7PWei worstcase 56.045 ms/op 49.719 ms/op 1.13
altair processEth1Data - 250000 vs - 7PWei normalcase 1.1405 ms/op 785.28 us/op 1.45
Tree 40 250000 create 920.63 ms/op 832.01 ms/op 1.11
Tree 40 250000 get(125000) 341.93 ns/op 293.04 ns/op 1.17
Tree 40 250000 set(125000) 2.6924 us/op 2.7545 us/op 0.98
Tree 40 250000 toArray() 37.245 ms/op 31.709 ms/op 1.17
Tree 40 250000 iterate all - toArray() + loop 37.380 ms/op 32.270 ms/op 1.16
Tree 40 250000 iterate all - get(i) 131.64 ms/op 111.25 ms/op 1.18
MutableVector 250000 create 18.029 ms/op 15.306 ms/op 1.18
MutableVector 250000 get(125000) 17.653 ns/op 13.087 ns/op 1.35
MutableVector 250000 set(125000) 691.32 ns/op 677.61 ns/op 1.02
MutableVector 250000 toArray() 8.5583 ms/op 12.219 ms/op 0.70
MutableVector 250000 iterate all - toArray() + loop 9.2799 ms/op 7.3261 ms/op 1.27
MutableVector 250000 iterate all - get(i) 4.1402 ms/op 3.2897 ms/op 1.26
Array 250000 create 7.8971 ms/op 6.6140 ms/op 1.19
Array 250000 clone - spread 4.2787 ms/op 2.9964 ms/op 1.43
Array 250000 get(125000) 1.8000 ns/op 1.2740 ns/op 1.41
Array 250000 set(125000) 1.7850 ns/op 1.2580 ns/op 1.42
Array 250000 iterate all - loop 203.28 us/op 167.90 us/op 1.21
effectiveBalanceIncrements clone Uint8Array 300000 94.075 us/op 83.439 us/op 1.13
effectiveBalanceIncrements clone MutableVector 300000 1.3160 us/op 867.00 ns/op 1.52
effectiveBalanceIncrements rw all Uint8Array 300000 303.23 us/op 252.70 us/op 1.20
effectiveBalanceIncrements rw all MutableVector 300000 231.03 ms/op 187.73 ms/op 1.23
phase0 afterProcessEpoch - 250000 vs - 7PWei 219.83 ms/op 181.18 ms/op 1.21
phase0 beforeProcessEpoch - 250000 vs - 7PWei 79.000 ms/op 73.975 ms/op 1.07
altair processEpoch - mainnet_e81889 679.29 ms/op 579.52 ms/op 1.17
mainnet_e81889 - altair beforeProcessEpoch 170.87 ms/op 146.95 ms/op 1.16
mainnet_e81889 - altair processJustificationAndFinalization 30.179 us/op 22.221 us/op 1.36
mainnet_e81889 - altair processInactivityUpdates 13.169 ms/op 10.915 ms/op 1.21
mainnet_e81889 - altair processRewardsAndPenalties 107.82 ms/op 91.927 ms/op 1.17
mainnet_e81889 - altair processRegistryUpdates 7.0480 us/op 3.5930 us/op 1.96
mainnet_e81889 - altair processSlashings 2.1330 us/op 836.00 ns/op 2.55
mainnet_e81889 - altair processEth1DataReset 2.0230 us/op 811.00 ns/op 2.49
mainnet_e81889 - altair processEffectiveBalanceUpdates 2.9471 ms/op 2.4175 ms/op 1.22
mainnet_e81889 - altair processSlashingsReset 10.233 us/op 6.4370 us/op 1.59
mainnet_e81889 - altair processRandaoMixesReset 12.069 us/op 5.2830 us/op 2.28
mainnet_e81889 - altair processHistoricalRootsUpdate 2.1120 us/op 813.00 ns/op 2.60
mainnet_e81889 - altair processParticipationFlagUpdates 6.6430 us/op 2.8110 us/op 2.36
mainnet_e81889 - altair processSyncCommitteeUpdates 1.6430 us/op 667.00 ns/op 2.46
mainnet_e81889 - altair afterProcessEpoch 228.96 ms/op 192.57 ms/op 1.19
phase0 processEpoch - mainnet_e58758 609.16 ms/op 525.91 ms/op 1.16
mainnet_e58758 - phase0 beforeProcessEpoch 266.44 ms/op 232.53 ms/op 1.15
mainnet_e58758 - phase0 processJustificationAndFinalization 35.025 us/op 20.413 us/op 1.72
mainnet_e58758 - phase0 processRewardsAndPenalties 92.738 ms/op 142.73 ms/op 0.65
mainnet_e58758 - phase0 processRegistryUpdates 16.008 us/op 9.8530 us/op 1.62
mainnet_e58758 - phase0 processSlashings 1.8230 us/op 742.00 ns/op 2.46
mainnet_e58758 - phase0 processEth1DataReset 2.0310 us/op 769.00 ns/op 2.64
mainnet_e58758 - phase0 processEffectiveBalanceUpdates 2.2758 ms/op 1.8919 ms/op 1.20
mainnet_e58758 - phase0 processSlashingsReset 9.4860 us/op 4.0550 us/op 2.34
mainnet_e58758 - phase0 processRandaoMixesReset 12.668 us/op 5.5470 us/op 2.28
mainnet_e58758 - phase0 processHistoricalRootsUpdate 2.2560 us/op 785.00 ns/op 2.87
mainnet_e58758 - phase0 processParticipationRecordUpdates 11.141 us/op 4.6580 us/op 2.39
mainnet_e58758 - phase0 afterProcessEpoch 202.03 ms/op 157.70 ms/op 1.28
phase0 processEffectiveBalanceUpdates - 250000 normalcase 3.0277 ms/op 2.6287 ms/op 1.15
phase0 processEffectiveBalanceUpdates - 250000 worstcase 0.5 4.1496 ms/op 3.4319 ms/op 1.21
altair processInactivityUpdates - 250000 normalcase 44.551 ms/op 39.288 ms/op 1.13
altair processInactivityUpdates - 250000 worstcase 55.663 ms/op 51.049 ms/op 1.09
phase0 processRegistryUpdates - 250000 normalcase 13.510 us/op 8.2250 us/op 1.64
phase0 processRegistryUpdates - 250000 badcase_full_deposits 486.97 us/op 407.43 us/op 1.20
phase0 processRegistryUpdates - 250000 worstcase 0.5 245.98 ms/op 210.32 ms/op 1.17
altair processRewardsAndPenalties - 250000 normalcase 159.97 ms/op 125.82 ms/op 1.27
altair processRewardsAndPenalties - 250000 worstcase 97.360 ms/op 87.301 ms/op 1.12
phase0 getAttestationDeltas - 250000 normalcase 15.090 ms/op 13.533 ms/op 1.12
phase0 getAttestationDeltas - 250000 worstcase 14.817 ms/op 13.671 ms/op 1.08
phase0 processSlashings - 250000 worstcase 6.2809 ms/op 5.3569 ms/op 1.17
altair processSyncCommitteeUpdates - 250000 325.70 ms/op 285.64 ms/op 1.14
BeaconState.hashTreeRoot - No change 602.00 ns/op 499.00 ns/op 1.21
BeaconState.hashTreeRoot - 1 full validator 75.507 us/op 55.522 us/op 1.36
BeaconState.hashTreeRoot - 32 full validator 740.09 us/op 551.17 us/op 1.34
BeaconState.hashTreeRoot - 512 full validator 9.7975 ms/op 6.1534 ms/op 1.59
BeaconState.hashTreeRoot - 1 validator.effectiveBalance 90.856 us/op 81.865 us/op 1.11
BeaconState.hashTreeRoot - 32 validator.effectiveBalance 1.3531 ms/op 1.2068 ms/op 1.12
BeaconState.hashTreeRoot - 512 validator.effectiveBalance 17.533 ms/op 18.212 ms/op 0.96
BeaconState.hashTreeRoot - 1 balances 70.796 us/op 61.527 us/op 1.15
BeaconState.hashTreeRoot - 32 balances 668.84 us/op 563.24 us/op 1.19
BeaconState.hashTreeRoot - 512 balances 6.6778 ms/op 5.6544 ms/op 1.18
BeaconState.hashTreeRoot - 250000 balances 106.22 ms/op 94.456 ms/op 1.12
aggregationBits - 2048 els - zipIndexesInBitList 39.537 us/op 35.729 us/op 1.11
regular array get 100000 times 80.535 us/op 67.408 us/op 1.19
wrappedArray get 100000 times 80.739 us/op 67.406 us/op 1.20
arrayWithProxy get 100000 times 34.248 ms/op 28.908 ms/op 1.18
ssz.Root.equals 556.00 ns/op 504.00 ns/op 1.10
byteArrayEquals 556.00 ns/op 516.00 ns/op 1.08
shuffle list - 16384 els 13.297 ms/op 11.084 ms/op 1.20
shuffle list - 250000 els 193.59 ms/op 163.43 ms/op 1.18
processSlot - 1 slots 13.663 us/op 12.744 us/op 1.07
processSlot - 32 slots 2.0052 ms/op 1.8340 ms/op 1.09
getEffectiveBalanceIncrementsZeroInactive - 250000 vs - 7PWei 436.91 us/op 351.95 us/op 1.24
getCommitteeAssignments - req 1 vs - 250000 vc 6.3230 ms/op 5.2780 ms/op 1.20
getCommitteeAssignments - req 100 vs - 250000 vc 8.7666 ms/op 7.3379 ms/op 1.19
getCommitteeAssignments - req 1000 vs - 250000 vc 9.3555 ms/op 7.7631 ms/op 1.21
RootCache.getBlockRootAtSlot - 250000 vs - 7PWei 11.930 ns/op 9.5900 ns/op 1.24
state getBlockRootAtSlot - 250000 vs - 7PWei 1.2211 us/op 1.2234 us/op 1.00
computeProposers - vc 250000 19.479 ms/op 16.999 ms/op 1.15
computeEpochShuffling - vc 250000 198.38 ms/op 165.61 ms/op 1.20
getNextSyncCommittee - vc 250000 321.08 ms/op 281.01 ms/op 1.14

by benchmarkbot/action

@@ -6,7 +6,7 @@ import {IMetrics} from "../../metrics/index.js";
import {MapTracker} from "./mapMetrics.js";
import {stateInternalCachePopulated} from "./stateContextCheckpointsCache.js";

const MAX_STATES = 3 * 32;
const MAX_STATES = 6 * 32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to investigate the impact of this with regards to the heap memory (which lead to gc percentage metrics). In the past when chain was not finalized for a while I saw the heap memory spiked at the same time.

Copy link
Contributor

@dapplion dapplion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must not merge this increase in cache size. The problem here is that the head state should never be dropped; period. This change will dramatically increase the change of OOM in Lodestar

@g11tech
Copy link
Contributor Author

g11tech commented Sep 12, 2022

We must not merge this increase in cache size. The problem here is that the head state should never be dropped; period. This change will dramatically increase the change of OOM in Lodestar

👍 let me see how we can handle this particular case in a better way

@g11tech
Copy link
Contributor Author

g11tech commented Sep 25, 2022

fixed in #4562

@g11tech g11tech closed this Sep 25, 2022
@twoeths twoeths deleted the g11tech/fix-headstatemissing branch September 26, 2022 01:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Node notifier error headState does not exist
3 participants