Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: track syncing status and fetch duties on resynced #6995

Merged
merged 23 commits into from
Aug 2, 2024

Conversation

nflaig
Copy link
Member

@nflaig nflaig commented Aug 1, 2024

Motivation

Connected beacon node might be syncing or is offline on validator client startup or on start of new epoch, in both cases, we wait until the next epoch to fetch validator duties again. This is not ideal as we can miss attestation and sync committee duties for a whole epoch (does not apply to block duties as we fetch those each slot).

Since all duty apis will return a 503 if node is syncing, we can keep track of syncing status and when it changes from syncing to synced we can fetch validator duties again as the request is very likely to succeed if previous syncing status check had no issues.

Tracking the syncing status of connected beacon node via GET /eth/v1/node/syncing is practically free and can be called once a slot, similar to tracking beacon health (see #4939).

Description

  • Track syncing status of connected beacon node(s) and fetch duties on resynced
  • Add ssz support to getSyncingStatus api
  • Move beacon health metric to status tracker (from Track beacon health from vc #4939)
  • Add beacon health panel to validator client dashboard

Same reasoning from #4939 applies

Adds metric vc_beacon_health to track the health of the "current" connected beacon node. Note that retry logic of the multi-endpoint REST client apply, but should represent the current URL that the validator client is attempting to pull duties from.

In a multi node setup, we will be tracking the syncing status of the first node that returns a successful response which in case of the getSyncingStatus should be most of the time as the only error condition is due to network errors, e.g. due to node being offline. In my opinion, this behavior is good enough considering how our fallback logic works and in a multi node setup, it is much less likely that all connected nodes are syncing and obtaining duties is much more reliable.

Closes #6962

@nflaig nflaig requested a review from a team as a code owner August 1, 2024 12:10
@nflaig nflaig marked this pull request as draft August 1, 2024 12:10
Copy link

codecov bot commented Aug 1, 2024

Codecov Report

Attention: Patch coverage is 89.81481% with 11 lines in your changes missing coverage. Please review.

Project coverage is 49.23%. Comparing base (be03ef1) to head (5e21e6a).
Report is 3 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #6995      +/-   ##
============================================
+ Coverage     49.10%   49.23%   +0.12%     
============================================
  Files           577      578       +1     
  Lines         37336    37426      +90     
  Branches       2139     2162      +23     
============================================
+ Hits          18334    18425      +91     
+ Misses        18963    18961       -2     
- Partials         39       40       +1     

Copy link
Contributor

github-actions bot commented Aug 1, 2024

Performance Report

✔️ no performance regression detected

Full benchmark results
Benchmark suite Current: 4965e21 Previous: cf00c3f Ratio
getPubkeys - index2pubkey - req 1000 vs - 250000 vc 1.8987 ms/op 1.7436 ms/op 1.09
getPubkeys - validatorsArr - req 1000 vs - 250000 vc 42.177 us/op 39.748 us/op 1.06
BLS verify - blst 921.70 us/op 860.98 us/op 1.07
BLS verifyMultipleSignatures 3 - blst 1.3454 ms/op 1.2737 ms/op 1.06
BLS verifyMultipleSignatures 8 - blst 2.0715 ms/op 1.9901 ms/op 1.04
BLS verifyMultipleSignatures 32 - blst 4.7314 ms/op 4.4934 ms/op 1.05
BLS verifyMultipleSignatures 64 - blst 8.4482 ms/op 8.2804 ms/op 1.02
BLS verifyMultipleSignatures 128 - blst 16.003 ms/op 15.730 ms/op 1.02
BLS deserializing 10000 signatures 624.34 ms/op 572.28 ms/op 1.09
BLS deserializing 100000 signatures 6.2780 s/op 5.8852 s/op 1.07
BLS verifyMultipleSignatures - same message - 3 - blst 936.08 us/op 938.54 us/op 1.00
BLS verifyMultipleSignatures - same message - 8 - blst 1.1277 ms/op 1.0909 ms/op 1.03
BLS verifyMultipleSignatures - same message - 32 - blst 1.6537 ms/op 1.6668 ms/op 0.99
BLS verifyMultipleSignatures - same message - 64 - blst 2.4305 ms/op 2.4708 ms/op 0.98
BLS verifyMultipleSignatures - same message - 128 - blst 3.9819 ms/op 3.9666 ms/op 1.00
BLS aggregatePubkeys 32 - blst 17.182 us/op 17.346 us/op 0.99
BLS aggregatePubkeys 128 - blst 60.640 us/op 60.146 us/op 1.01
notSeenSlots=1 numMissedVotes=1 numBadVotes=10 69.664 ms/op 59.293 ms/op 1.17
notSeenSlots=1 numMissedVotes=0 numBadVotes=4 41.385 ms/op 60.301 ms/op 0.69
notSeenSlots=2 numMissedVotes=1 numBadVotes=10 30.668 ms/op 28.238 ms/op 1.09
getSlashingsAndExits - default max 75.384 us/op 62.650 us/op 1.20
getSlashingsAndExits - 2k 342.02 us/op 264.77 us/op 1.29
proposeBlockBody type=full, size=empty 4.9514 ms/op 5.2195 ms/op 0.95
isKnown best case - 1 super set check 827.00 ns/op 412.00 ns/op 2.01
isKnown normal case - 2 super set checks 790.00 ns/op 429.00 ns/op 1.84
isKnown worse case - 16 super set checks 783.00 ns/op 428.00 ns/op 1.83
InMemoryCheckpointStateCache - add get delete 6.0600 us/op 5.0300 us/op 1.20
validate api signedAggregateAndProof - struct 1.9941 ms/op 1.9233 ms/op 1.04
validate gossip signedAggregateAndProof - struct 1.5325 ms/op 1.7305 ms/op 0.89
validate gossip attestation - vc 640000 967.99 us/op 984.75 us/op 0.98
batch validate gossip attestation - vc 640000 - chunk 32 133.95 us/op 124.18 us/op 1.08
batch validate gossip attestation - vc 640000 - chunk 64 119.97 us/op 104.12 us/op 1.15
batch validate gossip attestation - vc 640000 - chunk 128 107.31 us/op 96.597 us/op 1.11
batch validate gossip attestation - vc 640000 - chunk 256 109.17 us/op 98.474 us/op 1.11
pickEth1Vote - no votes 981.01 us/op 785.36 us/op 1.25
pickEth1Vote - max votes 6.4770 ms/op 7.3717 ms/op 0.88
pickEth1Vote - Eth1Data hashTreeRoot value x2048 14.720 ms/op 12.029 ms/op 1.22
pickEth1Vote - Eth1Data hashTreeRoot tree x2048 24.909 ms/op 19.990 ms/op 1.25
pickEth1Vote - Eth1Data fastSerialize value x2048 390.44 us/op 391.45 us/op 1.00
pickEth1Vote - Eth1Data fastSerialize tree x2048 6.7435 ms/op 3.3766 ms/op 2.00
bytes32 toHexString 881.00 ns/op 654.00 ns/op 1.35
bytes32 Buffer.toString(hex) 473.00 ns/op 482.00 ns/op 0.98
bytes32 Buffer.toString(hex) from Uint8Array 653.00 ns/op 582.00 ns/op 1.12
bytes32 Buffer.toString(hex) + 0x 477.00 ns/op 465.00 ns/op 1.03
Object access 1 prop 0.38200 ns/op 0.32600 ns/op 1.17
Map access 1 prop 0.32200 ns/op 0.32700 ns/op 0.98
Object get x1000 4.9970 ns/op 4.9560 ns/op 1.01
Map get x1000 5.7160 ns/op 5.5980 ns/op 1.02
Object set x1000 40.554 ns/op 26.957 ns/op 1.50
Map set x1000 31.398 ns/op 19.279 ns/op 1.63
Return object 10000 times 0.30320 ns/op 0.28470 ns/op 1.06
Throw Error 10000 times 2.7763 us/op 2.6586 us/op 1.04
fastMsgIdFn sha256 / 200 bytes 2.0930 us/op 2.0630 us/op 1.01
fastMsgIdFn h32 xxhash / 200 bytes 500.00 ns/op 428.00 ns/op 1.17
fastMsgIdFn h64 xxhash / 200 bytes 479.00 ns/op 440.00 ns/op 1.09
fastMsgIdFn sha256 / 1000 bytes 6.2610 us/op 5.8250 us/op 1.07
fastMsgIdFn h32 xxhash / 1000 bytes 634.00 ns/op 515.00 ns/op 1.23
fastMsgIdFn h64 xxhash / 1000 bytes 536.00 ns/op 500.00 ns/op 1.07
fastMsgIdFn sha256 / 10000 bytes 52.210 us/op 50.146 us/op 1.04
fastMsgIdFn h32 xxhash / 10000 bytes 2.0400 us/op 1.8970 us/op 1.08
fastMsgIdFn h64 xxhash / 10000 bytes 1.3770 us/op 1.3160 us/op 1.05
send data - 1000 256B messages 14.060 ms/op 9.9564 ms/op 1.41
send data - 1000 512B messages 17.519 ms/op 12.961 ms/op 1.35
send data - 1000 1024B messages 28.265 ms/op 20.716 ms/op 1.36
send data - 1000 1200B messages 28.698 ms/op 23.749 ms/op 1.21
send data - 1000 2048B messages 32.877 ms/op 28.915 ms/op 1.14
send data - 1000 4096B messages 35.993 ms/op 26.574 ms/op 1.35
send data - 1000 16384B messages 71.163 ms/op 66.929 ms/op 1.06
send data - 1000 65536B messages 307.67 ms/op 255.81 ms/op 1.20
enrSubnets - fastDeserialize 64 bits 1.8230 us/op 1.1670 us/op 1.56
enrSubnets - ssz BitVector 64 bits 676.00 ns/op 522.00 ns/op 1.30
enrSubnets - fastDeserialize 4 bits 423.00 ns/op 347.00 ns/op 1.22
enrSubnets - ssz BitVector 4 bits 687.00 ns/op 560.00 ns/op 1.23
prioritizePeers score -10:0 att 32-0.1 sync 2-0 225.78 us/op 110.08 us/op 2.05
prioritizePeers score 0:0 att 32-0.25 sync 2-0.25 146.09 us/op 153.41 us/op 0.95
prioritizePeers score 0:0 att 32-0.5 sync 2-0.5 216.85 us/op 202.26 us/op 1.07
prioritizePeers score 0:0 att 64-0.75 sync 4-0.75 471.24 us/op 336.79 us/op 1.40
prioritizePeers score 0:0 att 64-1 sync 4-1 929.68 us/op 428.10 us/op 2.17
array of 16000 items push then shift 1.3071 us/op 1.2217 us/op 1.07
LinkedList of 16000 items push then shift 12.387 ns/op 7.1930 ns/op 1.72
array of 16000 items push then pop 139.27 ns/op 94.571 ns/op 1.47
LinkedList of 16000 items push then pop 10.514 ns/op 6.2960 ns/op 1.67
array of 24000 items push then shift 2.2073 us/op 1.7983 us/op 1.23
LinkedList of 24000 items push then shift 7.2630 ns/op 6.6530 ns/op 1.09
array of 24000 items push then pop 116.66 ns/op 105.72 ns/op 1.10
LinkedList of 24000 items push then pop 6.2970 ns/op 6.6030 ns/op 0.95
intersect bitArray bitLen 8 5.6000 ns/op 5.5850 ns/op 1.00
intersect array and set length 8 41.207 ns/op 38.531 ns/op 1.07
intersect bitArray bitLen 128 26.876 ns/op 27.653 ns/op 0.97
intersect array and set length 128 600.28 ns/op 573.92 ns/op 1.05
bitArray.getTrueBitIndexes() bitLen 128 2.2240 us/op 1.2670 us/op 1.76
bitArray.getTrueBitIndexes() bitLen 248 3.0230 us/op 2.0280 us/op 1.49
bitArray.getTrueBitIndexes() bitLen 512 5.9680 us/op 3.7720 us/op 1.58
Buffer.concat 32 items 998.00 ns/op 978.00 ns/op 1.02
Uint8Array.set 32 items 1.2980 us/op 1.3480 us/op 0.96
Buffer.copy 1.5310 us/op 1.3870 us/op 1.10
Uint8Array.set - with subarray 1.9210 us/op 1.8820 us/op 1.02
Uint8Array.set - without subarray 1.4330 us/op 1.3200 us/op 1.09
getUint32 - dataview 424.00 ns/op 386.00 ns/op 1.10
getUint32 - manual 362.00 ns/op 316.00 ns/op 1.15
Set add up to 64 items then delete first 1.8025 us/op 1.7584 us/op 1.03
OrderedSet add up to 64 items then delete first 2.7994 us/op 2.7326 us/op 1.02
Set add up to 64 items then delete last 2.0326 us/op 2.0197 us/op 1.01
OrderedSet add up to 64 items then delete last 3.0486 us/op 3.1391 us/op 0.97
Set add up to 64 items then delete middle 2.0388 us/op 2.0463 us/op 1.00
OrderedSet add up to 64 items then delete middle 4.4405 us/op 4.5294 us/op 0.98
Set add up to 128 items then delete first 4.0220 us/op 3.9323 us/op 1.02
OrderedSet add up to 128 items then delete first 6.2126 us/op 5.8991 us/op 1.05
Set add up to 128 items then delete last 3.9165 us/op 3.9520 us/op 0.99
OrderedSet add up to 128 items then delete last 6.1842 us/op 6.0619 us/op 1.02
Set add up to 128 items then delete middle 3.8920 us/op 3.9273 us/op 0.99
OrderedSet add up to 128 items then delete middle 11.999 us/op 12.015 us/op 1.00
Set add up to 256 items then delete first 7.8773 us/op 7.6805 us/op 1.03
OrderedSet add up to 256 items then delete first 12.552 us/op 11.570 us/op 1.08
Set add up to 256 items then delete last 7.6197 us/op 7.6656 us/op 0.99
OrderedSet add up to 256 items then delete last 11.767 us/op 11.963 us/op 0.98
Set add up to 256 items then delete middle 7.6636 us/op 7.5982 us/op 1.01
OrderedSet add up to 256 items then delete middle 34.760 us/op 34.888 us/op 1.00
transfer serialized Status (84 B) 1.3650 us/op 1.4700 us/op 0.93
copy serialized Status (84 B) 1.1490 us/op 1.2800 us/op 0.90
transfer serialized SignedVoluntaryExit (112 B) 1.4330 us/op 1.4740 us/op 0.97
copy serialized SignedVoluntaryExit (112 B) 1.2150 us/op 1.2110 us/op 1.00
transfer serialized ProposerSlashing (416 B) 1.5450 us/op 2.0830 us/op 0.74
copy serialized ProposerSlashing (416 B) 1.4390 us/op 1.5450 us/op 0.93
transfer serialized Attestation (485 B) 1.5950 us/op 1.5900 us/op 1.00
copy serialized Attestation (485 B) 1.7820 us/op 2.0070 us/op 0.89
transfer serialized AttesterSlashing (33232 B) 1.9460 us/op 2.3020 us/op 0.85
copy serialized AttesterSlashing (33232 B) 4.4150 us/op 4.2880 us/op 1.03
transfer serialized Small SignedBeaconBlock (128000 B) 2.4990 us/op 2.7030 us/op 0.92
copy serialized Small SignedBeaconBlock (128000 B) 10.539 us/op 8.9260 us/op 1.18
transfer serialized Avg SignedBeaconBlock (200000 B) 3.3580 us/op 2.8990 us/op 1.16
copy serialized Avg SignedBeaconBlock (200000 B) 14.467 us/op 12.883 us/op 1.12
transfer serialized BlobsSidecar (524380 B) 3.5510 us/op 3.1640 us/op 1.12
copy serialized BlobsSidecar (524380 B) 76.830 us/op 111.86 us/op 0.69
transfer serialized Big SignedBeaconBlock (1000000 B) 3.2530 us/op 3.4050 us/op 0.96
copy serialized Big SignedBeaconBlock (1000000 B) 373.39 us/op 213.92 us/op 1.75
pass gossip attestations to forkchoice per slot 2.8215 ms/op 2.7562 ms/op 1.02
forkChoice updateHead vc 100000 bc 64 eq 0 415.27 us/op 373.41 us/op 1.11
forkChoice updateHead vc 600000 bc 64 eq 0 2.5460 ms/op 2.4783 ms/op 1.03
forkChoice updateHead vc 1000000 bc 64 eq 0 4.1336 ms/op 3.9976 ms/op 1.03
forkChoice updateHead vc 600000 bc 320 eq 0 2.4774 ms/op 2.6184 ms/op 0.95
forkChoice updateHead vc 600000 bc 1200 eq 0 2.5376 ms/op 2.6909 ms/op 0.94
forkChoice updateHead vc 600000 bc 7200 eq 0 3.0441 ms/op 2.7858 ms/op 1.09
forkChoice updateHead vc 600000 bc 64 eq 1000 9.8628 ms/op 9.5983 ms/op 1.03
forkChoice updateHead vc 600000 bc 64 eq 10000 9.7729 ms/op 9.5791 ms/op 1.02
forkChoice updateHead vc 600000 bc 64 eq 300000 11.621 ms/op 11.650 ms/op 1.00
computeDeltas 500000 validators 300 proto nodes 2.9023 ms/op 2.8879 ms/op 1.00
computeDeltas 500000 validators 1200 proto nodes 2.9413 ms/op 2.8895 ms/op 1.02
computeDeltas 500000 validators 7200 proto nodes 3.0262 ms/op 2.9670 ms/op 1.02
computeDeltas 750000 validators 300 proto nodes 4.3401 ms/op 4.4186 ms/op 0.98
computeDeltas 750000 validators 1200 proto nodes 4.2620 ms/op 4.4231 ms/op 0.96
computeDeltas 750000 validators 7200 proto nodes 4.4587 ms/op 4.3298 ms/op 1.03
computeDeltas 1400000 validators 300 proto nodes 8.0967 ms/op 8.1621 ms/op 0.99
computeDeltas 1400000 validators 1200 proto nodes 7.9743 ms/op 8.2419 ms/op 0.97
computeDeltas 1400000 validators 7200 proto nodes 7.7580 ms/op 8.2172 ms/op 0.94
computeDeltas 2100000 validators 300 proto nodes 12.239 ms/op 11.800 ms/op 1.04
computeDeltas 2100000 validators 1200 proto nodes 12.083 ms/op 12.169 ms/op 0.99
computeDeltas 2100000 validators 7200 proto nodes 11.782 ms/op 12.107 ms/op 0.97
altair processAttestation - 250000 vs - 7PWei normalcase 1.3263 ms/op 1.3827 ms/op 0.96
altair processAttestation - 250000 vs - 7PWei worstcase 2.0260 ms/op 2.1125 ms/op 0.96
altair processAttestation - setStatus - 1/6 committees join 63.388 us/op 62.779 us/op 1.01
altair processAttestation - setStatus - 1/3 committees join 120.66 us/op 125.61 us/op 0.96
altair processAttestation - setStatus - 1/2 committees join 184.80 us/op 185.85 us/op 0.99
altair processAttestation - setStatus - 2/3 committees join 244.10 us/op 240.21 us/op 1.02
altair processAttestation - setStatus - 4/5 committees join 357.10 us/op 362.56 us/op 0.98
altair processAttestation - setStatus - 100% committees join 423.79 us/op 462.03 us/op 0.92
altair processBlock - 250000 vs - 7PWei normalcase 3.2298 ms/op 4.1738 ms/op 0.77
altair processBlock - 250000 vs - 7PWei normalcase hashState 29.443 ms/op 26.617 ms/op 1.11
altair processBlock - 250000 vs - 7PWei worstcase 37.369 ms/op 29.701 ms/op 1.26
altair processBlock - 250000 vs - 7PWei worstcase hashState 66.746 ms/op 56.113 ms/op 1.19
phase0 processBlock - 250000 vs - 7PWei normalcase 1.4522 ms/op 1.3598 ms/op 1.07
phase0 processBlock - 250000 vs - 7PWei worstcase 22.223 ms/op 21.983 ms/op 1.01
altair processEth1Data - 250000 vs - 7PWei normalcase 245.63 us/op 250.28 us/op 0.98
getExpectedWithdrawals 250000 eb:1,eth1:1,we:0,wn:0,smpl:15 4.7270 us/op 5.1360 us/op 0.92
getExpectedWithdrawals 250000 eb:0.95,eth1:0.1,we:0.05,wn:0,smpl:219 18.389 us/op 18.976 us/op 0.97
getExpectedWithdrawals 250000 eb:0.95,eth1:0.3,we:0.05,wn:0,smpl:42 7.3840 us/op 6.8390 us/op 1.08
getExpectedWithdrawals 250000 eb:0.95,eth1:0.7,we:0.05,wn:0,smpl:18 2.9670 us/op 6.1510 us/op 0.48
getExpectedWithdrawals 250000 eb:0.1,eth1:0.1,we:0,wn:0,smpl:1020 81.155 us/op 82.634 us/op 0.98
getExpectedWithdrawals 250000 eb:0.03,eth1:0.03,we:0,wn:0,smpl:11777 502.60 us/op 755.61 us/op 0.67
getExpectedWithdrawals 250000 eb:0.01,eth1:0.01,we:0,wn:0,smpl:16384 1.2072 ms/op 1.1475 ms/op 1.05
getExpectedWithdrawals 250000 eb:0,eth1:0,we:0,wn:0,smpl:16384 691.48 us/op 690.80 us/op 1.00
getExpectedWithdrawals 250000 eb:0,eth1:0,we:0,wn:0,nocache,smpl:16384 2.0551 ms/op 2.0423 ms/op 1.01
getExpectedWithdrawals 250000 eb:0,eth1:1,we:0,wn:0,smpl:16384 1.1655 ms/op 1.2929 ms/op 0.90
getExpectedWithdrawals 250000 eb:0,eth1:1,we:0,wn:0,nocache,smpl:16384 3.0172 ms/op 2.9412 ms/op 1.03
Tree 40 250000 create 182.68 ms/op 181.58 ms/op 1.01
Tree 40 250000 get(125000) 115.03 ns/op 117.83 ns/op 0.98
Tree 40 250000 set(125000) 544.35 ns/op 566.44 ns/op 0.96
Tree 40 250000 toArray() 10.344 ms/op 10.764 ms/op 0.96
Tree 40 250000 iterate all - toArray() + loop 10.142 ms/op 9.9010 ms/op 1.02
Tree 40 250000 iterate all - get(i) 43.490 ms/op 40.617 ms/op 1.07
MutableVector 250000 create 9.2372 ms/op 8.3181 ms/op 1.11
MutableVector 250000 get(125000) 5.6910 ns/op 5.7150 ns/op 1.00
MutableVector 250000 set(125000) 175.95 ns/op 150.62 ns/op 1.17
MutableVector 250000 toArray() 4.5161 ms/op 3.1379 ms/op 1.44
MutableVector 250000 iterate all - toArray() + loop 3.5159 ms/op 3.2909 ms/op 1.07
MutableVector 250000 iterate all - get(i) 1.5984 ms/op 1.3173 ms/op 1.21
Array 250000 create 3.3313 ms/op 2.4534 ms/op 1.36
Array 250000 clone - spread 1.4846 ms/op 1.2787 ms/op 1.16
Array 250000 get(125000) 0.59200 ns/op 0.57700 ns/op 1.03
Array 250000 set(125000) 0.60400 ns/op 0.58600 ns/op 1.03
Array 250000 iterate all - loop 78.997 us/op 76.539 us/op 1.03
effectiveBalanceIncrements clone Uint8Array 300000 17.040 us/op 14.037 us/op 1.21
effectiveBalanceIncrements clone MutableVector 300000 322.00 ns/op 309.00 ns/op 1.04
effectiveBalanceIncrements rw all Uint8Array 300000 170.90 us/op 166.02 us/op 1.03
effectiveBalanceIncrements rw all MutableVector 300000 60.340 ms/op 57.505 ms/op 1.05
phase0 afterProcessEpoch - 250000 vs - 7PWei 78.498 ms/op 75.612 ms/op 1.04
Array.fill - length 1000000 2.7812 ms/op 2.6632 ms/op 1.04
Array push - length 1000000 16.099 ms/op 14.748 ms/op 1.09
Array.get 0.26450 ns/op 0.26246 ns/op 1.01
Uint8Array.get 0.35147 ns/op 0.34034 ns/op 1.03
phase0 beforeProcessEpoch - 250000 vs - 7PWei 15.527 ms/op 14.922 ms/op 1.04
altair processEpoch - mainnet_e81889 341.22 ms/op 283.33 ms/op 1.20
mainnet_e81889 - altair beforeProcessEpoch 21.786 ms/op 18.691 ms/op 1.17
mainnet_e81889 - altair processJustificationAndFinalization 10.779 us/op 10.851 us/op 0.99
mainnet_e81889 - altair processInactivityUpdates 5.9546 ms/op 4.2819 ms/op 1.39
mainnet_e81889 - altair processRewardsAndPenalties 50.187 ms/op 55.831 ms/op 0.90
mainnet_e81889 - altair processRegistryUpdates 2.0740 us/op 2.1690 us/op 0.96
mainnet_e81889 - altair processSlashings 898.00 ns/op 712.00 ns/op 1.26
mainnet_e81889 - altair processEth1DataReset 769.00 ns/op 767.00 ns/op 1.00
mainnet_e81889 - altair processEffectiveBalanceUpdates 803.38 us/op 1.4627 ms/op 0.55
mainnet_e81889 - altair processSlashingsReset 1.3490 us/op 2.5690 us/op 0.53
mainnet_e81889 - altair processRandaoMixesReset 2.4240 us/op 3.3360 us/op 0.73
mainnet_e81889 - altair processHistoricalRootsUpdate 1.7670 us/op 874.00 ns/op 2.02
mainnet_e81889 - altair processParticipationFlagUpdates 2.6020 us/op 1.7670 us/op 1.47
mainnet_e81889 - altair processSyncCommitteeUpdates 770.00 ns/op 562.00 ns/op 1.37
mainnet_e81889 - altair afterProcessEpoch 78.355 ms/op 75.859 ms/op 1.03
capella processEpoch - mainnet_e217614 946.49 ms/op 1.2686 s/op 0.75
mainnet_e217614 - capella beforeProcessEpoch 72.943 ms/op 67.173 ms/op 1.09
mainnet_e217614 - capella processJustificationAndFinalization 20.829 us/op 12.228 us/op 1.70
mainnet_e217614 - capella processInactivityUpdates 14.422 ms/op 14.721 ms/op 0.98
mainnet_e217614 - capella processRewardsAndPenalties 243.07 ms/op 230.64 ms/op 1.05
mainnet_e217614 - capella processRegistryUpdates 14.294 us/op 11.349 us/op 1.26
mainnet_e217614 - capella processSlashings 909.00 ns/op 753.00 ns/op 1.21
mainnet_e217614 - capella processEth1DataReset 766.00 ns/op 735.00 ns/op 1.04
mainnet_e217614 - capella processEffectiveBalanceUpdates 16.087 ms/op 4.9544 ms/op 3.25
mainnet_e217614 - capella processSlashingsReset 4.8200 us/op 1.8150 us/op 2.66
mainnet_e217614 - capella processRandaoMixesReset 6.0550 us/op 3.1580 us/op 1.92
mainnet_e217614 - capella processHistoricalRootsUpdate 835.00 ns/op 798.00 ns/op 1.05
mainnet_e217614 - capella processParticipationFlagUpdates 1.6090 us/op 4.6230 us/op 0.35
mainnet_e217614 - capella afterProcessEpoch 249.89 ms/op 222.96 ms/op 1.12
phase0 processEpoch - mainnet_e58758 379.02 ms/op 381.52 ms/op 0.99
mainnet_e58758 - phase0 beforeProcessEpoch 98.788 ms/op 86.314 ms/op 1.14
mainnet_e58758 - phase0 processJustificationAndFinalization 25.507 us/op 12.350 us/op 2.07
mainnet_e58758 - phase0 processRewardsAndPenalties 21.713 ms/op 30.672 ms/op 0.71
mainnet_e58758 - phase0 processRegistryUpdates 8.8590 us/op 6.7100 us/op 1.32
mainnet_e58758 - phase0 processSlashings 1.0570 us/op 738.00 ns/op 1.43
mainnet_e58758 - phase0 processEth1DataReset 758.00 ns/op 675.00 ns/op 1.12
mainnet_e58758 - phase0 processEffectiveBalanceUpdates 733.81 us/op 1.1030 ms/op 0.67
mainnet_e58758 - phase0 processSlashingsReset 2.9540 us/op 2.2970 us/op 1.29
mainnet_e58758 - phase0 processRandaoMixesReset 6.0260 us/op 3.4210 us/op 1.76
mainnet_e58758 - phase0 processHistoricalRootsUpdate 885.00 ns/op 663.00 ns/op 1.33
mainnet_e58758 - phase0 processParticipationRecordUpdates 4.2090 us/op 2.5140 us/op 1.67
mainnet_e58758 - phase0 afterProcessEpoch 69.625 ms/op 64.703 ms/op 1.08
phase0 processEffectiveBalanceUpdates - 250000 normalcase 791.31 us/op 765.39 us/op 1.03
phase0 processEffectiveBalanceUpdates - 250000 worstcase 0.5 1.3919 ms/op 1.1902 ms/op 1.17
altair processInactivityUpdates - 250000 normalcase 15.623 ms/op 16.962 ms/op 0.92
altair processInactivityUpdates - 250000 worstcase 15.989 ms/op 17.982 ms/op 0.89
phase0 processRegistryUpdates - 250000 normalcase 6.8720 us/op 5.1210 us/op 1.34
phase0 processRegistryUpdates - 250000 badcase_full_deposits 301.55 us/op 297.65 us/op 1.01
phase0 processRegistryUpdates - 250000 worstcase 0.5 109.39 ms/op 112.09 ms/op 0.98
altair processRewardsAndPenalties - 250000 normalcase 40.380 ms/op 31.379 ms/op 1.29
altair processRewardsAndPenalties - 250000 worstcase 41.583 ms/op 32.889 ms/op 1.26
phase0 getAttestationDeltas - 250000 normalcase 7.8181 ms/op 5.9886 ms/op 1.31
phase0 getAttestationDeltas - 250000 worstcase 8.4994 ms/op 5.7657 ms/op 1.47
phase0 processSlashings - 250000 worstcase 92.173 us/op 92.960 us/op 0.99
altair processSyncCommitteeUpdates - 250000 95.987 ms/op 104.59 ms/op 0.92
BeaconState.hashTreeRoot - No change 503.00 ns/op 506.00 ns/op 0.99
BeaconState.hashTreeRoot - 1 full validator 82.992 us/op 81.577 us/op 1.02
BeaconState.hashTreeRoot - 32 full validator 750.35 us/op 1.0764 ms/op 0.70
BeaconState.hashTreeRoot - 512 full validator 8.5602 ms/op 11.748 ms/op 0.73
BeaconState.hashTreeRoot - 1 validator.effectiveBalance 102.52 us/op 151.23 us/op 0.68
BeaconState.hashTreeRoot - 32 validator.effectiveBalance 1.6919 ms/op 2.0283 ms/op 0.83
BeaconState.hashTreeRoot - 512 validator.effectiveBalance 15.450 ms/op 24.255 ms/op 0.64
BeaconState.hashTreeRoot - 1 balances 72.886 us/op 115.50 us/op 0.63
BeaconState.hashTreeRoot - 32 balances 608.23 us/op 788.79 us/op 0.77
BeaconState.hashTreeRoot - 512 balances 5.8359 ms/op 6.4934 ms/op 0.90
BeaconState.hashTreeRoot - 250000 balances 119.70 ms/op 161.31 ms/op 0.74
aggregationBits - 2048 els - zipIndexesInBitList 23.874 us/op 22.017 us/op 1.08
byteArrayEquals 32 45.954 ns/op 46.234 ns/op 0.99
Buffer.compare 32 15.093 ns/op 15.582 ns/op 0.97
byteArrayEquals 1024 1.2462 us/op 1.2611 us/op 0.99
Buffer.compare 1024 23.215 ns/op 24.430 ns/op 0.95
byteArrayEquals 16384 19.630 us/op 19.971 us/op 0.98
Buffer.compare 16384 179.93 ns/op 202.45 ns/op 0.89
byteArrayEquals 123687377 150.75 ms/op 144.63 ms/op 1.04
Buffer.compare 123687377 6.8525 ms/op 5.3296 ms/op 1.29
byteArrayEquals 32 - diff last byte 48.941 ns/op 45.049 ns/op 1.09
Buffer.compare 32 - diff last byte 17.451 ns/op 14.810 ns/op 1.18
byteArrayEquals 1024 - diff last byte 1.3381 us/op 1.1928 us/op 1.12
Buffer.compare 1024 - diff last byte 22.902 ns/op 22.644 ns/op 1.01
byteArrayEquals 16384 - diff last byte 20.597 us/op 18.962 us/op 1.09
Buffer.compare 16384 - diff last byte 212.01 ns/op 158.34 ns/op 1.34
byteArrayEquals 123687377 - diff last byte 153.86 ms/op 150.15 ms/op 1.02
Buffer.compare 123687377 - diff last byte 7.0141 ms/op 3.8713 ms/op 1.81
byteArrayEquals 32 - random bytes 5.1030 ns/op 4.8720 ns/op 1.05
Buffer.compare 32 - random bytes 15.709 ns/op 16.258 ns/op 0.97
byteArrayEquals 1024 - random bytes 5.7370 ns/op 5.1000 ns/op 1.12
Buffer.compare 1024 - random bytes 17.493 ns/op 16.210 ns/op 1.08
byteArrayEquals 16384 - random bytes 5.1920 ns/op 4.8800 ns/op 1.06
Buffer.compare 16384 - random bytes 15.788 ns/op 16.665 ns/op 0.95
byteArrayEquals 123687377 - random bytes 8.0700 ns/op 7.7200 ns/op 1.05
Buffer.compare 123687377 - random bytes 20.710 ns/op 20.010 ns/op 1.03
regular array get 100000 times 32.489 us/op 30.694 us/op 1.06
wrappedArray get 100000 times 31.246 us/op 30.678 us/op 1.02
arrayWithProxy get 100000 times 9.5213 ms/op 9.7377 ms/op 0.98
ssz.Root.equals 43.352 ns/op 39.194 ns/op 1.11
byteArrayEquals 41.494 ns/op 42.095 ns/op 0.99
Buffer.compare 9.6730 ns/op 9.6560 ns/op 1.00
shuffle list - 16384 els 5.8532 ms/op 5.4517 ms/op 1.07
shuffle list - 250000 els 88.684 ms/op 78.043 ms/op 1.14
processSlot - 1 slots 19.383 us/op 11.670 us/op 1.66
processSlot - 32 slots 2.4418 ms/op 2.7631 ms/op 0.88
getEffectiveBalanceIncrementsZeroInactive - 250000 vs - 7PWei 39.352 ms/op 35.956 ms/op 1.09
getCommitteeAssignments - req 1 vs - 250000 vc 1.8453 ms/op 1.7160 ms/op 1.08
getCommitteeAssignments - req 100 vs - 250000 vc 3.5834 ms/op 3.3566 ms/op 1.07
getCommitteeAssignments - req 1000 vs - 250000 vc 3.8188 ms/op 3.6364 ms/op 1.05
findModifiedValidators - 10000 modified validators 244.84 ms/op 200.41 ms/op 1.22
findModifiedValidators - 1000 modified validators 188.34 ms/op 137.79 ms/op 1.37
findModifiedValidators - 100 modified validators 186.27 ms/op 140.48 ms/op 1.33
findModifiedValidators - 10 modified validators 160.42 ms/op 140.23 ms/op 1.14
findModifiedValidators - 1 modified validators 165.42 ms/op 126.39 ms/op 1.31
findModifiedValidators - no difference 178.35 ms/op 131.85 ms/op 1.35
compare ViewDUs 3.7375 s/op 2.9247 s/op 1.28
compare each validator Uint8Array 1.4058 s/op 1.8023 s/op 0.78
compare ViewDU to Uint8Array 844.49 ms/op 668.35 ms/op 1.26
migrate state 1000000 validators, 24 modified, 0 new 572.20 ms/op 508.84 ms/op 1.12
migrate state 1000000 validators, 1700 modified, 1000 new 791.01 ms/op 783.50 ms/op 1.01
migrate state 1000000 validators, 3400 modified, 2000 new 1.0368 s/op 931.89 ms/op 1.11
migrate state 1500000 validators, 24 modified, 0 new 604.33 ms/op 598.57 ms/op 1.01
migrate state 1500000 validators, 1700 modified, 1000 new 907.02 ms/op 803.58 ms/op 1.13
migrate state 1500000 validators, 3400 modified, 2000 new 1.0288 s/op 962.03 ms/op 1.07
RootCache.getBlockRootAtSlot - 250000 vs - 7PWei 6.8000 ns/op 6.1200 ns/op 1.11
state getBlockRootAtSlot - 250000 vs - 7PWei 925.63 ns/op 807.16 ns/op 1.15
computeProposers - vc 250000 7.9787 ms/op 6.1965 ms/op 1.29
computeEpochShuffling - vc 250000 90.126 ms/op 83.202 ms/op 1.08
getNextSyncCommittee - vc 250000 102.70 ms/op 102.60 ms/op 1.00
computeSigningRoot for AttestationData 22.026 us/op 20.535 us/op 1.07
hash AttestationData serialized data then Buffer.toString(base64) 1.2168 us/op 1.1673 us/op 1.04
toHexString serialized data 986.85 ns/op 744.36 ns/op 1.33
Buffer.toString(base64) 163.21 ns/op 136.32 ns/op 1.20

by benchmarkbot/action

Copy link
Member

@wemeetagain wemeetagain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach here looks good.
I wonder if it can or should be tied to beacon health call in some way.

packages/validator/src/services/syncingStatusTracker.ts Outdated Show resolved Hide resolved
packages/validator/src/services/syncingStatusTracker.ts Outdated Show resolved Hide resolved
this.logger.debug("Connected beacon node is synced", {slot, ...syncingStatus});
}
} catch (e) {
this.logger.error("Failed to check syncing status", {}, e as Error);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the UX of this (and can it be improved)? Most of the time we'll get two logs here, right? both health log and this one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we shouldn't have two logs, that seems wrong, need to review

I think a good UX would be

  • verbose: once a slot to print sync status if isSyncing=false
  • warn: once a slot if isSyncing=true
  • error: once a slot if failed to check status (e.g. due to node offline)

but in any case, just one of those, never multiple logs per slot

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both health log and this one?

ah you mean a log related to polling health status api? There is no log like this as far as I know

Copy link
Member Author

@nflaig nflaig Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remaining ux question for me right now is how to deal with existing "Node is syncing" logs which are printed out due to other api failures, like polling duties and are already throttled to once per slot.

I aligned the logs to be somewhat similar
image

but it might look a bit redundant because getProposerDuties is called every slot right now. I am still not sure if removing those altogether is good because it still provides information about polling failures + includes error message of beacon node.

Copy link
Member Author

@nflaig nflaig Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the UX is clean now

  • info log to inform user about sync status just once on startup or on resynced
  • warn every slot if not synced with details about head slot / sync distance
  • verbose dump node syncing status every slot, including all data retrieved from api

There might be additional "Node is syncing" warnings due to duty polling attempts running into 503 errors but those are slightly different in nature as (1) they include error details (hence important) and (2) depend on lifecycle of imported keys and might not even fire but I think it's good to inform the user about the node sync status in any case (think pending validator).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could further consider the case if beacon node is synced but connected EL client is syncing or offline and emit a warning every slot. Might be out of scope though since status tracker service is primarily designed to trigger duties polling.

@nflaig
Copy link
Member Author

nflaig commented Aug 1, 2024

I wonder if it can or should be tied to beacon health call in some way.

The syncing status call is more useful as the health status as it provides more detailed info instead of just the status code. Honestly we could consider removing the health status calls completely, the metric is unused and to observe health status of all nodes the approach to use the URL score seems better to me, and we already display this information on the validator client dashboard #6415.

The health status api is mostly meant to be used by tooling like liveness / readiness probes in k8s or other health monitoring systems.

@nflaig nflaig changed the title feat: track syncing status and fetch duties on synced feat: track syncing status and fetch duties on resynced Aug 1, 2024
@wemeetagain
Copy link
Member

Honestly we could consider removing the health status calls completely

would be in favor of that. it would also resolve the UX concern mentioned ^

@nflaig
Copy link
Member Author

nflaig commented Aug 1, 2024

Honestly we could consider removing the health status calls completely

would be in favor of that. it would also resolve the UX concern mentioned ^

Do we care about removing the metric, is anybody using it? I can't tell but likely not, I can migrate the logic to set the metric to syncing status tracker, it has all the information needed to set it as before

@wemeetagain
Copy link
Member

Do we care about removing the metric, is anybody using it?

Afaik its only used on the beacon side, not on the validator side (because of complications wrt maintaining state of backup beacon nodes). Would be in favor of removal in that case.

@nflaig
Copy link
Member Author

nflaig commented Aug 1, 2024

Do we care about removing the metric, is anybody using it?

Afaik its only used on the beacon side, not on the validator side (because of complications wrt maintaining state of backup beacon nodes). Would be in favor of removal in that case.

I wonder if the infra team uses it for alerts @Faithtosin? It's mentioned here https://github.com/ChainSafe/lodestar-ansible-development/issues/57, and Lion tagged this as high prio issue #4637 but I fail to see how it was ever used, other than manually via exploring metrics via grafana or /metrics.

@Faithtosin
Copy link
Contributor

Do we care about removing the metric, is anybody using it?

Afaik its only used on the beacon side, not on the validator side (because of complications wrt maintaining state of backup beacon nodes). Would be in favor of removal in that case.

I wonder if the infra team uses it for alerts @Faithtosin? It's mentioned here ChainSafe/lodestar-ansible-development#57, and Lion tagged this as high prio issue #4637 but I fail to see how it was ever used, other than manually via exploring metrics via grafana or /metrics.

@nflaig vc_beacon_health metric is not used in any of our internal alerting systems.

Copy link
Member

@wemeetagain wemeetagain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Will you remove the health check + metric?

@nflaig
Copy link
Member Author

nflaig commented Aug 2, 2024

Looks good. Will you remove the health check + metric?

Have some stuff locally I need to push (mostly tests), was thinking about moving the metrics to syncing status tracker as we anyways have the data there already and we could add this to validator client dashboard, if we remove the metric there is no way from metrics to see if node is syncing.

@nflaig nflaig marked this pull request as ready for review August 2, 2024 21:31
@nflaig
Copy link
Member Author

nflaig commented Aug 2, 2024

Looks good. Will you remove the health check + metric?

Added this panel to validator client dashboard now, I guess this was the original idea behind #4939 + could use this for alerting if infra team is interestd (cc @Faithtosin). It might be useful to detect if a validator client is only connected to a beacon node that is not ready to fulfill duties. But only helpful if there are no active keys yet on the validator client, otherwise should probably prefer to use failed duties as an alert trigger.

image

@wemeetagain wemeetagain merged commit 3d2c2b3 into unstable Aug 2, 2024
20 checks passed
@wemeetagain wemeetagain deleted the nflaig/duties-on-synced branch August 2, 2024 22:02
@wemeetagain
Copy link
Member

🎉 This PR is included in v1.21.0 🎉

philknows pushed a commit that referenced this pull request Sep 3, 2024
* feat: track syncing status and fetch duties on synced

* Rename scheduling function to runOnResynced

* Consider prev offline and syncing to trigger resynced event handlers

* Add comment to error handler

* Add note about el offline and sycning not considered

* Align syncing status logs with existing node is syncing logs

* Cleanup

* Add ssz support to syncing status api

* Align beacon node code to return proper types

* Keep track of error in prev syncing status

* Print slot in error log

* Skip on first slot of epoch since tasks are already scheduled

* Update api test data

* Fix endpoint tests

* await scheduled tasks, mostly relevant for testing

* Add unit tests

* Move beacon heath metric to syncing status tracker

* Add beacon health panel to validator client dashboard

* Formatting

* Improve info called once assertion

* Reset mocks after each test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validator process checks beacon sync state too infrequently
3 participants