Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lodestar CL < > Nimbus VC incompatibility #6634

Closed
barnabasbusa opened this issue Apr 5, 2024 · 14 comments · Fixed by #6749
Closed

Lodestar CL < > Nimbus VC incompatibility #6634

barnabasbusa opened this issue Apr 5, 2024 · 14 comments · Fixed by #6749
Labels
meta-bug Issues that identify a bug and require a fix. meta-investigate Issues found that require further investigation and may not have a specific resolution/fix scope-interop Issues that fix interop issues between Lodestar and CL, EL or tooling.

Comments

@barnabasbusa
Copy link
Contributor

barnabasbusa commented Apr 5, 2024

Describe the bug

We are in the process to test cross beacon <> validator client compatibility, and found a bug when testing lodestar CL with Nimbus VC.

Nimbus reports:

INF 2024-04-05 14:24:30.001+00:00 Slot start                                 slot=26 epoch=0 attestationIn=16s blockIn= validators=64 node_status=synced delay=1ms448us381ns
ERR 2024-04-05 14:24:30.072+00:00 Beacon node provides unexpected response   reason="Serialization error;200;produceBlockV3(best);unexpected-data" node=http://172.16.0.31:8562[Lodestar/v1.17.0/898cd90] node_index=0 node_roles=AGBSDT
WRN 2024-04-05 14:24:30.072+00:00 Unable to retrieve block data              reason="Serialization error;200;produceBlockV3(best);unexpected-data" wall_slot=26 validator_index=85 service=block_service slot=26 validator=b0b263c2@85
NTC 2024-04-05 14:24:30.072+00:00 Block production using V3 failed, trying V2 wall_slot=26 validator_index=85 service=block_service slot=26 validator=b0b263c2@85
NTC 2024-04-05 14:24:30.863+00:00 Beacon node is online                      agent_version=Lodestar/v1.17.0/898cd90 node=http://172.16.0.31:8562[Lodestar/v1.17.0/898cd90] node_index=0 node_roles=AGBSDT
NTC 2024-04-05 14:24:30.866+00:00 Beacon node is compatible                  node=http://172.16.0.31:8562[Lodestar/v1.17.0/898cd90] node_index=0 node_roles=AGBSDT
NTC 2024-04-05 14:24:30.868+00:00 Beacon node is in sync                     head_slot=23 sync_distance=0 is_optimistic=false node=http://172.16.0.31:8562[Lodestar/v1.17.0/898cd90] node_index=0 node_roles=AGBSDT
ERR 2024-04-05 14:24:30.882+00:00 Beacon node provides unexpected response   reason="Serialization error;200;produceBlockV2(best);unexpected-data" node=http://172.16.0.31:8562[Lodestar/v1.17.0/898cd90] node_index=0 node_roles=AGBSDT
WRN 2024-04-05 14:24:30.882+00:00 Unable to retrieve block data              reason="Serialization error;200;produceBlockV2(best);unexpected-data" wall_slot=26 service=block_service slot=26 validator=b0b263c2@85
NTC 2024-04-05 14:24:32.873+00:00 Beacon node is online                      agent_version=Lodestar/v1.17.0/898cd90 node=http://172.16.0.31:8562[Lodestar/v1.17.0/898cd90] node_index=0 node_roles=AGBSDT
NTC 2024-04-05 14:24:32.878+00:00 Beacon node is compatible                  node=http://172.16.0.31:8562[Lodestar/v1.17.0/898cd90] node_index=0 node_roles=AGBSDT
NTC 2024-04-05 14:24:32.881+00:00 Beacon node is in sync                     head_slot=23 sync_distance=0 is_optimistic=false node=http://172.16.0.31:8562[Lodestar/v1.17.0/898cd90] node_index=0 node_roles=AGBSDT
NTC 2024-04-05 14:24:34.231+00:00 Sync committee message published           validator_index=98 delay=231ms206us633ns duration=217ms69us375ns message="(slot: 26, beacon_block_root: \"581b9a2f\", validator_index: 98, signature: \"814ebb83\")" service=sync_committee_service slot=26 validator=a9fdcd17@98 block_root=581b9a2f

Lodestar reports:

Apr-05 14:20:00.472[rest]            error: Req req-r produceBlockV2 error - REGEN_ERROR_SLOT_BEFORE_BLOCK_SLOT
Error: REGEN_ERROR_SLOT_BEFORE_BLOCK_SLOT
    at StateRegenerator.getBlockSlotState (file:///usr/app/packages/beacon-node/src/chain/regen/regen.ts:114:13)
    at JobItemQueue.QueuedStateRegenerator.jobQueueProcessor [as itemProcessor] (file:///usr/app/packages/beacon-node/src/chain/regen/queued.ts:288:35)
    at Timeout.JobItemQueue.runJob [as _onTimeout] (file:///usr/app/packages/beacon-node/src/util/queue/itemQueue.ts:101:33)
    at listOnTimeout (node:internal/timers:573:17)
    at processTimers (node:internal/timers:514:7)
Apr-05 14:20:12.000[]                 info: Synced - slot: 4 - head: 0x9cc3…383b - exec-block: valid(2 0xf3b5…) - finalized: 0x0000…0000:0 - peers: 5
Apr-05 14:20:18.027[api]              info: Selected engine block: no builder block produced fork=deneb, builderSelection=maxprofit, slot=5, isBuilderEnabled=false, isEngineEnabled=true, strictFeeRecipientCheck=undefined, builderBoostFactor=100, durationMs=11, executionPayloadValue=0.00000 ETH, consensusBlockValue=0.00087 ETH, blockTotalValue=0.00087 ETH
Apr-05 14:20:24.004[]                 info: Synced - slot: 5 - head: (slot -1) 0x9cc3…383b - exec-block: valid(2 0xf3b5…) - finalized: 0x0000…0000:0 - peers: 5
Apr-05 14:20:26.014[rest]            error: Req req-5p getAggregatedAttestation error - No attestation for slot=5 dataRoot=0xd947df35713911cac72a977556a795d217eefc7a66bfded54347e18469c675dd
Error: No attestation for slot=5 dataRoot=0xd947df35713911cac72a977556a795d217eefc7a66bfded54347e18469c675dd
    at AttestationPool.getAggregate (file:///usr/app/packages/beacon-node/src/chain/opPools/attestationPool.ts:136:13)
    at Object.getAggregatedAttestation (file:///usr/app/packages/beacon-node/src/api/impl/validator/index.ts:1070:47)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at Object.handler (file:///usr/app/packages/api/src/utils/server/genericJsonServer.ts:45:23)

Lodestar seems to be able to submit sync committee signatures, but not block proposals

Snooper between cl <> vc reports:

Nothing out of the ordinary

Expected behavior

I would expect all client combinations would work.

Steps to reproduce

config.yaml:

participants:
  - cl_type: lodestar
    cl_image: chainsafe/lodestar:next
    vc_type: nimbus
additional_services:
  - dora
snooper_enabled: true
global_log_level: debug

kurtosis run github.com/kurtosis-tech/ethereum-package --args-file config.yaml

Additional context

Current BN <> VC Compatibility list tracker

Screenshot 2024-04-05 at 14 08 53

Operating system

Linux

Lodestar version or commit hash

Version: v1.17.0/898cd90

cc: @pk910

@barnabasbusa barnabasbusa added the meta-bug Issues that identify a bug and require a fix. label Apr 5, 2024
@nflaig nflaig added scope-interop Issues that fix interop issues between Lodestar and CL, EL or tooling. meta-investigate Issues found that require further investigation and may not have a specific resolution/fix labels Apr 5, 2024
@barnabasbusa
Copy link
Contributor Author

Nimbus reports doppelganger detection:

NTC 2024-04-05 14:40:50.011+00:00 Doppelganger detection active - skipping validator duties while observing the network topics="val_pool" validator=867e8956 slot=107 doppelCheck=none() activationEpoch=0
INF 2024-04-05 14:40:54.000+00:00 Slot start                                 slot=108 epoch=3 attestationIn=4s blockIn=1m validators=64 node_status=synced delay=171us45ns

@tersec
Copy link

tersec commented Apr 5, 2024

Is this just the doppelganger? That's intentional. Does --doppelganger-detection=off work?

@tersec
Copy link

tersec commented Apr 5, 2024

Nimbus doesn't use doppelganger protection on sync committees because they're not slashable.

@nflaig
Copy link
Member

nflaig commented Apr 5, 2024

@tersec regarding the getAggregate error, also reported in #6631, is Nimbus by any chance requesting an aggregate attestation even if no validator is aggregator? We had the same issue with Lighthouse previously and it was fixed on their end sigp/lighthouse#4712

See #6634 (comment)

@barnabasbusa
Copy link
Contributor Author

even with --doppelganger-detection=off I'm getting:

ERR 2024-04-05 18:03:31.097+00:00 Beacon node provides unexpected response   reason="Serialization error;200;produceBlockV3(best);unexpected-data" node=http://172.16.0.31:8562[Lodestar/v1.17.0/f2ec0d4] node_index=0 node_roles=AGBSDT
WRN 2024-04-05 18:03:31.097+00:00 Unable to retrieve block data              reason="Serialization error;200;produceBlockV3(best);unexpected-data" wall_slot=28 validator_index=78 service=block_service slot=28 validator=8890da28@78

@nflaig
Copy link
Member

nflaig commented Apr 6, 2024

even with --doppelganger-detection=off I'm getting:

ERR 2024-04-05 18:03:31.097+00:00 Beacon node provides unexpected response   reason="Serialization error;200;produceBlockV3(best);unexpected-data" node=http://172.16.0.31:8562[Lodestar/v1.17.0/f2ec0d4] node_index=0 node_roles=AGBSDT
WRN 2024-04-05 18:03:31.097+00:00 Unable to retrieve block data              reason="Serialization error;200;produceBlockV3(best);unexpected-data" wall_slot=28 validator_index=78 service=block_service slot=28 validator=8890da28@78

The "unexpected data" might be because we are returning execution_payload_source metadata field in the response as we allow to produce a blinded local block but this is field is not officially part of the block v3 api and was only proposed in ethereum/beacon-APIs#387 but not included.

See #6634 (comment)

@nflaig
Copy link
Member

nflaig commented Apr 8, 2024

The "unexpected data" might be because we are returning execution_payload_source metadata field in the response

After reviewing how Nimbus handles this case it does not seem to be the issue as it only uses metadata provided in headers which Lodestar is setting and only parses the response.data

beacon_chain/validator_client/api.nim#L2145

res = decodeBytes(ProduceBlockResponseV3, response.data, response.contentType, version, blinded, executionValue, consensusValue)

And even there they seem to allow unknown fields

beacon_chain/spec/eth2_apis/eth2_rest_serialization.nim#L3950-L3952

ok(RestJson.decode(value, T, requireAllFields = true, allowUnknownFields = true))

Based on just reviewing Lodestar and Nimbus code it's hard to tell what's the actual issue, would need to inspect the data we are sending the further investigate this

@nflaig
Copy link
Member

nflaig commented Apr 8, 2024

@tersec regarding the getAggregate error, also reported in #6631, is Nimbus by any chance requesting an aggregate attestation even if no validator is aggregator? We had the same issue with Lighthouse previously and it was fixed on their end sigp/lighthouse#4712

This is not the case, Nimbus will only request an aggregated attestation if at least one validator is aggregator

beacon_chain/validator_client/attestation_service.nim#L280

if len(aggregateItems) > 0:

But I think I know what the issue here is, Nimbus uses different strategies per API, and if I interpreted them correctly this is why we are seeing the error:

The submitPoolAttestations is only sent to the first / primary beacon node (unless an error is encountered)

beacon_chain/validator_client/attestation_service.nim#L66

await vc.submitPoolAttestations(@[attestation], ApiStrategyKind.First)

While the getAggregatedAttestation is sent to all connected beacon nodes and the "best" response is picked

beacon_chain/validator_client/attestation_service.nim#L283-L284

await vc.getAggregatedAttestation(slot, attestationRoot, ApiStrategyKind.Best)

But Lodestar might not have an aggregated attestation in it's cache if the validator client did not previously submit an attestation for attestation_data_root. Although this error might be not happen consistently as the beacon node might prepare the aggregate due to receiving the attestation via gossip (gossipHandlers.ts#L490-L492).

We have just improved the error handling if there is no aggregated attestation available #6648 to follow spec and produce a less nosiy error as this is somewhat expected to happen in a setup with multiple nodes.

If my analysis is correct, this is also not an issue as Nimbus would still produce the aggregate attestation if it received the data from the primary node.

@barnabasbusa this should not happen in a 1:1 setup without fallback node, I am assuming in your tests you had multiple bns connected to a single vc

@barnabasbusa
Copy link
Contributor Author

@nflaig We have 1:1 setups only. We haven't considered any testing for 1:x testing.

@nflaig
Copy link
Member

nflaig commented Apr 8, 2024

@nflaig We have 1:1 setups only. We haven't considered any testing for 1:x testing.

Then I have no idea why this happens but something to note is that there is also a very strange error which should never happen

Apr-05 14:20:00.472[rest]            error: Req req-r produceBlockV2 error - REGEN_ERROR_SLOT_BEFORE_BLOCK_SLOT
Error: REGEN_ERROR_SLOT_BEFORE_BLOCK_SLOT

So maybe there was just a clock issue / time skew between the validator client and beacon node?

@barnabasbusa
Copy link
Contributor Author

All tests are ran on the same (docker backend) machine, and everyone should be able to replicate it with the config I have posted above.
Doubt that there would be any sort of clock issues between two containers hosted on the same physical node.

@nflaig
Copy link
Member

nflaig commented Apr 8, 2024

and everyone should be able to replicate it with the config I have posted above.

Will have to do this next, thanks for the detailed infos

@nflaig
Copy link
Member

nflaig commented Apr 16, 2024

Apr-05 14:20:26.014[rest] error: Req req-5p getAggregatedAttestation error - No attestation for slot=5 dataRoot=0xd947df35713911cac72a977556a795d217eefc7a66bfded54347e18469c675dd
Error: No attestation for slot=5 dataRoot=0xd947df35713911cac72a977556a795d217eefc7a66bfded54347e18469c675dd

The aggregated attestation issue has been fixed by #6668. There still seems to be strange behavior by Nimbus VC that it requests an aggregated attestation for the first 1-5 slots even though it does not submit subnet subscriptions with is_aggregator=true for those slots beforehand. But other than that looks good, attestations, aggregates, and sync committee works as expected.

The only remaining issue is block production, my best guess right now is that Nimbus does not like that Lodestar returns a JSON payload from produceBlockV3 API.

@nflaig
Copy link
Member

nflaig commented May 29, 2024

The only remaining issue is block production, my best guess right now is that Nimbus does not like that Lodestar returns a JSON payload from produceBlockV3 API.

Confirmed this issue is related us return a JSON payload in response

ERR 2024-05-29 13:37:55.640+00:00 Beacon node provides unexpected response   reason="Serialization error;200;produceBlockV3(best);unexpected-data" node=http://172.16.0.15:4000[Lodestar/v1.18.1/8b6ecc4] node_index=0 node_roles=AGBSDT
WRN 2024-05-29 13:37:55.645+00:00 Unable to retrieve block data              reason="Serialization error;200;produceBlockV3(best);unexpected-data" wall_slot=12 validator_index=53 service=block_service slot=12 validator=b09cb155@53

Block production with Nimbus VC will be fixed once we merge #6749 but there is another issue publishing blocks as SSZ (while JSON seems work fine there). Does not seem to be a problem with other clients and there is an issue on the Nimbus side to track this status-im/nimbus-eth2#6205 <-- this is an issue on Nimbus BN

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta-bug Issues that identify a bug and require a fix. meta-investigate Issues found that require further investigation and may not have a specific resolution/fix scope-interop Issues that fix interop issues between Lodestar and CL, EL or tooling.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants