Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for driving multiple EL nodes from a single Nimbus BN #4465

Merged
merged 9 commits into from
Mar 5, 2023

Conversation

zah
Copy link
Contributor

@zah zah commented Jan 6, 2023

Full list of changes:

  • Eth1Monitor has been renamed to ELManager to match its current responsibilities better.

  • The ELManager is no longer optional in the code (it won't have a nil value under any circumstances).

  • The support for subscribing for headers was removed as it only worked with WebSockets and contributed significant complexity while bringing only a very minor advantage.

  • Transition configuration is now exchanged 2 weeks prior to the Bellatrix fork (at this point, relevant only for future testnets).

  • The --web3-url parameter has been deprecated in favor of a new --el parameter. The new parameter supports specifying the JWT token per connection. Each connection can also be configured with a different set of responsibilities (e.g. download deposits, validate blocks and/or produce blocks). On the command-line, these properties can be configured through URL properties stored in the #anchor part of the URL. In TOML files, they come with a very natural syntax (althrough the URL scheme is also supported).

  • The previously scattered EL-related state and logic is now moved to eth1_monitor.nim (this module will be renamed to el_manager.nim in a follow-up commit). State is assigned properly either to the ELManager or the to individual ELConnection objects where appropriate.

    The ELManager executes all Engine API requests against all attached EL nodes, in parallel. It compares their results and if there is a disagreement regarding the validity of a certain payload, this is detected and the beacon node is protected from publishing a block with a potential execution layer consensus bug in it.

    The BN provides metrics per EL node for the number of successful or failed requests for each type Engine API requests. If an EL node goes offline and connectivity is resoted later, we report the problem and the remedy in edge-triggered fashion.

  • The client can now produce Capella execution payloads

  • The local testnet simulation script now runs Geth and Nimbus by default. The minimal network simulation uses a Nimbus remote signing node for 50% of the validators. The mainnet simulation uses the official web3signer in a 2 out of 3 threshold signing config.

  • The local testnet simulation can now use a payload builder. This is currently not activated in CI due to lack of automated procedures for installing third-party relays or builders.

    You are adviced to use mergemock for now, but for most realistic results, we can create a simple builder based on the nimbus-eth1 codebase that will be able to propose transactions from the regular network mempool.

  • Updated the run-geth-el.sh script to work with the latest Capella and EIP-4844 builds of Geth and added the run-nimbus-eth2-in-withdrawal-testnet.sh script that can be used to launch Nimbus instances in the available withdrawal devnets.

@zah zah changed the title Supporting driving multiple EL nodes from a Nimbus BN Support for driving multiple EL nodes from a single Nimbus BN Jan 6, 2023
@zah zah force-pushed the multiple-el-endpoints branch from 2583cc0 to dcc4458 Compare January 6, 2023 02:47
@github-actions
Copy link

github-actions bot commented Jan 6, 2023

Unit Test Results

         9 files  ±0    1 065 suites  ±0   32m 14s ⏱️ + 1m 22s
  3 554 tests ±0    3 317 ✔️ ±0  237 💤 ±0  0 ±0 
15 419 runs  ±0  15 154 ✔️ ±0  265 💤 ±0  0 ±0 

Results for commit 6a3dc07. ± Comparison against base commit 3b41e6a.

♻️ This comment has been updated with latest results.

@arnetheduck
Copy link
Member

The --web3-url parameter has been deprecated in favor of a new --el parameter.

all documentation needs to be updated accordingly

On the command-line, these properties can be configured through URL properties stored in the #anchor part of the URL. In TOML files, they come with a very natural syntax (althrough the URL scheme is also supported).

ditto, needs an update to the docs

The minimal network simulation uses a Nimbus remote signing node for 50% of the validators. The mainnet simulation uses the official web3signer in a 2 out of 3 threshold signing config.

does anything test the plain and simple validators on mainnet?

run-nimbus-eth2-in-withdrawal-testnet.sh

with config file support available, we should be moving away from these - ie they're crutches at best and make the manual hard to write (because they're not consistently used for all commands such as trusted node sync)

@arnetheduck
Copy link
Member

new --el parameter.

since it's a new parameter, it's a good opportunity to default it to localhost:8551 so that a default "localhost" setup just works with as few command line parameters as possible (to be further improved with ethereum/execution-apis#302) - it should also be possible to remove the default so that no el connection is attempted

@@ -266,27 +356,18 @@ when hasGenesisDetection:
withdrawal_credentials: deposit.withdrawal_credentials)
m.genesisValidatorKeyToIndex[pubkey] = idx

proc processGenesisDeposit*(m: Eth1Monitor, newDeposit: DepositData) =
m.depositsChain.db.genesisDeposits.add newDeposit
proc processGenesisDeposit*(m: ELManager, newDeposit: DepositData) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the el manager managing multiple EL:s, it makes even more sense to move the deposit processing logic to a separate module that can be unit tested in isolation - ie there should a module whose role is to take as input eth1 blocks and produce a deposit structure / state, similar to the chaindag - all of this is synchronous logic that is orthogonal to the multiplexing of several connections - it's a very natural line along which this code could be simplified and rendered robust.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ie "processing deposits" and "managing the live payload interaction between el and CL" are to completely stand-alone tasks - if this was designed like the VC, these would be separate services living on top of a manager whose role is solely to multiplex - a design like this would significantly untangle what right now is a monolith that's hard to work with

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These activities are already independent enough. They run as different "asynchronous threads" of execution which both share access to the Eth1Chain structure which is indeed a bit like the chaindag.

Otherwise, I've wanted to reduce the required inputs of this particular function further in order to make it more reusable and easier to test, but the main problem is that it has multiple different responsibilities:

  1. It must download all deposits, while forming a sparse chain of all blocks with deposits
  2. It must download all recent blocks, even if they don't have deposits
  3. It must obtain the timestamps for some of the blocks (currently, it's not doing this optimally and I plan to submit a follow-up PR fixing this).

As a result, it has a somewhat awkward monolithic internal design in order to allow these goals to be achieved with the minimal number of requests. I'll add some comments on the top to explain this. Removing the ELManager input is a worthy goal which is relatively within reach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These activities are already independent enough.

this is a comment about the code complexity - the monolithic design has regularly caused bugs and confusion when trying to approach the code for changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous code has quite a bit more complex control flow because it was relying on things like async events which can be fired from either the websocket header subscriptions or HTTP polling. The current version is quite a bit simpler and introducing a notion of "services" will only bring back some of the complexity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really - there's a portion of the logic that is not asynchronous and not dependent on EL communication - it checks for consistency of a new block with respect to the others and has an independent API with error handling and logic - it would be exactly the same if it was driven by an internal eth1 engine without async comms, or indeed a feed of blocks read from disk - this is the part that doesn't belong here, but rather in a separate module which receives tests etc - the code in elmanager at that point is a dumb data router, as it should be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, sure - the Eth1Chain object can be moved to a separate module. It's already used in the non-asynchronous way that you describe as part of the block_sim simulation which injects a random number of deposits and uses them during block production.

elif bestPayloadIdx.isNone:
bestPayloadIdx = some idx
else:
if cmpGetPayloadResponses(req.read, requests[bestPayloadIdx.get].read) > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the payload comparison done in elmanager? it seems to me this should be the responsibility of the same module that compares blocks from all sources, ie blinded blocks included thus concentrating the block selection logic in one place

Copy link
Contributor Author

@zah zah Jan 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a simple utility function. I agree that it can be moved to another module once it's applied in more places.

Since you really want the best block, it doesn't matter if comparisons are done in multiple places. It would be silly to return multiple block candidates from the EL manager, just so there could be a single place where all blocks will be compared at the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, the point is to concentrate the selection strategy to a single place - at that point it becomes feasible to create other strategies for block selection (ie ensuring that a block contains a particular tx etc) - the elmanager already has too many responsibilities, many of which are not tested - this is one more which could be moved to a place where it could become a regular part of the test suite - since this is a recurring design problem in the elmanager, it's a good opportunity to get it done sooner rather than later.

Copy link
Contributor Author

@zah zah Jan 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I still find the requested design silly. If you want to have a single block selection algorithm, one easy way to implement this would be to inject a BlockSelector dependency in the EL manager that will be called in place of the cmpGetPayloadResponses function. Since any block selection algorithm is likely to be a simple functions will very little configurable state, it's trivial to test in unit tests without depending on any part of the ELManager.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be one way to do it, but ideally it would be able to also take into consideration blinded blocks (for example, a popular strategy is to use a blinded block iff the reward is greater by a percentage compared to an unblinded block, so as to give slight preference to censorship resistance, but not at the expense of large profits).

It is the mixing of all these responsiblities that has led to the poor track record of elmanager over time - the tests don't get written because it is too complex and does too many things and the way to solve that is to split its responsiblities into parts that relate to logic and parts that relate to connection management and handling.

beacon_chain/conf.nim Outdated Show resolved Hide resolved
@arnetheduck
Copy link
Member

ERR 2023-01-06 11:03:00.750+01:00 Error while syncing deposits               topics="eth1" url=ws://geth:8551 err="Transport is not initialised (missing a call to connect?)"
ERR 2023-01-06 11:03:00.750+01:00 Error while syncing deposits               topics="eth1" url=ws://geth:8551 err="Transport is not initialised (missing a call to connect?)"
ERR 2023-01-06 11:03:00.750+01:00 Error while syncing deposits               topics="eth1" url=ws://geth:8551 err="Transport is not initialised (missing a call to connect?)"
ERR 2023-01-06 11:03:00.750+01:00 Error while syncing deposits               topics="eth1" url=ws://geth:8551 err="Transport is not initialised (missing a call to connect?)"
ERR 2023-01-06 11:03:00.750+01:00 Error while syncing deposits               topics="eth1" url=ws://geth:8551 err="Transport is not initialised (missing a call to connect?)"
ERR 2023-01-06 11:03:00.750+01:00 Error while syncing deposits               topics="eth1" url=ws://geth:8551 err="Transport is not initialised (missing a call to connect?)"
ERR 2023-01-06 11:03:00.750+01:00 Error while syncing deposits               topics="eth1" url=ws://geth:8551 err="Transport is not initialised (missing a call to connect?)"
ERR 2023-01-06 11:03:00.750+01:00 Error while syncing deposits               topics="eth1" url=ws://geth:8551 err="Transport is not initialised (missing a call to connect?)"

turning off the EL puts the the beacon node in a busy loop (see timing), with this PR

@tersec
Copy link
Contributor

tersec commented Feb 13, 2023

Transition configuration is now exchanged 2 weeks prior to the Bellatrix fork (at this point, relevant only for future testnets).

Future testnets will, by all appearances, start merged and in Bellatrix or newer.

.gitmodules Show resolved Hide resolved
beacon_chain/conf.nim Outdated Show resolved Hide resolved
zah added 9 commits March 5, 2023 00:59
Full list of changes:

* Eth1Monitor has been renamed to ELManager to match its current
  responsibilities better.

* The ELManager is no longer optional in the code (it won't have
  a nil value under any circumstances).

* The support for subscribing for headers was removed as it only
  worked with WebSockets and contributed significant complexity
  while bringing only a very minor advantage.

* The `--web3-url` parameter has been deprecated in favor of a
  new `--el` parameter. The new parameter has a reasonable default
  value and supports specifying a different JWT for each connection.
  Each connection can also be configured with a different set of
  responsibilities (e.g. download deposits, validate blocks and/or
  produce blocks). On the command-line, these properties can be
  configured through URL properties stored in the #anchor part of
  the URL. In TOML files, they come with a very natural syntax
  (althrough the URL scheme is also supported).

* The previously scattered EL-related state and logic is now moved
  to `eth1_monitor.nim` (this module will be renamed to `el_manager.nim`
  in a follow-up commit). State is assigned properly either to the
  `ELManager` or the to individual `ELConnection` objects where
  appropriate.

  The ELManager executes all Engine API requests against all attached
  EL nodes, in parallel. It compares their results and if there is a
  disagreement regarding the validity of a certain payload, this is
  detected and the beacon node is protected from publishing a block
  with a potential execution layer consensus bug in it.

  The BN provides metrics per EL node for the number of successful or
  failed requests for each type Engine API requests. If an EL node
  goes offline and connectivity is resoted later, we report the
  problem and the remedy in edge-triggered fashion.

* More progress towards implementing Deneb block production in the VC
  and comparing the value of blocks produced by the EL and the builder
  API.

* Adds a Makefile target for the zhejiang testnet
@zah zah force-pushed the multiple-el-endpoints branch from 6a3dc07 to 98586e1 Compare March 4, 2023 23:00
@zah zah enabled auto-merge (squash) March 4, 2023 23:00
@zah zah merged commit 8771e91 into unstable Mar 5, 2023
@zah zah deleted the multiple-el-endpoints branch March 5, 2023 01:40
Copy link
Contributor

@etan-status etan-status left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed a block with Failed to obtain execution payload from EL, despite Geth logging Stopping work on payload only 0.679s after Starting work on payload.

But, Geth also logged Chain head was updated 0.858s before finishing the payload, which is probably the await rpcClient.forkchoiceUpdated within getPayloadFromSingleEL.

The 1s timeout is probably too aggressive with the current implementation, as it spans the entire block building process and not just the actual getPayload call.

@zah

raise newException(CatchableError, "Head block is not a valid payload")

# Give the EL some time to assemble the block
await sleepAsync(chronos.milliseconds 500)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

engine_api.getPayload further down here may take another 1000 ms, so if we hit this path, the outer await requestsCompleted or deadline may timeout on spec compliant ELs.

Comment on lines +875 to +876
# TODO We should follow the spec and track the timeouts of
# the individual engine API calls inside `getPayloadFromSingleEL`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done even in the pre-Deneb case. Currently, the 1s timeout includes all of getPayloadFromSingleEL and doesn't just wrap the getPayload message.

Furthermore, we should probably start deadline only after all requests have been sent, so that all ELs have the full 1s available for getPayload.

@etan-status
Copy link
Contributor

Also, @tersec, when using an external VC, we don't seem to eagerly trigger the runProposalForkchoiceUpdated even though nextProposalSlot is known by ActionTracker. checkNextProposer seems to only take into account the local DAG attached validators. Is this intended, or an oversight?

@tersec
Copy link
Contributor

tersec commented Jul 17, 2023

Also, @tersec, when using an external VC, we don't seem to eagerly trigger the runProposalForkchoiceUpdated even though nextProposalSlot is known by ActionTracker. checkNextProposer seems to only take into account the local DAG attached validators. Is this intended, or an oversight?

Well, it's not intended. But it seems like it at least should be set up to work, so would have to look at why it might not.

func getNextProposalSlot*(tracker: ActionTracker, slot: Slot): Slot =
getNextValidatorAction(
tracker.proposingSlots,
tracker.lastCalculatedEpoch, slot)

if actionTracker.getNextProposalSlot(wallSlot) != nextWallSlot and
dynamicFeeRecipientsStore[].getDynamicFeeRecipient(
proposer, nextWallSlot.epoch).isNone:
return Opt.none((ValidatorIndex, ValidatorPubKey))
let proposerKey = dag.validatorKey(proposer).get().toPubKey
Opt.some((proposer, proposerKey))
proc checkNextProposer*(self: ref ConsensusManager, wallSlot: Slot):
Opt[(ValidatorIndex, ValidatorPubKey)] =
self.dag.checkNextProposer(
self.actionTracker, self.dynamicFeeRecipientsStore, wallSlot)

proc runProposalForkchoiceUpdated*(
self: ref ConsensusManager, wallSlot: Slot) {.async.} =
let
nextWallSlot = wallSlot + 1
(validatorIndex, nextProposer) = self.checkNextProposer(wallSlot).valueOr:
return
debug "runProposalForkchoiceUpdated: expected to be proposing next slot",
nextWallSlot, validatorIndex, nextProposer

if self.consensusManager.checkNextProposer(wallSlot).isNone:
# No attached validator is next proposer, so use non-proposal fcU

So this should only be failing to handle VC validators if getNextValidatorAction isn't picking them up either, and it's the same core function used by other parts of Nimbus for this purpose. runProposalForkchoiceUpdated isn't particularly unique in this regard.

@etan-status
Copy link
Contributor

That's correct, so probably something else going on :-/ It's indeed calling the same source as is used in the log

@etan-status
Copy link
Contributor

@tersec Found the culprit:

  # In Capella and later, computing correct withdrawals would mean creating a
  # proposal state. Instead, only do that at proposal time.
  if nextWallSlot.is_epoch:
    debug "runProposalForkchoiceUpdated: not running early fcU for epoch-aligned proposal slot",
      nextWallSlot, validatorIndex, nextProposer
    return

Actually it has to be this one, as I don't see any fcU sent to Geth in the slot prior to my missed block.

Which is consistent with the behaviour when self.consensusManager.checkNextProposer(wallSlot).isSome, and then runProposalForkchoiceUpdated returns. In that situation, no fcU is sent at all, until we try to actually build the next block.

Should we simply fall back to a regular fcu in that case, so that at least the attestations are correct?

@tersec
Copy link
Contributor

tersec commented Jul 17, 2023

@tersec Found the culprit:

  # In Capella and later, computing correct withdrawals would mean creating a
  # proposal state. Instead, only do that at proposal time.
  if nextWallSlot.is_epoch:
    debug "runProposalForkchoiceUpdated: not running early fcU for epoch-aligned proposal slot",
      nextWallSlot, validatorIndex, nextProposer
    return

Actually it has to be this one, as I don't see any fcU sent to Geth in the slot prior to my missed block.

Which is consistent with the behaviour when self.consensusManager.checkNextProposer(wallSlot).isSome, and then runProposalForkchoiceUpdated returns. In that situation, no fcU is sent at all, until we try to actually build the next block.

Should we simply fall back to a regular fcu in that case, so that at least the attestations are correct?

#5195

@etan-status
Copy link
Contributor

Another problem was that Slot end is triggered immediately after Slot start if no in-BN validators have duties. This leads to possibly heavy processing being done early into the slot, even when VCs are processing duties.

I added an additional delay to the slot processing, to wait until shortly before end of slot before calling onSlotEnd if there is a potential for pending duties: #5196

etan-status added a commit that referenced this pull request Aug 17, 2023
In #4465, a regression was introduced by deleting the periodic call to
`engine_exchangeTransitionConfiguration` in the Nimbus light client.
This led to the light client never treating the EL as online and,
subsequently, not sending `engine_newPayload` requests for new blocks.
Strangely, `engine_forkchoiceUpdated` requests still make it through :)

Geth still requires both `engine_newPayload` and `fcU` to be called.
By restoring the `exchangeTransitionConfiguration` loop, `newPayload`
requests are once more issued to the EL.
etan-status added a commit that referenced this pull request Aug 18, 2023
In #4465, a regression was introduced by deleting the periodic call to
`engine_exchangeTransitionConfiguration` in the Nimbus light client.
This led to the light client never treating the EL as online and,
subsequently, not sending `engine_newPayload` requests for new blocks.
Strangely, `engine_forkchoiceUpdated` requests still make it through :)

Geth still requires both `engine_newPayload` and `fcU` to be called.
By restoring the `exchangeTransitionConfiguration` loop, `newPayload`
requests are once more issued to the EL.
etan-status added a commit that referenced this pull request Feb 6, 2024
Since #4465, compilation with `-d:has_deposit_root_checks` fails. #4707
further built on top of it but the additions also don't compile. Fix it.
etan-status added a commit that referenced this pull request Feb 6, 2024
Since #4465, compilation with `-d:has_deposit_root_checks` fails. #4707
further built on top of it but the additions also don't compile. Fix it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants