Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(l1): fix hangs on the tx spammer #1413

Merged
merged 13 commits into from
Dec 10, 2024

Conversation

rodrigo-o
Copy link
Collaborator

@rodrigo-o rodrigo-o commented Dec 5, 2024

Motivation

The spammer hanged in one of 2 forms, either it wait for tx that weren't added in the second phase of each spam and timeout after 5 minutes resuming the execution or it hangs ad infinitum if the issue happens in the first phase. This indicate that we were losing transactions.

Description

This is a mid-term solution, the explanation about what happened is done on the issue, in this particular comment, but TLDR: we were receiving engine_getPayload request with old payload_ids and we filled AGAIN the block but with new transactions. the block was discarded as orphan by consensus and we lost those txs from mempool on the immediate subsequent request. This was caused because we were missing the update of the payload when it was modified in the second step of our build process.

The current solution stores the whole payload, i.e. the block, block value and the blobs bunde. Given our current implementation (a naive 2-step build process) we either filled the transaction and ended the build process or not, a simple closed flag was added to the payload store to signal this and avoid refilling transactions, this is clearer than just check if there are any but could be changed if preferred.

Spammer logs Dora explorer
image image

Next Steps
Enhance the build process to make it async and interactive instead of the naive 2-step all-or-nothing implementation we had right now.

Resolves #1120

@rodrigo-o rodrigo-o marked this pull request as ready for review December 9, 2024 11:53
@rodrigo-o rodrigo-o requested a review from a team as a code owner December 9, 2024 11:53
@rodrigo-o
Copy link
Collaborator Author

rodrigo-o commented Dec 9, 2024

Now the behaviour we've seen between lighthouse and geth is replicated with our node. Here are some logs showing it:

CL Logs

Dec 09 01:38:16.998 WARN Error signalling fork choice waiter     slot: 1237, error: ForkChoiceSignalOutOfOrder { current: Slot(1238), latest: Slot(1237) }, service: beacon
Dec 09 01:38:17.003 ERRO Producing block at incorrect slot       message: check clock sync, this block may be orphaned, current_slot: 1238, block_slot: 1237, service: beacon
Dec 09 01:38:17.003 WARN Producing block that conflicts with head, slot: 1237, message: this block is more likely to be orphaned, service: beacon
Dec 09 01:38:17.004 ERRO No valid eth1_data votes, `votes_to_consider` empty, outcome: casting `state.eth1_data` as eth1 vote, genesis_time: 1733693441, earliest_block_timestamp: 1733693421, lowest_block_number: 0, service: deposit_contract_rpc
Dec 09 01:38:17.006 WARN Duplicate payload cached, this might indicate redundant proposal attempts., service: exec
Dec 09 01:38:17.011 ERRO No valid eth1_data votes, `votes_to_consider` empty, outcome: casting `state.eth1_data` as eth1 vote, genesis_time: 1733693441, earliest_block_timestamp: 1733693421, lowest_block_number: 0, service: deposit_contract_rpc
Dec 09 01:38:17.102 INFO Signed block published to network via HTTP API, publish_delay_ms: 1, slot: 1238
Dec 09 01:38:17.140 INFO Valid block from HTTP API               slot: 1238, proposer_index: 16, root: 0xbfd8…a1ab, block_delay: 101.260668ms
Dec 09 01:38:22.998 INFO Synced                                  slot: 1238, block: 0xbfd8…a1ab, epoch: 38, finalized_epoch: 36, finalized_root: 0x1b45…15a9, exec_hash: 0xef5a…eca4 (verified), peers: 2, service: slot_notifier


El Logs

2024-12-09T01:38:17.005811Z  INFO ethrex_rpc::engine::payload: Requested payload with id: 0x030d7e0143865573
2024-12-09T01:38:17.005850Z  INFO ethrex_rpc::engine::payload: Current Fork: Cancun
2024-12-09T01:38:17.005900Z  INFO ethrex_rpc: RPC req method: "engine_getPayloadV3", req params: Some([String("0x030d7e0143865573")])
2024-12-09T01:38:17.005904Z  INFO ethrex_rpc: RPC response parent hash: String("0xd7e73b12be05a0524d10dcf508d52631af958a2b0d2dd86a6aff14993a437170"), blockhash: String("0x244d7ff371958f6943735adfa0617009c823768f411acc3e42ad31226f60bedb")
2024-12-09T01:38:17.012249Z  INFO ethrex_rpc::engine::payload: Requested payload with id: 0x039150aa7bf22245
2024-12-09T01:38:17.012297Z  INFO ethrex_rpc::engine::payload: Current Fork: Cancun
2024-12-09T01:38:17.012621Z  INFO ethrex_blockchain::payload: Fetching transactions from mempool
2024-12-09T01:38:17.012638Z  INFO ethrex_blockchain::payload: Fetching plain transactions from mempool
2024-12-09T01:38:17.012643Z  INFO ethrex_storage: Mempool tx count: 10
2024-12-09T01:38:17.012660Z  INFO ethrex_storage: Filtered tx count: 10
2024-12-09T01:38:17.012669Z  INFO ethrex_blockchain::payload: Fetching blob transactions from mempool
2024-12-09T01:38:17.012671Z  INFO ethrex_storage: Mempool tx count: 10
2024-12-09T01:38:17.012672Z  INFO ethrex_storage: Filtered tx count: 0
2024-12-09T01:38:17.051459Z  INFO ethrex_rpc: RPC req method: "engine_getPayloadV3", req params: Some([String("0x039150aa7bf22245")])
2024-12-09T01:38:17.051478Z  INFO ethrex_rpc: RPC response parent hash: String("0x244d7ff371958f6943735adfa0617009c823768f411acc3e42ad31226f60bedb"), blockhash: String("0xef5ae0e7c8534482302e5df9a4dddcb7d4407fbc3edb90ddba61679e3275eca4")

As we've seen with geth, 2 payload request were received, one with the old payload which was discarded by the consensus client due to being the same as a previous one, and the next one correct, not missing the block and publishing the correct one. There is also possible to just receive the old payload (which we've also seen in geth) and miss the block entirely.

Comment on lines +61 to +62
ENCLAVE ?= lambdanet

Copy link
Collaborator Author

@rodrigo-o rodrigo-o Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was made to been able to compare two different executions at the same time using make localnet with different enclaves.

@rodrigo-o rodrigo-o added this pull request to the merge queue Dec 10, 2024
Merged via the queue into main with commit 3c05f02 Dec 10, 2024
12 checks passed
@rodrigo-o rodrigo-o deleted the check-transactions-hanged-from-the-spammer branch December 10, 2024 13:46
@rodrigo-o rodrigo-o added bug Something isn't working and removed bug Something isn't working labels Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

L1: Tx Spammer sporadically timeouts
3 participants