Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

L1: Tx Spammer sporadically timeouts #1120

Closed
rodrigo-o opened this issue Nov 8, 2024 · 5 comments · Fixed by #1413
Closed

L1: Tx Spammer sporadically timeouts #1120

rodrigo-o opened this issue Nov 8, 2024 · 5 comments · Fixed by #1413
Assignees
Labels
bug Something isn't working

Comments

@rodrigo-o
Copy link
Collaborator

rodrigo-o commented Nov 8, 2024

Some times we get errors in the tx spammer due to transaction not being picked up before a timeout. We need to investigate if this further to check if its related to load'or specific transactions. At a particular example a block was missed:

Screenshot 2024-11-08 at 12 12 04 PM

The spammer hanged in one of 2 forms, either it wait for tx that weren't added in the second phase of each spam and timeout after 5 minutes resuming the execution or it hangs ad infinitum if the issue happens in the first phase.

@rodrigo-o
Copy link
Collaborator Author

rodrigo-o commented Dec 4, 2024

This was a bit of a rabbit hole so first i want to share the initial information I was able to get.

Spammer execution
As part of the spammer configuration we have two important parameters --accounts(a) and --txcount(b), accounts are the amount of accounts affected by transactions, and txcount the amount of transactions send per any account. The spammer works in a 2 step process, first it send 1 tx for every account, which ends up in the first block a a number of transactions. If those succeed it send txcount transactions for each account, ie a x b number of transactions in a subsequent block.

Right now we are not sharing tx through p2p so we are the only ones receiving them, otoh if we are not the first client in the kurtosis yaml, that one receives them and given we are not processing them yes just the other clients get access to those. This shows the first difference between our current state and for example geth.

Diff 1: Amount of Transactions
Given the previous explanation if we configure the spammer with --txcount=2 --accounts=100, we'll see an initial block with 100 tx, a follow up one with 200 tx, and a repeat ad infinitum. Geth displays this behaviour, but in our case we have an initial one of 100 and the second one between 100~140. I'll create a new issu to track this, we are rejecting or ignoring some of those transactions. What i was able to validate is that they are not filtered out, at least as far as I could check, so they were rejected prior to being added to the mempool.

Diff 2: Transaction Timeout
This was harder to debug. I'm still investigating it and trying to both replicate an scenario were it always happen as well as being able to replicate one were it didn't happen at all. My current hypothesis and a specific case that i was able to see at least 3 times is the following:

For some reason the Consensus layer ask us for an old payload_id instead of the current one, we, instead of returning the old block generate a new one and fill the mempool transactions. Due to the block being conflicting wwith the already proposed for that slot the consensus client discards it and immediatly ask us for the correct payload id, but we have now the mempool empty and proceed to send a 0 tx one, effectively skipping a batch of tx which the spammer keeps checking for inclusion and eventually times out. In the next comment I'll add some logs and more explanation of why/how this appears to be happening.

@rodrigo-o
Copy link
Collaborator Author

A more in depth explanation of the Transaction Timeout

The main issue was time, for some reason lighthouse was requesting payloads before the start of a new slot by just a couple of milliseconds (note the 22:15:41.994 log just before the start of a new slot at 22:15:42:000):

Dec 04 22:15:41.994 WARN Error signalling fork choice waiter     slot: 83, error: ForkChoiceSignalOutOfOrder { current: Slot(84), latest: Slot(83) }, service: beacon
Dec 04 22:15:41.999 WARN Producing block that conflicts with head, slot: 83, message: this block is more likely to be orphaned, service: beacon
Dec 04 22:15:42.000 ERRO No valid eth1_data votes, `votes_to_consider` empty, outcome: casting `state.eth1_data` as eth1 vote, genesis_time: 1733349534, earliest_block_timestamp: 1733349514, lowest_block_number: 0, service: deposit_contract_rpc
Dec 04 22:15:42.258 INFO Prepared beacon proposer                parent_root: 0x5c3646ce7e6f48f56efa064a8818db32a8fef939703c1952bdb1868af57fd7ab, validator: 0, prepare_slot: 85, service: beacon
Dec 04 22:15:42.260 ERRO No valid eth1_data votes, `votes_to_consider` empty, outcome: casting `state.eth1_data` as eth1 vote, genesis_time: 1733349534, earliest_block_timestamp: 1733349514, lowest_block_number: 0, service: deposit_contract_rpc
Dec 04 22:15:42.329 INFO Signed block published to network via HTTP API, publish_delay_ms: 1, slot: 84
Dec 04 22:15:42.342 INFO Valid block from HTTP API               slot: 84, proposer_index: 24, root: 0x0219…61d2, block_delay: 328.829875ms
Dec 04 22:15:42.351 INFO Prepared beacon proposer                parent_root: 0x0219aa83bd140218c09acd1212770c4f5653d0fc31daea59ba74431f957361d2, validator: 0, prepare_slot: 85, service: beacon

Unfortunately lighthouse#faq explains that the error: ForkChoiceSignalOutOfOrder is due to resource shortcomings, which wasn't our case and I assume it's related to the other side of the our of time request (being late because of lack of resources, instead of being early).

Locally I tried both to set a small script to update eagerly from time.apple.com sntp server and as some suggested use chrony, but althought the errors greatly diminished (from 1 in ~5m to 1 in ~20m), the issue still persisted.

On our side, the logs showed clearly that we were recieving two subsequent get paylod request with different ids:

2024-12-04T22:15:41.997223Z  INFO ethrex_rpc: RPC response: Object {"payloadStatus": Object {"status": String("VALID"), "latestValidHash": String("0xcbf001319c847adab0652378eed1856c45b0b8cc169e481ce8fd244746b872c7"), "validationError": Null}, "payloadId": String("0x03003c720272eae9")}
2024-12-04T22:15:42.000722Z  INFO ethrex_rpc::engine::payload: Requested payload with id: 0x034f708c4dc44262
2024-12-04T22:15:42.001347Z  INFO ethrex_blockchain::payload: Fetching transactions from mempool
2024-12-04T22:15:42.001360Z  INFO ethrex_blockchain::payload: Fetching plain transactions from mempool
2024-12-04T22:15:42.001362Z  INFO ethrex_storage: Mempool tx count: 100
2024-12-04T22:15:42.001438Z  INFO ethrex_storage: Filtered tx count: 100
2024-12-04T22:15:42.001522Z  INFO ethrex_blockchain::payload: Fetching blob transactions from mempool
2024-12-04T22:15:42.001531Z  INFO ethrex_storage: Mempool tx count: 100
2024-12-04T22:15:42.001532Z  INFO ethrex_storage: Filtered tx count: 0
2024-12-04T22:15:42.249700Z  INFO ethrex_rpc: RPC req method: "engine_getPayloadV3", req params: Some([String("0x034f708c4dc44262")])
2024-12-04T22:15:42.249719Z  INFO ethrex_rpc: RPC response parent hash: String("0xc437eb30a008036c702b36076700bdbff637afbe5d37ea06156c2217776bd27b"), blockhash: String("0x49b8d2e15072dec47fd15989bcaf31e3c40864028df0de48b7b388567bcaa40b")
2024-12-04T22:15:42.260817Z  INFO ethrex_rpc::engine::payload: Requested payload with id: 0x03003c720272eae9
2024-12-04T22:15:42.261082Z  INFO ethrex_blockchain::payload: Fetching transactions from mempool
2024-12-04T22:15:42.261092Z  INFO ethrex_blockchain::payload: Fetching plain transactions from mempool
2024-12-04T22:15:42.261093Z  INFO ethrex_storage: Mempool tx count: 0
2024-12-04T22:15:42.261094Z  INFO ethrex_storage: Filtered tx count: 0
2024-12-04T22:15:42.261096Z  INFO ethrex_blockchain::payload: Fetching blob transactions from mempool
2024-12-04T22:15:42.261096Z  INFO ethrex_storage: Mempool tx count: 0
2024-12-04T22:15:42.261097Z  INFO ethrex_storage: Filtered tx count: 0

And with some additional testing logs we could see how we generate a block with a number of tx (100 in this case) and immediatly generated an empty one. This is due to the initial block being discarded as orphaned by the consensus client, which makes sense given that we are using an old payload to recreate the block, and here is what happens:

  • At slot 82, we received a engine_forkChoiceUpdated with payloadAttrs for 83 (so we store the initial payload there) and returned payload id: 0x03003c720272eae9
  • At the start of slot 83 we received the engine_getPayload with payload_id: 0x03003c720272eae9, filled the transactions and returned the block without modifying out storage of payloads. After that we receive the new engine_forkChoiceUpdatedwithpayloadAttrs for 84 and rteturned the payload:0x034f708c4dc44262`
  • Just a couple of ms before the start of slot 84, consensus request the engine_getPayload, but given that it knows it's still in slot 83, it use again the previous payload: 0x03003c720272eae9
  • We found the old payload and fill in new transactions, which are then discarded as an orphaned block, immediatly, but now in slot 84, we recieve a new engine_getPayload, now with the correct payload_id: 0x034f708c4dc44262. But we produce an empty block.
  • This hanged the Tx Spammer

@rodrigo-o
Copy link
Collaborator Author

The issue with this is that it's not clear why we received an old payload_id, and even when I saw ForkChoiceSignalOutOfOrder issues around the other pairs (lighthouse-geth), I still need to reproduce the behaviour where other clients receive old payloads to make sure we are not generating this issue by our behaviour.

That said we still have a specific issue regarding payloads and it's related to them being not stored as finalized when we finish the building. For now the proposed solution is to update the state and check how that works. But it might refloat the previous comment about the old payload ids and why we are receiving them if other clients don't.

@rodrigo-o
Copy link
Collaborator Author

The issue with this is that it's not clear why we received an old payload_id, and even when I saw ForkChoiceSignalOutOfOrder issues around the other pairs (lighthouse-geth), I still need to reproduce the behaviour where other clients receive old payloads to make sure we are not generating this issue by our behaviour.

In a setup with ligthhouse-geth as the initial node and most of the validators I was able to reproduce the issue:

CL Logs

Dec 05 18:41:44.999 WARN Error signalling fork choice waiter     slot: 41, error: ForkChoiceSignalOutOfOrder { current: Slot(42), latest: Slot(41) }, service: beacon
Dec 05 18:41:45.000 INFO Prepared beacon proposer                parent_root: 0xa564e60428215b89cd2ff0529da114d734efbd4deaccac8aa6326b7306261079, validator: 0, prepare_slot: 43, service: beacon
Dec 05 18:41:45.001 ERRO Producing block at incorrect slot       message: check clock sync, this block may be orphaned, current_slot: 42, block_slot: 41, service: beacon
Dec 05 18:41:45.001 WARN Producing block that conflicts with head, slot: 41, message: this block is more likely to be orphaned, service: beacon
Dec 05 18:41:45.003 ERRO No valid eth1_data votes, `votes_to_consider` empty, outcome: casting `state.eth1_data` as eth1 vote, genesis_time: 1733423601, earliest_block_timestamp: 1733423581, lowest_block_number: 0, service: deposit_contract_rpc
Dec 05 18:41:45.004 WARN Duplicate payload cached, this might indicate redundant proposal attempts., service: exec
Dec 05 18:41:50.998 WARN Low peer count                          peer_count: 1, service: slot_notifier
Dec 05 18:41:50.998 INFO Synced                                  slot: 42, block:    …  empty, epoch: 1, finalized_epoch: 0, finalized_root: 0xa989…d1fb, exec_hash: 0x5e24…373a (verified), peers: 1, service: slot_notifier
Dec 05 18:41:50.998 INFO Ready for Electra                       info: ensure the execution endpoint is updated to the latest Electra/Prague release, service: slot_notifier

EL logs

DEBUG[12-05|18:41:33.008] Engine API request received              method=GetPayload id=0x03246c7c79377d69
DEBUG[12-05|18:41:33.009] Served engine_getPayloadV3               conn=172.16.0.13:58266 reqid=1 duration="269.541µs"
INFO [12-05|18:41:33.009] Stopping work on payload                 id=0x03246c7c79377d69 reason=delivery
....
DEBUG[12-05|18:41:45.003] Engine API request received              method=GetPayload id=0x03246c7c79377d69
DEBUG[12-05|18:41:45.003] Served engine_getPayloadV3               conn=172.16.0.13:58266 reqid=1 duration="167.541µs"
DEBUG[12-05|18:41:45.064] Adding p2p peer                          peercount=2 id=76c9bb19381009f8   conn=dyndial addr=185.99.89.203:30404               name=Geth/v1.13.15-stable...

As noted here the payload_id 0x03246c7c79377d69 was received twice. As to why this shouldn't be an issue on testnets/mainnet the problem only arise with 2 subsequent block proposals as far as I was able to test.

@rodrigo-o
Copy link
Collaborator Author

rodrigo-o commented Dec 6, 2024

the problem only arise with 2 subsequent block proposals as far as I was able to test.

I had a long run that might confirm this assumption, every time the incorrect slot was detected in the CL but the previous block was proposed by another node, geth received the same payload_id that was expecting (instead of an old one probably because he wasn't the proposer of that previous slot) and everything worked as expected:

CL Logs

Dec 06 12:41:33.999 WARN Error signalling fork choice waiter     slot: 4044, error: ForkChoiceSignalOutOfOrder { current: Slot(4045), latest: Slot(4044) }, service: beacon
Dec 06 12:41:34.005 ERRO Producing block at incorrect slot       message: check clock sync, this block may be orphaned, current_slot: 4045, block_slot: 4044, service: beacon
Dec 06 12:41:34.005 WARN Producing block that conflicts with head, slot: 4044, message: this block is more likely to be orphaned, service: beacon
Dec 06 12:41:34.007 ERRO No valid eth1_data votes, `votes_to_consider` empty, outcome: casting `state.eth1_data` as eth1 vote, genesis_time: 1733440354, earliest_block_timestamp: 1733440334, lowest_block_number: 0, service: deposit_contract_rpc
Dec 06 12:41:34.007 WARN Duplicate payload cached, this might indicate redundant proposal attempts., service: exec
Dec 06 12:41:34.084 INFO New block received                      root: 0x8577f5c67716a5b7d09767535d2bb8d54552046b4c25937bc9e771561885de0d, slot: 4045
Dec 06 12:41:40.000 INFO Synced                                  slot: 4045, block: 0x8577…de0d, epoch: 126, finalized_epoch: 124, finalized_root: 0x0971…1cb4, exec_hash:

EL Logs

DEBUG[12-06|12:33:33.504] Served engine_forkchoiceUpdatedV3        conn=172.16.0.15:57782 reqid=1 duration=2.181167ms
DEBUG[12-06|12:33:34.009] Engine API request received              method=GetPayload id=0x03966bcb26c03f20
DEBUG[12-06|12:33:34.009] Served engine_getPayloadV3               conn=172.16.0.15:57782 reqid=1 duration="191.292µs"
INFO [12-06|12:33:34.009] Stopping work on payload                 id=0x03966bcb26c03f20 reason=delivery

github-merge-queue bot pushed a commit that referenced this issue Dec 10, 2024
**Motivation**

The spammer hanged in one of 2 forms, either it wait for tx that weren't
added in the second phase of each spam and timeout after 5 minutes
resuming the execution or it hangs ad infinitum if the issue happens in
the first phase. This indicate that we were losing transactions.

**Description**

This is a mid-term solution, the explanation about what happened is done
on the issue, [in this particular
comment](#1120 (comment)),
but TLDR: we were receiving `engine_getPayload` request with old
`payload_ids` and we filled *AGAIN* the block but with new transactions.
the block was discarded as orphan by consensus and we lost those txs
from mempool on the immediate subsequent request. This was caused
because we were missing the update of the payload when it was modified
in the second step of our build process.

The current solution stores the whole payload, i.e. the block, block
value and the blobs bunde. Given our current implementation (a naive
2-step build process) we either filled the transaction and ended the
build process or not, a simple closed flag was added to the payload
store to signal this and avoid refilling transactions, this is clearer
than just check if there are any but could be changed if preferred.

| Spammer logs             |  Dora explorer |
:-------------------------:|:-------------------------:

![image](https://github.com/user-attachments/assets/f768879b-4bba-41f6-991b-e81abe0531f4)
|
![image](https://github.com/user-attachments/assets/bd642c92-4a99-42fa-b99d-bc9036b14fff)

**Next Steps**
Enhance the build process to make it async and interactive instead of
the naive 2-step all-or-nothing implementation we had right now.


<!-- Link to issues: Resolves #111, Resolves #222 -->

Resolves #1120
@rodrigo-o rodrigo-o added the bug Something isn't working label Dec 10, 2024
@rodrigo-o rodrigo-o self-assigned this Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant