-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
L1: Tx Spammer sporadically timeouts #1120
Comments
This was a bit of a rabbit hole so first i want to share the initial information I was able to get. Spammer execution Right now we are not sharing tx through p2p so we are the only ones receiving them, otoh if we are not the first client in the kurtosis yaml, that one receives them and given we are not processing them yes just the other clients get access to those. This shows the first difference between our current state and for example geth. Diff 1: Amount of Transactions Diff 2: Transaction Timeout For some reason the Consensus layer ask us for an old payload_id instead of the current one, we, instead of returning the old block generate a new one and fill the mempool transactions. Due to the block being conflicting wwith the already proposed for that slot the consensus client discards it and immediatly ask us for the correct payload id, but we have now the mempool empty and proceed to send a 0 tx one, effectively skipping a batch of tx which the spammer keeps checking for inclusion and eventually times out. In the next comment I'll add some logs and more explanation of why/how this appears to be happening. |
A more in depth explanation of the Transaction Timeout The main issue was time, for some reason lighthouse was requesting payloads before the start of a new slot by just a couple of milliseconds (note the
Unfortunately lighthouse#faq explains that the Locally I tried both to set a small script to update eagerly from On our side, the logs showed clearly that we were recieving two subsequent get paylod request with different ids:
And with some additional testing logs we could see how we generate a block with a number of tx (100 in this case) and immediatly generated an empty one. This is due to the initial block being discarded as orphaned by the consensus client, which makes sense given that we are using an old payload to recreate the block, and here is what happens:
|
The issue with this is that it's not clear why we received an old payload_id, and even when I saw That said we still have a specific issue regarding payloads and it's related to them being not stored as finalized when we finish the building. For now the proposed solution is to update the state and check how that works. But it might refloat the previous comment about the old payload ids and why we are receiving them if other clients don't. |
In a setup with ligthhouse-geth as the initial node and most of the validators I was able to reproduce the issue: CL Logs
EL logs
As noted here the |
I had a long run that might confirm this assumption, every time the incorrect slot was detected in the CL but the previous block was proposed by another node, geth received the same payload_id that was expecting (instead of an old one probably because he wasn't the proposer of that previous slot) and everything worked as expected: CL Logs
EL Logs
|
**Motivation** The spammer hanged in one of 2 forms, either it wait for tx that weren't added in the second phase of each spam and timeout after 5 minutes resuming the execution or it hangs ad infinitum if the issue happens in the first phase. This indicate that we were losing transactions. **Description** This is a mid-term solution, the explanation about what happened is done on the issue, [in this particular comment](#1120 (comment)), but TLDR: we were receiving `engine_getPayload` request with old `payload_ids` and we filled *AGAIN* the block but with new transactions. the block was discarded as orphan by consensus and we lost those txs from mempool on the immediate subsequent request. This was caused because we were missing the update of the payload when it was modified in the second step of our build process. The current solution stores the whole payload, i.e. the block, block value and the blobs bunde. Given our current implementation (a naive 2-step build process) we either filled the transaction and ended the build process or not, a simple closed flag was added to the payload store to signal this and avoid refilling transactions, this is clearer than just check if there are any but could be changed if preferred. | Spammer logs | Dora explorer | :-------------------------:|:-------------------------: ![image](https://github.com/user-attachments/assets/f768879b-4bba-41f6-991b-e81abe0531f4) | ![image](https://github.com/user-attachments/assets/bd642c92-4a99-42fa-b99d-bc9036b14fff) **Next Steps** Enhance the build process to make it async and interactive instead of the naive 2-step all-or-nothing implementation we had right now. <!-- Link to issues: Resolves #111, Resolves #222 --> Resolves #1120
Some times we get errors in the tx spammer due to transaction not being picked up before a timeout. We need to investigate if this further to check if its related to load'or specific transactions. At a particular example a block was missed:
The spammer hanged in one of 2 forms, either it wait for tx that weren't added in the second phase of each spam and timeout after 5 minutes resuming the execution or it hangs ad infinitum if the issue happens in the first phase.
The text was updated successfully, but these errors were encountered: