-
Notifications
You must be signed in to change notification settings - Fork 390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add engine api timeouts #207
Conversation
It might be the case that EL clients lock blockchain instance while processing a payload and are unable to process two payloads simultaneously -- this worth checking with EL client implementers! IF this is true then we can't prevent chain splits where a DoS block exploits the difference between nodes' hardware or EL client implementation (one may also be slower than another) by introducing timeouts, because timed out request on CL side doesn't mean that EL has aborted payload processing and is ready for the next one. I think that we should rather set the minimal amount of time that CL should wait before considering a call timed out and this amount should exceed seconds per slot for the reasons described by @djrtwo. The other related part of request management is actually a resource management part, a CL might want to cap a number of requests kept in memory by introducing a kind of LRU cache which will prune old ones. Also, a timeout may be different while EL is IMO, the change should rather be:
Where SHOULD instead of MUST leaves a room for resource management on CL side. We could also add explicitly:
On the reasonable amount of time to wait
In an attack scenario where every block is a DoS block it would make sense to wait for seconds per epoch during block processing. In this case there is a chance that chain will be able to justify previous epoch if at least one block is processed during current epoch and it carries enough attestations. Though, 120s looks pretty reasonable as well. Re-orgs with FCU
This may depend on the client implementation. EL may start doing a re-org which requires multiple blocks to execute, and return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made suggestions as per our discussion on devconnect. I think 8s
timeout for block processing make sense in general case because if a block is timely (i.e. received before attestation boundary) it is unlikely that any other block is received until the end of the slot, thus, a node may keep waiting at least till the end of a slot. 2s
for getPayload
is probably too much, I am open to reduce this down to 1s
close in favor of #216 |
Proposal
Why a timeout
There are a couple of reasons a block may take longer than expected to process when inserting via engine API into EL
In the event of a DoS block, it is dangerous to naively dismiss these as purely "invalid" at certain time thresholds. If a DoS block can be crafted such that some hardware/clint takes just less than a specified timeout and other hardware/client takes just longer, then an attacker can split the network on such a block.
It is particularly dangerous to have such timeouts be on the order of a slot's length of time (12s) because in this range of time, decisions on what to propose on top of and to attest to happen. So if a DoS block is in the range of sub-slot to 2x slot time (and different depending on hardware!), then an attacker (assuming they found a DoS in block construction) can likely perform such split-network view attacks with timeouts in this range.
Once timeouts are firmly beyond a slot-length, then we have fewer issues with split views. This is because targeting network split attacks that rely on many-slot-length timeouts (e.g. 120s) are likely to just get passed up by subsequnt normal, non-DoS blocks. That is -- if a DoS block at slot N relies on ~120s of execution time to create a network split, it is likely that in that time, a few honest blocks will come in and the canonical chain will just pass up this DoS block.
But what a timeout at all? At a certain point, a block is not valuable to execute and include. That is, if a block is taking more than a slot for any machine to execute, it is almost certain that it will just be orphaned and at some point the client needs to be able to move on and abort the work. More importantly, a block might have a deeper bug -- e.g. infinite loop -- and might never succeed. In such an event, the client needs to be able to call it quits and might have no indication to do so other than the lapse of time.
The case for 120s for block execution endpoints
Assuming that in the normal case, an honest block follows in the slot after a DoS block, a 120s timeout allows for a 10x difference in fastest hardware/client vs slowest hardware/client for that given block. There are thus two scenarios -- when the fastest executor is < 12s and when the fastest executor is >= 12s. Assuming the 10x min/max executor bound:
Note, I wouldn't suggest going less than my 10x assumption. I think it is conservative and safe. But if we think this is not conservative enough, let me know with some justification and we can change it higher.
What about re-orgs with FCU
The FCU endpoint might take a long time not just because of one block execution but might be many blocks. Additioanlly it might just be expensive to re-org for other reasons.
I'd like to hear more input here but we might consider either
The case for 12s for non-execution endpoints
For endpoints without block execution I suggest 12s timeout (slot-length).
If EL cannot respond within a slot for non-execution endpoints, either these are no longer valuable (e.g. getPayloadV1 and th proposal slot passed) or there is just clearly a bug (e.g. exchangeTransitionConfigurationV1)