Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add engine api timeouts #207

Closed
wants to merge 1 commit into from
Closed

add engine api timeouts #207

wants to merge 1 commit into from

Conversation

djrtwo
Copy link
Contributor

@djrtwo djrtwo commented Apr 5, 2022

Proposal

  • 120s timeout for engine endpoints with block execution semantics
  • 12s timeout for engine endpoints without execution semantics

Why a timeout

There are a couple of reasons a block may take longer than expected to process when inserting via engine API into EL

  • A DoS block -- that is a block that will finish but takes unexpectedly and disproportionately long to process
  • A bug -- Some sort of infinite loop, crash, or hang may cause the EL to never respond (or take a very very long time to)

In the event of a DoS block, it is dangerous to naively dismiss these as purely "invalid" at certain time thresholds. If a DoS block can be crafted such that some hardware/clint takes just less than a specified timeout and other hardware/client takes just longer, then an attacker can split the network on such a block.

It is particularly dangerous to have such timeouts be on the order of a slot's length of time (12s) because in this range of time, decisions on what to propose on top of and to attest to happen. So if a DoS block is in the range of sub-slot to 2x slot time (and different depending on hardware!), then an attacker (assuming they found a DoS in block construction) can likely perform such split-network view attacks with timeouts in this range.

Once timeouts are firmly beyond a slot-length, then we have fewer issues with split views. This is because targeting network split attacks that rely on many-slot-length timeouts (e.g. 120s) are likely to just get passed up by subsequnt normal, non-DoS blocks. That is -- if a DoS block at slot N relies on ~120s of execution time to create a network split, it is likely that in that time, a few honest blocks will come in and the canonical chain will just pass up this DoS block.

But what a timeout at all? At a certain point, a block is not valuable to execute and include. That is, if a block is taking more than a slot for any machine to execute, it is almost certain that it will just be orphaned and at some point the client needs to be able to move on and abort the work. More importantly, a block might have a deeper bug -- e.g. infinite loop -- and might never succeed. In such an event, the client needs to be able to call it quits and might have no indication to do so other than the lapse of time.

The case for 120s for block execution endpoints

Assuming that in the normal case, an honest block follows in the slot after a DoS block, a 120s timeout allows for a 10x difference in fastest hardware/client vs slowest hardware/client for that given block. There are thus two scenarios -- when the fastest executor is < 12s and when the fastest executor is >= 12s. Assuming the 10x min/max executor bound:

  • When the fastest executor is < 12s, an honest proposer of next slot might build on the DoS block (if they were the fastest executor). In such a reality, the slowest executor would be < 120s (less than the timeout) so would eventually be able to follow what might be a canonical chain including the block
  • When the fastest executor is >= 12s, an honest proposer of next slot would not be built on top of this DoS block because the next slot producer (even if the fastest executor on the network) would not have be able to incorporate the block locally in time. Thus the canonical chain would move on without the DoS block. Some nodes would include the block locally (< 120s execution), but the slowest on the network (> 120s) would just throw out the block at the timeout but not be at risk of chain split due to the DoS block not making it into the canonical chain

Note, I wouldn't suggest going less than my 10x assumption. I think it is conservative and safe. But if we think this is not conservative enough, let me know with some justification and we can change it higher.

What about re-orgs with FCU

The FCU endpoint might take a long time not just because of one block execution but might be many blocks. Additioanlly it might just be expensive to re-org for other reasons.

I'd like to hear more input here but we might consider either

  • EL responds with SYNCING if doing a deep re-org that will take time
  • FCU timeout is dynamic wrt the depth of the re-org -- 120s in the normal case, but longer if CL knows a depth is happening.

The case for 12s for non-execution endpoints

For endpoints without block execution I suggest 12s timeout (slot-length).

If EL cannot respond within a slot for non-execution endpoints, either these are no longer valuable (e.g. getPayloadV1 and th proposal slot passed) or there is just clearly a bug (e.g. exchangeTransitionConfigurationV1)

@mkalinin
Copy link
Collaborator

mkalinin commented Apr 8, 2022

It might be the case that EL clients lock blockchain instance while processing a payload and are unable to process two payloads simultaneously -- this worth checking with EL client implementers! IF this is true then we can't prevent chain splits where a DoS block exploits the difference between nodes' hardware or EL client implementation (one may also be slower than another) by introducing timeouts, because timed out request on CL side doesn't mean that EL has aborted payload processing and is ready for the next one.

I think that we should rather set the minimal amount of time that CL should wait before considering a call timed out and this amount should exceed seconds per slot for the reasons described by @djrtwo. The other related part of request management is actually a resource management part, a CL might want to cap a number of requests kept in memory by introducing a kind of LRU cache which will prune old ones. Also, a timeout may be different while EL is SYNCING as there are much more requests per second and it's not that important for CL to wait for response on each of sent requests, though in case of SYNCING EL is supposed to respond quickly and we may not take SYNCING as a special case here.

IMO, the change should rather be:

Consensus Layer client software SHOULD wait for at least an amount of time specified by the timeout before aborting the call due to timeout exceeding.

Where SHOULD instead of MUST leaves a room for resource management on CL side.

We could also add explicitly:

A call that is aborted due to timeout exceeding MUST NOT be interpreted as if an INVALID status would be returned.

On the reasonable amount of time to wait

a block is not valuable to execute and include

In an attack scenario where every block is a DoS block it would make sense to wait for seconds per epoch during block processing. In this case there is a chance that chain will be able to justify previous epoch if at least one block is processed during current epoch and it carries enough attestations. Though, 120s looks pretty reasonable as well.

Re-orgs with FCU

EL responds with SYNCING if doing a deep re-org that will take time

This may depend on the client implementation. EL may start doing a re-org which requires multiple blocks to execute, and return SYNCING or executing them all and responding synchronously depending on the size of the fork. Or it can be more advanced and start executing blocks and if the execution isn't done in a sub-second interval return SYNCING.

Copy link
Collaborator

@mkalinin mkalinin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made suggestions as per our discussion on devconnect. I think 8s timeout for block processing make sense in general case because if a block is timely (i.e. received before attestation boundary) it is unlikely that any other block is received until the end of the slot, thus, a node may keep waiting at least till the end of a slot. 2s for getPayload is probably too much, I am open to reduce this down to 1s

src/engine/specification.md Show resolved Hide resolved
src/engine/specification.md Show resolved Hide resolved
src/engine/specification.md Show resolved Hide resolved
src/engine/specification.md Show resolved Hide resolved
src/engine/specification.md Show resolved Hide resolved
src/engine/specification.md Show resolved Hide resolved
@djrtwo djrtwo mentioned this pull request May 4, 2022
@djrtwo
Copy link
Contributor Author

djrtwo commented May 4, 2022

close in favor of #216

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants