-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Optimize backing for latency and liveness #4386
Comments
We know some of this like 1 helps even once we have contextual execution. I donno either way about 2 honestly.. I think 3 sounds harmless, except that parachain liveness requires they share the block within the parachain. We do risk temporary liveness hiccups if we shrink the backing group size, but.. It's fairly harmless to back with less than 1/2 of the backers I think, but it somewhat increases block producers' power for MEV and censorship, but we envision other solutions there. |
Other things we could do, by @crystalin :
For longer terms we have:
|
We always wanted QUIC but initial efforts lacked tuning knowledge I think. We should either hold connections open or else use the 0-RTT option in TLS 1.3. I doubt validators run modified code now, so incentivization changes nothing now. We've done initial design work for incentivization that helps maintain this longer term. |
I disagree with you about incentivization. It needs to be improved. It is true that probably no-one is running customized binaries, but the incentivization is to make sure they are trying to do their best to get the blocks included. I suspect a lot of them are running very poor hardware & connection, and currently there is no reason for them to improve that situation as it doesn't impact their rewards on Kusama. However it impacts a lot the parachains. A slow validator will never back a parachain block (We observe that in Moonriver, when blocks are not included during the full validator rotation length (2 mins)). Having a strong incentivization (for the validator backing the parachain, but also the validator producing the block to include the parachain block), would motivate the operators to improve their setup (hardware and configuration). We see that on Moonriver, where the chances to get a block in (and to get the rewards with it) are highly related with their setup and hardware. We only have ~50 collators, but ALL of them are running high end hardware/setup, optimized for block production. |
It doesn't cost much to get the additional security required to reduce the backing votes to 1. I don't know how much the next step - distribution for availability - depends on how many people had the block already. This might end up timing out more often if the one guy who has the PoV block has bad networking. |
Thanks @AlistairStewart ! Good point with with regards to availability-distribution,but I think that should be fine:
|
We do not afaik count backing checkers towards approval @AlistairStewart so no security changes required there. It's interesting whether doing only 1 backer helps or hurts overall, so maybe worth testing. A priori, it sounds like contextual execution makes the 1 backer trick no longer helpful, but maybe with future pipelining ideas? I donno.. As I said above @crystalin, we're only discussing optimizations here, because right now incentivization cannot impact observable performance, assuming nobody runs modified code yet. I'll do a guide PR for our incentivization design eventually, but not this week. |
This PR: - Reduces MAX_UNSHARED_UPLOAD_TIME to 150ms - Increases timeout on collation fetching to 1200ms - Reduces limit on needed backing votes in the runtime This PR does not yet reduce the number of needed backing votes on the node as this can only be meaningfully enacted once the changed limit in the runtime is live.
* First step in implementing #4386 This PR: - Reduces MAX_UNSHARED_UPLOAD_TIME to 150ms - Increases timeout on collation fetching to 1200ms - Reduces limit on needed backing votes in the runtime This PR does not yet reduce the number of needed backing votes on the node as this can only be meaningfully enacted once the changed limit in the runtime is live. * Fix tests. * Guide updates. * Review remarks. * Bump minimum required backing votes to 2 in runtime. * Make sure node side code won't make runtime vomit. * cargo +nightly fmt
* First step in implementing #4386 This PR: - Reduces MAX_UNSHARED_UPLOAD_TIME to 150ms - Increases timeout on collation fetching to 1200ms - Reduces limit on needed backing votes in the runtime This PR does not yet reduce the number of needed backing votes on the node as this can only be meaningfully enacted once the changed limit in the runtime is live. * Fix tests. * Guide updates. * Review remarks. * Bump minimum required backing votes to 2 in runtime. * Make sure node side code won't make runtime vomit. * cargo +nightly fmt
* First step in implementing paritytech#4386 This PR: - Reduces MAX_UNSHARED_UPLOAD_TIME to 150ms - Increases timeout on collation fetching to 1200ms - Reduces limit on needed backing votes in the runtime This PR does not yet reduce the number of needed backing votes on the node as this can only be meaningfully enacted once the changed limit in the runtime is live. * Fix tests. * Guide updates. * Review remarks. * Bump minimum required backing votes to 2 in runtime. * Make sure node side code won't make runtime vomit. * cargo +nightly fmt
* First step in implementing paritytech/polkadot#4386 This PR: - Reduces MAX_UNSHARED_UPLOAD_TIME to 150ms - Increases timeout on collation fetching to 1200ms - Reduces limit on needed backing votes in the runtime This PR does not yet reduce the number of needed backing votes on the node as this can only be meaningfully enacted once the changed limit in the runtime is live. * Fix tests. * Guide updates. * Review remarks. * Bump minimum required backing votes to 2 in runtime. * Make sure node side code won't make runtime vomit. * cargo +nightly fmt
Without contextual execution, getting a big parachain block backed is a tight squeeze as we only have two seconds before the block producer must have seen all required statements. This can be witnessed by issues like this one.
Those two seconds are spent in the backing process:
Delivering a collation to a validator currently times out after 1 seconds, which is kind of sensible, considering that we only have 2 seconds in total, although it might be worthwhile to increase that limit a bit, as it seems to be a very significant part of the whole process.
Optimizing for Latency
The collator protocols is currently optimized for bandwidth, when in reality the real problem will most likely be latency in a globally distributed network. Ping times can get in the range of hundreds of milliseconds. If we consider TCP handshakes and TCP slow start, transferring megabytes of data can easily break the timeout, despite nodes having loads of bandwidth. Things we should do to mitigate this:
MAX_UNSHARED_UPLOAD_TIME
even 0, to fully optimize for latency, if we limit the number of parallel uploads.With 3) POV distribution should hardly be necessary anymore, which I think is good, because POV distribution only starts after a candidate has been validated, which again adds latency, which will hardly make up for the reduced bandwidth demands on the collator - also parachains control their collators, so if they are not happy with performance, they can beef them up.
Optimizing for Liveness
In addition or instead to point 4 of the previous section, we should also think about reducing the number of required backing votes. As already discussed a few times, the security of Polkadot comes from approval checking, in backing we really should be more concerned about liveness than security.
Consider a backing group of size 5, right now we would require 3 votes for the backing to succeed. If this backing group is widely distributed around the world, then some of those validators will have a very short round trip time to the current collator, while others might have a very long round trip time. By reducing the number of required votes, we can take this to our advantage. For example, if we would only demand one vote, then we could make backing succeed as long as a single validator in the backing group is in the same region as the collator, even when pushing limits.
Or in other words, we could make it so that a single good enough validator will suffice for the parachain to make progress, while now we require three out of five to be good and nearby.
Considerations of reduced number of required backing votes
If we optimize for liveness and only require a single backing vote, the stake risked by an attacker is only a third of what would be at risk if we required three votes. I would not consider this a big problem though, as we can easily make up for it, by increasing the number of required approval checkers - increasing the risk of getting caught.
On a network with disputes, but without slashing, a single malicious validator can reduce parachain performance way more than a single slow backing group would.
Conclusion
Contextual Execution is absolute top priority right now, but will need some time to be implemented properly, in the meantime we should make sure parachains can work as smoothly as possible. Parachain teams expect load on Polkadot to be higher than on Kusama, so any problems we are seeing on Kusama, might become worse on Polkadot.
Further improvements
With high latency TCP slow start is a big problem. Requiring several round trips for transferring data will ruin the effective bandwidth, even if nominal bandwidth is plenty. Until we finally have QUIC support, we might integrate some OS setting detection into the Polkadot binary, which will inform the operator about non optimal TCP settings and how to fix them.
Forks make matters worse, so tackling this might also help, depending on the outcome of this.
Implementation
Apart from collator pre-connect, which is an issue on its own, going forward with this issue, I would suggest the following:
MAX_UNSHARED_UPLOAD_TIME
from 400ms to max 200ms, maybe even just 150ms.I would expect these measures already having a big effect. Because requests are answered on a first come - first serve basis, so low latency connections have an advantage of getting answered first and those connections will likely also be able to upload the candidate in a very short amount of time ... after 200ms it might already be finished. If only one vote is required this first fast upload would already suffice for the parachain to make progress, but if not the reduced
MAX_UNSHARED_UPLOAD_TIME
should definitely help with getting other backing validators to succeed at leaast. Given that it seems that we should really worry about latency the most, I would vouch for 150msMAX_UNSHARED_UPLOAD_TIME
.The text was updated successfully, but these errors were encountered: