-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unbonding queue #97
Unbonding queue #97
Conversation
Importantly soundness slashes happen quickly because no-shows cause full escalation within minutes. It follows these parameter choices need only address the classical long range attacks scenarios, not the polkadot specific soundness slashes. This wasn't mentioned here, but after this it becomes another major benefit of polakdot over "optimistic roll ups" in terms of UX . We'd likely change these parameters if we ever adopted an 80% honest assumption for multiple relay chains, but this should be addressed in future, nowhere near enough utilization yet. |
@Ank4n @gpestana @kianenigma PTAL |
text/0092-unbonding_queue.md
Outdated
|
||
Locking tokens for staking ensures that Polkadot is able to slash tokens backing misbehaving validators. With changing the locking period, we still need to make sure that Polkadot can slash enough tokens to deter misbehaviour. This means that not all tokens can be unbonded immediately, however we can still allow some tokens to be unbonded quickly. | ||
|
||
The new mechanism leads to a signficantly reduced unbonding time on average, by queuing up new unbonding requests and scaling their unbonding duration relative to the size of the queue. New requests are executed with a minimum of 2 days, when the queue is comparatively empty, to the conventional 28 days, if the sum of requests (in terms of stake) exceed some threshold. In scenarios between these two bounds, the unbonding duration scales proportionately. The new mechanism will never be worse than the current fixed 28 days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the sum of requests (in terms of stake) exceed some threshold.
dq: Which part of your formulas below corresponds to this? I fail to detect it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter is called max_unstake
. Further below it reads:
We also store a variable, max_unstake that tracks how much stake we allow to unbond potentially earlier than 28 eras (28 days on Polkadot and 7 days on Kusama).
The queue scales proportionally between 2 and 28 days (with respect to max_unstake
). In case we exceed that value, the unbonding time is capped at 28 days.
text/0092-unbonding_queue.md
Outdated
|
||
If we detect an LRA of no more than 28 days with the current unbonding period, then we should be able to detect misbehaviour from over 1/3 of validators whose nominators are still bonded. The stake backing these validators is considerable fraction of the total stake (empirically it is 0.287 or so). If we allowed more than this stake to unbond, without checking who it was backing, then the LRA attack might be free of cost for an attacker. The proposed mechansim allows up to half this stake to unbond with in 28 days. This halves the amount of tokens that can be slashed, but this is still very high in absolute terms. For example, at the time of writing (19.06.2024) this would translate to around 120 millions DOTs. | ||
|
||
Attacks other than an LRA, such as backing incorrect parachain blocks, should be detected and slashed within 2 days. This is why the mechanism has a minimum unbonding period. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the mechanisms/parameters that ensure that these types of attacks/misbehaviours are always below the 2 day threshold? Could changing some of these params change the threshold to other number of days?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see this comment
Co-authored-by: Gonçalo Pestana <g6pestana@gmail.com>
Co-authored-by: Gonçalo Pestana <g6pestana@gmail.com>
Co-authored-by: Gonçalo Pestana <g6pestana@gmail.com>
Co-authored-by: Gonçalo Pestana <g6pestana@gmail.com>
text/0092-unbonding_queue.md
Outdated
|
||
Owing to the way exposures, which nominators back validators with how many tokens, are stored, it is hard to search for whether a nominator has deferred slashes that need to be applied to them on chain. So we cannot simply check when a nominator attempts to withdraw their bond. | ||
|
||
One option would be to allow any account to point out that an unbonding account had a deferred slash and then the chain would set the `unbonding_block_number` to after the time when the slash would be applied, which will be no more than 28 days from the time the staker unbonded. It is not obvious how to incentivise this, especially in the case that the slash is never applied. Then we would be assuming that in the minimum 2 days unbonding period, not only would any slashable event be caught, but also that someone would post such a transaction cancelling or delaying the unbond until after the slash is applied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is either this, which could be fine without incentivization as well. There's also the risk of liveness in such cases, so the transaction must be Operational
, ensuring an attacker cannot prevent it by filling the chain with remarks for 28 days, it is possibly not super expensive to do that.
Alternatively, the nominators can be migrated to store their pending slash on_idle
/task
. In the few blocks where this on_idle
is still in progress we disallow any fast unbonds. In honest cases, this is a negligible degradation. It attack cases, the worst, everyone is forced to unbond with 28 days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An even simpler solution would be to clear the unbonding queue and disable the fast unbonding temporarily if there are any pending slashes in the system, and skip calculating the nominators affected by the slash on-idle
altogether.
In honest cases, this is a negligible degradation. It attack cases, the worst, everyone is forced to unbond with 28 days.
This remains true.
I would like to point out effects of this RFC on the security of trust-less bridges using Random Sampling (e.g. Snowbridge between Polkadot<>Ethereum). TLDR: Reasoning |
Assuming we can relay a valid block from Polkadot to Ethereum within two days is reasonable. Without this block, the bridge would also not be live, causing significant disruptions to the user experience. Furthermore, the stated goal of decentralizing the relayer set greatly reduces the probability of the bridge being non-live for two days. To ensure reliability, having a backup relayer is crucial. This relayer would not necessarily handle costly transactions between Polkadot and Ethereum but would monitor the main relayer's liveness, stepping in if the main relayer goes offline. Establishing and maintaining this safety net should be a priority. For instance, through initiatives like the Infrastructure Builders Program (Bounty), we have the tools to set up and support this security measure effectively. |
2 days is more than enough for the detecting and reporting misbehaviour that is done by validators, for that 1 hour is enough. The question is whether it is enough for things that might require manually running a relayer or bot. For the bridge, it's ok if all relaying stops, the problem is about not reporting things to Polkadot, and only if we have only malicious relayers running. We should see an attack happening on chain, so this just means that someone should be monitoring the bridged chain, e.g. Ethereum, and have a bot that reports misehaviour, or one that can be run with 2 days notice. The situation with deferred slashes is similar. We would need someone to be running a bot to detect and report people with deferred slashes unbonding. And these should be in a state where anyone can spin them up if we have to do so in 2 days, even on a weekend. This is certainly possible. |
It sounds like running, maintaining and incentivizing these bots perfectly fall into the scope of the Infrastructure Builders Program (@tugytur) |
This is PR has id 97. Therefore this should be RFC-97 (NOT 92) |
The RFC numbers are not taken from PR ids, because not every PR is an RFC PR, so it would leave unnecessary gaps. |
They are, as per README:
|
That might have been my thinking at the time. I changed the files. When does the site get rebuild? It hasn't updated for a while just checking that there is no issue with caching. |
Right, my mistake - I remembered it wrong. It's just that this part of the process is not automatically enforced at any point. @jonasW3F The site gets rebuild nightly. |
a recent paper on optimizing unbonding queues for PoS blockchains is very relevant work for this RFC. It kind of formally argues why the FCFS mechanism is optimal when the value of withdrawal for agents is homogeneous, backing the ideas presented in this RFC. |
Going to re-propose later. |
Attacks other than an LRA, such as backing incorrect parachain blocks, should be detected and slashed within 2 days. This is why the mechanism has a minimum unbonding period. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we allow for slashing for disputes up to dispute_period
of past eras, which is configured at 6 (days) for Polkadot at the moment. although I agree that such attacks should be detected within 2 days, but just saying we might need to adjust these params for Polkadot prior to enabling of this feature and somehow ensure this value is aligned with LOWER_BOUND
going forward
We can observe that historical unbonds only trigger an unbonding time larger than `LOWER_BOUND` in situations with extensive and/or clustered unbonding amounts. The average unbonding time across the whole timeseries is ~2.67 days. We can, however, see it taking effect pushing unbonding times up during large unbonding events. In the largest events, we hit a maximum of 28 days. This gives us reassurance that it is sufficiently sensitive and it makes sense to match the `UPPER_BOUND` with the historically largest unbonds. | ||
|
||
The main parameter affecting the situation is the `max_unstake`. The relationship is obvious: decreasing the `max_unstake` makes the queue more sensitive, i.e., having it spike more quickly and higher with unbonding events. Given that these events historically were mostly associated with parachain auctions, we can assume that, in the absence of major systemic events, users will experience drastically reduced unbonding times. | ||
The analysis can be reproduced or changed to other parameters using [this repository](https://github.com/jonasW3F/unbonding_queue_analysis). |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public now
text/0097-unbonding_queue.md
Outdated
|
||
### The Unresolved Question: Deferred slashing | ||
|
||
Currently we defer applying many slashes until 28 days have passed. This was implemented so we can conveniently cancel slashes via governance in the case that the slashing was due to a bug. While rare on Polkadot, such bugs cause a significant fraction of slashes. This includes slashing for attacks other than LRAs for which we've assumed that 2 days is enough to slash. But 2 days in not enough to cancel slashes via OpenGov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we defer applying many slashes until 28 days have passed. This was implemented so we can conveniently cancel slashes via governance in the case that the slashing was due to a bug. While rare on Polkadot, such bugs cause a significant fraction of slashes. This includes slashing for attacks other than LRAs for which we've assumed that 2 days is enough to slash. But 2 days in not enough to cancel slashes via OpenGov. | |
Currently we defer applying many slashes until 27 days have passed. This was implemented so we can conveniently cancel slashes via governance in the case that the slashing was due to a bug. While rare on Polkadot, such bugs cause a significant fraction of slashes. This includes slashing for attacks other than LRAs for which we've assumed that 2 days is enough to slash. But 2 days in not enough to cancel slashes via OpenGov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now there is a slash that happened in era 1498 which, polkadot.js apps says, will be applied in era 1526 (28 eras later). Not sure where the discrepancy comes from, but since we are talking about days here anyway, I'll just say that it takes around 28 days, which should be fine enough.
Co-authored-by: ordian <write@reusable.software>
@xlc any more comments? Going to propose soon. |
/rfc propose |
Hey @ggwpez, here is a link you can use to create the referendum aiming to approve this RFC number 0097. Instructions
It is based on commit hash 4a4cd694350e6b74af93e34f12cdd21ad7a13981. The proposed remark text is: |
Voting for this referenda is ongoing. Vote for it here |
|
||
If we detect an LRA of no more than 28 days with the current unbonding period, then we should be able to detect misbehaviour from over 1/3 of validators whose nominators are still bonded. The stake backing these validators is considerable fraction of the total stake (empirically it is 0.287 or so). If we allowed more than this stake to unbond, without checking who it was backing, then the LRA attack might be free of cost for an attacker. The proposed mechansim allows up to half this stake to unbond within 28 days. This halves the amount of tokens that can be slashed, but this is still very high in absolute terms. For example, at the time of writing (19.06.2024) this would translate to around 120 millions DOTs. | ||
|
||
Attacks other than an LRA, such as backing incorrect parachain blocks, should be detected and slashed within 2 days. This is why the mechanism has a minimum unbonding period. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I am not to familiar with staking internals, but I will assume that if you unbond, your stake will always remain at least for the duration of the current era. Hypothetically on a network with eras longer than 2 days, it could happen that your stake is securing the network until the end of the era and gets unbonded the very next block. Thus, causing an offense at the end of an era would be risk free?
If that is true, we should bind the unbonding period to multiples of an era, e.g. 2, instead of a fixed 2 days. Also it would be good to measure that period in number of blocks, because then a potential DoS attack would be pointless as it would also increase the unbonding time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 2 days here is used to make it easier for the reader and putting durations into a more familiar context. The queue itself, however, is specified in blocks in the chapter "Mechanism". Here, MAX_DURATION
is equal to 403200 blocks and MIN_DURATION
is equal to 28800. If the queue is implemented on other networks with different era durations, the parameters can easily be adjusted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect on using blocks! For the relationship to eras: Would be good if that relationship to era duration would be at least documented (not necessarily in this RFC as it is already up for vote), but for sure in the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I'll make to add it to the documentation for the implementation.
PR can be merged. Write the following command to trigger the bot
|
/rfc process 0xafc7e5a97a974592fd714094c7eb1b23c0f25d43c27b66e639bffebeffb283f9 |
The on-chain referendum has approved the RFC. |
We are proposing an unbonding queue for Relay Chain tokens to significantly reduce the expected unbonding time. The queue would still maintain sufficient stake accountability to mitigate the profitability of LRAs. --------- Co-authored-by: Gonçalo Pestana <g6pestana@gmail.com> Co-authored-by: ordian <write@reusable.software>
We currently track the unlocking of funds for both pool members (SubPools) and staking accounts (UnlockChunks) based on the predetermined era when these funds are set to be unlocked. To align with the RFC, we will probably shift to tracking SubPools and UnlockChunks by the era or block at which the unbonding request is made. At a high level, once the entire UnlockChunk for a specific era is unlocked, the corresponding funds from the SubPool of that era can also be withdrawn. Thoughts? |
We are proposing an unbonding queue for Relay Chain tokens to significantly reduce the expected unbonding time. The queue would still maintain sufficient stake accountability to mitigate the profitability of LRAs.