Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbonding queue #97

Merged
merged 15 commits into from
Aug 12, 2024
Merged

Conversation

jonasW3F
Copy link
Contributor

We are proposing an unbonding queue for Relay Chain tokens to significantly reduce the expected unbonding time. The queue would still maintain sufficient stake accountability to mitigate the profitability of LRAs.

@burdges
Copy link

burdges commented Jun 19, 2024

Importantly soundness slashes happen quickly because no-shows cause full escalation within minutes. It follows these parameter choices need only address the classical long range attacks scenarios, not the polkadot specific soundness slashes. This wasn't mentioned here, but after this it becomes another major benefit of polakdot over "optimistic roll ups" in terms of UX .

We'd likely change these parameters if we ever adopted an 80% honest assumption for multiple relay chains, but this should be addressed in future, nowhere near enough utilization yet.

@ggwpez
Copy link
Member

ggwpez commented Jul 8, 2024

@Ank4n @gpestana @kianenigma PTAL


Locking tokens for staking ensures that Polkadot is able to slash tokens backing misbehaving validators. With changing the locking period, we still need to make sure that Polkadot can slash enough tokens to deter misbehaviour. This means that not all tokens can be unbonded immediately, however we can still allow some tokens to be unbonded quickly.

The new mechanism leads to a signficantly reduced unbonding time on average, by queuing up new unbonding requests and scaling their unbonding duration relative to the size of the queue. New requests are executed with a minimum of 2 days, when the queue is comparatively empty, to the conventional 28 days, if the sum of requests (in terms of stake) exceed some threshold. In scenarios between these two bounds, the unbonding duration scales proportionately. The new mechanism will never be worse than the current fixed 28 days.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the sum of requests (in terms of stake) exceed some threshold.

dq: Which part of your formulas below corresponds to this? I fail to detect it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter is called max_unstake. Further below it reads:

We also store a variable, max_unstake that tracks how much stake we allow to unbond potentially earlier than 28 eras (28 days on Polkadot and 7 days on Kusama).

The queue scales proportionally between 2 and 28 days (with respect to max_unstake). In case we exceed that value, the unbonding time is capped at 28 days.

text/0092-unbonding_queue.md Outdated Show resolved Hide resolved
text/0092-unbonding_queue.md Outdated Show resolved Hide resolved
text/0092-unbonding_queue.md Outdated Show resolved Hide resolved

If we detect an LRA of no more than 28 days with the current unbonding period, then we should be able to detect misbehaviour from over 1/3 of validators whose nominators are still bonded. The stake backing these validators is considerable fraction of the total stake (empirically it is 0.287 or so). If we allowed more than this stake to unbond, without checking who it was backing, then the LRA attack might be free of cost for an attacker. The proposed mechansim allows up to half this stake to unbond with in 28 days. This halves the amount of tokens that can be slashed, but this is still very high in absolute terms. For example, at the time of writing (19.06.2024) this would translate to around 120 millions DOTs.

Attacks other than an LRA, such as backing incorrect parachain blocks, should be detected and slashed within 2 days. This is why the mechanism has a minimum unbonding period.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the mechanisms/parameters that ensure that these types of attacks/misbehaviours are always below the 2 day threshold? Could changing some of these params change the threshold to other number of days?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text/0092-unbonding_queue.md Outdated Show resolved Hide resolved
jonasW3F and others added 4 commits July 15, 2024 09:54
Co-authored-by: Gonçalo Pestana <g6pestana@gmail.com>
Co-authored-by: Gonçalo Pestana <g6pestana@gmail.com>
Co-authored-by: Gonçalo Pestana <g6pestana@gmail.com>
Co-authored-by: Gonçalo Pestana <g6pestana@gmail.com>

Owing to the way exposures, which nominators back validators with how many tokens, are stored, it is hard to search for whether a nominator has deferred slashes that need to be applied to them on chain. So we cannot simply check when a nominator attempts to withdraw their bond.

One option would be to allow any account to point out that an unbonding account had a deferred slash and then the chain would set the `unbonding_block_number` to after the time when the slash would be applied, which will be no more than 28 days from the time the staker unbonded. It is not obvious how to incentivise this, especially in the case that the slash is never applied. Then we would be assuming that in the minimum 2 days unbonding period, not only would any slashable event be caught, but also that someone would post such a transaction cancelling or delaying the unbond until after the slash is applied.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is either this, which could be fine without incentivization as well. There's also the risk of liveness in such cases, so the transaction must be Operational, ensuring an attacker cannot prevent it by filling the chain with remarks for 28 days, it is possibly not super expensive to do that.

Alternatively, the nominators can be migrated to store their pending slash on_idle/task. In the few blocks where this on_idle is still in progress we disallow any fast unbonds. In honest cases, this is a negligible degradation. It attack cases, the worst, everyone is forced to unbond with 28 days.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An even simpler solution would be to clear the unbonding queue and disable the fast unbonding temporarily if there are any pending slashes in the system, and skip calculating the nominators affected by the slash on-idle altogether.

In honest cases, this is a negligible degradation. It attack cases, the worst, everyone is forced to unbond with 28 days.

This remains true.

@bhargavbh
Copy link

I would like to point out effects of this RFC on the security of trust-less bridges using Random Sampling (e.g. Snowbridge between Polkadot<>Ethereum).

TLDR:
With the possibility of faster unbonding, we need to make a stronger assumption on the acceptable downtime of honest relayers for the bridge's safety properties. Earlier, it was sufficient if 1 honest relayer relayed a valid block from Polkadot -> Ethereum within 28 days, which would now reduce to 2 days. We feel this is still an acceptable period to spin up a new relayer if all the existing relayers are down.

Reasoning
The bridge's security relies on a crypto-economic argument where any validator who BEEFY-signs an un-finalised block gets slashed. Hence, one of the crucial security parameter from bridge security perspective is "how quickly can the stake of the least-backed validator be un-bonded via the queue". Currently, this security parameter is 28 days but introducing the RFC changes it to 2 days. If all the relayers are down, the a malicious validator can immediately start unbonding and once his stakes are cleared thru the queue (2days), he can attempt relaying un-finalised blocks without any repercussions.

@jonasW3F
Copy link
Contributor Author

I would like to point out effects of this RFC on the security of trust-less bridges using Random Sampling (e.g. Snowbridge between Polkadot<>Ethereum).

TLDR: With the possibility of faster unbonding, we need to make a stronger assumption on the acceptable downtime of honest relayers for the bridge's safety properties. Earlier, it was sufficient if 1 honest relayer relayed a valid block from Polkadot -> Ethereum within 28 days, which would now reduce to 2 days. We feel this is still an acceptable period to spin up a new relayer if all the existing relayers are down.

Reasoning The bridge's security relies on a crypto-economic argument where any validator who BEEFY-signs an un-finalised block gets slashed. Hence, one of the crucial security parameter from bridge security perspective is "how quickly can the stake of the least-backed validator be un-bonded via the queue". Currently, this security parameter is 28 days but introducing the RFC changes it to 2 days. If all the relayers are down, the a malicious validator can immediately start unbonding and once his stakes are cleared thru the queue (2days), he can attempt relaying un-finalised blocks without any repercussions.

Assuming we can relay a valid block from Polkadot to Ethereum within two days is reasonable. Without this block, the bridge would also not be live, causing significant disruptions to the user experience. Furthermore, the stated goal of decentralizing the relayer set greatly reduces the probability of the bridge being non-live for two days.

To ensure reliability, having a backup relayer is crucial. This relayer would not necessarily handle costly transactions between Polkadot and Ethereum but would monitor the main relayer's liveness, stepping in if the main relayer goes offline. Establishing and maintaining this safety net should be a priority. For instance, through initiatives like the Infrastructure Builders Program (Bounty), we have the tools to set up and support this security measure effectively.

@AlistairStewart
Copy link

2 days is more than enough for the detecting and reporting misbehaviour that is done by validators, for that 1 hour is enough. The question is whether it is enough for things that might require manually running a relayer or bot.

For the bridge, it's ok if all relaying stops, the problem is about not reporting things to Polkadot, and only if we have only malicious relayers running. We should see an attack happening on chain, so this just means that someone should be monitoring the bridged chain, e.g. Ethereum, and have a bot that reports misehaviour, or one that can be run with 2 days notice.

The situation with deferred slashes is similar. We would need someone to be running a bot to detect and report people with deferred slashes unbonding.

And these should be in a state where anyone can spin them up if we have to do so in 2 days, even on a weekend. This is certainly possible.

@jonasW3F
Copy link
Contributor Author

2 days is more than enough for the detecting and reporting misbehaviour that is done by validators, for that 1 hour is enough. The question is whether it is enough for things that might require manually running a relayer or bot.

For the bridge, it's ok if all relaying stops, the problem is about not reporting things to Polkadot, and only if we have only malicious relayers running. We should see an attack happening on chain, so this just means that someone should be monitoring the bridged chain, e.g. Ethereum, and have a bot that reports misehaviour, or one that can be run with 2 days notice.

The situation with deferred slashes is similar. We would need someone to be running a bot to detect and report people with deferred slashes unbonding.

And these should be in a state where anyone can spin them up if we have to do so in 2 days, even on a weekend. This is certainly possible.

It sounds like running, maintaining and incentivizing these bots perfectly fall into the scope of the Infrastructure Builders Program (@tugytur)

@Tomen
Copy link

Tomen commented Jul 19, 2024

This is PR has id 97. Therefore this should be RFC-97 (NOT 92)

@rzadp
Copy link
Contributor

rzadp commented Jul 19, 2024

This is PR has id 97. Therefore this should be RFC-97 (NOT 92)

The RFC numbers are not taken from PR ids, because not every PR is an RFC PR, so it would leave unnecessary gaps.

@ggwpez
Copy link
Member

ggwpez commented Jul 19, 2024

The RFC numbers are not taken from PR ids, because not every PR is an RFC PR, so it would leave unnecessary gaps.

They are, as per README:

Rename the file to correspond to the GitHub pull request number and update the "RFC PR" field in the file.

@jonasW3F
Copy link
Contributor Author

This is PR has id 97. Therefore this should be RFC-97 (NOT 92)

The RFC numbers are not taken from PR ids, because not every PR is an RFC PR, so it would leave unnecessary gaps.

That might have been my thinking at the time. I changed the files. When does the site get rebuild? It hasn't updated for a while just checking that there is no issue with caching.

@rzadp
Copy link
Contributor

rzadp commented Jul 19, 2024

Right, my mistake - I remembered it wrong. It's just that this part of the process is not automatically enforced at any point.
Would a little CI check be helpful, to remind authors if the number is wrong?

@jonasW3F The site gets rebuild nightly.

@bhargavbh
Copy link

a recent paper on optimizing unbonding queues for PoS blockchains is very relevant work for this RFC. It kind of formally argues why the FCFS mechanism is optimal when the value of withdrawal for agents is homogeneous, backing the ideas presented in this RFC.
@jonasW3F

@ggwpez
Copy link
Member

ggwpez commented Jul 29, 2024

/rfc propose

Going to re-propose later.

Comment on lines +46 to +47
Attacks other than an LRA, such as backing incorrect parachain blocks, should be detected and slashed within 2 days. This is why the mechanism has a minimum unbonding period.

Copy link
Contributor

@ordian ordian Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we allow for slashing for disputes up to dispute_period of past eras, which is configured at 6 (days) for Polkadot at the moment. although I agree that such attacks should be detected within 2 days, but just saying we might need to adjust these params for Polkadot prior to enabling of this feature and somehow ensure this value is aligned with LOWER_BOUND going forward

We can observe that historical unbonds only trigger an unbonding time larger than `LOWER_BOUND` in situations with extensive and/or clustered unbonding amounts. The average unbonding time across the whole timeseries is ~2.67 days. We can, however, see it taking effect pushing unbonding times up during large unbonding events. In the largest events, we hit a maximum of 28 days. This gives us reassurance that it is sufficiently sensitive and it makes sense to match the `UPPER_BOUND` with the historically largest unbonds.

The main parameter affecting the situation is the `max_unstake`. The relationship is obvious: decreasing the `max_unstake` makes the queue more sensitive, i.e., having it spike more quickly and higher with unbonding events. Given that these events historically were mostly associated with parachain auctions, we can assume that, in the absence of major systemic events, users will experience drastically reduced unbonding times.
The analysis can be reproduced or changed to other parameters using [this repository](https://github.com/jonasW3F/unbonding_queue_analysis).

This comment was marked as resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public now


### The Unresolved Question: Deferred slashing

Currently we defer applying many slashes until 28 days have passed. This was implemented so we can conveniently cancel slashes via governance in the case that the slashing was due to a bug. While rare on Polkadot, such bugs cause a significant fraction of slashes. This includes slashing for attacks other than LRAs for which we've assumed that 2 days is enough to slash. But 2 days in not enough to cancel slashes via OpenGov.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/polkadot-fellows/runtimes/blob/7c69345c75910006d56e60e2f9a93c9d0f44f280/relay/polkadot/src/lib.rs#L654

Suggested change
Currently we defer applying many slashes until 28 days have passed. This was implemented so we can conveniently cancel slashes via governance in the case that the slashing was due to a bug. While rare on Polkadot, such bugs cause a significant fraction of slashes. This includes slashing for attacks other than LRAs for which we've assumed that 2 days is enough to slash. But 2 days in not enough to cancel slashes via OpenGov.
Currently we defer applying many slashes until 27 days have passed. This was implemented so we can conveniently cancel slashes via governance in the case that the slashing was due to a bug. While rare on Polkadot, such bugs cause a significant fraction of slashes. This includes slashing for attacks other than LRAs for which we've assumed that 2 days is enough to slash. But 2 days in not enough to cancel slashes via OpenGov.

Copy link
Contributor Author

@jonasW3F jonasW3F Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now there is a slash that happened in era 1498 which, polkadot.js apps says, will be applied in era 1526 (28 eras later). Not sure where the discrepancy comes from, but since we are talking about days here anyway, I'll just say that it takes around 28 days, which should be fine enough.

Co-authored-by: ordian <write@reusable.software>
@polkadot-fellows polkadot-fellows deleted a comment from github-actions bot Jul 30, 2024
@polkadot-fellows polkadot-fellows deleted a comment from paritytech-rfc-bot bot Jul 30, 2024
text/0097-unbonding_queue.md Outdated Show resolved Hide resolved
@ggwpez ggwpez requested review from ordian and xlc July 31, 2024 12:59
@ggwpez
Copy link
Member

ggwpez commented Aug 4, 2024

@xlc any more comments? Going to propose soon.

@ggwpez
Copy link
Member

ggwpez commented Aug 6, 2024

/rfc propose

@paritytech-rfc-bot
Copy link
Contributor

Hey @ggwpez, here is a link you can use to create the referendum aiming to approve this RFC number 0097.

Instructions
  1. Open the link.

  2. Switch to the Submission tab.

  1. Adjust the transaction if needed (for example, the proposal Origin).

  2. Submit the Transaction


It is based on commit hash 4a4cd694350e6b74af93e34f12cdd21ad7a13981.

The proposed remark text is: RFC_APPROVE(0097,ebf77632bee876d6028cca326a069172836a3a2eee43c5a829a06b3f13fb2d8a).

Copy link

github-actions bot commented Aug 6, 2024

Voting for this referenda is ongoing.

Vote for it here


If we detect an LRA of no more than 28 days with the current unbonding period, then we should be able to detect misbehaviour from over 1/3 of validators whose nominators are still bonded. The stake backing these validators is considerable fraction of the total stake (empirically it is 0.287 or so). If we allowed more than this stake to unbond, without checking who it was backing, then the LRA attack might be free of cost for an attacker. The proposed mechansim allows up to half this stake to unbond within 28 days. This halves the amount of tokens that can be slashed, but this is still very high in absolute terms. For example, at the time of writing (19.06.2024) this would translate to around 120 millions DOTs.

Attacks other than an LRA, such as backing incorrect parachain blocks, should be detected and slashed within 2 days. This is why the mechanism has a minimum unbonding period.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I am not to familiar with staking internals, but I will assume that if you unbond, your stake will always remain at least for the duration of the current era. Hypothetically on a network with eras longer than 2 days, it could happen that your stake is securing the network until the end of the era and gets unbonded the very next block. Thus, causing an offense at the end of an era would be risk free?

If that is true, we should bind the unbonding period to multiples of an era, e.g. 2, instead of a fixed 2 days. Also it would be good to measure that period in number of blocks, because then a potential DoS attack would be pointless as it would also increase the unbonding time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2 days here is used to make it easier for the reader and putting durations into a more familiar context. The queue itself, however, is specified in blocks in the chapter "Mechanism". Here, MAX_DURATION is equal to 403200 blocks and MIN_DURATION is equal to 28800. If the queue is implemented on other networks with different era durations, the parameters can easily be adjusted.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect on using blocks! For the relationship to eras: Would be good if that relationship to era duration would be at least documented (not necessarily in this RFC as it is already up for vote), but for sure in the implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll make to add it to the documentation for the implementation.

Copy link

PR can be merged.

Write the following command to trigger the bot

/rfc process 0xafc7e5a97a974592fd714094c7eb1b23c0f25d43c27b66e639bffebeffb283f9

@ggwpez
Copy link
Member

ggwpez commented Aug 12, 2024

/rfc process 0xafc7e5a97a974592fd714094c7eb1b23c0f25d43c27b66e639bffebeffb283f9

@paritytech-rfc-bot paritytech-rfc-bot bot merged commit f6ec2f8 into polkadot-fellows:main Aug 12, 2024
@paritytech-rfc-bot
Copy link
Contributor

The on-chain referendum has approved the RFC.

liuchengxu pushed a commit to liuchengxu/RFCs that referenced this pull request Aug 14, 2024
We are proposing an unbonding queue for Relay Chain tokens to
significantly reduce the expected unbonding time. The queue would still
maintain sufficient stake accountability to mitigate the profitability
of LRAs.

---------

Co-authored-by: Gonçalo Pestana <g6pestana@gmail.com>
Co-authored-by: ordian <write@reusable.software>
@kianenigma
Copy link
Contributor

Something that I realized a bit late here is how we can support this for pools. Any existing thoughts?

cc @Ank4n @gpestana.

@Ank4n
Copy link

Ank4n commented Aug 28, 2024

Something that I realized a bit late here is how we can support this for pools. Any existing thoughts?

cc @Ank4n @gpestana.

We currently track the unlocking of funds for both pool members (SubPools) and staking accounts (UnlockChunks) based on the predetermined era when these funds are set to be unlocked. To align with the RFC, we will probably shift to tracking SubPools and UnlockChunks by the era or block at which the unbonding request is made. At a high level, once the entire UnlockChunk for a specific era is unlocked, the corresponding funds from the SubPool of that era can also be withdrawn. Thoughts?

@anaelleltd anaelleltd added the Approved Has passed on-chain voting. label Sep 10, 2024
@anaelleltd anaelleltd added Implementing Is actively being worked on. and removed Approved Has passed on-chain voting. labels Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Implementing Is actively being worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.