Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

epoching: copy/paste necessary code from staking/evidence/slashing and make modifications #28

Merged
merged 17 commits into from
Jun 29, 2022

Conversation

SebastianElvis
Copy link
Member

Fixes BM-35.

This PR introduces code from Cosmos SDK's staking/evidence/slashing modules necessary for BBL, and makes our own modifications proposed in https://babylon-chain.atlassian.net/wiki/spaces/BABYLON/pages/13107673/Delay+Mechanism+for+Epochs+and+Unbonding and https://babylon-chain.atlassian.net/wiki/spaces/BABYLON/pages/14745602/Four+proposals+on+handling+slashed+validators.

Introduced code and our modifications:

@SebastianElvis
Copy link
Member Author

There are two blockers for implementing the proposed modifications at the moment. Not sure if we should fix them in this PR, which already introduces many logics.

  1. In Cosmos SDK's staking module, DequeueAllMatureUBDQueue and DequeueAllMatureRedelegationQueue always uses the last block's time ctx.BlockHeader().Time rather than the currTime parameter. Consequently, whenever we invoke them, they unbond all validators/delegations in the queue, rather than those in a certain epoch.
  2. In Cosmos SDK's slasing module, HandleValidatorSignature invokes a private function clearValidatorMissedBlockBitArray(ctx, consAddr). We cannot invoke this private function outside this Cosmos SDK module. At the moment I commented this line. Not sure if it affects security or not.

@aakoshh
Copy link
Contributor

aakoshh commented Jun 24, 2022

Consequently, whenever we invoke them, they unbond all validators/delegations in the queue, rather than those in a certain epoch.

This is why I suggested using withBlockHeader on the Context to temporarily override the block to be the one which got confirmed, and then restore it to where we're at.

Copy link
Contributor

@aakoshh aakoshh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, easy to see where things came from 👍

Do we really need to copy paste slashing and evidence handling? If we want those to be processed immediately, why not just let nature take its course?

app/app.go Outdated
Comment on lines 385 to 384
crisistypes.ModuleName, govtypes.ModuleName, stakingtypes.ModuleName,
crisistypes.ModuleName, govtypes.ModuleName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this will not work, because SetOrderEndBlockers will panic if you forget to include a module that is otherwise registered in the app. The way I see it is that you can call this first to get the benefit of this check, then remove the staking module from OrderEndBlockers, which is thankfully public.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Didn't know they have an assert there... will fix

app/app.go Outdated
upgradetypes.ModuleName, capabilitytypes.ModuleName, minttypes.ModuleName, distrtypes.ModuleName, slashingtypes.ModuleName,
evidencetypes.ModuleName, stakingtypes.ModuleName,
upgradetypes.ModuleName, capabilitytypes.ModuleName, minttypes.ModuleName, distrtypes.ModuleName, stakingtypes.ModuleName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both evidence and slashing are removed, but staking is kept? Can you explain what's the idea?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Staking's BeginBlock processing only includes TrackHistoricalInfo (see https://github.com/cosmos/cosmos-sdk/blob/v0.45.5/x/staking/abci.go), which basically does some cleanup job. Keeping it seems to be not harmful.

defer telemetry.ModuleMeasureSince(stakingtypes.ModuleName, time.Now(), telemetry.MetricKeyEndBlocker)

// TODO: unimplemented:
// - if an epoch is newly checkpointed, make unbonding validators/delegations in this epoch unbonded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I [put this in the diagram to happen when the BTC checkpoint is confirmed, mostly because this is the easiest in terms of book-keeping: I can clearly see when a checkpoint is going confirmed the first time. To do it at the end would be more difficult, need to capture before and after.

I also thought it's not actually the epoch's job to know about checkpointing, and that release is related to checkpointing, so it should be in the checkpointing module.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was meaning that the epoching module gets the epoch's chechpoint status from the checkpointing module, rather than judging the status by itself. Is my understanding correct and consistent with yours?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I mean I'm not sure that the epoching module even needs a notion of status. To put it differently, it doesn't have to know that it's being checkpointed. It knows that it has ended, and it can tell other modules what the last block was, but those dependant modules can figure out on their own what they need to do. It just so happens that the checkpointing module needs the last commit hash for its own purposes, and it will use the checkpointing status as a requirement to release unbonding tokens. But the epoching can be oblivious to that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just so happens that the checkpointing module needs the last commit hash for its own purposes, and it will use the checkpointing status as a requirement to release unbonding tokens. But the epoching can be oblivious to that.

Checkpoint-assisted unbonding is a part of the epoching module rather than checkpointing, right?

One of the main reasons that checkpoint-assisted unbonding fits the epoching module better is that, the checkpoint-assisted unbonding depends on the staking module (especially the "mature" queues), maintained by the epoching module.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we agreed with @fishermanymc that the release of unbonding tokens doesn't need to be tied to the epoch.

You can imagine using just epochs, with 21 days unbonding, it's fine. But because we are doing checkpointing, we can release tokens earlier, as soon as the checkpoints are confirmed. Therefore it's associated with checkpoints, not epochs. It happens in the middle of the next epoch, depends on BTC, not the next epoch boundary.

Epochs manage the delayed staking indeed, but they only manage the inputs. The maturing queues are managed by the staking module itself, it doesn't need further input from the epochs. It just happens to be that we can use the epochs as a bucket for the confirmation height and the time step function - instead of maturing in 21 days, we take manual control of setting the height which we consider mature. But again, this doesn't depend on any active participation from the epoching module.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we do the following?

  • The epoching module maintains the modified version of staking (e.g., the ApplyMatureUnbonding thing)
  • Whenever an epoch is checkpointed, the checkpointing module invokes epochingKeeper.ApplyMatureUnbonding in order to unbond mature validators/delegations?

This way each module does its own thing, and we only have a dependency from checkpointing to epoching (instead of being the other way where epoching hooks to the EpochConfirmed hook in the checkpointing module).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SebastianElvis and @aakoshh . The BTC/Babylon-specific bond release condition is that all the epochs up to the epoch where the unbonding request is submitted are confirmed. The bond release does not have to happen at the epoch boundary unless there are implementation level limitations preventing us from doing this.

// TODO: unimplemented:
// - if an epoch is newly checkpointed, make unbonding validators/delegations in this epoch unbonded
// - if reaching an epoch boundary, execute validator-related msgs (bonded -> unbonding)
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't matter too much, but you could return an empty array like the rest of the modules. I say doesn't matter because if we never call the staking module at all the app won't work anyway. Might as well panic, to make sure this doesn't go unnoticed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to copy paste slashing and evidence handling? If we want those to be processed immediately, why not just let nature take its course?

Default operations in slashing/evidence handling include both slashing and jailing. We want the slashing part but not the jailing part, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I don't know much about jailing. What would happen if we allow it to go through?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a validator is jailed, then:

  1. It loses all voting power (https://github.com/cosmos/cosmos-sdk/blob/v0.45.5/x/staking/keeper/val_state_change.go#L259-L268).
  2. The validator's status Jailed becomes true. To be unjailed, the validator needs to submit a message MsgUnjail and redo self-delegation (https://github.com/cosmos/cosmos-sdk/blob/v0.45.5/x/slashing/keeper/unjail.go#L9-L63).

If we allow jailing, then 1 will change the validator set within an epoch, and MsgUnjail in 2 needs to be delayed as well. The proposal 4 at https://babylon-chain.atlassian.net/wiki/spaces/BABYLON/pages/14745602/Four+proposals+on+handling+slashed+validators that we agreed upon gets rid of the jailing part due to such complexity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so jailing does remove the power and does change the validator set, but only temporarily, without slashing.

I don't know if this has a big impact on our epochs. I'm a bit lost, I thought we can let slashing happen immediately, which changes some of the power, and it sounds like jailing has a similar effect, so as long as it's done in moderation, the system should keep working.

Because you can't replicate the whole behaviour with copy-pasting, I would not include that part in this PR. What we are doing is not functionally equivalent to what the SDK is doing, therefore it is unacceptable IMO. We can't ignore certain bits on the basis that we may or may not be able to fix it later, when we don't even understand what we are breaking.

I would first restrict ourselves to dealing with the delayed staking, and have another go at consulting with Frojdi about whether we can just let jailing happen in the middle of the epoch.

@fishermanymc wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this has a big impact on our epochs. I'm a bit lost, I thought we can let slashing happen immediately, which changes some of the power, and it sounds like jailing has a similar effect, so as long as it's done in moderation, the system should keep working.

A potential issue is as follows. Assuming the system slashes only a part of its stake (say, 50%) of a validator, then:

  • if we jail it, then it is removed from the validator set from a node's view, but remains in the validator set from a checkpoint's view (as it's included in the bitmap).
  • if we don't jail it, then the validator set is consistent from a node or a checkpoint's view.

Because you can't replicate the whole behaviour with copy-pasting, I would not include that part in this PR.

Agree with this one. At the moment we have limited understanding on this part and the modification might break the system. At least the vanilla evidence/slashing modules will still work when only few validators are slashed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right, if we don't slash/jail then the validator set won't change from what it was at the beginning of the epoch and everyone in the bitmap is still okay to present a BLS signature.

To our advantage in this is that

  • we only need +1/3 of the validators to send BLS signatures
  • what needs to be signed comes from the Babylon chain, produced by the remaining validators, so the BLS signatures of the slashed validators should still be okay

Is that true? If we slash +1/3 then these dishonest validators can produce a BLS signature on their own, but in order to convince anyone's Tendermint node to go to an adversarial fork, you would need +2/3 to be dishonest.

What the slashed validators can do is to submit what looks like a "bogus" checkpoint, with their valid signatures, and send the application down this path. Looks like we have to be careful in that situation not to over react and kill Tendermint.

My understanding so far has been that:

  1. We start Epoch_n, and take a snapshot of who the ValidatorSet_n is, and their BLS signatures.
  2. We start Epoch_n+1; the commit hash over Epoch_n becomes known, produced by +2/3 of ValidatorSet_n.
  3. We collect BLS signatures from +1/3 members of ValidatorSet_n; we produce the raw checkpoint.

If we took the snapshot for ValidatorSet_n at the end of Epoch_n, we would have only the people who weren't slashed. Is it important to use the validators from the beginning? If we already know that only slashing can influence it during an epoch?

I suppose we can always discard BLS signatures from slashed validators during collection, and thus reject checkpoints signed by them as well, so we would know that an honest Tendermint validator set would not produce such a checkpoint.

Ultimately, in each epoch we assume that +2/3rd are honest, and if we have to slash +1/3rd we have just proven that assumption to be wrong. Not sure if at that point if the +1/3rd dishonest people need to produce a real alternative fork, or just pretend to have signed a hidden fork, for us to conclude that Babylon failed.

@fishermanymc wdyt?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a complicated topic.

Option-1: If slashing is 100%, then BLS signatures from the slashed validators are useless and should be ignored during signature aggregation. This is because even if they create an attacking QC and get caught later, they have nothing more to be slashed. These are the lunatic malicious validators here: https://babylon-chain.atlassian.net/wiki/spaces/BABYLON/pages/15597569/How+Many+Validators+Can+Babylon+Slash+During+an+Epoch+WIP#3.-Tendermint%E2%80%99s-Native-Slashing-Capacity-in-One-Block

Option-2: If slashing is less than 100%, then the slashed nodes still have some stake in the system and thus are motivated to sign genuine BLS to get unbonded.

There seems also theoretical tradeoff between how many to slash and how many BLS signatures should be collected, which affects the security - liveness tradeoff. David and Nusret are working on this.

As our first pass, let's 1) not delay any slashing, and 2) ask the checkpointing module to only accept BLS signatures from unslashed validators from the original validator set at the beginning of the epoch. The aggregated power should be min(1/3, 1-X), where X is the slashed power. @gitferry , you are also a stakeholder.

Copy link
Member Author

@SebastianElvis SebastianElvis Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the insightful discussions Akosh and Fisher! I also have a feeling that how many to slash and whether to jail seem to have connection with some fundamental trade-offs. Given that we are at the MVP phase and David and Nusret are working on related issues, at the moment let's follow the minimalist 2/3-secure approach proposed above, at least in this PR:

  • Follow Cosmos SDK's original evidence/slashing implementations with both jailing and slashing
  • Reject all BLS signatures from unslashed validators, even if they were in the validator set at the epoch beginning

Comment on lines 109 to 110
// TODO: commented the line below as it's a private function. Find a way to work around this.
// slk.clearValidatorMissedBlockBitArray(ctx, consAddr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vitsalis this seems to contradict the concept of being able to copy paste.

It's also easy to miss this comment (maybe instead of a TODO it can be a FIXME), thanks for mentioning it in the PR description. I would not try to check code in with this kind of commented logic, which we don't know the consequences of. It looks like it might lead to slashing more times for missed blocks, or who knows.

I didn't expect we have to copy paste slashing, though, only staking.

@SebastianElvis
Copy link
Member Author

Consequently, whenever we invoke them, they unbond all validators/delegations in the queue, rather than those in a certain epoch.

This is why I suggested using withBlockHeader on the Context to temporarily override the block to be the one which got confirmed, and then restore it to where we're at.

Thankfully we have an workaround for this...

@SebastianElvis
Copy link
Member Author

Thanks for the comments and pointing to the workarounds of the blockers Akosh! I have resolved the comments and please feel free to review again. Remaining questions/issues:

  • The WithBlockHeader workaround faces the race condition issue when multiple queries are accessing ctx concurrently.
  • The module responsible for checkpoint-assisted unbonding: epoching vs checkpointing?
  • Jailing
  • Usage of private functions in staking/evidence/slashing

app/app.go Outdated
Comment on lines 373 to 383
app.mm.SetOrderBeginBlockers(
upgradetypes.ModuleName, capabilitytypes.ModuleName, minttypes.ModuleName, distrtypes.ModuleName, slashingtypes.ModuleName,
evidencetypes.ModuleName, stakingtypes.ModuleName,
// NOTE: BBL avoids using app.mm.SetOrderBeginBlockers and app.mm.SetOrderEndBlockers as they has an assertation that avoids incomplete lists of modules, while BBL does not want:
// - BeginBlock processing in slashing/evidence
// - EndBLock processing in staking
app.mm.OrderBeginBlockers = []string{
upgradetypes.ModuleName, capabilitytypes.ModuleName, minttypes.ModuleName, distrtypes.ModuleName, stakingtypes.ModuleName,
authtypes.ModuleName, banktypes.ModuleName, govtypes.ModuleName, crisistypes.ModuleName, genutiltypes.ModuleName,
authz.ModuleName, feegrant.ModuleName,
paramstypes.ModuleName, vestingtypes.ModuleName,
epochingtypes.ModuleName,
btclightclienttypes.ModuleName,
btccheckpointtypes.ModuleName,
checkpointingtypes.ModuleName,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I suggested calling SetOrderBeginBlockers first and then remove unwanted entries from OrderBeginBlockers is so that the application can still check that we did not accidentally forget to add a module. There must be some value in that. But then a filter could make sure we don't call what we don't need.

app.mm.SetOrderBeginBlockers(
		upgradetypes.ModuleName, capabilitytypes.ModuleName, minttypes.ModuleName, ...
)
remove(*app.mm.OrderEndBlockers, stakingtype.ModuleName)

@SebastianElvis
Copy link
Member Author

SebastianElvis commented Jun 28, 2022

Thanks for the reviews Akosh and Fisher! I have revised this PR according to our discussions, limiting the scope and modifications in this PR (and also MVP). Feel free to have a look again.

Summary of the current PR:

  • The epoching module now depends on the staking module
  • The epoching keeper wraps and modified certain validator-related functions of the staking module
  • Evidence/slashing modules are unchanged and the epoching module does not touch the slashing/evidence modules
  • Staking's EndBlock processing is excluded

epochingtypes.ModuleName,
btclightclienttypes.ModuleName,
btccheckpointtypes.ModuleName,
checkpointingtypes.ModuleName,
)
// BBL does not want EndBlock processing in staking
app.mm.OrderEndBlockers = append(app.mm.OrderEndBlockers[:2], app.mm.OrderEndBlockers[2+1:]...) // remove stakingtypes.ModuleName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think about that method! 😃

Copy link
Contributor

@aakoshh aakoshh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! This proved such a fertile ground for discussion, thank you for your patience following through with the changes!

@SebastianElvis SebastianElvis merged commit 442df70 into main Jun 29, 2022
@SebastianElvis SebastianElvis deleted the epoching-staking-code-migration branch June 29, 2022 03:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants