epoching: copy/paste necessary code from staking/evidence/slashing and make modifications #28

SebastianElvis · 2022-06-24T06:09:49Z

This PR introduces code from Cosmos SDK's staking/evidence/slashing modules necessary for BBL, and makes our own modifications proposed in https://babylon-chain.atlassian.net/wiki/spaces/BABYLON/pages/13107673/Delay+Mechanism+for+Epochs+and+Unbonding and https://babylon-chain.atlassian.net/wiki/spaces/BABYLON/pages/14745602/Four+proposals+on+handling+slashed+validators.

Introduced code and our modifications:

staking: we import functions from https://github.com/cosmos/cosmos-sdk/blob/v0.45.5/x/staking/keeper/val_state_change.go, and break the parts of updating validator sets and unbonding mature validators/delegations into two different functions.
evidence and slashing: we import functions from https://github.com/cosmos/cosmos-sdk/blob/v0.45.5/x/evidence/keeper/infraction.go and https://github.com/cosmos/cosmos-sdk/blob/v0.45.5/x/slashing/keeper/infractions.go, and remove the jailing part.

SebastianElvis · 2022-06-24T06:16:04Z

There are two blockers for implementing the proposed modifications at the moment. Not sure if we should fix them in this PR, which already introduces many logics.

In Cosmos SDK's staking module, DequeueAllMatureUBDQueue and DequeueAllMatureRedelegationQueue always uses the last block's time ctx.BlockHeader().Time rather than the currTime parameter. Consequently, whenever we invoke them, they unbond all validators/delegations in the queue, rather than those in a certain epoch.
In Cosmos SDK's slasing module, HandleValidatorSignature invokes a private function clearValidatorMissedBlockBitArray(ctx, consAddr). We cannot invoke this private function outside this Cosmos SDK module. At the moment I commented this line. Not sure if it affects security or not.

aakoshh · 2022-06-24T09:00:31Z

Consequently, whenever we invoke them, they unbond all validators/delegations in the queue, rather than those in a certain epoch.

This is why I suggested using withBlockHeader on the Context to temporarily override the block to be the one which got confirmed, and then restore it to where we're at.

aakoshh

Great, easy to see where things came from 👍

Do we really need to copy paste slashing and evidence handling? If we want those to be processed immediately, why not just let nature take its course?

aakoshh · 2022-06-24T09:14:49Z

app/app.go

- crisistypes.ModuleName, govtypes.ModuleName, stakingtypes.ModuleName,
+ crisistypes.ModuleName, govtypes.ModuleName,


I suspect this will not work, because SetOrderEndBlockers will panic if you forget to include a module that is otherwise registered in the app. The way I see it is that you can call this first to get the benefit of this check, then remove the staking module from OrderEndBlockers, which is thankfully public.

Good point! Didn't know they have an assert there... will fix

aakoshh · 2022-06-24T09:15:28Z

app/app.go

- upgradetypes.ModuleName, capabilitytypes.ModuleName, minttypes.ModuleName, distrtypes.ModuleName, slashingtypes.ModuleName,
- evidencetypes.ModuleName, stakingtypes.ModuleName,
+ upgradetypes.ModuleName, capabilitytypes.ModuleName, minttypes.ModuleName, distrtypes.ModuleName, stakingtypes.ModuleName,


Both evidence and slashing are removed, but staking is kept? Can you explain what's the idea?

Staking's BeginBlock processing only includes TrackHistoricalInfo (see https://github.com/cosmos/cosmos-sdk/blob/v0.45.5/x/staking/abci.go), which basically does some cleanup job. Keeping it seems to be not harmful.

aakoshh · 2022-06-24T09:27:57Z

x/epoching/abci.go

+ defer telemetry.ModuleMeasureSince(stakingtypes.ModuleName, time.Now(), telemetry.MetricKeyEndBlocker)
+
+ // TODO: unimplemented:
+ // - if an epoch is newly checkpointed, make unbonding validators/delegations in this epoch unbonded


I [put this in the diagram to happen when the BTC checkpoint is confirmed, mostly because this is the easiest in terms of book-keeping: I can clearly see when a checkpoint is going confirmed the first time. To do it at the end would be more difficult, need to capture before and after.

I also thought it's not actually the epoch's job to know about checkpointing, and that release is related to checkpointing, so it should be in the checkpointing module.

I was meaning that the epoching module gets the epoch's chechpoint status from the checkpointing module, rather than judging the status by itself. Is my understanding correct and consistent with yours?

Yes, but I mean I'm not sure that the epoching module even needs a notion of status. To put it differently, it doesn't have to know that it's being checkpointed. It knows that it has ended, and it can tell other modules what the last block was, but those dependant modules can figure out on their own what they need to do. It just so happens that the checkpointing module needs the last commit hash for its own purposes, and it will use the checkpointing status as a requirement to release unbonding tokens. But the epoching can be oblivious to that.

It just so happens that the checkpointing module needs the last commit hash for its own purposes, and it will use the checkpointing status as a requirement to release unbonding tokens. But the epoching can be oblivious to that.

Checkpoint-assisted unbonding is a part of the epoching module rather than checkpointing, right?

One of the main reasons that checkpoint-assisted unbonding fits the epoching module better is that, the checkpoint-assisted unbonding depends on the staking module (especially the "mature" queues), maintained by the epoching module.

I think we agreed with @fishermanymc that the release of unbonding tokens doesn't need to be tied to the epoch.

You can imagine using just epochs, with 21 days unbonding, it's fine. But because we are doing checkpointing, we can release tokens earlier, as soon as the checkpoints are confirmed. Therefore it's associated with checkpoints, not epochs. It happens in the middle of the next epoch, depends on BTC, not the next epoch boundary.

Epochs manage the delayed staking indeed, but they only manage the inputs. The maturing queues are managed by the staking module itself, it doesn't need further input from the epochs. It just happens to be that we can use the epochs as a bucket for the confirmation height and the time step function - instead of maturing in 21 days, we take manual control of setting the height which we consider mature. But again, this doesn't depend on any active participation from the epoching module.

How about we do the following?

The epoching module maintains the modified version of staking (e.g., the ApplyMatureUnbonding thing)

Whenever an epoch is checkpointed, the checkpointing module invokes epochingKeeper.ApplyMatureUnbonding in order to unbond mature validators/delegations?

This way each module does its own thing, and we only have a dependency from checkpointing to epoching (instead of being the other way where epoching hooks to the EpochConfirmed hook in the checkpointing module).

Thanks @SebastianElvis and @aakoshh . The BTC/Babylon-specific bond release condition is that all the epochs up to the epoch where the unbonding request is submitted are confirmed. The bond release does not have to happen at the epoch boundary unless there are implementation level limitations preventing us from doing this.

aakoshh · 2022-06-24T09:31:43Z

x/epoching/abci.go

+ // TODO: unimplemented:
+ // - if an epoch is newly checkpointed, make unbonding validators/delegations in this epoch unbonded
+ // - if reaching an epoch boundary, execute validator-related msgs (bonded -> unbonding)
+ return nil


Doesn't matter too much, but you could return an empty array like the rest of the modules. I say doesn't matter because if we never call the staking module at all the app won't work anyway. Might as well panic, to make sure this doesn't go unnoticed.

Do we really need to copy paste slashing and evidence handling? If we want those to be processed immediately, why not just let nature take its course?

Default operations in slashing/evidence handling include both slashing and jailing. We want the slashing part but not the jailing part, right?

Oh, I don't know much about jailing. What would happen if we allow it to go through?

When a validator is jailed, then:

It loses all voting power (https://github.com/cosmos/cosmos-sdk/blob/v0.45.5/x/staking/keeper/val_state_change.go#L259-L268).

The validator's status Jailed becomes true. To be unjailed, the validator needs to submit a message MsgUnjail and redo self-delegation (https://github.com/cosmos/cosmos-sdk/blob/v0.45.5/x/slashing/keeper/unjail.go#L9-L63).

If we allow jailing, then 1 will change the validator set within an epoch, and MsgUnjail in 2 needs to be delayed as well. The proposal 4 at https://babylon-chain.atlassian.net/wiki/spaces/BABYLON/pages/14745602/Four+proposals+on+handling+slashed+validators that we agreed upon gets rid of the jailing part due to such complexity.

Okay, so jailing does remove the power and does change the validator set, but only temporarily, without slashing.

I don't know if this has a big impact on our epochs. I'm a bit lost, I thought we can let slashing happen immediately, which changes some of the power, and it sounds like jailing has a similar effect, so as long as it's done in moderation, the system should keep working.

Because you can't replicate the whole behaviour with copy-pasting, I would not include that part in this PR. What we are doing is not functionally equivalent to what the SDK is doing, therefore it is unacceptable IMO. We can't ignore certain bits on the basis that we may or may not be able to fix it later, when we don't even understand what we are breaking.

I would first restrict ourselves to dealing with the delayed staking, and have another go at consulting with Frojdi about whether we can just let jailing happen in the middle of the epoch.

@fishermanymc wdyt?

I don't know if this has a big impact on our epochs. I'm a bit lost, I thought we can let slashing happen immediately, which changes some of the power, and it sounds like jailing has a similar effect, so as long as it's done in moderation, the system should keep working.

A potential issue is as follows. Assuming the system slashes only a part of its stake (say, 50%) of a validator, then:

if we jail it, then it is removed from the validator set from a node's view, but remains in the validator set from a checkpoint's view (as it's included in the bitmap).

if we don't jail it, then the validator set is consistent from a node or a checkpoint's view.

Because you can't replicate the whole behaviour with copy-pasting, I would not include that part in this PR.

Agree with this one. At the moment we have limited understanding on this part and the modification might break the system. At least the vanilla evidence/slashing modules will still work when only few validators are slashed.

Yes you are right, if we don't slash/jail then the validator set won't change from what it was at the beginning of the epoch and everyone in the bitmap is still okay to present a BLS signature.

To our advantage in this is that

we only need +1/3 of the validators to send BLS signatures

what needs to be signed comes from the Babylon chain, produced by the remaining validators, so the BLS signatures of the slashed validators should still be okay

Is that true? If we slash +1/3 then these dishonest validators can produce a BLS signature on their own, but in order to convince anyone's Tendermint node to go to an adversarial fork, you would need +2/3 to be dishonest.

What the slashed validators can do is to submit what looks like a "bogus" checkpoint, with their valid signatures, and send the application down this path. Looks like we have to be careful in that situation not to over react and kill Tendermint.

My understanding so far has been that:

We start Epoch_n, and take a snapshot of who the ValidatorSet_n is, and their BLS signatures.

We start Epoch_n+1; the commit hash over Epoch_n becomes known, produced by +2/3 of ValidatorSet_n.

We collect BLS signatures from +1/3 members of ValidatorSet_n; we produce the raw checkpoint.

If we took the snapshot for ValidatorSet_n at the end of Epoch_n, we would have only the people who weren't slashed. Is it important to use the validators from the beginning? If we already know that only slashing can influence it during an epoch?

I suppose we can always discard BLS signatures from slashed validators during collection, and thus reject checkpoints signed by them as well, so we would know that an honest Tendermint validator set would not produce such a checkpoint.

Ultimately, in each epoch we assume that +2/3rd are honest, and if we have to slash +1/3rd we have just proven that assumption to be wrong. Not sure if at that point if the +1/3rd dishonest people need to produce a real alternative fork, or just pretend to have signed a hidden fork, for us to conclude that Babylon failed.

@fishermanymc wdyt?

This is a complicated topic.

Option-1: If slashing is 100%, then BLS signatures from the slashed validators are useless and should be ignored during signature aggregation. This is because even if they create an attacking QC and get caught later, they have nothing more to be slashed. These are the lunatic malicious validators here: https://babylon-chain.atlassian.net/wiki/spaces/BABYLON/pages/15597569/How+Many+Validators+Can+Babylon+Slash+During+an+Epoch+WIP#3.-Tendermint%E2%80%99s-Native-Slashing-Capacity-in-One-Block

Option-2: If slashing is less than 100%, then the slashed nodes still have some stake in the system and thus are motivated to sign genuine BLS to get unbonded.

There seems also theoretical tradeoff between how many to slash and how many BLS signatures should be collected, which affects the security - liveness tradeoff. David and Nusret are working on this.

As our first pass, let's 1) not delay any slashing, and 2) ask the checkpointing module to only accept BLS signatures from unslashed validators from the original validator set at the beginning of the epoch. The aggregated power should be min(1/3, 1-X), where X is the slashed power. @gitferry , you are also a stakeholder.

Thanks for the insightful discussions Akosh and Fisher! I also have a feeling that how many to slash and whether to jail seem to have connection with some fundamental trade-offs. Given that we are at the MVP phase and David and Nusret are working on related issues, at the moment let's follow the minimalist 2/3-secure approach proposed above, at least in this PR:

Follow Cosmos SDK's original evidence/slashing implementations with both jailing and slashing

Reject all BLS signatures from unslashed validators, even if they were in the validator set at the epoch beginning

aakoshh · 2022-06-24T09:38:17Z

x/epoching/keeper/modified_slashing.go

+ // TODO: commented the line below as it's a private function. Find a way to work around this.
+ // slk.clearValidatorMissedBlockBitArray(ctx, consAddr)


@vitsalis this seems to contradict the concept of being able to copy paste.

It's also easy to miss this comment (maybe instead of a TODO it can be a FIXME), thanks for mentioning it in the PR description. I would not try to check code in with this kind of commented logic, which we don't know the consequences of. It looks like it might lead to slashing more times for missed blocks, or who knows.

I didn't expect we have to copy paste slashing, though, only staking.

SebastianElvis · 2022-06-24T10:32:50Z

Consequently, whenever we invoke them, they unbond all validators/delegations in the queue, rather than those in a certain epoch.

This is why I suggested using withBlockHeader on the Context to temporarily override the block to be the one which got confirmed, and then restore it to where we're at.

Thankfully we have an workaround for this...

SebastianElvis · 2022-06-27T05:19:47Z

Thanks for the comments and pointing to the workarounds of the blockers Akosh! I have resolved the comments and please feel free to review again. Remaining questions/issues:

The WithBlockHeader workaround faces the race condition issue when multiple queries are accessing ctx concurrently.
The module responsible for checkpoint-assisted unbonding: epoching vs checkpointing?
Jailing
Usage of private functions in staking/evidence/slashing

aakoshh · 2022-06-27T09:56:36Z

app/app.go

- app.mm.SetOrderBeginBlockers(
- upgradetypes.ModuleName, capabilitytypes.ModuleName, minttypes.ModuleName, distrtypes.ModuleName, slashingtypes.ModuleName,
- evidencetypes.ModuleName, stakingtypes.ModuleName,
+ // NOTE: BBL avoids using app.mm.SetOrderBeginBlockers and app.mm.SetOrderEndBlockers as they has an assertation that avoids incomplete lists of modules, while BBL does not want:
+ // - BeginBlock processing in slashing/evidence
+ // - EndBLock processing in staking
+ app.mm.OrderBeginBlockers = []string{
+ upgradetypes.ModuleName, capabilitytypes.ModuleName, minttypes.ModuleName, distrtypes.ModuleName, stakingtypes.ModuleName,
 authtypes.ModuleName, banktypes.ModuleName, govtypes.ModuleName, crisistypes.ModuleName, genutiltypes.ModuleName,
 authz.ModuleName, feegrant.ModuleName,
 paramstypes.ModuleName, vestingtypes.ModuleName,
 epochingtypes.ModuleName,
 btclightclienttypes.ModuleName,
 btccheckpointtypes.ModuleName,
 checkpointingtypes.ModuleName,
- )


The reason I suggested calling SetOrderBeginBlockers first and then remove unwanted entries from OrderBeginBlockers is so that the application can still check that we did not accidentally forget to add a module. There must be some value in that. But then a filter could make sure we don't call what we don't need.

app.mm.SetOrderBeginBlockers( upgradetypes.ModuleName, capabilitytypes.ModuleName, minttypes.ModuleName, ... ) remove(*app.mm.OrderEndBlockers, stakingtype.ModuleName)

SebastianElvis · 2022-06-28T06:12:22Z

Thanks for the reviews Akosh and Fisher! I have revised this PR according to our discussions, limiting the scope and modifications in this PR (and also MVP). Feel free to have a look again.

Summary of the current PR:

The epoching module now depends on the staking module
The epoching keeper wraps and modified certain validator-related functions of the staking module
Evidence/slashing modules are unchanged and the epoching module does not touch the slashing/evidence modules
Staking's EndBlock processing is excluded

aakoshh · 2022-06-28T14:37:02Z

app/app.go

 epochingtypes.ModuleName,
 btclightclienttypes.ModuleName,
 btccheckpointtypes.ModuleName,
 checkpointingtypes.ModuleName,
 )
+ // BBL does not want EndBlock processing in staking
+ app.mm.OrderEndBlockers = append(app.mm.OrderEndBlockers[:2], app.mm.OrderEndBlockers[2+1:]...) // remove stakingtypes.ModuleName


I didn't think about that method! 😃

aakoshh

Great! This proved such a fertile ground for discussion, thank you for your patience following through with the changes!

SebastianElvis added 6 commits June 24, 2022 15:05

init

951dcf3

abci interface

af5ff8d

module stuff

e12dd3f

slash without jail

d6333c1

minor

6cd242a

minor

d997725

SebastianElvis requested a review from aakoshh June 24, 2022 06:09

aakoshh reviewed Jun 24, 2022

View reviewed changes

SebastianElvis added 2 commits June 27, 2022 14:29

apply mature unbonding fix

82eb108

avoid assertions

d6d5996

aakoshh reviewed Jun 27, 2022

View reviewed changes

SebastianElvis added 6 commits June 27, 2022 20:29

fix order of beginblockers and endblockers

6bc1fb3

Merge branch 'main' into epoching-staking-code-migration

729da0b

don't touch evidence/slashing at the moment

3d5f5a1

fix and make required staking functions a part of epoching

a6aa599

minor

34caaa5

add dependency to evidence and slashing

97c36cd

SebastianElvis requested a review from aakoshh June 28, 2022 06:12

SebastianElvis added 2 commits June 28, 2022 16:56

minor

bf09eb8

minor

39b7d61

aakoshh reviewed Jun 28, 2022

View reviewed changes

aakoshh approved these changes Jun 28, 2022

View reviewed changes

Merge branch 'main' into epoching-staking-code-migration

c56e597

SebastianElvis merged commit 442df70 into main Jun 29, 2022

SebastianElvis deleted the epoching-staking-code-migration branch June 29, 2022 03:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

epoching: copy/paste necessary code from staking/evidence/slashing and make modifications #28

epoching: copy/paste necessary code from staking/evidence/slashing and make modifications #28

SebastianElvis commented Jun 24, 2022

SebastianElvis commented Jun 24, 2022

aakoshh commented Jun 24, 2022

aakoshh left a comment

aakoshh Jun 24, 2022

SebastianElvis Jun 24, 2022

aakoshh Jun 24, 2022

SebastianElvis Jun 24, 2022

aakoshh Jun 24, 2022

SebastianElvis Jun 24, 2022

aakoshh Jun 25, 2022

SebastianElvis Jun 27, 2022

aakoshh Jun 27, 2022

SebastianElvis Jun 27, 2022

fishermanymc Jun 27, 2022

aakoshh Jun 24, 2022

SebastianElvis Jun 24, 2022

aakoshh Jun 25, 2022

SebastianElvis Jun 27, 2022

aakoshh Jun 27, 2022

SebastianElvis Jun 27, 2022

aakoshh Jun 27, 2022

fishermanymc Jun 27, 2022

SebastianElvis Jun 28, 2022 •

edited

Loading

aakoshh Jun 24, 2022

SebastianElvis commented Jun 24, 2022

SebastianElvis commented Jun 27, 2022

aakoshh Jun 27, 2022

SebastianElvis commented Jun 28, 2022 •

edited

Loading

aakoshh Jun 28, 2022

aakoshh left a comment

		crisistypes.ModuleName, govtypes.ModuleName, stakingtypes.ModuleName,
		crisistypes.ModuleName, govtypes.ModuleName,

		// TODO: commented the line below as it's a private function. Find a way to work around this.
		// slk.clearValidatorMissedBlockBitArray(ctx, consAddr)

epoching: copy/paste necessary code from staking/evidence/slashing and make modifications #28

epoching: copy/paste necessary code from staking/evidence/slashing and make modifications #28

Conversation

SebastianElvis commented Jun 24, 2022

SebastianElvis commented Jun 24, 2022

aakoshh commented Jun 24, 2022

aakoshh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SebastianElvis Jun 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SebastianElvis commented Jun 24, 2022

SebastianElvis commented Jun 27, 2022

Choose a reason for hiding this comment

SebastianElvis commented Jun 28, 2022 • edited Loading

Choose a reason for hiding this comment

aakoshh left a comment

Choose a reason for hiding this comment

SebastianElvis Jun 28, 2022 •

edited

Loading

SebastianElvis commented Jun 28, 2022 •

edited

Loading