Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Radically simplify staking logic #2312

Closed
cwgoes opened this issue Sep 12, 2018 · 12 comments
Closed

Radically simplify staking logic #2312

cwgoes opened this issue Sep 12, 2018 · 12 comments

Comments

@cwgoes
Copy link
Contributor

cwgoes commented Sep 12, 2018

Many of the state machine bugs we've found so far are in the staking module, particularly the UpdateValidators iteration cases and the cliff validator optimization logic. The stake keeper (x/stake/keeper) is generally difficult to understand and reason through - as demonstrated by our inability to catch these issues when writing the code or in review. In part this may be due to writing the code in stages and changing the underlying model several times.

Let's first discuss the clearest layout - maintaining the current performance (which is fine) but splitting up the logic much more clearly between self-contained functions with well-defined pre-conditions and post-conditions (most of which ideally can be defensively asserted). In parallel with this we should make sure to compare our code to the staking/slashing spec, which may have fallen a bit behind the implementation in a few places.

cc @rigelrozanski @alexanderbez @ValarDragon

Ref #2298 (comment)
Ref general discussion in #2173

@ValarDragon
Copy link
Contributor

ValarDragon commented Sep 12, 2018

Some ideas for main ideas to classify by:
Cliff validator exists or not?
Is the current validator new or not?

In the case of a new validator, we have a separate function for "pushing the cliff validator down", and etc. I think alot of the bugs come through because the functions aren't small things that are easy to verify independently.

@rigelrozanski
Copy link
Contributor

All for it - I think the conceptualization of staking has expanded dramatically several times - many times we've tended to tack on new thinking without refactoring..
looks like it's time to clean out the freezer so to speak - take out all the frozen food, reorganize and repackage it, then put it back in.

Misc thoughts:

  • I don't think the staking logic itself can be "radically simplified" - the logic by nature is quite complex. However the logic can be made radically more comprehensible I reckon - and organization of the logic radically simplified.
  • Many of the existing functions have poorly specified pre-conditions in the code comments (def an easy place to being improvement).
  • Honestly, I think the biggest bugs we've been encountering is with regards to the cliff validator logic - and how it reacts on a variety of edge cases - this should probably be of primary focus. Thinking outside of the box this may be the time to completely reevaluate how we conceptualize what a cliff validator is - and potentially respecify what and how cliff logic should work. I think there is probably an alternative model to the current cliff model which changes the staking dynamics, but drastically simplifies the code. For example if we didn't care so much about kicking folks out immediately, we get rid of the cliff validator object entirely - and passively iterate through the tail of the bonded validator set and kick validators out in a lagged fashion (which may mean than once in a while the validator set is slightly greater than 100 but will be corrected in the next few blocks)

^ this is not a fully conceptualized idea, however I think something a little bit more lazy and along these lines may help A LOT! - what y'all think??

We should setup a call probs


Side thought:
This may be an opportunity to think about slow-staking module may be a good way to iron out bugs.

@ValarDragon
Copy link
Contributor

ValarDragon commented Sep 12, 2018

I agree, perhaps we can change alot of the "cliff validator" paradigm. Its unclear to me why the following categorization doesn't suffice:

"bonded", "unbonded", "jailed".

I don't see a need for the cliff validator abstraction, as long as we keep "bonded" and "unbonded" sorted by power, which shouldn't be expensive. Inserting into a sorted list is logarithmic worst case, if we want to do some fancy predictive stuff #postlaunch, we may even be able to get it to log(log(n)). (Basically rob the idea from https://en.wikipedia.org/wiki/Interpolation_search, and come up with a decent predictor for location within the sorted list)

@rigelrozanski
Copy link
Contributor

(side point we also need to have "unbonding" in that list - important for slashing right)

Yeah there is definitely a way to not have the cliff validator term at all without changing much, I think it just means that we need to have a bunch more iterating - which may be okay, maybe a much of the inefficiently that may come from having to iterate over a bunch of records (only retrieve one of them) can be solved on the store side of things. For instance, As far as I can tell there is not easy way to retrieve the "100th" record from the store without iterating over it, but I bet that's a piece of functionality which wouldn't be to difficult to build into the store.

  • but yeah maybe we leave this as a store optimization and kill cliff validator's ASAP - seems like it's just causing no end of headaches

@alexanderbez
Copy link
Contributor

During the past few weeks of debugging this code, I can personally say that iteration and the notion of cliff validators are the biggest pain points for debugging. I found the following troublesome when debugging

  • When/how bondHeight/bondIntraTxCounter are set
  • (Reverse) Iteration of validators by power and who to unbond, who to bond, who to remove as cliff, and who to set as cliff, etc...

@rigelrozanski
Copy link
Contributor

rigelrozanski commented Sep 13, 2018

Ideas from the meeting:

  • The cases within the switch statement are hard to navigate
  • It could be good to seperate switch into a Classification and Execution code set
  • (spec Split up UpdateValidator per each state-transitions, & Tendermint updates in Endblock #2351) Cliff, move cliff validator to the end of the block (we need a proposal)
  • Obviously more distinct functions
  • (spec Split up UpdateValidator per each state-transitions, & Tendermint updates in Endblock #2351) Maybe combine Validator.Status, Validator.Jailed - (need to introduce things like Validator.IsUnbonding()) - then we can write everything as a state transition in update validator (really cool!!) - much better organization - Maybe this would allow us to eradicate update validator all together (only if this change is combined with use of end-block bonded-validator updates)
  • (spec/implementation @ValarDragon) Replace the IntraTx counter - GlobalMessageCounter which comes from the context (or maybe IntraMessageCounter from context -TBD) if we had a global counter then we could remove block-height from the Validator object (bonus)
  • No action necessary slashing/staking split is good
  • Instead of having a separate storage for cliff validator, instead just make it access the nth element of the store
  • #postlaunch consider using a cache BST instead of a cache KV store (which is a map)

@ValarDragon
Copy link
Contributor

ValarDragon commented Sep 14, 2018

Its also interesting to note, that doing the lazy evaluation at end block will probably speed up execution under periods of high load. This is because if there are multiple validator set updates, you only have to do one iteration through it. Whereas before the number of iterations may not have been well bounded (e.g. 1 per message in the block), but in the simplest case there wouldn't be any iterations.

However, we can easily add a boolean "was there a stake related update this block", and only run through that iteration accordingly. (This is an easy #if-have-time-prelaunch optimization though)

@rigelrozanski
Copy link
Contributor

@ValarDragon - great idea to add an optimization bool for if there was an update or not

@ebuchman
Copy link
Member

Can we consider removing redelegation as part of this? It seems like pretty clearly non-MVP requirement and just adds complexity.

@sunnya97
Copy link
Member

sunnya97 commented Sep 18, 2018

I think instant redelegation is a requirement for having a proper decentralized Proof of Stake. Without it, delegation becomes so sticky and the initial validator set gets entrenched.

We've also publicly stated on multiple occasions that this is one of the defining unique features of our Proof of Stake implementation.

@rigelrozanski
Copy link
Contributor

rigelrozanski commented Sep 18, 2018

I'm going to second @sunnya97 for this, not to mention it's actually a fairly straightforward implementation, which doesn't need to change is already complete - there are very few edge cases. I think by removing and adding it back in later, we are going to generate more work for ourselves in pre1.0 phase - and I'm not convinced it will save us much time prelaunch (not to mention weaken our security model as sunny pointed out)

@cwgoes
Copy link
Contributor Author

cwgoes commented Oct 9, 2018

Most has been done, closing in favor of the remaining #1402.

@cwgoes cwgoes closed this as completed Oct 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants