Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Rerun Staking OCW periodically. #7976

Closed
wants to merge 21 commits into from
Closed

Conversation

kianenigma
Copy link
Contributor

@kianenigma kianenigma commented Jan 25, 2021

Fix issues such as https://github.com/paritytech/stakingops/issues/130

Our OCW runner in staking is too conservative now. It only runs once at the beginning of the election window, and only attempts to submit once. If for any reason (e.g. secondary/primary slot overlap, fork and rollback) all validators temporarily think that a solution is submitted, they will all drop their computed solution and never try again. In essence, they will attempt to submit it, which will fail because there is already a good solution on-chain, and the process ends. In this scenario, if the submitted solution gets reverted, we are left with nothing, and thus a heavy on-chain election.

This is a minimal patch to fix the situation. OCW will compute, save and submit the solution initially, and until the end of the election window periodically try to re-submitted the saved solution, if none is stored yet.

Note that I tried to make this patch minimal, as #7909 will have to reintroduce this again.

This needs a burn-in, but not a typical one. I will prepare a patch to the current polkadot release that contains this runtime change but does not bump the spec version. This is okay because there is no runtime change, and the new runtime should work with the old one. Then we can run this first on a full node, and the once we are confident, a validator node. -l=staking=trace is important in all stages.


Polkadot dry-run branch: https://github.com/paritytech/polkadot/tree/kiz-staking-OCW-rerun-dryrun-polkadot
Created using: https://hackmd.io/MMGn9TnKSx2oganhfcQrpw

@kianenigma kianenigma added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C3-medium PR touches the given topic and has a medium impact on builders. A1-needs_burnin Pull request needs to be tested on a live validator node before merge. DevOps is notified via matrix labels Jan 25, 2021
@kianenigma kianenigma changed the title Kiz staking ocw rerun Rerun Staking OCW periodically. Jan 25, 2021
Copy link
Contributor

@tomusdrw tomusdrw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the submitted solution gets reverted, we are left with nothing, and thus a heavy on-chain election

Under what conditions can the solution be reverted?

frame/staking/src/offchain_election.rs Outdated Show resolved Hide resolved
frame/staking/src/lib.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@tomusdrw tomusdrw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! So the assumption right now is that:

  1. If you run OCW at the election window opening block, you always attempt to compute a solution.
  2. When the window stays open AND there is no submission already, you attempt to re-submit your solution.

In theory we could attempt to submit if we've found a better solution, but we can leave that as a future improvement, cause afaiu the main goal right now is to ensure that there is ANY solution to never fall back to on-chain process.

frame/staking/src/offchain_election.rs Outdated Show resolved Hide resolved
frame/staking/src/offchain_election.rs Show resolved Hide resolved
Copy link
Member

@bkchr bkchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How complicated would it to write a test for this? :D

frame/staking/src/offchain_election.rs Outdated Show resolved Hide resolved
frame/staking/src/offchain_election.rs Show resolved Hide resolved
frame/staking/src/offchain_election.rs Outdated Show resolved Hide resolved
frame/staking/src/offchain_election.rs Outdated Show resolved Hide resolved
frame/staking/src/lib.rs Outdated Show resolved Hide resolved
frame/staking/src/lib.rs Show resolved Hide resolved
@kianenigma
Copy link
Contributor Author

How complicated would it to write a test for this? :D

Probably in the order that I'd say given the dry-run, not needed 😄 but I will give it a try before merge.

kianenigma and others added 2 commits February 8, 2021 14:24
Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>
Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>
kianenigma and others added 4 commits February 8, 2021 14:26
Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>
@kianenigma
Copy link
Contributor Author

Burnin looks well, will leave it running for another day with full logs to check everything, else we're good to go.

@kianenigma kianenigma added D9-needsaudit 👮 PR contains changes to fund-managing logic that should be properly reviewed and externally audited and removed A0-please_review Pull request needs code review. A1-needs_burnin Pull request needs to be tested on a live validator node before merge. DevOps is notified via matrix A7-needspolkadotpr labels Feb 8, 2021
@kianenigma
Copy link
Contributor Author

I see it very likely that we will close this and just re-introduce it in the multi-phase election pallet, given that the audit has taken a while.

@kianenigma
Copy link
Contributor Author

Closing this since will be re-introduced by @coriolinus soon.

@kianenigma kianenigma closed this Mar 5, 2021
coriolinus added a commit that referenced this pull request Mar 8, 2021
This is a much smaller diff than that PR contained, but I think
it contains all the essentials.
ghost pushed a commit that referenced this pull request May 3, 2021
* not climate

* explain the intent of the bool in the unsigned phase

* remove glob imports from unsigned.rs

* add OffchainRepeat parameter to ElectionProviderMultiPhase

* migrate core logic from #7976

This is a much smaller diff than that PR contained, but I think
it contains all the essentials.

* improve formatting

* fix test build failures

* cause test to pass

* Apply suggestions from code review

Co-authored-by: Kian Paimani <5588131+kianenigma@users.noreply.github.com>

* collapse imports

* threshold acquired directly within try_acquire_offchain_lock

* add test of resubmission after interval

* add test that ocw can regenerate a failed cache when resubmitting

* ensure that OCW solutions are of the correct round

This should help prevent stale cached solutions from persisting
past the election for which they are intended.

* add test of pre-dispatch round check

* use `RawSolution.round` instead of redundantly externally

* unpack imports

Co-authored-by: Kian Paimani <5588131+kianenigma@users.noreply.github.com>

* rename `OFFCHAIN_HEAD_DB` -> `OFFCHAIN_LOCK`

* rename `mine_call` -> `mine_checked_call`

* eliminate extraneous comma

* check cached call is current before submitting

* remove unused consts introduced by bad merge.

Co-authored-by: Guillaume Thiolliere <gui.thiolliere@gmail.com>

* resubmit when our solution beats queued solution

* clear call cache if solution fails to submit

* use local storage; clear on ElectionFinalized

* Revert "use local storage; clear on ElectionFinalized"

This reverts commit 4b46a93.

* BROKEN: try to filter local events in OCW

* use local storage; simplify score fetching

* fix event filter

* mutate storage instead of setting it

* StorageValueRef::local isn't actually implemented yet

* add logging for some events of interest in OCW miner

* rename kill_solution -> kill_ocw_solution to avoid ambiguity

* defensive err instead of unreachable given unreachable code

* doc punctuation

Co-authored-by: Kian Paimani <5588131+kianenigma@users.noreply.github.com>

* distinguish miner errors between "out of date" and "call invalid"

* downgrade info logs -> debug

* ensure encoded call decodes as a call

* fix semantics of validation of pre-dispatch failure for wrong round

* move score check within `and_then`

* add test that offchain workers clear their cache after election

* ensure that bad ocw submissions are not retained for resubmission

* simplify fn ocw_solution_exists

* add feasibility check when restoring cached solution

should address https://github.com/paritytech/substrate/pull/8290/files#r617533358

restructures how the checks are sequenced, which simplifies legibility.

* simplify checks again

Co-authored-by: Kian Paimani <5588131+kianenigma@users.noreply.github.com>
Co-authored-by: Guillaume Thiolliere <gui.thiolliere@gmail.com>
nazar-pc pushed a commit to autonomys/substrate that referenced this pull request Aug 8, 2021
* not climate

* explain the intent of the bool in the unsigned phase

* remove glob imports from unsigned.rs

* add OffchainRepeat parameter to ElectionProviderMultiPhase

* migrate core logic from paritytech#7976

This is a much smaller diff than that PR contained, but I think
it contains all the essentials.

* improve formatting

* fix test build failures

* cause test to pass

* Apply suggestions from code review

Co-authored-by: Kian Paimani <5588131+kianenigma@users.noreply.github.com>

* collapse imports

* threshold acquired directly within try_acquire_offchain_lock

* add test of resubmission after interval

* add test that ocw can regenerate a failed cache when resubmitting

* ensure that OCW solutions are of the correct round

This should help prevent stale cached solutions from persisting
past the election for which they are intended.

* add test of pre-dispatch round check

* use `RawSolution.round` instead of redundantly externally

* unpack imports

Co-authored-by: Kian Paimani <5588131+kianenigma@users.noreply.github.com>

* rename `OFFCHAIN_HEAD_DB` -> `OFFCHAIN_LOCK`

* rename `mine_call` -> `mine_checked_call`

* eliminate extraneous comma

* check cached call is current before submitting

* remove unused consts introduced by bad merge.

Co-authored-by: Guillaume Thiolliere <gui.thiolliere@gmail.com>

* resubmit when our solution beats queued solution

* clear call cache if solution fails to submit

* use local storage; clear on ElectionFinalized

* Revert "use local storage; clear on ElectionFinalized"

This reverts commit 4b46a93.

* BROKEN: try to filter local events in OCW

* use local storage; simplify score fetching

* fix event filter

* mutate storage instead of setting it

* StorageValueRef::local isn't actually implemented yet

* add logging for some events of interest in OCW miner

* rename kill_solution -> kill_ocw_solution to avoid ambiguity

* defensive err instead of unreachable given unreachable code

* doc punctuation

Co-authored-by: Kian Paimani <5588131+kianenigma@users.noreply.github.com>

* distinguish miner errors between "out of date" and "call invalid"

* downgrade info logs -> debug

* ensure encoded call decodes as a call

* fix semantics of validation of pre-dispatch failure for wrong round

* move score check within `and_then`

* add test that offchain workers clear their cache after election

* ensure that bad ocw submissions are not retained for resubmission

* simplify fn ocw_solution_exists

* add feasibility check when restoring cached solution

should address https://github.com/paritytech/substrate/pull/8290/files#r617533358

restructures how the checks are sequenced, which simplifies legibility.

* simplify checks again

Co-authored-by: Kian Paimani <5588131+kianenigma@users.noreply.github.com>
Co-authored-by: Guillaume Thiolliere <gui.thiolliere@gmail.com>
@shawntabrizi shawntabrizi removed the D9-needsaudit 👮 PR contains changes to fund-managing logic that should be properly reviewed and externally audited label Feb 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
B0-silent Changes should not be mentioned in any release notes C3-medium PR touches the given topic and has a medium impact on builders.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants