Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Domains: Freeze, Unfreeze, and Prune Execution receipt of Domain through Sudo #2896

Merged
merged 1 commit into from
Jul 16, 2024

Conversation

vedhavyas
Copy link
Contributor

@vedhavyas vedhavyas commented Jul 9, 2024

This PR add new dispatches for Sudo on Consensus to

  • Freeze a given domain
    • Once frozen, Domain does not accept any new bundles but still processes the fraud proofs
  • Unfreeze a given domain
  • Prune a bad execution receipt of Frozen domain
    • Dispatch makes an assumption that Sudo has already validated the invalidity of the Bad ER.

This dispatches are generally used but in case if the Fraud proof does not fit into the Consensus block due to incorrect weight/size limits. Then through social consensus, Sudo/Governance would freeze the domain for time being, prune any Bad ERs and unfreeze the domain to enable to normal execution.

Will update the spec with these changes.
Spec updates: subspace/protocol-specs#44

cc: @dariolina

Code contributor checklist:

@NingLin-P
Copy link
Member

Prune a bad execution receipt of Frozen domain
Dispatch makes an assumption that Sudo has already validated the invalidity of the Bad ER.

I'm not quite sure about this, if the FP can't be submitted due to:

  • weight/size limit
  • prover can't generate FP
  • verifier rejects the FP
  • etc

Then it is a bug of the FP system we need to fix before unfreezing the domain. While pruning the bad ER is just a temporary workaround the issue can still happen again. Also, this puts the burden of manually validating the bad ER on the Sudo while the FP system is broken. cc @jfrank-summit

@vedhavyas
Copy link
Contributor Author

I'm not quite sure about this, if the FP can't be submitted due to:
weight/size limit
prover can't generate FP
verifier rejects the FP
etc

That is actual reason why we have this dispatch. If the fraud proof is not able to be submitted for any reason. Then there will be some sort of offchain process to verify the actual fraud proof validity and then root will prune the bad ER.
This is what I assumed per audit discussion here - https://subspacenetwork.slack.com/archives/C04RUR1P37U/p1719441632508529

Maybe @dariolina can confirm here

@dariolina
Copy link
Member

I think you're both right. The flow I imagine is:

  1. Something is wrong, sudo freezes the domain
  2. We investigate off-chain what went wrong, if needed fix relevant FP bugs so that the same situation does not happen again
  3. Roll out upgrades with the fix
  4. Prune bad ER manually
  5. Unfreeze domain for the upgrades to apply
    WDYT?

@NingLin-P
Copy link
Member

I think we both agree that freeze and unfreeze domains are necessary but the controversy is how to prune the bad ER. What I imagine is after the fix is applied (either to the prover or the verifier) the FP flow should work again and an FP will generate and submit successfully to prune the bad ER, this should happen automatically without manually validating the FP/bad ER offchain.

@dariolina
Copy link
Member

But assume we need to do domain runtime upgrade for the fix then the ER in question will go out of challenge period by the time new code kicks in. Right?

@NingLin-P
Copy link
Member

But assume we need to do domain runtime upgrade for the fix

The domain runtime code that derives the bad ER will be used during FP verification, not the latest runtime code. Even if there is a necessary hack required for the fix, the hack should be part of the fix and present in the code rather than offchain IMHO.

@vedhavyas
Copy link
Contributor Author

The domain runtime code that derives the bad ER will be used during FP verification, not the latest runtime code. Even if there is a necessary hack required for the fix, the hack should be part of the fix and present in the code rather than offchain IMHO.

Okay, let assume a edge case scenario where FP did not make it into Consensus block due to multiple reasons. Malicious operators can continue to submit fraud proofs while honest operators continue to submit FP. As long as the bad ER does not go out of challenge period, this is fine. But it is possible for the bad ER to be confirmed due to above FP not making into Consensus block.
If the team realises that, they would freeze the domain. Then whats next?
Do we make a temporary hack to ensure FP lands in Consensus block ? What if its was a bug from domain runtime itself? If so even if we unfreeze the domain to produce the bundle, there is a possibility that bad ER being confirmed and continue to produce new bad ERs?

At this point, governance can make a call to prune the first bad er, upgrade the domain runtime, wait for runtime to be enacted on consensus chain, then proceed to unfreeze the domain. The next domain block will contain the new runtime and honest operators will with domain block imports.

There could a pelthora such scenarios and the list might not even be exhaustive. This dispatch is essentially a way for us to ensure such guards are in place

@NingLin-P
Copy link
Member

If the team realises that, they would freeze the domain. Then whats next?
Do we make a temporary hack to ensure FP lands in Consensus block ? What if its was a bug from domain runtime itself?

What the team will do is deploy a complete fix (which contains a hack if necessary) that allows the prover to generate the FP and the verifier to accept the FP, no matter whether it is a domain runtime bug or not, to prune the existing and any new bad ER.

At this point, governance can make a call to prune the first bad er, upgrade the domain runtime, wait for runtime to be enacted on consensus chain, then proceed to unfreeze the domain. The next domain block will contain the new runtime and honest operators will with domain block imports.

I really don't see how "governance can make a call to prune the first bad er" helps in this case:

  • If the verifier, i.e. consensus chain, needs to use a new domain runtime to verify the FP of the bad ER (which derive from an old runtime), then the fix should be a consensus runtime upgrade that contains a hack to use the latest domain runtime code to verify the FP (perhaps for a specific range of ER) instead of using the runtime code that derive the bad ER
  • If the prover, i.e. the domain client, needs to use the new domain runtime code to generate the FP, then it just needs to wait until the runtime is upgraded since domain runtime upgrade is forcibly and even if there is no bundle

After the fix is deployed the honest operator will produce/submit FP to prune the bad ER, and then we can unfreeze the domain, I don't see why we need to manually prune the bad ER beforehand.

There could a pelthora such scenarios and the list might not even be exhaustive. This dispatch is essentially a way for us to ensure such guards are in place

The only safeguard, from my understanding, is the ability to freeze the domain and deploy the fix, what the fix will be depends on the issue.

Copy link
Member

@NingLin-P NingLin-P left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still I think we should not use prune_domain_execution_receipt during the recovery process but as the audit team points out there is no hurt to have this capability, let's move on.

Comment on lines +2572 to +2575
pub fn max_prune_domain_execution_receipt() -> Weight {
T::WeightInfo::handle_bad_receipt(MAX_BUNLDE_PER_BLOCK)
.saturating_add(T::DbWeight::get().reads_writes(3, 1))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extrinsic weights are not just the database write & read but also a base weight, thus would prefer to have proper benchmarking like submit_fraud_proof, though non-blocker for this PR.

@vedhavyas vedhavyas added this pull request to the merge queue Jul 16, 2024
Merged via the queue into main with commit 8a1e2ed Jul 16, 2024
9 checks passed
@vedhavyas vedhavyas deleted the domain_freeze_unfreeze branch July 16, 2024 09:30
@vanhauser-thc vanhauser-thc added audited This change was audited and removed need to audit This change needs to be audited labels Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audited This change was audited
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants