Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Skipping Empty Epoch #948

Closed
skunert opened this issue Jul 24, 2023 · 6 comments · Fixed by #991
Closed

Support Skipping Empty Epoch #948

skunert opened this issue Jul 24, 2023 · 6 comments · Fixed by #991

Comments

@skunert
Copy link
Contributor

skunert commented Jul 24, 2023

In one of our warp-sync zombienet tests, a cumulus collator containing an integrated smoldot node is failing:

2023-07-19 17:09:08.787 DEBUG ⋮sync-service-rococo_local_testnet: [four] Sync => HeaderVerifyError(hash=0x6c83…3803, error=VerificationFailed(BabeVerification(BadVrfProof)))
2023-07-19 17:09:08.787  WARN ⋮sync-service-rococo_local_testnet: [four] Error while verifying header 0x6c83…3803: BadVrfProof

Test scenario:

  1. Download databases from remote which contain a few hundred blocks
  2. Start two relay chain and three parachain nodes using these databases
  3. Test parachain warp sync by starting multiple warp-sync collators with different configurations

The main issue here is that the relay chain nodes are using downloaded databases. The contained blocks were produced during a past run when this DB was created. So between the blocks in the DB and newly produced blocks there is a slot gap.
When this test is run now, newly produced blocks will have a larger slot that does not fit the epoch smoldot expects.

In substrate we have logic that checks if some epochs have been skipped and adjusts the epoch_index that is used in VRF transcript creation.
The same is done during slot claiming and block import. This is the reason verification succeeds in substrate nodes but fails in smoldot.

Can we have support for skipped epochs in smoldot too?

@tomaka
Copy link
Contributor

tomaka commented Jul 24, 2023

For what it's worth, I'm a little bit concerned that paritytech/substrate#13135 wasn't reviewed by André or by W3F researchers. Not that I can find any way to attack this, but it seems a bit sketchy to me nonetheless.

@tomaka
Copy link
Contributor

tomaka commented Jul 25, 2023

Just confirming my understand of the situation with w3f/polkadot-spec#681 (comment)

Then I think that the change would be to modify epoch_transition_target to also optionally return the "current epoch" (i.e. the epoch that the block that has been verified belongs to), in case its epoch index is modified.

@skunert
Copy link
Contributor Author

skunert commented Jul 25, 2023

This sounds correct. I think something like this would be needed:

  • For verification of the current header, modify the epoch_index to match the current slot
  • When there is a new epoch announced, take its randomness and authorities but recalculate the new epoch index and start slot taking into account skipped epochs. Emit this as epoch_transition_target

@tomaka
Copy link
Contributor

tomaka commented Jul 26, 2023

I'm actually instead considering removing the epoch_index and start_slot fields from everywhere, and instead put a start_slot in the finalized_block_epoch_information field.

I don't like the kind of hack that Substrate does where you have data flying around that corresponds to nothing.

@tomaka
Copy link
Contributor

tomaka commented Aug 8, 2023

I'm actually instead considering removing the epoch_index and start_slot fields from everywhere, and instead put a start_slot in the finalized_block_epoch_information field.

There's a problem with that plan: if we don't track the starting slot of an epoch, we can't detect when a block makes an epoch change when it's not normally authorized to do so.

In the context of a full node, this isn't a problem, as executing the block performs this check anyway.
In the context of a light client, however, this would make it easier for a malicious authority to "hijack" the chain. The security would technically not decrease, because this check is only a mitigation, but still it's not great.

This problem could be solved by tracking more things, but the idea of doing a small code refactoring had the objective to simplify the logic. If we track more things, the refactoring would no longer simplify the code and is thus pointless.

@tomaka
Copy link
Contributor

tomaka commented Aug 8, 2023

Then I think that the change would be to modify epoch_transition_target to also optionally return the "current epoch" (i.e. the epoch that the block that has been verified belongs to), in case its epoch index is modified.

There's also no need to do that, as we can easily recalculate the number of skipped epochs at each block.
There's a check in place making sure that slot numbers are increased, so there shouldn't be any way for this to go wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants