BABE skipped epochs #681

andresilva · 2023-07-24T16:08:05Z

Whenever there is a period of inactivity (i.e. no blocks being produced) for more than one epoch BABE is broken, since we announce epochs in advance and by the time the inactivity is resolved we will be in a future epoch which was unnanounced (i.e. no authorities and randomness available). The security model of BABE is also broken in this scenario, but for a long time in the Substrate implementation at least, this would completely brick the chain and make it impossible to author new blocks. As of paritytech/substrate#11727 this was changed, such that empty epochs can be skipped over, i.e. after a period of inactivity the runtime will appropriately update the epoch index based on the given slot. We are still left with the problem of what epoch data to use and the natural thing to do is to re-use the data for the latest announced epoch. As an example, assuming the last block was authored on a slot at epoch 3, the data for epoch 4 had already been announced. Given a period of inactivity spanning 3 epochs, a block is then authored with a slot belonging to epoch 6. In order to author this block, the client should re-use the data that was previously published for epoch 4.

andresilva · 2023-07-24T16:36:13Z

This problem is also described by the Ouroboros team in this paper in chapter 2. They describe a manual mitigation which is the same as we have employed in early 2020 on Kusama (we never had to do it on Polkadot mainnet). They do not provide any description of an automatic mitigation method though.

Noc2 · 2023-07-24T17:40:51Z

Thanks a lot for creating the issue! We will look into it as soon as possible.

hndnklnc · 2023-07-25T06:48:19Z

Do we need a solution for only one empty epoch or more than one sequential epochs?

tomaka · 2023-07-25T08:09:37Z

Does the first header after a period of inactivity contain an epoch change digest? My understanding is yes.

For example, let's say that we're in the middle of epoch 3, and all of a sudden all the validators die. Two weeks later, they come back online.
Is it correct that the first authored block is in epoch 9 (or whatever the number is) and announces epoch 10 (or whatever the number is)? And epoch 9 uses the authorities and randomness of epoch 4?

andresilva · 2023-07-25T09:32:31Z

@tomaka Everything you said is correct.

@hndnklnc Yes, we need to be able to recover from any kind of disaster scenario. The most likely scenario for this to happen is due to a client bug that prevents block production, and e.g. with epochs on Polkadot being 6 hours, it's not unlikely that a fix would take longer than this time to be developed and rolled out.

Not sure if it's obvious but this issue is about updating the spec to reflect the solution that was already implemented and deployed. AFAIK the current SASSAFRAS implementation is also employing the same solution, so if there is a better solution to handle this problem I suggest we use it there.

hndnklnc · 2023-07-25T10:04:09Z

I think the current solution makes sense as far as I understand. I just want to verify this: Assume that epochs from epoch e+1 to e+n are skipped. The validators in epoch e + n +1 use the randomness from epoch e-1, right?

andresilva · 2023-07-27T14:50:16Z

I think we might be using slightly different definitions (i.e. randomness collection vs announcement), let me expand on your example. If epochs E+1 to E+n are skipped then the last block was authored at epoch E, at which point the randomness to be used in E+1 had already been announced. When coming back online, the validators at epoch E+n+1 will use the randomness that was meant to be used in epoch E+1 (which was collected during epoch E-1).

bhargavbh · 2023-08-08T10:33:33Z

Hi @andresilva thanks for pointing this out. I created a short PR addressing it here #685, can you please take a look.

tomaka · 2023-08-08T10:55:41Z

For what it's worth, the approach I'm going to take in smoldot is to completely remove the concept of epoch_index from the client side. The transcript instead contains (slot_of_block - first_slot_ever) / slots_per_epoch.

It's not a bad thing that on the runtime side the epoch index is still tracked in case it is needed in the future, but on the client side there really isn't any need for this concept anymore.

IMO the modification to the spec should follow a similar approach, because adding a little addendum somewhere creates confusion more than anything else.

tomaka · 2023-08-08T12:10:39Z

Thinking about it more, the interface between the runtime and the client is part of the spec as well and returns an epoch index. It feels weird to not mention that the epoch index returned by the runtime isn't at least partially tied to the actual epoch index calculated in the transcript.

So basically I'm opinion-less.

andresilva · 2023-08-08T14:19:10Z

@tomaka I agree that as it stands the epoch index is superflous since it can be trivially calculated. I think this is mainly due to the fact that right now we are relying on unix time to calculate the current slot rather than the "median algorithm". Also for the reason you mentioned I think we should keep the epoch index mention in the spec (regardless of whether client implementations actually need to keep track of it or just calculate it on-the-fly).

andresilva · 2023-08-09T10:13:35Z

Closing after #685.

tomaka · 2023-08-09T10:28:36Z

For what it's worth, the approach I'm going to take in smoldot is to completely remove the concept of epoch_index from the client side. The transcript instead contains (slot_of_block - first_slot_ever) / slots_per_epoch.

For future reference, tracking only the "first slot ever" has a drawback, which is that it's not possible to detect when a block contains an epoch change when it's not supposed to.

This can be solved by tracking both the first slot ever and also the starting slot of the current epoch.
But in the end I've kept the concept of epoch_index in smoldot, as it didn't require a lot of code changes.

andresilva mentioned this issue Jul 24, 2023

Babe: bad epoch index with skipped epochs and warp sync paritytech/substrate#13135

Merged

Noc2 assigned bhargavbh Jul 24, 2023

tomaka mentioned this issue Jul 25, 2023

Support Skipping Empty Epoch smol-dot/smoldot#948

Closed

andresilva closed this as completed Aug 9, 2023

EclesioMeloJunior mentioned this issue Aug 9, 2023

BABE: added reusing last epoch data in case of skipped epochs ChainSafe/gossamer#3430

Closed

5 tasks

kamilsa mentioned this issue Aug 10, 2023

Implement skipped epochs qdrvm/kagome#1729

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BABE skipped epochs #681

BABE skipped epochs #681

andresilva commented Jul 24, 2023

andresilva commented Jul 24, 2023 •

edited

Loading

Noc2 commented Jul 24, 2023

hndnklnc commented Jul 25, 2023

tomaka commented Jul 25, 2023 •

edited

Loading

andresilva commented Jul 25, 2023

hndnklnc commented Jul 25, 2023

andresilva commented Jul 27, 2023 •

edited

Loading

bhargavbh commented Aug 8, 2023

tomaka commented Aug 8, 2023 •

edited

Loading

tomaka commented Aug 8, 2023

andresilva commented Aug 8, 2023

andresilva commented Aug 9, 2023

tomaka commented Aug 9, 2023 •

edited

Loading

BABE skipped epochs #681

BABE skipped epochs #681

Comments

andresilva commented Jul 24, 2023

andresilva commented Jul 24, 2023 • edited Loading

Noc2 commented Jul 24, 2023

hndnklnc commented Jul 25, 2023

tomaka commented Jul 25, 2023 • edited Loading

andresilva commented Jul 25, 2023

hndnklnc commented Jul 25, 2023

andresilva commented Jul 27, 2023 • edited Loading

bhargavbh commented Aug 8, 2023

tomaka commented Aug 8, 2023 • edited Loading

tomaka commented Aug 8, 2023

andresilva commented Aug 8, 2023

andresilva commented Aug 9, 2023

tomaka commented Aug 9, 2023 • edited Loading

andresilva commented Jul 24, 2023 •

edited

Loading

tomaka commented Jul 25, 2023 •

edited

Loading

andresilva commented Jul 27, 2023 •

edited

Loading

tomaka commented Aug 8, 2023 •

edited

Loading

tomaka commented Aug 9, 2023 •

edited

Loading