Limit the amount of look-ahead in sync #157

aakoshh · 2023-11-16T11:29:56Z

Issue type

Bug

Have you reproduced the bug with the latest dev version?

Yes

Version

v0.1.0

Custom code

No

OS platform and distribution

Linux

Describe the issue

See https://filecoinproject.slack.com/archives/C04JR5R1UL8/p1700111601341039

There is a suspicion that different validators are trying to propose to each other largely different heights of the parent chain for finalization. Currently this is possible because of this:

When looking for how to restrict sync height, I thought we start from the last finalized height and add 100 as the limit, but I'm not sure any more, it looks at last_recorded_height , which in turns calls last_recorded_block that ultimately calls provider.lastest_height , which is not the tip of the parent chain, but the highest cached data.

Then it adds 100 on top of that, continuously expanding the cache size.

let cache = from_parent.blocks[10000 .. 10100];
let last_recorded_height = cache.last().height;
let parent_chain_height = 15000; // say it's way ahead
let max_end_height = 15000 - 20; // saw we have 20 finalization delay
let starting_height = last_recorded_height + 1; // 10101
let ending_height = min(max_ending_height, starting_height + 100); // 10201

// now goes and add another 100 blocks to the cache

Instead what we want is expand our pre-fetch cache only by 100 blocks at a time, finalize that, then move on, limiting the amount of memory and IO needed to finalize a new block. However, we have the complications of null blocks with Lotus.

We can do so by:

Go ahead a maximum look ahead from the last finalized height
if it's null, search backwards for the last non-null block
if ll of them are null, then, and only then, extend the go further ahead by another maximum look ahead

The effect should be that even if we extend the cache by e.g. 200 heights, there will only be 100 non-null blocks in it.

Then, the next_proposal function would be changed as well. Currently if the last recorded height is null, it doesn't propose anything but waits for the cache to be extended by another fetch (up to 100 blocks) and try again there. Instead, it would have to look back in the cache and find the last non-null block it can propose, if there is any.

This way the target height for finality would hover around the same ballpark for every node in the network, since the share the last_finalized_height, and we'd never go ahead more than 100 from that, unless they are all null blocks.

Alternative

One alternative we discussed was the re-introduction of a fixed top_down_checkpoint_period. However this would be limiting when we are already in sync, and we only have to finalize 1 block every 30 seconds. There it doesn't make sense not to finalize them ASAP and instead do every 10th block or something.

Repro steps

I didn't look at the logs, but the spirit of introducing MAX_PARENT_VIEW_BLOCK_GAP was exactly to limit the size of the cache and the proposed finality, but it somehow was defeated by going ahead next time anyway.

What we see is nodes struggling to agree on what to finalize, and this is one possibility, because if we have 3 validators and they have caches that go from [0-100], [0-200], [0-300], respectively, then

validator 3 proposes block 300, which nobody else votes on,
validator 2 proposes block 200, which validator 3 votes on but validator 1 doesn't, so the proposal still fails (quorum needs >2/3 power)
validator 1 proposes block 100, which gets votes

Meanwhile they all put more and more burden on their memory, and when finally finalize a block they have to apply a huge amount of change.

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

aakoshh added the bug Something isn't working label Nov 16, 2023

This was referenced Nov 17, 2023

FIX: Add an extra 2 rounds of delay before proposing a topdown finality consensus-shipyard/fendermint#435

Closed

Remove look ahead consensus-shipyard/fendermint#442

Merged

jsoares transferred this issue from consensus-shipyard/fendermint Dec 19, 2023

jsoares added the s:fendermint label Dec 19, 2023

jsoares closed this as not planned Won't fix, can't repro, duplicate, stale Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit the amount of look-ahead in sync #157

Limit the amount of look-ahead in sync #157

aakoshh commented Nov 16, 2023

Limit the amount of look-ahead in sync #157

Limit the amount of look-ahead in sync #157

Comments

aakoshh commented Nov 16, 2023

Issue type

Have you reproduced the bug with the latest dev version?

Version

Custom code

OS platform and distribution

Describe the issue

Alternative

Repro steps

Relevant log output