Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit the amount of look-ahead in sync #157

Closed
aakoshh opened this issue Nov 16, 2023 · 0 comments
Closed

Limit the amount of look-ahead in sync #157

aakoshh opened this issue Nov 16, 2023 · 0 comments
Labels
bug Something isn't working s:fendermint

Comments

@aakoshh
Copy link
Contributor

aakoshh commented Nov 16, 2023

Issue type

Bug

Have you reproduced the bug with the latest dev version?

Yes

Version

v0.1.0

Custom code

No

OS platform and distribution

Linux

Describe the issue

See https://filecoinproject.slack.com/archives/C04JR5R1UL8/p1700111601341039

There is a suspicion that different validators are trying to propose to each other largely different heights of the parent chain for finalization. Currently this is possible because of this:

When looking for how to restrict sync height, I thought we start from the last finalized height and add 100 as the limit, but I'm not sure any more, it looks at last_recorded_height , which in turns calls last_recorded_block that ultimately calls provider.lastest_height , which is not the tip of the parent chain, but the highest cached data.

Then it adds 100 on top of that, continuously expanding the cache size.

let cache = from_parent.blocks[10000 .. 10100];
let last_recorded_height = cache.last().height;
let parent_chain_height = 15000; // say it's way ahead
let max_end_height = 15000 - 20; // saw we have 20 finalization delay
let starting_height = last_recorded_height + 1; // 10101
let ending_height = min(max_ending_height, starting_height + 100); // 10201

// now goes and add another 100 blocks to the cache

Instead what we want is expand our pre-fetch cache only by 100 blocks at a time, finalize that, then move on, limiting the amount of memory and IO needed to finalize a new block. However, we have the complications of null blocks with Lotus.

We can do so by:

  1. Go ahead a maximum look ahead from the last finalized height
  2. if it's null, search backwards for the last non-null block
  3. if ll of them are null, then, and only then, extend the go further ahead by another maximum look ahead

The effect should be that even if we extend the cache by e.g. 200 heights, there will only be 100 non-null blocks in it.

Then, the next_proposal function would be changed as well. Currently if the last recorded height is null, it doesn't propose anything but waits for the cache to be extended by another fetch (up to 100 blocks) and try again there. Instead, it would have to look back in the cache and find the last non-null block it can propose, if there is any.

This way the target height for finality would hover around the same ballpark for every node in the network, since the share the last_finalized_height, and we'd never go ahead more than 100 from that, unless they are all null blocks.

Alternative

One alternative we discussed was the re-introduction of a fixed top_down_checkpoint_period. However this would be limiting when we are already in sync, and we only have to finalize 1 block every 30 seconds. There it doesn't make sense not to finalize them ASAP and instead do every 10th block or something.

Repro steps

I didn't look at the logs, but the spirit of introducing MAX_PARENT_VIEW_BLOCK_GAP was exactly to limit the size of the cache and the proposed finality, but it somehow was defeated by going ahead next time anyway.

What we see is nodes struggling to agree on what to finalize, and this is one possibility, because if we have 3 validators and they have caches that go from [0-100], [0-200], [0-300], respectively, then

  • validator 3 proposes block 300, which nobody else votes on,
  • validator 2 proposes block 200, which validator 3 votes on but validator 1 doesn't, so the proposal still fails (quorum needs >2/3 power)
  • validator 1 proposes block 100, which gets votes

Meanwhile they all put more and more burden on their memory, and when finally finalize a block they have to apply a huge amount of change.

Relevant log output

No response

@aakoshh aakoshh added the bug Something isn't working label Nov 16, 2023
@jsoares jsoares transferred this issue from consensus-shipyard/fendermint Dec 19, 2023
@jsoares jsoares closed this as not planned Won't fix, can't repro, duplicate, stale Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working s:fendermint
Projects
No open projects
Status: Triage
Development

No branches or pull requests

2 participants