You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is a suspicion that different validators are trying to propose to each other largely different heights of the parent chain for finalization. Currently this is possible because of this:
When looking for how to restrict sync height, I thought we start from the last finalized height and add 100 as the limit, but I'm not sure any more, it looks at last_recorded_height , which in turns calls last_recorded_block that ultimately calls provider.lastest_height , which is not the tip of the parent chain, but the highest cached data.
Then it adds 100 on top of that, continuously expanding the cache size.
let cache = from_parent.blocks[10000 .. 10100];
let last_recorded_height = cache.last().height;
let parent_chain_height = 15000; // say it's way ahead
let max_end_height = 15000 - 20; // saw we have 20 finalization delay
let starting_height = last_recorded_height + 1; // 10101
let ending_height = min(max_ending_height, starting_height + 100); // 10201
// now goes and add another 100 blocks to the cache
Instead what we want is expand our pre-fetch cache only by 100 blocks at a time, finalize that, then move on, limiting the amount of memory and IO needed to finalize a new block. However, we have the complications of null blocks with Lotus.
We can do so by:
Go ahead a maximum look ahead from the last finalized height
if it's null, search backwards for the last non-null block
if ll of them are null, then, and only then, extend the go further ahead by another maximum look ahead
The effect should be that even if we extend the cache by e.g. 200 heights, there will only be 100 non-null blocks in it.
Then, the next_proposal function would be changed as well. Currently if the last recorded height is null, it doesn't propose anything but waits for the cache to be extended by another fetch (up to 100 blocks) and try again there. Instead, it would have to look back in the cache and find the last non-null block it can propose, if there is any.
This way the target height for finality would hover around the same ballpark for every node in the network, since the share the last_finalized_height, and we'd never go ahead more than 100 from that, unless they are all null blocks.
Alternative
One alternative we discussed was the re-introduction of a fixed top_down_checkpoint_period. However this would be limiting when we are already in sync, and we only have to finalize 1 block every 30 seconds. There it doesn't make sense not to finalize them ASAP and instead do every 10th block or something.
Repro steps
I didn't look at the logs, but the spirit of introducing MAX_PARENT_VIEW_BLOCK_GAP was exactly to limit the size of the cache and the proposed finality, but it somehow was defeated by going ahead next time anyway.
What we see is nodes struggling to agree on what to finalize, and this is one possibility, because if we have 3 validators and they have caches that go from [0-100], [0-200], [0-300], respectively, then
validator 3 proposes block 300, which nobody else votes on,
validator 2 proposes block 200, which validator 3 votes on but validator 1 doesn't, so the proposal still fails (quorum needs >2/3 power)
validator 1 proposes block 100, which gets votes
Meanwhile they all put more and more burden on their memory, and when finally finalize a block they have to apply a huge amount of change.
Relevant log output
No response
The text was updated successfully, but these errors were encountered:
Issue type
Bug
Have you reproduced the bug with the latest dev version?
Yes
Version
v0.1.0
Custom code
No
OS platform and distribution
Linux
Describe the issue
See https://filecoinproject.slack.com/archives/C04JR5R1UL8/p1700111601341039
There is a suspicion that different validators are trying to propose to each other largely different heights of the parent chain for finalization. Currently this is possible because of this:
When looking for how to restrict sync height, I thought we start from the last finalized height and add 100 as the limit, but I'm not sure any more, it looks at last_recorded_height , which in turns calls last_recorded_block that ultimately calls provider.lastest_height , which is not the tip of the parent chain, but the highest cached data.
Then it adds 100 on top of that, continuously expanding the cache size.
Instead what we want is expand our pre-fetch cache only by 100 blocks at a time, finalize that, then move on, limiting the amount of memory and IO needed to finalize a new block. However, we have the complications of null blocks with Lotus.
We can do so by:
The effect should be that even if we extend the cache by e.g. 200 heights, there will only be 100 non-null blocks in it.
Then, the
next_proposal
function would be changed as well. Currently if the last recorded height is null, it doesn't propose anything but waits for the cache to be extended by another fetch (up to 100 blocks) and try again there. Instead, it would have to look back in the cache and find the last non-null block it can propose, if there is any.This way the target height for finality would hover around the same ballpark for every node in the network, since the share the
last_finalized_height
, and we'd never go ahead more than 100 from that, unless they are all null blocks.Alternative
One alternative we discussed was the re-introduction of a fixed
top_down_checkpoint_period
. However this would be limiting when we are already in sync, and we only have to finalize 1 block every 30 seconds. There it doesn't make sense not to finalize them ASAP and instead do every 10th block or something.Repro steps
I didn't look at the logs, but the spirit of introducing MAX_PARENT_VIEW_BLOCK_GAP was exactly to limit the size of the cache and the proposed finality, but it somehow was defeated by going ahead next time anyway.
What we see is nodes struggling to agree on what to finalize, and this is one possibility, because if we have 3 validators and they have caches that go from [0-100], [0-200], [0-300], respectively, then
Meanwhile they all put more and more burden on their memory, and when finally finalize a block they have to apply a huge amount of change.
Relevant log output
No response
The text was updated successfully, but these errors were encountered: