Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Block ActiveLeavesUpdate and BlockFinalized events in overseer until major sync is complete #6689

Closed
wants to merge 19 commits into from

Conversation

tdimitrov
Copy link
Contributor

@tdimitrov tdimitrov commented Feb 8, 2023

Adds major sync oracle to overseer. The oracle is used to detect when the node has completed an initial full sync.

The PR also changes the behavior of ActiveLeaves and BlockFinalized events generation. Now they are not sent until the full initial sync is complete.

Partially addresses paritytech/polkadot-sdk#793. Fixes #6694.

@tdimitrov tdimitrov changed the title [DNM] Block ActiveLeavesUpdate and BlockFinalized events in overseer until major sunc is complete [DNM] Block ActiveLeavesUpdate and BlockFinalized events in overseer until major sync is complete Feb 9, 2023
@tdimitrov tdimitrov force-pushed the tsv-block-overseer-events-during-sync branch from 8b59b56 to 447698d Compare February 9, 2023 08:51
node/overseer/src/lib.rs Outdated Show resolved Hide resolved
@tdimitrov tdimitrov force-pushed the tsv-block-overseer-events-during-sync branch from 447698d to 2244cff Compare February 9, 2023 14:57
Overseer won't generate any events to the subsystems until initial full
sync is complete.
@tdimitrov tdimitrov force-pushed the tsv-block-overseer-events-during-sync branch from 2244cff to 70d031d Compare February 10, 2023 11:47
@tdimitrov tdimitrov changed the title [DNM] Block ActiveLeavesUpdate and BlockFinalized events in overseer until major sync is complete Block ActiveLeavesUpdate and BlockFinalized events in overseer until major sync is complete Feb 10, 2023
@tdimitrov tdimitrov added I4-annoyance Code behaves within expectations, however this “expected behaviour” itself is at issue. C1-low PR touches the given topic and has a low impact on builders. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance. A0-pleasereview labels Feb 10, 2023
node/overseer/src/lib.rs Outdated Show resolved Hide resolved
node/overseer/src/lib.rs Outdated Show resolved Hide resolved
self.handle_external_request(request);
}
Event::MsgToSubsystem { msg, origin } => self.handle_msg_to_subsystem(msg, origin).await?,
Event::Stop => return self.handle_stop().await,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A concern: I think with current impl Ctrl-C before major sync is complete should not stall/deadlock subsystems, but would be nice to double check that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should this stall the subsystems?

node/overseer/src/lib.rs Outdated Show resolved Hide resolved
node/overseer/src/lib.rs Outdated Show resolved Hide resolved
@tdimitrov
Copy link
Contributor Author

I've added some tests to cover the cases we care for + some corner ones. The final step is to refactor the loops so that code duplication is minimal 😟

self.handle_external_request(request);
}
Event::MsgToSubsystem { msg, origin } => self.handle_msg_to_subsystem(msg, origin).await?,
Event::Stop => return self.handle_stop().await,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should this stall the subsystems?

node/overseer/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines 812 to 834
// In theory we can receive `BlockImported` during initial major sync. In this case the
// update will be empty.
let span = match self.span_per_active_leaf.get(&block.hash) {
Some(span) => span.clone(),
None => {
// This should never happen.
gum::warn!(
target: LOG_TARGET,
?block.hash,
?block.number,
"Span for active leaf not found. This is not expected"
);
let span = Arc::new(jaeger::Span::new(block.hash, "leaf-activated"));
span
}
};
let update = ActiveLeavesUpdate::start_work(ActivatedLeaf {
hash: block.hash,
number: block.number,
status: LeafStatus::Fresh,
span,
});
self.broadcast_signal(OverseerSignal::ActiveLeaves(update)).await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why we need this branch? When we are receving a block import notification while doing major sync, we should not even land in the match block here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the finality has stalled and no BlockImported events are generated the subsystems in polkadot won't start because they initiate work on ActiveLeavesUpdate.

Right now this is solved by issuing an ActiveLeavesUpdate on startup with the leaves from DB. This is causing problems though because subsystems start work for old leaves, the state from them is usually pruned and some subsystems generate a bunch of logs on stratup.

Then we decided to send this initial ActiveLeavesUpdate after the major sync is done. But here I see two cases:

  1. Finality is stalled or we don't get BlockImported events for some reason. -> In this case we won't issue ActiveLeavesUpdate and the subsystems will be 'inactive'.
  2. Everything is fine and we get BlockImported at some point after the initial major sync. We don't need any special handling in this case.

So the branch you are referring to tries to solve1. We just got BlockFinalized, the initial major sync is complete and I'm trying to guess if an ActiveLeavesUpdate will be generated.

What I do:

  • Generate an artificial BlockImported event by calling self.block_imported(..). It is supposed to generate non-empty ActiveLeaves in 99% of the cases but what if it doesn't?
  • The fat else handles this 'what it doesn't' case with all the stuff that can go wrong.

I wrote tests to cover some of the cases and today I'm thinking about how to simplify this. I don't like it too :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the finality has stalled and no BlockImported events are generated the subsystems in polkadot won't start because they initiate work on ActiveLeavesUpdate.

I don't get the connection? When finality stalls, we will still import new blocks?

  1. Finality is stalled or we don't get BlockImported events for some reason. -> In this case we won't issue ActiveLeavesUpdate and the subsystems will be 'inactive'.

This shouldn't happen.

I mean in general you could track when major sync is finished and then send the ActiveLeavesUpdate for all current leaves. However, I'm not really sure that we need this as there will always be some block import (yes with stalled finality it will be slower, but there will be blocks imported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get the connection? When finality stalls, we will still import new blocks?

For me 'finality has stalled' means 'no new blocks are generated'. I'm misunderstanding something here - can you explain why this is not true?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are never not producing any new blocks. This is not going to happen, we may slow down block production but not more. There is block production and then there is finality. The block production is for building new blocks and put them on top of the chain. Finality is the process of saying that a certain chain up to a certain point can not be modified anymore, because between the last finalized block and the tip of the chain you can have forks, re-orgs and also go back to older blocks to produce on them. However, the moment we have finalized a block, we can not go before this block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are never not producing any new blocks.

I see... so this means after the initial sync we'll always have BlockImported events and all the precautions in my PR make no sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe 👀

I mean it can happen that we don't produce any blocks, but then we have much bigger problems as Parachain subsystems not working properly 😬

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, what is even the point of acting on a leaf read from DB? Worst thing that can happen is that a single node does not work immediately on a leaf after a quick restart, which should be no problem at all. Hence, I think this can be simplified by simply not triggering ActiveLeavesUpdate on startup on a leaf read from db, but only for regular block import events.

Subsystems which do care about older blocks, do traverse the ancestry already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's far more elegant this way!

node/overseer/src/lib.rs Outdated Show resolved Hide resolved
@tdimitrov tdimitrov force-pushed the tsv-block-overseer-events-during-sync branch from e680a77 to 6bdb390 Compare February 14, 2023 17:41
@tdimitrov tdimitrov force-pushed the tsv-block-overseer-events-during-sync branch from 6ead261 to f8467d8 Compare February 14, 2023 18:48
@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: test-linux-stable
Logs: https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/2392699

@tdimitrov tdimitrov force-pushed the tsv-block-overseer-events-during-sync branch from a813802 to 2e14f46 Compare February 15, 2023 14:41
@tdimitrov tdimitrov marked this pull request as ready for review February 15, 2023 14:41
@tdimitrov tdimitrov requested review from bkchr and eskimor February 15, 2023 14:42
@tdimitrov tdimitrov added the B0-silent Changes should not be mentioned in any release notes label Feb 15, 2023
Copy link
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused. Haven't we just agreed that all that is needed is removal of the initial artificial ActiveLeavesUpdate?

@@ -746,12 +785,28 @@ where
Event::Stop => {
self.stop().await;
return Ok(());
}
},
Event::BlockImported(block) if self.sync_oracle.is_syncing() => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we still need the oracle? If I understand correctly, we don't get any BlockImported events anyway during sync.

@tdimitrov
Copy link
Contributor Author

I'm closing this one in favor of #6727

Let's try the simpler solution first.

@tdimitrov tdimitrov closed this Feb 16, 2023
@tdimitrov tdimitrov deleted the tsv-block-overseer-events-during-sync branch February 16, 2023 08:11
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. I4-annoyance Code behaves within expectations, however this “expected behaviour” itself is at issue. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

overseer: signal completion of major sync
5 participants