-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Block ActiveLeavesUpdate
and BlockFinalized
events in overseer until major sync is complete
#6689
Conversation
ActiveLeavesUpdate
and BlockFinalized
events in overseer until major sunc is completeActiveLeavesUpdate
and BlockFinalized
events in overseer until major sync is complete
8b59b56
to
447698d
Compare
447698d
to
2244cff
Compare
Overseer won't generate any events to the subsystems until initial full sync is complete.
2244cff
to
70d031d
Compare
ActiveLeavesUpdate
and BlockFinalized
events in overseer until major sync is completeActiveLeavesUpdate
and BlockFinalized
events in overseer until major sync is complete
node/overseer/src/lib.rs
Outdated
self.handle_external_request(request); | ||
} | ||
Event::MsgToSubsystem { msg, origin } => self.handle_msg_to_subsystem(msg, origin).await?, | ||
Event::Stop => return self.handle_stop().await, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A concern: I think with current impl Ctrl-C before major sync is complete should not stall/deadlock subsystems, but would be nice to double check that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should this stall the subsystems?
I've added some tests to cover the cases we care for + some corner ones. The final step is to refactor the loops so that code duplication is minimal 😟 |
node/overseer/src/lib.rs
Outdated
self.handle_external_request(request); | ||
} | ||
Event::MsgToSubsystem { msg, origin } => self.handle_msg_to_subsystem(msg, origin).await?, | ||
Event::Stop => return self.handle_stop().await, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should this stall the subsystems?
node/overseer/src/lib.rs
Outdated
// In theory we can receive `BlockImported` during initial major sync. In this case the | ||
// update will be empty. | ||
let span = match self.span_per_active_leaf.get(&block.hash) { | ||
Some(span) => span.clone(), | ||
None => { | ||
// This should never happen. | ||
gum::warn!( | ||
target: LOG_TARGET, | ||
?block.hash, | ||
?block.number, | ||
"Span for active leaf not found. This is not expected" | ||
); | ||
let span = Arc::new(jaeger::Span::new(block.hash, "leaf-activated")); | ||
span | ||
} | ||
}; | ||
let update = ActiveLeavesUpdate::start_work(ActivatedLeaf { | ||
hash: block.hash, | ||
number: block.number, | ||
status: LeafStatus::Fresh, | ||
span, | ||
}); | ||
self.broadcast_signal(OverseerSignal::ActiveLeaves(update)).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why we need this branch? When we are receving a block import notification while doing major sync, we should not even land in the match block here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the finality has stalled and no BlockImported
events are generated the subsystems in polkadot won't start because they initiate work on ActiveLeavesUpdate
.
Right now this is solved by issuing an ActiveLeavesUpdate
on startup with the leaves from DB. This is causing problems though because subsystems start work for old leaves, the state from them is usually pruned and some subsystems generate a bunch of logs on stratup.
Then we decided to send this initial ActiveLeavesUpdate
after the major sync is done. But here I see two cases:
- Finality is stalled or we don't get
BlockImported
events for some reason. -> In this case we won't issueActiveLeavesUpdate
and the subsystems will be 'inactive'. - Everything is fine and we get
BlockImported
at some point after the initial major sync. We don't need any special handling in this case.
So the branch you are referring to tries to solve1. We just got BlockFinalized
, the initial major sync is complete and I'm trying to guess if an ActiveLeavesUpdate
will be generated.
What I do:
- Generate an artificial
BlockImported
event by callingself.block_imported(..)
. It is supposed to generate non-emptyActiveLeaves
in 99% of the cases but what if it doesn't? - The fat else handles this 'what it doesn't' case with all the stuff that can go wrong.
I wrote tests to cover some of the cases and today I'm thinking about how to simplify this. I don't like it too :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the finality has stalled and no
BlockImported
events are generated the subsystems in polkadot won't start because they initiate work onActiveLeavesUpdate
.
I don't get the connection? When finality stalls, we will still import new blocks?
- Finality is stalled or we don't get
BlockImported
events for some reason. -> In this case we won't issueActiveLeavesUpdate
and the subsystems will be 'inactive'.
This shouldn't happen.
I mean in general you could track when major sync is finished and then send the ActiveLeavesUpdate
for all current leaves. However, I'm not really sure that we need this as there will always be some block import (yes with stalled finality it will be slower, but there will be blocks imported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get the connection? When finality stalls, we will still import new blocks?
For me 'finality has stalled' means 'no new blocks are generated'. I'm misunderstanding something here - can you explain why this is not true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are never not producing any new blocks. This is not going to happen, we may slow down block production but not more. There is block production and then there is finality. The block production is for building new blocks and put them on top of the chain. Finality is the process of saying that a certain chain up to a certain point can not be modified anymore, because between the last finalized block and the tip of the chain you can have forks, re-orgs and also go back to older blocks to produce on them. However, the moment we have finalized a block, we can not go before this block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are never not producing any new blocks.
I see... so this means after the initial sync we'll always have BlockImported
events and all the precautions in my PR make no sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe 👀
I mean it can happen that we don't produce any blocks, but then we have much bigger problems as Parachain subsystems not working properly 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, what is even the point of acting on a leaf read from DB? Worst thing that can happen is that a single node does not work immediately on a leaf after a quick restart, which should be no problem at all. Hence, I think this can be simplified by simply not triggering ActiveLeavesUpdate
on startup on a leaf read from db, but only for regular block import events.
Subsystems which do care about older blocks, do traverse the ancestry already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's far more elegant this way!
Co-authored-by: Bastian Köcher <git@kchr.de>
e680a77
to
6bdb390
Compare
6ead261
to
f8467d8
Compare
The CI pipeline was cancelled due to failure one of the required jobs. |
a813802
to
2e14f46
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused. Haven't we just agreed that all that is needed is removal of the initial artificial ActiveLeavesUpdate
?
@@ -746,12 +785,28 @@ where | |||
Event::Stop => { | |||
self.stop().await; | |||
return Ok(()); | |||
} | |||
}, | |||
Event::BlockImported(block) if self.sync_oracle.is_syncing() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we still need the oracle? If I understand correctly, we don't get any BlockImported
events anyway during sync.
I'm closing this one in favor of #6727 Let's try the simpler solution first. |
Adds major sync oracle to overseer. The oracle is used to detect when the node has completed an initial full sync.
The PR also changes the behavior of
ActiveLeaves
andBlockFinalized
events generation. Now they are not sent until the full initial sync is complete.Partially addresses paritytech/polkadot-sdk#793. Fixes #6694.