-
Notifications
You must be signed in to change notification settings - Fork 378
Add retry mechanism for pov-recovery, fix full-node pov-recovery #2164
Add retry mechanism for pov-recovery, fix full-node pov-recovery #2164
Conversation
Co-authored-by: Bastian Köcher <git@kchr.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 (code-wise; don't know enough to reason about business logic)
Co-authored-by: Bastian Köcher <git@kchr.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good 🟢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some last things.
/// Clear `waiting_for_parent` from the given `hash` and do this recursively for all child | ||
/// blocks. | ||
fn clear_waiting_for_parent(&mut self, hash: Block::Hash) { | ||
/// Clear `waiting_for_parent` and `waiting_recovery` for the candidate with `hash`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Clear `waiting_for_parent` and `waiting_recovery` for the candidate with `hash`. | |
/// Clear `waiting_for_parent` and `waiting_recovery` for the candidate with `hash`. | |
/// |
client/pov-recovery/src/lib.rs
Outdated
let parent_scheduled_for_recovery = | ||
self.candidates.get(&parent).map_or(false, |parent| parent.waiting_recovery); | ||
if self.active_candidate_recovery.is_recovering(&parent) || | ||
parent_scheduled_for_recovery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let parent_scheduled_for_recovery = | |
self.candidates.get(&parent).map_or(false, |parent| parent.waiting_recovery); | |
if self.active_candidate_recovery.is_recovering(&parent) || | |
parent_scheduled_for_recovery | |
let parent_being_recovered = | |
self.candidates.get(&parent).map_or(false, |parent| parent.waiting_recovery); | |
if parent_being_recovered |
You just updated the comment above waiting_recovery
saying that this is only true
when you either waiting for recovery or it is being recovered.
client/pov-recovery/src/lib.rs
Outdated
@@ -425,22 +533,7 @@ where | |||
|
|||
if do_recover { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This do_recover
can be removed by replacing the break false
above with return
. The logic above also has an error, we set waiting_recovery
to true
directly, but later in the loop we may encounter an error and never reset this flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
client/pov-recovery/src/lib.rs
Outdated
// This is not necessary in the happy case, as the flag will be cleared on import | ||
// If import fails for any reason however, we still want to have the flag cleared. | ||
self.clear_waiting_recovery(&block_hash); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should assume that import works, if that fails we are trapped here anyway. But we should remove this code, as otherwise we may start downloading the block again while it is imported.
This reverts commit 3809eed.
* Remove `account` * Update and companions * Companion of paritytech/cumulus#2164 * Companion of paritytech/substrate#13159 * Companion of paritytech/cumulus#1909 * Fmt * Companion of paritytech/polkadot#6744 * Companion of paritytech/cumulus#2245 * Companion of paritytech/substrate#12828 * Companion of paritytech/cumulus#2287 * Companion of paritytech/substrate#13592 * Companion of paritytech/cumulus#2308 * Companion of paritytech/substrate#13410 * Companion of paritytech/substrate#13305 * Companion of polkadot-evm/frontier#1050 * TODO weight * TODO weight * Companion of polkadot-evm/frontier#1040 Signed-off-by: Xavier Lau <xavier@inv.cafe> * Remove unused dep * Try fix dev node paritytech/substrate#12828 * Fix the frontier part (#1154) * Fix dev node * Fmt * Bump moonbeam * Bump moonbeam * Remove unnecessary clone * Fix tests --------- Signed-off-by: Xavier Lau <xavier@inv.cafe> Co-authored-by: Guantong <i@gt.email> Co-authored-by: bear <boundless.forest@outlook.com>
This PR introduces some changes to the PoV recovery mechanism.
Retry in case of unavailability
PoV recovery is listening to a stream of pending candidates that is derived from the relay chain import notification stream. Since the candidates are pending, availability is not yet guaranteed. In that case we now just retry the operation.
Currently we retry only once.
Ordered recovery queue
Until now for each recovery operation we added a randomized delay after which the recovery would start. This was not a problem for collators. For full nodes however, we have a configured recovery delay of 150 - 300 seconds. Since the recovery can start anywhere in that range, it could happen that we attempted to recover a child before its parent was available. There is already a mechanism to wait for parents, but it was only waiting for parent currently in recovery, which is not true for queued ones.
This is not changed into an ordered queue that will add randomized recovery slots on insertion, but start recoveries in order.
Finalization
When full-nodes recover a block via PoV (which should be very rare), they are behind by several minutes. Since we are following relay chain final and best block, the node had no way of knowing the status of the recovered block. We are now maintaining a cache of seen finalized blocks. If a block is imported and its hash is in the cache, we immediately set it as finalized.
Closes #2142