Allow subsystems to finish work before shutdown: availability recovery and approvals #985

rphmeier · 2020-11-28T18:22:39Z

The approval voting protocol treats validators who broadcast their assignment but not their corresponding approval as "no-shows". In all likelihood, nodes that shut down instantaneously will leave some dangling assignments and thus will accidentally slow down finality. We should alter the handling of the Conclude message in AvailabilityRecovery and ApprovalVoting to finish all in-progress recoveries and approval checks before shutting down the subsystem.

I think this should come with a refactoring of our entire shutdown process for the overseer. The flow should go something like this instead:

Overseer broadcasts a signal BeginConclude(response) to each subsystem.
It waits T time (~seconds) to receive responses from each subsystem.
Once all responses are received or a timeout is reached, broadcast a signal Conclude to each subsystem and then wait for them to shut down.

Subsystems should handle Conclude as an immediate shutdown signal. BeginConclude should instruct the subsystem to prepare to shut down. ApprovalVoting would not broadcast any new assignments after receiving this signal. Backing should not kick off any seconding or validation work after receiving this signal. etc.

The text was updated successfully, but these errors were encountered:

bkchr · 2020-11-28T21:19:46Z

I think currently the Conclude signal will not be send when the node is shutting down. Looking at the code it requires manual calling of the code that sends the signal, which is currently not done.

To fix this issue properly, it probably requires some changes to Substrate or at least some better integration into the shutdown process of the node. The node is shut down when some essential task has shutdown or it receives a signal to shutdown. This is directly provided by substrate. When shutting down it will tell the event loop to shutdown, meaning the overseer would need to check for when it is dropped.

burdges · 2020-11-28T22:25:22Z

This speeds up finality by like 24 seconds ocasionally, but it's not a priority for testnets or even MVP. Actually testnet might do better seeing the bad behavior. And MVP is whatever you guys think is best.

rphmeier · 2022-10-26T22:13:19Z

In the case of PVF execution, we've also encountered the AmbiguousWorkerDeath error during shutdown: the child processes executing PVFs are killed on ctrl-c, and this can lead to spurious disputes. So two more suggested changes:

The candidate-validation subsystem should not begin executing any more PVFs during conclusion (and should inform the caller that the execution request was denied due to shutting down)
The disputes subsystems should not initiate or participate in disputes while shutting down, and instead do the work after starting up again.

eskimor · 2022-10-30T17:37:09Z

Ok the problem I am seeing here is two-fold:

What Basti said, this would require substrate changes as substrate is already handling shutdown signals and just stops executing.
Even if we did that, that waiting for a certain amount of time until force quit is inherently racy with what operating systems are doing - they do exactly the same. Even if we had a perfectly clean shutdown procedure in the code, this would not prevent the operating system to kill us before it completes. With child processes it becomes even worse, as the sequence and shutdown strategy is not even well defined. E.g. the operating system might already have killed our child processes, before we managed to complete the "clean shutdown"/deliver the Conclude message to the relevant subsystems.

In any case, the biggest blocker is the handling in substrate actually. I tried sending the Conclude signal on SIGTERM and similar messages, but it did not get through as other shutdown sequences are faster.

For the time being, for the issue of ambiguous worker death: I am leaning towards the retry approach as this makes disputes due to restarts extremely unlikely already - especially if we don't persist the knowledge that we tried once already, because then a restart would mean another two tries (if we are even still caring about the candidate). In addition we can handle the SIGTERM and similar signals in candidate-validation directly and ignore any ambiguous worker death if we received that signal before, maybe even with a little buffering: If we receive ambiguous worker death, we only report it if we don't receive SIGTERM within 100ms or something like that. This is all very hacky though, the simple retry (with delay) approach should probably be good enough for now, until we get to implement the proper shutdown sequence on top.

rphmeier · 2022-11-02T03:57:04Z

Is it actually the case that operating systems typically shut down processes within a few seconds of SIGTERM? I'm no expert here, but my experience is that hung processes due to stuck SIGTERM handlers is quite common.

We could and probably should also insert a signal handler into the child processes to communicate back a result that SIGTERM killed the process, and the candidate validation subsystem can handle this accordingly.

Yes, the retry approach is probably necessary in any case, but it seems like we should have both.

* Fix overflow in reading gas_price * Saturating conversion

rphmeier added the I8-refactor label Nov 28, 2020

rphmeier mentioned this issue Nov 28, 2020

guide: Availability Recovery paritytech/polkadot#2011

Merged

rphmeier mentioned this issue Dec 10, 2020

Provide an API for service-spawned tasks to participate in shutdown process paritytech/substrate#7715

Open

rphmeier mentioned this issue Oct 26, 2022

Retry failed PVF execution paritytech/polkadot#6195

Closed

Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023

the-right-joyce added I4-refactor Code needs refactoring. and removed I8-refactor labels Aug 25, 2023

claravanstaden pushed a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023

Fix for runtime-benchmarks (paritytech#985)

581f6f9

helin6 pushed a commit to boolnetwork/polkadot-sdk that referenced this issue Feb 5, 2024

Fix overflow in reading gas_price (paritytech#985)

e52e5bd

* Fix overflow in reading gas_price * Saturating conversion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow subsystems to finish work before shutdown: availability recovery and approvals #985

Allow subsystems to finish work before shutdown: availability recovery and approvals #985

rphmeier commented Nov 28, 2020 •

edited

Loading

bkchr commented Nov 28, 2020

burdges commented Nov 28, 2020

rphmeier commented Oct 26, 2022

eskimor commented Oct 30, 2022

rphmeier commented Nov 2, 2022 •

edited

Loading

Allow subsystems to finish work before shutdown: availability recovery and approvals #985

Allow subsystems to finish work before shutdown: availability recovery and approvals #985

Comments

rphmeier commented Nov 28, 2020 • edited Loading

bkchr commented Nov 28, 2020

burdges commented Nov 28, 2020

rphmeier commented Oct 26, 2022

eskimor commented Oct 30, 2022

rphmeier commented Nov 2, 2022 • edited Loading

rphmeier commented Nov 28, 2020 •

edited

Loading

rphmeier commented Nov 2, 2022 •

edited

Loading