-
Notifications
You must be signed in to change notification settings - Fork 746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow subsystems to finish work before shutdown: availability recovery and approvals #985
Comments
I think currently the Conclude signal will not be send when the node is shutting down. Looking at the code it requires manual calling of the code that sends the signal, which is currently not done. To fix this issue properly, it probably requires some changes to Substrate or at least some better integration into the shutdown process of the node. The node is shut down when some essential task has shutdown or it receives a signal to shutdown. This is directly provided by substrate. When shutting down it will tell the event loop to shutdown, meaning the overseer would need to check for when it is dropped. |
This speeds up finality by like 24 seconds ocasionally, but it's not a priority for testnets or even MVP. Actually testnet might do better seeing the bad behavior. And MVP is whatever you guys think is best. |
In the case of PVF execution, we've also encountered the
|
Ok the problem I am seeing here is two-fold:
In any case, the biggest blocker is the handling in substrate actually. I tried sending the For the time being, for the issue of ambiguous worker death: I am leaning towards the retry approach as this makes disputes due to restarts extremely unlikely already - especially if we don't persist the knowledge that we tried once already, because then a restart would mean another two tries (if we are even still caring about the candidate). In addition we can handle the SIGTERM and similar signals in candidate-validation directly and ignore any ambiguous worker death if we received that signal before, maybe even with a little buffering: If we receive ambiguous worker death, we only report it if we don't receive SIGTERM within 100ms or something like that. This is all very hacky though, the simple retry (with delay) approach should probably be good enough for now, until we get to implement the proper shutdown sequence on top. |
Is it actually the case that operating systems typically shut down processes within a few seconds of SIGTERM? I'm no expert here, but my experience is that hung processes due to stuck SIGTERM handlers is quite common. We could and probably should also insert a signal handler into the child processes to communicate back a result that SIGTERM killed the process, and the candidate validation subsystem can handle this accordingly. Yes, the retry approach is probably necessary in any case, but it seems like we should have both. |
* Fix overflow in reading gas_price * Saturating conversion
The approval voting protocol treats validators who broadcast their assignment but not their corresponding approval as "no-shows". In all likelihood, nodes that shut down instantaneously will leave some dangling assignments and thus will accidentally slow down finality. We should alter the handling of the
Conclude
message inAvailabilityRecovery
andApprovalVoting
to finish all in-progress recoveries and approval checks before shutting down the subsystem.I think this should come with a refactoring of our entire shutdown process for the overseer. The flow should go something like this instead:
BeginConclude(response)
to each subsystem.T
time (~seconds) to receive responses from each subsystem.Conclude
to each subsystem and then wait for them to shut down.Subsystems should handle
Conclude
as an immediate shutdown signal.BeginConclude
should instruct the subsystem to prepare to shut down.ApprovalVoting
would not broadcast any new assignments after receiving this signal. Backing should not kick off any seconding or validation work after receiving this signal. etc.The text was updated successfully, but these errors were encountered: