-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Check spawned worker version vs node version before PVF preparation #6861
Conversation
This is for pre-review right now. It the approach is okay, I'll also implement the same logic for execution workers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Goes into the right direction!
node/core/pvf/src/prepare/worker.rs
Outdated
%worker_pid, | ||
"node and worker version mismatch", | ||
); | ||
std::process::exit(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can send a signal to the overseer to let it shutdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's PVF host, it isn't a subsystem by itself, and it's not aware of overseer's existence 🙄
To handle it the right way, the error should be propagated all the way up to the candidate validation and PVF pre-check subsystems and handled by them separately. Do you think it's worth the effort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't just kill the process here, you should bubble up this error and let the host kill the worker properly. Look at how the host handles e.g. TimedOut
for an example. Also, I'm pretty sure this code is running on the host so exit
would kill the node, not the worker, which I think is not what we want? (Haven't followed the disputes issue that closely over the weekend.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'm pretty sure this code is running on the host so
exit
would kill the node, not the worker
Yes, it's exactly what we want. At this point, we found out that the node owner screwed up the node upgrade process and we're still running an old node software, but the binary is already new, and we're trying to execute a new worker from the old node. In that case, we want to tear the whole node down and let the node owner handle its restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, got it. Is any clean-up of the workers necessary, or do they get killed automatically as child processes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Unfortunately this means potentially waiting the full lenient timeout duration for a worker to exit. I think it would make sense to just kill them forcefully so we don't wait.
Yes for sure we should kill the workers, they don't do anything important that could get lost when we kill them.
It makes sense to implement @koute suggestion as well and include polkadot's version into the artifact path. This way old version can't override an artifact for the new node.
This should be a separate pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to wait for the workers to shutdown.
Good point! But I've reviewed the code and looks like @pepyakin thought this through before us.
start_work()
on the host side provides the worker with a temporary path where it should store the artifact. If the worker reports a positive result, then in handle_result()
, again on the host side, the artifact gets renamed to its real artifact_id
path. If the host dies between start_work()
and handle_result()
, the only leftover is a stale tmp file which will be pruned on the next node restart.
It's still a good idea to kill preparation workers on version mismatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.... thinking about this some more, since we delete the cache anyway at each startup couldn't we alternatively just hash a random number into the filename? I don't remember if we already do this, but assuming we write the cache atomically to disk (that is, save the cache to a file with e.g. .tmp
extension or into another directory, and then rename it to where we want it; writing is not atomic, but renaming always is) we'd be essentially guaranteed that we won't load a stale file. This would also allow us to just simply forcibly kill the worker on exit without having to wait for it to shut down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@koute you mean some random number we generate on node startup and use throughout the node instance lifetime? I think it's a good approach also, but your first idea about including the node version number to the artifact name is somewhat more... Deterministic, I'd say? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean some random number we generate on node startup and use throughout the node instance lifetime?
Yeah, just generate a random number and then just hash that in. Assuming the number is actually random and the hash is cryptographic (e.g. BLAKE2) we'll have a guarantee that no two separate runs will try to load the same files.
I think it's a good approach also, but your first idea about including the node version number to the artifact name is somewhat more... Deterministic, I'd say? :)
Well, sure, but considering that we don't want to reuse those cache files after the node restarts what's the point of them having stable filenames across restarts? Isn't that just unnecessary complexity? (:
(Unless we do want to reuse them after a restart, in which case carry on.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting solution. Just needs a fix to kill the worker properly. I still have to catch up on the disputes issue and will give this PR a closer look then.
Reworked it quite a bit. As @bkchr rightly observed, workers should be killed before the node shutdown. It's crucial to kill preparation workers, as execution workers do not leave anything behind. But the execution is a much more frequent event than the preparation, so an execution worker is much more likely to recognize a version mismatch first. Because of that, the only option to handle it properly is to propagate the error from any worker to the PVF host and let it command both pipelines to shut down. Then, after both of them reported back they killed all the workers, stopped all the activity, and are ready to be shut down, the PVF host tears the node down. In principle, it's very close now to being handled the totally right way, that is, propagating the error one step higher to the candidate validation subsystem to signal the shutdown event to the overseer. It's not complete yet, I definitely need some tests and logical checks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good work on short notice. 👍 But... I'm not really comfortable with the significant increase in the complexity of the pipelines, to handle a quite rare corner case. It makes it harder to reason about and review and make future changes. 😬
For such an exceptional case I think we should enact an exceptional measure: just send a signal to all the child processes to force-kill them. Right at the spot where we first receive a VersionMismatch
signal, list all forked processes and send SIGKILL
or whatever, and don't bubble further up. Would that possible? Then we don't need to do all this coordination between the prepare and execute pipelines. It should be safe to kill the processes because they store artifacts in temporary locations.
I believe there's no portable way to list all the child processes (beyond calling Let me know if you have better ideas, I'd also like to simplify this. |
@s0me0ne-unkn0wn I believe we can just send https://www.man7.org/linux/man-pages/man1/kill.1.html This looks to be POSIX and cross-platform: And I confirmed with Activity Monitor on my Mac that process groups are a thing here: I'm also now on the FBI watch list after searching "man kill". |
Yeah, that's interesting, thank you! Also a convenient low-level interface is the
and we already have
😅 😅 😅 |
Good thinking, though any consensus-critical piece should already take this into account. I would be surprised if the DB wasn't properly ACID. The software can die at any time, whether for receiving a kill signal, plug being pulled, etc. |
…ines properly" This reverts commit b96cc31.
We should clearly not start sending any kill signals around :P If we detect a version mismatch, the most crucial thing to do is that the worker is not doing anything. It should just reject any kind of work and directly shut down. I only briefly checked your latest changes and they are really looking like they are touch quite a lot of things. I agree with @mrcnski that the current changes are a lot for the small change we want to achieve. I didn't checked the code, but if that can not be done easier, we may need to do some kind of refactoring? We should clearly not be required to sprinkle the code with |
@bkchr okay, let's limit the scope of the task then. The idea was to try to shut down the whole node if a version mismatch was observed, killing the workers before that, so they could not produce an artifact after the node restart. There are two ways to achieve that: 1) to just If we don't want to go hard, then maybe we shouldn't try to shut down? Reporting the situation at |
@bkchr In general we shouldn't be liberal with kill signals, but this seems to me like an exceptional scenario. It might actually be more likely for someone's node to die (for a number of reasons), with the same result as a kill, than for them to do an in-place upgrade. I would just kill the workers and not worry about it -- worst-case scenario we get a half-written temporary artifact. Or what other possibilities are you worried about? I definitely agree that we shouldn't complicate this -- increasing the maintenance burden of this code is likely to lead to more bugs and be a net negative outcome. Edit: we can also use |
I also said this or tried to say this. Killing the workers is fine, but they will kill themselves when they find the version mismatch? The parent process should only force kill them when it shuts down, but not try to find all processes that we may have started, even in earlier instances e.g
Should probably work for the beginning and then later we can improve this!
If we have artifact ids with a random number as proposed by @koute it should be fine! |
I modeled the scenarios that can take place and now I believe that even including a random number or a version number is redundant. The preparation worker produces an artifact in a temporary file. It gets renamed to a real usable artifact on the node side. Only the node instance that started the worker can do such rename (because the worker signals to do the rename through a socket, and sockets are set up by the node instance on worker spawn). Now, we have an old node that started an old worker. The worker has not concluded yet, and the node gets upgraded in place. Any new worker started by the old node would report a version mismatch and exit immediately. It doesn't matter at this point if we kill the node or just spam errors into its logs. What matters is that at that point node has to be restarted, either by the node owner or by our code. Now, if the old worker concludes before the node restart, an artifact is stored in the artifact cache. But the node cannot use it as it cannot spawn workers anymore because of version mismatch. When the node restart finally happens, the artifact cache is pruned, so the artifact cannot outlive the node upgrade. The second scenario: the old worker does not conclude before the node restart. The new node is already started, and after that, the old worker produces a temporary file with the artifact. But that temporary file will never be renamed to a real usable artifact because that should be done by the node instance that is not there anymore. The only leftover is a useless temporary file which will be pruned either by the cache pruner or on the next node restart. Let me know if I'm missing something but it seems that just ceasing processing new requests if the version mismatch condition is recognized is enough. |
So, pushed a new minimal solution: check for version mismatch on the worker startup, and if the mismatch is detected, send Tests show that it works pretty fine: node is torn down, and when that happens, all its unix sockets get closed, which results in other workers getting an i/o error and exiting. Before merging, I'd like to recruit a volunteer to test it under MacOS. |
I can do it. Have any specific testing instructions? Also, is it possible to add some tests to the integration test suite in this module? I don't see why not. e.g. spawn a worker and send a version mismatch, spawn multiple valid workers and then spawn another one and send a version mismatch, etc. NOTE: Right now |
Thank you! I do it like the following:
Would you also fix dependencies to make
Yes, I was thinking about that, too, I'm just unfamiliar with the integration tests infrastructure and need to do some research. |
I'm seeing unexpected behavior after completing step 8: |
If I remove the node before replacing it, then the version mismatch detection If I try to overwrite it with Also, I once randomly got this fun error: My recommendations
This seems like the best we can do. Then we should be good to go. These changes Sample LogSee this truncated sample log: https://pastebin.com/raw/LBT0RUP9. This shows that:
|
@mrcnski, thanks a lot for testing and feedback! I'll have to leave (1) to you, as on Linux, I believe I'll try to add some integration tests later (I'll be ooo for several days). |
wasmio right? Enjoy Barcelona! |
I pushed a commit addressing (3) (adding a test). (1) actually seems not trivial -- we can't So, this PR seems good to go from my perspective. |
bot merge |
…6861) * Check spawned worker version vs node version before PVF preparation * Address discussions * Propagate errors and shutdown preparation and execution pipelines properly * Add logs; Fix execution worker checks * Revert "Propagate errors and shutdown preparation and execution pipelines properly" This reverts commit b96cc31. * Don't try to shut down; report the condition and exit worker * Get rid of `VersionMismatch` preparation error * Merge master * Add docs; Fix tests * Update Cargo.lock * Kill again, but only the main node process * Move unsafe code to a common safe function * Fix libc dependency error on MacOS * pvf spawning: Add some logging, add a small integration test * Minor fixes * Restart CI --------- Co-authored-by: Marcin S <marcin@realemail.net>
This pull request has been mentioned on Polkadot Forum. There might be relevant details there: https://forum.polkadot.network/t/polkadot-dispute-storm-the-postmortem/2550/1 |
This pull request has been mentioned on Polkadot Forum. There might be relevant details there: https://forum.polkadot.network/t/ux-of-distributing-multiple-binaries-take-2/2854/3 |
Closes #6860
Compare worker and node versions and force the node to shut down in case of a mismatch.