Check spawned worker version vs node version before PVF preparation #6861

s0me0ne-unkn0wn · 2023-03-12T16:55:19Z

Closes #6860

Compare worker and node versions and force the node to shut down in case of a mismatch.

s0me0ne-unkn0wn · 2023-03-12T16:56:05Z

This is for pre-review right now. It the approach is okay, I'll also implement the same logic for execution workers.

bkchr

Goes into the right direction!

cli/src/cli.rs

node/core/pvf/src/prepare/worker.rs

bkchr · 2023-03-12T17:14:38Z

node/core/pvf/src/prepare/worker.rs

+				%worker_pid,
+				"node and worker version mismatch",
+			);
+			std::process::exit(1);


I think you can send a signal to the overseer to let it shutdown?

It's PVF host, it isn't a subsystem by itself, and it's not aware of overseer's existence 🙄
To handle it the right way, the error should be propagated all the way up to the candidate validation and PVF pre-check subsystems and handled by them separately. Do you think it's worth the effort?

You shouldn't just kill the process here, you should bubble up this error and let the host kill the worker properly. Look at how the host handles e.g. TimedOut for an example. Also, I'm pretty sure this code is running on the host so exit would kill the node, not the worker, which I think is not what we want? (Haven't followed the disputes issue that closely over the weekend.)

Also, I'm pretty sure this code is running on the host so exit would kill the node, not the worker

Yes, it's exactly what we want. At this point, we found out that the node owner screwed up the node upgrade process and we're still running an old node software, but the binary is already new, and we're trying to execute a new worker from the old node. In that case, we want to tear the whole node down and let the node owner handle its restart.

Oh, got it. Is any clean-up of the workers necessary, or do they get killed automatically as child processes?

Good point. Unfortunately this means potentially waiting the full lenient timeout duration for a worker to exit. I think it would make sense to just kill them forcefully so we don't wait.

Yes for sure we should kill the workers, they don't do anything important that could get lost when we kill them.

It makes sense to implement @koute suggestion as well and include polkadot's version into the artifact path. This way old version can't override an artifact for the new node.

This should be a separate pr.

We need to wait for the workers to shutdown.

Good point! But I've reviewed the code and looks like @pepyakin thought this through before us.

start_work() on the host side provides the worker with a temporary path where it should store the artifact. If the worker reports a positive result, then in handle_result(), again on the host side, the artifact gets renamed to its real artifact_id path. If the host dies between start_work() and handle_result(), the only leftover is a stale tmp file which will be pruned on the next node restart.

It's still a good idea to kill preparation workers on version mismatch.

Hmm.... thinking about this some more, since we delete the cache anyway at each startup couldn't we alternatively just hash a random number into the filename? I don't remember if we already do this, but assuming we write the cache atomically to disk (that is, save the cache to a file with e.g. .tmp extension or into another directory, and then rename it to where we want it; writing is not atomic, but renaming always is) we'd be essentially guaranteed that we won't load a stale file. This would also allow us to just simply forcibly kill the worker on exit without having to wait for it to shut down.

@koute you mean some random number we generate on node startup and use throughout the node instance lifetime? I think it's a good approach also, but your first idea about including the node version number to the artifact name is somewhat more... Deterministic, I'd say? :)

you mean some random number we generate on node startup and use throughout the node instance lifetime?

Yeah, just generate a random number and then just hash that in. Assuming the number is actually random and the hash is cryptographic (e.g. BLAKE2) we'll have a guarantee that no two separate runs will try to load the same files.

I think it's a good approach also, but your first idea about including the node version number to the artifact name is somewhat more... Deterministic, I'd say? :)

Well, sure, but considering that we don't want to reuse those cache files after the node restarts what's the point of them having stable filenames across restarts? Isn't that just unnecessary complexity? (:

(Unless we do want to reuse them after a restart, in which case carry on.)

node/core/pvf/src/prepare/worker.rs

mrcnski

Interesting solution. ~~Just needs a fix to kill the worker properly.~~ I still have to catch up on the disputes issue and will give this PR a closer look then.

node/core/pvf/src/prepare/worker.rs

…perly

s0me0ne-unkn0wn · 2023-03-13T22:18:09Z

Reworked it quite a bit. As @bkchr rightly observed, workers should be killed before the node shutdown. It's crucial to kill preparation workers, as execution workers do not leave anything behind. But the execution is a much more frequent event than the preparation, so an execution worker is much more likely to recognize a version mismatch first. Because of that, the only option to handle it properly is to propagate the error from any worker to the PVF host and let it command both pipelines to shut down. Then, after both of them reported back they killed all the workers, stopped all the activity, and are ready to be shut down, the PVF host tears the node down.

In principle, it's very close now to being handled the totally right way, that is, propagating the error one step higher to the candidate validation subsystem to signal the shutdown event to the overseer.

It's not complete yet, I definitely need some tests and logical checks.

mrcnski

Really good work on short notice. 👍 But... I'm not really comfortable with the significant increase in the complexity of the pipelines, to handle a quite rare corner case. It makes it harder to reason about and review and make future changes. 😬

For such an exceptional case I think we should enact an exceptional measure: just send a signal to all the child processes to force-kill them. Right at the spot where we first receive a VersionMismatch signal, list all forked processes and send SIGKILL or whatever, and don't bubble further up. Would that possible? Then we don't need to do all this coordination between the prepare and execute pipelines. It should be safe to kill the processes because they store artifacts in temporary locations.

node/core/pvf/src/execute/queue.rs

node/core/pvf/src/execute/worker.rs

node/core/pvf/src/prepare/worker.rs

node/core/pvf/src/execute/worker.rs

s0me0ne-unkn0wn · 2023-03-14T17:42:54Z

Right at the spot where we first receive a VersionMismatch signal, list all forked processes and send SIGKILL or whatever

I believe there's no portable way to list all the child processes (beyond calling popen("ps ax --ppid") and then parsing the output but that looks way too hacky even for corner-case handling). So the usual practice is to save their pids on fork() but in our case fork()s are abstracted by tokio::process handles and bookkeeping of those handles is done by preparation and execution queues separately. So I don't see a good option beyond bubbling errors up, it's annoying but very idiomatic approach :)

Let me know if you have better ideas, I'd also like to simplify this.

mrcnski · 2023-03-14T18:35:01Z

@s0me0ne-unkn0wn I believe we can just send kill with a pid of -parent_pid to kill both the node and all child processes. This works by killing all processes in the "process group" indexed by parent_id.

https://www.man7.org/linux/man-pages/man1/kill.1.html

This looks to be POSIX and cross-platform:

https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/kill.2.html

And I confirmed with Activity Monitor on my Mac that process groups are a thing here:

I'm also now on the FBI watch list after searching "man kill".

s0me0ne-unkn0wn · 2023-03-14T20:08:54Z

I believe we can just send kill with a pid of -parent_pid to kill both the node and all child processes.

Yeah, that's interesting, thank you! Also a convenient low-level interface is the kill syscall:

If pid equals 0, then sig is sent to every process in the process group of the calling process.

and we already have libc as a PVF host dependency. I'll give it a try. My only concern is that the node will receive SIGKILL too, and what if some thread is writing to the database at the moment? Not sure how critical it is. @bkchr any clues?

I'm also now on the FBI watch list after searching "man kill".

😅 😅 😅
Yeah, all of us are there. I'm writing so much text on killing workers. The FBI guys are like "what they're up to in that Parity" 😆

mrcnski · 2023-03-14T20:18:04Z

My only concern is that the node will receive SIGKILL too, and what if some thread is writing to the database at the moment?

Good thinking, though any consensus-critical piece should already take this into account. I would be surprised if the DB wasn't properly ACID. The software can die at any time, whether for receiving a kill signal, plug being pulled, etc.

…ines properly" This reverts commit b96cc31.

bkchr · 2023-03-14T20:42:45Z

We should clearly not start sending any kill signals around :P If we detect a version mismatch, the most crucial thing to do is that the worker is not doing anything. It should just reject any kind of work and directly shut down. I only briefly checked your latest changes and they are really looking like they are touch quite a lot of things. I agree with @mrcnski that the current changes are a lot for the small change we want to achieve. I didn't checked the code, but if that can not be done easier, we may need to do some kind of refactoring? We should clearly not be required to sprinkle the code with if am_i_still_running {}.

s0me0ne-unkn0wn · 2023-03-14T20:59:25Z

@bkchr okay, let's limit the scope of the task then. The idea was to try to shut down the whole node if a version mismatch was observed, killing the workers before that, so they could not produce an artifact after the node restart. There are two ways to achieve that: 1) to just kill() everything around and 2) to do a graceful shutdown propagating errors through all the pipelines up and down.

If we don't want to go hard, then maybe we shouldn't try to shut down? Reporting the situation at error loglevel and ceasing to accept any more worker jobs would do the trick. I've kind of convinced myself that stale workers cannot cause any harm, so probably it's the easiest solution?

mrcnski · 2023-03-14T21:05:39Z

We should clearly not start sending any kill signals around :P If we detect a version mismatch, the most crucial thing to do is that the worker is not doing anything.

@bkchr In general we shouldn't be liberal with kill signals, but this seems to me like an exceptional scenario. It might actually be more likely for someone's node to die (for a number of reasons), with the same result as a kill, than for them to do an in-place upgrade. I would just kill the workers and not worry about it -- worst-case scenario we get a half-written temporary artifact. Or what other possibilities are you worried about? I definitely agree that we shouldn't complicate this -- increasing the maintenance burden of this code is likely to lead to more bugs and be a net negative outcome.

Edit: we can also use kill to send some other signal to tell the worker to finish its work and start shutting down. Not sure it's worth the added code/testing for it, but might be better for our consciences.

bkchr · 2023-03-14T21:10:42Z

I would just kill the workers and not worry about it -- worst-case scenario we get a half-written temporary artifact.

I also said this or tried to say this. Killing the workers is fine, but they will kill themselves when they find the version mismatch? The parent process should only force kill them when it shuts down, but not try to find all processes that we may have started, even in earlier instances e.g ps -a | grep polkadot.

Reporting the situation at error loglevel and ceasing to accept any more worker jobs would do the trick.

Should probably work for the beginning and then later we can improve this!

I've kind of convinced myself that stale workers cannot cause any harm, so probably it's the easiest solution?

If we have artifact ids with a random number as proposed by @koute it should be fine!

s0me0ne-unkn0wn · 2023-03-14T21:32:42Z

If we have artifact ids with a random number as proposed by @koute it should be fine!

I modeled the scenarios that can take place and now I believe that even including a random number or a version number is redundant.

The preparation worker produces an artifact in a temporary file. It gets renamed to a real usable artifact on the node side. Only the node instance that started the worker can do such rename (because the worker signals to do the rename through a socket, and sockets are set up by the node instance on worker spawn).

Now, we have an old node that started an old worker. The worker has not concluded yet, and the node gets upgraded in place.

Any new worker started by the old node would report a version mismatch and exit immediately. It doesn't matter at this point if we kill the node or just spam errors into its logs. What matters is that at that point node has to be restarted, either by the node owner or by our code.

Now, if the old worker concludes before the node restart, an artifact is stored in the artifact cache. But the node cannot use it as it cannot spawn workers anymore because of version mismatch. When the node restart finally happens, the artifact cache is pruned, so the artifact cannot outlive the node upgrade.

The second scenario: the old worker does not conclude before the node restart. The new node is already started, and after that, the old worker produces a temporary file with the artifact. But that temporary file will never be renamed to a real usable artifact because that should be done by the node instance that is not there anymore. The only leftover is a useless temporary file which will be pruned either by the cache pruner or on the next node restart.

Let me know if I'm missing something but it seems that just ceasing processing new requests if the version mismatch condition is recognized is enough.

s0me0ne-unkn0wn · 2023-03-18T20:36:29Z

So, pushed a new minimal solution: check for version mismatch on the worker startup, and if the mismatch is detected, send SIGKILL to the parent process only.

Tests show that it works pretty fine: node is torn down, and when that happens, all its unix sockets get closed, which results in other workers getting an i/o error and exiting.

Before merging, I'd like to recruit a volunteer to test it under MacOS.

node/core/pvf/src/prepare/worker.rs

mrcnski · 2023-03-19T14:25:07Z

Before merging, I'd like to recruit a volunteer to test it under MacOS.

I can do it. Have any specific testing instructions?

Also, is it possible to add some tests to the integration test suite in this module? I don't see why not. e.g. spawn a worker and send a version mismatch, spawn multiple valid workers and then spawn another one and send a version mismatch, etc.

NOTE: Right now libc is a Linux-only dependency, that needs to be fixed.

s0me0ne-unkn0wn · 2023-03-20T12:14:06Z

I can do it. Have any specific testing instructions?

Thank you! I do it like the following:

Build with cargo build, which produces debug binary
Change the version in the root Cargo.toml to something different
Build with cargo build -r, which produces release binary of different version
Set up zombienet using the small network template config (two nodes and one parachain) with default_command pointing to the debug binary
Start the network with native provider: ./zombienet-linux-x64 --provider=native spawn 0001-small-network.toml
Observe the alice node with PolkadotJS and wait until the parachain starts producing blocks
Delete the debug polkadot binary and replace it with the release polkadot binary in place (you may have to delete it with rm and then copy a new one, just cp -f might not work)
Kill the execution workers to force nodes to restart them: ps auxww |grep 'execute-worker' |awk '{print $2}' |xargs kill
After a couple of seconds, make sure all the polkadot binaries exited, including preparation workers and nodes themselves: ps auxww |grep polkadot
Node logs may also be checked, they should contain the message "Node and worker version mismatch, node needs restarting, forcing shutdown" coming from the execution worker before it kills everything.

Would you also fix dependencies to make libc build on MacOS? I can do that, but I cannot check that.

Also, is it possible to add some tests to the integration test suite in this module?

Yes, I was thinking about that, too, I'm just unfamiliar with the integration tests infrastructure and need to do some research.

node/core/pvf/src/worker_common.rs

mrcnski · 2023-03-21T09:33:56Z

I'm seeing unexpected behavior after completing step 8: failed to spawn an execute worker: AcceptTimeout. The workers are never spawned and the mismatch case is never triggered. Looking into it.

stale

mrcnski · 2023-03-21T11:41:20Z

I'm seeing unexpected behavior after completing step 8: failed to spawn an execute worker: AcceptTimeout. The workers are never spawned and the mismatch case is never triggered. Looking into it.

If I remove the node before replacing it, then the version mismatch detection
works correctly.

If I try to overwrite it with cp -f (which your instructions say not to do,
but operators may do it), then I get AcceptTimeout. My guess is that the new
process immediately exits, though we don't check for this in
spawn_with_program_path. This immediate exit seems to be known behavior on
MacOS, based on
https://stackoverflow.com/questions/67378106/mac-m1-cping-binary-over-another-results-in-crash.
It results in SpawnErr.

Also, I once randomly got this fun error: Malformed Mach-o file, which
also resulted in a SpawnErr. This SpawnErr is harmless -- it just causes us to
retry spawning the worker every 3s in an infinite loop.

My recommendations

~~Catch process exiting (handle becoming invalid?) in spawn_with_program_path (can be a follow-up issue).~~ Not doing, see comment below.
~~Double check that SpawnErr does not result in disputes.~~ (checked above)
~~Add a test for immediately-dying workers if possible (tests/it/worker_common.rs).~~ DONE
I think documenting this in-code is not needed as this is a super edge case.

This seems like the best we can do. Then we should be good to go. These changes
should be quick to implement and not hold up this PR much longer.

Sample Log

See this truncated sample log: https://pastebin.com/raw/LBT0RUP9. This shows that:

"failed to spawn" just results in us trying to re-start the process repeatedly.
No disputes are raised.

s0me0ne-unkn0wn · 2023-03-21T19:39:25Z

@mrcnski, thanks a lot for testing and feedback! I'll have to leave (1) to you, as on Linux, cp -f results in an error, refusing to replace the file in use. Nevertheless, unlinking it and then copying a new one in its place works.

I believe SpawnErr is not a concern as it does not result in a dispute anyway. And upgrading a validator in place under MacOS is such a super-super-edge case... It would be a shame to introduce any new code just to handle that.

I'll try to add some integration tests later (I'll be ooo for several days).

mrcnski · 2023-03-22T09:51:11Z

I'll try to add some integration tests later (I'll be ooo for several days).

wasmio right? Enjoy Barcelona!

mrcnski · 2023-03-22T10:35:27Z

I pushed a commit addressing (3) (adding a test).

(1) actually seems not trivial -- we can't select on the handle future, because we also return handle from a different branch of the select. I would just skip (1), the spawn will just timeout and we now have a test for it.

So, this PR seems good to go from my perspective.

s0me0ne-unkn0wn · 2023-03-29T22:48:33Z

bot merge

…6861) * Check spawned worker version vs node version before PVF preparation * Address discussions * Propagate errors and shutdown preparation and execution pipelines properly * Add logs; Fix execution worker checks * Revert "Propagate errors and shutdown preparation and execution pipelines properly" This reverts commit b96cc31. * Don't try to shut down; report the condition and exit worker * Get rid of `VersionMismatch` preparation error * Merge master * Add docs; Fix tests * Update Cargo.lock * Kill again, but only the main node process * Move unsafe code to a common safe function * Fix libc dependency error on MacOS * pvf spawning: Add some logging, add a small integration test * Minor fixes * Restart CI --------- Co-authored-by: Marcin S <marcin@realemail.net>

Polkadot-Forum · 2023-04-07T14:33:02Z

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/polkadot-dispute-storm-the-postmortem/2550/1

Polkadot-Forum · 2023-05-14T17:17:51Z

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/ux-of-distributing-multiple-binaries-take-2/2854/3

Check spawned worker version vs node version before PVF preparation

df6c683

s0me0ne-unkn0wn requested review from eskimor and ordian March 12, 2023 16:55

bkchr reviewed Mar 12, 2023

View reviewed changes

Address discussions

335495b

mrcnski reviewed Mar 12, 2023

View reviewed changes

node/core/pvf/src/prepare/worker.rs Show resolved Hide resolved

mrcnski suggested changes Mar 12, 2023

View reviewed changes

mrcnski reviewed Mar 13, 2023

View reviewed changes

node/core/pvf/src/prepare/worker.rs Outdated Show resolved Hide resolved

Propagate errors and shutdown preparation and execution pipelines pro…

b96cc31

…perly

mrcnski mentioned this pull request Mar 14, 2023

PVF: include polkadot's version into the artifact path paritytech/polkadot-sdk#695

Closed

mrcnski suggested changes Mar 14, 2023

View reviewed changes

s0me0ne-unkn0wn added 2 commits March 14, 2023 21:27

Add logs; Fix execution worker checks

54b11c6

Revert "Propagate errors and shutdown preparation and execution pipel…

3cd790f

…ines properly" This reverts commit b96cc31.

s0me0ne-unkn0wn added 2 commits March 15, 2023 11:24

Don't try to shut down; report the condition and exit worker

acc94f7

Merge branch 's0me0ne/worker-version-v2' into s0me0ne/worker-version

0db3ed3

mrcnski reviewed Mar 19, 2023

View reviewed changes

node/core/pvf/src/prepare/worker.rs Outdated Show resolved Hide resolved

Merge branch 'master' into s0me0ne/worker-version

a42d0a5

Move unsafe code to a common safe function

1b4678b

mrcnski reviewed Mar 20, 2023

View reviewed changes

node/core/pvf/src/worker_common.rs Outdated Show resolved Hide resolved

Fix libc dependency error on MacOS

ec9c4ca

pvf spawning: Add some logging, add a small integration test

0bd760f

s0me0ne-unkn0wn added 3 commits March 29, 2023 11:01

Merge branch 'master' into s0me0ne/worker-version

a9562c3

Minor fixes

0ecce2e

Restart CI

456314c

paritytech-processbot bot merged commit 1fe1702 into master Mar 29, 2023

paritytech-processbot bot deleted the s0me0ne/worker-version branch March 29, 2023 22:48

s0me0ne-unkn0wn mentioned this pull request Mar 29, 2023

A global overseer shutdown handle #6979

Closed

s0me0ne-unkn0wn mentioned this pull request Apr 20, 2023

PVF: Logging in PVF workers has no effect #7112

Closed

mrcnski mentioned this pull request May 17, 2023

PVF: separate worker binaries and build with musl #7147

Closed

5 tasks

mrcnski mentioned this pull request May 25, 2023

PVF worker: Mitigate process leaks #7296

Open

github-actions bot mentioned this pull request Jun 15, 2023

Update substrate/polkadot/cumulus from v0.9.40 to v0.9.43 moonbeam-foundation/moonbeam#2354

Closed

s0me0ne-unkn0wn mentioned this pull request Jan 30, 2024

PVF: Consider re-preparing artifact on failed runtime construction paritytech/polkadot-sdk#3139

Closed

Check spawned worker version vs node version before PVF preparation #6861

Check spawned worker version vs node version before PVF preparation #6861

Conversation

s0me0ne-unkn0wn commented Mar 12, 2023

s0me0ne-unkn0wn commented Mar 12, 2023

bkchr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrcnski left a comment • edited Loading

Choose a reason for hiding this comment

s0me0ne-unkn0wn commented Mar 13, 2023

mrcnski left a comment

Choose a reason for hiding this comment

s0me0ne-unkn0wn commented Mar 14, 2023

mrcnski commented Mar 14, 2023

s0me0ne-unkn0wn commented Mar 14, 2023

mrcnski commented Mar 14, 2023

bkchr commented Mar 14, 2023

s0me0ne-unkn0wn commented Mar 14, 2023

mrcnski commented Mar 14, 2023 • edited Loading

bkchr commented Mar 14, 2023

s0me0ne-unkn0wn commented Mar 14, 2023

s0me0ne-unkn0wn commented Mar 18, 2023

mrcnski commented Mar 19, 2023

s0me0ne-unkn0wn commented Mar 20, 2023

mrcnski commented Mar 21, 2023

mrcnski commented Mar 21, 2023 • edited Loading

My recommendations

Sample Log

s0me0ne-unkn0wn commented Mar 21, 2023

mrcnski commented Mar 22, 2023

mrcnski commented Mar 22, 2023 • edited Loading

s0me0ne-unkn0wn commented Mar 29, 2023

Polkadot-Forum commented Apr 7, 2023

Polkadot-Forum commented May 14, 2023

mrcnski left a comment •

edited

Loading

mrcnski commented Mar 14, 2023 •

edited

Loading

mrcnski commented Mar 21, 2023 •

edited

Loading

mrcnski commented Mar 22, 2023 •

edited

Loading