Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Check spawned worker version vs node version before PVF preparation #6861

Merged
merged 20 commits into from
Mar 29, 2023

Conversation

s0me0ne-unkn0wn
Copy link
Contributor

Closes #6860

Compare worker and node versions and force the node to shut down in case of a mismatch.

@s0me0ne-unkn0wn
Copy link
Contributor Author

This is for pre-review right now. It the approach is okay, I'll also implement the same logic for execution workers.

Copy link
Member

@bkchr bkchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goes into the right direction!

cli/src/cli.rs Outdated Show resolved Hide resolved
node/core/pvf/src/prepare/worker.rs Outdated Show resolved Hide resolved
node/core/pvf/src/prepare/worker.rs Outdated Show resolved Hide resolved
%worker_pid,
"node and worker version mismatch",
);
std::process::exit(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can send a signal to the overseer to let it shutdown?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's PVF host, it isn't a subsystem by itself, and it's not aware of overseer's existence 🙄
To handle it the right way, the error should be propagated all the way up to the candidate validation and PVF pre-check subsystems and handled by them separately. Do you think it's worth the effort?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't just kill the process here, you should bubble up this error and let the host kill the worker properly. Look at how the host handles e.g. TimedOut for an example. Also, I'm pretty sure this code is running on the host so exit would kill the node, not the worker, which I think is not what we want? (Haven't followed the disputes issue that closely over the weekend.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm pretty sure this code is running on the host so exit would kill the node, not the worker

Yes, it's exactly what we want. At this point, we found out that the node owner screwed up the node upgrade process and we're still running an old node software, but the binary is already new, and we're trying to execute a new worker from the old node. In that case, we want to tear the whole node down and let the node owner handle its restart.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, got it. Is any clean-up of the workers necessary, or do they get killed automatically as child processes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Unfortunately this means potentially waiting the full lenient timeout duration for a worker to exit. I think it would make sense to just kill them forcefully so we don't wait.

Yes for sure we should kill the workers, they don't do anything important that could get lost when we kill them.

It makes sense to implement @koute suggestion as well and include polkadot's version into the artifact path. This way old version can't override an artifact for the new node.

This should be a separate pr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to wait for the workers to shutdown.

Good point! But I've reviewed the code and looks like @pepyakin thought this through before us.

start_work() on the host side provides the worker with a temporary path where it should store the artifact. If the worker reports a positive result, then in handle_result(), again on the host side, the artifact gets renamed to its real artifact_id path. If the host dies between start_work() and handle_result(), the only leftover is a stale tmp file which will be pruned on the next node restart.

It's still a good idea to kill preparation workers on version mismatch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.... thinking about this some more, since we delete the cache anyway at each startup couldn't we alternatively just hash a random number into the filename? I don't remember if we already do this, but assuming we write the cache atomically to disk (that is, save the cache to a file with e.g. .tmp extension or into another directory, and then rename it to where we want it; writing is not atomic, but renaming always is) we'd be essentially guaranteed that we won't load a stale file. This would also allow us to just simply forcibly kill the worker on exit without having to wait for it to shut down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koute you mean some random number we generate on node startup and use throughout the node instance lifetime? I think it's a good approach also, but your first idea about including the node version number to the artifact name is somewhat more... Deterministic, I'd say? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean some random number we generate on node startup and use throughout the node instance lifetime?

Yeah, just generate a random number and then just hash that in. Assuming the number is actually random and the hash is cryptographic (e.g. BLAKE2) we'll have a guarantee that no two separate runs will try to load the same files.

I think it's a good approach also, but your first idea about including the node version number to the artifact name is somewhat more... Deterministic, I'd say? :)

Well, sure, but considering that we don't want to reuse those cache files after the node restarts what's the point of them having stable filenames across restarts? Isn't that just unnecessary complexity? (:

(Unless we do want to reuse them after a restart, in which case carry on.)

@s0me0ne-unkn0wn s0me0ne-unkn0wn added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance. labels Mar 12, 2023
Copy link
Contributor

@mrcnski mrcnski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting solution. Just needs a fix to kill the worker properly. I still have to catch up on the disputes issue and will give this PR a closer look then.

@s0me0ne-unkn0wn
Copy link
Contributor Author

Reworked it quite a bit. As @bkchr rightly observed, workers should be killed before the node shutdown. It's crucial to kill preparation workers, as execution workers do not leave anything behind. But the execution is a much more frequent event than the preparation, so an execution worker is much more likely to recognize a version mismatch first. Because of that, the only option to handle it properly is to propagate the error from any worker to the PVF host and let it command both pipelines to shut down. Then, after both of them reported back they killed all the workers, stopped all the activity, and are ready to be shut down, the PVF host tears the node down.

In principle, it's very close now to being handled the totally right way, that is, propagating the error one step higher to the candidate validation subsystem to signal the shutdown event to the overseer.

It's not complete yet, I definitely need some tests and logical checks.

Copy link
Contributor

@mrcnski mrcnski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good work on short notice. 👍 But... I'm not really comfortable with the significant increase in the complexity of the pipelines, to handle a quite rare corner case. It makes it harder to reason about and review and make future changes. 😬

For such an exceptional case I think we should enact an exceptional measure: just send a signal to all the child processes to force-kill them. Right at the spot where we first receive a VersionMismatch signal, list all forked processes and send SIGKILL or whatever, and don't bubble further up. Would that possible? Then we don't need to do all this coordination between the prepare and execute pipelines. It should be safe to kill the processes because they store artifacts in temporary locations.

node/core/pvf/src/execute/queue.rs Outdated Show resolved Hide resolved
node/core/pvf/src/execute/queue.rs Outdated Show resolved Hide resolved
node/core/pvf/src/execute/queue.rs Outdated Show resolved Hide resolved
node/core/pvf/src/execute/queue.rs Outdated Show resolved Hide resolved
node/core/pvf/src/execute/worker.rs Outdated Show resolved Hide resolved
node/core/pvf/src/prepare/worker.rs Show resolved Hide resolved
node/core/pvf/src/execute/worker.rs Outdated Show resolved Hide resolved
@s0me0ne-unkn0wn
Copy link
Contributor Author

Right at the spot where we first receive a VersionMismatch signal, list all forked processes and send SIGKILL or whatever

I believe there's no portable way to list all the child processes (beyond calling popen("ps ax --ppid") and then parsing the output but that looks way too hacky even for corner-case handling). So the usual practice is to save their pids on fork() but in our case fork()s are abstracted by tokio::process handles and bookkeeping of those handles is done by preparation and execution queues separately. So I don't see a good option beyond bubbling errors up, it's annoying but very idiomatic approach :)

Let me know if you have better ideas, I'd also like to simplify this.

@mrcnski
Copy link
Contributor

mrcnski commented Mar 14, 2023

@s0me0ne-unkn0wn I believe we can just send kill with a pid of -parent_pid to kill both the node and all child processes. This works by killing all processes in the "process group" indexed by parent_id.

https://www.man7.org/linux/man-pages/man1/kill.1.html

This looks to be POSIX and cross-platform:

https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/kill.2.html

And I confirmed with Activity Monitor on my Mac that process groups are a thing here:

Screenshot 2023-03-14 at 19 32 31

I'm also now on the FBI watch list after searching "man kill".

@s0me0ne-unkn0wn
Copy link
Contributor Author

I believe we can just send kill with a pid of -parent_pid to kill both the node and all child processes.

Yeah, that's interesting, thank you! Also a convenient low-level interface is the kill syscall:

If pid equals 0, then sig is sent to every process in the process group of the calling process.

and we already have libc as a PVF host dependency. I'll give it a try. My only concern is that the node will receive SIGKILL too, and what if some thread is writing to the database at the moment? Not sure how critical it is. @bkchr any clues?

I'm also now on the FBI watch list after searching "man kill".

😅 😅 😅
Yeah, all of us are there. I'm writing so much text on killing workers. The FBI guys are like "what they're up to in that Parity" 😆

@mrcnski
Copy link
Contributor

mrcnski commented Mar 14, 2023

My only concern is that the node will receive SIGKILL too, and what if some thread is writing to the database at the moment?

Good thinking, though any consensus-critical piece should already take this into account. I would be surprised if the DB wasn't properly ACID. The software can die at any time, whether for receiving a kill signal, plug being pulled, etc.

@bkchr
Copy link
Member

bkchr commented Mar 14, 2023

We should clearly not start sending any kill signals around :P If we detect a version mismatch, the most crucial thing to do is that the worker is not doing anything. It should just reject any kind of work and directly shut down. I only briefly checked your latest changes and they are really looking like they are touch quite a lot of things. I agree with @mrcnski that the current changes are a lot for the small change we want to achieve. I didn't checked the code, but if that can not be done easier, we may need to do some kind of refactoring? We should clearly not be required to sprinkle the code with if am_i_still_running {}.

@s0me0ne-unkn0wn
Copy link
Contributor Author

@bkchr okay, let's limit the scope of the task then. The idea was to try to shut down the whole node if a version mismatch was observed, killing the workers before that, so they could not produce an artifact after the node restart. There are two ways to achieve that: 1) to just kill() everything around and 2) to do a graceful shutdown propagating errors through all the pipelines up and down.

If we don't want to go hard, then maybe we shouldn't try to shut down? Reporting the situation at error loglevel and ceasing to accept any more worker jobs would do the trick. I've kind of convinced myself that stale workers cannot cause any harm, so probably it's the easiest solution?

@mrcnski
Copy link
Contributor

mrcnski commented Mar 14, 2023

We should clearly not start sending any kill signals around :P If we detect a version mismatch, the most crucial thing to do is that the worker is not doing anything.

@bkchr In general we shouldn't be liberal with kill signals, but this seems to me like an exceptional scenario. It might actually be more likely for someone's node to die (for a number of reasons), with the same result as a kill, than for them to do an in-place upgrade. I would just kill the workers and not worry about it -- worst-case scenario we get a half-written temporary artifact. Or what other possibilities are you worried about? I definitely agree that we shouldn't complicate this -- increasing the maintenance burden of this code is likely to lead to more bugs and be a net negative outcome.

Edit: we can also use kill to send some other signal to tell the worker to finish its work and start shutting down. Not sure it's worth the added code/testing for it, but might be better for our consciences.

@bkchr
Copy link
Member

bkchr commented Mar 14, 2023

I would just kill the workers and not worry about it -- worst-case scenario we get a half-written temporary artifact.

I also said this or tried to say this. Killing the workers is fine, but they will kill themselves when they find the version mismatch? The parent process should only force kill them when it shuts down, but not try to find all processes that we may have started, even in earlier instances e.g ps -a | grep polkadot.

Reporting the situation at error loglevel and ceasing to accept any more worker jobs would do the trick.

Should probably work for the beginning and then later we can improve this!

I've kind of convinced myself that stale workers cannot cause any harm, so probably it's the easiest solution?

If we have artifact ids with a random number as proposed by @koute it should be fine!

@s0me0ne-unkn0wn
Copy link
Contributor Author

If we have artifact ids with a random number as proposed by @koute it should be fine!

I modeled the scenarios that can take place and now I believe that even including a random number or a version number is redundant.

The preparation worker produces an artifact in a temporary file. It gets renamed to a real usable artifact on the node side. Only the node instance that started the worker can do such rename (because the worker signals to do the rename through a socket, and sockets are set up by the node instance on worker spawn).

Now, we have an old node that started an old worker. The worker has not concluded yet, and the node gets upgraded in place.

Any new worker started by the old node would report a version mismatch and exit immediately. It doesn't matter at this point if we kill the node or just spam errors into its logs. What matters is that at that point node has to be restarted, either by the node owner or by our code.

Now, if the old worker concludes before the node restart, an artifact is stored in the artifact cache. But the node cannot use it as it cannot spawn workers anymore because of version mismatch. When the node restart finally happens, the artifact cache is pruned, so the artifact cannot outlive the node upgrade.

The second scenario: the old worker does not conclude before the node restart. The new node is already started, and after that, the old worker produces a temporary file with the artifact. But that temporary file will never be renamed to a real usable artifact because that should be done by the node instance that is not there anymore. The only leftover is a useless temporary file which will be pruned either by the cache pruner or on the next node restart.

Let me know if I'm missing something but it seems that just ceasing processing new requests if the version mismatch condition is recognized is enough.

@s0me0ne-unkn0wn
Copy link
Contributor Author

So, pushed a new minimal solution: check for version mismatch on the worker startup, and if the mismatch is detected, send SIGKILL to the parent process only.

Tests show that it works pretty fine: node is torn down, and when that happens, all its unix sockets get closed, which results in other workers getting an i/o error and exiting.

Before merging, I'd like to recruit a volunteer to test it under MacOS.

@mrcnski
Copy link
Contributor

mrcnski commented Mar 19, 2023

Before merging, I'd like to recruit a volunteer to test it under MacOS.

I can do it. Have any specific testing instructions?

Also, is it possible to add some tests to the integration test suite in this module? I don't see why not. e.g. spawn a worker and send a version mismatch, spawn multiple valid workers and then spawn another one and send a version mismatch, etc.

NOTE: Right now libc is a Linux-only dependency, that needs to be fixed.

@s0me0ne-unkn0wn
Copy link
Contributor Author

I can do it. Have any specific testing instructions?

Thank you! I do it like the following:

  1. Build with cargo build, which produces debug binary
  2. Change the version in the root Cargo.toml to something different
  3. Build with cargo build -r, which produces release binary of different version
  4. Set up zombienet using the small network template config (two nodes and one parachain) with default_command pointing to the debug binary
  5. Start the network with native provider: ./zombienet-linux-x64 --provider=native spawn 0001-small-network.toml
  6. Observe the alice node with PolkadotJS and wait until the parachain starts producing blocks
  7. Delete the debug polkadot binary and replace it with the release polkadot binary in place (you may have to delete it with rm and then copy a new one, just cp -f might not work)
  8. Kill the execution workers to force nodes to restart them: ps auxww |grep 'execute-worker' |awk '{print $2}' |xargs kill
  9. After a couple of seconds, make sure all the polkadot binaries exited, including preparation workers and nodes themselves: ps auxww |grep polkadot
  10. Node logs may also be checked, they should contain the message "Node and worker version mismatch, node needs restarting, forcing shutdown" coming from the execution worker before it kills everything.

Would you also fix dependencies to make libc build on MacOS? I can do that, but I cannot check that.

Also, is it possible to add some tests to the integration test suite in this module?

Yes, I was thinking about that, too, I'm just unfamiliar with the integration tests infrastructure and need to do some research.

@mrcnski
Copy link
Contributor

mrcnski commented Mar 21, 2023

I'm seeing unexpected behavior after completing step 8: failed to spawn an execute worker: AcceptTimeout. The workers are never spawned and the mismatch case is never triggered. Looking into it.

@ordian ordian dismissed their stale review March 21, 2023 09:40

stale

@mrcnski
Copy link
Contributor

mrcnski commented Mar 21, 2023

I'm seeing unexpected behavior after completing step 8: failed to spawn an execute worker: AcceptTimeout. The workers are never spawned and the mismatch case is never triggered. Looking into it.

If I remove the node before replacing it, then the version mismatch detection
works correctly.

If I try to overwrite it with cp -f (which your instructions say not to do,
but operators may do it), then I get AcceptTimeout. My guess is that the new
process immediately exits, though we don't check for this in
spawn_with_program_path. This immediate exit seems to be known behavior on
MacOS, based on
https://stackoverflow.com/questions/67378106/mac-m1-cping-binary-over-another-results-in-crash.
It results in SpawnErr.

Also, I once randomly got this fun error: Malformed Mach-o file, which
also resulted in a SpawnErr. This SpawnErr is harmless -- it just causes us to
retry spawning the worker every 3s in an infinite loop.

My recommendations

  1. Catch process exiting (handle becoming invalid?) in spawn_with_program_path (can be a follow-up issue). Not doing, see comment below.
  2. Double check that SpawnErr does not result in disputes. (checked above)
  3. Add a test for immediately-dying workers if possible (tests/it/worker_common.rs). DONE
  4. I think documenting this in-code is not needed as this is a super edge case.

This seems like the best we can do. Then we should be good to go. These changes
should be quick to implement and not hold up this PR much longer.

Sample Log

See this truncated sample log: https://pastebin.com/raw/LBT0RUP9. This shows that:

  1. "failed to spawn" just results in us trying to re-start the process repeatedly.
  2. No disputes are raised.

@s0me0ne-unkn0wn
Copy link
Contributor Author

@mrcnski, thanks a lot for testing and feedback! I'll have to leave (1) to you, as on Linux, cp -f results in an error, refusing to replace the file in use. Nevertheless, unlinking it and then copying a new one in its place works.

I believe SpawnErr is not a concern as it does not result in a dispute anyway. And upgrading a validator in place under MacOS is such a super-super-edge case... It would be a shame to introduce any new code just to handle that.

I'll try to add some integration tests later (I'll be ooo for several days).

@mrcnski
Copy link
Contributor

mrcnski commented Mar 22, 2023

I'll try to add some integration tests later (I'll be ooo for several days).

wasmio right? Enjoy Barcelona!

@mrcnski
Copy link
Contributor

mrcnski commented Mar 22, 2023

I pushed a commit addressing (3) (adding a test).

(1) actually seems not trivial -- we can't select on the handle future, because we also return handle from a different branch of the select. I would just skip (1), the spawn will just timeout and we now have a test for it.

So, this PR seems good to go from my perspective.

@s0me0ne-unkn0wn
Copy link
Contributor Author

bot merge

@paritytech-processbot paritytech-processbot bot merged commit 1fe1702 into master Mar 29, 2023
@paritytech-processbot paritytech-processbot bot deleted the s0me0ne/worker-version branch March 29, 2023 22:48
coderobe pushed a commit that referenced this pull request Apr 4, 2023
…6861)

* Check spawned worker version vs node version before PVF preparation

* Address discussions

* Propagate errors and shutdown preparation and execution pipelines properly

* Add logs; Fix execution worker checks

* Revert "Propagate errors and shutdown preparation and execution pipelines properly"

This reverts commit b96cc31.

* Don't try to shut down; report the condition and exit worker

* Get rid of `VersionMismatch` preparation error

* Merge master

* Add docs; Fix tests

* Update Cargo.lock

* Kill again, but only the main node process

* Move unsafe code to a common safe function

* Fix libc dependency error on MacOS

* pvf spawning: Add some logging, add a small integration test

* Minor fixes

* Restart CI

---------

Co-authored-by: Marcin S <marcin@realemail.net>
@Polkadot-Forum
Copy link

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/polkadot-dispute-storm-the-postmortem/2550/1

@Polkadot-Forum
Copy link

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/ux-of-distributing-multiple-binaries-take-2/2854/3

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance.
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Check worker version against node version before running execution/preparation jobs
7 participants