Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVF: consider spawning a new process per job #584

Closed
mrcnski opened this issue Aug 7, 2023 · 18 comments · Fixed by #1373
Closed

PVF: consider spawning a new process per job #584

mrcnski opened this issue Aug 7, 2023 · 18 comments · Fixed by #1373
Assignees
Labels
I1-security The node fails to follow expected, security-sensitive, behaviour.

Comments

@mrcnski
Copy link
Contributor

mrcnski commented Aug 7, 2023

Currently we spawn worker processes for PVF jobs (prepare/execute) and each incoming job gets its own thread within that process.

This works fine but there are potential security issues with this as described in paritytech/polkadot#7580 (comment). Namely, we can't fully sandbox the process because we have to have an allow-exception for the entire PVF artifact cache directory.

We should investigate the overhead of spawning a whole separate process for each job as opposed to a thread. It may be less overhead now that we are spawning processes from smaller worker binaries instead of polkadot. If the cost is low enough, we can switch to one-process-per-job which allows us to sandbox the process better.

This would probably also come with changes to the execution queue logic (most likely simplifications).

@mrcnski
Copy link
Contributor Author

mrcnski commented Aug 8, 2023

I'm going to first try sending the artifact bytes over the socket, avoiding the need for any sandbox allowances at all. I was under the impression that it would take too long but it's worth measuring.

@burdges
Copy link

burdges commented Aug 11, 2023

In principle SPREEs could be used by many paracahins, so then we'd benefit from the final artifact, presumably an ELF .so file, being dlopened. Not a concern right now.

@rphmeier
Copy link
Contributor

rphmeier commented Aug 14, 2023

Numbers on PVF overhead (per job, per byte of data in PoV) would be really helpful to track as they'd inform higher-level protocol things like bundling.

i.e. the lower we can get these overheads, the smaller we can make candidates, and probably the higher utilization we can get out of approval cores.

@mrcnski
Copy link
Contributor Author

mrcnski commented Aug 15, 2023

I'm going to first try sending the artifact bytes over the socket, avoiding the need for any sandbox allowances at all. I was under the impression that it would take too long but it's worth measuring.

Just measured on local zombienet. The result is what I expected, but worth a try I guess.

Artifact of size 233768 bytes took 7364ms to transfer

Next I'll benchmark a new process per job.

@rphmeier
Copy link
Contributor

Artifact of size 233768 bytes took 7364ms to transfer

Am I reading this right? Is this meant to be microseconds or milliseconds?

@mrcnski
Copy link
Contributor Author

mrcnski commented Aug 16, 2023

Artifact of size 233768 bytes took 7364ms to transfer

Am I reading this right? Is this meant to be microseconds or milliseconds?

Duration::as_millis, so it's indeed pretty bad. I also saw bad numbers in my internet research but wanted to quickly confirm myself.

Shared memory is supposed to be much faster, but it would require a bigger rewrite to test it, and I fear it wouldn't fit well with Rust's ownership scheme and require unsafe. Also, the executor already has optimizations around working with file data efficiently, which we would lose if we read the whole file in the host and wrote it to shared memory.

Maybe worth doing a quick PoC or extending the benchmarks we have, but first I'll try one-process-per-job as it's much simpler to test.

@mrcnski
Copy link
Contributor Author

mrcnski commented Aug 16, 2023

Locally on M1 Mac the spawning takes 4-12ms. That's already how long execution by itself takes. When I tried on a beefy Linux machine it was worse, at 20ms. In the grand scheme of things, maybe it's acceptable? It still comfortably fits within the backing timeout. cc @eskimor @s0me0ne-unkn0wn

I tried the socket measurements again, this time on Linux, since my numbers on Mac were even worse than stack overflow's socket / shared memory numbers. On Linux I got 645us (Duration::as_micros) for 233768 bytes. Extrapolating for a 20mb artifact (which is pretty small) it would be 60ms, so still not too good.

@s0me0ne-unkn0wn
Copy link
Contributor

When I tried on a beefy Linux machine it was worse, at 20ms.

Have you tried spawning from memory yet? It could somewhat reduce the overhead.

On Linux I got 645us (Duration::as_micros) for 233768 bytes. Extrapolating for a 20mb artifact (which is pretty small) it would be 60ms, so still not too good.

The question is, how bad is that compared to reading the artifact from the fs? For the case if we read them only once (or never read them at all, preparing and keeping everything in memory)

@rphmeier
Copy link
Contributor

rphmeier commented Aug 16, 2023

The question is, how bad is that compared to reading the artifact from the fs? For the case if we read them only once (or never read them at all, preparing and keeping everything in memory)

We probably can't keep all artifacts in memory while we scale to tens of thousands of registered parachains. We can give heads-up notifications to pre-load specific artifacts when we assign to a parachain in a approval-checking, but it'd probably only give a few hundred ms of lead time (probably enough).

These are the numbers we're interested in:

  • Overhead of spawning a process
  • Overhead of starting the execution job (including getting the artifact)
  • Overhead of passing in the PoV

Together it's our total overhead per validation job. Ideally we can get this as lean as possible on linux/1st-class platforms for validators. (relates to #607)

@eskimor
Copy link
Member

eskimor commented Aug 16, 2023

20MB smallish ...? @s0me0ne-unkn0wn is this something we can easily bring down in the future?

I tried the socket measurements again, this time on Linux, since my numbers on Mac were even worse than stack overflow's socket / shared memory numbers. On Linux I got 645us (Duration::as_micros) for 233768 bytes. Extrapolating for a 20mb artifact (which is pretty small) it would be 60ms, so still not too good.

Hmm, shouldn't it be 6ms?

@s0me0ne-unkn0wn
Copy link
Contributor

20MB smallish ...? @s0me0ne-unkn0wn is this something we can easily bring down in the future?

Checked with Moonbeam runtime (7.7 Mb uncompressed Wasm bytecode). Wasmtime produced a 28 Mb artifact. It's an unstripped ELF, .text segment is ~13.5 Mb, and a lot of other things used by Wasmtime runtime which are not obvious to me (e.g., ~10 Mb section called wasmtime.addrmap).

If I strip it, it doesn't make much difference, ~300 Kb are gone, but that's it.

My compiler produces 22.7 Mb artifact from the same bytecode, so the value is fair enough.

Just for comparison, the adder runtime, 126 Kb Wasm bytecode, produces 149 Kb unstripped (143 Kb stripped) artifact with Wasmtime. My compiler produces 102 Kb artifact.

@s0me0ne-unkn0wn
Copy link
Contributor

s0me0ne-unkn0wn commented Aug 16, 2023

As I mentioned before, we can compress them in-memory. The 28 Mb Moonbeam artifact is compressed with zstd (which we use in Substrate to compress Wasm blobs) down to 12.3 Mb.

@eskimor
Copy link
Member

eskimor commented Aug 17, 2023

Let's see, I know a very good article where someone reduced an ELF binary quite significantly*). Very entertaining. 😎

*) Removed the spoiler.

@mrcnski
Copy link
Contributor Author

mrcnski commented Aug 17, 2023

Thinking back to the original issue, which is that we want to prevent attackers from discovering other artifacts on disk. I found out that landlock has this access right for directories:

LANDLOCK_ACCESS_FS_READ_DIR: Open a directory or list its content.

I believe we can apply this and then prepend artifact filenames with a random string, so the only way we should be able to access an artifact is when the host sends us a job for it. Edit: I wrote a test and the access right works as expected.

As I mentioned before, we can compress them in-memory. The 28 Mb Moonbeam artifact is compressed with zstd (which we use in Substrate to compress Wasm blobs) down to 12.3 Mb.

Isn't this also a performance hit, as we would have to decompress them each time? Still, could be worth evaluating.

@mrcnski
Copy link
Contributor Author

mrcnski commented Aug 20, 2023

I believe we can apply this and then prepend artifact filenames with a random string, so the only way we should be able to access an artifact is when the host sends us a job for it.

Interesting. This prevents access of other artifacts, but the artifact filename itself is now a source of randomness. We can rename the artifact to remove the random string before passing it for a job, and revert the name after. But then we have to account for races, as multiple workers may need the artifact concomittantly.

Perhaps simpler is for each worker to have an expected file that it always reads, i.e. <cache_path>/worker_1. It can perhaps be symlinked or hardlinked to the actual artifact before each job. But... the <cache_path> itself can be a source of randomness. Agh. IMO we should agree not to pursue this goal of removing all randomness for now.1

Footnotes

  1. But maybe we can do some chroot trickery like this. It didn't work when I tried in a Rust test but it could work in a separate child process...

@mrcnski
Copy link
Contributor Author

mrcnski commented Aug 24, 2023

For a new process per job, here's a proposal for a lightweight way of doing it:
#574

For hiding artifacts from jobs that don't need them, I've tested this and it seems to work:
#584

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
@the-right-joyce the-right-joyce added I1-security The node fails to follow expected, security-sensitive, behaviour. and removed I2-security labels Aug 25, 2023
@mrcnski
Copy link
Contributor Author

mrcnski commented Aug 29, 2023

I almost finished implementing the pivot_root solution with linking as described here. However, it turned out to be a big and complex change, adding some complexity to the FS structure, and I'm thinking it might be better to do shared memory. I was on the fence until this issue. If we don't have a socket, then point (2) there is solved, and if we block (1) (a 3-line change, birdcage has already done it) then we get networking sandboxing for free, on top of paritytech/polkadot#7334. And of course, we can keep the pivot_root sandboxing that I already wrote.

One nice thing about the existing file-based IPC is that the child writes a temporary prepared artifact, and the host atomically renames it to the final destination if the child succeeded. But, I don't see why we can't do the same on the host-side with shared memory. Also, I think the linking solution is going to be the best performance-wise, but shared memory should only incur a couple ms extra, max. I can do a quick PoC to double-check it.

One other reason maybe not to do it is that the networking sandboxing would just be temporary until the full seccomp solution arrives. However, it seems that without a socket in play, there's just less of a chance of things going wrong here. I.e., if there's no socket at all, we could entirely block the socket read/write syscalls.

It would be a big change so looking for feedback before I do it.

@mrcnski
Copy link
Contributor Author

mrcnski commented Aug 29, 2023

Looks like shared memory was considered in the past:
https://hackmd.io/c9GhdTjpT-y0W-w8v8CknA#Problems-of-the-Current-Design

Haven't dug into it yet. But I don't like the solution I have with linking, because the child and host are very tightly coupled around the expected FS structure. Ofc they were already tightly coupled, but now the coupling is hidden and easy to break and must be documented everywhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I1-security The node fails to follow expected, security-sensitive, behaviour.
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

6 participants