PVF: consider spawning a new process per job #584

mrcnski · 2023-08-07T09:23:08Z

Currently we spawn worker processes for PVF jobs (prepare/execute) and each incoming job gets its own thread within that process.

This works fine but there are potential security issues with this as described in paritytech/polkadot#7580 (comment). Namely, we can't fully sandbox the process because we have to have an allow-exception for the entire PVF artifact cache directory.

We should investigate the overhead of spawning a whole separate process for each job as opposed to a thread. It may be less overhead now that we are spawning processes from smaller worker binaries instead of polkadot. If the cost is low enough, we can switch to one-process-per-job which allows us to sandbox the process better.

This would probably also come with changes to the execution queue logic (most likely simplifications).

The text was updated successfully, but these errors were encountered:

mrcnski · 2023-08-08T07:09:13Z

I'm going to first try sending the artifact bytes over the socket, avoiding the need for any sandbox allowances at all. I was under the impression that it would take too long but it's worth measuring.

burdges · 2023-08-11T12:18:32Z

In principle SPREEs could be used by many paracahins, so then we'd benefit from the final artifact, presumably an ELF .so file, being dlopened. Not a concern right now.

rphmeier · 2023-08-14T19:53:55Z

Numbers on PVF overhead (per job, per byte of data in PoV) would be really helpful to track as they'd inform higher-level protocol things like bundling.

i.e. the lower we can get these overheads, the smaller we can make candidates, and probably the higher utilization we can get out of approval cores.

mrcnski · 2023-08-15T11:48:46Z

I'm going to first try sending the artifact bytes over the socket, avoiding the need for any sandbox allowances at all. I was under the impression that it would take too long but it's worth measuring.

Just measured on local zombienet. The result is what I expected, but worth a try I guess.

Artifact of size 233768 bytes took 7364ms to transfer

Next I'll benchmark a new process per job.

rphmeier · 2023-08-15T20:58:24Z

Artifact of size 233768 bytes took 7364ms to transfer

Am I reading this right? Is this meant to be microseconds or milliseconds?

mrcnski · 2023-08-16T08:24:36Z

Artifact of size 233768 bytes took 7364ms to transfer

Am I reading this right? Is this meant to be microseconds or milliseconds?

Duration::as_millis, so it's indeed pretty bad. I also saw bad numbers in my internet research but wanted to quickly confirm myself.

Shared memory is supposed to be much faster, but it would require a bigger rewrite to test it, and I fear it wouldn't fit well with Rust's ownership scheme and require unsafe. Also, the executor already has optimizations around working with file data efficiently, which we would lose if we read the whole file in the host and wrote it to shared memory.

Maybe worth doing a quick PoC or extending the benchmarks we have, but first I'll try one-process-per-job as it's much simpler to test.

mrcnski · 2023-08-16T14:53:10Z

Locally on M1 Mac the spawning takes 4-12ms. That's already how long execution by itself takes. When I tried on a beefy Linux machine it was worse, at 20ms. In the grand scheme of things, maybe it's acceptable? It still comfortably fits within the backing timeout. cc @eskimor @s0me0ne-unkn0wn

I tried the socket measurements again, this time on Linux, since my numbers on Mac were even worse than stack overflow's socket / shared memory numbers. On Linux I got 645us (Duration::as_micros) for 233768 bytes. Extrapolating for a 20mb artifact (which is pretty small) it would be 60ms, so still not too good.

s0me0ne-unkn0wn · 2023-08-16T14:59:00Z

When I tried on a beefy Linux machine it was worse, at 20ms.

Have you tried spawning from memory yet? It could somewhat reduce the overhead.

On Linux I got 645us (Duration::as_micros) for 233768 bytes. Extrapolating for a 20mb artifact (which is pretty small) it would be 60ms, so still not too good.

The question is, how bad is that compared to reading the artifact from the fs? For the case if we read them only once (or never read them at all, preparing and keeping everything in memory)

rphmeier · 2023-08-16T15:54:22Z

The question is, how bad is that compared to reading the artifact from the fs? For the case if we read them only once (or never read them at all, preparing and keeping everything in memory)

We probably can't keep all artifacts in memory while we scale to tens of thousands of registered parachains. We can give heads-up notifications to pre-load specific artifacts when we assign to a parachain in a approval-checking, but it'd probably only give a few hundred ms of lead time (probably enough).

These are the numbers we're interested in:

Overhead of spawning a process
Overhead of starting the execution job (including getting the artifact)
Overhead of passing in the PoV

Together it's our total overhead per validation job. Ideally we can get this as lean as possible on linux/1st-class platforms for validators. (relates to #607)

eskimor · 2023-08-16T16:25:33Z

20MB smallish ...? @s0me0ne-unkn0wn is this something we can easily bring down in the future?

I tried the socket measurements again, this time on Linux, since my numbers on Mac were even worse than stack overflow's socket / shared memory numbers. On Linux I got 645us (Duration::as_micros) for 233768 bytes. Extrapolating for a 20mb artifact (which is pretty small) it would be 60ms, so still not too good.

Hmm, shouldn't it be 6ms?

s0me0ne-unkn0wn · 2023-08-16T19:54:50Z

20MB smallish ...? @s0me0ne-unkn0wn is this something we can easily bring down in the future?

Checked with Moonbeam runtime (7.7 Mb uncompressed Wasm bytecode). Wasmtime produced a 28 Mb artifact. It's an unstripped ELF, .text segment is ~13.5 Mb, and a lot of other things used by Wasmtime runtime which are not obvious to me (e.g., ~10 Mb section called wasmtime.addrmap).

If I strip it, it doesn't make much difference, ~300 Kb are gone, but that's it.

My compiler produces 22.7 Mb artifact from the same bytecode, so the value is fair enough.

Just for comparison, the adder runtime, 126 Kb Wasm bytecode, produces 149 Kb unstripped (143 Kb stripped) artifact with Wasmtime. My compiler produces 102 Kb artifact.

s0me0ne-unkn0wn · 2023-08-16T19:59:58Z

As I mentioned before, we can compress them in-memory. The 28 Mb Moonbeam artifact is compressed with zstd (which we use in Substrate to compress Wasm blobs) down to 12.3 Mb.

eskimor · 2023-08-17T08:13:22Z

Let's see, I know a very good article where someone reduced an ELF binary quite significantly*). Very entertaining. 😎

*) Removed the spoiler.

mrcnski · 2023-08-17T08:36:44Z

Thinking back to the original issue, which is that we want to prevent attackers from discovering other artifacts on disk. I found out that landlock has this access right for directories:

LANDLOCK_ACCESS_FS_READ_DIR: Open a directory or list its content.

I believe we can apply this and then prepend artifact filenames with a random string, so the only way we should be able to access an artifact is when the host sends us a job for it. Edit: I wrote a test and the access right works as expected.

As I mentioned before, we can compress them in-memory. The 28 Mb Moonbeam artifact is compressed with zstd (which we use in Substrate to compress Wasm blobs) down to 12.3 Mb.

Isn't this also a performance hit, as we would have to decompress them each time? Still, could be worth evaluating.

mrcnski · 2023-08-20T13:27:35Z

I believe we can apply this and then prepend artifact filenames with a random string, so the only way we should be able to access an artifact is when the host sends us a job for it.

Interesting. This prevents access of other artifacts, but the artifact filename itself is now a source of randomness. We can rename the artifact to remove the random string before passing it for a job, and revert the name after. But then we have to account for races, as multiple workers may need the artifact concomittantly.

Perhaps simpler is for each worker to have an expected file that it always reads, i.e. <cache_path>/worker_1. It can perhaps be symlinked or hardlinked to the actual artifact before each job. But... the <cache_path> itself can be a source of randomness. Agh. IMO we should agree not to pursue this goal of removing all randomness for now.¹

But maybe we can do some chroot trickery like this. It didn't work when I tried in a Rust test but it could work in a separate child process... ↩

mrcnski · 2023-08-24T12:11:54Z

For a new process per job, here's a proposal for a lightweight way of doing it:
#574

For hiding artifacts from jobs that don't need them, I've tested this and it seems to work:
#584

mrcnski · 2023-08-29T15:51:33Z

I almost finished implementing the pivot_root solution with linking as described here. However, it turned out to be a big and complex change, adding some complexity to the FS structure, and I'm thinking it might be better to do shared memory. I was on the fence until this issue. If we don't have a socket, then point (2) there is solved, and if we block (1) (a 3-line change, birdcage has already done it) then we get networking sandboxing for free, on top of paritytech/polkadot#7334. And of course, we can keep the pivot_root sandboxing that I already wrote.

One nice thing about the existing file-based IPC is that the child writes a temporary prepared artifact, and the host atomically renames it to the final destination if the child succeeded. But, I don't see why we can't do the same on the host-side with shared memory. Also, I think the linking solution is going to be the best performance-wise, but shared memory should only incur a couple ms extra, max. I can do a quick PoC to double-check it.

One other reason maybe not to do it is that the networking sandboxing would just be temporary until the full seccomp solution arrives. However, it seems that without a socket in play, there's just less of a chance of things going wrong here. I.e., if there's no socket at all, we could entirely block the socket read/write syscalls.

It would be a big change so looking for feedback before I do it.

mrcnski · 2023-08-29T16:24:13Z

Looks like shared memory was considered in the past:
https://hackmd.io/c9GhdTjpT-y0W-w8v8CknA#Problems-of-the-Current-Design

Haven't dug into it yet. But I don't like the solution I have with linking, because the child and host are very tightly coupled around the expected FS structure. Ofc they were already tightly coupled, but now the coupling is hidden and easy to break and must be documented everywhere.

…tytech#584)

mrcnski added I2-security labels Aug 7, 2023

mrcnski self-assigned this Aug 7, 2023

mrcnski mentioned this issue Aug 7, 2023

PVF: Move landlock out of thread into process; add landlock exceptions paritytech/polkadot#7580

Draft

4 tasks

Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023

mrcnski mentioned this issue Aug 24, 2023

PVF: Throw error if worker process cannot spawn necessary threads #659

Open

the-right-joyce added I1-security The node fails to follow expected, security-sensitive, behaviour. and removed I2-security labels Aug 25, 2023

mrcnski mentioned this issue Sep 3, 2023

PVF: more filesystem sandboxing #1373

Merged

1 task

mrcnski closed this as completed in #1373 Sep 28, 2023

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024

add relayers_state param to the receive_messages_delivery_proof (pari…

45535b0

…tytech#584)

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024

add relayers_state param to the receive_messages_delivery_proof (pari…

50c1e62

…tytech#584)

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024

add relayers_state param to the receive_messages_delivery_proof (pari…

e468946

…tytech#584)

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024

add relayers_state param to the receive_messages_delivery_proof (pari…

fe06d92

…tytech#584)

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024

add relayers_state param to the receive_messages_delivery_proof (pari…

1cfee27

…tytech#584)

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024

add relayers_state param to the receive_messages_delivery_proof (pari…

1502f3f

…tytech#584)

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024

add relayers_state param to the receive_messages_delivery_proof (pari…

bc9c662

…tytech#584)

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024

add relayers_state param to the receive_messages_delivery_proof (pari…

51a4994

…tytech#584)

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024

add relayers_state param to the receive_messages_delivery_proof (pari…

b2d465b

…tytech#584)

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024

add relayers_state param to the receive_messages_delivery_proof (pari…

1a25450

…tytech#584)

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 10, 2024

add relayers_state param to the receive_messages_delivery_proof (pari…

5b2a919

…tytech#584)

serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 10, 2024

add relayers_state param to the receive_messages_delivery_proof (pari…

09cbe23

…tytech#584)

bkchr pushed a commit that referenced this issue Apr 10, 2024

add relayers_state param to the receive_messages_delivery_proof (#584)

043a008

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PVF: consider spawning a new process per job #584

PVF: consider spawning a new process per job #584

mrcnski commented Aug 7, 2023

mrcnski commented Aug 8, 2023

burdges commented Aug 11, 2023

rphmeier commented Aug 14, 2023 •

edited

Loading

mrcnski commented Aug 15, 2023

rphmeier commented Aug 15, 2023

mrcnski commented Aug 16, 2023 •

edited

Loading

mrcnski commented Aug 16, 2023 •

edited

Loading

s0me0ne-unkn0wn commented Aug 16, 2023

rphmeier commented Aug 16, 2023 •

edited

Loading

eskimor commented Aug 16, 2023

s0me0ne-unkn0wn commented Aug 16, 2023

s0me0ne-unkn0wn commented Aug 16, 2023 •

edited

Loading

eskimor commented Aug 17, 2023 •

edited

Loading

mrcnski commented Aug 17, 2023 •

edited

Loading

mrcnski commented Aug 20, 2023 •

edited

Loading

mrcnski commented Aug 24, 2023

mrcnski commented Aug 29, 2023

mrcnski commented Aug 29, 2023

PVF: consider spawning a new process per job #584

PVF: consider spawning a new process per job #584

Comments

mrcnski commented Aug 7, 2023

mrcnski commented Aug 8, 2023

burdges commented Aug 11, 2023

rphmeier commented Aug 14, 2023 • edited Loading

mrcnski commented Aug 15, 2023

rphmeier commented Aug 15, 2023

mrcnski commented Aug 16, 2023 • edited Loading

mrcnski commented Aug 16, 2023 • edited Loading

s0me0ne-unkn0wn commented Aug 16, 2023

rphmeier commented Aug 16, 2023 • edited Loading

eskimor commented Aug 16, 2023

s0me0ne-unkn0wn commented Aug 16, 2023

s0me0ne-unkn0wn commented Aug 16, 2023 • edited Loading

eskimor commented Aug 17, 2023 • edited Loading

mrcnski commented Aug 17, 2023 • edited Loading

mrcnski commented Aug 20, 2023 • edited Loading

Footnotes

mrcnski commented Aug 24, 2023

mrcnski commented Aug 29, 2023

mrcnski commented Aug 29, 2023

rphmeier commented Aug 14, 2023 •

edited

Loading

mrcnski commented Aug 16, 2023 •

edited

Loading

mrcnski commented Aug 16, 2023 •

edited

Loading

rphmeier commented Aug 16, 2023 •

edited

Loading

s0me0ne-unkn0wn commented Aug 16, 2023 •

edited

Loading

eskimor commented Aug 17, 2023 •

edited

Loading

mrcnski commented Aug 17, 2023 •

edited

Loading

mrcnski commented Aug 20, 2023 •

edited

Loading