New PVF validation host #2710

pepyakin · 2021-03-25T15:12:49Z

This PR proposes a fresh implementation of PVF validation host. This new version tries to avoid shortcomings of the previous approach and, despite its size, only lays a foundation for the further evolution.

This basically implements the vision outlined here https://hackmd.io/c9GhdTjpT-y0W-w8v8CknA

This PR builds on top of paritytech/substrate#8394 (atm still unmerged, hence inprogress). But note that this is not a companion though, since the mentioned change can be merged independently from this PR.

This PR incorporates the following coupled changes:

The new PVF validation host implemented by node/core/pvf. The root of that crate is a good entrypoint for reviewing the code, it has overview of how it works.
Integration of it and replacing the legacy validation host, in candidate-validation and CLI.
Removal of the legacy validation host.

I recommend approaching this PR in the following order:

For a start, you can look at Opt-out from fast instance reuse and foundation for other refactorings substrate#8394 to see how exactly Substrate Executor was changed to provide compiled artifacts and accept them instead of the raw code.
Then, you can check out how is the node/core/pvf crate implemented. A high level description of the inner workings can be found in node/core/pvf/src/lib.rs.
Knowing what is the public interface of the node/core/pvf crate you can see how is the integration into the node is implemented. Specifically, how it is integrated into candidate-validation instead of the legacy polkadot-parachain's wasm_executor.

Suggestions for what to look for:

Race conditions. E.g. races between pruning and either preparation or execution.
Problems with and non idiomatic async code (I don't feel confident writing async code)
Portability and safety concerns (across POSIX platforms)

Rationale and unsolicited FAQ

This implementation is POSIX exclusive, why is that?

PVF validation only performed by the validators. Realistically, validators target only POSIX based systems. The same happens with development. Initially I thought that I may implement a fallback for other platforms, it turns out that we don't even have builders for them and also that on windows WSL2 works just fine. Therefore I didn't implement it and the project will no longer compile on Windows.

Should we ever want to port it on windows, it can be done relatively easily.

The project should still build for wasm and android as it did before.

I see many allocations and a could be better optimized code, can you fix those?

I think I can, but I don't see a lot of sense in it. The code is very cold so I took liberty and optimized for code readability rather than performance.

Why is caching always goes through files and doesn't attempt to cache in memory?

First of all, we cannot rely exclusively on in memory caches. The artifacts can swell in size and should it be 100 of them in memory that can be quite a big memory requirement. So there should be a fallback that persists the files on disk. Another thing is that we also want to be able to restart the node without recompiling all the artifacts.

So why don't we use a combination of memory and a file cache? The answer is simplicity and no need.
We can get away with only a file cache because the recently accessed files will reside in so called page cache, and thus, the disk hit will be avoided should be there enough memory.

One downside is that before execution the file contents should be copied from the page cache to the userland address space of the worker process. This could be addressed later by mmaping these files which would allow deserializing the artifacts directly from the page cache.

Why worker now returns stringified errors?

Well, actually, it used to as well when running in typical external process isolation mode. These structured errors were only in use when the code was running in the same process mode.

I considered implementing structured errors but didn't see enough benefit in that.

What is the reason there are two types of workers, rather than one?

I used two different kinds of workers (one for preparation and one for execution) because I anticipate we will introduce resource limits on worker processes and compilation and execution have two different resource consumption profiles.

Only UTF-8 paths supported

As far as I understand, we only support those. Besides, most of sane configured environments we target are configured to use UTF-8 paths.

Why Artifacts are prefixed with a manually bumped number rather than to just rely on the version of wasmtime?

I took this path that is there because it seems that it is more manual, which is a good thing in this case IMO. When the version of wasmtime is bumped then it would trigger a test error which would draw attention of a developer, which hopefully would give a chance to react to something unexpected and doesn't seem to be too burdensome.

Should it get annoying we can consider switching to automatic version sourcing.

Why not `fork`?

In order to avoid this scheme with creating an unix socket and spawn a process, we theoretically could use fork. However, turns out that fork is really too sharp of a tool. The main concern is that the forked process starts with only one thread running. This in turn makes such an environment similar to a signal handler. Such an environment is very limited, due to basic runtime services such as memory allocations could lead to a dead lock. I am not sure if there is a way to resume all threads, but resuming all threads also doesn't seem to be a good idea in the worker processes.

This sounds like a can of worms so I didn't go there.

pepyakin · 2021-03-25T20:01:35Z

I just have checked and macOS seem to work just fine.
The test failure indicates that I broke the test service.

bkchr

Did not yet reviewed all of it

node/core/candidate-validation/Cargo.toml

node/core/pvf/Cargo.toml

node/core/pvf/src/artifacts.rs

node/core/pvf/bin/puppet_worker.rs

node/core/pvf/src/execute/queue.rs

node/core/pvf/src/execute/worker.rs

node/core/pvf/src/worker_common.rs

Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>

rphmeier · 2021-04-03T21:23:38Z

node/service/src/lib.rs

 	telemetry_worker_handle: Option<TelemetryWorkerHandle>,
+	program_path: Option<std::path::PathBuf>,


This is the path of the binary itself? Or is it a db-path or somtehing like that?

The former one. Do you think the name is confusing?

Maybe exe_path is less ambiguous and in line with std::evn::current_exe?

exe and program sound synonymous for me. I suspect the problem may be that it is not immediately obvious what program/executable we are talking about here.

node/core/pvf/src/artifacts.rs

rphmeier · 2021-04-03T21:36:00Z

node/core/pvf/Cargo.toml

+
+[dependencies]
+always-assert = "0.1"
+async-std = { version = "1.8.0", features = ["attributes"] }


Is it OK to mix async-std and tokio?

uh-oh!!

I assumed the effective executor (i.e. the stuff that was spawned with ctx.spawn_blocking) is async-std, but I never actually checked. So you are saying it is tokio?

I don't know for sure 🙈

Ok, this shouldn't be a problem. Apparently async_std should work on all executors.

node/core/pvf/src/error.rs

parachain/test-parachains/adder/collator/Cargo.toml

parachain/test-parachains/adder/collator/bin/puppet_worker.rs

parachain/test-parachains/adder/collator/Cargo.toml

ordian · 2021-04-08T20:27:55Z

(sorry for introducing more merge conflicts)

Specifically the leftovers after removing real-overseer

pepyakin

_

Co-authored-by: Andronik Ordian <write@reusable.software>

ordian · 2021-04-08T22:09:39Z

bot merge

ghost · 2021-04-08T22:09:44Z

Error: Missing process info; check that the PR belongs to a project column.

Merge cannot succeed as it is. Check out the criteria for merge.

* Implement PVF validation host * WIP: Diener * Increase the alloted compilation time * Add more comments * Minor clean up * Apply suggestions from code review Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com> * Fix pruning artifact removal * Fix formatting and newlines * Fix the thread pool * Update node/core/pvf/src/executor_intf.rs Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com> * Remove redundant test declaration * Don't convert the path into an intermediate string * Try to workaround the test failure * Use the puppet_worker trick again * Fix a blip * Move `ensure_wasmtime_version` under the tests mod * Add a macro for puppet_workers * fix build for not real-overseer * Rename the puppet worker for adder collator * play it safe with the name of adder puppet worker * Typo: triggered * Add more comments * Do not kill exec worker on every error * Plumb Duration for timeouts * typo: critical * Add proofs * Clean unused imports * Revert "WIP: Diener" This reverts commit ff2d3ff. * Sync version of wasmtime * Update cargo.lock * Update Substrate * Merge fixes still * Update wasmtime version in test * bastifmt Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com> * Squash spaces * Trailing new line for testing.rs * Remove controversial code * comment about biasing * Fix suggestion * Add comments * make it more clear why unwrap_err * tmpfile retry * proper proofs for claim_idle * Remove mutex from ValidationHost * Add some more logging * Extract exec timeout into a constant * Add some clarifying logging * Use blake2_256 * Clean up the merge Specifically the leftovers after removing real-overseer * Update parachain/test-parachains/adder/collator/Cargo.toml Co-authored-by: Andronik Ordian <write@reusable.software> Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com> Co-authored-by: Andronik Ordian <write@reusable.software>

This reverts commit cb6e65f.

This reverts commit f623e52.

pepyakin added 3 commits March 25, 2021 14:25

Implement PVF validation host

3da7e43

WIP: Diener

ff2d3ff

Increase the alloted compilation time

d0603dc

pepyakin added A3-in_progress Pull request is in progress. No review needed at this stage. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Mar 25, 2021

rphmeier added the A9-FUCKYEAH! label Mar 25, 2021

pepyakin added 2 commits March 29, 2021 13:14

Add more comments

3f1087a

Minor clean up

9413fac

bkchr reviewed Mar 29, 2021

View reviewed changes

pepyakin and others added 15 commits March 29, 2021 16:53

Apply suggestions from code review

d73db14

Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>

Fix pruning artifact removal

3fec908

Fix formatting and newlines

b9d0b6b

Fix the thread pool

aa3ef8e

Update node/core/pvf/src/executor_intf.rs

f4c3a5e

Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>

Remove redundant test declaration

83082dd

Don't convert the path into an intermediate string

f4f4383

Try to workaround the test failure

54f6ac0

Use the puppet_worker trick again

393a1d7

Fix a blip

1482878

Move ensure_wasmtime_version under the tests mod

01e572c

Add a macro for puppet_workers

fd16b2b

fix build for not real-overseer

0df8f39

Rename the puppet worker for adder collator

bb7b05a

play it safe with the name of adder puppet worker

4270358

rphmeier reviewed Apr 3, 2021

View reviewed changes

node/core/pvf/src/artifacts.rs Show resolved Hide resolved

rphmeier reviewed Apr 3, 2021

View reviewed changes

node/core/pvf/src/error.rs Outdated Show resolved Hide resolved

pepyakin added 10 commits April 8, 2021 11:03

Add comments

26ed193

make it more clear why unwrap_err

8db420e

tmpfile retry

7842469

proper proofs for claim_idle

c06a2fd

Remove mutex from ValidationHost

ef58171

Add some more logging

1feca0c

Extract exec timeout into a constant

ffb3caa

Add some clarifying logging

d151556

Use blake2_256

406fd71

Merge remote-tracking branch 'origin/master' into ser-wasm-cache

0fb7311

ordian reviewed Apr 8, 2021

View reviewed changes

parachain/test-parachains/adder/collator/Cargo.toml Outdated Show resolved Hide resolved

ordian reviewed Apr 8, 2021

View reviewed changes

parachain/test-parachains/adder/collator/bin/puppet_worker.rs Outdated Show resolved Hide resolved

parachain/test-parachains/adder/collator/bin/puppet_worker.rs Outdated Show resolved Hide resolved

parachain/test-parachains/adder/collator/Cargo.toml Outdated Show resolved Hide resolved

Clean up the merge

d5ab3a1

Specifically the leftovers after removing real-overseer

pepyakin commented Apr 8, 2021

View reviewed changes

pepyakin and others added 2 commits April 8, 2021 21:18

Merge remote-tracking branch 'origin/master' into ser-wasm-cache

3f23214

Update parachain/test-parachains/adder/collator/Cargo.toml

250a5ae

Co-authored-by: Andronik Ordian <write@reusable.software>

ordian merged commit 0eb7905 into master Apr 8, 2021

ordian deleted the ser-wasm-cache branch April 8, 2021 22:09

rphmeier added a commit that referenced this pull request Apr 8, 2021

Revert "New PVF validation host (#2710)"

f623e52

This reverts commit cb6e65f.

rphmeier added a commit that referenced this pull request Apr 8, 2021

Revert "Revert "New PVF validation host (#2710)""

311bd3b

This reverts commit f623e52.

joao-paulo-parity mentioned this pull request Apr 9, 2021

Refine review approval detection paritytech/parity-processbot#277

Closed

pepyakin mentioned this pull request Sep 26, 2022

Member Request polkadot-fellows/seeding#6

Merged

mrcnski mentioned this pull request May 30, 2023

PVF: Instantiate wasm in pre-checking #7246

Merged

mrcnski mentioned this pull request Aug 29, 2023

PVF worker: consider fork instead of thread for better isolation paritytech/polkadot-sdk#574

Closed

s0me0ne-unkn0wn mentioned this pull request Apr 17, 2024

Use higher priority for PVF preparation in dispute/approval context paritytech/polkadot-sdk#4172

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New PVF validation host #2710

New PVF validation host #2710

pepyakin commented Mar 25, 2021 •

edited

Loading

pepyakin commented Mar 25, 2021

bkchr left a comment

rphmeier Apr 3, 2021

pepyakin Apr 4, 2021

ordian Apr 7, 2021

pepyakin Apr 8, 2021 •

edited

Loading

rphmeier Apr 3, 2021

pepyakin Apr 4, 2021

rphmeier Apr 5, 2021

pepyakin Apr 6, 2021

ordian commented Apr 8, 2021

pepyakin left a comment •

edited

Loading

ordian commented Apr 8, 2021

ghost commented Apr 8, 2021

		telemetry_worker_handle: Option<TelemetryWorkerHandle>,
		program_path: Option<std::path::PathBuf>,

New PVF validation host #2710

New PVF validation host #2710

Conversation

pepyakin commented Mar 25, 2021 • edited Loading

Rationale and unsolicited FAQ

This implementation is POSIX exclusive, why is that?

I see many allocations and a could be better optimized code, can you fix those?

Why is caching always goes through files and doesn't attempt to cache in memory?

Why worker now returns stringified errors?

What is the reason there are two types of workers, rather than one?

Only UTF-8 paths supported

Why Artifacts are prefixed with a manually bumped number rather than to just rely on the version of wasmtime?

Why not fork?

pepyakin commented Mar 25, 2021

bkchr left a comment

Choose a reason for hiding this comment

rphmeier Apr 3, 2021

Choose a reason for hiding this comment

pepyakin Apr 4, 2021

Choose a reason for hiding this comment

ordian Apr 7, 2021

Choose a reason for hiding this comment

pepyakin Apr 8, 2021 • edited Loading

Choose a reason for hiding this comment

rphmeier Apr 3, 2021

Choose a reason for hiding this comment

pepyakin Apr 4, 2021

Choose a reason for hiding this comment

rphmeier Apr 5, 2021

Choose a reason for hiding this comment

pepyakin Apr 6, 2021

Choose a reason for hiding this comment

ordian commented Apr 8, 2021

pepyakin left a comment • edited Loading

Choose a reason for hiding this comment

ordian commented Apr 8, 2021

ghost commented Apr 8, 2021

pepyakin commented Mar 25, 2021 •

edited

Loading

Why not `fork`?

pepyakin Apr 8, 2021 •

edited

Loading

pepyakin left a comment •

edited

Loading