Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

New PVF validation host #2710

Merged
merged 53 commits into from
Apr 8, 2021
Merged

New PVF validation host #2710

merged 53 commits into from
Apr 8, 2021

Conversation

pepyakin
Copy link
Contributor

@pepyakin pepyakin commented Mar 25, 2021

This PR proposes a fresh implementation of PVF validation host. This new version tries to avoid shortcomings of the previous approach and, despite its size, only lays a foundation for the further evolution.

This basically implements the vision outlined here https://hackmd.io/c9GhdTjpT-y0W-w8v8CknA

This PR builds on top of paritytech/substrate#8394 (atm still unmerged, hence inprogress). But note that this is not a companion though, since the mentioned change can be merged independently from this PR.

This PR incorporates the following coupled changes:

  1. The new PVF validation host implemented by node/core/pvf. The root of that crate is a good entrypoint for reviewing the code, it has overview of how it works.
  2. Integration of it and replacing the legacy validation host, in candidate-validation and CLI.
  3. Removal of the legacy validation host.

I recommend approaching this PR in the following order:

  1. For a start, you can look at Opt-out from fast instance reuse and foundation for other refactorings substrate#8394 to see how exactly Substrate Executor was changed to provide compiled artifacts and accept them instead of the raw code.
  2. Then, you can check out how is the node/core/pvf crate implemented. A high level description of the inner workings can be found in node/core/pvf/src/lib.rs.
  3. Knowing what is the public interface of the node/core/pvf crate you can see how is the integration into the node is implemented. Specifically, how it is integrated into candidate-validation instead of the legacy polkadot-parachain's wasm_executor.

Suggestions for what to look for:

  1. Race conditions. E.g. races between pruning and either preparation or execution.
  2. Problems with and non idiomatic async code (I don't feel confident writing async code)
  3. Portability and safety concerns (across POSIX platforms)

Rationale and unsolicited FAQ

This implementation is POSIX exclusive, why is that?

PVF validation only performed by the validators. Realistically, validators target only POSIX based systems. The same happens with development. Initially I thought that I may implement a fallback for other platforms, it turns out that we don't even have builders for them and also that on windows WSL2 works just fine. Therefore I didn't implement it and the project will no longer compile on Windows.

Should we ever want to port it on windows, it can be done relatively easily.

The project should still build for wasm and android as it did before.

I see many allocations and a could be better optimized code, can you fix those?

I think I can, but I don't see a lot of sense in it. The code is very cold so I took liberty and optimized for code readability rather than performance.

Why is caching always goes through files and doesn't attempt to cache in memory?

First of all, we cannot rely exclusively on in memory caches. The artifacts can swell in size and should it be 100 of them in memory that can be quite a big memory requirement. So there should be a fallback that persists the files on disk. Another thing is that we also want to be able to restart the node without recompiling all the artifacts.

So why don't we use a combination of memory and a file cache? The answer is simplicity and no need.
We can get away with only a file cache because the recently accessed files will reside in so called page cache, and thus, the disk hit will be avoided should be there enough memory.

One downside is that before execution the file contents should be copied from the page cache to the userland address space of the worker process. This could be addressed later by mmaping these files which would allow deserializing the artifacts directly from the page cache.

Why worker now returns stringified errors?

Well, actually, it used to as well when running in typical external process isolation mode. These structured errors were only in use when the code was running in the same process mode.

I considered implementing structured errors but didn't see enough benefit in that.

What is the reason there are two types of workers, rather than one?

I used two different kinds of workers (one for preparation and one for execution) because I anticipate we will introduce resource limits on worker processes and compilation and execution have two different resource consumption profiles.

Only UTF-8 paths supported

As far as I understand, we only support those. Besides, most of sane configured environments we target are configured to use UTF-8 paths.

Why Artifacts are prefixed with a manually bumped number rather than to just rely on the version of wasmtime?

I took this path that is there because it seems that it is more manual, which is a good thing in this case IMO. When the version of wasmtime is bumped then it would trigger a test error which would draw attention of a developer, which hopefully would give a chance to react to something unexpected and doesn't seem to be too burdensome.

Should it get annoying we can consider switching to automatic version sourcing.

Why not fork?

In order to avoid this scheme with creating an unix socket and spawn a process, we theoretically could use fork. However, turns out that fork is really too sharp of a tool. The main concern is that the forked process starts with only one thread running. This in turn makes such an environment similar to a signal handler. Such an environment is very limited, due to basic runtime services such as memory allocations could lead to a dead lock. I am not sure if there is a way to resume all threads, but resuming all threads also doesn't seem to be a good idea in the worker processes.

This sounds like a can of worms so I didn't go there.

@pepyakin pepyakin added A3-in_progress Pull request is in progress. No review needed at this stage. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Mar 25, 2021
@pepyakin
Copy link
Contributor Author

  • I just have checked and macOS seem to work just fine.
  • The test failure indicates that I broke the test service.

Copy link
Member

@bkchr bkchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not yet reviewed all of it

node/core/candidate-validation/Cargo.toml Outdated Show resolved Hide resolved
node/core/pvf/Cargo.toml Outdated Show resolved Hide resolved
node/core/pvf/Cargo.toml Outdated Show resolved Hide resolved
node/core/pvf/src/artifacts.rs Outdated Show resolved Hide resolved
node/core/pvf/src/artifacts.rs Outdated Show resolved Hide resolved
node/core/pvf/bin/puppet_worker.rs Outdated Show resolved Hide resolved
node/core/pvf/src/execute/queue.rs Outdated Show resolved Hide resolved
node/core/pvf/src/execute/worker.rs Outdated Show resolved Hide resolved
node/core/pvf/src/worker_common.rs Outdated Show resolved Hide resolved
node/core/pvf/src/worker_common.rs Outdated Show resolved Hide resolved
telemetry_worker_handle: Option<TelemetryWorkerHandle>,
program_path: Option<std::path::PathBuf>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the path of the binary itself? Or is it a db-path or somtehing like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The former one. Do you think the name is confusing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe exe_path is less ambiguous and in line with std::evn::current_exe?

Copy link
Contributor Author

@pepyakin pepyakin Apr 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exe and program sound synonymous for me. I suspect the problem may be that it is not immediately obvious what program/executable we are talking about here.


[dependencies]
always-assert = "0.1"
async-std = { version = "1.8.0", features = ["attributes"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it OK to mix async-std and tokio?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh-oh!!

I assumed the effective executor (i.e. the stuff that was spawned with ctx.spawn_blocking) is async-std, but I never actually checked. So you are saying it is tokio?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know for sure 🙈

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this shouldn't be a problem. Apparently async_std should work on all executors.

node/core/pvf/src/error.rs Outdated Show resolved Hide resolved
@ordian
Copy link
Member

ordian commented Apr 8, 2021

(sorry for introducing more merge conflicts)

Specifically the leftovers after removing real-overseer
Copy link
Contributor Author

@pepyakin pepyakin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_

@ordian
Copy link
Member

ordian commented Apr 8, 2021

bot merge

@ghost
Copy link

ghost commented Apr 8, 2021

Error: Missing process info; check that the PR belongs to a project column.

Merge cannot succeed as it is. Check out the criteria for merge.

@ordian ordian merged commit 0eb7905 into master Apr 8, 2021
@ordian ordian deleted the ser-wasm-cache branch April 8, 2021 22:09
rphmeier pushed a commit that referenced this pull request Apr 8, 2021
* Implement PVF validation host

* WIP: Diener

* Increase the alloted compilation time

* Add more comments

* Minor clean up

* Apply suggestions from code review

Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>

* Fix pruning artifact removal

* Fix formatting and newlines

* Fix the thread pool

* Update node/core/pvf/src/executor_intf.rs

Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>

* Remove redundant test declaration

* Don't convert the path into an intermediate string

* Try to workaround the test failure

* Use the puppet_worker trick again

* Fix a blip

* Move `ensure_wasmtime_version` under the tests mod

* Add a macro for puppet_workers

* fix build for not real-overseer

* Rename the puppet worker for adder collator

* play it safe with the name of adder puppet worker

* Typo: triggered

* Add more comments

* Do not kill exec worker on every error

* Plumb Duration for timeouts

* typo: critical

* Add proofs

* Clean unused imports

* Revert "WIP: Diener"

This reverts commit ff2d3ff.

* Sync version of wasmtime

* Update cargo.lock

* Update Substrate

* Merge fixes still

* Update wasmtime version in test

* bastifmt

Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>

* Squash spaces

* Trailing new line for testing.rs

* Remove controversial code

* comment about biasing

* Fix suggestion

* Add comments

* make it more clear why unwrap_err

* tmpfile retry

* proper proofs for claim_idle

* Remove mutex from ValidationHost

* Add some more logging

* Extract exec timeout into a constant

* Add some clarifying logging

* Use blake2_256

* Clean up the merge

Specifically the leftovers after removing real-overseer

* Update parachain/test-parachains/adder/collator/Cargo.toml

Co-authored-by: Andronik Ordian <write@reusable.software>

Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>
Co-authored-by: Andronik Ordian <write@reusable.software>
rphmeier added a commit that referenced this pull request Apr 8, 2021
rphmeier added a commit that referenced this pull request Apr 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants