-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not yet reviewed all of it
Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>
Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com>
telemetry_worker_handle: Option<TelemetryWorkerHandle>, | ||
program_path: Option<std::path::PathBuf>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the path of the binary itself? Or is it a db-path or somtehing like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The former one. Do you think the name is confusing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe exe_path
is less ambiguous and in line with std::evn::current_exe
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exe
and program
sound synonymous for me. I suspect the problem may be that it is not immediately obvious what program/executable we are talking about here.
|
||
[dependencies] | ||
always-assert = "0.1" | ||
async-std = { version = "1.8.0", features = ["attributes"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it OK to mix async-std
and tokio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh-oh!!
I assumed the effective executor (i.e. the stuff that was spawned with ctx.spawn_blocking) is async-std, but I never actually checked. So you are saying it is tokio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know for sure 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this shouldn't be a problem. Apparently async_std
should work on all executors.
(sorry for introducing more merge conflicts) |
Specifically the leftovers after removing real-overseer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_
Co-authored-by: Andronik Ordian <write@reusable.software>
bot merge |
Error: Missing process info; check that the PR belongs to a project column. Merge cannot succeed as it is. Check out the criteria for merge. |
* Implement PVF validation host * WIP: Diener * Increase the alloted compilation time * Add more comments * Minor clean up * Apply suggestions from code review Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com> * Fix pruning artifact removal * Fix formatting and newlines * Fix the thread pool * Update node/core/pvf/src/executor_intf.rs Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com> * Remove redundant test declaration * Don't convert the path into an intermediate string * Try to workaround the test failure * Use the puppet_worker trick again * Fix a blip * Move `ensure_wasmtime_version` under the tests mod * Add a macro for puppet_workers * fix build for not real-overseer * Rename the puppet worker for adder collator * play it safe with the name of adder puppet worker * Typo: triggered * Add more comments * Do not kill exec worker on every error * Plumb Duration for timeouts * typo: critical * Add proofs * Clean unused imports * Revert "WIP: Diener" This reverts commit ff2d3ff. * Sync version of wasmtime * Update cargo.lock * Update Substrate * Merge fixes still * Update wasmtime version in test * bastifmt Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com> * Squash spaces * Trailing new line for testing.rs * Remove controversial code * comment about biasing * Fix suggestion * Add comments * make it more clear why unwrap_err * tmpfile retry * proper proofs for claim_idle * Remove mutex from ValidationHost * Add some more logging * Extract exec timeout into a constant * Add some clarifying logging * Use blake2_256 * Clean up the merge Specifically the leftovers after removing real-overseer * Update parachain/test-parachains/adder/collator/Cargo.toml Co-authored-by: Andronik Ordian <write@reusable.software> Co-authored-by: Bastian Köcher <bkchr@users.noreply.github.com> Co-authored-by: Andronik Ordian <write@reusable.software>
This PR proposes a fresh implementation of PVF validation host. This new version tries to avoid shortcomings of the previous approach and, despite its size, only lays a foundation for the further evolution.
This basically implements the vision outlined here https://hackmd.io/c9GhdTjpT-y0W-w8v8CknA
This PR builds on top of paritytech/substrate#8394 (atm still unmerged, hence inprogress). But note that this is not a companion though, since the mentioned change can be merged independently from this PR.
This PR incorporates the following coupled changes:
I recommend approaching this PR in the following order:
candidate-validation
instead of the legacypolkadot-parachain
'swasm_executor
.Suggestions for what to look for:
Rationale and unsolicited FAQ
This implementation is POSIX exclusive, why is that?
PVF validation only performed by the validators. Realistically, validators target only POSIX based systems. The same happens with development. Initially I thought that I may implement a fallback for other platforms, it turns out that we don't even have builders for them and also that on windows WSL2 works just fine. Therefore I didn't implement it and the project will no longer compile on Windows.
Should we ever want to port it on windows, it can be done relatively easily.
The project should still build for wasm and android as it did before.
I see many allocations and a could be better optimized code, can you fix those?
I think I can, but I don't see a lot of sense in it. The code is very cold so I took liberty and optimized for code readability rather than performance.
Why is caching always goes through files and doesn't attempt to cache in memory?
First of all, we cannot rely exclusively on in memory caches. The artifacts can swell in size and should it be 100 of them in memory that can be quite a big memory requirement. So there should be a fallback that persists the files on disk. Another thing is that we also want to be able to restart the node without recompiling all the artifacts.
So why don't we use a combination of memory and a file cache? The answer is simplicity and no need.
We can get away with only a file cache because the recently accessed files will reside in so called page cache, and thus, the disk hit will be avoided should be there enough memory.
One downside is that before execution the file contents should be copied from the page cache to the userland address space of the worker process. This could be addressed later by mmaping these files which would allow deserializing the artifacts directly from the page cache.
Why worker now returns stringified errors?
Well, actually, it used to as well when running in typical external process isolation mode. These structured errors were only in use when the code was running in the same process mode.
I considered implementing structured errors but didn't see enough benefit in that.
What is the reason there are two types of workers, rather than one?
I used two different kinds of workers (one for preparation and one for execution) because I anticipate we will introduce resource limits on worker processes and compilation and execution have two different resource consumption profiles.
Only UTF-8 paths supported
As far as I understand, we only support those. Besides, most of sane configured environments we target are configured to use UTF-8 paths.
Why Artifacts are prefixed with a manually bumped number rather than to just rely on the version of wasmtime?
I took this path that is there because it seems that it is more manual, which is a good thing in this case IMO. When the version of wasmtime is bumped then it would trigger a test error which would draw attention of a developer, which hopefully would give a chance to react to something unexpected and doesn't seem to be too burdensome.
Should it get annoying we can consider switching to automatic version sourcing.
Why not
fork
?In order to avoid this scheme with creating an unix socket and spawn a process, we theoretically could use
fork
. However, turns out that fork is really too sharp of a tool. The main concern is that the forked process starts with only one thread running. This in turn makes such an environment similar to a signal handler. Such an environment is very limited, due to basic runtime services such as memory allocations could lead to a dead lock. I am not sure if there is a way to resume all threads, but resuming all threads also doesn't seem to be a good idea in the worker processes.This sounds like a can of worms so I didn't go there.