Separate preparation timeouts for PVF prechecking and execution #6139

mrcnski · 2022-10-11T17:39:43Z

PULL REQUEST

Overview

Per the linked issue, we make the required changes so that preparation for
execution is more lenient (by a factor of 3) than preparation for prechecking.

We add a compilation_timeout parameter for PVF preparation job
and also split the COMPILATION_TIMEOUT constant into two new consts.

Todo

~~What kind of tests should be added?~~

Issues Closed

Closes #4132

mrcnski · 2022-10-11T17:54:03Z

Not sure how to label this for release, i.e. what is worthy of release notes?

rphmeier · 2022-10-12T05:52:49Z

@m-cat - the default should be 'silent' but major features, API changes, DB locations, and other things that impact the process of running a validator should be present in release notes.

eskimor

Excellent! Really nice work! One thing, with regards to leniency, for execution we have a factor of 6, having the same factor here seems to make sense and would make it a bit "safer".

eskimor · 2022-10-12T12:24:58Z

With regards to tests, for this particular change there is hardly something we could test.

Wiith regards to leniency, for execution we have a factor of 6, having the same factor here seems to make sense and would make it a bit "safer".

Ok, actually the situation is a bit different. For backing/approval timeouts we only have like 2-3 validators with the short timeout, here we have a super majority - so a smaller lenience can be justified - still node load can vary over time, also the validator set may change over time ... better safe than sorry.

pepyakin

Good job. Don't get discouraged by the number of comments. They are all nits, which (at least for my reviews) means they are not blocking the merge and are not necessary to address.

Organizational note: I personally prefer landing code as soon as it's ready. If you see a part that is not ready and you wanna address before the merge, I am up to split it into another PR. That will give us a platform for discussion the changes with more focus. It's also possible to land and as-is and send the fixes with a follow-up PR (as long as everything is linked)

pepyakin · 2022-10-12T12:19:23Z

node/core/pvf/src/prepare/queue.rs

@@ -76,6 +79,8 @@ struct JobData {
 	/// The priority of this job. Can be bumped.
 	priority: Priority,
 	pvf: Pvf,
+	/// The timeout for the preparation job.
+	compilation_timeout: Duration,


Nit: I think here (and elsewhere) it's more correct to say preparation. The worker will perform preparation, which is the combination of prevalidation and compilation (production of the compiled artifact).

pepyakin · 2022-10-12T12:23:01Z

node/core/pvf/src/host.rs

@@ -38,6 +38,16 @@ use std::{
 	time::{Duration, SystemTime},
 };

+/// The time period after which the precheck preparation worker is considered unresponsive and will
+/// be killed.


Nit: those docs distinguish between pre-check preparation worker and execute preparation worker. I am not sure if that's the right way of thinking about that. After all it's the very same worker that does the same the job. It does not even have any parametrization.

pepyakin · 2022-10-12T12:28:22Z

node/core/pvf/src/host.rs

+/// The time period after which the execute preparation worker is considered unresponsive and will
+/// be killed.
+// NOTE: If you change this make sure to fix the buckets of `pvf_preparation_time` metric.
+pub const EXECUTE_COMPILATION_TIMEOUT: Duration = Duration::from_secs(180);


Nit: It would be great if there were a doc line explaining the relationship between the two. Perhaps, moving them into a module (inline or separate file) and in the module doc explaining the stuff we discussed in DMs?

pepyakin · 2022-10-12T12:31:06Z

node/core/pvf/src/metrics.rs

@@ -166,6 +167,7 @@ impl metrics::Metrics for Metrics {
 						20.0,
 						30.0,
 						60.0,
+						180.0,


Nit: Do you think that's a good resolution for this metric? IOW, ask the question, looking at a metrics dashboard do you think it's possible that you would think for yourself "I wish there were more buckets available that are higher/lesser than 180"? If you are inclined to say yes just plop more bands. It's very cold code.

What do you mean by cold code?

generally, cold code is the code that is not frequently called. Here, I meant that there are no performance reasons to save on the bands. Assuming more bands, less performance. I did not even think about this too much since it's how many preparations per second can we reasonably do in the worst case? 100? So with the performance of the node argument being irrelevant, and by extension memory as well. I don't see other arguments against it.

pepyakin · 2022-10-12T12:32:41Z

roadmap/implementers-guide/src/pvf-prechecking.md

+- **Prevalidation:** Right now this just tries to deserialize the binary with
+  parity-wasm. It is a part of *preparation*.
+- **Compilation:** This is the process of compiling a PVF from wasm code to
+  machine code. It is a part of *preparation*.


Nit: This book already has a glossary. Do you think it would be better to move those there? Here, we can leave a note saying that this is a loaded document with terms refer to the glossary.

Alternatively (or better, additionally) we could embed those into the text as explainations. I think as a bonus this would allow us to structure the explaination hierarchically, IMO better. Something like the following abstract:

In order to make the PVF usable for candidate validation it has to be registered on-chain

As part of the registration process, it has to go through pre-checking.

Pre-checking is a game of attempting preparation and reporting the results back on-chain.

We define preparation as a process that: validates the consistency of the wasm binary (aka prevalidation) and the compilation of the wasm module into machine code (refered to as artifact).

Besides pre-checking, preparation can also be triggered by execution, since compiled artifact is needed for the execution

pepyakin · 2022-10-12T12:50:41Z

node/core/pvf/src/host.rs

+/// The time period after which the execute preparation worker is considered unresponsive and will
+/// be killed.
+// NOTE: If you change this make sure to fix the buckets of `pvf_preparation_time` metric.
+pub const EXECUTE_COMPILATION_TIMEOUT: Duration = Duration::from_secs(180);


Nit: I wonder if this would be better named as lazy or lenient. After all we use it for the heads up signal which also requires a more permissive timeout.

pepyakin · 2022-10-12T12:51:02Z

node/core/pvf/src/host.rs

@@ -418,6 +429,9 @@ async fn handle_to_host(
 	Ok(())
 }

+/// Handles PVF prechecking.


Nit: ... prechecking requests

pepyakin · 2022-10-12T12:52:05Z

node/core/pvf/src/host.rs

@@ -485,9 +511,17 @@ async fn handle_execute_pvf(
 		}
 	} else {
 		// Artifact is unknown: register it and enqueue a job with the corresponding priority and
-		//
+		// PVF.


lol, finally this is fixed 🎉

mrcnski · 2022-10-12T23:28:24Z

Thanks for the reviews! 👍 Just stuck on a couple of CI checks:

Check reviews: Looks like I need a couple reviews from the ci or release-engineering teams.
zombienet: I'm trying to restart this one on GitLab, but getting This job could not start because it could not retrieve the needed artifacts: publish-polkadot-debug-image.

mrcnski · 2022-10-12T23:31:18Z

still node load can vary over time, also the validator set may change over time ... better safe than sorry.

@eskimor Sounds good, I'll address it in the followup PR.

Good job. Don't get discouraged by the number of comments. They are all nits, which (at least for my reviews) means they are not blocking the merge and are not necessary to address.

All good. I was expecting more comments than that for my first PR!

I agree about addressing the nits in a followup PR.

ordian · 2022-10-13T09:44:38Z

jFYI: continuous-integration/gitlab-zombienet-tests-parachains-disputes is expected to fail for now on master, it's marked as {"build_allow_failure":true}
should be fixed in #6142

slumber

Agreed with @pepyakin comments, other than that well done

Note: once pvf is queued for preparation with some timeout, any subsequent request would discard compilation_timeout parameter and simply enqueue response_receiver (the way it's implemented now)

This is OK because we can't receive execute request for unprepared code until it's enacted once prechecking process concludes, and the code that was already prechecked should pass this process with a greater timeout.

However, this is an external guarantee so wanted to make sure you keep it in mind.

UPD: nvm didn't make it in time 😪

* master: (21 commits) try and fix build (#6170) Companion for EPM duplicate submissions (#6115) Bump docker/setup-buildx-action from 2.0.0 to 2.1.0 (#6141) companion for #12212 (#6162) Bump substrate (#6164) BlockId removal: refactor: StorageProvider (#6160) availability-recovery: use `IfDisconnected::TryConnect` for chunks (#6081) Update clap to version 4 (#6128) Add `force_open_hrmp_channel` Call (#6155) Fix fuzzing builds xcm-fuzz and erasure-coding fuzzer (#6153) BlockId removal refactor: Backend::state_at (#6149) First round of implementers guide fixes (#6146) bump zombienet version (#6142) lingua.dic is not managed by CI team (#6148) pallet-mmr: RPC and Runtime APIs work with block numbers (#6072) Separate preparation timeouts for PVF prechecking and execution (#6139) Malus: add disputed block percentage (#6100) refactor grid topology to expose more info to subsystems (#6140) Manual Para Lock (#5451) Expose node subcommands in Malus CLI (#6135) ...

mrcnski added 5 commits October 11, 2022 11:25

Add some documentation

10d1460

Add compilation_timeout parameter for PVF preparation job

ee0abe2

Update buckets in prometheus metrics

ea82029

Update prepare/queue tests

218d640

Update pvf-prechecking overview in implementer docs

8514193

mrcnski added the A0-please_review Pull request needs code review. label Oct 11, 2022

mrcnski requested a review from pepyakin October 11, 2022 17:39

Fix some CI checks

b07e37d

mrcnski requested review from a team and chevdor as code owners October 11, 2022 17:52

paritytech-ci requested a review from a team October 11, 2022 17:53

mrcnski requested a review from s0me0ne-unkn0wn October 11, 2022 18:49

Merge branch 'master' into m-cat/pvf-timeouts

d9a1fac

eskimor added C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. B0-silent Changes should not be mentioned in any release notes labels Oct 12, 2022

eskimor approved these changes Oct 12, 2022

View reviewed changes

pepyakin approved these changes Oct 12, 2022

View reviewed changes

ordian mentioned this pull request Oct 13, 2022

lingua.dic is not managed by CI team #6148

Merged

eskimor enabled auto-merge (squash) October 13, 2022 09:59

eskimor requested review from pepyakin and slumber October 13, 2022 10:00

sergejparity approved these changes Oct 13, 2022

View reviewed changes

paritytech-ci requested a review from a team October 13, 2022 10:34

alvicsam approved these changes Oct 13, 2022

View reviewed changes

eskimor merged commit 851a108 into master Oct 13, 2022

eskimor deleted the m-cat/pvf-timeouts branch October 13, 2022 11:00

slumber approved these changes Oct 13, 2022

View reviewed changes

mrcnski mentioned this pull request Oct 13, 2022

PVF timeouts follow-up #6151

Merged

mrcnski mentioned this pull request Nov 15, 2022

Relax PVF compilation deadline during preparation #4132

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate preparation timeouts for PVF prechecking and execution #6139

Separate preparation timeouts for PVF prechecking and execution #6139

mrcnski commented Oct 11, 2022 •

edited

Loading

mrcnski commented Oct 11, 2022

rphmeier commented Oct 12, 2022

eskimor left a comment

eskimor commented Oct 12, 2022 •

edited

Loading

pepyakin left a comment

pepyakin Oct 12, 2022

pepyakin Oct 12, 2022

pepyakin Oct 12, 2022

pepyakin Oct 12, 2022

mrcnski Oct 12, 2022

pepyakin Oct 13, 2022

pepyakin Oct 12, 2022

pepyakin Oct 12, 2022

pepyakin Oct 12, 2022

pepyakin Oct 12, 2022

mrcnski commented Oct 12, 2022 •

edited

Loading

mrcnski commented Oct 12, 2022

ordian commented Oct 13, 2022

slumber left a comment •

edited

Loading

@@ @@ -166,6 +167,7 @@ impl metrics::Metrics for Metrics { @@
 .0,
 .0,
 .0,
+.0,

Separate preparation timeouts for PVF prechecking and execution #6139

Separate preparation timeouts for PVF prechecking and execution #6139

Conversation

mrcnski commented Oct 11, 2022 • edited Loading

PULL REQUEST

Overview

Todo

Issues Closed

mrcnski commented Oct 11, 2022

rphmeier commented Oct 12, 2022

eskimor left a comment

Choose a reason for hiding this comment

eskimor commented Oct 12, 2022 • edited Loading

pepyakin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrcnski commented Oct 12, 2022 • edited Loading

mrcnski commented Oct 12, 2022

ordian commented Oct 13, 2022

slumber left a comment • edited Loading

Choose a reason for hiding this comment

mrcnski commented Oct 11, 2022 •

edited

Loading

eskimor commented Oct 12, 2022 •

edited

Loading

mrcnski commented Oct 12, 2022 •

edited

Loading

slumber left a comment •

edited

Loading