-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Separate preparation timeouts for PVF prechecking and execution #6139
Conversation
Not sure how to label this for release, i.e. what is worthy of release notes? |
@m-cat - the default should be 'silent' but major features, API changes, DB locations, and other things that impact the process of running a validator should be present in release notes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! Really nice work! One thing, with regards to leniency, for execution we have a factor of 6, having the same factor here seems to make sense and would make it a bit "safer".
With regards to tests, for this particular change there is hardly something we could test.
Ok, actually the situation is a bit different. For backing/approval timeouts we only have like 2-3 validators with the short timeout, here we have a super majority - so a smaller lenience can be justified - still node load can vary over time, also the validator set may change over time ... better safe than sorry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job. Don't get discouraged by the number of comments. They are all nits, which (at least for my reviews) means they are not blocking the merge and are not necessary to address.
Organizational note: I personally prefer landing code as soon as it's ready. If you see a part that is not ready and you wanna address before the merge, I am up to split it into another PR. That will give us a platform for discussion the changes with more focus. It's also possible to land and as-is and send the fixes with a follow-up PR (as long as everything is linked)
@@ -76,6 +79,8 @@ struct JobData { | |||
/// The priority of this job. Can be bumped. | |||
priority: Priority, | |||
pvf: Pvf, | |||
/// The timeout for the preparation job. | |||
compilation_timeout: Duration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think here (and elsewhere) it's more correct to say preparation
. The worker will perform preparation, which is the combination of prevalidation and compilation (production of the compiled artifact).
@@ -38,6 +38,16 @@ use std::{ | |||
time::{Duration, SystemTime}, | |||
}; | |||
|
|||
/// The time period after which the precheck preparation worker is considered unresponsive and will | |||
/// be killed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: those docs distinguish between pre-check preparation worker and execute preparation worker. I am not sure if that's the right way of thinking about that. After all it's the very same worker that does the same the job. It does not even have any parametrization.
/// The time period after which the execute preparation worker is considered unresponsive and will | ||
/// be killed. | ||
// NOTE: If you change this make sure to fix the buckets of `pvf_preparation_time` metric. | ||
pub const EXECUTE_COMPILATION_TIMEOUT: Duration = Duration::from_secs(180); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It would be great if there were a doc line explaining the relationship between the two. Perhaps, moving them into a module (inline or separate file) and in the module doc explaining the stuff we discussed in DMs?
@@ -166,6 +167,7 @@ impl metrics::Metrics for Metrics { | |||
20.0, | |||
30.0, | |||
60.0, | |||
180.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Do you think that's a good resolution for this metric? IOW, ask the question, looking at a metrics dashboard do you think it's possible that you would think for yourself "I wish there were more buckets available that are higher/lesser than 180"? If you are inclined to say yes just plop more bands. It's very cold code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by cold code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally, cold code is the code that is not frequently called. Here, I meant that there are no performance reasons to save on the bands. Assuming more bands, less performance. I did not even think about this too much since it's how many preparations per second can we reasonably do in the worst case? 100? So with the performance of the node argument being irrelevant, and by extension memory as well. I don't see other arguments against it.
- **Prevalidation:** Right now this just tries to deserialize the binary with | ||
parity-wasm. It is a part of *preparation*. | ||
- **Compilation:** This is the process of compiling a PVF from wasm code to | ||
machine code. It is a part of *preparation*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This book already has a glossary. Do you think it would be better to move those there? Here, we can leave a note saying that this is a loaded document with terms refer to the glossary.
Alternatively (or better, additionally) we could embed those into the text as explainations. I think as a bonus this would allow us to structure the explaination hierarchically, IMO better. Something like the following abstract:
- In order to make the PVF usable for candidate validation it has to be registered on-chain
- As part of the registration process, it has to go through pre-checking.
- Pre-checking is a game of attempting preparation and reporting the results back on-chain.
- We define preparation as a process that: validates the consistency of the wasm binary (aka prevalidation) and the compilation of the wasm module into machine code (refered to as artifact).
- Besides pre-checking, preparation can also be triggered by execution, since compiled artifact is needed for the execution
/// The time period after which the execute preparation worker is considered unresponsive and will | ||
/// be killed. | ||
// NOTE: If you change this make sure to fix the buckets of `pvf_preparation_time` metric. | ||
pub const EXECUTE_COMPILATION_TIMEOUT: Duration = Duration::from_secs(180); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I wonder if this would be better named as lazy
or lenient
. After all we use it for the heads up signal which also requires a more permissive timeout.
@@ -418,6 +429,9 @@ async fn handle_to_host( | |||
Ok(()) | |||
} | |||
|
|||
/// Handles PVF prechecking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: ... prechecking requests
@@ -485,9 +511,17 @@ async fn handle_execute_pvf( | |||
} | |||
} else { | |||
// Artifact is unknown: register it and enqueue a job with the corresponding priority and | |||
// | |||
// PVF. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol, finally this is fixed 🎉
Thanks for the reviews! 👍 Just stuck on a couple of CI checks:
|
@eskimor Sounds good, I'll address it in the followup PR.
All good. I was expecting more comments than that for my first PR! I agree about addressing the nits in a followup PR. |
jFYI: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with @pepyakin comments, other than that well done
Note: once pvf is queued for preparation with some timeout, any subsequent request would discard compilation_timeout
parameter and simply enqueue response_receiver
(the way it's implemented now)
This is OK because we can't receive execute request for unprepared code until it's enacted once prechecking process concludes, and the code that was already prechecked should pass this process with a greater timeout.
However, this is an external guarantee so wanted to make sure you keep it in mind.
UPD: nvm didn't make it in time 😪
* master: (21 commits) try and fix build (#6170) Companion for EPM duplicate submissions (#6115) Bump docker/setup-buildx-action from 2.0.0 to 2.1.0 (#6141) companion for #12212 (#6162) Bump substrate (#6164) BlockId removal: refactor: StorageProvider (#6160) availability-recovery: use `IfDisconnected::TryConnect` for chunks (#6081) Update clap to version 4 (#6128) Add `force_open_hrmp_channel` Call (#6155) Fix fuzzing builds xcm-fuzz and erasure-coding fuzzer (#6153) BlockId removal refactor: Backend::state_at (#6149) First round of implementers guide fixes (#6146) bump zombienet version (#6142) lingua.dic is not managed by CI team (#6148) pallet-mmr: RPC and Runtime APIs work with block numbers (#6072) Separate preparation timeouts for PVF prechecking and execution (#6139) Malus: add disputed block percentage (#6100) refactor grid topology to expose more info to subsystems (#6140) Manual Para Lock (#5451) Expose node subcommands in Malus CLI (#6135) ...
PULL REQUEST
Overview
Per the linked issue, we make the required changes so that preparation for
execution is more lenient (by a factor of 3) than preparation for prechecking.
We add a compilation_timeout parameter for PVF preparation job
and also split the
COMPILATION_TIMEOUT
constant into two new consts.Todo
What kind of tests should be added?Issues Closed
Closes #4132