Candidate Validation/PVF: more fidelity of error metrics #6479

mrcnski · 2022-12-23T19:30:11Z

PULL REQUEST

Overview

Adds some more metrics buckets for the results of candidate validation (i.e. PVF preparation/execution). This was just supposed to be a fun one for me.

I had to add some more error variants along the way. The error being metric'd did not have enough fidelity (having just a single variant for the result of PVF prep/exec, that contained the original error but stringified).

I also split up PrepareError. We were handling it differently based on whether it is deterministic. Deterministic variants were already not allowed in certain places, and same for non-deterministic variants, so we might as well enforce this on the type level. This revealed an inconsistency with the way that timeout errors are treated (see TODO).

Related Issues

Closes #3755

We were handling `PrepareError` differently based on whether it is deterministic. Deterministic variants were already not allowed in certain places, and same for non-deterministic variants, so we might as well enforce this on the type level. This refactor also revealed an inconsistency with the way that timeout errors are treated (see TODO in commit).

mrcnski · 2022-12-23T19:31:10Z

node/core/candidate-validation/Cargo.toml

+# TODO: Do we need this?
 [target.'cfg(not(any(target_os = "android", target_os = "unknown")))'.dependencies]


Why is this here? The polkadot-node-core-pvf dependency isn't gated like this anywhere else.

Can probably be removed.

mrcnski · 2022-12-23T19:37:24Z

node/primitives/src/lib.rs

+	// TODO: Currently unused. Preparation timeouts are treated as non-deterministic, so this can
+	// never be instantiated, whereas execution timeouts are deterministic. Should this
+	// inconsistency be addressed?


I noticed that preparation timeouts are treated as non-deterministic (potentially spurious, reported as ValidationFailed) whereas execution timeouts are considered deterministic (reported as InvalidCandidate). This might make sense if pre-checking is supposed to rule out bad PVFs. The preparation timeout during pre-checking is lower, so if we're timing out while preparing for execution, it's probably not an issue with the PVF.

mrcnski · 2022-12-25T01:34:59Z

node/core/pvf/src/error.rs

+	/// This error is raised due to inability to serve the request during execution.
+	InternalExecuteError(String),
+	/// This error is raised due to inability to serve the request for some other reason.
+	InternalOtherError(String),


This ranks among the worst names I've ever come up with. 😬

bkchr

I mean I see what you want to do. However I'm not sure that this pr improves the current situation 🙈

bkchr · 2022-12-25T08:16:31Z

node/core/candidate-validation/Cargo.toml

+# TODO: Do we need this?
 [target.'cfg(not(any(target_os = "android", target_os = "unknown")))'.dependencies]


Can probably be removed.

bkchr · 2022-12-25T08:18:09Z

node/core/pvf/src/error.rs

-	/// Non-deterministic errors can happen spuriously. Typically, they occur due to resource
-	/// starvation, e.g. under heavy load or memory pressure. Those errors are typically transient
-	/// but may persist e.g. if the node is run by overwhelmingly underpowered machine.
-	pub fn is_deterministic(&self) -> bool {


Why did you remove this?

I figured that now we can just match on the Deterministic variant directly, so this function seemed extraneous.

Yeah for sure, but calling this method is less code 🤣 But yeah, no strong argument here.

bkchr · 2022-12-25T08:22:46Z

node/core/pvf/src/error.rs

-	/// This error is raised due to inability to serve the request.
-	InternalError(String),
+	/// This error is raised due to inability to serve the request during preparation.
+	InternalPrepareError(NonDeterministicError),


Why not forward the prepareerror directly?

The variants also don't need the postfix "Error".

And yeah, looking into the code that handles this error, we should name this variant Prepare and forward the PrepareError directly.

These internal variants are only for non-deterministic errors, hence the Internal prefix and why I use NonDeterministicError here.

Ahh, I had read the From implementation not good enough! Sorry!

bkchr · 2022-12-25T08:27:12Z

node/core/candidate-validation/src/lib.rs

-			Ok(ValidationResult::Invalid(InvalidCandidate::ExecutionError(e))),
+		// Internal errors.
+		Err(ValidationError::InternalPrepareError(e)) =>
+			Err(ValidationFailed::Prepare(e.to_string())),


And why convert all errors to strings?

bkchr · 2022-12-25T08:36:03Z

node/core/pvf/src/error.rs

+/// starvation, e.g. under heavy load or memory pressure. Those errors are typically transient but
+/// may persist e.g. if the node is run by overwhelmingly underpowered machine.
+#[derive(Debug, Clone, Encode, Decode)]
+pub enum NonDeterministicError {


I get the naming of deterministic and non deterministic. However, I think this could be improved. I don't have a better terminology at hand right now, but I think it could be better.

Yeah, it's awkward. Maybe InternalPrepareError and InvalidPrepareError? IDK.

mrcnski · 2022-12-25T12:52:01Z

I mean I see what you want to do. However I'm not sure that this pr improves the current situation 🙈

Why do you say that? I think we have a bit more fidelity and soundness of errors now.

bkchr · 2022-12-25T13:01:30Z

I mean I see what you want to do. However I'm not sure that this pr improves the current situation 🙈

Why do you say that? I think we have a bit more fidelity and soundness of errors now.

Don't wanted to sound like it's bad or something! Maybe you could change ValidationError to only have two variants? InvalidCandidate and InternalFailure? Or Something like that?

My "general complain" is the huge number of different errors that you all handle differently in the upper layers. While it is probably only important if the candidate is invalid or we think that some machine error lead to the validation to fail?

mrcnski · 2022-12-25T14:37:10Z

node/core/pvf/src/executor_intf.rs

-const DEFAULT_HEAP_PAGES_ESTIMATE: u64 = 32;
-const EXTRA_HEAP_PAGES: u64 = 2048;
+const DEFAULT_HEAP_PAGES_ESTIMATE: u32 = 32;
+const EXTRA_HEAP_PAGES: u32 = 2048;


This random change was due to clippy::pedantic warning about casting a u64 to a usize (down below).

mrcnski · 2022-12-25T14:54:28Z

Don't wanted to sound like it's bad or something! Maybe you could change ValidationError to only have two variants? InvalidCandidate and InternalFailure? Or Something like that?

It's all good! I'm open to feedback, I just don't see how this can be improved yet.

ValidationError had only two variants to start with, but I split it up so that I could forward a more precise error into the metrics buckets:

Err(ValidationError::InternalPrepare(e)) => Err(ValidationFailed::Prepare(e)),
Err(ValidationError::InternalExecute(e)) => Err(ValidationFailed::Execute(e)),
Err(ValidationError::InternalOther(e)) => Err(ValidationFailed::Other(e)),

Err(ValidationFailed::Prepare(_)) => &["internal failure (preparation)"],
Err(ValidationFailed::Execute(_)) => &["internal failure (execution)"],
Err(ValidationFailed::Other(_)) => &["internal failure (misc)"],

My "general complain" is the huge number of different errors that you all handle differently in the upper layers. While it is probably only important if the candidate is invalid or we think that some machine error lead to the validation to fail?

Yeah, the error handling here does feel pretty messy, even before my changes. I wanted to get the better metrics fidelity without introducing even more error types. But I feel like we should have internal vs. invalid in the metrics, as well as prepare vs. execute, at a minimum.

bkchr · 2022-12-25T16:16:13Z

Do you really need that much detail on the metrics level? Isn't there valid/invalid/internal error enough? This should be enough to know when something is going on and for more detail you then need to look into the logs any way? Or do you see any advantage in having such a detail on the error variants in the metrics?

mrcnski · 2022-12-25T17:45:32Z

Do you really need that much detail on the metrics level? Isn't there valid/invalid/internal error enough? This should be enough to know when something is going on and for more detail you then need to look into the logs any way? Or do you see any advantage in having such a detail on the error variants in the metrics?

Yeah, good questions! I don't really know. I just assumed that having more buckets could be useful. Maybe someone else can weigh in on that.

bkchr · 2022-12-25T18:59:23Z

I can see that people for example would add some kind of alert when internal validation errors are going over a certain threshold, because that may means that your node is not working properly.

paritytech-cicd-pr · 2022-12-26T10:59:17Z

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: test-linux-stable
Logs: https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/2197558

slumber

I do understand it was intended to improve observability, but overall changes don't look as an improvement to me.

slumber · 2023-01-06T12:17:00Z

node/primitives/src/lib.rs

+	/// The worker has died during validation of a candidate. See
+	/// [`InvalidCandidate::AmbiguousWorkerDeath`].
+	AmbiguousWorkerDeath,
 }


Please refer to #3655 (comment)
(also related to the rest of added variants)

Hmm, interesting. Before I jump back into this though, do you think this PR is worth pursuing further? Are the metrics buckets useful to us?

[TBH I thought this would be an easy change, and wanted to do minimal refactoring just to get more metrics buckets (I guess I got a bit carried away with PrepareError). My point being that I don't really want to do more reworking of the errors. 😛]

I think these are too many metrics. What should an operator do with all these metrics? They want to know if the node is running correctly and if not, they should be notified. So, Valid, Invalid, Invalid_Your_Validator_Has_Issues should be enough IMO, while the latest one would be the thing that operators use to add some kind of alert. When this metric starts rising quickly, something is probably wrong with their node. However, for a detailed analysis logs will be required.

I agree with Basti

Fair enough! I can close this PR, should I close the related issue as well?

+1 metrics should provide a birds eye view for edge cases or things not working correctly in general. Sometimes having more details (more metrics) makes sense but I've only seen that useful for performance tracking.

slumber · 2023-01-06T12:18:43Z

node/subsystem-types/src/messages.rs

+pub enum ValidationFailed {
+	/// Validation failed due to an internal prepare error.
+	Prepare(NonDeterministicError),
+	/// Validation failed due to an internal execute error.
+	Execute(String),
+	/// Validation failed due to some other internal error.
+	Other(String),
+}


I agree with Basti in a sense that there's a problem with ValidationFailed as it's just a string-wrapper (see #3655 (comment)), but this PR doesn't solve this issue.

slumber · 2023-01-06T12:20:20Z

node/core/pvf/src/error.rs

+	/// An error that should trigger reliably. See [`DeterministicError`].
+	Deterministic(DeterministicError),
+	/// An error that may happen spuriously. See [`NonDeterministicError`].
+	NonDeterministic(NonDeterministicError),
+}


IMO prepare error being deterministic or not is a property that's not necessarily needed everywhere this enum is used.

pub fn is_deterministic(&self) -> bool {

looked much better to me.

These variants do end up being useful actually, there's several places where only "deterministic" (or not) errors are allowed, and having PrepareError in such places didn't quite seem correct.

If we pursue this further, I would probably rename to PrepareFailedError and PrepareInvalidError or something (not totally sure, haven't looked at this in a while).

slumber · 2023-01-06T12:22:10Z

node/core/pvf/src/execute/queue.rs

@@ -149,7 +149,7 @@ impl Queue {
 						break;
 					}
 				}
-				ev = self.mux.select_next_some() => handle_mux(&mut self, ev).await,
+				ev = self.mux.select_next_some() => handle_mux(&mut self, ev),


Please avoid adding too many unrelated changes as it worsens review experiense as well as commit history after squash.

mrcnski · 2023-01-09T11:30:38Z

Closing as we do not want the new metrics buckets; see #6479 (comment). Also, the error refactor was half-baked, and deemed contentious and not useful.

mrcnski added 4 commits December 22, 2022 14:59

Add a bit more fidelity to internal validation errors

079364c

Add some more validation metrics; clarify errors a bit

b70c53a

Fix some clippy::pedantic lints

9578ee9

mrcnski requested review from pepyakin, slumber, s0me0ne-unkn0wn and a user December 23, 2022 19:30

github-actions bot added the A0-pleasereview label Dec 23, 2022

mrcnski commented Dec 23, 2022

View reviewed changes

Fix compile error

fe79750

mrcnski added B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. labels Dec 23, 2022

mrcnski commented Dec 25, 2022

View reviewed changes

bkchr reviewed Dec 25, 2022

View reviewed changes

Address review comments

1593e16

mrcnski commented Dec 25, 2022

View reviewed changes

Fix compile errors

77360f8

Fix compile errors

8ff6d1b

slumber reviewed Jan 6, 2023

View reviewed changes

mrcnski closed this Jan 9, 2023

mrcnski mentioned this pull request Jan 9, 2023

Candidate Validation/PVF validation host: more fidelity of error metrics #3755

Closed

		# TODO: Do we need this?
		[target.'cfg(not(any(target_os = "android", target_os = "unknown")))'.dependencies]

Candidate Validation/PVF: more fidelity of error metrics #6479

Candidate Validation/PVF: more fidelity of error metrics #6479

Conversation

mrcnski commented Dec 23, 2022

PULL REQUEST

Overview

Related Issues

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrcnski Dec 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkchr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrcnski commented Dec 25, 2022

bkchr commented Dec 25, 2022

Choose a reason for hiding this comment

mrcnski commented Dec 25, 2022

bkchr commented Dec 25, 2022

mrcnski commented Dec 25, 2022

bkchr commented Dec 25, 2022

paritytech-cicd-pr commented Dec 26, 2022

slumber left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrcnski commented Jan 9, 2023

mrcnski Dec 23, 2022 •

edited

Loading