Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Candidate Validation/PVF: more fidelity of error metrics #6479

Closed
wants to merge 8 commits into from

Conversation

mrcnski
Copy link
Contributor

@mrcnski mrcnski commented Dec 23, 2022

PULL REQUEST

Overview

Adds some more metrics buckets for the results of candidate validation (i.e. PVF preparation/execution). This was just supposed to be a fun one for me.

I had to add some more error variants along the way. The error being metric'd did not have enough fidelity (having just a single variant for the result of PVF prep/exec, that contained the original error but stringified).

I also split up PrepareError. We were handling it differently based on whether it is deterministic. Deterministic variants were already not allowed in certain places, and same for non-deterministic variants, so we might as well enforce this on the type level. This revealed an inconsistency with the way that timeout errors are treated (see TODO).

Related Issues

Closes #3755

We were handling `PrepareError` differently based on whether it is
deterministic. Deterministic variants were already not allowed in certain
places, and same for non-deterministic variants, so we might as well enforce
this on the type level.

This refactor also revealed an inconsistency with the way that timeout errors
are treated (see TODO in commit).
Comment on lines 22 to 23
# TODO: Do we need this?
[target.'cfg(not(any(target_os = "android", target_os = "unknown")))'.dependencies]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this here? The polkadot-node-core-pvf dependency isn't gated like this anywhere else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can probably be removed.

Comment on lines +240 to +242
// TODO: Currently unused. Preparation timeouts are treated as non-deterministic, so this can
// never be instantiated, whereas execution timeouts are deterministic. Should this
// inconsistency be addressed?
Copy link
Contributor Author

@mrcnski mrcnski Dec 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that preparation timeouts are treated as non-deterministic (potentially spurious, reported as ValidationFailed) whereas execution timeouts are considered deterministic (reported as InvalidCandidate). This might make sense if pre-checking is supposed to rule out bad PVFs. The preparation timeout during pre-checking is lower, so if we're timing out while preparing for execution, it's probably not an issue with the PVF.

@mrcnski mrcnski added B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. labels Dec 23, 2022
/// This error is raised due to inability to serve the request during execution.
InternalExecuteError(String),
/// This error is raised due to inability to serve the request for some other reason.
InternalOtherError(String),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ranks among the worst names I've ever come up with. 😬

Copy link
Member

@bkchr bkchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean I see what you want to do. However I'm not sure that this pr improves the current situation 🙈

Comment on lines 22 to 23
# TODO: Do we need this?
[target.'cfg(not(any(target_os = "android", target_os = "unknown")))'.dependencies]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can probably be removed.

/// Non-deterministic errors can happen spuriously. Typically, they occur due to resource
/// starvation, e.g. under heavy load or memory pressure. Those errors are typically transient
/// but may persist e.g. if the node is run by overwhelmingly underpowered machine.
pub fn is_deterministic(&self) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured that now we can just match on the Deterministic variant directly, so this function seemed extraneous.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for sure, but calling this method is less code 🤣 But yeah, no strong argument here.

/// This error is raised due to inability to serve the request.
InternalError(String),
/// This error is raised due to inability to serve the request during preparation.
InternalPrepareError(NonDeterministicError),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not forward the prepareerror directly?

The variants also don't need the postfix "Error".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yeah, looking into the code that handles this error, we should name this variant Prepare and forward the PrepareError directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These internal variants are only for non-deterministic errors, hence the Internal prefix and why I use NonDeterministicError here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I had read the From implementation not good enough! Sorry!

Ok(ValidationResult::Invalid(InvalidCandidate::ExecutionError(e))),
// Internal errors.
Err(ValidationError::InternalPrepareError(e)) =>
Err(ValidationFailed::Prepare(e.to_string())),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And why convert all errors to strings?

/// starvation, e.g. under heavy load or memory pressure. Those errors are typically transient but
/// may persist e.g. if the node is run by overwhelmingly underpowered machine.
#[derive(Debug, Clone, Encode, Decode)]
pub enum NonDeterministicError {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the naming of deterministic and non deterministic. However, I think this could be improved. I don't have a better terminology at hand right now, but I think it could be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's awkward. Maybe InternalPrepareError and InvalidPrepareError? IDK.

@mrcnski
Copy link
Contributor Author

mrcnski commented Dec 25, 2022

I mean I see what you want to do. However I'm not sure that this pr improves the current situation 🙈

Why do you say that? I think we have a bit more fidelity and soundness of errors now.

@bkchr
Copy link
Member

bkchr commented Dec 25, 2022

I mean I see what you want to do. However I'm not sure that this pr improves the current situation 🙈

Why do you say that? I think we have a bit more fidelity and soundness of errors now.

Don't wanted to sound like it's bad or something! Maybe you could change ValidationError to only have two variants? InvalidCandidate and InternalFailure? Or Something like that?

My "general complain" is the huge number of different errors that you all handle differently in the upper layers. While it is probably only important if the candidate is invalid or we think that some machine error lead to the validation to fail?

Comment on lines -43 to +44
const DEFAULT_HEAP_PAGES_ESTIMATE: u64 = 32;
const EXTRA_HEAP_PAGES: u64 = 2048;
const DEFAULT_HEAP_PAGES_ESTIMATE: u32 = 32;
const EXTRA_HEAP_PAGES: u32 = 2048;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This random change was due to clippy::pedantic warning about casting a u64 to a usize (down below).

@mrcnski
Copy link
Contributor Author

mrcnski commented Dec 25, 2022

Don't wanted to sound like it's bad or something! Maybe you could change ValidationError to only have two variants? InvalidCandidate and InternalFailure? Or Something like that?

It's all good! I'm open to feedback, I just don't see how this can be improved yet.

ValidationError had only two variants to start with, but I split it up so that I could forward a more precise error into the metrics buckets:

Err(ValidationError::InternalPrepare(e)) => Err(ValidationFailed::Prepare(e)),
Err(ValidationError::InternalExecute(e)) => Err(ValidationFailed::Execute(e)),
Err(ValidationError::InternalOther(e)) => Err(ValidationFailed::Other(e)),
Err(ValidationFailed::Prepare(_)) => &["internal failure (preparation)"],
Err(ValidationFailed::Execute(_)) => &["internal failure (execution)"],
Err(ValidationFailed::Other(_)) => &["internal failure (misc)"],

My "general complain" is the huge number of different errors that you all handle differently in the upper layers. While it is probably only important if the candidate is invalid or we think that some machine error lead to the validation to fail?

Yeah, the error handling here does feel pretty messy, even before my changes. I wanted to get the better metrics fidelity without introducing even more error types. But I feel like we should have internal vs. invalid in the metrics, as well as prepare vs. execute, at a minimum.

@bkchr
Copy link
Member

bkchr commented Dec 25, 2022

Do you really need that much detail on the metrics level? Isn't there valid/invalid/internal error enough? This should be enough to know when something is going on and for more detail you then need to look into the logs any way? Or do you see any advantage in having such a detail on the error variants in the metrics?

@mrcnski
Copy link
Contributor Author

mrcnski commented Dec 25, 2022

Do you really need that much detail on the metrics level? Isn't there valid/invalid/internal error enough? This should be enough to know when something is going on and for more detail you then need to look into the logs any way? Or do you see any advantage in having such a detail on the error variants in the metrics?

Yeah, good questions! I don't really know. I just assumed that having more buckets could be useful. Maybe someone else can weigh in on that.

@bkchr
Copy link
Member

bkchr commented Dec 25, 2022

I can see that people for example would add some kind of alert when internal validation errors are going over a certain threshold, because that may means that your node is not working properly.

@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: test-linux-stable
Logs: https://gitlab.parity.io/parity/mirrors/polkadot/-/jobs/2197558

Copy link
Contributor

@slumber slumber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do understand it was intended to improve observability, but overall changes don't look as an improvement to me.

Comment on lines +269 to 272
/// The worker has died during validation of a candidate. See
/// [`InvalidCandidate::AmbiguousWorkerDeath`].
AmbiguousWorkerDeath,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to #3655 (comment)
(also related to the rest of added variants)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, interesting. Before I jump back into this though, do you think this PR is worth pursuing further? Are the metrics buckets useful to us?

[TBH I thought this would be an easy change, and wanted to do minimal refactoring just to get more metrics buckets (I guess I got a bit carried away with PrepareError). My point being that I don't really want to do more reworking of the errors. 😛]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are too many metrics. What should an operator do with all these metrics? They want to know if the node is running correctly and if not, they should be notified. So, Valid, Invalid, Invalid_Your_Validator_Has_Issues should be enough IMO, while the latest one would be the thing that operators use to add some kind of alert. When this metric starts rising quickly, something is probably wrong with their node. However, for a detailed analysis logs will be required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Basti

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough! I can close this PR, should I close the related issue as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 metrics should provide a birds eye view for edge cases or things not working correctly in general. Sometimes having more details (more metrics) makes sense but I've only seen that useful for performance tracking.

Comment on lines +95 to +102
pub enum ValidationFailed {
/// Validation failed due to an internal prepare error.
Prepare(NonDeterministicError),
/// Validation failed due to an internal execute error.
Execute(String),
/// Validation failed due to some other internal error.
Other(String),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Basti in a sense that there's a problem with ValidationFailed as it's just a string-wrapper (see #3655 (comment)), but this PR doesn't solve this issue.

Comment on lines +27 to +31
/// An error that should trigger reliably. See [`DeterministicError`].
Deterministic(DeterministicError),
/// An error that may happen spuriously. See [`NonDeterministicError`].
NonDeterministic(NonDeterministicError),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO prepare error being deterministic or not is a property that's not necessarily needed everywhere this enum is used.

pub fn is_deterministic(&self) -> bool {

looked much better to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These variants do end up being useful actually, there's several places where only "deterministic" (or not) errors are allowed, and having PrepareError in such places didn't quite seem correct.

If we pursue this further, I would probably rename to PrepareFailedError and PrepareInvalidError or something (not totally sure, haven't looked at this in a while).

@@ -149,7 +149,7 @@ impl Queue {
break;
}
}
ev = self.mux.select_next_some() => handle_mux(&mut self, ev).await,
ev = self.mux.select_next_some() => handle_mux(&mut self, ev),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid adding too many unrelated changes as it worsens review experiense as well as commit history after squash.

@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 9, 2023

Closing as we do not want the new metrics buckets; see #6479 (comment). Also, the error refactor was half-baked, and deemed contentious and not useful.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Candidate Validation/PVF validation host: more fidelity of error metrics
5 participants