-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Candidate Validation/PVF: more fidelity of error metrics #6479
Conversation
We were handling `PrepareError` differently based on whether it is deterministic. Deterministic variants were already not allowed in certain places, and same for non-deterministic variants, so we might as well enforce this on the type level. This refactor also revealed an inconsistency with the way that timeout errors are treated (see TODO in commit).
# TODO: Do we need this? | ||
[target.'cfg(not(any(target_os = "android", target_os = "unknown")))'.dependencies] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this here? The polkadot-node-core-pvf
dependency isn't gated like this anywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can probably be removed.
// TODO: Currently unused. Preparation timeouts are treated as non-deterministic, so this can | ||
// never be instantiated, whereas execution timeouts are deterministic. Should this | ||
// inconsistency be addressed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that preparation timeouts are treated as non-deterministic (potentially spurious, reported as ValidationFailed
) whereas execution timeouts are considered deterministic (reported as InvalidCandidate
). This might make sense if pre-checking is supposed to rule out bad PVFs. The preparation timeout during pre-checking is lower, so if we're timing out while preparing for execution, it's probably not an issue with the PVF.
node/core/pvf/src/error.rs
Outdated
/// This error is raised due to inability to serve the request during execution. | ||
InternalExecuteError(String), | ||
/// This error is raised due to inability to serve the request for some other reason. | ||
InternalOtherError(String), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ranks among the worst names I've ever come up with. 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean I see what you want to do. However I'm not sure that this pr improves the current situation 🙈
# TODO: Do we need this? | ||
[target.'cfg(not(any(target_os = "android", target_os = "unknown")))'.dependencies] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can probably be removed.
/// Non-deterministic errors can happen spuriously. Typically, they occur due to resource | ||
/// starvation, e.g. under heavy load or memory pressure. Those errors are typically transient | ||
/// but may persist e.g. if the node is run by overwhelmingly underpowered machine. | ||
pub fn is_deterministic(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured that now we can just match on the Deterministic
variant directly, so this function seemed extraneous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah for sure, but calling this method is less code 🤣 But yeah, no strong argument here.
node/core/pvf/src/error.rs
Outdated
/// This error is raised due to inability to serve the request. | ||
InternalError(String), | ||
/// This error is raised due to inability to serve the request during preparation. | ||
InternalPrepareError(NonDeterministicError), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not forward the prepareerror directly?
The variants also don't need the postfix "Error".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yeah, looking into the code that handles this error, we should name this variant Prepare
and forward the PrepareError
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These internal variants are only for non-deterministic errors, hence the Internal
prefix and why I use NonDeterministicError
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I had read the From implementation not good enough! Sorry!
Ok(ValidationResult::Invalid(InvalidCandidate::ExecutionError(e))), | ||
// Internal errors. | ||
Err(ValidationError::InternalPrepareError(e)) => | ||
Err(ValidationFailed::Prepare(e.to_string())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And why convert all errors to strings?
/// starvation, e.g. under heavy load or memory pressure. Those errors are typically transient but | ||
/// may persist e.g. if the node is run by overwhelmingly underpowered machine. | ||
#[derive(Debug, Clone, Encode, Decode)] | ||
pub enum NonDeterministicError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the naming of deterministic and non deterministic. However, I think this could be improved. I don't have a better terminology at hand right now, but I think it could be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's awkward. Maybe InternalPrepareError
and InvalidPrepareError
? IDK.
Why do you say that? I think we have a bit more fidelity and soundness of errors now. |
Don't wanted to sound like it's bad or something! Maybe you could change ValidationError to only have two variants? InvalidCandidate and InternalFailure? Or Something like that? My "general complain" is the huge number of different errors that you all handle differently in the upper layers. While it is probably only important if the candidate is invalid or we think that some machine error lead to the validation to fail? |
const DEFAULT_HEAP_PAGES_ESTIMATE: u64 = 32; | ||
const EXTRA_HEAP_PAGES: u64 = 2048; | ||
const DEFAULT_HEAP_PAGES_ESTIMATE: u32 = 32; | ||
const EXTRA_HEAP_PAGES: u32 = 2048; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This random change was due to clippy::pedantic
warning about casting a u64
to a usize
(down below).
It's all good! I'm open to feedback, I just don't see how this can be improved yet.
Err(ValidationError::InternalPrepare(e)) => Err(ValidationFailed::Prepare(e)),
Err(ValidationError::InternalExecute(e)) => Err(ValidationFailed::Execute(e)),
Err(ValidationError::InternalOther(e)) => Err(ValidationFailed::Other(e)), Err(ValidationFailed::Prepare(_)) => &["internal failure (preparation)"],
Err(ValidationFailed::Execute(_)) => &["internal failure (execution)"],
Err(ValidationFailed::Other(_)) => &["internal failure (misc)"],
Yeah, the error handling here does feel pretty messy, even before my changes. I wanted to get the better metrics fidelity without introducing even more error types. But I feel like we should have internal vs. invalid in the metrics, as well as prepare vs. execute, at a minimum. |
Do you really need that much detail on the metrics level? Isn't there valid/invalid/internal error enough? This should be enough to know when something is going on and for more detail you then need to look into the logs any way? Or do you see any advantage in having such a detail on the error variants in the metrics? |
Yeah, good questions! I don't really know. I just assumed that having more buckets could be useful. Maybe someone else can weigh in on that. |
I can see that people for example would add some kind of alert when internal validation errors are going over a certain threshold, because that may means that your node is not working properly. |
The CI pipeline was cancelled due to failure one of the required jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do understand it was intended to improve observability, but overall changes don't look as an improvement to me.
/// The worker has died during validation of a candidate. See | ||
/// [`InvalidCandidate::AmbiguousWorkerDeath`]. | ||
AmbiguousWorkerDeath, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to #3655 (comment)
(also related to the rest of added variants)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, interesting. Before I jump back into this though, do you think this PR is worth pursuing further? Are the metrics buckets useful to us?
[TBH I thought this would be an easy change, and wanted to do minimal refactoring just to get more metrics buckets (I guess I got a bit carried away with PrepareError
). My point being that I don't really want to do more reworking of the errors. 😛]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are too many metrics. What should an operator do with all these metrics? They want to know if the node is running correctly and if not, they should be notified. So, Valid
, Invalid
, Invalid_Your_Validator_Has_Issues
should be enough IMO, while the latest one would be the thing that operators use to add some kind of alert. When this metric starts rising quickly, something is probably wrong with their node. However, for a detailed analysis logs will be required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Basti
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough! I can close this PR, should I close the related issue as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 metrics should provide a birds eye view for edge cases or things not working correctly in general. Sometimes having more details (more metrics) makes sense but I've only seen that useful for performance tracking.
pub enum ValidationFailed { | ||
/// Validation failed due to an internal prepare error. | ||
Prepare(NonDeterministicError), | ||
/// Validation failed due to an internal execute error. | ||
Execute(String), | ||
/// Validation failed due to some other internal error. | ||
Other(String), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Basti in a sense that there's a problem with ValidationFailed
as it's just a string-wrapper (see #3655 (comment)), but this PR doesn't solve this issue.
/// An error that should trigger reliably. See [`DeterministicError`]. | ||
Deterministic(DeterministicError), | ||
/// An error that may happen spuriously. See [`NonDeterministicError`]. | ||
NonDeterministic(NonDeterministicError), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO prepare error being deterministic or not is a property that's not necessarily needed everywhere this enum is used.
pub fn is_deterministic(&self) -> bool {
looked much better to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These variants do end up being useful actually, there's several places where only "deterministic" (or not) errors are allowed, and having PrepareError
in such places didn't quite seem correct.
If we pursue this further, I would probably rename to PrepareFailedError
and PrepareInvalidError
or something (not totally sure, haven't looked at this in a while).
@@ -149,7 +149,7 @@ impl Queue { | |||
break; | |||
} | |||
} | |||
ev = self.mux.select_next_some() => handle_mux(&mut self, ev).await, | |||
ev = self.mux.select_next_some() => handle_mux(&mut self, ev), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid adding too many unrelated changes as it worsens review experiense as well as commit history after squash.
Closing as we do not want the new metrics buckets; see #6479 (comment). Also, the error refactor was half-baked, and deemed contentious and not useful. |
PULL REQUEST
Overview
Adds some more metrics buckets for the results of candidate validation (i.e. PVF preparation/execution). This was just supposed to be a fun one for me.
I had to add some more error variants along the way. The error being metric'd did not have enough fidelity (having just a single variant for the result of PVF prep/exec, that contained the original error but stringified).
I also split up
PrepareError
. We were handling it differently based on whether it is deterministic. Deterministic variants were already not allowed in certain places, and same for non-deterministic variants, so we might as well enforce this on the type level. This revealed an inconsistency with the way that timeout errors are treated (see TODO).Related Issues
Closes #3755