-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ruler query failure reporting #4335
Fix ruler query failure reporting #4335
Conversation
d715bf5
to
21e455f
Compare
… errors, and adds integration test for it. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
This allows us to distinguish between those errors and errors returned by PromQL engine, and react appropriately. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
21e455f
to
a736bd5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good job! This stuff is tricky, but makes sense to me. I left a question.
// We only care about errors returned by underlying Queryable. Errors returned by PromQL engine are "user-errors", | ||
// and not interesting here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? For example, if it fails decoding a chunk while iterating seriesset, then it's an underlying error but the error didn't directly come from the Queryable
.
What's the downside if we remove the check on QueryableError
at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. Decoding a chunk is an edge case that is not covered by current implementation. To fix that, we will wrap returned Series and ChunkSeries (from SeriesSet and ChunkSeriesSet), and translate errors returned by their iterators as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. Decoding a chunk is an edge case that is not covered by current implementation. To fix that, we will wrap returned Series and ChunkSeries (from SeriesSet and ChunkSeriesSet), and translate errors returned by their iterators as well.
It would be possible to detect errors that occur during iteration of chunks or samples and treat them in the same way as other errors from Queryable, however benchmarks show that performance penalty for wrapping Series/ChunkSeries and their iterators is too big (~15% more cpu time). Therefore we have decided to not cover this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
☝️ Agree. Thanks for the investigation!
* This patch tries to fix problem with user-errors reported as internal errors, and adds integration test for it. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Allow passing custom error-wrapping function to ErrorTranslateQueryable. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Wrap errors returned by Queryable to custom wrapper. This allows us to distinguish between those errors and errors returned by PromQL engine, and react appropriately. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Improve ruler test to check for more scenarios. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * CHANGELOG.md Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
* This patch tries to fix problem with user-errors reported as internal errors, and adds integration test for it. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Allow passing custom error-wrapping function to ErrorTranslateQueryable. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Wrap errors returned by Queryable to custom wrapper. This allows us to distinguish between those errors and errors returned by PromQL engine, and react appropriately. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * Improve ruler test to check for more scenarios. Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> * CHANGELOG.md Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
What this PR does: This PR fixes problem described in issue #4333. It does so by wrapping errors returned by supplied
Queryable
into special wrapper, that is then recognized and unwrapped byMetricsQueryFunc
. This allows "failures" metric to be updated only if error was actually caused by internal Cortex error, and not by PromQL engine failure during evaluation (this fixes #4333), or by hitting limits.This PR also adds integration test to check for these scenarios.
Which issue(s) this PR fixes:
Fixes #4333
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]