Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ruler query failure reporting #4335

Merged
merged 5 commits into from
Jul 20, 2021

Conversation

pstibrany
Copy link
Contributor

@pstibrany pstibrany commented Jul 2, 2021

What this PR does: This PR fixes problem described in issue #4333. It does so by wrapping errors returned by supplied Queryable into special wrapper, that is then recognized and unwrapped by MetricsQueryFunc. This allows "failures" metric to be updated only if error was actually caused by internal Cortex error, and not by PromQL engine failure during evaluation (this fixes #4333), or by hitting limits.

This PR also adds integration test to check for these scenarios.

Which issue(s) this PR fixes:
Fixes #4333

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@pstibrany pstibrany changed the title Fix ruler query reporting Fix ruler query failure reporting Jul 2, 2021
@pstibrany pstibrany force-pushed the fix-ruler-query-reporting branch from d715bf5 to 21e455f Compare July 2, 2021 11:20
… errors, and adds integration test for it.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
This allows us to distinguish between those errors and errors returned by PromQL engine, and react appropriately.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
@pstibrany pstibrany force-pushed the fix-ruler-query-reporting branch from 21e455f to a736bd5 Compare July 20, 2021 09:37
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good job! This stuff is tricky, but makes sense to me. I left a question.

Comment on lines +152 to +154
// We only care about errors returned by underlying Queryable. Errors returned by PromQL engine are "user-errors",
// and not interesting here.
Copy link
Contributor

@pracucci pracucci Jul 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? For example, if it fails decoding a chunk while iterating seriesset, then it's an underlying error but the error didn't directly come from the Queryable.

What's the downside if we remove the check on QueryableError at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. Decoding a chunk is an edge case that is not covered by current implementation. To fix that, we will wrap returned Series and ChunkSeries (from SeriesSet and ChunkSeriesSet), and translate errors returned by their iterators as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. Decoding a chunk is an edge case that is not covered by current implementation. To fix that, we will wrap returned Series and ChunkSeries (from SeriesSet and ChunkSeriesSet), and translate errors returned by their iterators as well.

It would be possible to detect errors that occur during iteration of chunks or samples and treat them in the same way as other errors from Queryable, however benchmarks show that performance penalty for wrapping Series/ChunkSeries and their iterators is too big (~15% more cpu time). Therefore we have decided to not cover this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☝️ Agree. Thanks for the investigation!

@pstibrany pstibrany merged commit e3153e5 into cortexproject:master Jul 20, 2021
@pstibrany pstibrany deleted the fix-ruler-query-reporting branch July 20, 2021 13:40
simonswine pushed a commit to grafana/e2e that referenced this pull request Jan 13, 2022
* This patch tries to fix problem with user-errors reported as internal errors, and adds integration test for it.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Allow passing custom error-wrapping function to ErrorTranslateQueryable.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Wrap errors returned by Queryable to custom wrapper.

This allows us to distinguish between those errors and errors returned by PromQL engine, and react appropriately.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Improve ruler test to check for more scenarios.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* CHANGELOG.md

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
alvinlin123 pushed a commit to ac1214/cortex that referenced this pull request Jan 14, 2022
* This patch tries to fix problem with user-errors reported as internal errors, and adds integration test for it.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Allow passing custom error-wrapping function to ErrorTranslateQueryable.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Wrap errors returned by Queryable to custom wrapper.

This allows us to distinguish between those errors and errors returned by PromQL engine, and react appropriately.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Improve ruler test to check for more scenarios.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* CHANGELOG.md

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ruler doesn't track internal errors vs user errors correctly
2 participants