Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache non-transient error responses from the query-frontend #9028

Merged
merged 3 commits into from
Sep 25, 2024

Conversation

56quarters
Copy link
Contributor

@56quarters 56quarters commented Aug 16, 2024

What this PR does

Create a new query-frontend middleware that caches errors returned by queries if they are non-transient and will fail again if executed again. This allows us to save work when running a query that hits, e.g., a limit error: running the query again will not help and is a waste of work.

Which issue(s) this PR fixes or relates to

Fixes #2676

Fixes #7340

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@56quarters
Copy link
Contributor Author

Some example of errors that would be cached pulled from the logs of an internal cluster over the last seven days. Each of the errors falls into one of two categories: bad queries or limits. These are the intended type of errors that would be cached by this feature.

Examples of errorType = "execution" that result in HTTP 422:

  • many-to-many matching not allowed: matching labels must be unique on one side. Bad query
  • multiple matches for labels: many-to-one matching must be explicit (group_left/group_right). Bad query
  • multiple matches for labels: grouping labels must ensure unique matches. Bad query
  • expanding series: invalid destination label name in label_join(): . Bad query
  • expanding series: invalid source label name in label_join(): (.*). Bad query
  • Can't query aggregated metric X without aggregation because.... Bad query
  • query processing would load too many samples into memory in query execution. Limit
  • expanding series: the query exceeded the maximum number of chunks (limit: N chunks) (err-mimir-max-chunks-per-query). Limit
  • expanding series: the query time range exceeds the limit (query length: X, limit: Y) (err-mimir-max-query-length). Limit

Examples of errorType = "bad_data" that result in HTTP 400"

  • invalid parameter \"query\": 1:135: parse error: ...". Bad query
  • invalid parameter \"query\": invalid expression type \"range vector\" for range query, must be Scalar or instant Vector. Bad query
  • the total query time range exceeds the limit (query length: X, limit: Y) (err-mimir-max-total-query-length). Limit
  • exceeded maximum resolution of 11,000 points per timeseries. Try decreasing the query resolution. Limit

@56quarters 56quarters force-pushed the 56quarters/error-caching branch from b260726 to 3ff24d7 Compare August 16, 2024 17:24
pkg/frontend/querymiddleware/model.proto Outdated Show resolved Hide resolved
pkg/frontend/querymiddleware/roundtrip.go Show resolved Hide resolved
@@ -339,6 +340,15 @@ func newQueryMiddlewares(
newStepAlignMiddleware(limits, log, registerer),
)

if cfg.CacheResults && cfg.CacheErrors {
// TODO: Use a real TTL
queryRangeMiddleware = append(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an additional middleware increases the cache calls - now we look up one more key from the cache sequentially. I'm mostly thinking about increasing cache lookup RPCs and increasing latency because we do this cache lookup before we proceed with the request. Did you consider including this logic in the current split&cache middleware?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered it. I wanted to avoid changes to the split&cache middleware because it's quite complicated and very focused on caching results. Adding caching of errors to it would add more complexity.

I also wanted to keep the error caching before queries are split so that we have the best chance of avoiding work when any part of a query is going to fail. We would have do a lot more work if each individual split or sharded query can fail.

I'm not particularly worried about the latency added here for each request because it's only a single cache lookup which is a few milliseconds on hit or miss.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I guess if we measure these latencies it will be ok and we'd know if it's too much. Maybe we can stack them in the cache lookup panel in the reads dashboard?
Screenshot 2024-08-20 at 17 20 21

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing in a dev cell, this middleware ended up taking about 500us on cache hit or miss. I'll follow up this PR with some changes to the reads/queries dashboards if this turns out to be useful.

pkg/frontend/querymiddleware/error_caching.go Outdated Show resolved Hide resolved
pkg/frontend/querymiddleware/error_caching.go Outdated Show resolved Hide resolved
@56quarters 56quarters force-pushed the 56quarters/error-caching branch 4 times, most recently from a544161 to 6d493f0 Compare September 18, 2024 19:52
@56quarters 56quarters marked this pull request as ready for review September 18, 2024 20:19
@56quarters 56quarters requested review from tacole02 and a team as code owners September 18, 2024 20:19
Copy link
Contributor

@tacole02 tacole02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs look good!

Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for you patience 🙏 I'm really looking forward to this running in prod

LGTM, I left a few small comments, but I probably don't need to take a look at this again


func (e *errorCachingHandler) loadErrorFromCache(ctx context.Context, key, hashedKey string, spanLog *spanlogger.SpanLogger) *apierror.APIError {
res := e.cache.GetMulti(ctx, []string{hashedKey})
if cached, ok := res[hashedKey]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit to invert this: if !ok { return nil }


innerRes := newEmptyPrometheusResponse()
inner := &mockHandler{}
inner.On("Do", mock.Anything, mock.Anything).Return(innerRes, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inner should never be called, right? So we can also remove this setup call. Otherwise a wrong behaviour of the cache can pass the test

headerValues := getHeaderValuesWithName(r, cacheControlHeader)
for _, v := range headerValues {
if v == noStoreValue {
level.Debug(logger).Log("msg", fmt.Sprintf("%s header in response is equal to %s, not caching the response", cacheControlHeader, noStoreValue))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the change to remove logger?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just didn't seem like a useful log line on its own since we never run at debug level

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it always gets added to the span though

Create a new query-frontend middleware that caches errors returned by
queries if they are non-transient and will fail again if executed again.
This allows us to save work when running a query that hits, e.g., a limit
error: running the query again will not help and is a waste of work.

See #2676

See #7340

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
@56quarters 56quarters force-pushed the 56quarters/error-caching branch from eb2d0ae to c7c54d4 Compare September 25, 2024 18:05
@56quarters 56quarters merged commit 992efbb into main Sep 25, 2024
29 checks passed
@56quarters 56quarters deleted the 56quarters/error-caching branch September 25, 2024 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants