Cache non-transient error responses from the query-frontend #9028

56quarters · 2024-08-16T17:00:34Z

What this PR does

Create a new query-frontend middleware that caches errors returned by queries if they are non-transient and will fail again if executed again. This allows us to save work when running a query that hits, e.g., a limit error: running the query again will not help and is a waste of work.

Which issue(s) this PR fixes or relates to

Fixes #2676

Fixes #7340

Checklist

Tests updated.
Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
about-versioning.md updated with experimental features.

56quarters · 2024-08-16T17:07:37Z

Some example of errors that would be cached pulled from the logs of an internal cluster over the last seven days. Each of the errors falls into one of two categories: bad queries or limits. These are the intended type of errors that would be cached by this feature.

Examples of errorType = "execution" that result in HTTP 422:

many-to-many matching not allowed: matching labels must be unique on one side. Bad query
multiple matches for labels: many-to-one matching must be explicit (group_left/group_right). Bad query
multiple matches for labels: grouping labels must ensure unique matches. Bad query
expanding series: invalid destination label name in label_join(): . Bad query
expanding series: invalid source label name in label_join(): (.*). Bad query
Can't query aggregated metric X without aggregation because.... Bad query
query processing would load too many samples into memory in query execution. Limit
expanding series: the query exceeded the maximum number of chunks (limit: N chunks) (err-mimir-max-chunks-per-query). Limit
expanding series: the query time range exceeds the limit (query length: X, limit: Y) (err-mimir-max-query-length). Limit

Examples of errorType = "bad_data" that result in HTTP 400"

invalid parameter \"query\": 1:135: parse error: ...". Bad query
invalid parameter \"query\": invalid expression type \"range vector\" for range query, must be Scalar or instant Vector. Bad query
the total query time range exceeds the limit (query length: X, limit: Y) (err-mimir-max-total-query-length). Limit
exceeded maximum resolution of 11,000 points per timeseries. Try decreasing the query resolution. Limit

pkg/frontend/querymiddleware/model.proto

pkg/frontend/querymiddleware/roundtrip.go

dimitarvdimitrov · 2024-08-19T10:24:44Z

pkg/frontend/querymiddleware/roundtrip.go

@@ -339,6 +340,15 @@ func newQueryMiddlewares(
 		newStepAlignMiddleware(limits, log, registerer),
 	)

+	if cfg.CacheResults && cfg.CacheErrors {
+		// TODO: Use a real TTL
+		queryRangeMiddleware = append(


an additional middleware increases the cache calls - now we look up one more key from the cache sequentially. I'm mostly thinking about increasing cache lookup RPCs and increasing latency because we do this cache lookup before we proceed with the request. Did you consider including this logic in the current split&cache middleware?

I considered it. I wanted to avoid changes to the split&cache middleware because it's quite complicated and very focused on caching results. Adding caching of errors to it would add more complexity.

I also wanted to keep the error caching before queries are split so that we have the best chance of avoiding work when any part of a query is going to fail. We would have do a lot more work if each individual split or sharded query can fail.

I'm not particularly worried about the latency added here for each request because it's only a single cache lookup which is a few milliseconds on hit or miss.

ok, I guess if we measure these latencies it will be ok and we'd know if it's too much. Maybe we can stack them in the cache lookup panel in the reads dashboard?

Testing in a dev cell, this middleware ended up taking about 500us on cache hit or miss. I'll follow up this PR with some changes to the reads/queries dashboards if this turns out to be useful.

pkg/frontend/querymiddleware/error_caching.go

tacole02

Docs look good!

dimitarvdimitrov

Thanks for you patience 🙏 I'm really looking forward to this running in prod

LGTM, I left a few small comments, but I probably don't need to take a look at this again

dimitarvdimitrov · 2024-09-25T17:05:46Z

pkg/frontend/querymiddleware/error_caching.go

+
+func (e *errorCachingHandler) loadErrorFromCache(ctx context.Context, key, hashedKey string, spanLog *spanlogger.SpanLogger) *apierror.APIError {
+	res := e.cache.GetMulti(ctx, []string{hashedKey})
+	if cached, ok := res[hashedKey]; ok {


nit to invert this: if !ok { return nil }

dimitarvdimitrov · 2024-09-25T17:23:57Z

pkg/frontend/querymiddleware/error_caching_test.go

+
+		innerRes := newEmptyPrometheusResponse()
+		inner := &mockHandler{}
+		inner.On("Do", mock.Anything, mock.Anything).Return(innerRes, nil)


inner should never be called, right? So we can also remove this setup call. Otherwise a wrong behaviour of the cache can pass the test

dimitarvdimitrov · 2024-09-25T17:27:17Z

pkg/frontend/querymiddleware/results_cache.go

-	headerValues := getHeaderValuesWithName(r, cacheControlHeader)
-	for _, v := range headerValues {
-		if v == noStoreValue {
-			level.Debug(logger).Log("msg", fmt.Sprintf("%s header in response is equal to %s, not caching the response", cacheControlHeader, noStoreValue))


why the change to remove logger?

Just didn't seem like a useful log line on its own since we never run at debug level

it always gets added to the span though

Create a new query-frontend middleware that caches errors returned by queries if they are non-transient and will fail again if executed again. This allows us to save work when running a query that hits, e.g., a limit error: running the query again will not help and is a waste of work. See #2676 See #7340 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

56quarters force-pushed the 56quarters/error-caching branch from b260726 to 3ff24d7 Compare August 16, 2024 17:24

dimitarvdimitrov reviewed Aug 19, 2024

View reviewed changes

pkg/frontend/querymiddleware/error_caching.go Outdated Show resolved Hide resolved

56quarters force-pushed the 56quarters/error-caching branch 4 times, most recently from a544161 to 6d493f0 Compare September 18, 2024 19:52

56quarters marked this pull request as ready for review September 18, 2024 20:19

56quarters requested review from tacole02 and a team as code owners September 18, 2024 20:19

tacole02 approved these changes Sep 19, 2024

View reviewed changes

56quarters requested a review from dimitarvdimitrov September 19, 2024 22:58

dimitarvdimitrov approved these changes Sep 25, 2024

View reviewed changes

56quarters added 3 commits September 25, 2024 14:04

Set default value for TTL option correctly

4e5b283

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Code review changes

c7c54d4

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

56quarters force-pushed the 56quarters/error-caching branch from eb2d0ae to c7c54d4 Compare September 25, 2024 18:05

56quarters merged commit 992efbb into main Sep 25, 2024
29 checks passed

56quarters deleted the 56quarters/error-caching branch September 25, 2024 18:21

56quarters mentioned this pull request Oct 16, 2024

Use CacheKeyGenerator implementation in error caching middleware #9644

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache non-transient error responses from the query-frontend #9028

Cache non-transient error responses from the query-frontend #9028

56quarters commented Aug 16, 2024 •

edited

Loading

56quarters commented Aug 16, 2024

dimitarvdimitrov Aug 19, 2024

56quarters Aug 19, 2024

dimitarvdimitrov Aug 20, 2024

56quarters Sep 18, 2024

tacole02 left a comment

dimitarvdimitrov left a comment

dimitarvdimitrov Sep 25, 2024

dimitarvdimitrov Sep 25, 2024

dimitarvdimitrov Sep 25, 2024

56quarters Sep 25, 2024

dimitarvdimitrov Sep 26, 2024

Cache non-transient error responses from the query-frontend #9028

Cache non-transient error responses from the query-frontend #9028

Conversation

56quarters commented Aug 16, 2024 • edited Loading

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

56quarters commented Aug 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tacole02 left a comment

Choose a reason for hiding this comment

dimitarvdimitrov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

56quarters commented Aug 16, 2024 •

edited

Loading