-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache non-transient error responses from the query-frontend #9028
Conversation
Some example of errors that would be cached pulled from the logs of an internal cluster over the last seven days. Each of the errors falls into one of two categories: bad queries or limits. These are the intended type of errors that would be cached by this feature. Examples of
Examples of
|
b260726
to
3ff24d7
Compare
@@ -339,6 +340,15 @@ func newQueryMiddlewares( | |||
newStepAlignMiddleware(limits, log, registerer), | |||
) | |||
|
|||
if cfg.CacheResults && cfg.CacheErrors { | |||
// TODO: Use a real TTL | |||
queryRangeMiddleware = append( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an additional middleware increases the cache calls - now we look up one more key from the cache sequentially. I'm mostly thinking about increasing cache lookup RPCs and increasing latency because we do this cache lookup before we proceed with the request. Did you consider including this logic in the current split&cache middleware?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered it. I wanted to avoid changes to the split&cache middleware because it's quite complicated and very focused on caching results. Adding caching of errors to it would add more complexity.
I also wanted to keep the error caching before queries are split so that we have the best chance of avoiding work when any part of a query is going to fail. We would have do a lot more work if each individual split or sharded query can fail.
I'm not particularly worried about the latency added here for each request because it's only a single cache lookup which is a few milliseconds on hit or miss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing in a dev cell, this middleware ended up taking about 500us on cache hit or miss. I'll follow up this PR with some changes to the reads/queries dashboards if this turns out to be useful.
a544161
to
6d493f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs look good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for you patience 🙏 I'm really looking forward to this running in prod
LGTM, I left a few small comments, but I probably don't need to take a look at this again
|
||
func (e *errorCachingHandler) loadErrorFromCache(ctx context.Context, key, hashedKey string, spanLog *spanlogger.SpanLogger) *apierror.APIError { | ||
res := e.cache.GetMulti(ctx, []string{hashedKey}) | ||
if cached, ok := res[hashedKey]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit to invert this: if !ok { return nil }
|
||
innerRes := newEmptyPrometheusResponse() | ||
inner := &mockHandler{} | ||
inner.On("Do", mock.Anything, mock.Anything).Return(innerRes, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inner should never be called, right? So we can also remove this setup call. Otherwise a wrong behaviour of the cache can pass the test
headerValues := getHeaderValuesWithName(r, cacheControlHeader) | ||
for _, v := range headerValues { | ||
if v == noStoreValue { | ||
level.Debug(logger).Log("msg", fmt.Sprintf("%s header in response is equal to %s, not caching the response", cacheControlHeader, noStoreValue)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the change to remove logger
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just didn't seem like a useful log line on its own since we never run at debug level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it always gets added to the span though
Create a new query-frontend middleware that caches errors returned by queries if they are non-transient and will fail again if executed again. This allows us to save work when running a query that hits, e.g., a limit error: running the query again will not help and is a waste of work. See #2676 See #7340 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
eb2d0ae
to
c7c54d4
Compare
What this PR does
Create a new query-frontend middleware that caches errors returned by queries if they are non-transient and will fail again if executed again. This allows us to save work when running a query that hits, e.g., a limit error: running the query again will not help and is a waste of work.
Which issue(s) this PR fixes or relates to
Fixes #2676
Fixes #7340
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.