findCache performance improvements #1262

woodsaj · 2019-04-01T17:34:46Z

Currently, when new series are ingested, they are added to the index, then we create a background goroutine that calls

        defer func() {
		go m.findCache.InvalidateFor(def.OrgId, path)
	}()

By default 100 of these goroutines are allowed to run at once. And each of these goroutines could call find(tree, pattern) for each pattern in the cache (default of 1000). This can put a lot of CPU strain on the MT instance leading other components being impacted and also cause a lot of memory allocations.

We dont really need to invalidate the cache for new series immediately, buffering the series names and processing asynchronously would be fine. The trade-off is that the new series wont be returned for cached find requests until they have been processed. How long an acceptable delay is will vary between users. Some users will want the new series to be returned as soon as possible. Other users might be happy with a delay of a few minutes.

So to help reduce CPU usage and allocations, and keep a cap on the amount of CPU used I propose the following.

Instead of immediately processing series names in InvalidateFor(), we simply push them into a queue and process them in batches from a single goroutine. If the queue reaches X items, or we have not processed the queue in Y duration, then we process the buffer. This is the same buffer approach we use for flushing chunks to the backend store in batches.
https://github.com/grafana/metrictank/blob/master/store/bigtable/bigtable.go#L302-L323

With this approach we can add all of the series names in the batch to a new Tree and call find(tree, pattern) once for each pattern, instead of N times for new series.

The text was updated successfully, but these errors were encountered:

robert-milan · 2019-04-01T19:35:50Z

Do you think we should keep queues / buffers by orgId or just one queue for all of them? I was thinking one per orgId.

robert-milan · 2019-04-01T20:40:22Z

After discussion with @Dieterbe we think it might be wise to write a separate find() method just for the findCache that simply returns true or false so that it can bail out early. If even one item is a hit, the key must be removed from the cache. This would possibly also save on some memory allocations.

Dieterbe · 2019-04-02T07:01:55Z

This can put a lot of CPU strain on the MT instance leading other components being impacted and also cause a lot of memory allocations.

for the record, I think we've seen an incident in which we've consumed many cores trying to invalidate, without the cache going into backoff mode. We should never used more than 1 core for invalidations.
a metric saying whether the cache is in backoff mode or not may be useful.

we should also track the number of invalidations, as they come with a cost

replay · 2019-04-02T07:09:44Z

a metric saying whether the cache is in backoff mode or not may be useful.

#1233 (comment)

Dieterbe · 2019-04-02T07:54:39Z

might be wise to write a separate find() method just for the findCache that simply returns true or false so that it can bail out early

Been looking into this. the current find() is a BFS. We would need a DFS, which would a pretty major change from find() but also getMatcher(). I think it's not warranted to proceed with that idea.
Another idea though.. if the cpu profile shows significant time spent in getMatcher, then we may want to add another cache for string pattern->regex objects or pattern to matcher funcs

a metric saying whether the cache is in backoff mode or not may be useful.

#1233 (comment)

I think the metric would be useful to be able to quickly see what's going on in the dashboard. Having to correlate metrics with logs is not very practical.

robert-milan mentioned this issue Apr 2, 2019

refactor findCache to use fewer resources #1263

Merged

Dieterbe closed this as completed in #1263 Apr 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

findCache performance improvements #1262

findCache performance improvements #1262

woodsaj commented Apr 1, 2019

robert-milan commented Apr 1, 2019

robert-milan commented Apr 1, 2019

Dieterbe commented Apr 2, 2019 •

edited

Loading

replay commented Apr 2, 2019

Dieterbe commented Apr 2, 2019

findCache performance improvements #1262

findCache performance improvements #1262

Comments

woodsaj commented Apr 1, 2019

robert-milan commented Apr 1, 2019

robert-milan commented Apr 1, 2019

Dieterbe commented Apr 2, 2019 • edited Loading

replay commented Apr 2, 2019

Dieterbe commented Apr 2, 2019

Dieterbe commented Apr 2, 2019 •

edited

Loading