Skip to content
This repository was archived by the owner on Aug 23, 2023. It is now read-only.

findCache performance improvements #1262

Closed
woodsaj opened this issue Apr 1, 2019 · 5 comments · Fixed by #1263
Closed

findCache performance improvements #1262

woodsaj opened this issue Apr 1, 2019 · 5 comments · Fixed by #1263

Comments

@woodsaj
Copy link
Member

woodsaj commented Apr 1, 2019

Currently, when new series are ingested, they are added to the index, then we create a background goroutine that calls

        defer func() {
		go m.findCache.InvalidateFor(def.OrgId, path)
	}()

By default 100 of these goroutines are allowed to run at once. And each of these goroutines could call find(tree, pattern) for each pattern in the cache (default of 1000). This can put a lot of CPU strain on the MT instance leading other components being impacted and also cause a lot of memory allocations.

We dont really need to invalidate the cache for new series immediately, buffering the series names and processing asynchronously would be fine. The trade-off is that the new series wont be returned for cached find requests until they have been processed. How long an acceptable delay is will vary between users. Some users will want the new series to be returned as soon as possible. Other users might be happy with a delay of a few minutes.

So to help reduce CPU usage and allocations, and keep a cap on the amount of CPU used I propose the following.

Instead of immediately processing series names in InvalidateFor(), we simply push them into a queue and process them in batches from a single goroutine. If the queue reaches X items, or we have not processed the queue in Y duration, then we process the buffer. This is the same buffer approach we use for flushing chunks to the backend store in batches.
https://github.com/grafana/metrictank/blob/master/store/bigtable/bigtable.go#L302-L323

With this approach we can add all of the series names in the batch to a new Tree and call find(tree, pattern) once for each pattern, instead of N times for new series.

@robert-milan
Copy link
Contributor

Do you think we should keep queues / buffers by orgId or just one queue for all of them? I was thinking one per orgId.

@robert-milan
Copy link
Contributor

After discussion with @Dieterbe we think it might be wise to write a separate find() method just for the findCache that simply returns true or false so that it can bail out early. If even one item is a hit, the key must be removed from the cache. This would possibly also save on some memory allocations.

@Dieterbe
Copy link
Contributor

Dieterbe commented Apr 2, 2019

This can put a lot of CPU strain on the MT instance leading other components being impacted and also cause a lot of memory allocations.

for the record, I think we've seen an incident in which we've consumed many cores trying to invalidate, without the cache going into backoff mode. We should never used more than 1 core for invalidations.
a metric saying whether the cache is in backoff mode or not may be useful.

we should also track the number of invalidations, as they come with a cost

@replay
Copy link
Contributor

replay commented Apr 2, 2019

a metric saying whether the cache is in backoff mode or not may be useful.

#1233 (comment)

@Dieterbe
Copy link
Contributor

Dieterbe commented Apr 2, 2019

might be wise to write a separate find() method just for the findCache that simply returns true or false so that it can bail out early

Been looking into this. the current find() is a BFS. We would need a DFS, which would a pretty major change from find() but also getMatcher(). I think it's not warranted to proceed with that idea.
Another idea though.. if the cpu profile shows significant time spent in getMatcher, then we may want to add another cache for string pattern->regex objects or pattern to matcher funcs

a metric saying whether the cache is in backoff mode or not may be useful.

#1233 (comment)

I think the metric would be useful to be able to quickly see what's going on in the dashboard. Having to correlate metrics with logs is not very practical.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants