refactor findCache to use fewer resources #1263

robert-milan · 2019-04-02T08:43:49Z

this will help to optimize invalidation find calls by doing them in batches of multiple paths

also it's a multi-tenant queue

having it per-tenant means we need tenant aware invalidation processing which would just be too complicated (e.g. different batches for different tenants, how do you prioritize smartly without starving, etc) We can re-assess later.

idx/memory/find_cache.go

For more info see smartystreets/goconvey#497 TLDR: without this, I was getting unit test failures like so: 0 total assertions --- FAIL: TestFindCache (0.00s) panic: Top-level calls to Convey(...) need a reference to the *testing.T. Hint: Convey("description here", t, func() { /* notice that the second argument was the *testing.T (t)! */ }) [recovered] panic: Top-level calls to Convey(...) need a reference to the *testing.T. Hint: Convey("description here", t, func() { /* notice that the second argument was the *testing.T (t)! */ }) [recovered] panic: Top-level calls to Convey(...) need a reference to the *testing.T. Hint: Convey("description here", t, func() { /* notice that the second argument was the *testing.T (t)! */ }) goroutine 11 [running]: testing.tRunner.func1(0xc0000de400) /usr/lib/go/src/testing/testing.go:830 +0x392 panic(0xa28be0, 0xc000129290) /usr/lib/go/src/runtime/panic.go:522 +0x1b5 github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey.(*context).conveyInner.func2(0xc00007cd80) /home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey/context.go:232 +0x1c6 panic(0xa28be0, 0xc000129290) /usr/lib/go/src/runtime/panic.go:522 +0x1b5 github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey.conveyPanic(...) /home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey/context.go:20 github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey.rootConvey(0xc000067cb8, 0x2, 0x2) /home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey/context.go:91 +0x35c github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey.Convey(0xc000067cb8, 0x2, 0x2) /home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey/doc.go:75 +0xbf github.com/grafana/metrictank/idx/memory.TestFindCache.func1() /home/dieter/go/src/github.com/grafana/metrictank/idx/memory/find_cache_test.go:88 +0xe7 github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey.parseAction.func1(0xbce120, 0xc00007cd80) /home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey/discovery.go:80 +0x24 github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey.(*context).conveyInner(0xc00007cd80, 0xae977f, 0x17, 0xc00012cb20) /home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey/context.go:261 +0x154 github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey.rootConvey.func1() /home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey/context.go:110 +0xf1 github.com/grafana/metrictank/vendor/github.com/jtolds/gls._m(0x0, 0xc00000ff20) /home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/jtolds/gls/stack_tags.go:39 +0x31 github.com/grafana/metrictank/vendor/github.com/jtolds/gls.markS(...) /home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/jtolds/gls/stack_tags.go:16 github.com/grafana/metrictank/vendor/github.com/jtolds/gls.addStackTag(...) /home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/jtolds/gls/stack_tags.go:13 github.com/grafana/metrictank/vendor/github.com/jtolds/gls.(*ContextManager).SetValues(0xc00000fd80, 0xc0001291d0, 0xc00000ff20) /home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/jtolds/gls/context.go:92 +0x3fd github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey.rootConvey(0xc000067f68, 0x3, 0x3) /home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey/context.go:105 +0x22b github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey.Convey(0xc000067f68, 0x3, 0x3) /home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey/doc.go:75 +0xbf github.com/grafana/metrictank/idx/memory.TestFindCache(0xc0000de400) /home/dieter/go/src/github.com/grafana/metrictank/idx/memory/find_cache_test.go:86 +0x99 testing.tRunner(0xc0000de400, 0xb05e78) /usr/lib/go/src/testing/testing.go:865 +0xc0 created by testing.(*T).Run /usr/lib/go/src/testing/testing.go:916 +0x35a exit status 2 FAIL github.com/grafana/metrictank/idx/memory 0.008s

should delete all orgs + make calleable from test

idx/memory/find_cache.go

woodsaj · 2019-04-02T18:52:22Z

idx/memory/find_cache.go

@@ -69,22 +96,26 @@ func (c *FindCache) Get(orgId uint32, pattern string) ([]*Node, bool) {
 func (c *FindCache) Add(orgId uint32, pattern string, nodes []*Node) {
 	c.RLock()
 	cache, ok := c.cache[orgId]
-	t := c.backoff[orgId]
+	backoff := findCacheBackoff.Peek()


I like this approach, but findCacheBackoff cant be a global, it needs to be part of the FindCache struct.

actually, findCacheBackoff metric doesnt work at all with the partitioned Index, as there are multiple findCaches that could all have different states.

why, to make it clear that no other code in the package should mess with it? if that, we could solve it with a better comment

I would just add a "backoff" field to the FindCache struct. Then in triggerBackoff() i would

time.AfterFunc(c.backoffTime, func() { findCacheBackoff.SetFalse c.Lock() c.backoff = false c.Unlock() })

why, to make it clear that no other code in the package should mess with it

It is just a bad idea to mix structs and fields together. you can have multiple instances of a findCache at any one time, so controlling how the instance behaves through a global variable is just a recipe for disaster.

actually, findCacheBackoff metric doesnt work at all with the partitioned Index, as there are multiple findCaches that could all have different states.

hmm I didn't consider this. good catch. will follow your suggestion

actually to accommodate PartitionedIndex, we can't use a backoff bool either. maybe it should be a gauge that is incremented and decremented.

https://github.com/grafana/metrictank/pull/1263/files/88b6ea076e49ff8b6892bac5be6dc13bba777ab3#diff-38237d1889c51c563def70685fc28b11

woodsaj

LGTM

Dieterbe · 2019-04-03T09:28:29Z

I did an experiment with docker stack to confirm the improvements. (also with #1265 applied)
I ran some scripts that generate an ingest workload, then start background http request load, making sure to populate the findcache with patterns that will need to be revisited (and invalidated) because then we increase the volume of the ingestion rate, adding more metrics with names that match what we already cached. Of particular interest is what happens right when the new metrics come in and the cache needs to invalidate. Collected snapshots and cpu profiles at these times

Did it like so:

git diff # to make sure we're not affected by chunk saving intervals etc, as long as we don't pass 6h intervals at any point during the tests
diff --git a/docker/docker-dev-custom-cfg-kafka/storage-schemas.conf b/docker/docker-dev-custom-cfg-kafka/storage-schemas.conf
index 2d3083b5..54f0f4b2 100644
--- a/docker/docker-dev-custom-cfg-kafka/storage-schemas.conf
+++ b/docker/docker-dev-custom-cfg-kafka/storage-schemas.conf
@@ -50,11 +50,7 @@
 # This example has 3 retention definitions, the first and last override some default options (to use 10minutely and 2hourly chunks and only keep one of them in memory
 # and the last rollup is marked as not ready yet for querying.
 
-[raw]
-pattern = fakemetrics.raw
-retentions = 1s:6h:2min:2
-
 [default]
 pattern = .*
-retentions = 1s:6h:2min:2,1min:35d:6h:1
+retentions = 1s:6h:6h:2
 # reorderBuffer = 20

run make and then start these scripts at the same time:

~/g/s/g/g/metrictank ❯❯❯ cat test-index-add-cpu-launch.sh
#!/bin/bash

source util.sh
source scripts/version-tag.sh

wait_time 120
docker-compose -f docker/docker-dev-custom-cfg-kafka/docker-compose.yml up --force-recreate -V
~/g/s/g/g/metrictank ❯❯❯ cat test-index-add-cpu-tests.sh
#!/bin/bash

source util.sh
source scripts/version-tag.sh

# does queries such that the find-cache will be full with patterns that will later become invalidated and pruned once we introduce more metrics
renderWorkload () {
     num=1000
     for i in $(seq 1 $num); do
     	echo "GET http://localhost:6060/render?target=some.id.of.a.metric.*$i&from=-5min"
     	echo "X-Org-Id: 1"
     	echo
     done | vegeta attack -duration 180s | vegeta report
}

echo "killing any fakemetrics currently running"
pkill -f fakemetrics

wait_time 120
sleep 120
log "feed 10kHz"
fakemetrics feed --kafka-mdm-addr localhost:9092 --period 1s --mpo 10000 &
pid=$!

sleep 60

# needs at least 20s to populate index cache
renderWorkload &

sleep 60

kill $pid
pgrep -a fakemetrics
log "feed 10kHz -> 15kHz"
fakemetrics feed --kafka-mdm-addr localhost:9092 --period 1s --mpo 15000 &
log "getting profile"
curl 'http://localhost:6060/debug/pprof/profile?seconds=5' --output cpu-$version.prof

~/g/s/g/g/metrictank ❯❯❯ cat util.sh
#!/bin/bash

log () {
  echo "$(date +'%F %H:%M:%S') $1"
}

# converts a number of ns to a specificaton for `sleep`
ns_to_sleep() {
	local ns=$1
	printf "%010d\n" $ns | sed 's/\(.\{9\}\)$/.\1/'
}

# sleep until the next round timestamp (snap), but offsetted
# eg wait_time 60 7 will sleep until 7 seconds after the next whole minute
# obviously this is isn't accurate to the ns, due to overhead of executing commands, etc
# on Dieter's laptop, there is typically 5.7 to 5.8 millis of overhead, hence the compensation
wait_time() {
	local snap=$(($1 * 1000000000))
	local offset=$((${2:-0} * 1000000000))
	now=$(date +%s%N)
	sleep_ns=$(($snap - (($now + 5700000 - $offset) % $snap)))
	sleep $(ns_to_sleep $sleep_ns)
}

v0.11.0-185-ga7c8203d

https://snapshot.raintank.io/dashboard/snapshot/esH0WpwdEXf3RUd2konLB21B4NOgRVVe?orgId=2

killing any fakemetrics currently running
2019-04-03 10:54:00 feed 10kHz
2019/04/03 10:54:00 [I] starting listener on :6764
params: orgs=1, mpo=10000, period=1, flush=1000, offset=0, speedup=1, stopAtNow=false
per org:         each 1s, flushing 10000 metrics so rate of 10000 Hz. (10000 total unique series)
times    1 orgs: each 1s, flushing 10000 metrics so rate of 10000 Hz. (10000 total unique series)
2019-04-03 10:54:01.016186533 +0200 CEST m=+1.011586194 10000 metrics could not be sent as v2 MetricPoint
2019-04-03 10:56:00 feed 10kHz -> 15kHz
2019-04-03 10:56:00 getting profile
2019/04/03 10:56:00 [I] starting listener on :6764
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0params: orgs=1, mpo=15000, period=1, flush=1000, offset=0, speedup=1, stopAtNow=false
per org:         each 1s, flushing 15000 metrics so rate of 15000 Hz. (15000 total unique series)
times    1 orgs: each 1s, flushing 15000 metrics so rate of 15000 Hz. (15000 total unique series)
2019-04-03 10:56:01.029251223 +0200 CEST m=+1.010394697 15000 metrics could not be sent as v2 MetricPoint
100 22377    0 22377    0     0   4371      0 --:--:--  0:00:05 --:--:--  5708
~/g/s/g/g/metrictank ❯❯❯ Requests      [total, rate]            9000, 50.01
Duration      [total, attack, wait]    2m59.982985565s, 2m59.980043196s, 2.942369ms
Latencies     [mean, 50, 95, 99, max]  9.225303ms, 11.215829ms, 17.353998ms, 20.158164ms, 279.760029ms
Bytes In      [total, mean]            18000, 2.00
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:9000  
Error Set:

~/g/s/g/g/metrictank ❯❯❯ go tool pprof cpu-v0.11.0-185-ga7c8203.prof
File: metrictank
Type: cpu
Time: Apr 3, 2019 at 10:56am (CEST)
Duration: 5.12s, Total samples = 4.36s (85.24%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 3.22s, 73.85% of 4.36s total
Dropped 219 nodes (cum <= 0.02s)
Showing top 20 nodes out of 172
      flat  flat%   sum%        cum   cum%
     1.15s 26.38% 26.38%      2.12s 48.62%  regexp.(*Regexp).tryBacktrack
     0.53s 12.16% 38.53%      0.53s 12.16%  regexp.(*bitState).shouldVisit
     0.33s  7.57% 46.10%      0.33s  7.57%  regexp.(*inputString).step
     0.19s  4.36% 50.46%      0.19s  4.36%  syscall.Syscall
     0.14s  3.21% 53.67%      0.26s  5.96%  runtime.scanobject
     0.13s  2.98% 56.65%      0.13s  2.98%  runtime.memclrNoHeapPointers
     0.12s  2.75% 59.40%      0.54s 12.39%  runtime.mallocgc
     0.10s  2.29% 61.70%      0.13s  2.98%  runtime.heapBitsSetType
     0.08s  1.83% 63.53%      0.11s  2.52%  regexp.(*bitState).push
     0.06s  1.38% 64.91%      0.09s  2.06%  runtime.findObject
     0.06s  1.38% 66.28%      0.14s  3.21%  sync.(*Pool).Put
     0.05s  1.15% 67.43%      2.43s 55.73%  regexp.(*Regexp).backtrack
     0.05s  1.15% 68.58%      0.05s  1.15%  runtime.procPin
     0.04s  0.92% 69.50%      0.04s  0.92%  runtime.nextFreeFast
     0.04s  0.92% 70.41%      0.04s  0.92%  strings.IndexAny
     0.03s  0.69% 71.10%      0.04s  0.92%  github.com/grafana/metrictank/vendor/github.com/klauspost/compress/flate.(*compressor).deflateSSE
     0.03s  0.69% 71.79%      2.46s 56.42%  regexp.(*Regexp).doExecute
     0.03s  0.69% 72.48%      0.03s  0.69%  regexp.(*inputString).context
     0.03s  0.69% 73.17%      0.03s  0.69%  runtime.acquirem
     0.03s  0.69% 73.85%      0.12s  2.75%  runtime.findrunnable
(pprof)

v0.11.0-219-g88830e3

https://snapshot.raintank.io/dashboard/snapshot/aeX0vbOIQswQN6tycV2Xl7S9oKx9l8m8?orgId=2

~/g/s/g/g/metrictank ❯❯❯ ./test-index-add-cpu-tests.sh
make sure no fakemetrics currently running
30685 fakemetrics feed --kafka-mdm-addr localhost:9092 --period 1s --mpo 15000
2019-04-03 10:20:00 feed 10kHz
2019/04/03 10:20:00 [I] starting listener on :6764
params: orgs=1, mpo=10000, period=1, flush=1000, offset=0, speedup=1, stopAtNow=false
per org:         each 1s, flushing 10000 metrics so rate of 10000 Hz. (10000 total unique series)
times    1 orgs: each 1s, flushing 10000 metrics so rate of 10000 Hz. (10000 total unique series)
2019-04-03 10:20:01.014758663 +0200 CEST m=+1.011742224 10000 metrics could not be sent as v2 MetricPoint
2019-04-03 10:22:00 feed 10kHz -> 15kHz
2019-04-03 10:22:00 getting profile
2019/04/03 10:22:00 [I] starting listener on :6764
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0params: orgs=1, mpo=15000, period=1, flush=1000, offset=0, speedup=1, stopAtNow=false
per org:         each 1s, flushing 15000 metrics so rate of 15000 Hz. (15000 total unique series)
times    1 orgs: each 1s, flushing 15000 metrics so rate of 15000 Hz. (15000 total unique series)
2019-04-03 10:22:01.033584424 +0200 CEST m=+1.013603378 15000 metrics could not be sent as v2 MetricPoint
100 16490    0 16490    0     0   3210      0 --:--:--  0:00:05 --:--:--  4192
~/g/s/g/g/metrictank ❯❯❯ Requests      [total, rate]            9000, 50.01
Duration      [total, attack, wait]    2m59.982225827s, 2m59.979974721s, 2.251106ms
Latencies     [mean, 50, 95, 99, max]  9.018946ms, 11.244317ms, 17.437262ms, 20.039491ms, 41.928506ms
Bytes In      [total, mean]            18000, 2.00
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:9000  
Error Set:

~/g/s/g/g/metrictank ❯❯❯ go tool pprof cpu-v0.11.0-219-g88830e3.prof
File: metrictank
Type: cpu
Time: Apr 3, 2019 at 10:22am (CEST)
Duration: 5.13s, Total samples = 3.58s (69.75%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 2.94s, 82.12% of 3.58s total
Dropped 134 nodes (cum <= 0.02s)
Showing top 20 nodes out of 181
      flat  flat%   sum%        cum   cum%
     1.10s 30.73% 30.73%      2.34s 65.36%  regexp.(*Regexp).tryBacktrack
     0.70s 19.55% 50.28%      0.70s 19.55%  regexp.(*bitState).shouldVisit
     0.39s 10.89% 61.17%      0.39s 10.89%  regexp.(*inputString).step
     0.14s  3.91% 65.08%      0.15s  4.19%  syscall.Syscall
     0.11s  3.07% 68.16%      0.11s  3.07%  regexp.(*bitState).push
     0.06s  1.68% 69.83%      0.06s  1.68%  runtime.futex
     0.05s  1.40% 71.23%      0.05s  1.40%  runtime.memclrNoHeapPointers
     0.04s  1.12% 72.35%      0.08s  2.23%  github.com/grafana/metrictank/mdata.(*AggMetric).add
     0.04s  1.12% 73.46%      0.04s  1.12%  regexp.(*inputString).context
     0.04s  1.12% 74.58%      0.06s  1.68%  runtime.scanobject
     0.04s  1.12% 75.70%      0.04s  1.12%  runtime.usleep
     0.03s  0.84% 76.54%      0.04s  1.12%  regexp.(*bitState).reset
     0.03s  0.84% 77.37%      0.03s  0.84%  runtime.epollwait
     0.03s  0.84% 78.21%      0.03s  0.84%  runtime.mapaccess2
     0.03s  0.84% 79.05%      0.05s  1.40%  sync.(*Pool).Get
     0.03s  0.84% 79.89%      0.04s  1.12%  sync.(*Pool).pin
     0.02s  0.56% 80.45%      0.02s  0.56%  github.com/grafana/metrictank/vendor/github.com/klauspost/compress/flate.(*compressor).deflateSSE
     0.02s  0.56% 81.01%      2.51s 70.11%  regexp.(*Regexp).doExecute
     0.02s  0.56% 81.56%      0.02s  0.56%  runtime.(*itabTableType).find
     0.02s  0.56% 82.12%      0.02s  0.56%  runtime.findObject

go tool pprof -diff_base cpu-v0.11.0-185-ga7c8203.prof cpu-v0.11.0-219-g88830e3.prof # show the savings in alloc and GC overhead
File: metrictank
Type: cpu
Time: Apr 3, 2019 at 10:22am (CEST)
Duration: 10.25s, Total samples = 4.36s (42.54%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top30
Showing nodes accounting for -0.53s, 12.16% of 4.36s total
Dropped 12 nodes (cum <= 0.02s)
Showing top 30 nodes out of 457
      flat  flat%   sum%        cum   cum%
     0.17s  3.90%  3.90%      0.17s  3.90%  regexp.(*bitState).shouldVisit
    -0.10s  2.29%  1.61%     -0.44s 10.09%  runtime.mallocgc
    -0.10s  2.29%  0.69%     -0.20s  4.59%  runtime.scanobject
    -0.09s  2.06%  2.75%     -0.12s  2.75%  runtime.heapBitsSetType
    -0.08s  1.83%  4.59%     -0.08s  1.83%  runtime.memclrNoHeapPointers
     0.06s  1.38%  3.21%      0.06s  1.38%  regexp.(*inputString).step
    -0.05s  1.15%  4.36%      0.22s  5.05%  regexp.(*Regexp).tryBacktrack
    -0.05s  1.15%  5.50%     -0.11s  2.52%  sync.(*Pool).Put
    -0.05s  1.15%  6.65%     -0.04s  0.92%  syscall.Syscall
    -0.04s  0.92%  7.57%      0.06s  1.38%  regexp.(*Regexp).backtrack
    -0.04s  0.92%  8.49%     -0.07s  1.61%  runtime.findObject
     0.04s  0.92%  7.57%      0.04s  0.92%  runtime.futex
    -0.04s  0.92%  8.49%     -0.04s  0.92%  runtime.procPin
     0.04s  0.92%  7.57%      0.04s  0.92%  runtime.usleep
    -0.04s  0.92%  8.49%     -0.04s  0.92%  strings.IndexAny
     0.03s  0.69%  7.80%      0.06s  1.38%  github.com/grafana/metrictank/mdata.(*AggMetric).add
    -0.03s  0.69%  8.49%     -0.03s  0.69%  runtime.acquirem
    -0.03s  0.69%  9.17%     -0.03s  0.69%  runtime.findrunnable
    -0.03s  0.69%  9.86%     -0.04s  0.92%  runtime.heapBits.next
    -0.03s  0.69% 10.55%     -0.03s  0.69%  runtime.spanOf
     0.03s  0.69%  9.86%      0.03s  0.69%  sync.(*Pool).Get
    -0.02s  0.46% 10.32%     -0.13s  2.98%  regexp.freeBitState
     0.02s  0.46%  9.86%      0.02s  0.46%  runtime.(*itabTableType).find
    -0.02s  0.46% 10.32%     -0.02s  0.46%  runtime.mapaccess2_faststr
    -0.02s  0.46% 10.78%     -0.02s  0.46%  runtime.newdefer
    -0.02s  0.46% 11.24%     -0.02s  0.46%  runtime.nextFreeFast
    -0.02s  0.46% 11.70%     -0.02s  0.46%  runtime.pageIndexOf
    -0.02s  0.46% 12.16%      0.01s  0.23%  runtime.runqgrab
    -0.02s  0.46% 12.61%     -0.02s  0.46%  runtime.scanblock
     0.02s  0.46% 12.16%      0.02s  0.46%  runtime.siftupTimer

conclusion:

old code had larger cpu spike when the new series come in and did significantly more allocations and GC
in prod we saw something extreme where we spent several cpu cores on invalidation, slowing everything to a halt, and backoff not kicking in. I think the workload here was probably not high enough to reproduce that, but at least we demonstrated that the new backoff system works
new code has slightly lower request latency in general, and lower memory idx operation timings

Dieterbe added 2 commits April 2, 2019 11:46

better docs/typo's

2000af5

reorder fields to make clear what is covered by mutex

b0b4e49

Dieterbe force-pushed the findcache-optimizations-treefrompath branch 3 times, most recently from 930a1db to 83b49de Compare April 2, 2019 10:24

Dieterbe changed the title ~~WIP: Add methods for queue processing~~ refactor findCache to use fewer resources Apr 2, 2019

Dieterbe changed the title ~~refactor findCache to use fewer resources~~ WIP: refactor findCache to use fewer resources Apr 2, 2019

Dieterbe force-pushed the findcache-optimizations-treefrompath branch from 49bbfa6 to a41534c Compare April 2, 2019 11:19

Dieterbe and others added 7 commits April 2, 2019 13:22

support making tree from multiple paths at once

3563282

this will help to optimize invalidation find calls by doing them in batches of multiple paths

Add methods for queue processing

eab0323

some fixes for new queue stuff

cb08bc1

seems invalidateQueue and 'newSeries' are same thing

2b4c94d

also it's a multi-tenant queue

init shutdown

89734d9

refactor config options

8a21eb1

fix a bunch of races + simpler and faster draining

f924b80

Dieterbe force-pushed the findcache-optimizations-treefrompath branch from a41534c to 460eba5 Compare April 2, 2019 11:23

Dieterbe added 2 commits April 2, 2019 14:15

last bits of processInvalidateQueue I think

1b6a584

todo's

ea96728

Dieterbe force-pushed the findcache-optimizations-treefrompath branch from 460eba5 to ea96728 Compare April 2, 2019 12:15

robert-milan and others added 3 commits April 2, 2019 15:51

Update locking and unlocking

a8c4084

typo

15693a6

make backoff tenant-agnostic

3e80a1f

having it per-tenant means we need tenant aware invalidation processing which would just be too complicated (e.g. different batches for different tenants, how do you prioritize smartly without starving, etc) We can re-assess later.

Dieterbe changed the title ~~WIP: refactor findCache to use fewer resources~~ refactor findCache to use fewer resources Apr 2, 2019

Dieterbe requested review from woodsaj and replay April 2, 2019 15:53

woodsaj reviewed Apr 2, 2019

View reviewed changes

idx/memory/find_cache.go Outdated Show resolved Hide resolved

woodsaj reviewed Apr 2, 2019

View reviewed changes

idx/memory/find_cache.go Outdated Show resolved Hide resolved

woodsaj reviewed Apr 2, 2019

View reviewed changes

idx/memory/find_cache.go Outdated Show resolved Hide resolved

Dieterbe added 2 commits April 2, 2019 18:45

fix tests

8a4813b

Dieterbe added 2 commits April 2, 2019 18:45

backoff test fix

1943cf9

should delete all orgs + make calleable from test

simpler way to reset

407a5fc

Dieterbe force-pushed the findcache-optimizations-treefrompath branch from 3d343ab to d59a773 Compare April 2, 2019 16:45

woodsaj reviewed Apr 2, 2019

View reviewed changes

idx/memory/find_cache.go Outdated Show resolved Hide resolved

woodsaj reviewed Apr 2, 2019

View reviewed changes

idx/memory/find_cache.go Outdated Show resolved Hide resolved

Dieterbe added 6 commits April 2, 2019 20:32

drain fix

ae3f2f4

handle cache = nil case

9f692ec

make sure cache points to correct cache

2887fe9

better metrics

92120a3

allow unit test to force invalidation

7bb9186

implement shutdown

569d24e

Dieterbe force-pushed the findcache-optimizations-treefrompath branch from 235324f to 569d24e Compare April 2, 2019 18:33

woodsaj reviewed Apr 2, 2019

View reviewed changes

idx/memory/find_cache.go Outdated Show resolved Hide resolved

Dieterbe added 2 commits April 2, 2019 20:45

fix race condition with backoff mode out of sync

bee01df

typo

88b6ea0

woodsaj reviewed Apr 2, 2019

View reviewed changes

Dieterbe added 3 commits April 2, 2019 21:04

add find cache settings to config files

4c0b8a6

config validation

bda9558

fix backoff mechanism per comments

7aa469a

https://github.com/grafana/metrictank/pull/1263/files/88b6ea076e49ff8b6892bac5be6dc13bba777ab3#diff-38237d1889c51c563def70685fc28b11

Dieterbe force-pushed the findcache-optimizations-treefrompath branch from 6d19e6e to 7aa469a Compare April 2, 2019 19:30

woodsaj approved these changes Apr 2, 2019

View reviewed changes

Dieterbe merged commit 1d55764 into master Apr 2, 2019

Dieterbe deleted the findcache-optimizations-treefrompath branch April 3, 2019 17:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor findCache to use fewer resources #1263

refactor findCache to use fewer resources #1263

robert-milan commented Apr 2, 2019 •

edited by Dieterbe

Loading

woodsaj Apr 2, 2019

woodsaj Apr 2, 2019

Dieterbe Apr 2, 2019

woodsaj Apr 2, 2019

woodsaj Apr 2, 2019

Dieterbe Apr 2, 2019

Dieterbe Apr 2, 2019

woodsaj left a comment

Dieterbe commented Apr 3, 2019

refactor findCache to use fewer resources #1263

refactor findCache to use fewer resources #1263

Conversation

robert-milan commented Apr 2, 2019 • edited by Dieterbe Loading

woodsaj Apr 2, 2019

Choose a reason for hiding this comment

woodsaj Apr 2, 2019

Choose a reason for hiding this comment

Dieterbe Apr 2, 2019

Choose a reason for hiding this comment

woodsaj Apr 2, 2019

Choose a reason for hiding this comment

woodsaj Apr 2, 2019

Choose a reason for hiding this comment

Dieterbe Apr 2, 2019

Choose a reason for hiding this comment

Dieterbe Apr 2, 2019

Choose a reason for hiding this comment

woodsaj left a comment

Choose a reason for hiding this comment

Dieterbe commented Apr 3, 2019

v0.11.0-185-ga7c8203d

v0.11.0-219-g88830e3

conclusion:

robert-milan commented Apr 2, 2019 •

edited by Dieterbe

Loading