Skip to content
This repository was archived by the owner on Aug 23, 2023. It is now read-only.

refactor findCache to use fewer resources #1263

Merged
merged 29 commits into from
Apr 2, 2019

Conversation

robert-milan
Copy link
Contributor

@robert-milan robert-milan commented Apr 2, 2019

fix #1262

@Dieterbe Dieterbe force-pushed the findcache-optimizations-treefrompath branch 3 times, most recently from 930a1db to 83b49de Compare April 2, 2019 10:24
@Dieterbe Dieterbe changed the title WIP: Add methods for queue processing refactor findCache to use fewer resources Apr 2, 2019
@Dieterbe Dieterbe changed the title refactor findCache to use fewer resources WIP: refactor findCache to use fewer resources Apr 2, 2019
@Dieterbe Dieterbe force-pushed the findcache-optimizations-treefrompath branch from 49bbfa6 to a41534c Compare April 2, 2019 11:19
@Dieterbe Dieterbe force-pushed the findcache-optimizations-treefrompath branch from a41534c to 460eba5 Compare April 2, 2019 11:23
@Dieterbe Dieterbe force-pushed the findcache-optimizations-treefrompath branch from 460eba5 to ea96728 Compare April 2, 2019 12:15
robert-milan and others added 3 commits April 2, 2019 15:51
having it per-tenant means we need tenant aware invalidation processing
which would just be too complicated (e.g. different batches for
different tenants, how do you prioritize smartly without starving, etc)

We can re-assess later.
@Dieterbe Dieterbe changed the title WIP: refactor findCache to use fewer resources refactor findCache to use fewer resources Apr 2, 2019
@Dieterbe Dieterbe requested review from woodsaj and replay April 2, 2019 15:53
Dieterbe added 2 commits April 2, 2019 18:45
For more info see smartystreets/goconvey#497
TLDR: without this, I was getting unit test failures like so:

0 total assertions

--- FAIL: TestFindCache (0.00s)
panic: Top-level calls to Convey(...) need a reference to the *testing.T.
		Hint: Convey("description here", t, func() { /* notice that the second argument was the *testing.T (t)! */ })  [recovered]
	panic: Top-level calls to Convey(...) need a reference to the *testing.T.
		Hint: Convey("description here", t, func() { /* notice that the second argument was the *testing.T (t)! */ })  [recovered]
	panic: Top-level calls to Convey(...) need a reference to the *testing.T.
		Hint: Convey("description here", t, func() { /* notice that the second argument was the *testing.T (t)! */ })

goroutine 11 [running]:
testing.tRunner.func1(0xc0000de400)
	/usr/lib/go/src/testing/testing.go:830 +0x392
panic(0xa28be0, 0xc000129290)
	/usr/lib/go/src/runtime/panic.go:522 +0x1b5
github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey.(*context).conveyInner.func2(0xc00007cd80)
	/home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey/context.go:232 +0x1c6
panic(0xa28be0, 0xc000129290)
	/usr/lib/go/src/runtime/panic.go:522 +0x1b5
github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey.conveyPanic(...)
	/home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey/context.go:20
github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey.rootConvey(0xc000067cb8, 0x2, 0x2)
	/home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey/context.go:91 +0x35c
github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey.Convey(0xc000067cb8, 0x2, 0x2)
	/home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey/doc.go:75 +0xbf
github.com/grafana/metrictank/idx/memory.TestFindCache.func1()
	/home/dieter/go/src/github.com/grafana/metrictank/idx/memory/find_cache_test.go:88 +0xe7
github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey.parseAction.func1(0xbce120, 0xc00007cd80)
	/home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey/discovery.go:80 +0x24
github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey.(*context).conveyInner(0xc00007cd80, 0xae977f, 0x17, 0xc00012cb20)
	/home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey/context.go:261 +0x154
github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey.rootConvey.func1()
	/home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey/context.go:110 +0xf1
github.com/grafana/metrictank/vendor/github.com/jtolds/gls._m(0x0, 0xc00000ff20)
	/home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/jtolds/gls/stack_tags.go:39 +0x31
github.com/grafana/metrictank/vendor/github.com/jtolds/gls.markS(...)
	/home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/jtolds/gls/stack_tags.go:16
github.com/grafana/metrictank/vendor/github.com/jtolds/gls.addStackTag(...)
	/home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/jtolds/gls/stack_tags.go:13
github.com/grafana/metrictank/vendor/github.com/jtolds/gls.(*ContextManager).SetValues(0xc00000fd80, 0xc0001291d0, 0xc00000ff20)
	/home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/jtolds/gls/context.go:92 +0x3fd
github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey.rootConvey(0xc000067f68, 0x3, 0x3)
	/home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey/context.go:105 +0x22b
github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey.Convey(0xc000067f68, 0x3, 0x3)
	/home/dieter/go/src/github.com/grafana/metrictank/vendor/github.com/smartystreets/goconvey/convey/doc.go:75 +0xbf
github.com/grafana/metrictank/idx/memory.TestFindCache(0xc0000de400)
	/home/dieter/go/src/github.com/grafana/metrictank/idx/memory/find_cache_test.go:86 +0x99
testing.tRunner(0xc0000de400, 0xb05e78)
	/usr/lib/go/src/testing/testing.go:865 +0xc0
created by testing.(*T).Run
	/usr/lib/go/src/testing/testing.go:916 +0x35a
exit status 2
FAIL	github.com/grafana/metrictank/idx/memory	0.008s
Dieterbe added 2 commits April 2, 2019 18:45
should delete all orgs + make calleable from test
@Dieterbe Dieterbe force-pushed the findcache-optimizations-treefrompath branch from 3d343ab to d59a773 Compare April 2, 2019 16:45
@Dieterbe Dieterbe force-pushed the findcache-optimizations-treefrompath branch from 235324f to 569d24e Compare April 2, 2019 18:33
@@ -69,22 +96,26 @@ func (c *FindCache) Get(orgId uint32, pattern string) ([]*Node, bool) {
func (c *FindCache) Add(orgId uint32, pattern string, nodes []*Node) {
c.RLock()
cache, ok := c.cache[orgId]
t := c.backoff[orgId]
backoff := findCacheBackoff.Peek()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach, but findCacheBackoff cant be a global, it needs to be part of the FindCache struct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, findCacheBackoff metric doesnt work at all with the partitioned Index, as there are multiple findCaches that could all have different states.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why, to make it clear that no other code in the package should mess with it? if that, we could solve it with a better comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just add a "backoff" field to the FindCache struct. Then in triggerBackoff() i would

time.AfterFunc(c.backoffTime, func() {
   findCacheBackoff.SetFalse
   c.Lock()
   c.backoff = false
   c.Unlock()
})

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why, to make it clear that no other code in the package should mess with it

It is just a bad idea to mix structs and fields together. you can have multiple instances of a findCache at any one time, so controlling how the instance behaves through a global variable is just a recipe for disaster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, findCacheBackoff metric doesnt work at all with the partitioned Index, as there are multiple findCaches that could all have different states.

hmm I didn't consider this. good catch. will follow your suggestion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually to accommodate PartitionedIndex, we can't use a backoff bool either. maybe it should be a gauge that is incremented and decremented.

@Dieterbe Dieterbe force-pushed the findcache-optimizations-treefrompath branch from 6d19e6e to 7aa469a Compare April 2, 2019 19:30
Copy link
Member

@woodsaj woodsaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Dieterbe Dieterbe merged commit 1d55764 into master Apr 2, 2019
@Dieterbe
Copy link
Contributor

Dieterbe commented Apr 3, 2019

I did an experiment with docker stack to confirm the improvements. (also with #1265 applied)
I ran some scripts that generate an ingest workload, then start background http request load, making sure to populate the findcache with patterns that will need to be revisited (and invalidated) because then we increase the volume of the ingestion rate, adding more metrics with names that match what we already cached. Of particular interest is what happens right when the new metrics come in and the cache needs to invalidate. Collected snapshots and cpu profiles at these times

Did it like so:

git diff # to make sure we're not affected by chunk saving intervals etc, as long as we don't pass 6h intervals at any point during the tests
diff --git a/docker/docker-dev-custom-cfg-kafka/storage-schemas.conf b/docker/docker-dev-custom-cfg-kafka/storage-schemas.conf
index 2d3083b5..54f0f4b2 100644
--- a/docker/docker-dev-custom-cfg-kafka/storage-schemas.conf
+++ b/docker/docker-dev-custom-cfg-kafka/storage-schemas.conf
@@ -50,11 +50,7 @@
 # This example has 3 retention definitions, the first and last override some default options (to use 10minutely and 2hourly chunks and only keep one of them in memory
 # and the last rollup is marked as not ready yet for querying.
 
-[raw]
-pattern = fakemetrics.raw
-retentions = 1s:6h:2min:2
-
 [default]
 pattern = .*
-retentions = 1s:6h:2min:2,1min:35d:6h:1
+retentions = 1s:6h:6h:2
 # reorderBuffer = 20

run make and then start these scripts at the same time:

~/g/s/g/g/metrictank ❯❯❯ cat test-index-add-cpu-launch.sh
#!/bin/bash

source util.sh
source scripts/version-tag.sh

wait_time 120
docker-compose -f docker/docker-dev-custom-cfg-kafka/docker-compose.yml up --force-recreate -V
~/g/s/g/g/metrictank ❯❯❯ cat test-index-add-cpu-tests.sh
#!/bin/bash

source util.sh
source scripts/version-tag.sh

# does queries such that the find-cache will be full with patterns that will later become invalidated and pruned once we introduce more metrics
renderWorkload () {
     num=1000
     for i in $(seq 1 $num); do
     	echo "GET http://localhost:6060/render?target=some.id.of.a.metric.*$i&from=-5min"
     	echo "X-Org-Id: 1"
     	echo
     done | vegeta attack -duration 180s | vegeta report
}

echo "killing any fakemetrics currently running"
pkill -f fakemetrics

wait_time 120
sleep 120
log "feed 10kHz"
fakemetrics feed --kafka-mdm-addr localhost:9092 --period 1s --mpo 10000 &
pid=$!

sleep 60

# needs at least 20s to populate index cache
renderWorkload &

sleep 60

kill $pid
pgrep -a fakemetrics
log "feed 10kHz -> 15kHz"
fakemetrics feed --kafka-mdm-addr localhost:9092 --period 1s --mpo 15000 &
log "getting profile"
curl 'http://localhost:6060/debug/pprof/profile?seconds=5' --output cpu-$version.prof

~/g/s/g/g/metrictank ❯❯❯ cat util.sh
#!/bin/bash

log () {
  echo "$(date +'%F %H:%M:%S') $1"
}

# converts a number of ns to a specificaton for `sleep`
ns_to_sleep() {
	local ns=$1
	printf "%010d\n" $ns | sed 's/\(.\{9\}\)$/.\1/'
}

# sleep until the next round timestamp (snap), but offsetted
# eg wait_time 60 7 will sleep until 7 seconds after the next whole minute
# obviously this is isn't accurate to the ns, due to overhead of executing commands, etc
# on Dieter's laptop, there is typically 5.7 to 5.8 millis of overhead, hence the compensation
wait_time() {
	local snap=$(($1 * 1000000000))
	local offset=$((${2:-0} * 1000000000))
	now=$(date +%s%N)
	sleep_ns=$(($snap - (($now + 5700000 - $offset) % $snap)))
	sleep $(ns_to_sleep $sleep_ns)
}

v0.11.0-185-ga7c8203d

https://snapshot.raintank.io/dashboard/snapshot/esH0WpwdEXf3RUd2konLB21B4NOgRVVe?orgId=2

killing any fakemetrics currently running
2019-04-03 10:54:00 feed 10kHz
2019/04/03 10:54:00 [I] starting listener on :6764
params: orgs=1, mpo=10000, period=1, flush=1000, offset=0, speedup=1, stopAtNow=false
per org:         each 1s, flushing 10000 metrics so rate of 10000 Hz. (10000 total unique series)
times    1 orgs: each 1s, flushing 10000 metrics so rate of 10000 Hz. (10000 total unique series)
2019-04-03 10:54:01.016186533 +0200 CEST m=+1.011586194 10000 metrics could not be sent as v2 MetricPoint
2019-04-03 10:56:00 feed 10kHz -> 15kHz
2019-04-03 10:56:00 getting profile
2019/04/03 10:56:00 [I] starting listener on :6764
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0params: orgs=1, mpo=15000, period=1, flush=1000, offset=0, speedup=1, stopAtNow=false
per org:         each 1s, flushing 15000 metrics so rate of 15000 Hz. (15000 total unique series)
times    1 orgs: each 1s, flushing 15000 metrics so rate of 15000 Hz. (15000 total unique series)
2019-04-03 10:56:01.029251223 +0200 CEST m=+1.010394697 15000 metrics could not be sent as v2 MetricPoint
100 22377    0 22377    0     0   4371      0 --:--:--  0:00:05 --:--:--  5708
~/g/s/g/g/metrictank ❯❯❯ Requests      [total, rate]            9000, 50.01
Duration      [total, attack, wait]    2m59.982985565s, 2m59.980043196s, 2.942369ms
Latencies     [mean, 50, 95, 99, max]  9.225303ms, 11.215829ms, 17.353998ms, 20.158164ms, 279.760029ms
Bytes In      [total, mean]            18000, 2.00
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:9000  
Error Set:
~/g/s/g/g/metrictank ❯❯❯ go tool pprof cpu-v0.11.0-185-ga7c8203.prof
File: metrictank
Type: cpu
Time: Apr 3, 2019 at 10:56am (CEST)
Duration: 5.12s, Total samples = 4.36s (85.24%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 3.22s, 73.85% of 4.36s total
Dropped 219 nodes (cum <= 0.02s)
Showing top 20 nodes out of 172
      flat  flat%   sum%        cum   cum%
     1.15s 26.38% 26.38%      2.12s 48.62%  regexp.(*Regexp).tryBacktrack
     0.53s 12.16% 38.53%      0.53s 12.16%  regexp.(*bitState).shouldVisit
     0.33s  7.57% 46.10%      0.33s  7.57%  regexp.(*inputString).step
     0.19s  4.36% 50.46%      0.19s  4.36%  syscall.Syscall
     0.14s  3.21% 53.67%      0.26s  5.96%  runtime.scanobject
     0.13s  2.98% 56.65%      0.13s  2.98%  runtime.memclrNoHeapPointers
     0.12s  2.75% 59.40%      0.54s 12.39%  runtime.mallocgc
     0.10s  2.29% 61.70%      0.13s  2.98%  runtime.heapBitsSetType
     0.08s  1.83% 63.53%      0.11s  2.52%  regexp.(*bitState).push
     0.06s  1.38% 64.91%      0.09s  2.06%  runtime.findObject
     0.06s  1.38% 66.28%      0.14s  3.21%  sync.(*Pool).Put
     0.05s  1.15% 67.43%      2.43s 55.73%  regexp.(*Regexp).backtrack
     0.05s  1.15% 68.58%      0.05s  1.15%  runtime.procPin
     0.04s  0.92% 69.50%      0.04s  0.92%  runtime.nextFreeFast
     0.04s  0.92% 70.41%      0.04s  0.92%  strings.IndexAny
     0.03s  0.69% 71.10%      0.04s  0.92%  github.com/grafana/metrictank/vendor/github.com/klauspost/compress/flate.(*compressor).deflateSSE
     0.03s  0.69% 71.79%      2.46s 56.42%  regexp.(*Regexp).doExecute
     0.03s  0.69% 72.48%      0.03s  0.69%  regexp.(*inputString).context
     0.03s  0.69% 73.17%      0.03s  0.69%  runtime.acquirem
     0.03s  0.69% 73.85%      0.12s  2.75%  runtime.findrunnable
(pprof) 

v0.11.0-219-g88830e3

https://snapshot.raintank.io/dashboard/snapshot/aeX0vbOIQswQN6tycV2Xl7S9oKx9l8m8?orgId=2

~/g/s/g/g/metrictank ❯❯❯ ./test-index-add-cpu-tests.sh
make sure no fakemetrics currently running
30685 fakemetrics feed --kafka-mdm-addr localhost:9092 --period 1s --mpo 15000
2019-04-03 10:20:00 feed 10kHz
2019/04/03 10:20:00 [I] starting listener on :6764
params: orgs=1, mpo=10000, period=1, flush=1000, offset=0, speedup=1, stopAtNow=false
per org:         each 1s, flushing 10000 metrics so rate of 10000 Hz. (10000 total unique series)
times    1 orgs: each 1s, flushing 10000 metrics so rate of 10000 Hz. (10000 total unique series)
2019-04-03 10:20:01.014758663 +0200 CEST m=+1.011742224 10000 metrics could not be sent as v2 MetricPoint
2019-04-03 10:22:00 feed 10kHz -> 15kHz
2019-04-03 10:22:00 getting profile
2019/04/03 10:22:00 [I] starting listener on :6764
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0params: orgs=1, mpo=15000, period=1, flush=1000, offset=0, speedup=1, stopAtNow=false
per org:         each 1s, flushing 15000 metrics so rate of 15000 Hz. (15000 total unique series)
times    1 orgs: each 1s, flushing 15000 metrics so rate of 15000 Hz. (15000 total unique series)
2019-04-03 10:22:01.033584424 +0200 CEST m=+1.013603378 15000 metrics could not be sent as v2 MetricPoint
100 16490    0 16490    0     0   3210      0 --:--:--  0:00:05 --:--:--  4192
~/g/s/g/g/metrictank ❯❯❯ Requests      [total, rate]            9000, 50.01
Duration      [total, attack, wait]    2m59.982225827s, 2m59.979974721s, 2.251106ms
Latencies     [mean, 50, 95, 99, max]  9.018946ms, 11.244317ms, 17.437262ms, 20.039491ms, 41.928506ms
Bytes In      [total, mean]            18000, 2.00
Bytes Out     [total, mean]            0, 0.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:9000  
Error Set:
~/g/s/g/g/metrictank ❯❯❯ go tool pprof cpu-v0.11.0-219-g88830e3.prof
File: metrictank
Type: cpu
Time: Apr 3, 2019 at 10:22am (CEST)
Duration: 5.13s, Total samples = 3.58s (69.75%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 2.94s, 82.12% of 3.58s total
Dropped 134 nodes (cum <= 0.02s)
Showing top 20 nodes out of 181
      flat  flat%   sum%        cum   cum%
     1.10s 30.73% 30.73%      2.34s 65.36%  regexp.(*Regexp).tryBacktrack
     0.70s 19.55% 50.28%      0.70s 19.55%  regexp.(*bitState).shouldVisit
     0.39s 10.89% 61.17%      0.39s 10.89%  regexp.(*inputString).step
     0.14s  3.91% 65.08%      0.15s  4.19%  syscall.Syscall
     0.11s  3.07% 68.16%      0.11s  3.07%  regexp.(*bitState).push
     0.06s  1.68% 69.83%      0.06s  1.68%  runtime.futex
     0.05s  1.40% 71.23%      0.05s  1.40%  runtime.memclrNoHeapPointers
     0.04s  1.12% 72.35%      0.08s  2.23%  github.com/grafana/metrictank/mdata.(*AggMetric).add
     0.04s  1.12% 73.46%      0.04s  1.12%  regexp.(*inputString).context
     0.04s  1.12% 74.58%      0.06s  1.68%  runtime.scanobject
     0.04s  1.12% 75.70%      0.04s  1.12%  runtime.usleep
     0.03s  0.84% 76.54%      0.04s  1.12%  regexp.(*bitState).reset
     0.03s  0.84% 77.37%      0.03s  0.84%  runtime.epollwait
     0.03s  0.84% 78.21%      0.03s  0.84%  runtime.mapaccess2
     0.03s  0.84% 79.05%      0.05s  1.40%  sync.(*Pool).Get
     0.03s  0.84% 79.89%      0.04s  1.12%  sync.(*Pool).pin
     0.02s  0.56% 80.45%      0.02s  0.56%  github.com/grafana/metrictank/vendor/github.com/klauspost/compress/flate.(*compressor).deflateSSE
     0.02s  0.56% 81.01%      2.51s 70.11%  regexp.(*Regexp).doExecute
     0.02s  0.56% 81.56%      0.02s  0.56%  runtime.(*itabTableType).find
     0.02s  0.56% 82.12%      0.02s  0.56%  runtime.findObject
go tool pprof -diff_base cpu-v0.11.0-185-ga7c8203.prof cpu-v0.11.0-219-g88830e3.prof # show the savings in alloc and GC overhead
File: metrictank
Type: cpu
Time: Apr 3, 2019 at 10:22am (CEST)
Duration: 10.25s, Total samples = 4.36s (42.54%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top30
Showing nodes accounting for -0.53s, 12.16% of 4.36s total
Dropped 12 nodes (cum <= 0.02s)
Showing top 30 nodes out of 457
      flat  flat%   sum%        cum   cum%
     0.17s  3.90%  3.90%      0.17s  3.90%  regexp.(*bitState).shouldVisit
    -0.10s  2.29%  1.61%     -0.44s 10.09%  runtime.mallocgc
    -0.10s  2.29%  0.69%     -0.20s  4.59%  runtime.scanobject
    -0.09s  2.06%  2.75%     -0.12s  2.75%  runtime.heapBitsSetType
    -0.08s  1.83%  4.59%     -0.08s  1.83%  runtime.memclrNoHeapPointers
     0.06s  1.38%  3.21%      0.06s  1.38%  regexp.(*inputString).step
    -0.05s  1.15%  4.36%      0.22s  5.05%  regexp.(*Regexp).tryBacktrack
    -0.05s  1.15%  5.50%     -0.11s  2.52%  sync.(*Pool).Put
    -0.05s  1.15%  6.65%     -0.04s  0.92%  syscall.Syscall
    -0.04s  0.92%  7.57%      0.06s  1.38%  regexp.(*Regexp).backtrack
    -0.04s  0.92%  8.49%     -0.07s  1.61%  runtime.findObject
     0.04s  0.92%  7.57%      0.04s  0.92%  runtime.futex
    -0.04s  0.92%  8.49%     -0.04s  0.92%  runtime.procPin
     0.04s  0.92%  7.57%      0.04s  0.92%  runtime.usleep
    -0.04s  0.92%  8.49%     -0.04s  0.92%  strings.IndexAny
     0.03s  0.69%  7.80%      0.06s  1.38%  github.com/grafana/metrictank/mdata.(*AggMetric).add
    -0.03s  0.69%  8.49%     -0.03s  0.69%  runtime.acquirem
    -0.03s  0.69%  9.17%     -0.03s  0.69%  runtime.findrunnable
    -0.03s  0.69%  9.86%     -0.04s  0.92%  runtime.heapBits.next
    -0.03s  0.69% 10.55%     -0.03s  0.69%  runtime.spanOf
     0.03s  0.69%  9.86%      0.03s  0.69%  sync.(*Pool).Get
    -0.02s  0.46% 10.32%     -0.13s  2.98%  regexp.freeBitState
     0.02s  0.46%  9.86%      0.02s  0.46%  runtime.(*itabTableType).find
    -0.02s  0.46% 10.32%     -0.02s  0.46%  runtime.mapaccess2_faststr
    -0.02s  0.46% 10.78%     -0.02s  0.46%  runtime.newdefer
    -0.02s  0.46% 11.24%     -0.02s  0.46%  runtime.nextFreeFast
    -0.02s  0.46% 11.70%     -0.02s  0.46%  runtime.pageIndexOf
    -0.02s  0.46% 12.16%      0.01s  0.23%  runtime.runqgrab
    -0.02s  0.46% 12.61%     -0.02s  0.46%  runtime.scanblock
     0.02s  0.46% 12.16%      0.02s  0.46%  runtime.siftupTimer

conclusion:

  • old code had larger cpu spike when the new series come in and did significantly more allocations and GC
  • in prod we saw something extreme where we spent several cpu cores on invalidation, slowing everything to a halt, and backoff not kicking in. I think the workload here was probably not high enough to reproduce that, but at least we demonstrated that the new backoff system works
  • new code has slightly lower request latency in general, and lower memory idx operation timings

@Dieterbe Dieterbe deleted the findcache-optimizations-treefrompath branch April 3, 2019 17:19
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

findCache performance improvements
3 participants