Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal error: concurrent map iteration and map write #8989

Closed
anatolijd opened this issue Oct 20, 2017 · 3 comments · Fixed by #9009
Closed

Fatal error: concurrent map iteration and map write #8989

anatolijd opened this issue Oct 20, 2017 · 3 comments · Fixed by #9009
Assignees
Labels
Milestone

Comments

@anatolijd
Copy link

Directions

I've the same as described in #8633
Sympthoms are very similar -

  • a massive amount of updates POST /write?db=telegraf from newly scaled out instances (and new tag values),
  • a number of SHOW TAG VALUES ON telegraf WITH KEY queries (from Grafana dashboard),
  • a fatal error.

This is very worrying, because it means it will happen even more often as more people in our company start using Grafana dashboard for influxdb data.

Is it a bug ? Is there a way to prevent this to happen ?
Please help us to solve it.

Our setup:
Influxdb cluster with 2 data nodes (8 cores, 64GB each).

Version: 1.3.6-c1.3.6
cache-max-memory-size = 4194304000
cache-snapshot-memory-size = 26214400
index-version = "inmem"
max-concurrent-compactions = 3

telegraf database stats:

Size on disk: 40G
Measurements: 23
Series: 225647447

short log:

[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a3528157-b4d0-11e7-bee5-000000000000 4813
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a352b259-b4d0-11e7-bee6-000000000000 3576
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a351bc5b-b4d0-11e7-bee4-000000000000 9859
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a352f3b9-b4d0-11e7-bee7-000000000000 4282
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a353e2b1-b4d0-11e7-bee8-000000000000 2898
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a353f4e2-b4d0-11e7-bee9-000000000000 4884
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a353ff0d-b4d0-11e7-beea-000000000000 4655
[I] 2017-10-19T13:23:04Z SHOW TAG VALUES ON telegraf WITH KEY = interface WHERE (_name = ‘net’) AND ((host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/) AND (_tagKey = ‘interface’)) service=query
[I] 2017-10-19T13:23:04Z SHOW TAG VALUES ON telegraf WITH KEY = path WHERE (_name = ‘disk’) AND ((host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/) AND (_tagKey = ‘path’)) service=query
[I] 2017-10-19T13:23:04Z SHOW TAG VALUES ON telegraf WITH KEY = process_name WHERE (_name = ‘procstat’) AND ((host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/) AND (_tagKey = ‘process_name’)) LIMIT 30 service=query
[I] 2017-10-19T13:23:04Z SELECT mean(used) FROM telegraf.“default”.mem WHERE host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/ AND time > now() - 6h GROUP BY time(10m) service=query
[I] 2017-10-19T13:23:04Z SHOW TAG VALUES ON telegraf WITH KEY = “name” WHERE (_name = ‘diskio’) AND ((host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/) AND (_tagKey = ‘name’)) service=query
[I] 2017-10-19T13:23:04Z SELECT mean(usage_user) FROM telegraf.“default”.cpu WHERE host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/ AND cpu = ‘cpu-total’ AND time > now() - 6h GROUP BY time(10m) service=query
[I] 2017-10-19T13:23:04Z SELECT mean(cached) FROM telegraf.“default”.mem WHERE host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/ AND time > now() - 6h GROUP BY time(10m) service=query
[I] 2017-10-19T13:23:04Z SELECT mean(buffered) FROM telegraf.“default”.mem WHERE host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/ AND time > now() - 6h GROUP BY time(10m) service=query
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a3559cdd-b4d0-11e7-beed-000000000000 9285
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a354352f-b4d0-11e7-beeb-000000000000 18601
[I] 2017-10-19T13:23:04Z SELECT mean(free) FROM telegraf.“default”.mem WHERE host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/ AND time > now() - 6h GROUP BY time(10m) service=query
[httpd] 172.21.1.6, 127.0.0.1,172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “GET /query?db=telegraf&epoch=ms&q=SELECT+mean%28%22used%22%29+FROM+%22default%22.%22mem%22+WHERE+%22host%22+%3D~+%2F%5Efe-i-07a1b5f8e3e666075%5C.dc-eu.ragnarok%5C.net%24%2F+AND+time+%3E+now%28%29±+6h+GROUP+BY+time%2810m%29+fill%28null%29%3BSELECT+mean%28%22cached%22%29+FROM+%22default%22.%22mem%22+WHERE+%22host%22+%3D~+%2F%5Efe-i-07a1b5f8e3e666075%5C.dc-eu.ragnarok%5C.net%24%2F+AND+time+%3E+now%28%29±+6h+GROUP+BY+time%2810m%29+fill%28null%29%3BSELECT+mean%28%22buffered%22%29+FROM+%22default%22.%22mem%22+WHERE+%22host%22+%3D~+%2F%5Efe-i-07a1b5f8e3e666075%5C.dc-eu.ragnarok%5C.net%24%2F+AND+time+%3E+now%28%29±+6h+GROUP+BY+time%2810m%29+fill%28null%29%3BSELECT+mean%28%22free%22%29+FROM+%22default%22.%22mem%22+WHERE+%22host%22+%3D~+%2F%5Efe-i-07a1b5f8e3e666075%5C.dc-eu.ragnarok%5C.net%24%2F+AND+time+%3E+now%28%29±+6h+GROUP+BY+time%2810m%29+fill%28null%29 HTTP/1.1” 200 1353 “http://grafana.dc-eu.ragnarok.com/dashboard/db/instance?orgId=1&refresh=1m&from=now-6h&to=now&var-host=fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net&var-by_time=10m&var-r=default&var-iface=All&var-process=All&var-disk=All&var-path=All” “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36” a3563ac1-b4d0-11e7-bef0-000000000000 8487

fatal error: concurrent map iteration and map write

goroutine 41340066 [running]:
runtime.throw(0xcdfd72, 0x26)
/usr/local/go/src/runtime/panic.go:596 +0x95 fp=0xc6a24611f0 sp=0xc6a24611d0
runtime.mapiternext(0xc6a2461560)
/usr/local/go/src/runtime/hashmap.go:737 +0x7ee fp=0xc6a24612a0 sp=0xc6a24611f0
github.com/influxdata/influxdb/tsdb/index/inmem.(*Measurement).idsForExpr(0xc458ef6480, 0xc47cf573b0, 0x0, 0xc47c939230, 0x8, 0x18, 0xc58f9b7540, 0x0, 0xc42037e800)

log stack trace attached.

influxd.log.fatal.gz

@e-dard e-dard added this to the 1.4.0 milestone Oct 20, 2017
@e-dard e-dard self-assigned this Oct 20, 2017
@choffee
Copy link

choffee commented Oct 24, 2017

I think I have a similar error occuring in a crash today. Again it's in the inmem code but with SeriesPointIterator.
If this is not realted then I can log another bug.

Thanks

Oct 24 10:36:50 graf0 influxd[162905]: [httpd] 144.32.126.91 - - [24/Oct/2017:10:36:50 +0100] "POST /write?db=telegraf HTTP/1.
Oct 24 10:36:50 graf0 influxd[162905]: [httpd] 144.32.126.176 - - [24/Oct/2017:10:36:50 +0100] "POST /write?consistency=&db=te
Oct 24 10:36:50 graf0 influxd[162905]: [httpd] 144.32.126.170 - - [24/Oct/2017:10:36:50 +0100] "POST /write?consistency=&db=te
Oct 24 10:36:50 graf0 influxd[162905]: [httpd] 127.0.0.1 - root [24/Oct/2017:10:36:39 +0100] "GET /query?db=graphite&epoch=s&q
Oct 24 10:36:50 graf0 influxd[162905]: [I] 2017-10-24T09:36:50Z SELECT "key" FROM graphite."2w"._series LIMIT 100000 OFFSET 10
Oct 24 10:36:51 graf0 influxd[162905]: fatal error: concurrent map iteration and map write
Oct 24 10:36:51 graf0 influxd[162905]: goroutine 362851530 [running]:
Oct 24 10:36:51 graf0 influxd[162905]: runtime.throw(0xba2cf4, 0x26)
Oct 24 10:36:51 graf0 influxd[162905]:         /usr/local/go/src/runtime/panic.go:596 +0x95 fp=0xc9eae4e160 sp=0xc9eae4e140
Oct 24 10:36:51 graf0 influxd[162905]: runtime.mapiternext(0xc9eae4e2b0)
Oct 24 10:36:51 graf0 influxd[162905]:         /usr/local/go/src/runtime/hashmap.go:737 +0x7ee fp=0xc9eae4e210 sp=0xc9eae4e160
Oct 24 10:36:51 graf0 influxd[162905]: github.com/influxdata/influxdb/tsdb/index/inmem.(*Index).SeriesPointIterator(0xc42026f3
Oct 24 10:36:51 graf0 influxd[162905]:         /go/src/github.com/influxdata/influxdb/tsdb/index/inmem/inmem.go:727 +0x155 fp=
Oct 24 10:36:51 graf0 influxd[162905]: github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).SeriesPointIterator(0xc4203a9
Oct 24 10:36:51 graf0 influxd[162905]:         /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2040 +0x75 fp
Oct 24 10:36:51 graf0 influxd[162905]: github.com/influxdata/influxdb/tsdb.(*Shard).createSeriesIterator(0xc47c351040, 0x0, 0x
Oct 24 10:36:51 graf0 influxd[162905]:         /go/src/github.com/influxdata/influxdb/tsdb/shard.go:849 +0x1d4 fp=0xc9eae4e6b0
Oct 24 10:36:51 graf0 influxd[162905]: github.com/influxdata/influxdb/tsdb.(*Shard).createSystemIterator(0xc47c351040, 0xec146
Oct 24 10:36:51 graf0 influxd[162905]:         /go/src/github.com/influxdata/influxdb/tsdb/shard.go:817 +0x2cc fp=0xc9eae4e7f8
Oct 24 10:36:51 graf0 influxd[162905]: github.com/influxdata/influxdb/tsdb.(*Shard).CreateIterator(0xc47c351040, 0xb8c38d, 0x7
Oct 24 10:36:51 graf0 influxd[162905]:         /go/src/github.com/influxdata/influxdb/tsdb/shard.go:802 +0x165 fp=0xc9eae4e968
Oct 24 10:36:51 graf0 influxd[162905]: github.com/influxdata/influxdb/tsdb.Shards.CreateIterator(0xc837819b00, 0xf, 0xf, 0xb8c
Oct 24 10:36:51 graf0 influxd[162905]:         /go/src/github.com/influxdata/influxdb/tsdb/shard.go:1219 +0xf8 fp=0xc9eae4eb38
Oct 24 10:36:51 graf0 influxd[162905]: github.com/influxdata/influxdb/tsdb.(*Shards).CreateIterator(0xc837817e20, 0xb8c38d, 0x
Oct 24 10:36:51 graf0 influxd[162905]:         <autogenerated>:78 +0xb4 fp=0xc9eae4ec90 sp=0xc9eae4eb38

@e-dard
Copy link
Contributor

e-dard commented Oct 24, 2017

@choffee I think this might be a related issue. Were you using 1.3.6, and has it reoccured?

@e-dard
Copy link
Contributor

e-dard commented Oct 24, 2017

@anatolijd @choffee I have managed to reproduce both of these races. Working on a fix now.

@ghost ghost added review and removed proposed labels Oct 25, 2017
e-dard added a commit that referenced this issue Oct 25, 2017
Fixes #8989 and #8633.

Previously when issuing commands involving a regex check, walking
through the tags keys/values on a measurement, using the measurement's
index, would be racy.

This commit adds a new `TagKeyValue` type that abstracts away the
multi-layer map we were using as an inverted index from tag keys and
values to series ids. With this abstraction we can also make concurrent
access to this inverted index goroutine safe.

Finally, this commit fixes a very old bug in the index which will affect
any query using a regex. Previously we would always check _every_ tag
against a regex for a measurement, even when we had found a match.
@ghost ghost removed the review label Oct 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants