Fatal error: concurrent map iteration and map write #8989

anatolijd · 2017-10-20T09:17:19Z

Directions

I've the same as described in #8633
Sympthoms are very similar -

a massive amount of updates POST /write?db=telegraf from newly scaled out instances (and new tag values),
a number of SHOW TAG VALUES ON telegraf WITH KEY queries (from Grafana dashboard),
a fatal error.

This is very worrying, because it means it will happen even more often as more people in our company start using Grafana dashboard for influxdb data.

Is it a bug ? Is there a way to prevent this to happen ?
Please help us to solve it.

Our setup:
Influxdb cluster with 2 data nodes (8 cores, 64GB each).

Version: 1.3.6-c1.3.6
cache-max-memory-size = 4194304000
cache-snapshot-memory-size = 26214400
index-version = "inmem"
max-concurrent-compactions = 3

telegraf database stats:

Size on disk: 40G
Measurements: 23
Series: 225647447

short log:

[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a3528157-b4d0-11e7-bee5-000000000000 4813
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a352b259-b4d0-11e7-bee6-000000000000 3576
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a351bc5b-b4d0-11e7-bee4-000000000000 9859
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a352f3b9-b4d0-11e7-bee7-000000000000 4282
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a353e2b1-b4d0-11e7-bee8-000000000000 2898
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a353f4e2-b4d0-11e7-bee9-000000000000 4884
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a353ff0d-b4d0-11e7-beea-000000000000 4655
[I] 2017-10-19T13:23:04Z SHOW TAG VALUES ON telegraf WITH KEY = interface WHERE (_name = ‘net’) AND ((host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/) AND (_tagKey = ‘interface’)) service=query
[I] 2017-10-19T13:23:04Z SHOW TAG VALUES ON telegraf WITH KEY = path WHERE (_name = ‘disk’) AND ((host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/) AND (_tagKey = ‘path’)) service=query
[I] 2017-10-19T13:23:04Z SHOW TAG VALUES ON telegraf WITH KEY = process_name WHERE (_name = ‘procstat’) AND ((host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/) AND (_tagKey = ‘process_name’)) LIMIT 30 service=query
[I] 2017-10-19T13:23:04Z SELECT mean(used) FROM telegraf.“default”.mem WHERE host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/ AND time > now() - 6h GROUP BY time(10m) service=query
[I] 2017-10-19T13:23:04Z SHOW TAG VALUES ON telegraf WITH KEY = “name” WHERE (_name = ‘diskio’) AND ((host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/) AND (_tagKey = ‘name’)) service=query
[I] 2017-10-19T13:23:04Z SELECT mean(usage_user) FROM telegraf.“default”.cpu WHERE host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/ AND cpu = ‘cpu-total’ AND time > now() - 6h GROUP BY time(10m) service=query
[I] 2017-10-19T13:23:04Z SELECT mean(cached) FROM telegraf.“default”.mem WHERE host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/ AND time > now() - 6h GROUP BY time(10m) service=query
[I] 2017-10-19T13:23:04Z SELECT mean(buffered) FROM telegraf.“default”.mem WHERE host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/ AND time > now() - 6h GROUP BY time(10m) service=query
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a3559cdd-b4d0-11e7-beed-000000000000 9285
[httpd] 172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “POST /write?db=telegraf HTTP/1.1” 204 0 “-” “-” a354352f-b4d0-11e7-beeb-000000000000 18601
[I] 2017-10-19T13:23:04Z SELECT mean(free) FROM telegraf.“default”.mem WHERE host =~ /^fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net$/ AND time > now() - 6h GROUP BY time(10m) service=query
[httpd] 172.21.1.6, 127.0.0.1,172.20.99.29 - - [19/Oct/2017:13:23:04 +0000] “GET /query?db=telegraf&epoch=ms&q=SELECT+mean%28%22used%22%29+FROM+%22default%22.%22mem%22+WHERE+%22host%22+%3D~+%2F%5Efe-i-07a1b5f8e3e666075%5C.dc-eu.ragnarok%5C.net%24%2F+AND+time+%3E+now%28%29±+6h+GROUP+BY+time%2810m%29+fill%28null%29%3BSELECT+mean%28%22cached%22%29+FROM+%22default%22.%22mem%22+WHERE+%22host%22+%3D~+%2F%5Efe-i-07a1b5f8e3e666075%5C.dc-eu.ragnarok%5C.net%24%2F+AND+time+%3E+now%28%29±+6h+GROUP+BY+time%2810m%29+fill%28null%29%3BSELECT+mean%28%22buffered%22%29+FROM+%22default%22.%22mem%22+WHERE+%22host%22+%3D~+%2F%5Efe-i-07a1b5f8e3e666075%5C.dc-eu.ragnarok%5C.net%24%2F+AND+time+%3E+now%28%29±+6h+GROUP+BY+time%2810m%29+fill%28null%29%3BSELECT+mean%28%22free%22%29+FROM+%22default%22.%22mem%22+WHERE+%22host%22+%3D~+%2F%5Efe-i-07a1b5f8e3e666075%5C.dc-eu.ragnarok%5C.net%24%2F+AND+time+%3E+now%28%29±+6h+GROUP+BY+time%2810m%29+fill%28null%29 HTTP/1.1” 200 1353 “http://grafana.dc-eu.ragnarok.com/dashboard/db/instance?orgId=1&refresh=1m&from=now-6h&to=now&var-host=fe-i-07a1b5f8e3e666075.dc-eu.ragnarok.net&var-by_time=10m&var-r=default&var-iface=All&var-process=All&var-disk=All&var-path=All” “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36” a3563ac1-b4d0-11e7-bef0-000000000000 8487

fatal error: concurrent map iteration and map write

goroutine 41340066 [running]:
runtime.throw(0xcdfd72, 0x26)
/usr/local/go/src/runtime/panic.go:596 +0x95 fp=0xc6a24611f0 sp=0xc6a24611d0
runtime.mapiternext(0xc6a2461560)
/usr/local/go/src/runtime/hashmap.go:737 +0x7ee fp=0xc6a24612a0 sp=0xc6a24611f0
github.com/influxdata/influxdb/tsdb/index/inmem.(*Measurement).idsForExpr(0xc458ef6480, 0xc47cf573b0, 0x0, 0xc47c939230, 0x8, 0x18, 0xc58f9b7540, 0x0, 0xc42037e800)

log stack trace attached.

influxd.log.fatal.gz

The text was updated successfully, but these errors were encountered:

choffee · 2017-10-24T12:51:14Z

I think I have a similar error occuring in a crash today. Again it's in the inmem code but with SeriesPointIterator.
If this is not realted then I can log another bug.

Thanks

Oct 24 10:36:50 graf0 influxd[162905]: [httpd] 144.32.126.91 - - [24/Oct/2017:10:36:50 +0100] "POST /write?db=telegraf HTTP/1.
Oct 24 10:36:50 graf0 influxd[162905]: [httpd] 144.32.126.176 - - [24/Oct/2017:10:36:50 +0100] "POST /write?consistency=&db=te
Oct 24 10:36:50 graf0 influxd[162905]: [httpd] 144.32.126.170 - - [24/Oct/2017:10:36:50 +0100] "POST /write?consistency=&db=te
Oct 24 10:36:50 graf0 influxd[162905]: [httpd] 127.0.0.1 - root [24/Oct/2017:10:36:39 +0100] "GET /query?db=graphite&epoch=s&q
Oct 24 10:36:50 graf0 influxd[162905]: [I] 2017-10-24T09:36:50Z SELECT "key" FROM graphite."2w"._series LIMIT 100000 OFFSET 10
Oct 24 10:36:51 graf0 influxd[162905]: fatal error: concurrent map iteration and map write
Oct 24 10:36:51 graf0 influxd[162905]: goroutine 362851530 [running]:
Oct 24 10:36:51 graf0 influxd[162905]: runtime.throw(0xba2cf4, 0x26)
Oct 24 10:36:51 graf0 influxd[162905]:         /usr/local/go/src/runtime/panic.go:596 +0x95 fp=0xc9eae4e160 sp=0xc9eae4e140
Oct 24 10:36:51 graf0 influxd[162905]: runtime.mapiternext(0xc9eae4e2b0)
Oct 24 10:36:51 graf0 influxd[162905]:         /usr/local/go/src/runtime/hashmap.go:737 +0x7ee fp=0xc9eae4e210 sp=0xc9eae4e160
Oct 24 10:36:51 graf0 influxd[162905]: github.com/influxdata/influxdb/tsdb/index/inmem.(*Index).SeriesPointIterator(0xc42026f3
Oct 24 10:36:51 graf0 influxd[162905]:         /go/src/github.com/influxdata/influxdb/tsdb/index/inmem/inmem.go:727 +0x155 fp=
Oct 24 10:36:51 graf0 influxd[162905]: github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).SeriesPointIterator(0xc4203a9
Oct 24 10:36:51 graf0 influxd[162905]:         /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2040 +0x75 fp
Oct 24 10:36:51 graf0 influxd[162905]: github.com/influxdata/influxdb/tsdb.(*Shard).createSeriesIterator(0xc47c351040, 0x0, 0x
Oct 24 10:36:51 graf0 influxd[162905]:         /go/src/github.com/influxdata/influxdb/tsdb/shard.go:849 +0x1d4 fp=0xc9eae4e6b0
Oct 24 10:36:51 graf0 influxd[162905]: github.com/influxdata/influxdb/tsdb.(*Shard).createSystemIterator(0xc47c351040, 0xec146
Oct 24 10:36:51 graf0 influxd[162905]:         /go/src/github.com/influxdata/influxdb/tsdb/shard.go:817 +0x2cc fp=0xc9eae4e7f8
Oct 24 10:36:51 graf0 influxd[162905]: github.com/influxdata/influxdb/tsdb.(*Shard).CreateIterator(0xc47c351040, 0xb8c38d, 0x7
Oct 24 10:36:51 graf0 influxd[162905]:         /go/src/github.com/influxdata/influxdb/tsdb/shard.go:802 +0x165 fp=0xc9eae4e968
Oct 24 10:36:51 graf0 influxd[162905]: github.com/influxdata/influxdb/tsdb.Shards.CreateIterator(0xc837819b00, 0xf, 0xf, 0xb8c
Oct 24 10:36:51 graf0 influxd[162905]:         /go/src/github.com/influxdata/influxdb/tsdb/shard.go:1219 +0xf8 fp=0xc9eae4eb38
Oct 24 10:36:51 graf0 influxd[162905]: github.com/influxdata/influxdb/tsdb.(*Shards).CreateIterator(0xc837817e20, 0xb8c38d, 0x
Oct 24 10:36:51 graf0 influxd[162905]:         <autogenerated>:78 +0xb4 fp=0xc9eae4ec90 sp=0xc9eae4eb38

e-dard · 2017-10-24T13:28:41Z

@choffee I think this might be a related issue. Were you using 1.3.6, and has it reoccured?

e-dard · 2017-10-24T14:11:13Z

@anatolijd @choffee I have managed to reproduce both of these races. Working on a fix now.

Fixes #8989 and #8633. Previously when issuing commands involving a regex check, walking through the tags keys/values on a measurement, using the measurement's index, would be racy. This commit adds a new `TagKeyValue` type that abstracts away the multi-layer map we were using as an inverted index from tag keys and values to series ids. With this abstraction we can also make concurrent access to this inverted index goroutine safe. Finally, this commit fixes a very old bug in the index which will affect any query using a regex. Previously we would always check _every_ tag against a regex for a measurement, even when we had found a match.

e-dard added panic proposed backlog/storage labels Oct 20, 2017

e-dard added this to the 1.4.0 milestone Oct 20, 2017

e-dard self-assigned this Oct 20, 2017

e-dard mentioned this issue Oct 25, 2017

Fix race in Measurement index #9009

Merged

3 tasks

ghost added review and removed proposed labels Oct 25, 2017

e-dard closed this as completed in #9009 Oct 25, 2017

ghost removed the review label Oct 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fatal error: concurrent map iteration and map write #8989

Fatal error: concurrent map iteration and map write #8989

anatolijd commented Oct 20, 2017

choffee commented Oct 24, 2017

e-dard commented Oct 24, 2017 •

edited

Loading

e-dard commented Oct 24, 2017

Fatal error: concurrent map iteration and map write #8989

Fatal error: concurrent map iteration and map write #8989

Comments

anatolijd commented Oct 20, 2017

Directions

choffee commented Oct 24, 2017

e-dard commented Oct 24, 2017 • edited Loading

e-dard commented Oct 24, 2017

e-dard commented Oct 24, 2017 •

edited

Loading