-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shard deletion triggers infinite memory usage due to cache lock contention #7226
Comments
Just put in a PR which should help with this #7228 |
Another thing that could help significantly is to introduce the concept of handlers. By doing this the amount of contention on the lock could be controlled, something like having 10 write handlers, 1 snapshot handler and 50 read handlers. Each type would be processed by sending requests to a buffered channel which the relevant handlers read from and process. If the buffer is full the action could be rejected preventing excessive backlog build up, which in turn would ensure that memory usage stays bounded. In addition to this sharding the cache and looking to eliminate the impact of GC should also be done, as the current cache is likely to cause significant GC pauses due to the use of pointers, see http://allegro.tech/2016/03/writing-fast-cache-service-in-go.html for details. |
We're currently testing 1.0 yes but we haven't let a compaction happen yet. Given the description of the problem in #7258 I don't believe that's the cause of this, as it indicates it was a irrecoverable situation, so given stopping the incoming data allowed recovery from our issue it seems unlikely to be the cause. |
Also just read through #7165 and it doesn't seem to be related, neither of these seem to address the underlying issue of the high lock contention or the data race which will cause invalid cache size restriction. |
@stevenh What size of writes are you sending? Are you batching writes or just sending many small ones? Can you provide some sample writes in line protocol format? Also, compactions are not the same thing as retention policy expiring shards. Compactions happen all the time while writes are occurring on a shard. Snapshot compactions are the only ones that lock the cache, but they do not hold onto the locks while the compactions are run. The large heap allocations from queries look like they may be querying against large number of series which may indicate a cardinality issue with your tags or a query that is not selective enough. There are also a number of goroutines gzipping responses which is odd. Perhaps a slow query reader? You may want to try enabling some of these limits to help mitigate some of the issues as well as pinpoint any problem queries:
|
@jwilder I'll add the settings to our setup later on today :) In the meantime, I grabbed a backup of production data and put it against 1.0 in our dev environment and set the retention down to 7 days. After ~30 minutes the retention kicked in and deleted ~271,000 series (kind of expected) - from our graphing the retention took 8 minutes and 30 seconds to run, but during this time it completely stopped writing data to both the I'm going to retest this shortly and grab heap, goroutine, blocks and vars while the node isn't graphing. |
block.txt I managed to grab these while reducing the retention policy on the same backup we used during the test yesterday. |
@liv3d I think I see part of the issue in one of those traces now. When the shard deletion kicks in, the shard is Not sure why other writes are getting block yet though. |
With regards other writes getting blocked, the initial traces indicated that they weren't totally blocked they where just progressing VERY slowly due to the cache lock contention. |
@liv3d I think that panic is due to a recent change in master unrelated to this. I'll rebase the PR off of the 1.0 branch while we track down the cause of that panic. |
The panic is a regression due to #7296 |
@jwilder I'm not convinced that panic was unrelated. I tried show stats again and it still panic's. [danof@influxdb-1.labs ~]$ influxd version
InfluxDB v1.0.0 (git: jw-delete-shard 0063ca4842a83eb0290c06c3e48b9783c63d1869)
[danof@influxdb-1.labs ~]$ influx -execute "SHOW STATS"
ERR: SHOW STATS [panic:assignment to entry in nil map]
Warning: It is possible this error is due to not setting a database.
Please set a database with the command "use <database>".
SHOW STATS [panic:assignment to entry in nil map]
[danof@influxdb-1.labs ~]$ My box on 1.0 GA does: [danof@influxdb-2.labs ~]$ influxd version
InfluxDB v1.0.0 (git: master 37992377a55fbc138b2c01edd4deffed64b53989)
[danof@influxdb-2.labs ~]$ influx -execute "SHOW STATS"
name: cq
--------
queryFail queryOk
0 13972
... |
When deleting a shard, the shard is locked and then removed from the index. Removal from the index can be slow if there are a lot of series. During this time, the shard is still expected to exist by the meta store and tsdb store so stats collections, queries and writes could all be run on this shard while it's locked. This can cause everything to lock up until the unindexing completes and the shard can be unlocked. Fixes #7226
The good news @jwilder is I can indeed run |
@jwilder I've done some more testing and I'm pretty happy with the change. Hopefully this can make the next release |
I'm a little bit late to the party on this one, but I've also been having problems dropping series / measurements on a roughly 75 GB database with just under 2M series. I tried to drop a measurement earlier today, which took a while, but eventually the command returned to the prompt. However, in the following minutes, influxd started to eat a ton of memory (32 GB + 4 GB swap) and eventually crashed. I've been very hesitant about deleting stuff because this behaviour certainly is not new. Here's hoping that this patch improves things. |
@dswarbrick #7165 may also help you out. If you are able to test it out, that would be very helpful. |
That PR needs to be rebased w/ master though. |
When deleting a shard, the shard is locked and then removed from the index. Removal from the index can be slow if there are a lot of series. During this time, the shard is still expected to exist by the meta store and tsdb store so stats collections, queries and writes could all be run on this shard while it's locked. This can cause everything to lock up until the unindexing completes and the shard can be unlocked. Fixes #7226
Bug report
Influxdb: 1.0.0-beta3
OS Linux: 3.10.0-327.el7.x86_64
Steps to reproduce:
Expected behaviour:
Compaction should have little impact on memory
Actual behaviour:
Memory increases until it triggers out of memory
Additional info:
Looking into the details from pprof its clear that the issue is serious contention on the cache lock, enough so that all queries (read / write) to influx are stalling pretty much indefinitely while compaction takes place.
When incoming requests get behind, the problem spirals out of control with more and more memory used for query goroutines.
When we captured it we had 100k goroutines waiting on the cache lock using
tsm1.newIntegerAscendingCursor accounting for 129GB of active memory, this is very clear to see in heep.pdf below.
In our case we have a master slave setup for reads, with this node being the slave; so its only taking writes from influx-relay unless the primary is down, which it wasn't so all of this we believe is write not read traffic.
Once we realised this we disabled writes by taking this node out of influx-relay and it recovered in ~30mins even from 312GB. At the same point we saw a query complete which took 13555 minutes.
Machine Spec
Logs:
block.txt
goroutine2.txt
heap2.txt
vars.txt
iostat.txt
heap.pdf
The text was updated successfully, but these errors were encountered: