[0.13] drop measurement/series taking a very long time #6669

cheribral · 2016-05-19T02:23:51Z

Bug report

Centos 7, AWS r3.2xlarge, provisioned IOPs etc.

Steps to reproduce:

Generate measurements with millions of series, and some data points
Try to drop one of the measurements or drop the series

Expected behavior:
Unsure, but I would hope that a delete could finish at least somewhere in the region of time it would take to read and rewrite the entire measurement.

Actual behavior:
It takes hours to finish without doing any significant work on the server.

I'm not sure if this is related to #6250, but this didn't seem like this should be included in that issue. I've attached the files @jwilder referenced in #6250. Although I know that deletes are expensive, this doesn't seem quite right. It does make it very hard to recover from a mistake like we have here, where someone didn't realize their measurements' cardinality would be so high.

block.txt
goroutine.txt

jwilder · 2016-05-19T03:12:32Z

It looks like your process is deadlocked. There was a fix for this in #6627.

cheribral · 2016-05-25T02:51:40Z

@jwilder I tried this again using 0.14.0~n201605240800, and I'm not sure if it is doing any better. It's been running for quite a while, and I can't query _internal stats or use the cli client. I see the http writes for other databases going past in the logs, but I'm also seeing a lot of "failed to write point batch to database... timeout" for UDP writes.
block.txt
goroutine.txt

I was able to make the calls for the pprof data, which I'll include if it helps.

cheribral · 2016-05-25T03:02:41Z

Another interestig bit is that trying to drop the entire database which holds the bad metrics causes influxd to lock up completely, use up all the memory on the box and then get killed by the OOM killer. Shouldn't that just be a matter of removing some metadata, and deleting the database directory?

jwilder · 2016-05-25T03:10:38Z

@cheribral I see another problem. The tsdb.Store.DeleteMeasurement call is taking a write lock and then deleting the measurements from each shard serially. Deleting from each shard is taking a while so the write lock is held for a long time locking up the DB.

cheribral · 2016-05-25T03:46:24Z

Apologies for the stream of comments, but the server came back up, thinking the database was gone, but the data files were still there. I was also seeing statistics for the database in _internal. I shut down influx, moved the data files for that database out of the data directory as well as removing the wal files, and everything seems to be coherent again.

I was then a bit suspicious, so I tried to drop a large measurement in another database, and that one worked fine. At this point I'm not sure if the multiple failed attempts at removing data in that original database corrupted it in a way the server couldn't handle or not, so I'm not sure whether this issue still needs to be open. I'll close it supposing that you don't need any more noise than necessary :)

lvheyang · 2016-06-14T01:28:00Z

@jwilder I have met the same problem. And I also noticed that many drop actions, include drop on shards, databases, measurements and retention policies would block all the other write/ read request to influxdb.

Is there any plan to optimize the write lock to act on a narrower scope? It is common that we want to query on one database (or even retention policy) when dropping another.

jwilder · 2016-06-14T01:37:32Z

@ivanyu Yes. We're working on it.

CAFxX · 2016-12-15T07:15:46Z

@jwilder any updates? We're also seeing this on influx 1.0.

ccassar · 2017-01-09T12:16:00Z

@jwilder I'm guessing this will be no surprise, but seeing this in v1.1 too. Assuming the root cause is understood, is there some workaround while we wait for the fix. The reason I ask is because it does interfere with workflow when testing at scale and needing to reclaim space between iterations.

jwilder · 2017-01-09T19:06:14Z

This issue is closed as it was related to a deadlock and some serial deletion code that was fixed. If you are having issues with deletes, please log a new issue with details.

jwilder added area/performance area/tsm labels May 19, 2016

jwilder added this to the 1.0.0 milestone May 25, 2016

cheribral closed this as completed May 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.13] drop measurement/series taking a very long time #6669

[0.13] drop measurement/series taking a very long time #6669

cheribral commented May 19, 2016

jwilder commented May 19, 2016

cheribral commented May 25, 2016

cheribral commented May 25, 2016

jwilder commented May 25, 2016

cheribral commented May 25, 2016

lvheyang commented Jun 14, 2016 •

edited

Loading

jwilder commented Jun 14, 2016

CAFxX commented Dec 15, 2016

ccassar commented Jan 9, 2017

jwilder commented Jan 9, 2017

[0.13] drop measurement/series taking a very long time #6669

[0.13] drop measurement/series taking a very long time #6669

Comments

cheribral commented May 19, 2016

Bug report

jwilder commented May 19, 2016

cheribral commented May 25, 2016

cheribral commented May 25, 2016

jwilder commented May 25, 2016

cheribral commented May 25, 2016

lvheyang commented Jun 14, 2016 • edited Loading

jwilder commented Jun 14, 2016

CAFxX commented Dec 15, 2016

ccassar commented Jan 9, 2017

jwilder commented Jan 9, 2017

lvheyang commented Jun 14, 2016 •

edited

Loading