Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.13] drop measurement/series taking a very long time #6669

Closed
cheribral opened this issue May 19, 2016 · 10 comments
Closed

[0.13] drop measurement/series taking a very long time #6669

cheribral opened this issue May 19, 2016 · 10 comments

Comments

@cheribral
Copy link

Bug report

Centos 7, AWS r3.2xlarge, provisioned IOPs etc.

Steps to reproduce:

  1. Generate measurements with millions of series, and some data points
  2. Try to drop one of the measurements or drop the series

Expected behavior:
Unsure, but I would hope that a delete could finish at least somewhere in the region of time it would take to read and rewrite the entire measurement.

Actual behavior:
It takes hours to finish without doing any significant work on the server.

I'm not sure if this is related to #6250, but this didn't seem like this should be included in that issue. I've attached the files @jwilder referenced in #6250. Although I know that deletes are expensive, this doesn't seem quite right. It does make it very hard to recover from a mistake like we have here, where someone didn't realize their measurements' cardinality would be so high.

block.txt
goroutine.txt

@jwilder
Copy link
Contributor

jwilder commented May 19, 2016

It looks like your process is deadlocked. There was a fix for this in #6627.

@cheribral
Copy link
Author

@jwilder I tried this again using 0.14.0~n201605240800, and I'm not sure if it is doing any better. It's been running for quite a while, and I can't query _internal stats or use the cli client. I see the http writes for other databases going past in the logs, but I'm also seeing a lot of "failed to write point batch to database... timeout" for UDP writes.
block.txt
goroutine.txt

I was able to make the calls for the pprof data, which I'll include if it helps.

@cheribral
Copy link
Author

Another interestig bit is that trying to drop the entire database which holds the bad metrics causes influxd to lock up completely, use up all the memory on the box and then get killed by the OOM killer. Shouldn't that just be a matter of removing some metadata, and deleting the database directory?

@jwilder jwilder added this to the 1.0.0 milestone May 25, 2016
@jwilder
Copy link
Contributor

jwilder commented May 25, 2016

@cheribral I see another problem. The tsdb.Store.DeleteMeasurement call is taking a write lock and then deleting the measurements from each shard serially. Deleting from each shard is taking a while so the write lock is held for a long time locking up the DB.

@cheribral
Copy link
Author

Apologies for the stream of comments, but the server came back up, thinking the database was gone, but the data files were still there. I was also seeing statistics for the database in _internal. I shut down influx, moved the data files for that database out of the data directory as well as removing the wal files, and everything seems to be coherent again.

I was then a bit suspicious, so I tried to drop a large measurement in another database, and that one worked fine. At this point I'm not sure if the multiple failed attempts at removing data in that original database corrupted it in a way the server couldn't handle or not, so I'm not sure whether this issue still needs to be open. I'll close it supposing that you don't need any more noise than necessary :)

@lvheyang
Copy link
Contributor

lvheyang commented Jun 14, 2016

@jwilder I have met the same problem. And I also noticed that many drop actions, include drop on shards, databases, measurements and retention policies would block all the other write/ read request to influxdb.

Is there any plan to optimize the write lock to act on a narrower scope? It is common that we want to query on one database (or even retention policy) when dropping another.

@jwilder
Copy link
Contributor

jwilder commented Jun 14, 2016

@ivanyu Yes. We're working on it.

@CAFxX
Copy link

CAFxX commented Dec 15, 2016

@jwilder any updates? We're also seeing this on influx 1.0.

@ccassar
Copy link

ccassar commented Jan 9, 2017

@jwilder I'm guessing this will be no surprise, but seeing this in v1.1 too. Assuming the root cause is understood, is there some workaround while we wait for the fix. The reason I ask is because it does interfere with workflow when testing at scale and needing to reclaim space between iterations.

@jwilder
Copy link
Contributor

jwilder commented Jan 9, 2017

This issue is closed as it was related to a deadlock and some serial deletion code that was fixed. If you are having issues with deletes, please log a new issue with details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants