-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compaction exhausting disk resources in InfluxDB 1.2.2-1 #8368
Comments
It looks like it is trying to recompact all your shards. From the gorutine.txt it looks like you have 12 compactions running, but I guess you have even more or some of your shard are larger than the others. You might want to try the latest nightly build, as you can not limit the number of compactions running concurrently (#8276). If you set this to a low enough number, you can also limit the disk space required by Influx from tmp files (number of compactions * max size of a shard >= tmp disk usage). |
Even if it were compacting all shards, would it be expected that it could nearly double the on-disk usage? I'd not expect that to be the case given that logs indicate some compactions do complete. |
@dzr0001 Yes, compactions write out new TSM files which can increase disk usage while they are running. TSM files are immutable so the existing ones are left unchanged while they run. @hpbieker suggestion might be your best option currently. Can you attach the logs? |
@jwilder I'll sanitize them and attach momentarily. As best I can tell there are currently 29 compactions running, but shouldn't I see fewer of these tmp files?
|
How many folders have .tsm.tmp files? And what is the size of these folders (without the .tsm.tmp)? |
@dzr0001 The |
This log includes all entries after the latest restart. |
@dzr0001 Do you have any |
@jwilder Yes, 767 of them across several shards. |
Ok, that might be why all the shards are triggering a compaction at once. There must have been a delete or drop statement run that required data to be removed from many shards. From the logs, the shards look like they are fully compacted/optimized so the deletes are triggering the new compactions. |
I mentioned deleting some data in the original report. It was 30d of a smaller measurement that was deleted. Triggering compactions across all shards like that makes it difficult to impossible to be able to drop any data. I suspect this would exhibit the same behavior if this data were deleted as a result of a retention plan expiring old data. |
Sorry, missed that in the issue description. Retention policy deletion removes the whole shard and is a different code path from deleting individual measurement or series. You would not see this when dropping a shard manually (via If you deleted a data that should only exist in a single shard and it tombstoned data in other shards, that would be a bug since it would cause the shards to be recompacted unnecessarily. |
I would expect my deletions to fall within 4 shards, as the shard group duration is 7 days. There was little data in this before 9/1/2016, and my specific query was
Shard info:
This instance was installed originally with 0.13, but 1.0 was installed on 9/12, in case that has any bearing on anything. |
@dzr0001 I was able to repro what you are seeing. Tombstones are getting written for TSM files that do not contain the deleted data which triggers many shard compactions. #8372 should fix it. |
@jwilder Excellent, this will be very helpful. For the existing tombstones is there any way to clean those up beyond using the nightly build to limit compactions until it completes? |
When your server is down, you can remove the tombstone files from shards that are outside of the time range you deleted. Removing the tombstone files will cause any deleted data contained in those tombstone files to re-appear again after restart though. |
Ok, great. Thanks for identifying the culprit so quickly. Would these long compaction cycles also contribute to higher than anticipated memory utilization? I was running into some OOM issues but am not able to reproduce those right now while no clients are using this database. |
Bug report
System info:
CentOS 7.3
InfluxDB 1.2.2
KVM Virtual machine with 20 cpu and 128G RAM
Steps to reproduce:
Expected behavior:
Compactions run without exhausting available disk space.
Actual behavior:
This system currently has 3T of SSD storage available to InfluxDB. Since I've been unable to keep this online, I've moved all but 6 telegraf agents to a new system.
After restarting InfluxDB, about 1.3T of space is free, but in just 14 hours of running with just the 6 telegraf agents writing, all 3T of disk space has been used. This drops back down to 1.3T free again when restarting. This all seems to be a result of compaction and it seems that the number of "*.tsm.tmp" files never drops.
Additional info:
I'm unsure if compaction logs are useful here. The only recent changes are upgrading from 1.1.1 and deleting 30d of a single measurement.
This debug output was taken immediately after a restart, as I had pprof disabled.
block
goroutine
heap
vars
iostat
shards
stats
diagnostics
InfluxDB has now been running for 20 minutes with no agents writing. Disk utilization has grown by about 100G and there are now 60 tmp files.
Thanks in advance.
The text was updated successfully, but these errors were encountered: