-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InfluxDB starts consuming all available disk space / compaction errors #8417
Comments
Can your show how you started the container? Is the data dir writeable? That error indicates that we could not create a tmp file in the data dir to write a snapshot to disk. The root error looks like it's masked unfortunately. Your disk may be filling up due to wal segments not being able to be compacted. What does the contents of you wal dir ( |
The errors on the OP were logged about 30 minutes after starting InfluxDB with clean data/wal directories. Here is the data dir usage
And wal:
The data directory for telegraf-dev, under the autogen/13 directory is full of tsm and tsm.tmp files:
Here is the count of files
Log was full of these errors before running out of disk space
This is the compose file used to start the container:
Here is the df output from inside the container
And PS output as well
|
#8416 fixes some issues where tmp files could linger which should help here. There is something else going on that is preventing snapshots from getting written though. I'll need to fix the error that is returned so that the root issue is not masked. |
#8420 will fix the error getting masked so we can see why the snapshot is failing to be written. There are some other fixes in there for problems that occur when this error occurs. |
@garceri From you comment, can you attach one of recent |
Here is the TSM file I'm restarting the container, this time w/o the the rancher-ebs plugin volumes |
@garceri I found the issue. Do you know how the |
I figured it out. It can occur via the HTTP API as well. The As a workaround until this is fixed, I would disable whatever is writing |
Okey, yeah, that measurement is taken from Telegraf using the http_json input, i'm going to check the output directly to see if there is an error with telegraf or orientdb itself. |
@garceri You won't see an error as the write is accepted by the DB, but it shouldn't be. |
Okey, but i should be seeing the irregularly-formatted data in the http output from orientdb, right ? |
Yeah, you should be able to see the writes to the DB. The field name starts with |
@jwilder should some kind of checking be implemented in InfluxDB to prevent these errors ? Maybe discarding improperly formatted metrics or something ? |
Okey, compiled and built docker image for your branch, i'm gonna leave it running overnight and check on it ocasionally, i should be seeing compaction notification messages in the log, right ? |
Compiled your branch, still no luck, keep getting:
after 6 hours, not a single message indicating that compaction has taken place and it has consumed already 90GB of space. |
You probably need to remove the wal segments for that problem shard. They still have the bad writes and won't be dropped on their own at this point. They are reloaded at startup and still will be attempted to be snapshotted. If you see that error for tsm files at startup, those files would need to be removed as well. |
I cleaned up the whole data/meta/wal directories before starting InfluxDB again with your fixes. |
Hmm. Can you attach another one of the 00001.tsm files? Also, what commit did you build off of? |
Ah. I think I know what I missed. I need to update the PR. |
@garceri Were you able to verify the build commit you are testing? |
@garceri Great! Thanks for your help! |
I'm experiencing this bug with InfluxDB 1.3, via the I get these for a while, and eventually I get
|
This appears to only happen when my volume is backed by a samba mount. I'll create a new issue for this (#9065). |
Bug report
System info:
Nighlty build from 2017-05-18
Running in docker container under Rancher
EBS gp2 storage backend (separate devices for WAL and TSM)
Using telegraf to inject stats from different rancher environments
Steps to reproduce:
Expected behavior:
These errors should not appear
Disk space usage seems exagerated in one of the databases (telegraf-dev, the one with highest usage)
Actual behavior:
Errors start appearing on the logs
Disk space utilization skyrockets
Additional info:
vars.txt
The text was updated successfully, but these errors were encountered: