InfluxDB starts consuming all available disk space / compaction errors #8417

garceri · 2017-05-22T23:19:13Z

Bug report

System info:
Nighlty build from 2017-05-18
Running in docker container under Rancher
EBS gp2 storage backend (separate devices for WAL and TSM)
Using telegraf to inject stats from different rancher environments

Steps to reproduce:

Clear data and wal directories
Start influxdb
Let it run for a while
This error starts to appear on the log

[I] 2017-05-22T12:35:34Z Snapshot for path /data/influxdb/data/telegraf-internal-prod/autogen/3 written in 128.902839ms engine=tsm1
[I] 2017-05-22T12:35:34Z error writing snapshot: compaction in progress engine=tsm1
[I] 2017-05-22T12:35:34Z error writing snapshot from compactor: compaction in progress engine=tsm1

Leave it running for 6 hours, and it consumes 100GB of space

Expected behavior:
These errors should not appear
Disk space usage seems exagerated in one of the databases (telegraf-dev, the one with highest usage)

Actual behavior:
Errors start appearing on the logs
Disk space utilization skyrockets

Additional info:

vars.txt

The text was updated successfully, but these errors were encountered:

jwilder · 2017-05-23T02:19:02Z

Can your show how you started the container? Is the data dir writeable? That error indicates that we could not create a tmp file in the data dir to write a snapshot to disk. The root error looks like it's masked unfortunately. Your disk may be filling up due to wal segments not being able to be compacted. What does the contents of you wal dir (/data/influxdb/wal/telegraf-internal-prod/autogen/3) look like?

garceri · 2017-05-23T14:24:38Z

The errors on the OP were logged about 30 minutes after starting InfluxDB with clean data/wal directories.
Both directories are being actively written by InfluxDB, i left it running overnight and this morning it had consumed all 250GB of space in the data directory:

Here is the data dir usage

root@cc35be712981:/data/influxdb/data# du -sh *
908K	_internal
12K	rancher
6.6M	telegraf
245G	telegraf-dev

And wal:

root@cc35be712981:/data/influxdb/wal# du -sh *
22M	_internal
26M	rancher
168M	telegraf
5.0G	telegraf-dev

The data directory for telegraf-dev, under the autogen/13 directory is full of tsm and tsm.tmp files:

...
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024171-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024172-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024173-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024174-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024175-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024176-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024177-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024178-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024179-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024180-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024181-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024182-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024183-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024184-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024185-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024186-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:10 000024187-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:11 000024188-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:11 000024189-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:11 000024190-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:11 000024191-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:11 000024192-000000001.tsm
-rw-r--r-- 1 root root 10855292 May 23 09:11 000024193-000000001.tsm
-rw-r--r-- 1 root root  9719808 May 23 09:11 000024194-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024195-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024196-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024197-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024198-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024199-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024200-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024201-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024202-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024203-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024204-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024205-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024206-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024207-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024208-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024209-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024210-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024211-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024212-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024213-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024214-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024215-000000001.tsm.tmp
-rw-r--r-- 1 root root        0 May 23 09:11 000024216-000000001.tsm.tmp
...

Here is the count of files

root@cc35be712981:/data/influxdb/data/telegraf-dev/autogen/13# find *.tsm |wc -l
24193
root@cc35be712981:/data/influxdb/data/telegraf-dev/autogen/13# find *.tmp |wc -l
8365

Log was full of these errors before running out of disk space

[I] 2017-05-23T08:44:21Z Snapshot for path /data/influxdb/data/telegraf-dev/autogen/13 written in 285.264169ms engine=tsm1
[I] 2017-05-23T08:44:21Z error writing snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-23T08:44:21Z Snapshot for path /data/influxdb/data/telegraf-internal-prod/autogen/11 written in 9.128µs engine=tsm1
[I] 2017-05-23T08:44:21Z error writing snapshot: error opening new segment file for wal (1): invalid argument engine=tsm1
[I] 2017-05-23T08:44:22Z error adding new TSM files from snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-23T08:44:22Z Snapshot for path /data/influxdb/data/telegraf-dev/autogen/13 written in 282.545818ms engine=tsm1
[I] 2017-05-23T08:44:22Z error writing snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-23T08:44:22Z Snapshot for path /data/influxdb/data/telegraf-internal-prod/autogen/11 written in 12.319µs engine=tsm1
[I] 2017-05-23T08:44:22Z error writing snapshot: error opening new segment file for wal (1): invalid argument engine=tsm1
[I] 2017-05-23T08:44:23Z error adding new TSM files from snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-23T08:44:23Z Snapshot for path /data/influxdb/data/telegraf-dev/autogen/13 written in 274.55072ms engine=tsm1
[I] 2017-05-23T08:44:23Z error writing snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-23T08:44:23Z Snapshot for path /data/influxdb/data/telegraf-internal-prod/autogen/11 written in 9.242µs engine=tsm1
[I] 2017-05-23T08:44:23Z error writing snapshot: error opening new segment file for wal (1): invalid argument engine=tsm1
[I] 2017-05-23T08:44:24Z error adding new TSM files from snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-23T08:44:24Z Snapshot for path /data/influxdb/data/telegraf-dev/autogen/13 written in 283.661506ms engine=tsm1
[I] 2017-05-23T08:44:24Z error writing snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-23T08:44:24Z Snapshot for path /data/influxdb/data/telegraf-internal-prod/autogen/11 written in 11.222µs engine=tsm1
[I] 2017-05-23T08:44:24Z error writing snapshot: error opening new segment file for wal (1): invalid argument engine=tsm1
[I] 2017-05-23T08:44:25Z error adding new TSM files from snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-23T08:44:25Z Snapshot for path /data/influxdb/data/telegraf-dev/autogen/13 written in 275.902009ms engine=tsm1
[I] 2017-05-23T08:44:25Z error writing snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-23T08:44:25Z Snapshot for path /data/influxdb/data/telegraf-internal-prod/autogen/11 written in 12.154µs engine=tsm1
[I] 2017-05-23T08:44:25Z error writing snapshot: error opening new segment file for wal (1): invalid argument engine=tsm1

This is the compose file used to start the container:

version: '2'
volumes:
  ebs_influxdb-data:
    external: true
    driver: 'rancher-ebs'
  ebs_influxdb-wal:
    external: true
    driver: 'rancher-ebs'
  ebs_influxdb-meta:
    external: true
    driver: 'rancher-ebs'
services:
  influxdb:
    image: pythiant9shared/influxdb:nightly_20170518
    ulimits:
      memlock:
        hard: -1
        soft: -1
    volumes:
    - ebs_influxdb-data:/data/influxdb/data
    - ebs_influxdb-meta:/data/influxdb/meta
    - ebs_influxdb-wal:/data/influxdb/wal
    ports:
    - 2003:2003/tcp
    - 8086:8086/tcp

Here is the df output from inside the container

root@cc35be712981:/# df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay          30G   11G   19G  36% /
tmpfs           7.5G     0  7.5G   0% /dev
tmpfs           7.5G     0  7.5G   0% /sys/fs/cgroup
/dev/xvda1       30G   11G   19G  36% /.r
shm              64M     0   64M   0% /dev/shm
/dev/xvdh       4.8G   10M  4.6G   1% /data/influxdb/meta
/dev/xvdg       197G  5.3G  182G   3% /data/influxdb/wal
/dev/xvdf       246G  246G     0 100% /data/influxdb/data
tmpfs           7.5G     0  7.5G   0% /sys/firmware

And PS output as well

root@cc35be712981:/# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1 60.7 47.7 264020284 7481028 ?   Ssl  May22 572:39 influxd
root        30  0.0  0.0  20260   504 ?        Ss+  May22   0:00 bash
root       142  0.0  0.0  20260  3236 ?        Ss   14:19   0:00 bash
root       152  0.0  0.0  17500  2048 ?        R+   14:24   0:00 ps aux

jwilder · 2017-05-23T16:32:44Z

#8416 fixes some issues where tmp files could linger which should help here. There is something else going on that is preventing snapshots from getting written though. I'll need to fix the error that is returned so that the root issue is not masked.

jwilder · 2017-05-23T18:45:40Z

#8420 will fix the error getting masked so we can see why the snapshot is failing to be written. There are some other fixes in there for problems that occur when this error occurs.

jwilder · 2017-05-24T02:08:42Z

@garceri From you comment, can you attach one of recent -00000001.tsm files from /data/influxdb/data/telegraf-dev/autogen/5? Something is odd in that the it looks like the snapshot was written successfully, but it could not be opened immediately after writing it. That hints at a file system/hardware/OS issue.

garceri · 2017-05-24T15:25:24Z

Here is the TSM file
000029152-000000001.tsm.zip.
Is there any way to increase log debug output to unmask the error ?
This is EBS storage using the rancher-ebs Storage Engine:
https://docs.rancher.com/rancher/v1.6/en/rancher-services/storage-service/rancher-ebs/
https://github.com/rancher/storage/tree/master/package/ebs

I'm restarting the container, this time w/o the the rancher-ebs plugin volumes

jwilder · 2017-05-24T16:58:14Z

@garceri I found the issue. Do you know how the orientdb_stats measurement is written? Is it through the graphite endpoint? The issue is that a series key that is too large has managed to be accepted. When it's snapshotted to a TSM file it's overflowing the field length which causes it to fail to load. I'm wondering if these writes are coming in from something other than the HTTP line protocol endpoint because I can't seem to reproduce the issue via that endpoint.

jwilder · 2017-05-24T17:19:49Z

I figured it out. It can occur via the HTTP API as well. The orientdb_stat is writing a field name that is very large which is causing the series key to overflow. The field name seems to be well over 65k. It looks like it's using a query as the field name. The length of the field name is not part of the series key check currently in the write end points.

As a workaround until this is fixed, I would disable whatever is writing orientdb_stat for now.

garceri · 2017-05-24T17:25:29Z

Okey, yeah, that measurement is taken from Telegraf using the http_json input, i'm going to check the output directly to see if there is an error with telegraf or orientdb itself.

jwilder · 2017-05-24T17:27:22Z

@garceri You won't see an error as the write is accepted by the DB, but it shouldn't be.

garceri · 2017-05-24T17:28:58Z

Okey, but i should be seeing the irregularly-formatted data in the http output from orientdb, right ?

jwilder · 2017-05-24T17:33:20Z

Yeah, you should be able to see the writes to the DB. The field name starts with realtime_chronos_db.service_point_group_db.command.sql.select createWhitelist and is then followed by many GUIDs.

garceri · 2017-05-24T20:19:17Z

@jwilder should some kind of checking be implemented in InfluxDB to prevent these errors ? Maybe discarding improperly formatted metrics or something ?

jwilder · 2017-05-24T20:54:44Z

@garceri Yes, #8425 should fix this now. Would you be abel to test that branch?

garceri · 2017-05-24T21:20:01Z

Okey, compiled and built docker image for your branch, i'm gonna leave it running overnight and check on it ocasionally, i should be seeing compaction notification messages in the log, right ?

garceri · 2017-05-25T03:45:08Z

Compiled your branch, still no luck, keep getting:

[I] 2017-05-25T02:45:53Z error writing snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-25T02:45:54Z error adding new TSM files from snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-25T02:45:54Z Snapshot for path /data/influxdb/data/telegraf-dev/autogen/3 written in 269.148722ms engine=tsm1
[I] 2017-05-25T02:45:54Z error writing snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-25T02:45:55Z error adding new TSM files from snapshot: indirectIndex: not enough data for max time engine=tsm1
[I] 2017-05-25T02:45:55Z Snapshot for path /data/influxdb/data/telegraf-dev/autogen/3 written in 266.584342ms engine=tsm1
[I] 2017-05-25T02:45:55Z error writing snapshot: indirectIndex: not enough data for max time engine=tsm1

after 6 hours, not a single message indicating that compaction has taken place and it has consumed already 90GB of space.

jwilder · 2017-05-25T03:49:38Z

You probably need to remove the wal segments for that problem shard. They still have the bad writes and won't be dropped on their own at this point. They are reloaded at startup and still will be attempted to be snapshotted. If you see that error for tsm files at startup, those files would need to be removed as well.

garceri · 2017-05-25T13:36:26Z

I cleaned up the whole data/meta/wal directories before starting InfluxDB again with your fixes.

jwilder · 2017-05-25T13:39:25Z

Hmm. Can you attach another one of the 00001.tsm files? Also, what commit did you build off of?

jwilder · 2017-05-25T13:46:19Z

Ah. I think I know what I missed. I need to update the PR.

jwilder · 2017-05-25T16:42:25Z

@garceri Were you able to verify the build commit you are testing?

garceri · 2017-05-26T01:26:32Z

@jwilder , i figured out i was using an older commit, i confirm that commit 62a9325 solves the issue

jwilder · 2017-05-26T01:38:36Z

@garceri Great! Thanks for your help!

spiffytech · 2017-11-02T18:05:53Z

I'm experiencing this bug with InfluxDB 1.3, via the influxdb:latest Docker image.

I get these for a while, and eventually I get too many files open, Chronograf stops showing new data, and my storage fills up rapidly.

[I] 2017-11-02T14:19:38Z Snapshot for path /var/lib/influxdb/data/telegraf/autogen/2 written in 19.637876ms engine=tsm1
[I] 2017-11-02T14:19:38Z error writing snapshot: sync /var/lib/influxdb/data/telegraf/autogen/2: invalid argument engine=tsm1
[I] 2017-11-02T14:19:39Z error adding new TSM files from snapshot: sync /var/lib/influxdb/data/telegraf/autogen/2: invalid argument engine=
tsm1

spiffytech · 2017-11-06T14:05:38Z

This appears to only happen when my volume is backed by a samba mount. I'll create a new issue for this (#9065).

garceri changed the title ~~InfluxDB starts consuming all available disk space~~ InfluxDB starts consuming all available disk space / compaction errors May 23, 2017

jwilder added the need more info label May 23, 2017

jwilder added area/tsm kind/bug and removed need more info labels May 23, 2017

jwilder added this to the 1.3.0 milestone May 23, 2017

jwilder self-assigned this May 24, 2017

jwilder mentioned this issue May 24, 2017

Fix large field keys preventing snapshot compactions #8425

Merged

4 tasks

jwilder added the review label May 24, 2017

jwilder closed this as completed in #8425 May 25, 2017

ghost removed the review label May 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InfluxDB starts consuming all available disk space / compaction errors #8417

InfluxDB starts consuming all available disk space / compaction errors #8417

garceri commented May 22, 2017 •

edited

Loading

jwilder commented May 23, 2017 •

edited

Loading

garceri commented May 23, 2017

jwilder commented May 23, 2017

jwilder commented May 23, 2017

jwilder commented May 24, 2017

garceri commented May 24, 2017 •

edited

Loading

jwilder commented May 24, 2017

jwilder commented May 24, 2017

garceri commented May 24, 2017

jwilder commented May 24, 2017

garceri commented May 24, 2017

jwilder commented May 24, 2017

garceri commented May 24, 2017

jwilder commented May 24, 2017

garceri commented May 24, 2017 •

edited

Loading

garceri commented May 25, 2017

jwilder commented May 25, 2017

garceri commented May 25, 2017 •

edited

Loading

jwilder commented May 25, 2017

jwilder commented May 25, 2017

jwilder commented May 25, 2017

garceri commented May 26, 2017

jwilder commented May 26, 2017

spiffytech commented Nov 2, 2017

spiffytech commented Nov 6, 2017 •

edited

Loading

InfluxDB starts consuming all available disk space / compaction errors #8417

InfluxDB starts consuming all available disk space / compaction errors #8417

Comments

garceri commented May 22, 2017 • edited Loading

Bug report

jwilder commented May 23, 2017 • edited Loading

garceri commented May 23, 2017

jwilder commented May 23, 2017

jwilder commented May 23, 2017

jwilder commented May 24, 2017

garceri commented May 24, 2017 • edited Loading

jwilder commented May 24, 2017

jwilder commented May 24, 2017

garceri commented May 24, 2017

jwilder commented May 24, 2017

garceri commented May 24, 2017

jwilder commented May 24, 2017

garceri commented May 24, 2017

jwilder commented May 24, 2017

garceri commented May 24, 2017 • edited Loading

garceri commented May 25, 2017

jwilder commented May 25, 2017

garceri commented May 25, 2017 • edited Loading

jwilder commented May 25, 2017

jwilder commented May 25, 2017

jwilder commented May 25, 2017

garceri commented May 26, 2017

jwilder commented May 26, 2017

spiffytech commented Nov 2, 2017

spiffytech commented Nov 6, 2017 • edited Loading

garceri commented May 22, 2017 •

edited

Loading

jwilder commented May 23, 2017 •

edited

Loading

garceri commented May 24, 2017 •

edited

Loading

garceri commented May 24, 2017 •

edited

Loading

garceri commented May 25, 2017 •

edited

Loading

spiffytech commented Nov 6, 2017 •

edited

Loading