influxdb out of memory #13318

crjddd · 2019-04-11T11:13:41Z

### 1. Description:
When running the Influx, the VIRT memory consumption rapidly increases ,and was eventually killed by OOM.

2. Environment:`

[Docker version]:Docker version 1.13.1, build 07f3374/1.13.1
[Docker run command]: docker run -it -d --network host -v /var/lib/influxdb/:/opt/host --memory 10g dfaa91697202 /bin/bash -c "sleep 10000000"
[influxdb version]:InfluxDB shell version: 1.7.0~n201808230800

### [conf]:
influxdb_conf.TXT

[influx logs]:

influx_log_1.zip
Note: 8 hours difference in log time zone

[disk infos]:

I monitored the size of the data , as well as the memory changes, like the following

docker_memory.zip

---------------------------top bein 54004-----------------
top - 11:46:09 up 16 days, 3:09, 14 users, load average: 11.29, 10.57, 10.36 Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie %Cpu(s): 16.1 us, 1.1 sy, 0.0 ni, 82.7 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 26345912+total, 1652376 free, 54648416 used, 20715833+buff/cache KiB Swap: 4194300 total, 3644 free, 4190656 used. 20497331+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 54004 root 20 0 57.4g 10.0g 154332 S 93.8 4.0 1251:44 influxd
---------------------------top end 54004-----------------

[messages log]:

Apr 11 11:48:01 psinsight-112 kernel: influxd invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
Apr 11 11:48:01 psinsight-112 kernel: influxd cpuset=docker-f1207e65211b540e841d5d13a7b66c393ddde902d0c80a3bfa215a4c16754418.scope mems_allowed=0-1
Apr 11 11:48:01 psinsight-112 kernel: CPU: 21 PID: 54013 Comm: influxd Kdump: loaded Tainted: G ------------ T 3.10.0-957.1.3.el7.x86_64 #1
Apr 11 11:48:01 psinsight-112 kernel: Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.68 05/03/2018
Apr 11 11:48:01 psinsight-112 kernel: Call Trace:
Apr 11 11:48:01 psinsight-112 kernel: [] dump_stack+0x19/0x1b
Apr 11 11:48:01 psinsight-112 kernel: [] dump_header+0x90/0x229
Apr 11 11:48:01 psinsight-112 kernel: [] ? find_lock_task_mm+0x56/0xc0
Apr 11 11:48:01 psinsight-112 kernel: [] ? try_get_mem_cgroup_from_mm+0x28/0x60
Apr 11 11:48:01 psinsight-112 kernel: [] oom_kill_process+0x254/0x3d0
Apr 11 11:48:01 psinsight-112 kernel: [] ? selinux_capable+0x1c/0x40
Apr 11 11:48:01 psinsight-112 kernel: [] mem_cgroup_oom_synchronize+0x546/0x570
Apr 11 11:48:01 psinsight-112 kernel: [] ? mem_cgroup_charge_common+0xc0/0xc0
Apr 11 11:48:01 psinsight-112 kernel: [] pagefault_out_of_memory+0x14/0x90
Apr 11 11:48:01 psinsight-112 kernel: [] mm_fault_error+0x6a/0x157
Apr 11 11:48:01 psinsight-112 kernel: [] __do_page_fault+0x3c8/0x500
Apr 11 11:48:01 psinsight-112 kernel: [] do_page_fault+0x35/0x90
Apr 11 11:48:01 psinsight-112 kernel: [] page_fault+0x28/0x30
Apr 11 11:48:01 psinsight-112 kernel: Task in /system.slice/docker-f1207e65211b540e841d5d13a7b66c393ddde902d0c80a3bfa215a4c16754418.scope killed as a result of limit of /system.slice/docker-f1207e65211b540e841d5d13a7b66c393ddde902d0c80a3bfa215a4c16754418.scope
Apr 11 11:48:01 psinsight-112 kernel: memory: usage 10485760kB, limit 10485760kB, failcnt 159873656
Apr 11 11:48:01 psinsight-112 kernel: memory+swap: usage 10504544kB, limit 20971520kB, failcnt 0
Apr 11 11:48:01 psinsight-112 kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Apr 11 11:48:01 psinsight-112 kernel: Memory cgroup stats for /system.slice/docker-f1207e65211b540e841d5d13a7b66c393ddde902d0c80a3bfa215a4c16754418.scope: cache:168KB rss:10485592KB rss_huge:0KB mapped_file:4KB swap:18784KB inactive_anon:1623952KB active_anon:8861888KB inactive_file:112KB active_file:0KB unevictable:0KB
Apr 11 11:48:01 psinsight-112 kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Apr 11 11:48:01 psinsight-112 kernel: [44993] 0 44993 1078 11 8 7 0 sleep
Apr 11 11:48:01 psinsight-112 kernel: [50116] 0 50116 2943 56 11 40 0 bash
Apr 11 11:48:01 psinsight-112 kernel: [54004] 0 54004 15101796 2622256 18846 4497 0 influxd
Apr 11 11:48:01 psinsight-112 kernel: [54423] 0 54423 685493 1284 121 234 0 influx
Apr 11 11:48:01 psinsight-112 kernel: Memory cgroup out of memory: Kill process 223248 (influxd) score 720 or sacrifice child
Apr 11 11:48:01 psinsight-112 kernel: Killed process 54004 (influxd) total-vm:60407184kB, anon-rss:10479648kB, file-rss:9376kB, shmem-rss:0kB

crjddd · 2019-04-11T11:42:43Z

series And SHARDGROUPS.txt

dgnorton · 2019-04-11T15:21:32Z

@crjddd have you tested to see if this happens on 1.7.5?

crjddd · 2019-04-12T03:52:55Z

@dgnorton ,I haven't tried the 1.7.5 version, and today I'll try

crjddd · 2019-04-16T01:26:34Z

@dgnorton ,I've tried it with version 1.7.5, but the result is the same
influx_0412.log

joydashy · 2019-07-05T07:30:24Z

I am also seeing this issue running InfluxDB in a Proxmox LXC. The container has plenty (4 GB) memory assigned, but still the kernel is killing InfluxDB for lack of memory.

Apollon77 · 2019-07-05T07:40:26Z

maybe also see #13605

My experience is that 1.7.6 is very bad, 1.74 is better but also not perfect!
In 1.7.4 my experience is that everything is fine as soon as I7O to disk is "fast". As soon as you have a short time of "slow" i/o after that 1.7.4 also start eating memory and is not coming back. As soon as you restart it is fine again.

I run influxdb as VM and the VM images are stored on a glusterfs filesystem. As soon as everything runs normally everything is fine. When I reboot one of my fs-cluster machines and it starts resyncing and stuff normally i/o gets slower and then influxdb starts to need more memory

ghost · 2019-07-05T07:52:49Z

I meet same question, if i try to query data larger than server memory. influxdb will killed by system.
not only that, all queried data is store in memory. it's easy occur out of memory then killed by system.

LastStupid · 2019-08-02T11:07:46Z

Same thing on 1.7.7.
No special config but a lot of data. I purge all my data and still same problem, the server work for 15 min and lost internet connection but it not reboot. (I don't have access to it directly)

Just before server crash:

mqu · 2019-08-26T13:44:22Z

same issue with 1.7.7 ; I downgraded to 1.7.3 ; seems to be OK with same data and configuration.

LastStupid · 2019-08-26T14:59:05Z

Ok, it look like a miss understanding shard and retention policy. I made multiple change in shard and it lower the memory usage. For now influx is running. Try lower shard. Thanks

lobocobra · 2019-11-11T23:04:01Z

Hi
What version does really work? I just want my home automation not to crash every 6h because it is out of memory. Which version you recommend? V1.6.4?

It worked good for a year then I had the bad idea to upgrade influxdb >1.7, since then things went out of control. :-(
I do not need any fancy features, just a stable version.

Thanks for a feedback (PS, yes, I tried to move from TSM to TSI, but hey, after 4h I aborted, it not even showed a progress bar.
And does anyone know a good howto, to downgrade?

lobocobra · 2019-11-11T23:10:20Z

Ok, it look like a miss understanding shard and retention policy. I made multiple change in shard and it lower the memory usage. For now influx is running. Try lower shard. Thanks

Hey.... and how I lower the shrad please?

LastStupid · 2019-11-12T00:57:41Z

hi, you can lower shard only at creation. So in consol, create your measurement with the shard param, i don't exacly remenber the command but it's document on influx web site. hope it will help you. Le lun. 11 nov. 2019 18 h 10, lobocobra <notifications@github.com> a écrit :

…

Ok, it look like a miss understanding shard and retention policy. I made multiple change in shard and it lower the memory usage. For now influx is running. Try lower shard. Thanks Hey.... and how I lower the shrad please? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#13318?email_source=notifications&email_token=AHW5WXBVMM6NPXLXB26Q3L3QTHRADA5CNFSM4HFFM7PKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDYOQFA#issuecomment-552658964>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHW5WXCVUQGDJ3V5CORFJXDQTHRADANCNFSM4HFFM7PA> .

lobocobra · 2019-11-12T23:54:15Z

Many thanks for the hint, I will try to find out the rest.

linuxmail · 2019-11-22T07:44:59Z

hi,

I have also a InfluxDB with the latest version:

Environment info:

ii influxdb 1.7.9-1 amd64 Distributed time-series database.
Debian Stretch VM
32GB RAM
8 Core CPU
Hypervisor (Proxmox) System on a ZPOOL
Memory

 free -h
              total        used        free      shared  buff/cache   available
Mem:            31G         17G        358M         10M         13G         13G
Swap:          2.0G        397M        1.6G

Actual behavior:

 dmesg -T | grep Out
[Wed Nov 20 19:34:54 2019] Out of memory: Kill process 29666 (influxd) score 972 or sacrifice child
[Thu Nov 21 02:38:29 2019] Out of memory: Kill process 7752 (influxd) score 973 or sacrifice child
[Thu Nov 21 07:36:10 2019] Out of memory: Kill process 18827 (influxd) score 975 or sacrifice child
[Thu Nov 21 08:38:09 2019] Out of memory: Kill process 28339 (influxd) score 972 or sacrifice child
[Thu Nov 21 13:37:15 2019] Out of memory: Kill process 17285 (influxd) score 975 or sacrifice child
[Thu Nov 21 14:38:02 2019] Out of memory: Kill process 27934 (influxd) score 973 or sacrifice child
[Thu Nov 21 17:36:19 2019] Out of memory: Kill process 28771 (influxd) score 972 or sacrifice child
[Thu Nov 21 18:37:07 2019] Out of memory: Kill process 6392 (influxd) score 973 or sacrifice child
[Thu Nov 21 20:34:48 2019] Out of memory: Kill process 27214 (influxd) score 972 or sacrifice child
[Thu Nov 21 23:38:52 2019] Out of memory: Kill process 25614 (influxd) score 972 or sacrifice child
[Fri Nov 22 01:36:32 2019] Out of memory: Kill process 16500 (influxd) score 972 or sacrifice child
[Fri Nov 22 02:35:09 2019] Out of memory: Kill process 27449 (influxd) score 973 or sacrifice child
[Fri Nov 22 03:39:17 2019] Out of memory: Kill process 6413 (influxd) score 972 or sacrifice child
[Fri Nov 22 05:38:32 2019] Out of memory: Kill process 15884 (influxd) score 970 or sacrifice child
...

Config:

# This config file is managed by Puppet
#

reporting-disabled = true

[meta]
  enabled = true
  dir = "/var/lib/influxdb/meta"
  bind-address = "graph-01.example.com:8088"
  http-bind-address = "graph-01.example.com:8091"
  retention-autocreate = true
  election-timeout = "1s"
  heartbeat-timeout = "1s"
  leader-lease-timeout = "500ms"
  commit-timeout = "50ms"
  cluster-tracing = false

[data]
  enabled = true
  dir = "/var/lib/influxdb/data"
  wal-dir = "/var/lib/influxdb/wal"
  wal-logging-enabled = true
  trace-logging-enabled = false
  query-log-enabled = false
  max-series-per-database = 1000000
  wal-fsync-delay = "50ms"
  index-version = "tsi1"

[hinted-handoff]
  enabled = true
  dir = "/var/lib/influxdb/hh"
  max-size = 1073741824
  max-age = "168h"
  retry-rate-limit = 0
  retry-interval = "1s"
  retry-max-interval = "1m"
  purge-interval = "1h"

[coordinator]
  write-timeout = "10s"
  query-timeout = "0"
  log-queries-after = "0"
  max-select-point = 0
  max-select-series = 0
  max-select-buckets = 0

[retention]
  enabled = true
  check-interval = "30m"

[shard-precreation]
  enabled = true
  check-interval = "10m"
  advance-period = "30m"

[monitor]
  store-enabled = true
  store-database = "_internal"
  store-interval = "10s"

[admin]
  enabled = false
  bind-address = "127.0.0.0:8088"
  https-enabled = true
  https-certificate = "/etc/ssl/private/chain.crt"

[http]
  enabled = true
  bind-address = "graph-01.example.com:8086"
  auth-enabled = true
  log-enabled = false
  write-tracing = false
  pprof-enabled = false
  https-enabled = true
  https-certificate = "/etc/ssl/private/chain.crt"
  https-private-key = "/etc/ssl/private/key"
  max-row-limit = 10000
  realm = "InfluxDB"

[subscriber]
  enabled = true
  http-timeout = "30s"

[[graphite]]
  enabled = false

[[collectd]]
  enabled = false

[[opentsdb]]
  enabled = false

[[udp]]
  enabled = false

[continuous_queries]
  enabled = true
  log-enabled = true

Logs:
Logs from short before kill until kill:

Nov 22 05:37:53 graph-01 influxd[15884]: ts=2019-11-22T04:37:53.619661Z lvl=info msg="TSM compaction (end)" log_id=0JGIjxO0000 engine=tsm1 tsm1_level=1 tsm1_strategy=level trace_id=0JGMHEnW000 op_name=tsm1_compact_group op_event=end op_elapsed=2184.880ms
Nov 22 05:37:54 graph-01 influxd[15884]: ts=2019-11-22T04:37:54.438445Z lvl=info msg="TSM compaction (start)" log_id=0JGIjxO0000 engine=tsm1 tsm1_level=2 tsm1_strategy=level trace_id=0JGMHQWW000 op_name=tsm1_compact_group op_event=start
Nov 22 05:37:54 graph-01 influxd[15884]: ts=2019-11-22T04:37:54.438492Z lvl=info msg="Beginning compaction" log_id=0JGIjxO0000 engine=tsm1 tsm1_level=2 tsm1_strategy=level trace_id=0JGMHQWW000 op_name=tsm1_compact_group tsm1_files_n=4
Nov 22 05:37:54 graph-01 influxd[15884]: ts=2019-11-22T04:37:54.438505Z lvl=info msg="Compacting file" log_id=0JGIjxO0000 engine=tsm1 tsm1_level=2 tsm1_strategy=level trace_id=0JGMHQWW000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/var/lib/influxdb/data/telegraf/autogen/1557/000020278-000000002.tsm
Nov 22 05:37:54 graph-01 influxd[15884]: ts=2019-11-22T04:37:54.438516Z lvl=info msg="Compacting file" log_id=0JGIjxO0000 engine=tsm1 tsm1_level=2 tsm1_strategy=level trace_id=0JGMHQWW000 op_name=tsm1_compact_group tsm1_index=1 tsm1_file=/var/lib/influxdb/data/telegraf/autogen/1557/000020286-000000002.tsm
Nov 22 05:37:54 graph-01 influxd[15884]: ts=2019-11-22T04:37:54.438526Z lvl=info msg="Compacting file" log_id=0JGIjxO0000 engine=tsm1 tsm1_level=2 tsm1_strategy=level trace_id=0JGMHQWW000 op_name=tsm1_compact_group tsm1_index=2 tsm1_file=/var/lib/influxdb/data/telegraf/autogen/1557/000020294-000000002.tsm
Nov 22 05:37:54 graph-01 influxd[15884]: ts=2019-11-22T04:37:54.438535Z lvl=info msg="Compacting file" log_id=0JGIjxO0000 engine=tsm1 tsm1_level=2 tsm1_strategy=level trace_id=0JGMHQWW000 op_name=tsm1_compact_group tsm1_index=3 tsm1_file=/var/lib/influxdb/data/telegraf/autogen/1557/000020302-000000002.tsm
Nov 22 05:37:56 graph-01 influxd[15884]: ts=2019-11-22T04:37:56.681582Z lvl=info msg="Compacted file" log_id=0JGIjxO0000 engine=tsm1 tsm1_level=2 tsm1_strategy=level trace_id=0JGMHQWW000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/var/lib/influxdb/data/telegraf/autogen/1557/000020302-000000003.tsm.tmp
Nov 22 05:37:56 graph-01 influxd[15884]: ts=2019-11-22T04:37:56.681644Z lvl=info msg="Finished compacting files" log_id=0JGIjxO0000 engine=tsm1 tsm1_level=2 tsm1_strategy=level trace_id=0JGMHQWW000 op_name=tsm1_compact_group tsm1_files_n=1
Nov 22 05:37:56 graph-01 influxd[15884]: ts=2019-11-22T04:37:56.684925Z lvl=info msg="TSM compaction (end)" log_id=0JGIjxO0000 engine=tsm1 tsm1_level=2 tsm1_strategy=level trace_id=0JGMHQWW000 op_name=tsm1_compact_group op_event=end op_elapsed=2249.557ms
Nov 22 05:37:57 graph-01 influxd[15884]: ts=2019-11-22T04:37:57.434054Z lvl=info msg="TSM compaction (start)" log_id=0JGIjxO0000 engine=tsm1 tsm1_level=3 tsm1_strategy=level trace_id=0JGMHbEG000 op_name=tsm1_compact_group op_event=start
Nov 22 05:37:57 graph-01 influxd[15884]: ts=2019-11-22T04:37:57.434097Z lvl=info msg="Beginning compaction" log_id=0JGIjxO0000 engine=tsm1 tsm1_level=3 tsm1_strategy=level trace_id=0JGMHbEG000 op_name=tsm1_compact_group tsm1_files_n=4
Nov 22 05:37:57 graph-01 influxd[15884]: ts=2019-11-22T04:37:57.434109Z lvl=info msg="Compacting file" log_id=0JGIjxO0000 engine=tsm1 tsm1_level=3 tsm1_strategy=level trace_id=0JGMHbEG000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/var/lib/influxdb/data/telegraf/autogen/1557/000020206-000000003.tsm
Nov 22 05:37:57 graph-01 influxd[15884]: ts=2019-11-22T04:37:57.434120Z lvl=info msg="Compacting file" log_id=0JGIjxO0000 engine=tsm1 tsm1_level=3 tsm1_strategy=level trace_id=0JGMHbEG000 op_name=tsm1_compact_group tsm1_index=1 tsm1_file=/var/lib/influxdb/data/telegraf/autogen/1557/000020238-000000003.tsm
Nov 22 05:37:57 graph-01 influxd[15884]: ts=2019-11-22T04:37:57.434130Z lvl=info msg="Compacting file" log_id=0JGIjxO0000 engine=tsm1 tsm1_level=3 tsm1_strategy=level trace_id=0JGMHbEG000 op_name=tsm1_compact_group tsm1_index=2 tsm1_file=/var/lib/influxdb/data/telegraf/autogen/1557/000020270-000000003.tsm
Nov 22 05:37:57 graph-01 influxd[15884]: ts=2019-11-22T04:37:57.434139Z lvl=info msg="Compacting file" log_id=0JGIjxO0000 engine=tsm1 tsm1_level=3 tsm1_strategy=level trace_id=0JGMHbEG000 op_name=tsm1_compact_group tsm1_index=3 tsm1_file=/var/lib/influxdb/data/telegraf/autogen/1557/000020302-000000003.tsm
Nov 22 05:37:59 graph-01 influxd[15884]: ts=2019-11-22T04:37:59.745106Z lvl=info msg="Compacted file" log_id=0JGIjxO0000 engine=tsm1 tsm1_level=3 tsm1_strategy=level trace_id=0JGMHbEG000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/var/lib/influxdb/data/telegraf/autogen/1557/000020302-000000004.tsm.tmp
Nov 22 05:37:59 graph-01 influxd[15884]: ts=2019-11-22T04:37:59.745313Z lvl=info msg="Finished compacting files" log_id=0JGIjxO0000 engine=tsm1 tsm1_level=3 tsm1_strategy=level trace_id=0JGMHbEG000 op_name=tsm1_compact_group tsm1_files_n=1
Nov 22 05:37:59 graph-01 influxd[15884]: ts=2019-11-22T04:37:59.745336Z lvl=info msg="TSM compaction (end)" log_id=0JGIjxO0000 engine=tsm1 tsm1_level=3 tsm1_strategy=level trace_id=0JGMHbEG000 op_name=tsm1_compact_group op_event=end op_elapsed=2311.555ms
Nov 22 05:38:00 graph-01 influxd[15884]: ts=2019-11-22T04:38:00.435480Z lvl=info msg="TSM compaction (start)" log_id=0JGIjxO0000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0JGMHmxW000 op_name=tsm1_compact_group op_event=start
Nov 22 05:38:00 graph-01 influxd[15884]: ts=2019-11-22T04:38:00.435537Z lvl=info msg="Beginning compaction" log_id=0JGIjxO0000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0JGMHmxW000 op_name=tsm1_compact_group tsm1_files_n=4
Nov 22 05:38:00 graph-01 influxd[15884]: ts=2019-11-22T04:38:00.435561Z lvl=info msg="Compacting file" log_id=0JGIjxO0000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0JGMHmxW000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/var/lib/influxdb/data/telegraf/autogen/1557/000019917-000000005.tsm
Nov 22 05:38:00 graph-01 influxd[15884]: ts=2019-11-22T04:38:00.435577Z lvl=info msg="Compacting file" log_id=0JGIjxO0000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0JGMHmxW000 op_name=tsm1_compact_group tsm1_index=1 tsm1_file=/var/lib/influxdb/data/telegraf/autogen/1557/000020045-000000004.tsm
Nov 22 05:38:00 graph-01 influxd[15884]: ts=2019-11-22T04:38:00.435592Z lvl=info msg="Compacting file" log_id=0JGIjxO0000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0JGMHmxW000 op_name=tsm1_compact_group tsm1_index=2 tsm1_file=/var/lib/influxdb/data/telegraf/autogen/1557/000020174-000000004.tsm
Nov 22 05:38:00 graph-01 influxd[15884]: ts=2019-11-22T04:38:00.435607Z lvl=info msg="Compacting file" log_id=0JGIjxO0000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0JGMHmxW000 op_name=tsm1_compact_group tsm1_index=3 tsm1_file=/var/lib/influxdb/data/telegraf/autogen/1557/000020302-000000004.tsm
Nov 22 05:38:10 graph-01 influxd[15884]: ts=2019-11-22T04:38:10.433330Z lvl=info msg="Cache snapshot (start)" log_id=0JGIjxO0000 engine=tsm1 trace_id=0JGMIP0G000 op_name=tsm1_cache_snapshot op_event=start
Nov 22 05:38:12 graph-01 influxd[15884]: ts=2019-11-22T04:38:12.516071Z lvl=info msg="Snapshot for path written" log_id=0JGIjxO0000 engine=tsm1 trace_id=0JGMIP0G000 op_name=tsm1_cache_snapshot path=/var/lib/influxdb/data/telegraf/autogen/1557 duration=2082.733ms
Nov 22 05:38:12 graph-01 influxd[15884]: ts=2019-11-22T04:38:12.520955Z lvl=info msg="Cache snapshot (end)" log_id=0JGIjxO0000 engine=tsm1 trace_id=0JGMIP0G000 op_name=tsm1_cache_snapshot op_event=end op_elapsed=2087.801ms
Nov 22 05:38:21 graph-01 influxd[15884]: ts=2019-11-22T04:38:21.433654Z lvl=info msg="Cache snapshot (start)" log_id=0JGIjxO0000 engine=tsm1 trace_id=0JGMJ3zG000 op_name=tsm1_cache_snapshot op_event=start
Nov 22 05:38:22 graph-01 influxd[15884]: ts=2019-11-22T04:38:22.212438Z lvl=info msg="Snapshot for path written" log_id=0JGIjxO0000 engine=tsm1 trace_id=0JGMJ3zG000 op_name=tsm1_cache_snapshot path=/var/lib/influxdb/data/telegraf/autogen/1557 duration=779.265ms
Nov 22 05:38:22 graph-01 influxd[15884]: ts=2019-11-22T04:38:22.214336Z lvl=info msg="Cache snapshot (end)" log_id=0JGIjxO0000 engine=tsm1 trace_id=0JGMJ3zG000 op_name=tsm1_cache_snapshot op_event=end op_elapsed=780.884ms
Nov 22 05:38:45 graph-01 grafana-server[650]: 2019/11/22 05:38:45 http: proxy error: read tcp 192.168.43.15:35130->192.168.43.15:8086: read: connection reset by peer
Nov 22 05:38:45 graph-01 grafana-server[650]: 2019/11/22 05:38:45 http: proxy error: read tcp 192.168.43.15:35138->192.168.43.15:8086: read: connection reset by peer
Nov 22 05:38:45 graph-01 grafana-server[650]: 2019/11/22 05:38:45 http: proxy error: read tcp 192.168.43.15:35136->192.168.43.15:8086: read: connection reset by peer
Nov 22 05:38:45 graph-01 grafana-server[650]: 2019/11/22 05:38:45 http: proxy error: read tcp 192.168.43.15:35134->192.168.43.15:8086: read: connection reset by peer
Nov 22 05:38:45 graph-01 grafana-server[650]: 2019/11/22 05:38:45 http: proxy error: read tcp 192.168.43.15:35132->192.168.43.15:8086: read: connection reset by peer
Nov 22 05:38:45 graph-01 grafana-server[650]: 2019/11/22 05:38:45 http: proxy error: dial tcp 192.168.43.15:8086: connect: connection refused

Performance:

I switched the index from memory to tsi1, as I read in the 1.7 documents, but it doesn't help

Retention policy

Using database telegraf
> SHOW RETENTION POLICIES
name       duration   shardGroupDuration replicaN default
----       --------   ------------------ -------- -------
autogen    696h0m0s   168h0m0s           1        true
rp_5_years 43680h0m0s 168h0m0s           1        false
rp_1_years 8760h0m0s  168h0m0s           1        false

Attached screenshot from Internal Grafana dashboard
iostats
iostat.txt

I search the logs for older OOMs:

root@graph-01:[~]: grep Out  /var/log/kern.log
Nov 18 13:42:37 graph-01 kernel: [937687.188639] Out of memory: Kill process 20656 (influxd) score 973 or sacrifice child
Nov 18 23:39:50 graph-01 kernel: [973520.589614] Out of memory: Kill process 30087 (influxd) score 973 or sacrifice child
Nov 19 01:39:06 graph-01 kernel: [980676.353044] Out of memory: Kill process 20324 (influxd) score 972 or sacrifice child
Nov 19 08:42:15 graph-01 kernel: [1006064.706524] Out of memory: Kill process 19595 (influxd) score 974 or sacrifice child
Nov 19 17:35:59 graph-01 kernel: [1038088.389678] Out of memory: Kill process 19860 (influxd) score 972 or sacrifice child
Nov 20 03:32:33 graph-01 kernel: [1073881.957498] Out of memory: Kill process 30133 (influxd) score 973 or sacrifice child
Nov 20 04:34:16 graph-01 kernel: [1077585.372630] Out of memory: Kill process 28632 (influxd) score 973 or sacrifice child
Nov 20 06:35:28 graph-01 kernel: [1084857.403243] Out of memory: Kill process 18822 (influxd) score 972 or sacrifice child
Nov 20 08:39:48 graph-01 kernel: [1092316.604924] Out of memory: Kill process 8290 (influxd) score 973 or sacrifice child
Nov 20 16:39:19 graph-01 kernel: [1121087.465609] Out of memory: Kill process 27783 (influxd) score 971 or sacrifice child
Nov 20 19:35:05 graph-01 kernel: [1131633.910176] Out of memory: Kill process 29666 (influxd) score 972 or sacrifice child
Nov 21 02:38:40 graph-01 kernel: [1157048.594937] Out of memory: Kill process 7752 (influxd) score 973 or sacrifice child
Nov 21 07:36:21 graph-01 kernel: [1174909.678552] Out of memory: Kill process 18827 (influxd) score 975 or sacrifice child
Nov 21 08:38:21 graph-01 kernel: [1178628.907028] Out of memory: Kill process 28339 (influxd) score 972 or sacrifice child
Nov 21 13:37:26 graph-01 kernel: [1196574.396947] Out of memory: Kill process 17285 (influxd) score 975 or sacrifice child
Nov 21 14:38:14 graph-01 kernel: [1200221.951094] Out of memory: Kill process 27934 (influxd) score 973 or sacrifice child
Nov 21 17:36:30 graph-01 kernel: [1210918.243696] Out of memory: Kill process 28771 (influxd) score 972 or sacrifice child
Nov 21 18:37:18 graph-01 kernel: [1214566.231625] Out of memory: Kill process 6392 (influxd) score 973 or sacrifice child
Nov 21 20:35:00 graph-01 kernel: [1221627.670372] Out of memory: Kill process 27214 (influxd) score 972 or sacrifice child
Nov 21 23:39:03 graph-01 kernel: [1232671.168953] Out of memory: Kill process 25614 (influxd) score 972 or sacrifice child
Nov 22 01:36:43 graph-01 kernel: [1239731.099141] Out of memory: Kill process 16500 (influxd) score 972 or sacrifice child
Nov 22 02:35:21 graph-01 kernel: [1243248.971360] Out of memory: Kill process 27449 (influxd) score 973 or sacrifice child
Nov 22 03:39:29 graph-01 kernel: [1247096.781047] Out of memory: Kill process 6413 (influxd) score 972 or sacrifice child
Nov 22 05:38:44 graph-01 kernel: [1254251.216362] Out of memory: Kill process 15884 (influxd) score 970 or sacrifice child

root@graph-01:[~]: grep Out  /var/log/kern.log.1 
Nov 15 14:41:16 graph-01 kernel: [682008.560850] Out of memory: Kill process 26779 (influxd) score 976 or sacrifice child
root@graph-01:[~]: zgrep Out  /var/log/kern.log.2.gz 
Nov  4 18:19:11 graph-01 kernel: [2170194.747450] Out of memory: Kill process 857 (influxd) score 967 or sacrifice child
Nov  5 03:18:33 graph-01 kernel: [2202556.746874] Out of memory: Kill process 576 (influxd) score 969 or sacrifice child
Nov  6 02:18:40 graph-01 kernel: [2285362.889164] Out of memory: Kill process 9981 (influxd) score 967 or sacrifice child
Nov  6 18:19:08 graph-01 kernel: [2342989.964236] Out of memory: Kill process 26333 (influxd) score 969 or sacrifice child
Nov  6 21:18:32 graph-01 kernel: [2353753.831078] Out of memory: Kill process 25027 (influxd) score 970 or sacrifice child
Nov  7 00:20:09 graph-01 kernel: [2364650.942413] Out of memory: Kill process 23059 (influxd) score 968 or sacrifice child
Nov  7 14:20:10 graph-01 kernel: [2415051.221631] Out of memory: Kill process 4479 (influxd) score 966 or sacrifice child

root@graph-01:[~]: zgrep Out  /var/log/kern.log.3.gz

So it first happens on Nov 4. On that day, I added via Telegraf the Ceph metric plugin, which collects Ceph infos from 11 nodes, every minute.

Any suggestions ?

digidax · 2020-01-21T07:04:15Z

Exactly the same problem:
Jan 21 07:16:20 pve1 kernel: [4534740.858405] influxd invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=0 Jan 21 07:16:20 pve1 kernel: [4534740.858408] CPU: 0 PID: 28913 Comm: influxd Tainted: P IO 5.0.21-5-pve #1 Jan 21 07:16:20 pve1 kernel: [4534740.858409] Hardware name: Supermicro X8STi/X8STi, BIOS 1.0c 03/10/2010 Jan 21 07:16:20 pve1 kernel: [4534740.858410] Call Trace: Jan 21 07:16:20 pve1 kernel: [4534740.858416] dump_stack+0x63/0x8a Jan 21 07:16:20 pve1 kernel: [4534740.858419] dump_header+0x54/0x2fb Jan 21 07:16:20 pve1 kernel: [4534740.858421] ? sched_clock_cpu+0x11/0xc0 Jan 21 07:16:20 pve1 kernel: [4534740.858422] oom_kill_process.cold.30+0xb/0x1d6 Jan 21 07:16:20 pve1 kernel: [4534740.858424] out_of_memory+0x1c3/0x4a0 Jan 21 07:16:20 pve1 kernel: [4534740.858428] mem_cgroup_out_of_memory+0xc4/0xd0 Jan 21 07:16:20 pve1 kernel: [4534740.858430] try_charge+0x6c6/0x750 Jan 21 07:16:20 pve1 kernel: [4534740.858432] ? __alloc_pages_nodemask+0x13f/0x2e0 Jan 21 07:16:20 pve1 kernel: [4534740.858434] mem_cgroup_try_charge+0x8b/0x190 Jan 21 07:16:20 pve1 kernel: [4534740.858435] mem_cgroup_try_charge_delay+0x22/0x50 Jan 21 07:16:20 pve1 kernel: [4534740.858438] __handle_mm_fault+0x9de/0x12d0 Jan 21 07:16:20 pve1 kernel: [4534740.858441] ? __switch_to_asm+0x41/0x70 Jan 21 07:16:20 pve1 kernel: [4534740.858442] handle_mm_fault+0xdd/0x210 Jan 21 07:16:20 pve1 kernel: [4534740.858445] __do_page_fault+0x23a/0x4c0 Jan 21 07:16:20 pve1 kernel: [4534740.858446] do_page_fault+0x2e/0xe0 Jan 21 07:16:20 pve1 kernel: [4534740.858447] ? page_fault+0x8/0x30 Jan 21 07:16:20 pve1 kernel: [4534740.858448] page_fault+0x1e/0x30 Jan 21 07:16:20 pve1 kernel: [4534740.858450] RIP: 0033:0x99c9c6 Jan 21 07:16:20 pve1 kernel: [4534740.858452] Code: 48 83 fa 08 7d 11 49 8b 1f 48 89 1f 48 29 d1 48 01 d7 48 01 d2 eb e9 48 89 f8 48 01 cf 48 83 f9 00 0f 8e e1 fd ff ff 49 8b 1f <48> 89 18 49 83 c7 08 48 83 c0 08 48 83 e9 08 eb e2 41 8a 1f 88 1f Jan 21 07:16:20 pve1 kernel: [4534740.858453] RSP: 002b:000000c002f32948 EFLAGS: 00010206 Jan 21 07:16:20 pve1 kernel: [4534740.858454] RAX: 000000c010746ffc RBX: 692c6d6f632e7364 RCX: 000000000000000a Jan 21 07:16:20 pve1 kernel: [4534740.858455] RDX: 0000000000000225 RSI: 000000c0106ad8e3 RDI: 000000c010747006 Jan 21 07:16:20 pve1 kernel: [4534740.858456] RBP: 000000c002f32978 R08: 000000c0106b6000 R09: 00000000000ccaed Jan 21 07:16:20 pve1 kernel: [4534740.858456] R10: 000000c010782aed R11: 000000c01069e003 R12: 0000000000015d7c Jan 21 07:16:20 pve1 kernel: [4534740.858457] R13: 000000c0106b3d7f R14: 000000000003bb0f R15: 000000c010746dd7 Jan 21 07:16:20 pve1 kernel: [4534740.858459] memory: usage 1048576kB, limit 1048576kB, failcnt 334499715 Jan 21 07:16:20 pve1 kernel: [4534740.858460] memory+swap: usage 1048576kB, limit 2097152kB, failcnt 0 Jan 21 07:16:20 pve1 kernel: [4534740.858460] kmem: usage 21816kB, limit 9007199254740988kB, failcnt 0 Jan 21 07:16:20 pve1 kernel: [4534740.858461] Memory cgroup stats for /lxc/172: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Jan 21 07:16:20 pve1 kernel: [4534740.858468] Memory cgroup stats for /lxc/172/ns: cache:598348KB rss:428216KB rss_huge:0KB shmem:598064KB mapped_file:25740KB dirty:264KB writeback:0KB swap:0KB inactive_anon:185596KB active_anon:840976KB inactive_file:0KB active_file:0KB unevictable:0KB Jan 21 07:16:20 pve1 kernel: [4534740.858473] Tasks state (memory values in pages): Jan 21 07:16:20 pve1 kernel: [4534740.858474] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name Jan 21 07:16:20 pve1 kernel: [4534740.858528] [ 23427] 0 23427 10823 280 139264 0 0 systemd Jan 21 07:16:20 pve1 kernel: [4534740.858530] [ 23537] 0 23537 6597 71 98304 0 0 systemd-logind Jan 21 07:16:20 pve1 kernel: [4534740.858532] [ 23540] 81 23540 14534 121 159744 0 -900 dbus-daemon Jan 21 07:16:20 pve1 kernel: [4534740.858534] [ 23596] 0 23596 5676 155 86016 0 0 crond Jan 21 07:16:20 pve1 kernel: [4534740.858535] [ 23597] 0 23597 1631 31 57344 0 0 agetty Jan 21 07:16:20 pve1 kernel: [4534740.858537] [ 23598] 0 23598 1631 32 57344 0 0 agetty Jan 21 07:16:20 pve1 kernel: [4534740.858538] [ 23599] 0 23599 22742 160 225280 0 0 login Jan 21 07:16:20 pve1 kernel: [4534740.858540] [ 23906] 0 23906 28232 258 266240 0 -1000 sshd Jan 21 07:16:20 pve1 kernel: [4534740.858542] [ 23931] 997 23931 189399 1664 258048 0 0 chronograf Jan 21 07:16:20 pve1 kernel: [4534740.858544] [ 17281] 0 17281 2958 96 69632 0 0 bash Jan 21 07:16:20 pve1 kernel: [4534740.858550] [ 30499] 998 30499 339253 4262 348160 0 0 grafana-server Jan 21 07:16:20 pve1 kernel: [4534740.858552] [ 25153] 0 25153 153915 5932 536576 0 0 rsyslogd Jan 21 07:16:20 pve1 kernel: [4534740.858555] [ 10689] 0 10689 13869 5124 155648 0 0 systemd-journal Jan 21 07:16:20 pve1 kernel: [4534740.858558] [ 28694] 0 28694 244876 27884 2322432 0 0 node-red Jan 21 07:16:20 pve1 kernel: [4534740.858559] [ 28904] 999 28904 387458 71266 1228800 0 0 influxd Jan 21 07:16:20 pve1 kernel: [4534740.858565] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=ns,mems_allowed=0,oom_memcg=/lxc/172,task_memcg=/lxc/172/ns,task=influxd,pid=28904,uid=999 Jan 21 07:16:20 pve1 kernel: [4534740.871746] oom_reaper: reaped process 28904 (influxd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Proxmox 6.0.15 HA system
influxdb.x86_64 1.7.9-1 in CentOS 7.7.1908
No differences if the container is stored on GlusterFS farm OR ZPOOL

superbool · 2020-03-11T12:12:38Z

1.7.10 same problem

lobocobra · 2020-03-13T19:27:38Z

I could fix the problem by returning to version 1.6.4. This is the last version without this bug.
=> A program that eats up all memory till the server crashes, has defenitively problems.

With version 1.6.4 I still have rare server crashes but we talk about 1 crash every 2 month.
I might try an earlier version, because the problems started when I upgraded influx.

M0rdecay · 2020-05-12T07:43:52Z

This is still an urgent problem, because of which we do not run the risk of updating above 1.7.3 to version 1.7.10/1.8, in which there is the necessary Flux language functionality.

julienmathevet · 2020-05-29T17:24:59Z

I dont know if it is corresponding to this ticket. But we have many influxdb db which have each day a 4 h 10 utc a memory pic. On some this is conclude by OOM. It seem due to shard compaction, because we have day shard and time matches.
We tried to set max-comcurent compaction to 1 but without concliding result. We also run these instances in docker and on non ssd disk drive.
How to avoid OOM killed on compaction / how to limit memory usage during compaction.
Seen on 1.7.9 and 1.8.0.

neuyilan · 2020-06-03T08:35:28Z

I think it is necessary to have a mechanism to control memory usage to avoid OOM, even if it may cause performance degradation, after all, usability is more important than performance

bijwaard · 2020-06-16T09:55:25Z

Dear all,

I agree that it should be possible to limit the (virtual) memory use, especially on embedded targets, they should keep running in a sane state no matter what.

I have similar OOM kills as well on embedded devices with <=1G of ram running debian stretch and buster.

After an OOM kill the system is not always in a sane state, and sometimes another process is killed and influx is still remotely accessible. I've encountered a number of times that no new process could be started after the kill, so ssh login does no longer work and existing shell performs poorly (top reported 0 available memory).

I've now configured the OOM to restart the machine instead of leaving the device running in dismembered state.

I noticed the influxdb variable compact-full-write-cold-duration = "4h", which apparently forces compaction after 4 hours without measurements, e.g. when not recording measurements at night. So this may also trigger the compaction...

Kind regards,
Dennis

lobocobra · 2020-06-16T20:56:34Z

Many thanks for this hint.

I enabled now the feature by:

echo "vm.panic_on_oom=1" >> /etc/sysctl.conf
echo "kernel.panic=20" >> /etc/sysctl.conf

=> This enables the reboot if the kernel is not responding 20 sec.

But honestly I am disappointed that the kernel has to fix a bug like t his. I opened 2 bug reports here which were simply closed, without any solution. It seems that the developpers do simply do not care about the bug. And yes, if a SW kills a OS, it is a bug. Even if the SW it is wrongly configured, this must never happen.
=> I will use this work around till I found a solution to replace influxdb, which is the main cause of the problem.

bijwaard · 2020-06-16T21:58:08Z

Hi lobocobra,

I agree this OOM configuration is mainly a work-around and safety-net for the device. In the meantime I hope that someone at influx takes a closer look at these memory issues, I am confident we don't need to look for another database just yet.

As a further precaution, I also adjusted a number of parameters in the influxdb configuration to reduce the cache, shard-size, compact-throughput(burst), series-id-set-cache-size. Especially the last one appears to bring down memory usage considerably (paying some price with queries of course). Lowering the compact throughput and burst may reduce the sudden spikes in memory use.

Kind regards,
Dennis

lobocobra · 2020-06-16T23:24:29Z

Ha.... hope dies last... I posted my first bug reports 1 year ago....
=> they do not care.

If you open a new bug report, they just close it within 2 days.
if you downgrade to version 1.6.4 it happens 1x / month... with this OOM it restarts. As soon I have time I migrate away to another backend for my homematic.

bijwaard · 2020-06-17T07:03:31Z

My 512MB device with debian buster (Armbian) survived the night without OOM using tsi1 indexing on influxdb v1.8.0;-)

It now steadily uses 12.2% of CPU instead of the 38% it used earlier (resident memory went down from 180MB to 60MB), and I don't feel the difference using queries with series-id-set-cache-size=0.

The default configuration for compaction throughput should be much more conservative especially on smaller devices, I've down-scaled compact-throughput="48m" to "1m" and -burst="48m" to "10m". I think this reduces the steep increase in memory usage during compaction, which was triggered 6 times last night.

bapBardas · 2020-06-17T15:18:24Z

My 512MB device with debian buster (Armbian) survived the night without OOM using tsi1 indexing on influxdb v1.8.0;-)

It now steadily uses 12.2% of CPU instead of the 38% it used earlier (resident memory went down from 180MB to 60MB), and I don't feel the difference using queries with series-id-set-cache-size=0.

The default configuration for compaction throughput should be much more conservative especially on smaller devices, I've down-scaled compact-throughput="48m" to "1m" and -burst="48m" to "10m". I think this reduces the steep increase in memory usage during compaction, which was triggered 6 times last night.

Thank you for this. After a migration from inmem to tsi1 indexing, my influxdb instance would simply consume the 32GB RAM available in only one minute, and be killed due to Out of memory.

Simply changing the compact-throughput values as you indicated allowed me to bring my InfluxDB instance back to life, RAM consumption was still high but remaining within acceptable values.

bijwaard · 2020-06-17T15:28:09Z

Hi bapBardas,

Glad I could help. For my small device I also decreased the cache-max-memory-size from 1g to 64m and cache-snapshot-memory-size from 25m to 1m. This may not matter on your monster-machine.

Kind regards,
Dennis

linuxmail · 2020-08-07T08:03:51Z

hi,

From my side I can say: the CQ was / is the problem. For Telegraf DB I had mean(*), which is supported, but since 7..x it kills the node. Good to know: https://medium.com/opsops/avoiding-combinatorial-explosion-in-continuous-queries-in-influx-ce9db70634e5

I dropped the CQs for Telegraf and now, the nodes remains stable. We use many plugins and createing CQ for every value ... is too much work.

BerndCzech · 2021-12-05T10:08:02Z

Hi,

I have pretty much the same problem on Influx 2.1 which I am running on docker.

Ram consumption is increasing until a snapshot creation fails:

# Influx log
ts=2021-12-02T15:18:11.679482Z lvl=info msg="Snapshot for path written" log_id=0YA6GEnG000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot path=/var/lib/influxdb2/engine/data/7e09dfed7d3f7
25e/autogen/930 duration=649.846ms
ts=2021-12-02T15:18:11.679654Z lvl=info msg="Cache snapshot (start)" log_id=0YA6GEnG000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=start
ts=2021-12-02T15:18:11.681772Z lvl=info msg="Snapshot for path written" log_id=0YA6GEnG000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot path=/var/lib/influxdb2/engine/data/7e09dfed7d3f7
25e/autogen/624 duration=696.963ms
ts=2021-12-02T15:18:11.681980Z lvl=info msg="Cache snapshot (start)" log_id=0YA6GEnG000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=start
ts=2021-12-02T15:18:11.683706Z lvl=info msg="Snapshot for path written" log_id=0YA6GEnG000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot path=/var/lib/influxdb2/engine/data/7e09dfed7d3f7
25e/autogen/660 duration=627.738ms
ts=2021-12-02T15:18:11.683908Z lvl=info msg="Cache snapshot (start)" log_id=0YA6GEnG000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=start
ts=2021-12-02T15:23:10.519254Z lvl=info msg="Welcome to InfluxDB" log_id=0YAwieCW000 version=2.1.1 commit=657e1839de build_date=2021-11-09T03:03:48Z
# Syslog
Dec  2 15:23:02 ubuntu-2gb-nbg1-1 kernel: [1198133.940234] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=e2e95ffa3d2c00d9f6341a0b8e45932b56a51a38526f864682c9ebdb74adfe98,mems_allowed=0,globa
l_oom,task_memcg=/docker/8422c1a663e351fc9d13366538d047a3cb6bab359cc114f94b5eb44225526d10,task=influxd,pid=168073,uid=1000
Dec  2 15:23:02 ubuntu-2gb-nbg1-1 kernel: [1198133.941170] Out of memory: Killed process 168073 (influxd) total-vm:11163164kB, anon-rss:3326160kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:18164kB o
om_score_adj:0

I tried several workarounds:

Separating into smaller chunks:
- Creating snapshots between data is pushed (every~1m): INFLUXD_STORAGE_CACHE_SNAPSHOT_WRITE_COLD_DURATION=~~10m0s~~10s
setting several memory related env vars to small values:
- INFLUXD_STORAGE_COMPACT_THROUGHPUT_BURST=503316
- INFLUXD_STORAGE_MAX_INDEX_LOG_FILE_SIZE=10485
- INFLUXD_STORAGE_CACHE_MAX_MEMORY_SIZE=107374188
FYI: My cardinality is only ~4800

The overall memory consumption reduced (see flat top on 11/29) but sometimes spikes occur which lead to oom-kills. I don't know what else to try.

I read many issues on these oom-kills in influxdb and would like to see a fix. But for now this is too fragile and unpredictable to use it in production.

Details:

Ubuntu 20.04.3 LTS (GNU/Linux 5.4.0-90-generic x86_64)
influxdb:2.1-alpine
Reproduction is not possible for me. The oom-kills occur at random. Maybe not related this is how I write [to influx](using https://pkg.go.dev/github.com/influxdata/influxdb-client-go/v2/api#example-WriteAPI ).

So what is actually bugging me and why I am not directly moving to timescaledb etc. is how influx can be used by so many while it appears to be that fragile to me - so I guess I am doing something wrong.

digidax · 2021-12-05T12:25:52Z

I think the problem is the handling of the virtualisation layer. In my case, inside a LXC container the problem is present. If I use a VM instead of the LXC, Influx is working well. Am 5. Dezember 2021 13:15:13 MEZ schrieb Ingo Fischer ***@***.***>:

…

Ok, but what the adapter should do now? I think this is an issue wiht eighter influxdb itself or your setup and maybe the db size vs RAM granted for influxdb ... I do not see anything the adapter could do here. I would start and giving it more RAM. 4GB Seems too less in yur case ... try more and yu see how much RAM influxdb meeds -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #13318 (comment)

bijwaard · 2021-12-05T12:56:46Z

Hi digidax,

I run influxdb 1.8 within 2 lxc containers using trimmed down memory/snapshot configuration, internal memory is 1GB. It has run without issues for months. Before trimming config the OOM killer of host killed the containerized influxdb processes. Fortunately the host survived most times. Other virtualization techniques may threat (virtual) memory outages differently within the container.

Kind regards,
Dennis

matheushent · 2023-04-06T17:21:00Z

Quite long this issue is open but I just faced it today.

I was using InfluxDB 1.8 on k8s (EKS), ran it for weeks and suddenly today it started to raise OOM killer. I switched to 1.6.4 and it's working fine for the last couple of hours.

UFOXD · 2023-09-11T01:58:58Z

same issue on influxdb 1.8.9

geronimo-iia · 2023-12-11T14:24:10Z

same with 1.8.10

dgnorton added 1.x oom labels Apr 11, 2019

dgnorton added the need more info label Apr 11, 2019

mqu mentioned this issue Aug 26, 2019

InfluxDB 1.7 uses way more memory and disk i/o than 1.6.4 #10468

Closed

russorat mentioned this issue Apr 27, 2020

OOM Error and influxDB restart issue #17797

Open

M0rdecay mentioned this issue May 19, 2020

Is compatibility with InfluxDB 1.7.x broken? grafana/influxdb-flux-datasource#100

Open

bijwaard mentioned this issue Jun 24, 2020

Labels shown outside container boxes algenty/grafana-flowcharting#147

Closed

philjb added the area/memory label Sep 30, 2020

influxdb out of memory #13318

influxdb out of memory #13318

Comments

crjddd commented Apr 11, 2019

2. Environment:`

[influx logs]:

[disk infos]:

[messages log]:

crjddd commented Apr 11, 2019

dgnorton commented Apr 11, 2019

crjddd commented Apr 12, 2019

crjddd commented Apr 16, 2019

joydashy commented Jul 5, 2019

Apollon77 commented Jul 5, 2019

ghost commented Jul 5, 2019

LastStupid commented Aug 2, 2019

mqu commented Aug 26, 2019

LastStupid commented Aug 26, 2019

lobocobra commented Nov 11, 2019 • edited Loading

lobocobra commented Nov 11, 2019

LastStupid commented Nov 12, 2019 via email

lobocobra commented Nov 12, 2019

linuxmail commented Nov 22, 2019 • edited Loading

digidax commented Jan 21, 2020 • edited Loading

superbool commented Mar 11, 2020

lobocobra commented Mar 13, 2020

M0rdecay commented May 12, 2020

julienmathevet commented May 29, 2020 • edited Loading

neuyilan commented Jun 3, 2020

bijwaard commented Jun 16, 2020

lobocobra commented Jun 16, 2020

bijwaard commented Jun 16, 2020

lobocobra commented Jun 16, 2020

bijwaard commented Jun 17, 2020 • edited Loading

bapBardas commented Jun 17, 2020 • edited Loading

bijwaard commented Jun 17, 2020

linuxmail commented Aug 7, 2020

BerndCzech commented Dec 5, 2021 • edited Loading

digidax commented Dec 5, 2021 via email

bijwaard commented Dec 5, 2021

matheushent commented Apr 6, 2023

UFOXD commented Sep 11, 2023

geronimo-iia commented Dec 11, 2023

lobocobra commented Nov 11, 2019 •

edited

Loading

linuxmail commented Nov 22, 2019 •

edited

Loading

digidax commented Jan 21, 2020 •

edited

Loading

julienmathevet commented May 29, 2020 •

edited

Loading

bijwaard commented Jun 17, 2020 •

edited

Loading

bapBardas commented Jun 17, 2020 •

edited

Loading

BerndCzech commented Dec 5, 2021 •

edited

Loading