-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
influxdb out of memory #13318
Comments
@crjddd have you tested to see if this happens on 1.7.5? |
@dgnorton ,I haven't tried the 1.7.5 version, and today I'll try |
@dgnorton ,I've tried it with version 1.7.5, but the result is the same |
I am also seeing this issue running InfluxDB in a Proxmox LXC. The container has plenty (4 GB) memory assigned, but still the kernel is killing InfluxDB for lack of memory. |
maybe also see #13605 My experience is that 1.7.6 is very bad, 1.74 is better but also not perfect! I run influxdb as VM and the VM images are stored on a glusterfs filesystem. As soon as everything runs normally everything is fine. When I reboot one of my fs-cluster machines and it starts resyncing and stuff normally i/o gets slower and then influxdb starts to need more memory |
I meet same question, if i try to query data larger than server memory. influxdb will killed by system. |
same issue with 1.7.7 ; I downgraded to 1.7.3 ; seems to be OK with same data and configuration. |
Ok, it look like a miss understanding shard and retention policy. I made multiple change in shard and it lower the memory usage. For now influx is running. Try lower shard. Thanks |
Hi It worked good for a year then I had the bad idea to upgrade influxdb >1.7, since then things went out of control. :-( Thanks for a feedback (PS, yes, I tried to move from TSM to TSI, but hey, after 4h I aborted, it not even showed a progress bar. |
Hey.... and how I lower the shrad please? |
hi, you can lower shard only at creation. So in consol, create your
measurement with the shard param, i don't exacly remenber the command but
it's document on influx web site.
hope it will help you.
Le lun. 11 nov. 2019 18 h 10, lobocobra <notifications@github.com> a écrit :
… Ok, it look like a miss understanding shard and retention policy. I made
multiple change in shard and it lower the memory usage. For now influx is
running. Try lower shard. Thanks
Hey.... and how I lower the shrad please?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#13318?email_source=notifications&email_token=AHW5WXBVMM6NPXLXB26Q3L3QTHRADA5CNFSM4HFFM7PKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDYOQFA#issuecomment-552658964>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHW5WXCVUQGDJ3V5CORFJXDQTHRADANCNFSM4HFFM7PA>
.
|
Many thanks for the hint, I will try to find out the rest. |
hi, I have also a InfluxDB with the latest version: Environment info:
Actual behavior:
Config:
Logs:
Performance: I switched the index from memory to tsi1, as I read in the 1.7 documents, but it doesn't help
I search the logs for older OOMs:
So it first happens on Nov 4. On that day, I added via Telegraf the Ceph metric plugin, which collects Ceph infos from 11 nodes, every minute. Any suggestions ? |
Exactly the same problem: Proxmox 6.0.15 HA system |
1.7.10 same problem |
I could fix the problem by returning to version 1.6.4. This is the last version without this bug. With version 1.6.4 I still have rare server crashes but we talk about 1 crash every 2 month. |
This is still an urgent problem, because of which we do not run the risk of updating above 1.7.3 to version 1.7.10/1.8, in which there is the necessary Flux language functionality. |
I dont know if it is corresponding to this ticket. But we have many influxdb db which have each day a 4 h 10 utc a memory pic. On some this is conclude by OOM. It seem due to shard compaction, because we have day shard and time matches. |
I think it is necessary to have a mechanism to control memory usage to avoid OOM, even if it may cause performance degradation, after all, usability is more important than performance |
Dear all, I agree that it should be possible to limit the (virtual) memory use, especially on embedded targets, they should keep running in a sane state no matter what. I have similar OOM kills as well on embedded devices with <=1G of ram running debian stretch and buster. After an OOM kill the system is not always in a sane state, and sometimes another process is killed and influx is still remotely accessible. I've encountered a number of times that no new process could be started after the kill, so ssh login does no longer work and existing shell performs poorly (top reported 0 available memory). I've now configured the OOM to restart the machine instead of leaving the device running in dismembered state. I noticed the influxdb variable compact-full-write-cold-duration = "4h", which apparently forces compaction after 4 hours without measurements, e.g. when not recording measurements at night. So this may also trigger the compaction... Kind regards, |
Many thanks for this hint. I enabled now the feature by:
=> This enables the reboot if the kernel is not responding 20 sec. But honestly I am disappointed that the kernel has to fix a bug like t his. I opened 2 bug reports here which were simply closed, without any solution. It seems that the developpers do simply do not care about the bug. And yes, if a SW kills a OS, it is a bug. Even if the SW it is wrongly configured, this must never happen. |
Hi lobocobra, I agree this OOM configuration is mainly a work-around and safety-net for the device. In the meantime I hope that someone at influx takes a closer look at these memory issues, I am confident we don't need to look for another database just yet. As a further precaution, I also adjusted a number of parameters in the influxdb configuration to reduce the cache, shard-size, compact-throughput(burst), series-id-set-cache-size. Especially the last one appears to bring down memory usage considerably (paying some price with queries of course). Lowering the compact throughput and burst may reduce the sudden spikes in memory use. Kind regards, |
Ha.... hope dies last... I posted my first bug reports 1 year ago.... If you open a new bug report, they just close it within 2 days. |
My 512MB device with debian buster (Armbian) survived the night without OOM using tsi1 indexing on influxdb v1.8.0;-) It now steadily uses 12.2% of CPU instead of the 38% it used earlier (resident memory went down from 180MB to 60MB), and I don't feel the difference using queries with series-id-set-cache-size=0. The default configuration for compaction throughput should be much more conservative especially on smaller devices, I've down-scaled compact-throughput="48m" to "1m" and -burst="48m" to "10m". I think this reduces the steep increase in memory usage during compaction, which was triggered 6 times last night. |
Thank you for this. After a migration from inmem to tsi1 indexing, my influxdb instance would simply consume the 32GB RAM available in only one minute, and be killed due to Out of memory. Simply changing the compact-throughput values as you indicated allowed me to bring my InfluxDB instance back to life, RAM consumption was still high but remaining within acceptable values. |
Hi bapBardas, Glad I could help. For my small device I also decreased the cache-max-memory-size from 1g to 64m and cache-snapshot-memory-size from 25m to 1m. This may not matter on your monster-machine. Kind regards, |
hi, From my side I can say: the CQ was / is the problem. For Telegraf DB I had mean(*), which is supported, but since 7..x it kills the node. Good to know: https://medium.com/opsops/avoiding-combinatorial-explosion-in-continuous-queries-in-influx-ce9db70634e5 I dropped the CQs for Telegraf and now, the nodes remains stable. We use many plugins and createing CQ for every value ... is too much work. |
Hi, I have pretty much the same problem on Influx 2.1 which I am running on docker. Ram consumption is increasing until a snapshot creation fails:
I tried several workarounds:
The overall memory consumption reduced (see flat top on 11/29) but sometimes spikes occur which lead to oom-kills. I don't know what else to try. I read many issues on these oom-kills in influxdb and would like to see a fix. But for now this is too fragile and unpredictable to use it in production. Details:
So what is actually bugging me and why I am not directly moving to timescaledb etc. is how influx can be used by so many while it appears to be that fragile to me - so I guess I am doing something wrong. |
I think the problem is the handling of the virtualisation layer. In my case, inside a LXC container the problem is present. If I use a VM instead of the LXC, Influx is working well.
Am 5. Dezember 2021 13:15:13 MEZ schrieb Ingo Fischer ***@***.***>:
…Ok, but what the adapter should do now? I think this is an issue wiht eighter influxdb itself or your setup and maybe the db size vs RAM granted for influxdb ... I do not see anything the adapter could do here. I would start and giving it more RAM. 4GB Seems too less in yur case ... try more and yu see how much RAM influxdb meeds
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#13318 (comment)
|
Hi digidax, I run influxdb 1.8 within 2 lxc containers using trimmed down memory/snapshot configuration, internal memory is 1GB. It has run without issues for months. Before trimming config the OOM killer of host killed the containerized influxdb processes. Fortunately the host survived most times. Other virtualization techniques may threat (virtual) memory outages differently within the container. Kind regards, |
Quite long this issue is open but I just faced it today. I was using InfluxDB 1.8 on k8s (EKS), ran it for weeks and suddenly today it started to raise OOM killer. I switched to 1.6.4 and it's working fine for the last couple of hours. |
same issue on influxdb 1.8.9 |
same with 1.8.10 |
### 1. Description:
When running the Influx, the VIRT memory consumption rapidly increases ,and was eventually killed by OOM.
2. Environment:`
[Docker version]:Docker version 1.13.1, build 07f3374/1.13.1
[Docker run command]: docker run -it -d --network host -v /var/lib/influxdb/:/opt/host --memory 10g dfaa91697202 /bin/bash -c "sleep 10000000"
[influxdb version]:InfluxDB shell version: 1.7.0~n201808230800
### [conf]:
influxdb_conf.TXT
[influx logs]:
influx_log_1.zip
Note: 8 hours difference in log time zone
[disk infos]:
I monitored the size of the data , as well as the memory changes, like the following
docker_memory.zip
---------------------------top bein 54004-----------------
top - 11:46:09 up 16 days, 3:09, 14 users, load average: 11.29, 10.57, 10.36 Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie %Cpu(s): 16.1 us, 1.1 sy, 0.0 ni, 82.7 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 26345912+total, 1652376 free, 54648416 used, 20715833+buff/cache KiB Swap: 4194300 total, 3644 free, 4190656 used. 20497331+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 54004 root 20 0 57.4g 10.0g 154332 S 93.8 4.0 1251:44 influxd
---------------------------top end 54004-----------------
[messages log]:
Apr 11 11:48:01 psinsight-112 kernel: influxd invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
Apr 11 11:48:01 psinsight-112 kernel: influxd cpuset=docker-f1207e65211b540e841d5d13a7b66c393ddde902d0c80a3bfa215a4c16754418.scope mems_allowed=0-1
Apr 11 11:48:01 psinsight-112 kernel: CPU: 21 PID: 54013 Comm: influxd Kdump: loaded Tainted: G ------------ T 3.10.0-957.1.3.el7.x86_64 #1
Apr 11 11:48:01 psinsight-112 kernel: Hardware name: Huawei 2288H V5/BC11SPSCB0, BIOS 0.68 05/03/2018
Apr 11 11:48:01 psinsight-112 kernel: Call Trace:
Apr 11 11:48:01 psinsight-112 kernel: [] dump_stack+0x19/0x1b
Apr 11 11:48:01 psinsight-112 kernel: [] dump_header+0x90/0x229
Apr 11 11:48:01 psinsight-112 kernel: [] ? find_lock_task_mm+0x56/0xc0
Apr 11 11:48:01 psinsight-112 kernel: [] ? try_get_mem_cgroup_from_mm+0x28/0x60
Apr 11 11:48:01 psinsight-112 kernel: [] oom_kill_process+0x254/0x3d0
Apr 11 11:48:01 psinsight-112 kernel: [] ? selinux_capable+0x1c/0x40
Apr 11 11:48:01 psinsight-112 kernel: [] mem_cgroup_oom_synchronize+0x546/0x570
Apr 11 11:48:01 psinsight-112 kernel: [] ? mem_cgroup_charge_common+0xc0/0xc0
Apr 11 11:48:01 psinsight-112 kernel: [] pagefault_out_of_memory+0x14/0x90
Apr 11 11:48:01 psinsight-112 kernel: [] mm_fault_error+0x6a/0x157
Apr 11 11:48:01 psinsight-112 kernel: [] __do_page_fault+0x3c8/0x500
Apr 11 11:48:01 psinsight-112 kernel: [] do_page_fault+0x35/0x90
Apr 11 11:48:01 psinsight-112 kernel: [] page_fault+0x28/0x30
Apr 11 11:48:01 psinsight-112 kernel: Task in /system.slice/docker-f1207e65211b540e841d5d13a7b66c393ddde902d0c80a3bfa215a4c16754418.scope killed as a result of limit of /system.slice/docker-f1207e65211b540e841d5d13a7b66c393ddde902d0c80a3bfa215a4c16754418.scope
Apr 11 11:48:01 psinsight-112 kernel: memory: usage 10485760kB, limit 10485760kB, failcnt 159873656
Apr 11 11:48:01 psinsight-112 kernel: memory+swap: usage 10504544kB, limit 20971520kB, failcnt 0
Apr 11 11:48:01 psinsight-112 kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Apr 11 11:48:01 psinsight-112 kernel: Memory cgroup stats for /system.slice/docker-f1207e65211b540e841d5d13a7b66c393ddde902d0c80a3bfa215a4c16754418.scope: cache:168KB rss:10485592KB rss_huge:0KB mapped_file:4KB swap:18784KB inactive_anon:1623952KB active_anon:8861888KB inactive_file:112KB active_file:0KB unevictable:0KB
Apr 11 11:48:01 psinsight-112 kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Apr 11 11:48:01 psinsight-112 kernel: [44993] 0 44993 1078 11 8 7 0 sleep
Apr 11 11:48:01 psinsight-112 kernel: [50116] 0 50116 2943 56 11 40 0 bash
Apr 11 11:48:01 psinsight-112 kernel: [54004] 0 54004 15101796 2622256 18846 4497 0 influxd
Apr 11 11:48:01 psinsight-112 kernel: [54423] 0 54423 685493 1284 121 234 0 influx
Apr 11 11:48:01 psinsight-112 kernel: Memory cgroup out of memory: Kill process 223248 (influxd) score 720 or sacrifice child
Apr 11 11:48:01 psinsight-112 kernel: Killed process 54004 (influxd) total-vm:60407184kB, anon-rss:10479648kB, file-rss:9376kB, shmem-rss:0kB
The text was updated successfully, but these errors were encountered: