Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipfs daemon memory usage grows overtime: killed by OOM after a 10~12 days running #3532

Closed
hsanjuan opened this issue Dec 21, 2016 · 41 comments
Labels
kind/bug A bug in existing code (including security flaws)

Comments

@hsanjuan
Copy link
Contributor

Version information:

go-ipfs version: 0.4.5-dev-
Repo version: 4
System version: arm/linux
Golang version: go1.7

Type: Problem

Priority: P4

Description:

I have some Raspberry Pis 3 running go-ipfs daemon. Right now they don't do anything. The Pis don't handle any IPFS requests or anything. They are just there running the daemons. After about 10 days ipfs is getting killed in all of them because they are taking too much memory.

The daemons are killed around RSS=783192
My longest running daemon (11 days) has RSS=605868
A newly started daemon has RSS=92020
A one day running daemon has RSS= 542408

Questions:

  • What causes memory usage to steadily grow even if the daemons get no usage other than being running?
  • Is there a way to limit it?
  • Do we need to gather more information on this? if so, what's the best way and how can I help?

Related: #3318 and the question about running IPFS on platforms with limited resources.

@jonnycrunch
Copy link
Member

jonnycrunch commented Dec 29, 2016

Same here:

ipfs version 0.4.3
Ubuntu 16.0.4 ( 4.4.0-47-generic )
go-lang 1.7

after about 10 days memory grows to about 15G despite only a few hundred files pinned. Issue is replicated across 10 servers. Restarting the daemon fixed it but continues to grow and needs to be restarted.

UPDATE: Ah, ha! I found the enable garbage collection flag in the documentation, so trying:

ipfs daemon --enable-gc

@whyrusleeping
Copy link
Member

@jonnycrunch the --enable-gc flag refers to disk gc, not memory gc.

The memory leakage is coming from somewhere else... Next time the memory gets out of hand can you get me the debug info described here: https://github.com/ipfs/go-ipfs/blob/master/docs/debug-guide.md#beginning

Particularly the heap profile, goroutine dump and ipfs binary

@koriaf
Copy link

koriaf commented Jan 7, 2017

Hi! We are using this ipfs 0.4.4 at Linux 4.4.35-33.55.amzn1.x86_64 #1 SMP Tue Dec 6 20:30:04 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Currently it eats 65-76% of memory at 2GB instance, OOM sometimes kills it and it starts again and usage grows during several hours to given value. But looks like this is enough for the daemon to not be killed - may be it uses some smart way to determine how not to be killed :-) While experimenting with memory limits I saw that usage grows to whole available memory but not more (no swap used for IPFS, but other applications may have problems with available memory).

ipfs/QmaB2FJr1Z6yGRy9G37aXsBirR43Lc9ya3Q29R4gMYDVDv - dumps. Shall I recreate my node after sharing these files? Do they have a chance to contain any private keys or other data? Node is disposable, and contains no private files yet, but may have in future.

Also I noted that after running disk gc (ipfs repo gc) memory decreased from 70% to 65%, but after adding this debug directory it's again 75% of total host memory.

I have no idea how go works, so if you need more debug info or this one is unhelpful - please feel free to ask for more details.

Also, I have ipfs node run at 512MB Digitalocean instance, and it's managed by supervisord. OOM kills it there pretty fast (several hours), and supervisord starts it, and it dies again, and again, but generally works okay.

@come-maiz
Copy link
Contributor

Carla Sella, from the Ubuntu community, reports that using the ipfs v0.4.4, her virtualbox vm starts to get slow after it connects to over 70 peers. Here are her debugging files.
ipfs.tar.gz

@jonnycrunch
Copy link
Member

Maybe it is time for Garbage Collection to be enabled by default? @whyrusleeping @RichardLitt @diasdavid

@Kubuxu
Copy link
Member

Kubuxu commented Jan 16, 2017

@jonnycrunch as @whyrusleeping said, the --enable-gc flag is datastore garbage collection, not the program garbage collection.

The core problem is what we call "connection closing", IPFS is currently connecting with almost everyone which in connection with muxer implementation we are currently using takes a lot of memory. We are working on reducing it but it might take a while. The connection closing is much harder problem we initially expected.

The --enable-gc flag shouldn't matter, it might reduce memory usage a bit, but it isn't the core problem as far as I know.

@hsanjuan
Copy link
Contributor Author

hsanjuan commented Feb 1, 2017

This is the debugging information I have collected from 1 node that was still running (2 have died):

https://ipfs.io/ipfs/QmXnYzZT1EAq9pzi6snd6KHD8kNrBSDuyJqLPe7QHzUE23

It was also using 150% CPU when I checked it and >80% MEM. They are still on 0.4.5-pre1 though.

@bdimych
Copy link

bdimych commented May 14, 2017

stack dump from #61 ,
this is a vps with CentOS 7 64 with 1Gb memory,
ipfs daemon crashed in 5 days after start:
ipfs-crash-May-07-grep-ipfs-var-log-messages.zip

ipfs package go-ipfs_v0.4.8_linux-amd64.tar.gz

@whyrusleeping
Copy link
Member

Hey everyone, ipfs 0.4.11 should have some significant improvements here. The issue is not entirely resolved, but the leak should be mitigated.

@whyrusleeping whyrusleeping added P0 Critical: Tackled by core team ASAP status/ready Ready to be worked labels Oct 17, 2017
@maznu
Copy link

maznu commented Dec 25, 2017

Still leaking memory in 0.4.13 — killed after ~12 hours.

@Stebalien
Copy link
Member

At the moment, the largest issue is the peerstore. We had a rather nasty bug that will be fixed in the next release (we, uh, kind of didn't forget any address of any peer to which we had ever connected and, worse, advertised these (sometimes ephemeral) addresses to the network..).

@victorb
Copy link
Member

victorb commented Jan 28, 2018

@Stebalien

that will be fixed in the next release

Does that mean that the fix is already in master or is work in progress?

@Stebalien
Copy link
Member

Stebalien commented Jan 28, 2018 via email

@paralin
Copy link

paralin commented Mar 29, 2018

I profiled it and it seems like a lot of the CPU waste is surprisingly in AddAddrs in the AddrManager. Reading that code, it seems very hasty and not performance minded. I'll PR something to go-libp2p-peerstore to optimize that with concurrent maps, which should help.

@Stebalien
Copy link
Member

I'll PR something to go-libp2p-peerstore to optimize that with concurrent maps, which should help.

Unfortunately, the issue is libp2p/go-libp2p-peerstore#26 and the fact that the number of multiaddrs assigned to a peer can grow unchecked*. The peerstore actually works fine with a sane number of addresses.

*The previous version of go-ipfs failed to forget observed multiaddrs for peers and, worse, would gossip these observed multiaddrs. That combined with NATs and ephemeral ports lead to a build up of addresses for some peer.

The solution to this is really to sign peer address records (should be doing this anyways), enforce a maximum number of addresses, and require that there only be one valid peer address record per peer.

@paralin
Copy link

paralin commented Mar 29, 2018

Yeah, but that code is still unoptimized and in general really rough, even for a small number of addresses. Agreed that there is a bigger reason though as you describe.

@Stebalien Stebalien added status/deferred Conscious decision to pause or backlog and removed status/ready Ready to be worked labels Dec 18, 2018
@maznu
Copy link

maznu commented Feb 16, 2019

Still leaking memory in 0.4.18, between 0-100kB/sec (averaging at a rate of somewhere around 10kB/sec).

@whyrusleeping
Copy link
Member

@maznu are you sure its leaking memory? go is a garbage collected language, which means memory usage will appear to increase until a GC event. after a GC event, memory doesnt necessarily get released back to the OS, but internally the previously allocated memory will get used.

How are you measuring this?

@EugeneChung
Copy link

Still leaking memory in 0.4.18, between 0-100kB/sec (averaging at a rate of somewhere around 10kB/sec).

https://golangcode.com/print-the-current-memory-usage/

Using this periodically, you can gather memory usages of several days. With a graph tool like Microsoft Excel, you can check tendency of memory usages.

@maznu
Copy link

maznu commented Feb 23, 2019

Several days? It's eating up all the RAM on a 1Gb VPS (and then being killed by the kernel oom) within eight hours.

screenshot 2019-02-23 at 07 51 35

You can see there that there is garbage collection and freeing back to the OS — plenty of green spikes within that orange lump of usage — but fundamentally it just continues to grow.

@paralin
Copy link

paralin commented Feb 23, 2019

Can someone with bad memory usage please grab a memory trace?

@alexkursell
Copy link

Can someone with bad memory usage please grab a memory trace?

I am experiencing this issue using go-ipfs 0.4.19:
https://ipfs.io/ipfs/QmSkYDJV1BJeLm2uEBqnshcmBRb1LMPPPxdBsUrGDNGv8J

For me it takes ~2 days for the daemon to exhaust 1GB of memory and get OOM killed.

@Stebalien
Copy link
Member

@alexkursell I'm only seeing ~30MiB of memory usage on the heap. Unfortunately, I can't seem to download the goroutine stack traces.

When you grabbed that memory dump, how much memory was go-ipfs using (at that point in time).

@whyrusleeping
Copy link
Member

The biggest problem i'm seeing with memory usage lately isnt that ipfs always uses a lot of memory, its that it randomly spikes to a lot of memory, and go will pretty much never release that memory.

To debug this further, I would put a memory limit on the ipfs process (say, 1GB) so that it panics when the memory spikes, and we can then figure out what the problem is.

@alexkursell
Copy link

@Stebalien. I've grabbed a new set of diagnostics, along with the output of top: https://ipfs.io/ipfs/QmVB4s9Eu1XYxbikuzQix6SGUoDtqS46oyPJFanWLRMwV5 At the time this was taken, it looks like the daemon was using around 750mb.

@marrub--
Copy link

I was able to run an ipfs node just fine for a while but it's started taxing my server so much it's impossible to continue using. It would be fine even if it used a gigabyte, but it continues eating more and more memory until the server simply crashes.

@Stebalien
Copy link
Member

@alexkursell

Go is "only" using about 300MiB of heap memory so it looks like memory usage spiked at some point and go never returned the memory.

The largest actual memory users appear to be:

@kaysond
Copy link

kaysond commented Apr 24, 2019

+1. I just set up a node on an Ubuntu 19.04 vps, and it died after about a day. I'll try the latest master and see if that fixes it.

@whyrusleeping
Copy link
Member

@kaysond (and others) when your nodes die due to running out of memory, can you please send us the stack traces? It will help us track down whats causing the memory spikes.

@kaysond
Copy link

kaysond commented Apr 24, 2019

I built from the latest source, and it seems to have grown steadily then leveled off at around 600MB overnight.
img

@kaysond
Copy link

kaysond commented Apr 30, 2019

@whyrusleeping after a few days it looks like it settled out at a solid 1GB RAM. I've attached all the dumps per the debug guide
memdebug.tar.gz

@Stebalien
Copy link
Member

@kaysond

It looks like that memory is:

  1. The peerstore (fix in migrate to datastore-backed peerstore #6080).
  2. Bandwidth metric tracking. Unfortunately, we never forget old peers. You can disable bandwidth tracking by setting ipfs config --json "Swarm.DisableBandwidthMetrics true".

@kaysond
Copy link

kaysond commented Apr 30, 2019

@Stebalien thanks. I'll add that to my config and see how much it helps. Is there a plan to implement said "forgetting"?

@Stebalien
Copy link
Member

@kaysond not yet but it looks like we'll have to do that at some point. I've never seen that show up in a heap trace. You must have connected to ~0.5M (estimated) unique peers over the course of a few days.

I've filed an issue (https://github.com/libp2p/go-libp2p-metrics/issues/17) but it's unlikely to be a priority given that most systems connecting to that many peers have quite a bit of memory (unless that was entirely DHT traffic...).

That brings up a good point. If you're memory constrained, try running the daemon with --routing=dhtclient.

@kaysond
Copy link

kaysond commented Apr 30, 2019

I set up a node mainly to serve a single website from ipfs, so the less memory it uses the cheaper my VPS can be.

I'm skeptical that the site draws that much traffic... so I guess its just the nature of being connected to the swarm? The node isn't exactly a public gateway, so I'm not sure what caused all of the connections.

I'll try it with that option and see what happens.

@Stebalien
Copy link
Member

The node isn't exactly a public gateway, so I'm not sure what caused all of the connections.

Probably the DHT.

@mkg20001
Copy link

mkg20001 commented Jun 2, 2019

Any updates on this?

@mkg20001
Copy link

mkg20001 commented Jun 2, 2019

Btw, the command to disable bandwith metrics didn't work anymore, the new one is ipfs config --bool Swarm.DisableBandwidthMetrics true

Is it even needed, anymore?

@kaysond
Copy link

kaysond commented Jun 2, 2019

With the command /usr/local/bin/ipfs daemon --enable-gc --routing=dhtclient, after several weeks my node has settled at around 500MB RAM

@mkg20001
Copy link

mkg20001 commented Jun 2, 2019

@kaysond Used that command. This + Swarm.DisablebandwidthMetrics works, thx

@Stebalien Stebalien added kind/bug A bug in existing code (including security flaws) and removed P0 Critical: Tackled by core team ASAP status/deferred Conscious decision to pause or backlog labels Apr 22, 2021
@Stebalien
Copy link
Member

The remaining issue is #2848. Closing this one as it's quite old.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws)
Projects
None yet
Development

No branches or pull requests