Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConnectionManager peer HighWater configuration not honored (# peers spike up => OOM on VPS) #4718

Closed
AndreaCensi opened this issue Feb 19, 2018 · 8 comments
Labels
kind/bug A bug in existing code (including security flaws) topic/connection-manager Issues related to Swarm.ConnMgr (connection manager)

Comments

@AndreaCensi
Copy link

Version information:

go-ipfs version: 0.4.13-
Repo version: 6
System version: amd64/linux
Golang version: go1.9.2

Type:

Bug

Description:

Context: on my system the ipfs memory grows until it is OOM-killed. (I run ipfs on a VPS (linode), with 1 vCPU and 1GB of RAM. )

I have tried all the suggestions that I found by looking for similar issues in the past and stumbled upon the suggestion of limiting the number of peers.

However it seems that the configuration switches for the connection manager are not honored.

I use this configuration:

"Swarm": {
    "AddrFilters": null,
    "ConnMgr": {
      "GracePeriod": "20s",
      "HighWater": 20,
      "LowWater": 10,
      "Type": "basic"
    },
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": false,
    "DisableRelay": false,
    "EnableRelayHop": false
  }

I would expect the number of peers to be bounded by 20 (or 20 + something), but the peers average to ~60 after a couple of minutes and they spike up to 200 some time later. I suspect it gets even higher, but at that point the instance becomes unresponsive and I cannot access it anymore. Later on, when I log in, I find that ipfs was OOM-killed.

@Stebalien Stebalien added kind/bug A bug in existing code (including security flaws) topic/connection-manager Issues related to Swarm.ConnMgr (connection manager) labels Feb 19, 2018
@Stebalien
Copy link
Member

Your expectations are pretty much correct. We close open connections at most once every 10 seconds and only close connections that have been open for the grace period (20s) but you shouldn't be spiking up to ~200 peers in a couple of minutes (unless there all connecting within 20 seconds...).

Regardless, this is definitely a bug.

@AndreaCensi
Copy link
Author

A couple more observations:

I did manage to see the "endgame": after 8 hours, it was at 1.2GB memory usage (RAM and swap), but there were only 97 peers. So, it seems that the memory increases, but it is not proportional to the number of peers (leak?).

I also tried running the server with --routing=dhtclient and the issue remains.

@Stebalien
Copy link
Member

So, it seems that the memory increases, but it is not proportional to the number of peers

The current release has an issue where it:

  1. Never forgets information about any peer.
  2. Remembers and gossips tons of ephemeral addresses (addresses with ephemeral ports).

This should be fixed in the next release (it has been fixed in master) but we're trying to iron out a few bugs first.

@AndreaCensi
Copy link
Author

@Stebalien

I did a source install from master (0.4.14-dev).

The memory usage decreased - though it is unclear at this point if it grows indefinitely or not.

However, I still see too many peers connect (oscillating between 100 and 115 at the moment).

@AndreaCensi
Copy link
Author

After a couple of days running 0.4.14-dev, I can report the following:

  • the number of peers stabilizes around 35 (good!)
  • the memory usage grows over time; it is currently at 1GB. It grows slowly so I expect it to last another day or so until it gets OOMed.

@MichaelMure
Copy link
Contributor

This issue might be solved now. I regularly saw memory usage growing unbounded over a few days until OOM, but not anymore.
image

@Stebalien
Copy link
Member

Late to the party...

@MichaelMure

The issue here is that we don't have any "MaxConns" hard limit. You were probably noticing a different memory leak.

@Stebalien
Copy link
Member

Actually, reading through this issue, it appears that it is really about other per-peer memory leaks.

(sorry for the noise)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) topic/connection-manager Issues related to Swarm.ConnMgr (connection manager)
Projects
None yet
Development

No branches or pull requests

3 participants