Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AcceleratedDHTClient needs more resources than the ResourceManager has by default #8945

Closed
3 tasks done
aschmahmann opened this issue May 5, 2022 · 8 comments · Fixed by #8980
Closed
3 tasks done
Assignees
Labels
kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization
Milestone

Comments

@aschmahmann
Copy link
Contributor

Checklist

Installation method

ipfs-update or dist.ipfs.io

Version

go-ipfs version: 0.13.0-rc1
Repo version: 12
System version: amd64/windows
Golang version: go1.18.1

Config

Abridged:
{
  "Experimental": {
    "AcceleratedDHTClient": true,
  },
  "Internal": {
    "Bitswap": {
      "EngineBlockstoreWorkerCount": null,
      "EngineTaskWorkerCount": null,
      "MaxOutstandingBytesPerPeer": null,
      "TaskWorkerCount": null
    }
  },
  "Routing": {
    "Type": "dhtclient"
  },
  "Swarm": {
    "AddrFilters": null,
    "ConnMgr": {
      "GracePeriod": "20s",
      "HighWater": 100,
      "LowWater": 50,
      "Type": "basic"
    },
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": false,
    "RelayClient": {},
    "RelayService": {},
    "Transports": {
      "Multiplexers": {},
      "Network": {},
      "Security": {}
    }
  }
}

Description

Running with the AcceleratedDHTClient enabled results in bursty connections (e.g. on startup) that result in the routing table being very small (compared to running without the resource manager enabled) and as a result not particularly functional.

Enabling debug logging on the resource manager shows that we're out of available outbound connections.

Some analysis here is still need, however AFAICT the reasons for this are:

  • We now have enforceable connection limits
  • The accelerated DHT client makes lots of connections in a short period of time. While it doesn't need to keep them open for very long it can't really close the connections in case they're in use (e.g. for Bitswap requests) and so it has to rely on the garbage collection on the host which also means waiting for the grace period.

Some options here include:

  1. Increase the resource limits, particularly for outbound connections and some overall limits like connections and file descriptors. Given that outbound connections are largely controlled by user behavior (e.g. turning on the AcceleratedDHTClient, making lots of API or gateway requests, etc.) this might not be so bad
  2. Make the AcceleratedDHTClient consume fewer resources - some examples include
    • Reduce (the default) amount of concurrent work the AcceleratedDHTClient does when doing routing table initialization or maintenance
    • More aggressively return resources to the system. Could be having the AcceleratedDHTClient call Trim() or manually calling close on unprotected connections it recently used.
    • Have go-libp2p give the application layer more control over how dialing is done (e.g. for expensive but not latency-critical operations have the client only dial WAN peers and try different transports serially rather than in parallel)
  3. Emitting louder errors to users of the AcceleratedDHTClient when insufficient resources are available to maintain the routing tables. At the extremes we could have hard coded errors in go-ipfs telling users to bump their limits or errors emitted from the AcceleratedDHTClient based on checking the error types received

Looking for feedback here on other ideas or preferences for how to handle this. My initial reaction is to go with increasing limits and have the AcceleratedDHTClient log errors when its routing table won't update properly due to resource starvation. Without the logs getting people to report issues properly will likely be difficult, and increasing the limits vs reducing the resource consumption is something we can then more easily tune over time (i.e. lower the limits as we reduce resource consumption)

@aschmahmann aschmahmann added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels May 5, 2022
@guseggert
Copy link
Contributor

Your plan sounds reasonable to me, as long as the new limits are reasonable. Do you know what the new limits should be for a clean startup?

For 2), adding a rate limit on peer dials with a knob should be sufficient for folks to work around whatever resource limitations they have.

Also libp2p/go-libp2p-kad-dht#254 might help, though not a short-term mitigation.

@BigLep BigLep mentioned this issue May 5, 2022
65 tasks
@BigLep BigLep added this to the go-ipfs 0.13 milestone May 5, 2022
@BigLep BigLep moved this to 🥞 Todo in IPFS Shipyard Team May 5, 2022
@BigLep
Copy link
Contributor

BigLep commented May 5, 2022

2022-05-05 conversation:

  1. We need errors to know when this situation occurs
  2. Lets understand how limits were set.
  • For example, do we need a limit on outbound connections
  • Capture how this was chosen

Current plan:

We're not tackling bigger items like:

  1. Being able to hand resources back to libp2p
  2. Optimizing resource consumption in common code paths

@BigLep
Copy link
Contributor

BigLep commented May 17, 2022

2022-05-17 conversation:

Part 1: Conditions for when print out an "accelerated DHT couldn't bootstrap message"

  1. Time based OR
  2. Number of X-failed dials due to resource manager
    Going with X=1.

Part 2: messaging
"You may have a gimped accelerated DHT. Consider raising the limits. Read here for more info."

In the PR or here, lets set our outbound connections very low and then paste in what the logs look like.


We also need to confirm if we can reproduce locally quickly. Lets paste here the repro steps we took.

@BigLep BigLep moved this from 🏃‍♀️ In Progress to 🔎 In Review in IPFS Shipyard Team May 19, 2022
Repository owner moved this from 🔎 In Review to 🎉 Done in IPFS Shipyard Team May 19, 2022
@BigLep BigLep moved this from 🎉 Done to 🔎 In Review in IPFS Shipyard Team May 19, 2022
@BigLep BigLep reopened this May 24, 2022
@BigLep
Copy link
Contributor

BigLep commented May 24, 2022

Exit criteria:

ipfs stats dht | wc -l

Should be ~equal with and or without resource manager enabled.

@BigLep
Copy link
Contributor

BigLep commented Jun 7, 2022

Resolving since we did adjust limits

There is a larger issue to address concerning experiment DHT startup here: libp2p/go-libp2p-kad-dht#770

@BigLep BigLep closed this as completed Jun 7, 2022
Repository owner moved this from 🔎 In Review to 🎉 Done in IPFS Shipyard Team Jun 7, 2022
@BigLep BigLep moved this from 🎉 Done to ☑️ Done (Archive) in IPFS Shipyard Team Jun 14, 2022
@ThomasFreedman
Copy link

I just upgrated a very large repo from v7 to v12 which succeeded according to the info reported. However I see error messages continually occurring like:

2022-11-26T10:08:04.655-0600    ERROR   resourcemanager libp2p/rcmgr_logging.go:57      Consider inspecting logs and raising the reso
urce manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr
2022-11-26T10:08:14.656-0600    ERROR   resourcemanager libp2p/rcmgr_logging.go:53      Resource limits were exceeded 332 times with
error "system: cannot reserve inbound connection: resource limit exceeded".

I can find no information on the process by which the logs are analyzed and changes are made to the config file to remove these repetitive error messages. The link provided in the error message is of little help in how to resolve the errors.

@BigLep BigLep reopened this Nov 27, 2022
@BigLep BigLep closed this as completed Nov 27, 2022
Repository owner moved this from ☑️ Done (Archive) to 🎉 Done in IPFS Shipyard Team Nov 27, 2022
@BigLep
Copy link
Contributor

BigLep commented Nov 27, 2022

@ThomasFreedman : you mentioned v12. Did you mean v0.17 (latest Kubo release)?

@ThomasFreedman
Copy link

ThomasFreedman commented Nov 28, 2022

Yes, that probably correct since I was upgrading to that. I just copied what was printed in the output.
My apologies if I hijacked this topic. My issue is directly related.

I see there is already a bug filed for this error condition, and one person posted his solution which I will apply. Hopefully that will resolve the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants
@BigLep @guseggert @aschmahmann @ThomasFreedman and others