Skip to content

Latest commit

 

History

History
167 lines (123 loc) · 11.7 KB

libp2p-resource-management.md

File metadata and controls

167 lines (123 loc) · 11.7 KB

libp2p Network Resource Manager (Swarm.ResourceMgr)

Purpose

The purpose of this document is to provide more information about the libp2p Network Resource Manager and how it's integrated into Kubo so that Kubo users can understand and configure it appropriately.

🙋 Help! The resource manager is protecting my node but I want to understand more

The resource manager is generally a feature to bound libp2p's resources, whether from bugs, unintentionally misbehaving peers, or intentional Denial of Service attacks.

Good places to start are:

  1. Understand how the resource manager is configured.
  2. Understand how to read the log message
  3. Understand how to inspect and change limits

Table of Contents

Levels of Configuration

See also the Swarm.ResourceMgr config docs.

Approach

libp2p's resource manager provides tremendous flexibility but also adds complexity. There are these levels of limit configuration for resource management protection:

  1. "The user who does nothing" - In this case Kubo attempts to give some sane defaults discussed below based on the amount of memory and file descriptors their system has. This should protect the node from many attacks.

  2. "Slightly more advanced user" - They can tweak the default limits discussed below.
    Where the defaults aren't good enough, a good set of higher-level "knobs" are exposed to satisfy most use cases without requiring users to wade into all the intricacies of libp2p's resource manager. The "knobs"/inputs are Swarm.ResourceMgr.MaxMemory and Swarm.ResourceMgr.MaxFileDescriptors as described below.

  3. "Power user" - They specify overrides to computed default limits via ipfs swarm limit and Swarm.ResourceMgr.Limits;

Computed Default Limits

With the Swarm.ResourceMgr.MaxMemory and Swarm.ResourceMgr.MaxFileDescriptors inputs defined, resource manager limits are created at the system, transient, and peer scopes. Other scopes are ignored (by being set to "~infinity".

The reason these scopes are chosen is because:

  • system - This gives us the coarse-grained control we want so we can reason about the system as a whole. It is the backstop, and allows us to reason about resource consumption more easily since don't have think about the interaction of many other scopes.
  • transient - Limiting connections that are in process of being established provides backpressure so not too much work queues up.
  • peer - The peer scope doesn't protect us against intentional DoS attacks. It's just as easy for an attacker to send 100 requests/second with 1 peerId vs. 10 requests/second with 10 peers. We are reliant on the system scope for protection here in the malicious case. The reason for having a peer scope is to protect against unintentional DoS attacks (e.g., bug in a peer which is causing it to "misbehave"). In the unintional case, we want to make sure a "misbehaving" node doesn't consume more resources than necessary.

Within these scopes, limits are just set on memory, file descriptors (FD), inbound connections, and inbound streams. Limits are set based on the Swarm.ResourceMgr.MaxMemory and Swarm.ResourceMgr.MaxFileDescriptors inputs above. We trust this node to behave properly and thus don't limit outbound connection/stream limits. We apply any limits that libp2p has for its protocols/services since we assume libp2p knows best here.

Source: core/node/libp2p/rcmgr_defaults.go

User Supplied Override Limits

Once Kubo has the Computed Default Limits, it then applies any user-supplied Swarm.ResourceMgr.Limits on top. These become the active limits.

While Swarm.ResourceMgr.Limits can be edited directly, it is also possible to use ipfs swarm limit command to inspect and tweak specific limits at runtime.

To see all resources that are close to hitting their respective limit:

$ ipfs swarm stats --min-used-limit-perc=90 all

To modify limits for specific scope (e.g. system):

$ ipfs swarm limit system > change.json
$ vi change.json
$ ipfs swarm limit system change.json

Learn more: ipfs swarm limit --help

Infinite limits

There isn't a way via config to specify infinite limits (see go-libp2p#1935). For example, "-1" is not infinity. To work around this, Kubo uses a magic number of "1000000000" to denote infinity since it's effectively infinite.

FAQ

What do these "Protected from exceeding resource limits" log messages mean?

"Protected from exceeding resource limits" log messages denote that the resource manager is working and that it prevented additional resources being used beyond the set limits. Per libp2p code, these messages take the form of "$scope: cannot reserve $limitKey".

As an example:

Protected from exceeding resource limits 2 times: "system: cannot reserve inbound connection: resource limit exceeded"

This means that there were 2 recent occurences where the libp2p resource manager prevented an inbound connection at the "system" scope.
Specificaly the Swarm.ResourceMgr.Limits.System.ConnsInbound active limit was hit.

This can be analyzed by viewing the limit with ipfs swarm limit system and comparing the usage with ipfs swarm stats system. ConnsInbound is likely close or at the limit value.

The simiplest way to identify all resources across all scopes that are close to exceeding their limit is with a command like ipfs swarm stats --min-used-limit-perc=90 all.

Sources:

What are the "Application error ... cannot reserve ..." messages?

These are messages from a remote go-libp2p peer (likely another Kubo node) with the resource manager enabled on why it failed to establish a connection.

This can be confusing, but these Application error ... cannot reserve ... messages can occur even if your local node has the resoure manager disabled.

You can distinguish resource manager messages originating from your local node if they're from the resourcemanager / libp2p/rcmgr_logging.go logger or you see the string that is unique to Kubo (and not in go-libp2p): "Protected from exceeding resource limits".

There is a go-libp2p issue (#1928) to make it clearer that this is an error message originating from a remote peer.

How does the resource manager (ResourceMgr) relate to the connection manager (ConnMgr)?

As discussed here these are separate systems in go-libp2p. Kubo also configures the ConnMgr separately from ResourceMgr. There is no checking to make sure the limits between the systems are congruent.

Ideally Swarm.ConnMgr.HighWater is less than Swarm.ResourceMgr.Limits.System.ConnsInbound. This is so the ConnMgr can kick in and cleanup connections based on connection priorities before the hard limits of the ResourceMgr are applied. If Swarm.ConnMgr.HighWater is greater than Swarm.ResourceMgr.Limits.System.ConnsInbound, existing low priority idle connections can prevent new high priority connections from being established. The ResourceMgr doesn't know that the new connection is high priority and simply blocks it because of the limit its enforcing.

How does one see the Active Limits?

A dump of what limits are actually being used by the resource manager (Computed Default Limits + User Supplied Override Limits) can be obtained by ipfs swarm limit all.

How does one see the Computed Default Limits?

This can be observed with an empty Swarm.ResourceMgr.Limits and then seeing the active limits.

How does one monitor libp2p resource usage?

For monitoring libp2p resource usage, various *rcmgr_* metrics can be accessed as the prometheus endpoint at {Addresses.API}/debug/metrics/prometheus (default: http://127.0.0.1:5001/debug/metrics/prometheus).
There are also pre-built Grafana dashboards that can be added to a Grafana instance.

A textual view of current resource usage and a list of services, protocols, and peers can be obtained via ipfs swarm stats --help

History

Kubo first exposed this functionality in Kubo 0.13, but it was disabled by default. It was then enabled by default in Kubo 0.17. Until that point, Kubo was vulnerable to unbound resource usage which could bring down nodes. Introducing limits like this by default after the fact is tricky, which is why there have been changes and improvements afterwards.