Lotus sync issue: libp2p v0.34.1 #2858

rjan90 · 2024-07-02T09:30:24Z

Louts updated to libp2p v0.34.1 in its latests release Lotus v1.27.1, and we are getting some reports from users encountering syncing issues which seems to be related to the resource manager:

{"level":"debug","ts":"2024-07-02T07:14:40.235Z","logger":"rcmgr","caller":"resource-manager/scope.go:480","msg":"blocked stream from constraining edge","scope":"stream-16657929","edge":"transient","direction":"Inbound","current":4233,"attempted":1,"limit":4233,"stat":

{"NumStreamsInbound":0,"NumStreamsOutbound":4233,"NumConnsInbound":0,"NumConnsOutbound":0,"NumFD":0,"Memory":0},"error":"transient: cannot reserve stream: resource limit exceeded"}

Another report indicated that they were unable to get peers after updating, but after a couple of restarts of their node, they were able to get back in sync. Unfortunately they were not able to get a goroutine dump, but will do it next time they enounter the same issue.

Do you have any additional tips, for what information to gather when encopuntering these rcmgr-issues?

The text was updated successfully, but these errors were encountered:

MarcoPolo · 2024-07-05T18:11:27Z

Setting up Grafana dashboards would really really help debug stuff like this.

The log points to an issue reserving a space for a new inbound stream in the "transient" scope. Streams are considered "transient" before we know what protocol they will be used for. Streams should only be transient until multistream finishes or the 10s timeout is hit, whichever comes first. I would be surprised if we were leaking "transient" streams, but it would be obvious in the dashboard if we are. Does this error log persist? Does the number in NumStreamsOutbound ever go down?

MarcoPolo · 2024-07-09T19:40:58Z

We've tried to make it as easy as possible to get started with the dashboards, so there's a docker compose file that spins up everything you need. Refer to go-libp2p/dashboards/README.md for more detailed instructions. But, in case it helps, here are some step by step tips to get it working with lotus.

go to go-libp2p/dashboards
Make this change to prometheus.yml. This tells it about Lotus's default metrics endpoint

diff --git a/dashboards/prometheus.yml b/dashboards/prometheus.yml
index f0917188..bfa09fc5 100644
--- a/dashboards/prometheus.yml
+++ b/dashboards/prometheus.yml
@@ -23,8 +23,8 @@ scrape_configs:
   honor_timestamps: true
   scrape_interval: 15s
   scrape_timeout: 10s
-  metrics_path: /debug/metrics/prometheus
+  metrics_path: /debug/metrics
   scheme: http
   static_configs:
   - targets:
-    - host.docker.internal:5001
+    - host.docker.internal:1234

run docker compose -f docker-compose.base.yml up on macOS or docker compose -f docker-compose.base.yml -f docker-compose-linux.yml up on Linux.
View metrics on Grafana at http://localhost:3000.

rvagg · 2024-07-11T10:59:15Z

Some relevant logs from a lotus user just now in a stuck node: filecoin-project/lotus#12205 (comment)

BigLep · 2024-07-17T06:29:13Z

@rvagg : what are the next steps for confirming that the new go-libp2p release addresses the issue found with Lotus? Are you checking on your machine, or should we push this out with 1.27.2 and ask more people to upgrade? (Context: I want to have high confidence this is addressed before the 1.28.0 release.)

rvagg · 2024-07-17T06:53:42Z

Yeah, 1.27.2 is not a bad idea, I can queue that up. So far, running 1.27.1 with an upgraded libp2p to 1.35.3 for a little over 24 hours, the various libp2p dashboards for my mainnet node are fairly stable and it's hard to tell on any of them when I did the upgrade.

On the other hand, for my calibnet node which is running master and therefore upgrading from the same libp2p that's in 1.28.0, there's a lot of difference but I don't know what's good or bad here so @sukunrt || @MarcoPolo would be best to interpret these. Transient streams seems the most promising measure, with no evidence of leaky growth. Here's where I see differences (the upgrade is around 07/16 10:00 on these graphs, there was another, longer restart earlier in the graphs that is less relevant):

sukunrt · 2024-07-17T08:35:12Z

It does look like the bug fixed by #2869 was the culprit. On #2764 I see that there was a regression on moving from v0.31 to v0.33. This bug was introduced in the v0.32 release, by this change(#2555).

Nothing immediately stands out for me from the graphs @rvagg shared. If you think any particular graph is suspicious, happy to help debug it.

sukunrt · 2024-07-17T08:54:18Z

Note: #2875 while related, is not causing this issue. That does add to the transient streams, but those are garbage collected once the stream completes.

We'll make a patch release with #2875 once it's merged.

hsanjuan · 2024-07-19T20:04:15Z

Some users are complaining that IPFS Cluster displays lots of connection problems after upgrading from v0.33.2 to v0.35.2 and they are resolved by downgrading go-libp2p. If this issue was introduced v0.32 then there must be some other issue at play. @jonasrmichel can perhaps provide more details...

jonasrmichel · 2024-07-19T20:22:29Z

Unfortunately, I can't provide too much more concrete data.

However, here's what I observed in a network with a couple thousand peers.

Running go-libp2p v0.35.1, a majority of the peers in the network have trouble negotiating connections. Both sides log connection resets and messages like this...

failed to negotiate security protocol: privnet: could not read full nonce

Also, running v0.35.1 the network consistently exhibits many routing errors with logs indicating routing not found for established peers.

Downgrading go-libp2p to v0.33.2 resolved both of the above issues.

sukunrt · 2024-07-20T10:11:37Z

@jonasrmichel @hsanjuan, Do you have many nodes on the same IPv4 address or the same /64 IPv6 subnet? If so the added limits for concurrent connections to the same IP might be the problem(https://github.com/libp2p/go-libp2p/blob/master/p2p/host/resource-manager/conn_limiter.go#L32)

Is it possible to setup metrics like this comment explains?

jonasrmichel · 2024-07-22T15:04:45Z

Ah, fascinating.

Per-subnet connection limiting does map to the symptoms I've observed.

Our network's bootstrap peers are behind a load balancer. While each bootstrap peer is reachable on a dedicated port, these peers do share three public IP addresses assigned to the load balancer.

Is there a way to disable the connection limits per subnet?

MarcoPolo · 2024-07-22T15:23:20Z

Yes you can disable it. You can also set no bounds for a specific cidr. This comment may help #2872 (comment). Along with the go doc: https://pkg.go.dev/github.com/libp2p/go-libp2p@v0.35.4/p2p/host/resource-manager#WithNetworkPrefixLimit

rjan90 · 2024-07-29T07:34:04Z

From @rvagg in a thread in the Filecoin-slack:

graphs looking pretty healthy on both my nodes, nothing suggesting leaky behaviour, hopefully this is all resolved now
thanks for the help on this! 🤝

I will close this issue now and re-open it if more users report this error after upgrading to go-libp2p v0.35.4.

rjan90 mentioned this issue Jul 3, 2024

[WIP] Lotus Release v1.28.0 filecoin-project/lotus#12109

Closed

55 tasks

sukunrt mentioned this issue Jul 11, 2024

basichost: reset new stream if rcmgr blocks #2869

Merged

rjan90 mentioned this issue Jul 16, 2024

chore: libp2p: update to v0.35.3 filecoin-project/lotus#12240

Merged

rvagg mentioned this issue Jul 17, 2024

Release: v1.27.2 filecoin-project/lotus#12249

Merged

sukunrt mentioned this issue Jul 19, 2024

basichost: reset stream if SetProtocol fails #2875

Merged

sukunrt mentioned this issue Jul 22, 2024

v0.36 #2778

Closed

19 tasks

rjan90 closed this as completed Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lotus sync issue: libp2p v0.34.1 #2858

Lotus sync issue: libp2p v0.34.1 #2858

rjan90 commented Jul 2, 2024 •

edited

Loading

MarcoPolo commented Jul 5, 2024

MarcoPolo commented Jul 9, 2024

rvagg commented Jul 11, 2024

BigLep commented Jul 17, 2024

rvagg commented Jul 17, 2024

sukunrt commented Jul 17, 2024 •

edited

Loading

sukunrt commented Jul 17, 2024

hsanjuan commented Jul 19, 2024

jonasrmichel commented Jul 19, 2024

sukunrt commented Jul 20, 2024

jonasrmichel commented Jul 22, 2024

MarcoPolo commented Jul 22, 2024

rjan90 commented Jul 29, 2024

Lotus sync issue: libp2p v0.34.1 #2858

Lotus sync issue: libp2p v0.34.1 #2858

Comments

rjan90 commented Jul 2, 2024 • edited Loading

MarcoPolo commented Jul 5, 2024

MarcoPolo commented Jul 9, 2024

rvagg commented Jul 11, 2024

BigLep commented Jul 17, 2024

rvagg commented Jul 17, 2024

sukunrt commented Jul 17, 2024 • edited Loading

sukunrt commented Jul 17, 2024

hsanjuan commented Jul 19, 2024

jonasrmichel commented Jul 19, 2024

sukunrt commented Jul 20, 2024

jonasrmichel commented Jul 22, 2024

MarcoPolo commented Jul 22, 2024

rjan90 commented Jul 29, 2024

rjan90 commented Jul 2, 2024 •

edited

Loading

sukunrt commented Jul 17, 2024 •

edited

Loading