-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sudden Peering Drops Since Updating to v0.26.2 #2311
Comments
Thanks for raising this issue @nisdas, we'll start to look into this week. This seems very similar to #2202 which also reported peer loss after upgrading to 0.26.2 but was closed after investigation by @MarcoPolo and @Giulio2002 |
Update: looking into this issue now. To aid further, here are some steps that would be helpful if Prysm users could do:
You can enable just the resource manager logs via Here is an example dashboard you might want users to enable to get some more visibility into what's going on #2232 |
I thought it could be a changed in defaults in the rcmgr limits from v0.24 to v0.26.2, but I did a quick check and they are the same. I've preemptively created a PR to ensure that we don't accidentally change these. This isn't the issue here. I'm wondering if there's a memory leak somewhere? The part where this only shows up after nodes have been running for a while sounds leak-like. And I wonder if mplex is the culprit. Am I correct in my understanding that Prysm only uses mplex? (No yamux?) It would be really helpful if you enable the resource manager dashboards to see if there's a leak in the memory reservations. |
Thank you, I can relay this to users in the event they get stuck again.
Is there a specific set of metrics that would be helpful to track for this ? I can mention this to users in the event they run through it. We can also monitor this on our own internal nodes ( although have had no luck in reproducing this there)
Prysm uses both, not all clients support yamux in the network, so for backwards compatibility we still support both yamux and mplex connections |
I just saw this in the example: rcmgrObs.MustRegisterWith(prometheus.DefaultRegisterer)
str, err := rcmgrObs.NewStatsTraceReporter()
if err != nil {
log.Fatal(err)
}
rmgr, err := rcmgr.NewResourceManager(rcmgr.NewFixedLimiter(rcmgr.DefaultLimits.AutoScale()), rcmgr.WithTraceReporter(str))
if err != nil {
log.Fatal(err)
}
server, err := libp2p.New(libp2p.ResourceManager(rmgr))
if err != nil {
log.Fatal(err)
} is there any reason this isn't enabled by default in libp2p ? |
Sorry for the noise, I misread your code. You have Yamux setup.
The "System Memory Reservation" could confirm the memory leak theory. And the "Number of blocked resource requests".
To avoid the small cost of collecting the metrics if you aren't going to use them. |
Ok, i can try running a custom branch on our infra with all those metrics enabled. If there are any issues, they should show up in a few hours via the metrics(even if it doesn't lead to peering drops). |
Hey, one of our users managed to get this from one of their stuck nodes, i filtered out logs only referencing the resource manager:
It took a bit longer to get our internal infra setup, they don't show anything interesting so far but will paste anything we find later. |
Hey @MarcoPolo, did you manage to get a chance to look at the above ? |
Thanks! This is useful. This hints at a possible leak in the transient connection counter. I'll try to reproduce this by manually setting the transient limit to something small (e.g. 2) and running a Kubo node for a while. Prysm just uses the TCP transport with noise and yamux/mplex, is that correct? |
Yeap, that is correct @MarcoPolo |
@MarcoPolo Our infrastructure that we have been running on hasn't been the most stable, so our nodes have been restarted once/twice a day. However there definitely appears to be a pattern for the metric: The transient stream count seems to be steadily increasing, it is probably worse for nodes which have a larger number of peers.The above chart is a period over 8 days |
Is it possibly an issue with correctly accounting for mplex streams ? |
Just to add on, we have opened this PR to always favour yamux: This possibly might have exacerbated the situation. |
I wouldn't think yamux or mplex makes a difference. This is code that in the basic host, so both muxers use the same code path. Any links I could follow to setup my own node to see if I can repro? |
Actually, I have a hunch. I'll try something out tomorrow (It's already quite late) and report back. Thanks for the help so far! |
@MarcoPolo @marten-seemann Confirmed. This line fixes the problem: libp2p.ResourceManager(nil), Some problem with the default ResourceManager. Need to solve this problem urgently. I have a bug with this code (server and client):
|
This should be fixed in the latest release. Please let me know if you still see this issue after updating. |
@MarcoPolo I think the problem remains. After a while I can't connect to my server, even over the local network. |
Could you please provide a repro? Ideally something that uses docker containers and compose. That way it's super easy to reproduce. |
@MarcoPolo I found the problem. If I remove the line: libp2p.DisableMetrics(), everything works. |
That doesn't make sense. I don't think that's the problem. |
In Prysm we updated our
go-libp2p
version fromv0.24.0
tov0.26.2
for ourv4
release. However ever since then we have had multiple reports of sudden peering drops, where nodes would disconnect from all their peers after a short period of time and would unable to reconnect with any new peers. This is usually triggered after a period of extended use ( 4 - 10 days), which makes it very difficult to reproduce. This issue would require node operators to restart their nodes in order to be operational again.A few user reports on this:
prysmaticlabs/prysm#12255
After investigating this, we narrowed this down to the resource manager update in
v0.26.2
and have provided a flag for users to disable using the resource manager in the event they encounter this issue. We do not have libp2p logs that we can provide, as enabling them would make node logs extremely noisy so most users do not enable it. As of the current moment, all affected users that have disabled the resource manager have their nodes running normally as they were before. Currently prysm is using the default resource manager config that libp2p provides under its defaults.Version Information
The text was updated successfully, but these errors were encountered: