-
-
Notifications
You must be signed in to change notification settings - Fork 538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Peer link is being dropped #3121
Comments
Hey just to chime in, based on my testing, on linux peers with versions equal or above 0.34.0 after about 5min the connection drops without recovering. In my case this is the setup: Management Server : All servers run on static IPs and all three peers would be running the same version of the client. I have already attempted adding common allow All UDP ports but no use. So essentially even if we assume that management and peer servers running the same I can get the logs however as it is on live environment it will generate downtime so I will have to wait until maintenance window which will be some time in early January. In the meantime I will try to get at least logs for the |
Interesting timing. I've been seeing some more link instability in the last few days. Since 0.35 maybe. Requiring a restart of some peers to reconnect. Sometimes they think they are connected but are not passing traffic. I did have some stability issues back around pre-0.20 or so and required restarting clients. Then things have been quite stable for the last many months. I know this is very vague and doesn't provide useful information in of itself but just wanted to add in my anecdotal experience that the current instability hasn't shown up in my environment for quite some time. |
Logs from a peer at the time it dropped off:
|
Hey @hadleyrich, I agree that's pretty much what I've been battling with for the past couple of weeks. I have a load balancer which uses the VPN to connect to various other VPS peers so that we can have a simple
That's pretty much my experience. I joined at around version
I wouldn't really call it "anecdotal". I have a monthly maintenance window during which I upgrade all of the packages on the OS, so when I do eventually upgrade, I may jump many minor and patch releases. Because things were more or less stable, I had no issues upgrading to the latest. Right now, all of my servers are sitting on a downgraded version of |
I just noticed on a peer that had lost communication with another peer that "Last WireGuard handshake" was hours old and "Last connection update" was minutes so it certainly points to something at the WG level becoming out of sync. I think you're probably right, I think I probably saw stability issues reappearing around 0.34. I had become quite (probably overly) comfortable with the level of stability over the past months and been happily tracking the latest releases. I don't yet run netbird in a production setting. More of a long term stability test on my homelab "production" services before deploying to real customer facing workloads. |
Hmm.. it's funny I have one machine that drops and it's a Windows Server 2022 .. I don't see other clients stop. A simple restart fixes it, but i have to do every day. I have Linux clients and an older Windows SBS server all seem to be ok..Also have Windows 11 clients.. again seem fine.. even my arch desktop is fine. Very odd. |
Another data point. A long running ping in screen to keep traffic going over the link appears to keep the peer connected. |
It seems like the issue is with the WireGuard handshake. For instance, my Windows 11 PC seemingly struggles to connect to other Linux Server Peers despite everything running the latest Netbird version, in this case, Netbird 0.35.2. One of my Load Balancer servers running Ubuntu 22.04 just refuses to keep the connection to other Linux servers for longer than 5 minutes before dying and needing to be restarted. I don't know what I'm doing wrong, but I always update the management server first and only then move on to the client nodes, first on the Linux servers and then on devices such as PCs/Laptops/Phones. |
Hi Everyone, happy New Year! Hey @mlsmaycon, sorry to ping you directly. Would you like me to run the same steps as listed last time? I will email the logs so you have a better picture. I am approaching a maintenance window for all our org servers and will be able to run a full debugging trace like last time. Also, I need to know if the logging persists across client updates or whether I need to run it first on the old version and then after the upgrade. |
I think (in my case at least) this appears to be something triggered by, or relating to relaying. Previously I was not running the relay in my set up and only running coturn. The peer I was having most trouble with was connecting over relay. Adding in the new relay service appears to have made that peer more stable for the last 12 hours or so. |
Hmm, interesting. In my case, I am already running a new relay service. Strangely, some client versions seem to overuse the relay, and some underuse it; since Okay, you know what? It's late at night here in the UK, so let me try upgrading and getting at least some logs. |
Ok, without looking at the trace logs generated by the client, my anecdotal research log shows this:
@mlsmaycon I've collected a full trace from UK1 and UK7 using the method listed in #3112 (comment) and am now parsing it to see if there is anything obvious. I will send it to the support email once I've reviewed everything. |
Describe the problem
I have netbird installed as an overlay network. I have an ingress server and another server in another location. Neither are behind NAT. As far as I can tell everything is working properly, and generally does. The overlay peer network got dropped and the ingress server stopped talking to the other server. I noticed it quickly because of my host down alerts. I had to restart netbird on the ingress server to get the peer connection back up.
I've been running netbird for more than a year.
To Reproduce
Steps to reproduce the behavior:
1: Run netbird for a long amount of time
2: Monitor connections for loss
3: Run upgrades and configuration changes
I suspect that this happens around an upgrade or configuration change somewhere in the overall system. I am not certain as this only happens rarely. I suspect that there are bugs, probably race conditions, in the teardown and setup procedures that create this condition.
Expected behavior
Does not drop peers.
Are you using NetBird Cloud?
Self-hosted
NetBird version
Ingress was 0.29.2 on this latest time (has been observed with numerous versions), and the other server was on 0.35.0.
Additional context
I don't have time to track this down and be more specific as it's not consistent. This last incident was a production outage that I never had with nebula, so I'm switching back. The system looks nice and I've seen a lot of improvements. But I need reliability above all else and I haven't found it here. Good luck.
The text was updated successfully, but these errors were encountered: