Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peer link is being dropped #3121

Open
bmansfie opened this issue Dec 28, 2024 · 12 comments
Open

Peer link is being dropped #3121

bmansfie opened this issue Dec 28, 2024 · 12 comments

Comments

@bmansfie
Copy link

Describe the problem

I have netbird installed as an overlay network. I have an ingress server and another server in another location. Neither are behind NAT. As far as I can tell everything is working properly, and generally does. The overlay peer network got dropped and the ingress server stopped talking to the other server. I noticed it quickly because of my host down alerts. I had to restart netbird on the ingress server to get the peer connection back up.

I've been running netbird for more than a year.

To Reproduce

Steps to reproduce the behavior:
1: Run netbird for a long amount of time
2: Monitor connections for loss
3: Run upgrades and configuration changes

I suspect that this happens around an upgrade or configuration change somewhere in the overall system. I am not certain as this only happens rarely. I suspect that there are bugs, probably race conditions, in the teardown and setup procedures that create this condition.

Expected behavior

Does not drop peers.

Are you using NetBird Cloud?

Self-hosted

NetBird version

Ingress was 0.29.2 on this latest time (has been observed with numerous versions), and the other server was on 0.35.0.

Additional context

I don't have time to track this down and be more specific as it's not consistent. This last incident was a production outage that I never had with nebula, so I'm switching back. The system looks nice and I've seen a lot of improvements. But I need reliability above all else and I haven't found it here. Good luck.

@rihards-simanovics
Copy link

Hey just to chime in, based on my testing, on linux peers with versions equal or above 0.34.0 after about 5min the connection drops without recovering. In my case this is the setup:

Management Server : netbird-mgmt version 0.35.1 (docker),
Peer Server 1: 0.34.0 - 0.35.1
Peer Server 2: 0.34.0 - 0.35.1
Peer Server 3: 0.34.0 - 0.35.1

All servers run on static IPs and all three peers would be running the same version of the client. Peer Server 1 would drop the connection to other server peers roughly after 5 min from starting Netbird.

I have already attempted adding common allow All UDP ports but no use. So essentially even if we assume that management and peer servers running the same 0.35.1 one server will always fail after some time, specifically Peer Server 1. Just to clarify I've been running Netbird since version 0.27.0, and everything was working fine up until recently.

I can get the logs however as it is on live environment it will generate downtime so I will have to wait until maintenance window which will be some time in early January. In the meantime I will try to get at least logs for the Peer server 1 so there is at least some data.

@hadleyrich
Copy link

Interesting timing. I've been seeing some more link instability in the last few days. Since 0.35 maybe. Requiring a restart of some peers to reconnect. Sometimes they think they are connected but are not passing traffic.

I did have some stability issues back around pre-0.20 or so and required restarting clients. Then things have been quite stable for the last many months.

I know this is very vague and doesn't provide useful information in of itself but just wanted to add in my anecdotal experience that the current instability hasn't shown up in my environment for quite some time.

@hadleyrich
Copy link

Logs from a peer at the time it dropped off:

2024-12-29T11:34:33+13:00 WARN client/internal/peer/guard/ice_monitor.go:55: Failed to check ICE changes: wait for gathering timed out
2024-12-29T11:39:33+13:00 WARN client/internal/peer/guard/ice_monitor.go:55: Failed to check ICE changes: wait for gathering timed out
2024-12-29T11:44:33+13:00 WARN client/internal/peer/guard/ice_monitor.go:55: Failed to check ICE changes: wait for gathering timed out
2024-12-29T11:44:43+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer
2024-12-29T11:44:44+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer
2024-12-29T11:44:47+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer
2024-12-29T11:44:49+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer
2024-12-29T11:44:54+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer
2024-12-29T11:45:06+13:00 INFO [peer: Hd69vyaKlgZUMwxWQoF5yHtIDF4krAVApegjkoQc52I=] client/internal/peer/worker_relay.go:61: Relay is not supported by remote peer
2024-12-29T11:49:33+13:00 WARN client/internal/peer/guard/ice_monitor.go:55: Failed to check ICE changes: wait for gathering timed out
2024-12-29T11:54:33+13:00 WARN client/internal/peer/guard/ice_monitor.go:55: Failed to check ICE changes: wait for gathering timed out

@rihards-simanovics
Copy link

rihards-simanovics commented Dec 29, 2024

Interesting timing. I've been seeing some more link instability in the last few days. Since 0.35 maybe. Requiring a restart of some peers to reconnect. Sometimes they think they are connected but are not passing traffic.

Hey @hadleyrich, I agree that's pretty much what I've been battling with for the past couple of weeks. I have a load balancer which uses the VPN to connect to various other VPS peers so that we can have a simple HTTP reverse proxy on port :80. As of 0.34.0, the load balancer drops the connection to the other VPS peers without retrying to connect, needing a manual restart of the Netbird client.

I did have some stability issues back around pre-0.20 or so and required restarting clients. Then things have been quite stable for the last many months.

That's pretty much my experience. I joined at around version 0.27.0, I think. I fully converted from a traditional VPN by around 0.28.0, and things were relatively stable, so I stayed. That said, I think they need to have a nightly and stable release at this point, as I agree with @bmansfie having this run in production, I, first before anything, need stability. Yesterday had a 2-hour downtime because the 0.29.4 client did something when I was applying the access policy and took down all external ports, which absolutely wrecked all my DNS server and all DNS records for a good 4 hours; thankfully, nowadays, it only takes around 2 hours to re-propagate. That said, I'd like for that not to happen again...

I know this is very vague and doesn't provide helpful information in itself, but I just wanted to add in my anecdotal experience that the current instability hasn't shown up in my environment for quite some time.

I wouldn't really call it "anecdotal". I have a monthly maintenance window during which I upgrade all of the packages on the OS, so when I do eventually upgrade, I may jump many minor and patch releases. Because things were more or less stable, I had no issues upgrading to the latest. Right now, all of my servers are sitting on a downgraded version of 0.33.0 as it seems to be the last stable release, at least for the previous 24, before it was 0.29.4. That said, after yesterday, I am fearful of all versions 😅.

@hadleyrich
Copy link

I just noticed on a peer that had lost communication with another peer that "Last WireGuard handshake" was hours old and "Last connection update" was minutes so it certainly points to something at the WG level becoming out of sync.

I think you're probably right, I think I probably saw stability issues reappearing around 0.34. I had become quite (probably overly) comfortable with the level of stability over the past months and been happily tracking the latest releases. I don't yet run netbird in a production setting. More of a long term stability test on my homelab "production" services before deploying to real customer facing workloads.

@freebs65
Copy link

Hmm.. it's funny I have one machine that drops and it's a Windows Server 2022 .. I don't see other clients stop. A simple restart fixes it, but i have to do every day. I have Linux clients and an older Windows SBS server all seem to be ok..Also have Windows 11 clients.. again seem fine.. even my arch desktop is fine. Very odd.

@hadleyrich
Copy link

Another data point. A long running ping in screen to keep traffic going over the link appears to keep the peer connected.

@rihards-simanovics
Copy link

It seems like the issue is with the WireGuard handshake. For instance, my Windows 11 PC seemingly struggles to connect to other Linux Server Peers despite everything running the latest Netbird version, in this case, Netbird 0.35.2. One of my Load Balancer servers running Ubuntu 22.04 just refuses to keep the connection to other Linux servers for longer than 5 minutes before dying and needing to be restarted. I don't know what I'm doing wrong, but I always update the management server first and only then move on to the client nodes, first on the Linux servers and then on devices such as PCs/Laptops/Phones.

@rihards-simanovics
Copy link

Hi Everyone, happy New Year!

Hey @mlsmaycon, sorry to ping you directly. Would you like me to run the same steps as listed last time? I will email the logs so you have a better picture. I am approaching a maintenance window for all our org servers and will be able to run a full debugging trace like last time. Also, I need to know if the logging persists across client updates or whether I need to run it first on the old version and then after the upgrade.

@hadleyrich
Copy link

I think (in my case at least) this appears to be something triggered by, or relating to relaying.

Previously I was not running the relay in my set up and only running coturn. The peer I was having most trouble with was connecting over relay.

Adding in the new relay service appears to have made that peer more stable for the last 12 hours or so.

@rihards-simanovics
Copy link

Previously I was not running the relay in my set up and only running coturn. The peer I was having most trouble with was connecting over relay.

Hmm, interesting. In my case, I am already running a new relay service. Strangely, some client versions seem to overuse the relay, and some underuse it; since 0.35.0, the client seems to bypass it altogether and go straight for P2P.

Okay, you know what? It's late at night here in the UK, so let me try upgrading and getting at least some logs.

@rihards-simanovics
Copy link

rihards-simanovics commented Jan 2, 2025

Ok, without looking at the trace logs generated by the client, my anecdotal research log shows this:

4:54 UK7 client upgraded from 0.33.0 to 0.35.2
4.58 UK7 sites show as status 503 down on UK1 - which is still using 0.33.0 client
5:00 UK7 client downgraded back to 0.33.0 - and I am waiting for all sites to recover, which takes around 5 seconds
5:04 UK1 client upgraded to 0.35.2 from 0.33.0
5:08 UK3 websites are shown as down despite only the UK1 client being up to date.
.. some time here I downgraded the uk1 client to 0.33.0
5:19 UK1 client again upgraded to 0.35.2 from 0.33.0
5:23 UK2,3,5 sites went down - they use client 0.33.0 client Uk7 however is still up.
5:25 UK1 client restarted. Sites are going back up
5:37 UK1 client downgraded back to 0.33.0 things are back to normal.

@mlsmaycon I've collected a full trace from UK1 and UK7 using the method listed in #3112 (comment) and am now parsing it to see if there is anything obvious. I will send it to the support email once I've reviewed everything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants