-
Notifications
You must be signed in to change notification settings - Fork 497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression: Frequent disconnects with version 0.30.2 #2765
Comments
I also encountered the same situation |
hello @christian-schlichtherle thank you for your issue; we will investigate this, also, thank you for your provided commands. |
Hi again! We've found a bug in our reconnection logic, and we are working on improvements for it. We currently have a PullRequest ongoing, would you be willing to test it out before we release it? |
I could install an update on our development cluster for some limited time over the weekend. Since we are using Ansible, how would I go about installing a pre-release? |
I`m also experiencing problems, whenever a workstation (Mostly windows) goes into sleep mode (because of lunch or something) it will never reconnect anymore, only fix is a reboot of the entire system. |
@DutchCloud4Work you can try |
Doesn`t work, even restarting the service in Windows en restarting the GUI will still give me no connectivity (the gui does say its connected, but no traffic) |
I have nearly same problem with disconnect to some peers. |
hi, |
Will give it a try on our development cluster over the weekend - thank you so much! |
Actually, I was installing it now. Here's my findings: Remote installation with
So, apparently the service could not get restarted on upgrading from 0.28.4 to 0.30.3. I'm glad this is only a single node in the dev cluster which I can power cycle manually. For the prod cluster, this incident would be a disaster as I would have to call the customers and ask everyone to reboot the edge nodes manually. Some more diagnostic output:
From journalctl:
Obviously it complains about an invalid argument, but I haven't configured anything different on this node than any other. Maybe this error message is a false positive? Finally, I was doing a reboot of the node and the problem disappeared. Now I will start to monitor the stability. |
PS: I noticed that the troubled edge node has changed it's DNS name to |
Describe the problem
We are running an IoT project where some Linux based K3s nodes on the edge are located at customer premises and communicate with some other Linux based K3s nodes in the cloud. This project is running for almost three years now. Previously, we have been connecting and managing all nodes via OpenVPN (so we can SSH into every node, even when it's connected at customer premises) and then installed K3s on each node. A few months ago we replaced OpenVPN with Netbird because of its many advantages like performance, peer-to-peer topology with central management etc.
Ever since, we were following updates as soon as possible. We started at 0.27.10 and now we are (or were) at 0.30.2. Unfortunately, starting somewhere between version 0.28.4 and 0.30.2 we started to observe frequent network partitions (disconnects). They would happen randomly after some hours, mostly several times a day, at least once per day, following no particular pattern. I checked many potential causes, including IP address changes which happens to CPE equipment every night (according to the Internet provider's plan), but none of this was the root cause.
Recently I decided to downgrade the network from version 0.30.2 to version 0.28.4 and since then, we didn't have a single network partition / disconnect any more.
To Reproduce
Setup a bunch of nodes and run them 24/7. If you setup the nodes using Ansible, you can discover network partitions like this:
If there is no network partition, then each node produces an empty output, otherwise it lists the nodes it cannot connect to.
Expected behavior
These Linux nodes should stay connected 24/7, real Internet outages aside.
Are you using NetBird Cloud?
Yes
NetBird version
see above
NetBird status -dA output:
n/a
Do you face any (non-mobile) client issues?
n/a
Screenshots
n/a
Additional context
n/a
The text was updated successfully, but these errors were encountered: