Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: Frequent disconnects with version 0.30.2 #2765

Open
christian-schlichtherle opened this issue Oct 21, 2024 · 12 comments
Open

Regression: Frequent disconnects with version 0.30.2 #2765

christian-schlichtherle opened this issue Oct 21, 2024 · 12 comments

Comments

@christian-schlichtherle

Describe the problem

We are running an IoT project where some Linux based K3s nodes on the edge are located at customer premises and communicate with some other Linux based K3s nodes in the cloud. This project is running for almost three years now. Previously, we have been connecting and managing all nodes via OpenVPN (so we can SSH into every node, even when it's connected at customer premises) and then installed K3s on each node. A few months ago we replaced OpenVPN with Netbird because of its many advantages like performance, peer-to-peer topology with central management etc.

Ever since, we were following updates as soon as possible. We started at 0.27.10 and now we are (or were) at 0.30.2. Unfortunately, starting somewhere between version 0.28.4 and 0.30.2 we started to observe frequent network partitions (disconnects). They would happen randomly after some hours, mostly several times a day, at least once per day, following no particular pattern. I checked many potential causes, including IP address changes which happens to CPE equipment every night (according to the Internet provider's plan), but none of this was the root cause.

Recently I decided to downgrade the network from version 0.30.2 to version 0.28.4 and since then, we didn't have a single network partition / disconnect any more.

To Reproduce

Setup a bunch of nodes and run them 24/7. If you setup the nodes using Ansible, you can discover network partitions like this:

ansible netbird_client -m shell -a 'netbird status --filter-by-status disconnected | grep netbird.cloud | grep -v FQDN || true'

If there is no network partition, then each node produces an empty output, otherwise it lists the nodes it cannot connect to.

Expected behavior

These Linux nodes should stay connected 24/7, real Internet outages aside.

Are you using NetBird Cloud?

Yes

NetBird version

see above

NetBird status -dA output:

n/a

Do you face any (non-mobile) client issues?

n/a

Screenshots

n/a

Additional context

n/a

@wiiun
Copy link

wiiun commented Oct 22, 2024

I also encountered the same situation

@mgarces
Copy link

mgarces commented Oct 22, 2024

hello @christian-schlichtherle thank you for your issue; we will investigate this, also, thank you for your provided commands.
It would be beneficial to have debug logs for those clients, is it possible to turn them on and run them for a long period of time? You can achieve this by following these docs.
Thanks

@mgarces
Copy link

mgarces commented Oct 22, 2024

Hi again! We've found a bug in our reconnection logic, and we are working on improvements for it. We currently have a PullRequest ongoing, would you be willing to test it out before we release it?

@christian-schlichtherle
Copy link
Author

I could install an update on our development cluster for some limited time over the weekend. Since we are using Ansible, how would I go about installing a pre-release?

@DutchCloud4Work
Copy link

DutchCloud4Work commented Oct 23, 2024

I`m also experiencing problems, whenever a workstation (Mostly windows) goes into sleep mode (because of lunch or something) it will never reconnect anymore, only fix is a reboot of the entire system.
Started after the upgrade 0.29 to 0.30

@christian-schlichtherle
Copy link
Author

@DutchCloud4Work you can try netbird service restart to fix this issue. A reboot should not be required. I'm on macOS however, so your situation may be different.

@DutchCloud4Work
Copy link

DutchCloud4Work commented Oct 24, 2024

@christian-schlichtherle

Doesn`t work, even restarting the service in Windows en restarting the GUI will still give me no connectivity (the gui does say its connected, but no traffic)

@ngtrthanh
Copy link

I have nearly same problem with disconnect to some peers.
Host: Ubuntu 24.02.
Netbird version: 0.30.2
I got netbird status report in inverse
8 of 13 Accessible Peers on Dashboard but at CLI 5/13.
A down/ up cycle solve problems, but it persisted after 12h.
I want to join test new version or roll back to prev.

@mgarces
Copy link

mgarces commented Oct 24, 2024

hi, v0.30.3 is now live! While we have conducted thorough testing, if you encounter any unusual behaviour, it might be related to this change. Your feedback will be invaluable in ensuring the stability and performance of this update. Please report back if your connectivity issues are resolved or if you encounter any other hurdle.

@christian-schlichtherle
Copy link
Author

Will give it a try on our development cluster over the weekend - thank you so much!

@christian-schlichtherle
Copy link
Author

Actually, I was installing it now. Here's my findings:

Remote installation with apt install netbird=0.30.3 on the cloud nodes in the dev cluster went well.
Installation on the single edge node in the dev cluster using the same command hung up. When I SSH to the node using the LAN port and do netbird status I get:

$ netbird status
Error: failed to connect to daemon error: context deadline exceeded
If the daemon is not running please run: 
netbird service install 
netbird service start

So, apparently the service could not get restarted on upgrading from 0.28.4 to 0.30.3. I'm glad this is only a single node in the dev cluster which I can power cycle manually. For the prod cluster, this incident would be a disaster as I would have to call the customers and ask everyone to reboot the edge nodes manually.

Some more diagnostic output:

# systemctl status netbird
● netbird.service - A WireGuard-based mesh network that connects your devices into a single private network.
     Loaded: loaded (/etc/systemd/system/netbird.service; enabled; preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Thu 2024-10-24 19:35:30 UTC; 28s ago
    Process: 266290 ExecStart=/usr/bin/netbird service run --config /etc/netbird/config.json --log-level info --daemon-addr unix:///var/run/netbird.sock --log-file /var/log/netbird/client.log (code=exited, status=2)
   Main PID: 266290 (code=exited, status=2)
        CPU: 281ms

From journalctl:

Oct 24 19:35:25 de-nw-45134-cs-d0 systemd[1]: Started netbird.service - A WireGuard-based mesh network that connects your devices into a single private network..
░░ Subject: A start job for unit netbird.service has finished successfully
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ A start job for unit netbird.service has finished successfully.
░░ 
░░ The job identifier is 28342.
Oct 24 19:35:30 de-nw-45134-cs-d0 systemd[1]: netbird.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ An ExecStart= process belonging to unit netbird.service has exited.
░░ 
░░ The process' exit code is 'exited' and its exit status is 2.
Oct 24 19:35:30 de-nw-45134-cs-d0 systemd[1]: netbird.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░ 
░░ The unit netbird.service has entered the 'failed' state with result 'exit-code'.

Obviously it complains about an invalid argument, but I haven't configured anything different on this node than any other. Maybe this error message is a false positive?

Finally, I was doing a reboot of the node and the problem disappeared.

Now I will start to monitor the stability.

@christian-schlichtherle
Copy link
Author

PS: I noticed that the troubled edge node has changed it's DNS name to my-name-1.netbird.cloud (note the additional -1). I'm not sure when that happened and how this is related, if at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants