-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tunnel broker fails intermittently preventing nodes from connecting to the internet #8
Comments
@paidforby thanks for reporting this. This is consistent what I have seen during experimental mesh setups using n600/n750 over the last weeks. |
Issue was actually caused by exit node being in a 'weird' state? Can we document the solution an add it to the operator's manual? |
Yes! It appears that the issue was caused by some sort of VPN issue on the exit node. On Monday night, as I was debugging a home node mesh setup, I was able to mesh with local sudoroom nodes, but wasn't able to find a route to the internet. However, after @Juul applied some magic (VPN service restart on exit node?), my test nodes were able to ping public ip addresses. Also, an apparent related issue was resolved after this assumed reset (see https://sudoroom.org/pipermail/mesh/2017-December/002719.html). Ideally we'd reproduce and then fix the issue, document in operator's manual like you suggested. |
I may have produced the issue as I was building the https://github.com/sudomesh/tunneldigger-lab . Steps to reproduce
expected - actual -
@Juul - can you confirm that broker on exit node is no longer allow for the digging of tunnels? |
I am seeing this same behavior from tunneldigger-lab. Also, I have some newly flashed home nodes that are not getting to the internet. Has the exit node been restarted since @jhpoelen reproduced the original issue? |
@paidforby thanks for reproducing ... it seems like we may be on to something. Perhaps the next step would be to setup a test exit node to reproduce. Note sure about the tunneldigger broker restart on exit node. Perhaps @Juul knows, he's operating the broker. This does strike me as something we should address sooner rather than later. |
After @Juul restarted the node, I was able to big tunnels again using https://github.com/sudomesh/tunneldigger-lab#digging-a-tunnel . This suggests that the tunnel broker hangs at some point after opening/closing tunnel cycles. |
Another outage of the exit node today. Discovered while following service guide and attempting to ping a locally hosted service, as in this use case.
To me, tunneldigger appears broken (at least the way we're using it) and ill-fit for "virtual meshing" to begin with (it's a start topology right? not very meshy). I am interested in exploring alternatives. Any suggestions? |
Re: point 1: I can't answer with as much authority as @Juul (who'll probably weigh in later today), but it was my understanding we were switching to Hurricane Electric (HE) for the exit node and setting up another one at sudo room for backup purposes. Last I remember an update on the server at HE was in December, @jtremback and @Juul were coordinating setting it up. I believe there were 2 gigabit switches in sudo that we were going to set up on our rack up in sudo and in the HE cabinet. there's also this guide to setting up an exit node that we should probably update: https://sudoroom.org/wiki/Mesh/Exit_setup Re: point 2: I believe @papazoga was working on a tunneldigger alternative here: https://github.com/sudomesh/foutun but i'm not aware of its status. side note: juul has been paying for the exit node and the dev droplet at ~$150/month out of pocket for about a year and a half now. Might be time to switch that cost over to sudomesh. |
Hey there. I think I am seeing the same bug on a home node that I recently flashed. I've connected the home node to my home router and am able to access the internet via the private SSID. I see two public SSIDs, When I restart tunneldigger with
|
Some days ago, tunneldigger issues became more pronounced, with all (?) node getting disconnected. @Juul shared the following server config -
Something changed that broke things and I have no idea what. Possibly the kernel. I tried switching the server to the new version of the tunneldigger server (tunneldigger broker) and after fixing a small bug it worked but only with the new version of the tunneldigger client. With the new server and old client the server can send to the client over the tunnel but the client cannot send to the server. With the old server even getting a tunnel fails most, but not all, of the time. Even when it succeeds no traffic can pass through it in either direction. This could have been simple to troubleshoot if the server (written in python) was just creating these tunnels with calls to the |
I reverted the kernel back to |
Am making slow progress on reproducing the broker outage (with help of Benny!). I've gotten to a point where I can provision a 1GB/1vCPU digital ocean droplet resulting in a running babeld and tunneldigger daemon on ubuntu 16.04 (see https://github.com/jhpoelen/exitnode ). When I attempted doing this on latest debian (9.3 x64) offered through digital ocean, tunneldigger crashed with segmentation fault. Am now blocked by my ability to configure a home node to correct through my hotspot to test a newly minted exitnode on digital ocean, despite the somewhat detailed instructions at https://gist.github.com/957855bb5841100109eaeb90e8c6b01b . Hoping to work with others with functioning home nodes unless I can figure out a way to setup my node. |
Using the "new" tunneldigger setup described in https://github.com/sudomesh/tunneldigger-lab , I was able to start a udp tunnel (at least as far as I understand). On client, I used https://github.com/sudomesh/tunneldigger-lab#digging-a-tunnel to start a tunnel using the droplet ip. On server, after starting the client, an l2tp* interface is created
and udp packets are traveling back and forth, as monitored on client
|
@jhpoelen In what you pasted, it looks like that l2tp interface is down though, no? |
Note, however, that the connecting the same client to the (misbehaving) exit node at exit.sudomesh.org , bidirectional UDP traffic is also seen using tcpdump on the client. |
@bennlich yep! perhaps related to babeld issues, trying to figure out why tunneldigger actually writes its logs as configured. |
With some patches in tunneldigger and exitnode babeld config, I was able to get the interface to become active with a straight run of create_exitnode in https://github.com/jhpoelen/exitnode :
I ran I guess the next step is figure out how the routing is supposed to work on the exit node. Also, babeld seems to be doing something. Note that there's no babeld running on the client.
|
@jtremback @Juul I'm also trying to debug the broker (slowly but surely?) I'm trying to wrap my head around which broker services are responsible for routing packets to the internet. I think my naive assumption is that packets arriving on a tunnel interface with a non-local destination IP should get sent through the default gateway because of the default system routing rule:
This doesn't seem to be happening though. When I
and I do not see them end up on the eth0 interface. Off the top of your head, do you know what services are responsible for this part of the routing? From tunnel interface -> eth0 via default gateway (and back)? |
I second @bennlich question. Am stuck on reverse engineering how the exitnode routes requests from tunnels to wan. |
Was able to ping to 8.8.8.8 through tunneled and babeld routed client using static route via exitnode running on droplet. See https://github.com/jhpoelen/exitnode/blob/master/README.md#testing-routing-with-babeld-through-tunnel-digger . @yardenac @Juul I noticed that babeld is sensitive to default route having static protocol. Had to change |
ARP packets are layer 2. They are not routed. The fact that @bennlich is seeing ARP packets on the tunnelbroker's end of the tunnel asking for the MAC address for 8.8.8.8 tells me that the tunneldigger client does not have a sane default route set. The tunneldigger client should have assigned an IP to its end of the tunnel and set that IP as its default route. The tunneldigger broker should have set that same IP on its end of the tunnel and should of course also have a default route leading to the internet and IPv4 forwarding enabled. |
@Juul this is consistent with my earlier message #8 (comment) , where I manually installed a route to 8.8.8.8 via tunnel on my laptop. It looks like the exitnode setup created with https://github.com/jhpoelen/exitnode looks promising and I am hoping that someone can help test this with an actual home node. @Juul @yardenac can you please share the default route configuration for the current exit node? |
|
@yardenac thanks for sharing - I am hoping to do some more testing with various default routes on the exit node. Meanwhile, at the BYOI office hours today, we did got a node running and connecting through the "big" internet using a digital ocean droplet configured using the automated "create_exitnode.sh" script in https://github.com/jhpoelen/exitnode . A home node (aka "goat") was configured using instructions at https://peoplesopen.net/walkthrough and https://github.com/jhpoelen/exitnode/#configure-home-node-to-use-exit-node and plugged into sudoroom ethernet. For some reason, I had to manually configure the default route to the exitnode on the home node (aka "goat") in the "public" routing table using Further investigation is needed why babeld doesn't install the default routes in the home node "public" routing table by itself. |
I just rebooted the main exit node (again) after the number of gateways dropped to 13. I've attached screenshots of pre- and post-reboot status of https://peoplesopen.net/monitor . I noticed no unusual error messages in the recently re-enabled log entries of sorts so am still unclear about the root cause of this transient behavior. |
@jhpoelen did you happen to notice what id # the exit node tunnel interfaces were up to? or whether your home node was able to reconnect? If you notice this again, feel free to ping me and maybe we'll notice something about the state of the exit node by poking around together. Or we'll notice some info that is missing from the logs that would be good to add. |
tunnel ids ~ 481 . so interface ids where something like l2tp4811 . |
Have you people upgraded to the latest Tunneldigger or are you debugging here still an old version? |
@mitar We people are debugging still an old version :-) |
I was thinking to upgrade with a reverse patch to set session ids to 1. This would keep the old client working ok, but would getting any stability improvements otherwise. |
@mitar I have not yet looked at bugs reported in the parent tunneldigger repo. Does "clients dropping off over time and being unable to reconnect until the broker is rebooted" ring any bells? |
@mitar what commit would be good to work off of? |
I think the latest master is probably the best. I would suggest you run two tunnel digger instances in parallel, an old version and new version. I would advise against patching with a reverse patch. |
Not really. We had both new and old versions running for months. |
Upgrading would be nice, especially if we people would know that it would solve the root cause. Right now, we don't know what the root cause is. |
So this is why you can run it in parallel and see if it happens with new version. But I agree, understanding and learning is important as well. But also running a working network. There is always a trade off. :-( |
Right now, if we run the new version, it would only be able to access a single connection due to the session id issue, and this is after we patch all the clients to include a new list of exit nodes. And yes, running a network is nice: that is why upgrading with a network full of old client is not really feasible. This is why I suggested the reverse patch for the time being. |
Just trying to figure out how to resolve this with the limited resources and access we have. |
Ehm. Not sure if I agree with this. Maybe I am misunderstanding something. You create a new VM, with new kernel, you install a new Tunneldigger there. And then you start adding this IP to whoever installs a new node or upgrade an old one. There is no need for "patch all the clients" moment. You just go slowly as things happen organically. But having two (or more) Tunneldiggers is nice anyway. Because if one hangs like this one now, routes would go through others. So, exactly because of the fact that the old clients do not know about the new VPN server, means that they will not try to connect it and have an issue with one session ID. And the new clients can then connect to new VPN server and use it. Or are you trying to say that the updated clients will not be able to connect both to the new and old server at the same time, because they are incompatible. I am not sure about this, but maybe you can run two versions of client code on the node at the same time.
i mean, if people are OK running with old kernel version, then I would guess we could even make this a command line switch in main codebase. Do you want unique session IDs or only one session ID. |
@mitar having a gradual transition makes sense. I guess I am just hung up on rescuing the existing nodes from the transient connection issues we have now. I agree that running two (or more) brokers is good practice anyway. As far as running the old kernel version - I don't mind giving this one-session parameter as command-line switch a go, because it would allow us to test whether the "old" tunnel digger is in fact responsible for the root cause. Thanks for being patient. |
I've rebuild the firmware with sudomesh/nodewatcher-firmware-packages@d4e3b9f , flashed a home node with it, setup a recent version of tunnel digger on a digital ocean droplet and . . . and am now using the connection to write this comment. |
I think the bug that started this issue is resolved as of (#8 (comment)), and we can safely close this issue and open a new issue for the new bug that we've been monitoring the last few weeks. The new bug is: routes from connected nodes eventually disappear from the exit node routing table (see all comments from #8 (comment) onward in this issue). I did some poking around the exit node tonight, and I think the new bug we've been tracking is somewhere in babeld. My home node was successfully able to dig tunnels to the exit node, and a tcpdump on its tunnel interface showed healthy babel-ing from my home node to the exit node:
Unfortunately, the exit node never babelled back. I.e. my node was babeling into the existential abyss :-( :-( :-( I was able to fix the bug by restarting first babeld ( I noticed that there is exactly one node that seems to tear down and recreate its tunnel every 5 minutes. My current guess is: after enough |
@bennlich thanks for your careful debugging and sharing of your observations. This is consistent with what we've been seeing. I feel you are honing in on the original issue (before the system upgrade that prevented more than one tunnels from being created due to fixing of a l2tp bug). However, opening a new issue might be nice to start with a fresh and focused thread. |
@jhpoelen ah okay I thought the system upgrade was the original issue. There was an issue before that? In any case, I think starting a new issue could be good, as this one has a bunch of infos / observations that are not related to the babelling problem (e.g. the title is about tunneldigger). I'll open a new issue. Am no longer certain this one is resolved--will let someone else determine that. |
I agree that the title is misleading. The system upgrade issue appeared after this issue was originally created. Please take action as you see fit. |
To summarize: As far as I know, two issues related to exit node connectivity were discussed in this issue:
With this, I am closing this issue. Thanks for all that chimed in! If I missed anything, please feel free to comments / re-open. |
@jhpoelen Thanks for all the work on this bug and great job summarizing our solution (i had gotten a little lost back around the 16th comment) |
Mesh home nodes (mynet N600 or mynet N750) do not seem to be meshing with one another over their ad-hoc interface regardless of sudowrt firmware build (april build, new build) or version of makenode used (old commit, newest commit). Last I recall being able to mesh was in mid-September when working on battery powered, sneaker nodes.
To reproduce:
or
Expected result:
an internet connection
Actual result:
no internet connection
or
where 100.65.7.129 is the IP of the home node
Other observations:
pplsopen.net-node2node SSID is visible from computer WiFi list.
The text was updated successfully, but these errors were encountered: