-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
babeld segmentation fault on exit node causes lost routes #31
Comments
That's cool that the babeld process gets restarted successfully. This means that your change to the systemd service file worked! My first guess is: after babeld restarts on the exit node, it is unaware of any pre-existing tunnels, so it won't respond to any babeling on any existing tunnel interfaces. Can test this hypothesis by restarting |
Tested your suggestion and it does not appear to work. Additionally I have been flashing new nodes, creating new tunnels, and turning nodes off and on all day, and none of them have be able to babel with the exit node. One question this brings to mind is, why are we still using our fork of babeld "https://github.com/sudomesh/babeld"? Is it just for the convenience of Just wondering if our babelling difficulties were perhaps fixed somewhere in the 186 commits and 4 releases since we forked ours. Also if our changes don't conflict with the normal operation of babeld, why don't we make a pull request to the main repo? |
K, dang.
Yes we are stilling using our fork. The important piece of our fork is the I think these changes were pretty hefty, but I haven't looked at them closely--here's the diff: sudomesh/babeld@sudomesh:acc7bde48f504609a4b67ed205798f3c4c07b057...master. |
@paidforby Are the home nodes you've been testing with using the old tunneldigger or the new tunneldigger? (Only the new one will connect reliably to HE.) |
The restart script related to #21 automatically adds tunneldigger interfaces to a newly restarted babeld process. That seems to be working quite well as far as I can tell. However, when the babeld process crashes and restarts via systemd, this re-adding of tunnel digger interfaces does not occur. https://github.com/sudomesh/exitnode/blob/dad8b658b07f42c2146eb8893fdd571b4149f2d0/src/opt/babeld-monitor/babeld-monitor.sh#L43 . Do you have a specific home node on which this issue present itself? |
Along the lines of @bennlich question - which version of the firmware are you using? are you using an upgraded version of babeld. Suggest to use v0.2.3 rather than a newer (untested) version to avoid unnecessary variability. |
A short glimpse at the tunneldigger logs on the "new" exit node:
Note the |
@jhpoelen I'm using two different versions on different hardware and neither is working,
I've seen v0.3.0 working numerous times, when the new exitnode is in a known working state. The tunneldigger upgrade was made in sudomesh/nodewatcher-firmware-packages@d4e3b9f so any new build of the firmware should pull in the new version of tunneldigger. Could you provide instructions on checking the tunneldigger client version on home nodes? It should also be noted that https://peoplesopen.herokuapp.com/ is reporting far fewer nodes than it was when it was working, so I imagine others using v0.2.3 and up are seeing this problem as well. The only change I did make is I had to turn off my "bridging" nodes that helped restore order to the mesh after we resolved #21 (one was v0.2.2 pointing at the old exit node and one was v0.3.0 pointing at the new exitnode) |
Ok, so weird. I reconfigured the bridge node setup in my room just now and it appears to have corrected the issues with the new exit node? I have no explanation for why this works, but the number of nodes jumped from 19 before I set it up, to 34 afterwards. I will also note that the number of gateways remained at 14? Aha, that means that 15 nodes out in the world are relying on the single gateway in my room. Interesting. Here's the set up the restores order,
Any ideas why we might be seeing, does this give any clues as to the root cause? Why must we have a bridge setup? Shouldn't the new nodes work all on their own? |
I'd need more information (routes, logs etc) to understand the root cause of this. Last time I checked, mole in sudoroom was acting as a bridge between the "new" and the "old" exit nodes. However, today, mole (65.93.65) was unable to ping 100.64.0.42 (old exit node). This indicates that mole is no longer bridging between new and old exit node. Perhaps this is because all the nodes at sudoroom are now running the new firmware. Note that the monitor is coded/configured to only report via the old exit node at this point. There's no reason why we can't extend this to include more exit node. This would require a little javascript coding . . . |
With the information I have now, I think we are just looking at a feature, rather than a bug. Extending the monitor software would help to monitor the network from multiple exit nodes. Also, we can create a static bridge directly between the exit nodes by configuring a client on the one, talking to a broker on the other. I'd favor extending the monitoring, especially because we don't really have any mesh services at this point. |
I realize now that the monitor only reports through the old exit node, I don't necessarily take issue with that. Perhaps my comment was misunderstood, but I'd strongly disagree that this a feature, it seems more like a major flaw in the new exit node, only that we haven't noticed it until recently because mole and an old node in sudoroom were bridging the exitnodes. After turning off the old nodes in sudoroom on Tuesday, 4/17, we immediately observed the first outage the next day in #21, while partially patched, I still think we have yet to discover the root cause. Similarly, after shutting down my bridging nodes on Saturday 4/21, there was a second outage, first observed on 4/24. The following is detailed description of the behavior I have observed in the case of both outages,
What this means to me is that the new exit node is failing to operate independently of the old exit node. The "nuclear" option to check this theory would be to simply turn off the old exit node (or at least disable babeld and tunneldigger). However, for a more delicate way of reproducing, I suggest:
Rather than running a tunneldigger client on a broker which seems messy to me, a temporary solution would be to create a self contained bridge node that is always left on, so when babeld is restarted, nodes are able to rediscover their route to other nodes and to the internet. However, this avoiding the real problem that is hiding somewhere in https://github.com/sudomesh/exitnode or the babeld-monitor. |
Here are some notable portions of and also on the new exit node, I see both
and
which seems to line up with the recent outages. I don't see anything similar in the old exit node's logs |
Thanks for sharing your response. As far as I can tell, I see three independent threads appearing:
I don't quite understand your comment: Thread 1. seems like a design discussion caused by our attempts to move towards a network without a single point of failure. Thread 2. seems like a bug in babeld and thread 3. is just a result of an initial minimal implementation of the monitor. Curious to hear whether @bennlich @paidforby or others have some thoughts on this. Also, I'd be curious to hear which thread is the focus of this issue. |
@paidforby I noticed the outage on 4/21, but I didn't realize what was happening until 4/22. Even then, I misattributed it to the network outage at Omni, and didn't have time to continue troubleshooting beyond getting Omni back online. (My point being that things broke immediately.) |
@jhpoelen I previously opened #6 for this reason. Right now, we have one exit node that can report on both as long as the exits are meshing with each other. To make monitoring robust to partitions, we can just update the app to accept updates from multiple exit nodes and in turn report on each instead of a single one. Should be easy enough. |
@eenblam having support for partitioned mesh monitoring would be really nice I think: seeing the mesh in action seems not only useful for checking on health, but would be nice for educational purposes. Thanks for reminding of sudomesh/monitor#6 . |
@jhpoelen thanks for disentangling the threads--I found that helpful, and @paidforby thanks for the detailed log of observations. Results/confirmations from recent experimenting:
Interestingly, a
Namely, traffic from two different IP6 addresses, one of which corresponds to an interface on the home node, and the other to an interface on the exit node. In the past, I've always associated this kind of babeling with a working mesh, so it's interesting that we're seeing this without seeing route tables successfully distributing. -- If anyone has access to the My next step is to reflash my node with older firmwares until I find one that works. |
@bennlich 100.65.93.65 is mole , sent pw by signal. |
Today, I was able to re-produce the issue after a segmentation fault on HE exit node. Sequence of events:
expected behavior: actual: workaround: logs:
|
Note that re-adding tunnel interfaces via exit node patches described in sudomesh/exitnode@7d3c2a6 and sudomesh/exitnode@9eb8c7f do not seem to fix the issue. |
This won't save everything, but we might be able to reduce the impact of these events by dumping existing routes to something This can still lose recently added routes. |
@eenblam note that the patch was applied to exit nodes, not home nodes. It seems that babeld has some features to save routes already . . . given the warning "cannot restore old configuration ...". Am curious what we'll come up with to gracefully recover from outages. |
…re/pull/132\#commitcomment-28830678 and sudomesh/bugs#31 also adjusted pw_reset cron
Mainline babeld has support for adding interfaces at runtime since version 1.7.0.
(The first "ok" is the reply to the initial connection, the second "ok" is the reply to the interface statement. The format is designed to be easy to parse automatically. See https://www.irif.fr/~jch/software/babel/babel-lexer.c) If you can reproduce the crash with mainline babeld, then I'll be grateful for a backtrace. |
I upgraded babeld on the HE exit node on June 6, and I see only one segfault log since then:
In April, we were seeing several segfaults a day, so this is definitely an improvement. I'm going to close this issue for now. Currently the setsockopt memory leak is way more important to figure out (this happens ~once per day). |
I'd appreciate a backtrace, if you can provide one.
|
@jech I'm afraid I don't know how to provide :-/
Compile with "make CDEBUGFLAGS='-g -Wall'".
Before running, do "ulimit -c".
When you get a file called "core", do
gdb ./babeld core
bt
|
awesome! thank you! |
Related to #21, the "new" exit node continues to have difficulties maintaining its babeld process.
On home nodes pointed toward new exit node, I'm seeing this is the logs
On the "new" exit node, I'm noticed this in
/var/log/messages
However, the babeld process appears to alive and well and restarting it does nothing
Any ideas of the root cause? Perhaps its related to #24 that haunts the old exitnode? I'm tempted to just restart it and see if comes back to life, but I don't think that will help us solve the problem. Some other node whispers should jump into the exit node and see what they can figure out.
The text was updated successfully, but these errors were encountered: