Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

babeld segmentation fault on exit node causes lost routes #31

Closed
paidforby opened this issue Apr 25, 2018 · 30 comments
Closed

babeld segmentation fault on exit node causes lost routes #31

paidforby opened this issue Apr 25, 2018 · 30 comments

Comments

@paidforby
Copy link

Related to #21, the "new" exit node continues to have difficulties maintaining its babeld process.

On home nodes pointed toward new exit node, I'm seeing this is the logs

Tue Apr 24 19:07:43 2018 user.notice root: no mesh routes available yet via [l2tp0] on try [26]: checking again in [5]s...

On the "new" exit node, I'm noticed this in /var/log/messages

Apr 24 12:17:55 exit0 kernel: [433698.664554] babeld[830]: segfault at 18 ip 00005567267473d0 sp 00007ffc6a239240 error 4 in babeld[55672673d000+16000]
Apr 24 17:27:26 exit0 kernel: [452270.052907] babeld[3596]: segfault at fffffff337dc3df9 ip 000055e3a510f112 sp 00007fffd89c3400 error 7 in babeld[55e3a5104000+16000]
Apr 24 18:10:33 exit0 kernel: [454857.088438] traps: babeld[3767] general protection ip:561947add3d0 sp:7fffc111a920 error:0
Apr 24 18:10:33 exit0 kernel: [454857.088445]  in babeld[561947ad3000+16000]

However, the babeld process appears to alive and well and restarting it does nothing

● babeld.service - babeld
   Loaded: loaded (/etc/systemd/system/babeld.service; disabled; vendor preset: 
   Active: active (running) since Tue 2018-04-24 18:10:33 PDT; 59min ago
 Main PID: 5288 (babeld)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/babeld.service
           └─5288 /usr/local/bin/babeld -F -S /var/lib/babeld/state -c /etc/babe

Apr 24 18:22:23 exit0 babeld[5288]: removing: l2tp354-354
Apr 24 18:22:32 exit0 babeld[5288]: removing: l2tp355-355
Apr 24 18:24:24 exit0 babeld[5288]: removing: l2tp356-356
Apr 24 18:26:13 exit0 babeld[5288]: removing: l2tp357-357
Apr 24 18:26:15 exit0 babeld[5288]: send: Cannot assign requested address
Apr 24 18:28:16 exit0 babeld[5288]: removing: l2tp358-358
Apr 24 19:05:13 exit0 babeld[5288]: Warning: cannot restore old configuration fo
Apr 24 19:05:13 exit0 babeld[5288]: removing: l2tp350-350
Apr 24 19:07:05 exit0 babeld[5288]: removing: l2tp360-360
Apr 24 19:08:55 exit0 babeld[5288]: removing: l2tp361-361

Any ideas of the root cause? Perhaps its related to #24 that haunts the old exitnode? I'm tempted to just restart it and see if comes back to life, but I don't think that will help us solve the problem. Some other node whispers should jump into the exit node and see what they can figure out.

@bennlich
Copy link
Collaborator

That's cool that the babeld process gets restarted successfully. This means that your change to the systemd service file worked!

My first guess is: after babeld restarts on the exit node, it is unaware of any pre-existing tunnels, so it won't respond to any babeling on any existing tunnel interfaces.

Can test this hypothesis by restarting meshrouting on a homenode that is showing the no mesh routes available yet error. Restarting meshrouting will restart tunneldigger and cause a new l2tp interface to come up on the exitnode. This executes the hook that alerts babeld of the new interface. If the hypothesis is correct, the homenode should succeed in getting a route.

@paidforby
Copy link
Author

Tested your suggestion and it does not appear to work. Additionally I have been flashing new nodes, creating new tunnels, and turning nodes off and on all day, and none of them have be able to babel with the exit node.

One question this brings to mind is, why are we still using our fork of babeld "https://github.com/sudomesh/babeld"? Is it just for the convenience of babeld -i? Is there something I am missing, I noticed we also added a "fungible mode"? Why not use a stable release of https://github.com/jech/babeld?

Just wondering if our babelling difficulties were perhaps fixed somewhere in the 186 commits and 4 releases since we forked ours.

Also if our changes don't conflict with the normal operation of babeld, why don't we make a pull request to the main repo?

@bennlich
Copy link
Collaborator

Tested your suggestion and it does not appear to work. Additionally I have been flashing new nodes, creating new tunnels, and turning nodes off and on all day, and none of them have be able to babel with the exit node.

K, dang.

One question this brings to mind is, why are we still using our fork of babeld "https://github.com/sudomesh/babeld"? Is it just for the convenience of babeld -i? Is there something I am missing, I noticed we also added a "fungible mode"? Why not use a stable release of https://github.com/jech/babeld?

Yes we are stilling using our fork. The important piece of our fork is the babeld -a option, which lets you add a new interface to an already running babeld. This is called from the exitnode tunneldigger uphook script. AFAIK the main babeld branch requires you to pass in all the interfaces you want to babel on when babeld first starts.

I think these changes were pretty hefty, but I haven't looked at them closely--here's the diff: sudomesh/babeld@sudomesh:acc7bde48f504609a4b67ed205798f3c4c07b057...master.

@bennlich
Copy link
Collaborator

@paidforby Are the home nodes you've been testing with using the old tunneldigger or the new tunneldigger? (Only the new one will connect reliably to HE.)

@jhpoelen
Copy link
Contributor

The restart script related to #21 automatically adds tunneldigger interfaces to a newly restarted babeld process. That seems to be working quite well as far as I can tell. However, when the babeld process crashes and restarts via systemd, this re-adding of tunnel digger interfaces does not occur. https://github.com/sudomesh/exitnode/blob/dad8b658b07f42c2146eb8893fdd571b4149f2d0/src/opt/babeld-monitor/babeld-monitor.sh#L43 .

Do you have a specific home node on which this issue present itself?

@jhpoelen
Copy link
Contributor

Along the lines of @bennlich question - which version of the firmware are you using? are you using an upgraded version of babeld. Suggest to use v0.2.3 rather than a newer (untested) version to avoid unnecessary variability.

@jhpoelen
Copy link
Contributor

A short glimpse at the tunneldigger logs on the "new" exit node:

Apr 25 10:52:04 exit0 python[510]: [WARNING/tunneldigger.broker] Session identifier 1 already exists.
Apr 25 10:52:04 exit0 python[510]: [WARNING/tunneldigger.protocol] Failed to create tunnel (ba096bf9-9aa5-4fcc-92af-306e7009ee3c) while proc
Apr 25 10:52:07 exit0 python[510]: [INFO/tunneldigger.broker] Creating tunnel (xxx-306e7009ee3c) with id 465.
Apr 25 10:52:07 exit0 python[510]: [WARNING/tunneldigger.broker] Session identifier 1 already exists.

Note the Session identifier 1 already exists. This is a sign that old tunneldigger clients are used to establish tunnels. The clients need to be upgraded before being able to connect to the "new" exit node.

@paidforby
Copy link
Author

@jhpoelen I'm using two different versions on different hardware and neither is working,

  • MyNet N750 with v0.2.3 and makenode v0.0.1
  • MyNet N600 with v0.3.0

I've seen v0.3.0 working numerous times, when the new exitnode is in a known working state. The tunneldigger upgrade was made in sudomesh/nodewatcher-firmware-packages@d4e3b9f so any new build of the firmware should pull in the new version of tunneldigger. Could you provide instructions on checking the tunneldigger client version on home nodes?

It should also be noted that https://peoplesopen.herokuapp.com/ is reporting far fewer nodes than it was when it was working, so I imagine others using v0.2.3 and up are seeing this problem as well.

The only change I did make is I had to turn off my "bridging" nodes that helped restore order to the mesh after we resolved #21 (one was v0.2.2 pointing at the old exit node and one was v0.3.0 pointing at the new exitnode)

@paidforby paidforby changed the title exit node not maintaining babeld process exit node does work unless bridge node is configured Apr 25, 2018
@paidforby
Copy link
Author

paidforby commented Apr 26, 2018

The only change I did make is I had to turn off my "bridging" nodes that helped restore order to the mesh after we resolved #21

Ok, so weird. I reconfigured the bridge node setup in my room just now and it appears to have corrected the issues with the new exit node? I have no explanation for why this works, but the number of nodes jumped from 19 before I set it up, to 34 afterwards. I will also note that the number of gateways remained at 14? Aha, that means that 15 nodes out in the world are relying on the single gateway in my room. Interesting. Here's the set up the restores order,

  1. MyNet N600 flashed with sudowrt v0.2.2 and configured with makenode v0.1.0, provide internet via WAN port, comment out the "new" exit node in /etc/config/tunneldigger so it will only look for the old

  2. MyNet N600 flashed with sudowrt v0.2.3 and configured with makenode v0.1.0 (or just sudowrt v0.3.0 and no makenode 😁), provide internet via WAN port, comment out the "old" exit node in /etc/config/tunneldigger so it will only look for the new.

  3. Allow a little time for them to find their respective exit nodes, then find each other over their ad-hoc interfaces, and then share their routing tables and...volia! the mesh converges in less than 5 mins.

Any ideas why we might be seeing, does this give any clues as to the root cause? Why must we have a bridge setup? Shouldn't the new nodes work all on their own?

@jhpoelen
Copy link
Contributor

jhpoelen commented Apr 26, 2018

I'd need more information (routes, logs etc) to understand the root cause of this. Last time I checked, mole in sudoroom was acting as a bridge between the "new" and the "old" exit nodes. However, today, mole (65.93.65) was unable to ping 100.64.0.42 (old exit node). This indicates that mole is no longer bridging between new and old exit node. Perhaps this is because all the nodes at sudoroom are now running the new firmware.

Note that the monitor is coded/configured to only report via the old exit node at this point. There's no reason why we can't extend this to include more exit node. This would require a little javascript coding . . .

@jhpoelen
Copy link
Contributor

With the information I have now, I think we are just looking at a feature, rather than a bug. Extending the monitor software would help to monitor the network from multiple exit nodes. Also, we can create a static bridge directly between the exit nodes by configuring a client on the one, talking to a broker on the other. I'd favor extending the monitoring, especially because we don't really have any mesh services at this point.

@paidforby
Copy link
Author

I realize now that the monitor only reports through the old exit node, I don't necessarily take issue with that. Perhaps my comment was misunderstood, but I'd strongly disagree that this a feature, it seems more like a major flaw in the new exit node, only that we haven't noticed it until recently because mole and an old node in sudoroom were bridging the exitnodes. After turning off the old nodes in sudoroom on Tuesday, 4/17, we immediately observed the first outage the next day in #21, while partially patched, I still think we have yet to discover the root cause. Similarly, after shutting down my bridging nodes on Saturday 4/21, there was a second outage, first observed on 4/24. The following is detailed description of the behavior I have observed in the case of both outages,

  1. The bridging nodes are setup, everything is working normally, i.e. everyone on 100.64.0.0/10 subnet can ping everyone else and can reach the internet.
  2. The bridging nodes are turned off, everything continues working normally.
  3. Some time later, an error occurs in babeld on the new exit node (segmentation fault? maybe related to node establish tunnels, but does not get route to mesh #21 or setsockopt out of memory causes babeld failure #24) and babeld enters into a bad state, routes disappear to the old exit node (explaining the decrease in the monitor) and, more importantly, nodes maintain (or setup a new) tunnels to the new exit node, but are able to reach the internet or babel with any other nodes on the mesh. I've observed this on countless nodes newly flashed (with 0.2.3 or later) and ones that were flashed (with 0.2.3) and patched (for home nodes depend on static exit node ips #23 and Domain names not being resolved over extender node bridged connection  #27) weeks ago.
  4. After either a) restarting the new exit node in the case of node establish tunnels, but does not get route to mesh #21 or b) restarting the babeld process, though it was still active and running, nodes tunnelling to the new exit node are still not able to reach the internet or babel with other nodes.
  5. After reconfiguring the bridging nodes as I described in babeld segmentation fault on exit node causes lost routes #31 (comment), all nodes flashed with 0.2.3 or later and configured with makenode 0.0.1 or patched up to Domain names not being resolved over extender node bridged connection  #27 are suddenly able to reach the internet through the new exit node and babel again with each other.

What this means to me is that the new exit node is failing to operate independently of the old exit node. The "nuclear" option to check this theory would be to simply turn off the old exit node (or at least disable babeld and tunneldigger). However, for a more delicate way of reproducing, I suggest:

  1. Turning off the bridging nodes that I am currently maintaining (I see this being a huge flaw/feature also, because what if I'm not near my bridging nodes, or what if someone else is unwittingly maintaining bridging nodes)
  2. Check if everything is still working normally for an hour or two (allowing time for routes to disappear maybe?)
  3. kill -9 the babeld process on the new exit node to trigger systemd to restart it and maybe put it into that bad state, otherwise wait a few hours/days for the error to occur
  4. Check if everything is working normally (again allowing time for routes to disappear?)

Rather than running a tunneldigger client on a broker which seems messy to me, a temporary solution would be to create a self contained bridge node that is always left on, so when babeld is restarted, nodes are able to rediscover their route to other nodes and to the internet. However, this avoiding the real problem that is hiding somewhere in https://github.com/sudomesh/exitnode or the babeld-monitor.

@paidforby
Copy link
Author

Here are some notable portions of /var/log/messages from a node that was on during outage, home_node_messages_dump_1.txt
home_node_messages_dump_2.txt

and also on the new exit node, I see both

Apr 17 15:19:38 exit0 kernel: [2681419.863612] HTB: quantum of class 10001 is big. Consider r2q change.
Apr 17 18:42:28 exit0 kernel: [2693589.502208] traps: babeld[17629] general protection ip:556562bea3d0 sp:7ffec3777290 error:0
Apr 17 18:42:28 exit0 kernel: [2693589.502215]  in babeld[556562be0000+16000]

and

Apr 19 21:44:53 exit0 kernel: [35712.741489] HTB: quantum of class 10001 is big. Consider r2q change.
Apr 20 20:43:24 exit0 kernel: [118424.812594] perf: interrupt took too long (2502 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
Apr 21 22:17:13 exit0 kernel: [210454.370263] perf: interrupt took too long (3131 > 3127), lowering kernel.perf_event_max_sample_rate to 63750
Apr 24 12:17:55 exit0 kernel: [433698.664554] babeld[830]: segfault at 18 ip 00005567267473d0 sp 00007ffc6a239240 error 4 in babeld[55672673d000+16000]
Apr 24 17:27:26 exit0 kernel: [452270.052907] babeld[3596]: segfault at fffffff337dc3df9 ip 000055e3a510f112 sp 00007fffd89c3400 error 7 in babeld[55e3a5104000+16000]
Apr 24 18:10:33 exit0 kernel: [454857.088438] traps: babeld[3767] general protection ip:561947add3d0 sp:7fffc111a920 error:0
Apr 24 18:10:33 exit0 kernel: [454857.088445]  in babeld[561947ad3000+16000]
Apr 25 10:50:02 exit0 kernel: [514826.886135] HTB: quantum of class 10001 is big. Consider r2q change.
Apr 25 10:52:19 exit0 kernel: [514963.182555] HTB: quantum of class 10001 is big. Consider r2q change.
Apr 25 20:47:44 exit0 kernel: [550689.056115] babeld[5288]: segfault at fffffffb56cb2ff9 ip 0000557496a84112 sp 00007fff5886f6a0 error 7 in babeld[557496a79000+16000]
Apr 25 20:53:16 exit0 kernel: [551021.085994] babeld[22067]: segfault at 39 ip 0000555856d4b3d0 sp 00007ffc050306e0 error 4 in babeld[555856d41000+16000]

which seems to line up with the recent outages. I don't see anything similar in the old exit node's logs

@jhpoelen
Copy link
Contributor

jhpoelen commented Apr 26, 2018

Thanks for sharing your response. As far as I can tell, I see three independent threads appearing:

  1. confusion around how exit nodes mesh with each other.

  2. a re-occurring segmentation fault in babeld

  3. the monitor at https://peoplesopen.net/monitor reporting only the "old" exit node routes

I don't quite understand your comment: new exit node is failing to operate independently of the old exit node. My home nodes has been running against the "new" exit node independently for quite some time. Also, the "old" exit node and the "new" exit node are configured similarly, apart from versions of tunneldigger. In addition, various home nodes have operated with distinct exit nodes that do not mesh with the "new" or "old" exit nodes. Perhaps I don't understand the context.

Thread 1. seems like a design discussion caused by our attempts to move towards a network without a single point of failure. Thread 2. seems like a bug in babeld and thread 3. is just a result of an initial minimal implementation of the monitor.

Curious to hear whether @bennlich @paidforby or others have some thoughts on this. Also, I'd be curious to hear which thread is the focus of this issue.

@eenblam
Copy link
Contributor

eenblam commented Apr 27, 2018

Similarly, after shutting down my bridging nodes on Saturday 4/21, there was a second outage, first observed on 4/24.

@paidforby I noticed the outage on 4/21, but I didn't realize what was happening until 4/22. Even then, I misattributed it to the network outage at Omni, and didn't have time to continue troubleshooting beyond getting Omni back online. (My point being that things broke immediately.)

@eenblam
Copy link
Contributor

eenblam commented Apr 27, 2018

With the information I have now, I think we are just looking at a feature, rather than a bug. Extending the monitor software would help to monitor the network from multiple exit nodes. Also, we can create a static bridge directly between the exit nodes by configuring a client on the one, talking to a broker on the other. I'd favor extending the monitoring, especially because we don't really have any mesh services at this point.

@jhpoelen I previously opened #6 for this reason.

Right now, we have one exit node that can report on both as long as the exits are meshing with each other. To make monitoring robust to partitions, we can just update the app to accept updates from multiple exit nodes and in turn report on each instead of a single one. Should be easy enough.

@jhpoelen
Copy link
Contributor

@eenblam having support for partitioned mesh monitoring would be really nice I think: seeing the mesh in action seems not only useful for checking on health, but would be nice for educational purposes. Thanks for reminding of sudomesh/monitor#6 .

@bennlich
Copy link
Collaborator

@jhpoelen thanks for disentangling the threads--I found that helpful, and @paidforby thanks for the detailed log of observations.

Results/confirmations from recent experimenting:

  • I have a node flashed with 0.3.0 (zeroconf) that digs tunnels but never receives mesh routes from either exit node.
  • After restarting babeld and tunneldigger broker on the HE exit node, my 0.3.0 node still does not receive mesh routes. However, a few other mesh routes do reappear on the exit node, meaning some home nodes are able to connect to HE successfully. The vast majority of them are via 100.65.93.65.

Interestingly, a tcpdump on the l2tp0 interface shows what looks like healthy babeling:

root@sudomesh-node:~# tcpdump -i l2tp0
tcpdump: WARNING: l2tp0: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on l2tp0, link-type EN10MB (Ethernet), capture size 65535 bytes
10:26:54.911982 IP6 fe80::5c72:20ff:fe86:efe8.6696 > ff02::1:6.6696: babel 2 (884) hello nh router-id update router-id update update update router-id update router-id update router-id update router-id update router-id update router-id update router-id update router-id update router-id update router-id update router-id update router-id update router-id update router-id update router-id update router-id update router-id update router-id update router-id update router-id update router-id update router-id update router-id update router-id update router-id update router-id update router-id update router-id update
10:26:57.549808 IP6 fe80::18b7:27ff:fe98:b024.6696 > ff02::1:6.6696: babel 2 (30) hello ihu
10:26:57.644687 IP6 fe80::5c72:20ff:fe86:efe8.6696 > ff02::1:6.6696: babel 2 (8) hello
10:27:01.400803 IP6 fe80::5c72:20ff:fe86:efe8.6696 > ff02::1:6.6696: babel 2 (24) hello ihu
10:27:02.048806 IP6 fe80::18b7:27ff:fe98:b024.6696 > ff02::1:6.6696: babel 2 (14) hello

Namely, traffic from two different IP6 addresses, one of which corresponds to an interface on the home node, and the other to an interface on the exit node. In the past, I've always associated this kind of babeling with a working mesh, so it's interesting that we're seeing this without seeing route tables successfully distributing.

--

If anyone has access to the 100.65.93.65 home node and is willing to share, I'd be interested to poke around and see if anything seems obviously different than my node.

My next step is to reflash my node with older firmwares until I find one that works.

@bennlich bennlich changed the title exit node does work unless bridge node is configured exit node does (not?) work unless bridge node is configured Apr 27, 2018
@jhpoelen
Copy link
Contributor

@bennlich 100.65.93.65 is mole , sent pw by signal.

@jhpoelen jhpoelen changed the title exit node does (not?) work unless bridge node is configured babeld segmentation fault on exit node causes lost routes Apr 28, 2018
@jhpoelen
Copy link
Contributor

jhpoelen commented Apr 28, 2018

Today, I was able to re-produce the issue after a segmentation fault on HE exit node.

Sequence of events:

  1. babeld segmentation occurs
  2. systemd restarts babeld

expected behavior:
on babeld crash on exit node, tunnel interfaces are re-added to babeld on restart.

actual:
after babeld crash on exit node, babeld is restarted, but existing tunnel interfaces are not re-added to the babeld process.

workaround:
After babeld crash/ restart on exit node, stop babeld, tunneldigger on home node, and restart babeld and tunneldigger to force re-adding the tunnel interface on exit node to broker-side babeld process.

logs:

Apr 28 11:02:23 exit0 babeld[2189]: Warning: cannot save old configuration for l2tp976-976.
Apr 28 11:04:08 exit0 babeld[2189]: removing: l2tp976-976
Apr 28 11:04:19 exit0 systemd[1]: babeld.service: Main process exited, code=killed, status=11/SEGV
Apr 28 11:04:19 exit0 systemd[1]: babeld.service: Unit entered failed state.
Apr 28 11:04:19 exit0 systemd[1]: babeld.service: Failed with result 'signal'.
Apr 28 11:04:19 exit0 systemd[1]: babeld.service: Service hold-off time over, scheduling restart.
Apr 28 11:04:19 exit0 systemd[1]: Stopped babeld.
Apr 28 11:04:19 exit0 systemd[1]: Started babeld.
Apr 28 11:07:48 exit0 babeld[11930]: Warning: cannot restore old configuration for l2tp978-978.
Apr 28 11:07:48 exit0 babeld[11930]: removing: l2tp978-978
Apr 28 11:09:09 exit0 babeld[11930]: kernel_route(ADD): File exists
Apr 28 11:09:39 exit0 babeld[11930]: Warning: cannot restore old configuration for l2tp979-979.
Apr 28 11:09:39 exit0 babeld[11930]: removing: l2tp979-979

@jhpoelen
Copy link
Contributor

Note that re-adding tunnel interfaces via exit node patches described in sudomesh/exitnode@7d3c2a6 and sudomesh/exitnode@9eb8c7f do not seem to fix the issue.

@eenblam
Copy link
Contributor

eenblam commented Apr 28, 2018

This won't save everything, but we might be able to reduce the impact of these events by dumping existing routes to something /var/cache/routes every so often, then writing a hook to add those routes back when babeld restarts.

This can still lose recently added routes.

@jhpoelen
Copy link
Contributor

@eenblam note that the patch was applied to exit nodes, not home nodes. It seems that babeld has some features to save routes already . . . given the warning "cannot restore old configuration ...". Am curious what we'll come up with to gracefully recover from outages.

@jech
Copy link

jech commented May 11, 2018

Mainline babeld has support for adding interfaces at runtime since version 1.7.0.

$ sudo ./babeld -G 33123 eth0 &
[1] 5287
$ (echo 'interface wlan0') | nc -q1 localhost 33123
BABEL 1.0
version babeld-1.8.0-49-gcb978f2 
host trurl
my-id 86:8f:69:ff:fe:f0:33:8e
ok
ok
$

(The first "ok" is the reply to the initial connection, the second "ok" is the reply to the interface statement. The format is designed to be easy to parse automatically. See https://www.irif.fr/~jch/software/babel/babel-lexer.c)

If you can reproduce the crash with mainline babeld, then I'll be grateful for a backtrace.

@jhpoelen
Copy link
Contributor

@jech Thanks for sharing! I like the idea of switching back to main if it has the features we need. Have created #33 to discuss with others.

@bennlich
Copy link
Collaborator

bennlich commented Sep 26, 2018

I upgraded babeld on the HE exit node on June 6, and I see only one segfault log since then:

Aug 15 14:32:03 exit0 kernel: [10205039.878521] babeld[20265]: segfault at 7f635be65c90 ip 0000561b27848273 sp 00007ffc10a26e20 error 6 in babeld[561b2783c000+1b000]

In April, we were seeing several segfaults a day, so this is definitely an improvement. I'm going to close this issue for now. Currently the setsockopt memory leak is way more important to figure out (this happens ~once per day).

@jech
Copy link

jech commented Sep 26, 2018 via email

@bennlich
Copy link
Collaborator

@jech I'm afraid I don't know how to provide :-/ Working on a simple stress test to see if I can repro the setsockopt error for you though. Some observations are documented here: #24

@jech
Copy link

jech commented Sep 26, 2018 via email

@bennlich
Copy link
Collaborator

awesome! thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants