Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Babeld-only network does not work on wireless mesh links #666

Closed
ilario opened this issue Dec 12, 2019 · 6 comments
Closed

Babeld-only network does not work on wireless mesh links #666

ilario opened this issue Dec 12, 2019 · 6 comments
Labels

Comments

@ilario
Copy link
Member

ilario commented Dec 12, 2019

I just tested a network with no Batman-adv but only Babeld, on top of OpenWrt 18.06 (initially spotted when testing on OpenWrt 19.07-rc2, where the same problem is present): wireless mesh links do not work.

When adding all the protocols but list protocols batadv:%N1 in the /etc/config/lime-node, the routing over the wireless links stops working (while the cabled ones are still ok). Needless to say, when both Batman-adv and Babeld are present everything works.
But it works just because the broken routing is "patched" as the packets flow through bat0.

My setup has 4 nodes:
Commercial AP with internet - - - - LibreMesh wireless client (10.13.40.169) ------ LibreMesh AP + mesh (10.13.104.0) - - - - LibreMesh AP + mesh (10.13.123.8)

the ping to the internet from the third router works (i.e. the cabled link works), but from the fourth does not work (i.e. the wireless link does not work).

The Babeld dump (echo dump | nc ::1 30003) looks good both in the third and in the fourth router, and does not change when adding or removing Batman-adv.

Third router routes:

# ip route show
default via 10.13.40.169 dev eth0-1_17 proto babel onlink 
10.13.0.0/16 dev br-lan proto kernel scope link src 10.13.104.0 
10.13.0.0/16 dev anygw proto kernel scope link src 10.13.0.1 

ping to the internet works (via eth0-1_17) and ping to the second router (10.13.40.169) works (as it is reachable from the LAN network included in br-lan).

Fourth router routes:

# ip route show
default via 10.13.104.0 dev wlan1-mesh_17 proto babel onlink 
10.13.0.0/16 dev br-lan proto kernel scope link src 10.13.123.8 
10.13.0.0/16 dev anygw proto kernel scope link src 10.13.0.1

Ping to the internet is sent through wlan1-mesh_17 but the reply does not arrive as the third router sends it through br-lan, which does not include the mesh interface.
Ping to the third router (10.13.104.0) is sent through br-lan and does not even get to the third router as the mesh interface is not included in br-lan.

When Batman-adv is present, the connection magically works, even if the routes are identical to the ones I have shown here, but works in a stupid way I think:
the reply of the ping from the fourth router to the internet finds its way back as bat0 is included in br-lan; the ping to the third router works even if it's sent through br-lan as it includes bat0 which includes wlan1-mesh_29.

Using tcpdump I checked that this happens: pinging the internet, the outwards ping goes through wlan1-mesh_17 (Babeld interface) and the reply arrives from wlan1-mesh_29 (included in bat0); pinging the third router wlan1-mesh_17 does not have any traffic and everything flows through wlan1-mesh_29.

In my opinion this is a bug (limits the MTU and maybe the performances) which is caused by the fact that Babeld interfaces have only /32 IPv4 addresses which does not allow the packets to be routed through it. I don't know the solution but I think that Babeld should be in charge to announce more meaningful routes.

The option babeld_over_batman proposed in #631 could be used for enabling or disabling this kind of things, but by default Babeld routed packets should not pass through bat0.

@ilario
Copy link
Member Author

ilario commented Dec 12, 2019

A partial fix is to allow the redistribution of the local routes, we're denying it here:

-- Avoid redistributing extra local addesses
uci:set("babeld", "localdeny", "filter")
uci:set("babeld", "localdeny", "type", "redistribute")
uci:set("babeld", "localdeny", "local", "true")
uci:set("babeld", "localdeny", "action", "deny")

With this, the fourth router can ping internet and can ping the other routers. Still a client connected to the fourth router cannot ping the internet nor the routers.

@G10h4ck G10h4ck added invalid and removed bug labels Dec 12, 2019
@G10h4ck
Copy link
Member

G10h4ck commented Dec 12, 2019

This is not a bug, all the routers are configured on 10.13.0.0/16 subnet, this simply can't work in an L3 routing only setup. If you disable L2 routing each router need to have its own non-overlapping subnet otherwise they simply can't route. When you use cabled links it works because the hardware switches are doing the L2 routing for you under the hood.

@G10h4ck G10h4ck closed this as completed Dec 12, 2019
@ilario
Copy link
Member Author

ilario commented Dec 12, 2019

Don't stop at the title.
I found this issue testing a network without Batman-adv, but the issue affects the current Babeld + Batman-adv ecosystem.

Currently we have a schizophrenic situation in which Babeld decides the next hop and Batman-adv decides how to get there. Some times. Some other times Babeld also gets to send the packets through the route it choose.

For example, when pinging the internet from the fourth router, the next hop for the outgoing packet gets decided from Babeld, but the path of the replay gets decided by Batman-adv. The outgoing and the incoming packets go on different interfaces with maybe even different MTU.

@G10h4ck
Copy link
Member

G10h4ck commented Dec 13, 2019

That is normal, the kernel always chose the path that seems more convenient, but sometimes that is not necessarily the best path from an "omniscient" point of view. The kernel have to take the routing decision with the information it has access to which is limited, and with less CPU cycles as possible. So when you ping an host outside your sub-net the packets that comes out of your device has the destination IP, and anygw mac as L2 destination, so it bumps on L3 on first anygw node it can, and since that point the L3 routing is in charge, when a packet come back from the internet the first router which have same sub-net as you will think "the destination is on my same broadcast domain so I have to push it via L2" and then since that point L2 will be in charge of routing that packet, even if that may be sub-optimal in some cases. This is how the internet works, you can try to avoid that by using libremesh in L3 mode only but you will need to give each router an unique non overlapping sub-net, and you will loose roaming capability, as that "stupid routing" as you name it upstairs is what make roaming possible. Perfect routing doesn't exists you need yo chose the trades-off to live with.

@ilario
Copy link
Member Author

ilario commented Dec 15, 2019

I suggest to stop releasing LibreMesh with both Batman-adv and Babeld.

Instead, I would release LibreMesh-node with just Batman-adv (and batman-adv-auto-gw-mode package) to be used for small-medium networks and LibreMesh-border with just Babeld for the border nodes between the LibreMesh-node networks.

@ilario
Copy link
Member Author

ilario commented Dec 16, 2019

The discussion about whether to release a L2+L3 system or something else is on #468

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants