Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipq40xx: issue with WiFi Mesh since v2022.1.x #2692

Closed
bschelm opened this issue Nov 3, 2022 · 15 comments
Closed

ipq40xx: issue with WiFi Mesh since v2022.1.x #2692

bschelm opened this issue Nov 3, 2022 · 15 comments

Comments

@bschelm
Copy link

bschelm commented Nov 3, 2022

Bug report

Fritz Box 4040 looses mesh connection to Unifi AC Mesh

  • Since upgrade to version vh29, meshing between the two devices is not working properly
  • The mesh connection only lasts for a few hours (after start) and then breaks down (RX or TX = 0%)
    • UPDATE: The TQ is good on both ends. See below. (It was only a lack of data of the freifunk map, that showed "0% TQ" at first.)
  • Problem started with update to vH29
  • I noticed a broken connection between the two devices. Recovery is possible with command "wifi" on FB 4040
  • Connection only brakes when FB4040 is > vH29, software version on AC Mesh is irrelevant. All Devices were installed from scratch (forget configuration)?

What is the expected behaviour?

  • Connections should be stable?
  • it works with vH26 and lower?

vH26:

https://github.com/freifunkh/site/releases/tag/vH26

FURTHER INFORMATION FOLLOWS

Gluon Version: gluon-v2022.1.1

Site Configuration:

https://github.com/freifunkh/site/releases/tag/vH30

Custom patches:

https://github.com/freifunkh/site/tree/vH30/patches

@AiyionPrime AiyionPrime changed the title Issue with FritzBox 4040 ans Wifi Mesh Issue with FritzBox 4040 and Wifi Mesh Nov 5, 2022
@AiyionPrime
Copy link
Member

@lemoer
Copy link
Member

lemoer commented Jan 12, 2023

We traced this issue further. It seems to boil down to the fact that OpenWrt introduced ath10k-ct for the ipq40xx target. We tested if this issue also appears with ath10k, but it doesn't seem so (since a few days).

@lemoer
Copy link
Member

lemoer commented Jan 12, 2023

Ok, it seems that we did this in gluon: 15ef885

@lemoer
Copy link
Member

lemoer commented Jan 12, 2023

I'll explain a bit more in detail what we found out. Therefore, we have two routers:

The issues occur every few hours or so.

When the the issue occurs, the TQ seems to be excellent for the 5 GHz-Interfaces from both sides:

Originator Table on NDS-The-TARDIS:

root@NDS-The-TARDIS:~# batctl o | grep "^ . be:eb:ad:e1:5f:7b" # MAC of NDS-The-Sonic-Screwdriver on primary0
   be:eb:ad:e1:5f:7b    0.230s   (112) be:eb:ad:e1:5f:7d [     mesh0] -> 2.4 GHz
 * be:eb:ad:e1:5f:7b    0.230s   (239) be:eb:ad:e1:5f:79 [     mesh1] -> 5 GHz
root@NDS-The-TARDIS:~# batctl o | grep "^ . be:eb:ad:e1:5f:7d" # MAC of NDS-The-Sonic-Screwdriver on mesh1
 * be:eb:ad:e1:5f:7d    3.940s   (122) be:eb:ad:e1:5f:7d [     mesh0] -> 2.4 GHz
 root@NDS-The-TARDIS:~# batctl o | grep "^ . be:eb:ad:e1:5f:79" # MAC of NDS-The-Sonic-Screwdriver on mesh0
 * be:eb:ad:e1:5f:79    3.340s   (243) be:eb:ad:e1:5f:79 [     mesh1] -> 2.4 GHz

Originator Table on NDS-The-Sonic-Screwdriver:

root@NDS-The-Sonic-Screwdriver:~# batctl o | grep '^ . d6:0b:01:61:70:5b' # MAC of NDS-The-TARDIS on primary0
   d6:0b:01:61:70:5b    3.300s   (139) c6:c7:76:10:25:81 [     mesh0] -> 5 GHz 
   d6:0b:01:61:70:5b    3.300s   (165) d6:0b:01:61:70:59 [     mesh1] -> 2.4 GHz
 * d6:0b:01:61:70:5b    3.300s   (255) d6:0b:01:61:70:5d [     mesh0]  -> 5 GHz
root@NDS-The-Sonic-Screwdriver:~# batctl o | grep '^ . d6:0b:01:61:70:59' # MAC of NDS-The-TARDIS on mesh0
 * d6:0b:01:61:70:59    0.880s   (180) d6:0b:01:61:70:59 [     mesh1] -> 2.4 GHz
root@NDS-The-Sonic-Screwdriver:~# batctl o | grep '^ . d6:0b:01:61:70:5d' # MAC of NDS-The-TARDIS on mesh1
 * d6:0b:01:61:70:5d    4.390s   (255) d6:0b:01:61:70:5d [     mesh0] -> 5 GHz

However, a ping from NDS-The-TARDIS to NDS-The-Sonic-Scewdriver via the batman interface doesn't work:

root@NDS-The-TARDIS:~# ping -c 5 fe80::76ac:b9ff:fe60:4324%br-client
PING fe80::76ac:b9ff:fe60:4324%br-client (fe80::76ac:b9ff:fe60:4324%11): 56 data bytes

--- fe80::76ac:b9ff:fe60:4324%br-client ping statistics ---
5 packets transmitted, 0 packets received, 100% packet loss

This seems to be odd behaviour. The same behaviour can be observed, if we ping on the 5GHz interface directly:

root@NDS-The-TARDIS:~# ping fe80::bceb:adff:fee1:5f79%mesh1 # This is 5 GHz
PING fe80::bceb:adff:fee1:5f79%mesh1 (fe80::bceb:adff:fee1:5f79%16): 56 data bytes
64 bytes from fe80::bceb:adff:fee1:5f79: seq=2 ttl=64 time=21.381 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=4 ttl=64 time=17.423 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=7 ttl=64 time=15.341 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=14 ttl=64 time=21.839 ms
^C
--- fe80::bceb:adff:fee1:5f79%mesh1 ping statistics ---
16 packets transmitted, 4 packets received, 75% packet loss
round-trip min/avg/max = 15.341/18.996/21.839 ms

-> Packageloss of about 75%.

However, if we use a multicast ping, we get answers from fe80::bceb:adff:fee1:5f79 (NDS-The-Sonic-Screwdriver):

root@NDS-The-TARDIS:~# ping ff02::1%mesh1
PING ff02::1%mesh1 (ff02::1%16): 56 data bytes
64 bytes from fe80::d40b:1ff:fe61:705d: seq=0 ttl=64 time=0.745 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=0 ttl=64 time=14.570 ms (DUP!)
64 bytes from fe80::d40b:1ff:fe61:705d: seq=1 ttl=64 time=1.543 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=1 ttl=64 time=4.745 ms (DUP!)
64 bytes from fe80::d40b:1ff:fe61:705d: seq=2 ttl=64 time=1.971 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=2 ttl=64 time=11.180 ms (DUP!)
64 bytes from fe80::d40b:1ff:fe61:705d: seq=3 ttl=64 time=0.992 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=3 ttl=64 time=28.687 ms (DUP!)
64 bytes from fe80::d40b:1ff:fe61:705d: seq=4 ttl=64 time=1.024 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=4 ttl=64 time=2.542 ms (DUP!)
64 bytes from fe80::d40b:1ff:fe61:705d: seq=5 ttl=64 time=1.059 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=5 ttl=64 time=2.338 ms (DUP!)
64 bytes from fe80::d40b:1ff:fe61:705d: seq=6 ttl=64 time=0.982 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=6 ttl=64 time=11.422 ms (DUP!)
64 bytes from fe80::d40b:1ff:fe61:705d: seq=7 ttl=64 time=1.003 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=7 ttl=64 time=3.020 ms (DUP!)
64 bytes from fe80::d40b:1ff:fe61:705d: seq=8 ttl=64 time=0.982 ms
64 bytes from fe80::bceb:adff:fee1:5f79: seq=8 ttl=64 time=17.070 ms (DUP!)
^C
--- ff02::1%mesh1 ping statistics ---
9 packets transmitted, 9 packets received, 9 duplicates, 0% packet loss
round-trip min/avg/max = 0.745/5.881/28.687 ms

This means, that broadcast packets are sent out correctly, while unicast packets are somehow lost!

This also explains why the TQ almost 100% from both sides, but actual packets on br-client are lost.

@AiyionPrime
Copy link
Member

note: vH31 has the same gluon commit as vH30, the only difference is the latter using fastd and the former WireGuard.

@lemoer
Copy link
Member

lemoer commented Jan 12, 2023

I opened a bug report at greearb/ath10k-ct#210.

@blocktrron blocktrron changed the title Issue with FritzBox 4040 and Wifi Mesh ipq40xx: issue with WiFi Mesh since v2022.1.x Jan 20, 2023
@blocktrron
Copy link
Member

@bschelm does the issue also appear when using the non-ct firmware? From your description, I would assume the problem somewhere in the firmware, not the driver.

@lemoer
Copy link
Member

lemoer commented Feb 1, 2023

@blocktrron Only ath10k-ct is affected (see here greearb/ath10k-ct#210 (comment)). Ath10k is ok.

I am currently bisecting the firmware for 3 weeks (greearb/ath10k-ct#210 (comment)). Unfortunately, the issue is very unpredictable and I have to wait a day or two to see if the problem really occurs. Also, I have the impression that it might be more than one error, because I observe different behaviours when I bisect.

@blocktrron
Copy link
Member

Are you referring to the firmware or the kernel-driver? From the issue description, it sounds to me like the firmware is the culprit, so mainline firmware with -ct driver should work?

@lemoer
Copy link
Member

lemoer commented Feb 1, 2023

I am refering to the firmware and not to the driver. I got a bunch of firmwares from greearb and currently I am bisecting the issue.

And (as far as I can tell) the ath10k-ct driver seems to work with ath10k mainline firmware.

@rotanid rotanid added this to the 2022.1.3 milestone Feb 15, 2023
@rotanid
Copy link
Member

rotanid commented Feb 20, 2023

at the Gluon meeting last week we decided to switch back the firmware to ath10k (without -ct) until there is a bugfix for the ath10k-firmware. there might be performance issues without -ct but performance is less important than stability.

@AiyionPrime
Copy link
Member

And reopened, as @rotanid predicted earlier ;)

AiyionPrime added a commit to AiyionPrime/gluon that referenced this issue Feb 26, 2023
This is a temporary measure that fixes freifunk-gluon#2692.

This reverts commit 15ef885.

(cherry picked from commit 22c47df)
@lemoer lemoer removed this from the 2022.1.3 milestone Feb 26, 2023
@rotanid
Copy link
Member

rotanid commented Feb 26, 2023

we removed the issue from the milestones as a workaround has been implemented.

@rotanid
Copy link
Member

rotanid commented Mar 4, 2023

i could also reproduce the issue with gluon-v2022.1.2-4-gb5d9071 ( commit b5d9071 in branch v2022.1.x ) and a neighboring node with running the fixed gluon-v2022.1.2-6-g4f8f6e7 does not show the issue.

JayBraker pushed a commit to JayBraker/gluon that referenced this issue Apr 12, 2023
This is a temporary measure that fixes freifunk-gluon#2692.

This reverts commit 15ef885.
@lemoer
Copy link
Member

lemoer commented Apr 23, 2023

I have been struggling to reproduce this since a while. (See upstream issue.)

If I understood it correctly, we decided that we no longer want to use ath10k-ct in gluon and stay with ath10k mainline. (I am not sure in which meeting we discussed that, but certainly we did. @blocktrron was proposing this.)

Therefore, I will now close this issue.

@lemoer lemoer closed this as completed Apr 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants