Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpi-r3 branch 6.3-rc missing SFP support / transmit queue timeout #103

Open
gustavobsch opened this issue Apr 8, 2023 · 38 comments
Open

Comments

@gustavobsch
Copy link

Hi Frank,

Testing the 6.3-rc branch I noticed the SFP's stopped working, any fix for this?

Thanks

@frank-w
Copy link
Owner

frank-w commented Apr 8, 2023

which SFP do you use? which issue do you have with this? does sfp appear in dmesg? what does ethtool say?

there are several patches for sfp-support (mostly quirks for 2.5G rj45 sfp)

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/log/drivers/net/phy/sfp.c

network-stack is currently broken...but there are already fixes for it

#throughput for r3-gmac (wan sfp)
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/drivers/net/ethernet/mediatek/mtk_eth_soc.c?id=193250ace270fecd586dd2d0dfbd9cbd2ade977f

fixes for "net: ethernet: mtk_eth_soc: implement multi-queue support for per-port queues" which i have reverted in 6.3-rc, so these should not be needed
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/drivers/net/ethernet/mediatek/mtk_eth_soc.c?id=07b3af42d8d528374d4f42d688bae86eeb30831a

#mt7623 switch-throughput
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/drivers/net/ethernet/mediatek/mtk_eth_soc.c?id=e669ce46740a9815953bb4452a6bc5a7fdc21a50

@gustavobsch
Copy link
Author

I pulled the 6.3-rc branch. compiled and booted from it and noticed the rj45 sfp stopped working. As I didn't have time to play with it and 6.2-rc has been working fine, I just went back to 6.2.

@frank-w
Copy link
Owner

frank-w commented Apr 12, 2023

which SFP do you use exactly?

c686900 2023-03-14 net: sfp: support 2.5g copper SFP v3

should be make 2g5 sfp from aliexpress work...if you have the tplink then daniel prepared a quirk-patch for it

@gustavobsch
Copy link
Author

It's the tplink from aliexpress model TL-SM410U

I will give the 6.3-rc another try

@frank-w
Copy link
Owner

frank-w commented Apr 12, 2023

Ok,the tp-link needs patch from daniel i have not yet in my tree...

@gustavobsch
Copy link
Author

Oh ok, got confused, do you have plans to add Daniels patch to your 6.3 tree?

@frank-w
Copy link
Owner

frank-w commented Apr 12, 2023

have added the tp-link quirk, but daniel reports that it is not clean working and they search the issue

https://www.spinics.net/lists/netdev/msg892902.html

btw. there is 6.2-main...better than rc to use

but i wonder why 6.2-rc works for you....

maybe you can try to revert the sfp-patch above...possibly it breaks your sfps

@gustavobsch
Copy link
Author

Thanks for the tp-link quirk patch, the SFP is working now in 6.3-rc.

I will test the 6.2-main now, I noticed the fiber SFP is not detecting link.

@gustavobsch
Copy link
Author

have added the tp-link quirk, but daniel reports that it is not clean working and they search the issue

https://www.spinics.net/lists/netdev/msg892902.html

btw. there is 6.2-main...better than rc to use

but i wonder why 6.2-rc works for you....

maybe you can try to revert the sfp-patch above...possibly it breaks your sfps

I couldn't find the 6.2-main branch

@frank-w
Copy link
Owner

frank-w commented Apr 13, 2023

which fibre sfp? mine (H!Fibre in left cage) is working in 6.3-rc

sorry about 6.2-main, had not pushed it...done now

@gustavobsch
Copy link
Author

Thanks, I'll pull 6.3-main.

I tested two, I can see them but link is not detected

Brocade 57-1000013-01
Cisco GLC-SX-MM=

@frank-w
Copy link
Owner

frank-w commented Apr 13, 2023

6.2-main...6.3 is not final (sry typo from me...rc branches are till the kernel is released - when only rc are available)

which slot do you use?
what does ethtool and dmesg tell about them?

ethtool -m eth1/lan4

dmesg |grep -i 'eth\|sfp'

what happens when you do the autoneg workaround?

ethtool -s eth1 autoneg off
ethtool -s lan4 autoneg off

@gustavobsch
Copy link
Author

Ok, my bad, last time I tried to make the SFP work I played too much with it and let auto negotiation disabled on the other side.. so after enabling it I can detect link and tcpdump traffic. Looks fine so far on 6.3-rc. vlan is not working, if I remember correctly left side SFP does not support vlan.

@frank-w
Copy link
Owner

frank-w commented Apr 13, 2023

Yes,there was a patch series from felix fietkau which cannot be applied...thought patch from vladimir fixed it but maybe it was another issue...felix patch disabled hw offloading on the gmac1

@frank-w
Copy link
Owner

frank-w commented Apr 14, 2023

This should be the patch fixing the vlan-issue on gmac

1382688

@frank-w
Copy link
Owner

frank-w commented Apr 16, 2023

i have pushed the vlan-fix to 6.3-pwm-branch you can cherry-pick to any other branch you want...it works so far on my bananapi-r3 and i send it as RFC/RFT to ML

https://patchwork.kernel.org/project/linux-mediatek/patch/20230416091038.54479-1-linux@fw-web.de/

@gustavobsch
Copy link
Author

gustavobsch commented Apr 16, 2023

I just build 6.3-pwm and booted from it and I can see the vlan tags in the packets. Amazing, thanks!

@frank-w
Copy link
Owner

frank-w commented Apr 16, 2023

I hope there will be no sideeffects...i guess the patch disables rx vlan offload completely but better than broken vlan support

@gustavobsch
Copy link
Author

gustavobsch commented Apr 16, 2023

I've ran into this kernel panic twice now. I need to reboot to clear it. You think it's related to the a

[20795.193278] ------------[ cut here ]------------
[20795.197920] NETDEV WATCHDOG: eth0 (mtk_soc_eth): transmit queue 3 timed out
[20795.204916] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x26c/0x274
[20795.213182] Modules linked in: pps_ldisc tun nft_masq nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables libcrc32c nfnetlink pps_gpio mt7915e mt76_connac_lib mt76 mac80211 libarc4 cfg80211 fuse ip_tables x_tables
[20795.234045] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.3.0-rc6-bpi-r3-pwm #3
[20795.241167] Hardware name: Bananapi BPI-R3 (DT)
[20795.245686] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[20795.252633] pc : dev_watchdog+0x26c/0x274
[20795.256638] lr : dev_watchdog+0x26c/0x274
[20795.260637] sp : ffffffc00800be00
[20795.263938] x29: ffffffc00800be00 x28: ffffffc009426000 x27: ffffffc00800bee8
[20795.271061] x26: ffffffc009106008 x25: 0000000000000000 x24: ffffffc0094288c8
[20795.278183] x23: 0000000000000100 x22: ffffffc009426000 x21: 0000000000000003
[20795.285306] x20: ffffff8001e31000 x19: ffffff8001e31488 x18: ffffffffffffffff
[20795.292427] x17: ffffffc076a73000 x16: ffffffc008008000 x15: 076e076107720774
[20795.299550] x14: 0720073a07290768 x13: 0000000000000255 x12: 00000000ffffffea
[20795.306673] x11: 00000000ffffefff x10: 00000000ffffefff x9 : ffffffc00949a438
[20795.313796] x8 : 0000000000017fe8 x7 : c0000000ffffefff x6 : 0000000000000001
[20795.320921] x5 : 0000000000000002 x4 : 0000000000000000 x3 : 0000000000000002
[20795.328046] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffff8000134380
[20795.335168] Call trace:
[20795.337604]  dev_watchdog+0x26c/0x274
[20795.341262]  call_timer_fn+0x34/0x150
[20795.344917]  __run_timers.part.0+0x2f4/0x358
[20795.349177]  run_timer_softirq+0x3c/0x74
[20795.353089]  _stext+0x118/0x344
[20795.356220]  ____do_softirq+0x10/0x1c
[20795.359872]  call_on_irq_stack+0x24/0x4c
[20795.363782]  do_softirq_own_stack+0x1c/0x2c
[20795.367953]  __irq_exit_rcu+0xe0/0xfc
[20795.371606]  irq_exit_rcu+0x10/0x1c
[20795.375083]  el1_interrupt+0x38/0x54
[20795.378648]  el1h_64_irq_handler+0x18/0x24
[20795.382733]  el1h_64_irq+0x68/0x6c
[20795.386123]  default_idle_call+0x58/0x108
[20795.390121]  do_idle+0xa8/0x100
[20795.393253]  cpu_startup_entry+0x24/0x2c
[20795.397166]  secondary_start_kernel+0x128/0x14c
[20795.401686]  __secondary_switched+0xb8/0xbc
[20795.405859] ---[ end trace 0000000000000000 ]---
[ 7674.040708] ------------[ cut here ]------------
[ 7674.045340] NETDEV WATCHDOG: eth0 (mtk_soc_eth): transmit queue 8 timed out
[ 7674.052351] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x26c/0x274
[ 7674.060614] Modules linked in: pps_ldisc tun nft_masq nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables libcrc32c nfnetlink pps_gpio mt7915e mt76_connac_lib mt76 mac80211 libarc4 cfg80211 fuse ip_tables x_tables
[ 7674.081476] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.3.0-rc6-bpi-r3-pwm #3
[ 7674.088595] Hardware name: Bananapi BPI-R3 (DT)
[ 7674.093112] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 7674.100058] pc : dev_watchdog+0x26c/0x274
[ 7674.104061] lr : dev_watchdog+0x26c/0x274
[ 7674.108060] sp : ffffffc0096f3e00
[ 7674.111360] x29: ffffffc0096f3e00 x28: ffffffc009426000 x27: ffffffc0096f3ee8
[ 7674.118483] x26: ffffffc009106008 x25: 0000000000000000 x24: ffffffc0094288c8
[ 7674.125605] x23: 0000000000000100 x22: ffffffc009426000 x21: 0000000000000008
[ 7674.132726] x20: ffffff8000324000 x19: ffffff8000324488 x18: ffffffffffffffff
[ 7674.139847] x17: ffffffc076aab000 x16: ffffffc0096f0000 x15: ffffffc0896f3a47
[ 7674.146968] x14: 0000000000000000 x13: 00000000000001e8 x12: 00000000ffffffea
[ 7674.154089] x11: 00000000ffffefff x10: 00000000ffffefff x9 : ffffffc00949a438
[ 7674.161209] x8 : 0000000000017fe8 x7 : c0000000ffffefff x6 : 0000000000000001
[ 7674.168330] x5 : 0000000000000000 x4 : ffffff807fbb2ac8 x3 : ffffff807fbbe7b0
[ 7674.175450] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffff8000135e80
[ 7674.182572] Call trace:
[ 7674.185007]  dev_watchdog+0x26c/0x274
[ 7674.188660]  call_timer_fn+0x34/0x150
[ 7674.192315]  __run_timers.part.0+0x2f4/0x358
[ 7674.196574]  run_timer_softirq+0x3c/0x74
[ 7674.200485]  _stext+0x118/0x344
[ 7674.203616]  ____do_softirq+0x10/0x1c
[ 7674.207266]  call_on_irq_stack+0x24/0x4c
[ 7674.211178]  do_softirq_own_stack+0x1c/0x2c
[ 7674.215348]  __irq_exit_rcu+0xe0/0xfc
[ 7674.219001]  irq_exit_rcu+0x10/0x1c
[ 7674.222478]  el1_interrupt+0x38/0x54
[ 7674.226043]  el1h_64_irq_handler+0x18/0x24
[ 7674.230127]  el1h_64_irq+0x68/0x6c
[ 7674.233517]  default_idle_call+0x58/0x108
[ 7674.237514]  do_idle+0xa8/0x100
[ 7674.240647]  cpu_startup_entry+0x24/0x2c
[ 7674.244558]  secondary_start_kernel+0x128/0x14c
[ 7674.249078]  __secondary_switched+0xb8/0xbc
[ 7674.253252] ---[ end trace 0000000000000000 ]---

@frank-w
Copy link
Owner

frank-w commented Apr 17, 2023

I think this is related to felix patch adding the transmi queues...but i have no idea how to fix...there is a issue on openwrt github repo

@frank-w
Copy link
Owner

frank-w commented Apr 17, 2023

Do you know a way to trigger this bug?

This is the issue on openwrt repo where i sent an update: openwrt/openwrt#12143 (comment)

If we know how to trigger it we have a chance to debug it to root cause

@frank-w frank-w reopened this Apr 17, 2023
@frank-w
Copy link
Owner

frank-w commented Apr 17, 2023

Let issue open for tracing the transmit error

@frank-w frank-w changed the title bpi-r3 branch 6.3-rc missing SFP support bpi-r3 branch 6.3-rc missing SFP support / transmit queue timeout Apr 17, 2023
@gustavobsch
Copy link
Author

I don't, but it hasn't crashed today and yesterday when it crashed I was copying big files through SFP1/SFP2.

I'll run iperf3 test and see if it crashes again

@frank-w
Copy link
Owner

frank-w commented Apr 17, 2023

I have not seen it with iperf3 yet...thats my standard-test in both directions,but i guess the problem is when there are multiple streams which are spreaded over different tx queues over time

@frank-w
Copy link
Owner

frank-w commented Apr 19, 2023

Have you seen this with 6.2 too? If not you can try the vlan-patch there and look if this is the cause but imho this cannot be the cause as this is not in tx path

@gustavobsch
Copy link
Author

gustavobsch commented Apr 19, 2023

I havent tested with 6.2, can I just checkout latest 6.2-main and build it?

I ran into the issue again yesterday and I think the issue has something to do with the right SFP, where I have the copper tplink SFP.

I ran into it three times when traffic is flowing as shown below.

              ------------------>
       eth1                           lan4
------------------------------------------
|   fiber/vlan   |   copper/untagged     |
------------------------------------------

@frank-w
Copy link
Owner

frank-w commented Apr 19, 2023

Is this reproducable for you (e.g. you can trigger it this way)?

Is the other end also 2.5g or do you run at lower speed there?

Could you try with another sfp (maybe only 1g) in the lan-sfp cage or towards one of the rj45 ports?

Basicly you can use every branch to build :) vlanfix is only applied to the last one (6.3-pwm) but for testing the sfp itself other branches can work too,but maybe need patches for supporting the tplink-sfp

@gustavobsch
Copy link
Author

I can trigger it but unsure exactly how.. it just happens. I noticed it doesn't happen when I just stop using the right sfp2/lan4

Both ends on both SFP's are 2.5g

I tried a copper 1g SFP on the right sfp2/lan4 and it wont take it seems like it only allows 2.5g modules


root@bpi-r3:~# dmesg
[ 2265.225821] sfp sfp-2: module AVAGO            ABCU-5700RZ      rev      sn AN1001A2CX       dc 100107  
[ 2286.381969] mt7530 mdio-bus:1f lan4: configuring for inband/2500base-x link mode
[ 2286.454562] mt7530 mdio-bus:1f lan4: validation with support 00,00000000,00000000,00000000 failed: -EINVAL
[ 2286.464391] sfp sfp-2: sfp_add_phy failed: -EINVAL
[ 2315.659816] mt7530 mdio-bus:1f lan4: configuring for inband/2500base-x link mode
[ 2315.798868] mt7530 mdio-bus:1f lan4: validation with support 00,00000000,00000000,00000000 failed: -EINVAL
[ 2315.808653] sfp sfp-2: sfp_add_phy failed: -EINVAL
root@bpi-r3:~# ethtool lan4
Settings for lan4:
	Supported ports: [ MII ]
	Supported link modes:   2500baseX/Full
	Supported pause frame use: Symmetric Receive-only
	Supports auto-negotiation: No
	Supported FEC modes: Not reported
	Advertised link modes:  2500baseX/Full
	Advertised pause frame use: Symmetric Receive-only
	Advertised auto-negotiation: No
	Advertised FEC modes: Not reported
	Speed: Unknown!
	Duplex: Unknown! (255)
	Auto-negotiation: off
	Port: MII
	PHYAD: 0
	Transceiver: internal
	Supports Wake-on: d
	Wake-on: d
	Link detected: no

@frank-w
Copy link
Owner

frank-w commented Apr 20, 2023

My fiber sfp works in both cages and this is 1g only ..maybe your 1g module needs a quirk too...daniel reported from some sfps having checksum errors or having invalid data (e.g. supported modes) in eeprom

@frank-w
Copy link
Owner

frank-w commented Apr 22, 2023

Can you trigger the issue when only working over lan4/sfp2 (writing/reading to/from r3).

@gustavobsch
Copy link
Author

Can you trigger the issue when only working over lan4/sfp2 (writing/reading to/from r3).

I took a while to reply because I was testing. The good news is that I haven't ran into this issue again. I have both SFP's populated with the same modules and same traffic going through them.

The only thing that changed is back then I was doing a lot of NTP/GPS/PPS testing, maybe the kernel panic had something to do with it? I can't think of anything else.

@gustavobsch
Copy link
Author

I loaded 6.3.0-bpi-r3-main 24 hours ago and no crashes. When I initially ran into this issue it could crash after couple hours.

@frank-w
Copy link
Owner

frank-w commented May 3, 2023

Have you same connections as before (sfps,rj45 ports,vlan)? Have not seen any fix for these problems

@gustavobsch
Copy link
Author

Something definitely changed, but I can't tell what.
The SFP's and the cages where they are inserted hasn't changed.
It hasn't crashed since I reported the initial crash so I'm happy

@gustavobsch
Copy link
Author

Hi Frank,

Can you add the tp-link quirk patch to 6.6.25-bpi-r3-main?

Thanks
Gustavo

@frank-w
Copy link
Owner

frank-w commented Apr 24, 2024

Can you make a pull request or point to the patch?

@gustavobsch
Copy link
Author

gustavobsch commented Apr 24, 2024

I'm actually not sure that patch has the fix. The issue this time is the SFP not detecting link.


[246390.041031] sfp sfp-2: module TP-LINK          TL-SM410U        rev 1.0  sn 121B0Y9000456    dc 211108  
[246390.050640] mt7530-mdio mdio-bus:1f lan4: unsupported SFP module: no common interface modes
root@bpi-r3:~# ethtool lan4
Settings for lan4:
	Supported ports: [ MII ]
	Supported link modes:   2500baseX/Full
	Supported pause frame use: Symmetric Receive-only
	Supports auto-negotiation: Yes
	Supported FEC modes: Not reported
	Advertised link modes:  2500baseX/Full
	Advertised pause frame use: Symmetric Receive-only
	Advertised auto-negotiation: Yes
	Advertised FEC modes: Not reported
	Speed: Unknown!
	Duplex: Unknown! (255)
	Auto-negotiation: on
	Port: MII
	PHYAD: 0
	Transceiver: internal
	Supports Wake-on: d
	Wake-on: d
	Link detected: no


@frank-w
Copy link
Owner

frank-w commented Apr 24, 2024

This looks like the main problem for all 2g5 sfp...look what i have done in 6.9-net-next...i guess you need at least erics phylink patch and the tplink quirk let phylink know about 2g5...the phy-mapping i had done for the oem imho does not work for tplink

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants