Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mt7921u pending tx #410

Open
Simon566 opened this issue Mar 22, 2024 · 101 comments
Open

mt7921u pending tx #410

Simon566 opened this issue Mar 22, 2024 · 101 comments

Comments

@Simon566
Copy link

  • kernel 6.6.22
  • x86-64 , intel
  • Odroid H3
  • ALFA USB stick

[Do Mär 21 22:48:23 2024] mt7921u 1-1:1.3: Message 00020003 (seq 6) timeout
[Do Mär 21 22:48:24 2024] mt7921u 1-1:1.3: timed out waiting for pending tx
[Do Mär 21 22:48:24 2024] mt7921u 1-1:1.3: HW/SW Version: 0x8a108a10, Build Time: 20240219110958a

[Do Mär 21 22:48:24 2024] mt7921u 1-1:1.3: WM Firmware Version: ____010000, Build Time: 20240219111038
[Fr Mär 22 06:17:31 2024] mt7921u 1-1:1.3: vendor request req:63 off:d02c failed:-110
[Fr Mär 22 06:17:34 2024] mt7921u 1-1:1.3: vendor request req:63 off:d054 failed:-110
[Fr Mär 22 06:17:37 2024] mt7921u 1-1:1.3: vendor request req:63 off:d058 failed:-110
[Fr Mär 22 06:17:41 2024] mt7921u 1-1:1.3: vendor request req:63 off:53b8 failed:-110

.....alot more of the -110

[Fr Mär 22 06:42:20 2024] task:kworker/u8:3 state:D stack:0 pid:1179348 ppid:2 flags:0x00004000
[Fr Mär 22 06:42:20 2024] Workqueue: phy0 mt792x_mac_work [mt792x_lib]
[Fr Mär 22 06:42:20 2024] Call Trace:
[Fr Mär 22 06:42:20 2024]
[Fr Mär 22 06:42:20 2024] __schedule+0x3c4/0xb50
[Fr Mär 22 06:42:20 2024] schedule+0x61/0xe0
[Fr Mär 22 06:42:20 2024] schedule_preempt_disabled+0x18/0x30
[Fr Mär 22 06:42:20 2024] __mutex_lock.constprop.0+0x3b4/0x700
[Fr Mär 22 06:42:20 2024] mt792x_mac_work+0x28/0xb0 [mt792x_lib]
[Fr Mär 22 06:42:20 2024] process_one_work+0x171/0x340
[Fr Mär 22 06:42:20 2024] worker_thread+0x27b/0x3a0
[Fr Mär 22 06:42:20 2024] ? __pfx_worker_thread+0x10/0x10
[Fr Mär 22 06:42:20 2024] kthread+0xf4/0x130
[Fr Mär 22 06:42:20 2024] ? __pfx_kthread+0x10/0x10
[Fr Mär 22 06:42:20 2024] ret_from_fork+0x31/0x50
[Fr Mär 22 06:42:20 2024] ? __pfx_kthread+0x10/0x10
[Fr Mär 22 06:42:20 2024] ret_from_fork_asm+0x1b/0x30
[Fr Mär 22 06:42:20 2024]
[Fr Mär 22 06:42:20 2024] Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings
[Fr Mär 22 06:42:22 2024] mt7921u 1-1:1.3: vendor request req:63 off:4230 failed:-110
[Fr Mär 22 06:42:25 2024] mt7921u 1-1:1.3: vendor request req:63 off:4230 failed:-110

@Simon566
Copy link
Author

here is the next:
[Sa Mär 23 16:32:33 2024] mt7921u 1-1:1.3: Message 00020003 (seq 7) timeout
[Sa Mär 23 16:32:34 2024] mt7921u 1-1:1.3: timed out waiting for pending tx
[Sa Mär 23 16:32:34 2024] ------------[ cut here ]------------
[Sa Mär 23 16:32:34 2024] WARNING: CPU: 3 PID: 544288 at kernel/kthread.c:659 kthread_park+0x85/0xa0
[Sa Mär 23 16:32:34 2024] Modules linked in: nf_conntrack_netlink cls_matchall sch_ingress sch_cake act_mirred ifb nft_masq nft_nat nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 wireguard veth libchacha20poly1305 chacha_x86_64 poly1305_x86_64 curve25519_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel ccm nf_tables pppoe pppox crc32c_generic nfnetlink ppp_generic slhc nvme_fabrics msr bridge binfmt_misc snd_sof_pci_intel_icl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation snd_sof_intel_hda_mlink soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp mt7921u snd_sof snd_hda_codec_hdmi mt792x_usb snd_sof_utils snd_soc_hdac_hda mt7921_common snd_hda_codec_realtek snd_hda_codec_generic snd_hda_ext_core ledtrig_audio snd_soc_acpi_intel_match mt792x_lib x86_pkg_temp_thermal snd_soc_acpi intel_powerclamp snd_soc_core mt76_connac_lib kvm_intel snd_compress soundwire_bus mt76_usb i915 snd_hda_intel btusb mt76 kvm nls_ascii btrtl nls_cp437 btintel snd_intel_dspcfg
[Sa Mär 23 16:32:34 2024] vfat btbcm mac80211 snd_intel_sdw_acpi btmtk irqbypass bluetooth fat snd_hda_codec ghash_clmulni_intel 88XXau(OE) sha256_ssse3 libarc4 sha1_ssse3 snd_hda_core drm_buddy drm_display_helper snd_hwdep intel_rapl_msr mei_pxp cfg80211 processor_thermal_device_pci_legacy mei_hdcp aesni_intel sha3_generic jitterentropy_rng cec processor_thermal_device mei_me crypto_simd snd_pcm sha512_ssse3 sha512_generic processor_thermal_rfim cryptd ctr drbg iTCO_wdt snd_timer ansi_cprng processor_thermal_mbox ftdi_sio rc_core ecdh_generic mei intel_pmc_bxt processor_thermal_rapl ecc ttm iTCO_vendor_support usbserial snd nfsd intel_rapl_common rfkill watchdog pcspkr drm_kms_helper wmi_bmof crc16 intel_cstate ee1004 soundcore int340x_thermal_zone i2c_algo_bit intel_soc_dts_iosf intel_pmc_core evdev acpi_pad joydev sg button acpi_tad auth_rpcgss nfs_acl lockd it87(OE) hwmon_vid grace emc2103 coretemp sunrpc 8021q drm garp stp mrp llc dm_mod fuse configfs loop efi_pstore efivarfs ip_tables x_tables autofs4 btrfs xor raid6_pq
[Sa Mär 23 16:32:34 2024] libcrc32c hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid sd_mod nvme ahci libahci nvme_core libata xhci_pci xhci_hcd t10_pi r8169 crc64_rocksoft crc64 crc_t10dif realtek i2c_i801 scsi_mod crct10dif_generic mdio_devres usbcore i2c_smbus libphy crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel intel_lpss_pci scsi_common intel_lpss usb_common idma64 video wmi
[Sa Mär 23 16:32:34 2024] CPU: 3 PID: 544288 Comm: kworker/u8:11 Tainted: G U OE 6.6.22 #5
[Sa Mär 23 16:32:34 2024] Hardware name: HARDKERNEL ODROID-H3/ODROID-H3, BIOS 5.19 07/19/2023
[Sa Mär 23 16:32:34 2024] Workqueue: mt76 mt7921_mac_reset_work [mt7921_common]
[Sa Mär 23 16:32:34 2024] RIP: 0010:kthread_park+0x85/0xa0
[Sa Mär 23 16:32:34 2024] Code: 00 48 85 c0 74 2d 31 c0 5b 5d c3 cc cc cc cc 0f 0b 48 8b ab 40 0a 00 00 a8 04 74 ac 0f 0b b8 da ff ff ff 5b 5d c3 cc cc cc cc <0f> 0b b8 f0 ff ff ff eb d5 0f 0b eb cf 66 66 2e 0f 1f 84 00 00 00
[Sa Mär 23 16:32:34 2024] RSP: 0018:ffffb24346c5bd58 EFLAGS: 00010202
[Sa Mär 23 16:32:34 2024] RAX: 0000000000000004 RBX: ffff9c39cc2cb100 RCX: 0000000000000100
[Sa Mär 23 16:32:34 2024] RDX: 0000000080000000 RSI: 0000000000000283 RDI: ffff9c39cc2cb100
[Sa Mär 23 16:32:34 2024] RBP: ffff9c3a16415200 R08: ffffffffc051e368 R09: 0000000000000000
[Sa Mär 23 16:32:34 2024] R10: 0000000000000001 R11: 0000000000000000 R12: ffff9c39cd8c20a8
[Sa Mär 23 16:32:34 2024] R13: ffff9c3a168743a8 R14: ffff9c39cd8c20a8 R15: 0000000000000100
[Sa Mär 23 16:32:34 2024] FS: 0000000000000000(0000) GS:ffff9c3d2ff80000(0000) knlGS:0000000000000000
[Sa Mär 23 16:32:34 2024] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Sa Mär 23 16:32:34 2024] CR2: 00007f8e9c554000 CR3: 0000000442dae000 CR4: 0000000000350ee0
[Sa Mär 23 16:32:34 2024] Call Trace:
[Sa Mär 23 16:32:34 2024]
[Sa Mär 23 16:32:34 2024] ? kthread_park+0x85/0xa0
[Sa Mär 23 16:32:34 2024] ? __warn+0x81/0x130
[Sa Mär 23 16:32:34 2024] ? kthread_park+0x85/0xa0
[Sa Mär 23 16:32:34 2024] ? report_bug+0x191/0x1c0
[Sa Mär 23 16:32:34 2024] ? handle_bug+0x41/0x70
[Sa Mär 23 16:32:34 2024] ? exc_invalid_op+0x17/0x70
[Sa Mär 23 16:32:34 2024] ? asm_exc_invalid_op+0x1a/0x20
[Sa Mär 23 16:32:34 2024] ? kthread_park+0x85/0xa0
[Sa Mär 23 16:32:34 2024] mt76u_stop_tx+0x216/0x2f0 [mt76_usb]
[Sa Mär 23 16:32:34 2024] ? __pfx_autoremove_wake_function+0x10/0x10
[Sa Mär 23 16:32:34 2024] mt7921u_mac_reset+0x72/0x1b0 [mt7921u]
[Sa Mär 23 16:32:34 2024] mt7921_mac_reset_work+0x97/0x180 [mt7921_common]
[Sa Mär 23 16:32:34 2024] ? __schedule+0x3cc/0xb50
[Sa Mär 23 16:32:34 2024] process_one_work+0x171/0x340
[Sa Mär 23 16:32:34 2024] worker_thread+0x27b/0x3a0
[Sa Mär 23 16:32:34 2024] ? __pfx_worker_thread+0x10/0x10
[Sa Mär 23 16:32:34 2024] kthread+0xf4/0x130
[Sa Mär 23 16:32:34 2024] ? __pfx_kthread+0x10/0x10
[Sa Mär 23 16:32:34 2024] ret_from_fork+0x31/0x50
[Sa Mär 23 16:32:34 2024] ? __pfx_kthread+0x10/0x10
[Sa Mär 23 16:32:34 2024] ret_from_fork_asm+0x1b/0x30
[Sa Mär 23 16:32:34 2024]
[Sa Mär 23 16:32:34 2024] ---[ end trace 0000000000000000 ]---

@fhteagle
Copy link

fhteagle commented Apr 1, 2024

Also seeing that in journalctl as well:

Mar 28 22:22:52 hostname here kernel: mt7921u 1-1.4:1.3: timed out waiting for pending tx

Seems to be associated with the random failure to forward packets issue I have seen over the past few weeks on mt7921u and mt7921k. Fortunately, this is mitigated by restarting hostapd most of the time, but very rarely I will need to full reboot the machine the mt7921 is attached to, too. I will try to come up with concrete steps to replicate the issue.

@morrownr
Copy link
Owner

morrownr commented Apr 1, 2024

@Simon566 @fhteagle

After seeing the first message I have been keeping an eye open. It is clear that @fhteagle is in AP mode but it is not clear what mode @Simon566 is in. That would be good info to know.

I have seen some errors like this when inadequate or failing power supplies were involved.

@fhteagle

I have a mt7921au based adapter in AP Mode on a Pi4B. It is running RasPiOS with kernel 6.6. I'm not seeing this. Can you go into more detail about:

Seems to be associated with the random failure to forward packets issue I have seen over the past few weeks ...

@Simon566
Copy link
Author

Simon566 commented Apr 3, 2024

Hi ,

its the same here. Im running AP mode on X86-64 hardware, running Debian Bookworm with latest kernel 6.6.23 and hostapd from git mainline. Having a single client doesnt seem to trigger the fault, even when using things like openspeedtest or similar sites. But when i put the AP into my real network with 10 clients , it fails within a day mostly.

Using a rtl8812au AP runs for weeks at least before having some issues, also on USB hardware.

regards,
Simon

@fhteagle
Copy link

fhteagle commented Apr 3, 2024

I cannot give a cause of the issue, nor how to reproduce reliably. Best pattern I can describe now is three general classes of failure, that may or may not be related:

  1. hostapd itself soft bricks, which will lead to even root privilege user cannot get any information from hostapd_cli tool
  2. hostapd process still live, STAs correctly listed in hostapd_cli list_sta, pings get through to STAs, but no other packet type or port makes it to that STA from any other device on LAN, other STAs on same AP may or may not also be affected
  3. hostapd process still live, STAs correctly listed in hostapd_cli list_sta, pings and other port traffic gets through to STAs from IPs on LAN, but no pings or other packet whatsoever are passed between any two STAs connected to the same AP

All 3 failure modes can be cleared with ~95%+ success rate by a simple systemd restart of the hostapd service for that adapter. ~5% of the time I have to reboot to clear the failure mode.

The pending tx journal message is most strongly associated with failure modes 2 and 3, but not perfectly correlated. I can have a mode 2 or mode 3 failure with or without the journal message. With the journal message, there will always be either mode 2 or mode 3 failure.

Again, I'm running bleeding edge hostapd-git , so entirely possible this is the root cause. Behavior seen on arch x86_64 and aarch64 (raspberry pi 3b and 4b) hosts, using MT7921K (pci) and MT7921AU (usb) cards. Adapters are bridged, either via systemd/networkd config, or the bridge = option in hostapd.conf, with one or more ethernet NIC and other adapter device on each host.

@whitslack
Copy link

whitslack commented Aug 26, 2024

I've spent the past several hours trying to get to the bottom of this issue. The timed out waiting for pending tx message comes from the chip reset code path, which only gets invoked following a command timeout. Even the preceding Message 00020003 (seq __) timeout is only a symptom of the underlying fault condition.

I added some debug log messages to trace the flow of command messages to/from the mt7921u chip. Setting up the chip takes several dozen command messages, as one might expect:

<6>usb 2-7: new high-speed USB device number 4 using ehci-pci
<6>usb 2-7: reset high-speed USB device number 4 using ehci-pci
<6>mt7921u 2-7:1.3: sent cmd=00000010 seq=1
<6>mt7921u 2-7:1.3: rcvd seq=1
<6>mt7921u 2-7:1.3: HW/SW Version: 0x8a108a10, Build Time: 20240716150944a
<6>
<6>mt7921u 2-7:1.3: sent cmd=00000005 seq=2
<6>mt7921u 2-7:1.3: rcvd seq=2
<6>mt7921u 2-7:1.3: sent cmd=000000ee seq=3
⋮
<6>mt7921u 2-7:1.3: sent cmd=000000ee seq=10
<6>mt7921u 2-7:1.3: sent cmd=00000007 seq=11
<6>mt7921u 2-7:1.3: rcvd seq=11
<6>mt7921u 2-7:1.3: sent cmd=00000010 seq=12
<6>mt7921u 2-7:1.3: rcvd seq=12
<6>mt7921u 2-7:1.3: WM Firmware Version: ____010000, Build Time: 20240716151027
<6>mt7921u 2-7:1.3: sent cmd=00000001 seq=13
<6>mt7921u 2-7:1.3: rcvd seq=13
<6>mt7921u 2-7:1.3: sent cmd=000000ee seq=14
⋮
<6>mt7921u 2-7:1.3: sent cmd=000000ee seq=12
<6>mt7921u 2-7:1.3: sent cmd=00000001 seq=13
<6>mt7921u 2-7:1.3: rcvd seq=13
<6>mt7921u 2-7:1.3: sent cmd=000000ee seq=14
⋮
<6>mt7921u 2-7:1.3: sent cmd=000000ee seq=5
<6>mt7921u 2-7:1.3: sent cmd=00000001 seq=6
<6>mt7921u 2-7:1.3: rcvd seq=6
<6>mt7921u 2-7:1.3: sent cmd=000000ee seq=7
<6>mt7921u 2-7:1.3: sent cmd=000000ee seq=8
<6>mt7921u 2-7:1.3: sent cmd=000000ee seq=9
<6>mt7921u 2-7:1.3: sent cmd=000000ee seq=10
<6>mt7921u 2-7:1.3: sent cmd=00000001 seq=11
<6>mt7921u 2-7:1.3: rcvd seq=11
<6>mt7921u 2-7:1.3: sent cmd=000000ee seq=12
⋮
<6>mt7921u 2-7:1.3: sent cmd=000000ee seq=9
<6>mt7921u 2-7:1.3: sent cmd=00000002 seq=10
<6>mt7921u 2-7:1.3: rcvd seq=10
<6>mt7921u 2-7:1.3: sent cmd=0004008a seq=11
<6>mt7921u 2-7:1.3: rcvd seq=11
<6>mt7921u 2-7:1.3: sent cmd=000400c5 seq=12
<6>mt7921u 2-7:1.3: sent cmd=000021ed seq=13
<6>mt7921u 2-7:1.3: rcvd seq=13
<6>mt7921u 2-7:1.3: sent cmd=00003eed seq=14
<6>mt7921u 2-7:1.3: rcvd seq=14
<6>mt7921u 2-7:1.3: sent cmd=0004000f seq=15
<6>mt7921u 2-7:1.3: sent cmd=0004005d seq=1
⋮
<6>mt7921u 2-7:1.3: sent cmd=0004005d seq=7
<6>mt7921u 2-7:1.3: sent cmd=000400ca seq=8
<6>mt7921u 2-7:1.3: sent cmd=000046ed seq=9
<6>mt7921u 2-7:1.3: rcvd seq=9
<6>mt7921u 2-7:1.3: sent cmd=0004000f seq=10
<6>mt7921u 2-7:1.3: sent cmd=00004eed seq=11
<6>mt7921u 2-7:1.3: rcvd seq=11
<6>mt7921u 2-7:1.3: sent cmd=0004005d seq=12
⋮
<6>mt7921u 2-7:1.3: sent cmd=0004005d seq=3
<6>mt7921u 2-7:1.3: sent cmd=00020001 seq=4
<6>mt7921u 2-7:1.3: rcvd seq=4
<6>mt7921u 2-7:1.3: sent cmd=00020002 seq=5
<6>mt7921u 2-7:1.3: rcvd seq=5
<6>mt7921u 2-7:1.3: sent cmd=0004000a seq=6
<6>mt7921u 2-7:1.3: sent cmd=0004001d seq=7
<6>mt7921u 2-7:1.3: sent cmd=0004000a seq=8
<6>br0: port 3(wifi6) entered blocking state
<6>br0: port 3(wifi6) entered disabled state
<6>mt7921u 2-7:1.3 wifi6: entered allmulticast mode
<6>mt7921u 2-7:1.3: sent cmd=0004000a seq=9
<6>mt7921u 2-7:1.3 wifi6: entered promiscuous mode
<6>br0: port 3(wifi6) entered blocking state
<6>br0: port 3(wifi6) entered forwarding state
<6>mt7921u 2-7:1.3: sent cmd=0004000a seq=10
<6>mt7921u 2-7:1.3: sent cmd=00040003 seq=11
<6>mt7921u 2-7:1.3: sent cmd=00003eed seq=12
<6>mt7921u 2-7:1.3: rcvd seq=12
<6>mt7921u 2-7:1.3: sent cmd=00020002 seq=13
<6>br0: port 3(wifi6) entered disabled state
<6>mt7921u 2-7:1.3: rcvd seq=13
<6>mt7921u 2-7:1.3: sent cmd=00020002 seq=14
<6>mt7921u 2-7:1.3: rcvd seq=14
<6>mt7921u 2-7:1.3: sent cmd=00020002 seq=15
<6>mt7921u 2-7:1.3: rcvd seq=15
<6>mt7921u 2-7:1.3: sent cmd=00020002 seq=1
<6>mt7921u 2-7:1.3: rcvd seq=1
<6>mt7921u 2-7:1.3: sent cmd=00040017 seq=2
<6>mt7921u 2-7:1.3: sent cmd=00040016 seq=3
<6>mt7921u 2-7:1.3: sent cmd=00020003 seq=4
<6>mt7921u 2-7:1.3: rcvd seq=3
<4>mt7921u 2-7:1.3: skipping mismatched event: seq=4, rxd->seq=3
<6>mt7921u 2-7:1.3: rcvd seq=4
<6>mt7921u 2-7:1.3: sent cmd=00020002 seq=5
<6>mt7921u 2-7:1.3: rcvd seq=5
<6>mt7921u 2-7:1.3: sent cmd=0004001d seq=6
<6>mt7921u 2-7:1.3: sent cmd=000400b0 seq=7
<6>br0: port 3(wifi6) entered blocking state
<6>br0: port 3(wifi6) entered forwarding state
<6>mt7921u 2-7:1.3: sent cmd=0004000a seq=8
<6>mt7921u 2-7:1.3: sent cmd=00020003 seq=9
<6>mt7921u 2-7:1.3: rcvd seq=9
<6>mt7921u 2-7:1.3: sent cmd=00020003 seq=10
<6>mt7921u 2-7:1.3: rcvd seq=10
<6>mt7921u 2-7:1.3: sent cmd=0004001d seq=11
<6>mt7921u 2-7:1.3: sent cmd=000400b0 seq=12
<6>mt7921u 2-7:1.3: sent cmd=0004001d seq=13
<6>mt7921u 2-7:1.3: sent cmd=000400b0 seq=14
<6>mt7921u 2-7:1.3: sent cmd=0004001d seq=15
<6>mt7921u 2-7:1.3: sent cmd=000400b0 seq=1
<6>mt7921u 2-7:1.3: sent cmd=0004001d seq=2
<6>mt7921u 2-7:1.3: sent cmd=000400b0 seq=3
<6>mt7921u 2-7:1.3: sent cmd=00020002 seq=4
<6>mt7921u 2-7:1.3: rcvd seq=4
<6>mt7921u 2-7:1.3: sent cmd=0004000a seq=5

(For brevity, I abridged the long runs of cmd=000000ee and cmd=0004005d, which I'm guessing are related to firmware downloads.)

In the above log, you can see that some commands elicit replies from the MCU (the microcontroller in the mt7921 that is running the firmware), while other commands do not. This is very likely the expected operation of the protocol between the driver and the firmware. One thing does stand out to me as wrong, indicated by the line that says skipping mismatched event. That appears to occur because cmd=00040016 elicits a reply from the MCU, but the driver does not actually wait for a reply to that command. I have not investigated whether the driver truly doesn't care about the result of cmd=00040016 or if it's actually dropping the ball, but either way, this does not seem to cause a problem, as the driver transparently skips over the reply to cmd=00040016 while it is waiting to receive the reply to the next command.

When I now connect my Android phone to the AP, the following command messages are logged:

<6>mt7921u 2-7:1.3: sent cmd=00020003 seq=7
<6>mt7921u 2-7:1.3: rcvd seq=7
<6>mt7921u 2-7:1.3: sent cmd=00020002 seq=8
<6>mt7921u 2-7:1.3: rcvd seq=8
<6>mt7921u 2-7:1.3: sent cmd=00020003 seq=9
<6>mt7921u 2-7:1.3: rcvd seq=9
<6>mt7921u 2-7:1.3: sent cmd=00020002 seq=10
<6>mt7921u 2-7:1.3: rcvd seq=10
<6>mt7921u 2-7:1.3: sent cmd=00020003 seq=11
<6>mt7921u 2-7:1.3: rcvd seq=11
<6>mt7921u 2-7:1.3: sent cmd=00020003 seq=12
<6>mt7921u 2-7:1.3: rcvd seq=12
<6>mt7921u 2-7:1.3: sent cmd=00020003 seq=13
<6>mt7921u 2-7:1.3: rcvd seq=13
<6>mt7921u 2-7:1.3: sent cmd=00020003 seq=14
<6>mt7921u 2-7:1.3: rcvd seq=14
<6>mt7921u 2-7:1.3: sent cmd=00020003 seq=15
<6>mt7921u 2-7:1.3: rcvd seq=15
<6>mt7921u 2-7:1.3: sent cmd=00020003 seq=1
<6>mt7921u 2-7:1.3: rcvd seq=1
<6>mt7921u 2-7:1.3: sent cmd=00020003 seq=2
<6>mt7921u 2-7:1.3: rcvd seq=2
<6>mt7921u 2-7:1.3: sent cmd=00020003 seq=3
<6>mt7921u 2-7:1.3: rcvd seq=3

All appears well. The MCU has replied to the last cmd=00020003.

At this point, I run Ookla Speedtest on my Android phone, a usage scenario that reliably kills the AP every time. Crucially, no additional command messages are logged during the speedtest or the hang that results from it. This suggests that the actual problem is triggered by a data message rather than by a command message.

It is only when I attempt to bring down the interface, subsequent to the hang, that the driver notices a problem. There appear to be multiple failure modes at this point, depending on which particular command message the driver sends next when I run ip link set wifi6 down after the AP has hung.

Failure mode 1: driver next sends cmd=0004000a

<6>mt7921u 2-7:1.3: sent cmd=0004000a seq=8
<3>mt7921u 2-7:1.3: mcu_skb_send_msg cmd=00020003: error -110
<3>mt7921u 2-7:1.3: mcu_skb_send_msg cmd=00020003: error -110
<3>mt7921u 2-7:1.3: mcu_skb_send_msg cmd=00020003: error -110
<3>mt7921u 2-7:1.3: mcu_skb_send_msg cmd=00020003: error -110
<3>mt7921u 2-7:1.3: mcu_skb_send_msg cmd=00020003: error -110
<3>mt7921u 2-7:1.3: mcu_skb_send_msg cmd=00020002: error -110
<3>mt7921u 2-7:1.3: mcu_skb_send_msg cmd=0004001d: error -110
<3>mt7921u 2-7:1.3: mcu_skb_send_msg cmd=00040017: error -110
<3>mt7921u 2-7:1.3: mcu_skb_send_msg cmd=00020002: error -110
<3>mt7921u 2-7:1.3: timed out waiting for pending tx
<3>mt7921u 2-7:1.3: mcu_skb_send_msg cmd=000046ed: error -110
<6>br0: port 3(wifi6) entered disabled state

(Error -110 is -ETIMEDOUT.)

In the case that, immediately following the hang, the driver next sends cmd=0004000a, a command that is not expected to elicit a reply from the MCU, then that command gets sent successfully, but the next command, a cmd=00020003, fails to be enqueued to the USB endpoint, and all of the subsequent commands also fail to be enqueued until eventually the driver attempts to reset the chip and logs the timed out waiting for pending tx message while waiting for the tx queue to drain, which never happens.

Failure mode 2: driver next sends cmd=00020003

<6>mt7921u 2-7:1.3: sent cmd=00020003 seq=9
<3>mt7921u 2-7:1.3: Message 00020003 (seq 9) timeout
<3>mt7921u 2-7:1.3: timed out waiting for pending tx

In the case that, immediately following the hang, the driver next sends cmd=00020003, a command that is expected to elicit a reply from the MCU, then that command gets sent successfully, but the driver times out while waiting to receive the reply to the command. This causes it immediately to attempt to reset the chip, which ends up in the timed out waiting for pending tx code path as above.


For whatever it is worth, it does appear that the chip can be resurrected from a crash without rebooting the system by unbinding the mt7921u driver from the USB device and rebinding it. This action causes a reset of the USB interface, which does seem to allow the chip to recover.

# echo 2-7:1.3 >/sys/bus/usb/drivers/mt7921u/unbind
# echo 2-7:1.3 >/sys/bus/usb/drivers/mt7921u/bind

(Adjust the 2-7:1.3 to the actual USB bus address of your mt7921u device.)

<6>usb 2-7: reset high-speed USB device number 4 using ehci-pci
<6>mt7921u 2-7:1.3: sent cmd=00000010 seq=1
<6>mt7921u 2-7:1.3: rcvd seq=1
<6>mt7921u 2-7:1.3: HW/SW Version: 0x8a108a10, Build Time: 20240716150944a

I have to put this (and myself) to bed for now, but I may eventually circle back around to attempt to trace the (much more complex) data path to try to find what conditions cause the chip to go out to lunch and never return.

I should add one more observation and conjecture. It seems very telling to me that, in "Failure mode 1" above, the command that is not expected to elicit a reply is sent successfully but then the very next command cannot even be sent. This suggests a problem at the USB protocol level since the only way (as far as I know) that the driver could fail to send a command to the chip would be if the USB endpoint is busy. Presumably, the first command gets sent okay but puts the endpoint into a busy state while the USB host controller awaits an acknowledgment from the device. That acknowledgment never comes, as the device has crashed, and so the USB endpoint remains unavailable for transmission of any subsequent commands to the device.

@gdluca
Copy link

gdluca commented Aug 27, 2024

Hi @whitslack

great troubleshooting. I learned a few things thanks to your comment but don't have a fix myself. I recently bought EDUP EP-AX1672 that use the same driver and I am getting the same pending tx errors which make the adapter unusable till hostapd restart. I wanted to add that until yesterday, I used it in AP mode on my desktop x86_64 using the latest archlinux kernel 6.10.6 without any issues. Then I tried my new raspberrypi 5 and that's when pending errors started. I suspect that it is something to do with the usb 3.0 root hub that can't handle both ports at high utilization( the other port is used for a usb<-> 1gbps ethernet). So this morning, I moved the ethernet adapter to one of the usb 2.0 ports and so far it seems fine ( speedtests, heavy transfer files). The rpi5 is also running archlinuxarm with the linux-rpi-16k kernel (currently 6.6.47-2).

@whitslack
Copy link

@gdluca: The problem is definitely not caused by USB 3. I can readily reproduce the hang with an mt7921u adapter plugged into a system that does not have any USB 3 controllers.

@gdluca
Copy link

gdluca commented Aug 27, 2024

@whitslack I see. Then mine will probably get errors again at some point.. Will try to do more tests in the weekend.

@gdluca
Copy link

gdluca commented Sep 19, 2024

Since I cannot swap and test the adapter from the desktop (which works fine - full story #410 )back to the rpi5, I just purchased a second EDUP EP-AX1672. Who knows, maybe with a different hw it works fine too eheh.

CC: @morrownr

@morrownr
Copy link
Owner

@gdluca

The search continues.

@gdluca
Copy link

gdluca commented Sep 25, 2024

And here it is my second EDUP. Currently I am testing like this:
・SSID:labcrash - NEW EDUP on rpi5 with usb power hub

  • ax/40Mhz/channel 36

・SSD:nXXXXX - OLD EDUP on desktop

  • ax/40Mhz/channel 44

I was able to reproduce it again with the new EDUP afater ~322seconds of iperf3. Good at least now there is no impact on my remote job and I can test at the same time hehe.

@morrownr

@whitslack
Copy link

@gdluca: When your AP hangs, see if you can dump the queue states from the mt76 driver:

# cat /sys/kernel/debug/ieee80211/*/mt76/*-queues

(This assumes you have debugfs mounted at /sys/kernel/debug.)

I spent close to 12 hours this past weekend going through the mt76 USB code with a fine-toothed comb. I did find some places where the code makes some assumptions that don't strike me as obviously valid, and I also found an area where I believe there may be a race condition caused by the omission of needed memory barriers. It would be helpful to see a dump of those debug files taken while the driver is hung, as it could confirm my suspicion that the driver may be failing to wake up its kernel worker task despite a queue not being empty. I will also be dumping those files the next time my AP hangs, but you might beat me to it if I get lucky.

@gdluca
Copy link

gdluca commented Sep 25, 2024

@whitslack

For below queue results, I redid an iperf3, waited and the AP hung. There are no dmesg messages yet and I have not restarted hostapd. It is still in this situation so if you want more logs, I should be able to get them.

[root@alarm ~]# cat /sys/kernel/debug/ieee80211/*/mt76/*-queues
     queue | hw-queued |      head |      tail |
         0 |       128 |        44 |        44 |
         1 |       128 |         0 |         0 |
     queue | hw-queued |      head |      tail |
         0 |         0 |       132 |       132 |
         1 |         0 |         0 |         0 |
         2 |       224 |       150 |       182 |
         3 |         0 |         0 |         0 |
         4 |         0 |       132 |       132 |

@gdluca
Copy link

gdluca commented Sep 25, 2024

@whitslack !!

I waited I think 20 min (I am away from the pc but more or less) and the AP is back automatically!! Now I have dmesg errors

[Wed Sep 25 03:17:56 2024] mt7921u 2-1.2:1.0: Message 00020003 (seq 2) timeout
[Wed Sep 25 03:17:57 2024] mt7921u 2-1.2:1.0: timed out waiting for pending tx
[Wed Sep 25 03:17:57 2024] ------------[ cut here ]------------
[Wed Sep 25 03:17:57 2024] WARNING: CPU: 0 PID: 45567 at kernel/kthread.c:659 kthread_park+0xc0/0xe0
[Wed Sep 25 03:17:57 2024] Modules linked in: aes_ce_ccm mt7921u mt7921_common mt792x_lib mt76_connac_lib mt792x_usb mt76_usb mt76 mac80211 libarc4 xt_connmark xt_mark xt_comment xt_addrtype nft_compat wireguard libchacha20poly1305 chacha_neon poly1305_neon ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha veth vhost_vsock vmw_vsock_virtio_transport_common vhost vhost_iotlb vsock binfmt_misc sch_cake ifb sch_htb cls_u32 cls_flow cls_fw act_mirred sch_ingress 8021q garp bridge stp llc nft_nat nft_masq nft_chain_nat nf_nat nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables rpivid_hevc(C) brcmfmac_wcc pisp_be brcmfmac v4l2_mem2mem hci_uart videobuf2_dma_contig btbcm brcmutil videobuf2_memops videobuf2_v4l2 bluetooth aes_ce_blk cfg80211 aes_ce_cipher videodev ghash_ce gf128mul sha2_ce videobuf2_common ecdh_generic sha256_arm64 ecc rfkill sha1_ce libaes ax88179_178a raspberrypi_hwmon rp1_adc raspberrypi_gpiomem i2c_brcmstb mc nvmem_rmem pwm_fan uio_pdrv_genirq uio sch_fq_codel tcp_bbr fuse dm_mod nfnetlink ip_tables
[Wed Sep 25 03:17:57 2024]  x_tables ipv6 vc4 snd_soc_hdmi_codec snd_soc_core snd_compress snd_pcm_dmaengine snd_pcm snd_timer snd drm_display_helper drm_dma_helper drm_kms_helper v3d drm_shmem_helper cec gpu_sched drm drm_panel_orientation_quirks backlight spidev spi_bcm2835 gpio_keys
[Wed Sep 25 03:17:57 2024] CPU: 0 PID: 45567 Comm: kworker/u8:2 Tainted: G         C         6.6.51-3-rpi-16k #1
[Wed Sep 25 03:17:57 2024] Hardware name: Raspberry Pi 5 Model B Rev 1.0 (DT)
[Wed Sep 25 03:17:57 2024] Workqueue: mt76 mt7921_mac_reset_work [mt7921_common]
[Wed Sep 25 03:17:57 2024] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[Wed Sep 25 03:17:57 2024] pc : kthread_park+0xc0/0xe0
[Wed Sep 25 03:17:57 2024] lr : mt76u_stop_tx+0x25c/0x360 [mt76_usb]
[Wed Sep 25 03:17:57 2024] sp : ffffc000829a3c40
[Wed Sep 25 03:17:57 2024] x29: ffffc000829a3c40 x28: 0000000000000000 x27: ffff8001160e9d30
[Wed Sep 25 03:17:57 2024] x26: ffff8001160e2078 x25: ffff8001160e4820 x24: ffff8001160e2078
[Wed Sep 25 03:17:57 2024] x23: ffff8001160e2058 x22: ffffc000829a3c88 x21: ffff8001160e7420
[Wed Sep 25 03:17:57 2024] x20: ffff8001015cde00 x19: ffff800108c85b80 x18: 0000000000000000
[Wed Sep 25 03:17:57 2024] x17: 0000000000000000 x16: ffffd000840be418 x15: 0000000000000000
[Wed Sep 25 03:17:57 2024] x14: 0000000000000000 x13: 000000000000035d x12: 0000000000000001
[Wed Sep 25 03:17:57 2024] x11: ffffd000858308c8 x10: 0000000000001a20 x9 : ffffd00014a2e96c
[Wed Sep 25 03:17:57 2024] x8 : ffff800100372d00 x7 : 0000000000000000 x6 : ffffd000859b2650
[Wed Sep 25 03:17:57 2024] x5 : ffffd000859b2650 x4 : ffffd000859b2650 x3 : 0000000000002800
[Wed Sep 25 03:17:57 2024] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000004
[Wed Sep 25 03:17:57 2024] Call trace:
[Wed Sep 25 03:17:57 2024]  kthread_park+0xc0/0xe0
[Wed Sep 25 03:17:57 2024]  mt76u_stop_tx+0x25c/0x360 [mt76_usb]
[Wed Sep 25 03:17:57 2024]  mt7921u_mac_reset+0x8c/0x290 [mt7921u]
[Wed Sep 25 03:17:57 2024]  mt7921_mac_reset_work+0xc0/0x220 [mt7921_common]
[Wed Sep 25 03:17:57 2024]  process_one_work+0x144/0x3c0
[Wed Sep 25 03:17:57 2024]  worker_thread+0x328/0x460
[Wed Sep 25 03:17:57 2024]  kthread+0x11c/0x128
[Wed Sep 25 03:17:57 2024]  ret_from_fork+0x10/0x20
[Wed Sep 25 03:17:57 2024] ---[ end trace 0000000000000000 ]---
[Wed Sep 25 03:17:57 2024] mt7921u 2-1.2:1.0: HW/SW Version: 0x8a108a10, Build Time: 20240826150948a

[Wed Sep 25 03:17:57 2024] mt7921u 2-1.2:1.0: WM Firmware Version: ____010000, Build Time: 20240826151030

Current queue status

[root@alarm ~]# cat /sys/kernel/debug/ieee80211/*/mt76/*-queues
     queue | hw-queued |      head |      tail |
         0 |       128 |        55 |        55 |
         1 |       128 |         0 |         0 |
     queue | hw-queued |      head |      tail |
         0 |         0 |       174 |       174 |
         1 |         0 |         0 |         0 |
         2 |         0 |       164 |       164 |
         3 |         0 |         0 |         0 |
         4 |         0 |       174 |       174 |

@whitslack
Copy link

@gdluca: Thanks. I'll have to dig back into the driver code to remember what those numbers mean. No promises on when I'll get around to that.

Can you also check your kernel log to see if the mt76 driver has complained about any URB mismatches? (It would appear before the message about pending tx.)

@gdluca
Copy link

gdluca commented Sep 25, 2024

@whitslack

before the pending tx, I have

[Wed Sep 25 02:23:02 2024] br0: port 5(wlu1u2) entered blocking state
[Wed Sep 25 02:23:02 2024] br0: port 5(wlu1u2) entered forwarding state
[Wed Sep 25 03:17:56 2024] mt7921u 2-1.2:1.0: Message 00020003 (seq 2) timeout
[Wed Sep 25 03:17:57 2024] mt7921u 2-1.2:1.0: timed out waiting for pending tx
[Wed Sep 25 03:17:57 2024] ------------[ cut here ]------------

which is when I plugged in the adapter.

@whitslack
Copy link

Alright. There won't necessarily be a mismatched URB. I suspect that occurs due to an entirely separate problem in the code: namely, that the driver assumes that URB completion callbacks will execute serially despite that there is no guarantee (to my knowledge) that the kernel executes URB completion callbacks in a single tasklet thread. The mt76 driver takes care to synchronize access to its queue structures with a spinlock, but there is no guarantee about the order in which racing tasklet threads would enter the spinlock, so there is no guarantee that the driver will observe its submitted URBs to have completed in the same order as it submitted them. That the mt76 driver would implement its own URB queue from scratch strikes me as odd since the kernel already offers a standard struct usb_anchor and associated utility functions for keeping track of a collection of URBs.

Independent of the URB mismatch issue, I suspect that the mt76 driver code that manages the worker tasks that process the queues of completed URBs may have inadequate memory synchronization to ensure that a queue will always be serviced. It looks like there may be the possibility, with very low probability (but in multithreaded coding, anything that can happen will happen eventually), that a completed URB may be enqueued after the worker has already decided to go to sleep but before it has actually done so (and so the URB completion callback's attempt to awaken the worker does nothing since the worker is not yet asleep). The mt76 code does try to guard against this, but I believe it may be making invalid assumptions about CPU cache coherency. Again, it strikes me as odd that the mt76 driver attempts to roll its own queue worker implementation from scratch since the kernel already provides standard mechanisms for work queues. It kind of smells like whoever wrote the mt76 driver is a fairly decent programmer but one who isn't too experienced with coding within the Linux kernel and one who might not fully appreciate the need for memory barriers when writing thread synchronization code (which is why the kernel provides standard, battle-hardened implementations, because getting it right is tricky).

@whitslack
Copy link

whitslack commented Sep 25, 2024

I've been perusing the mt76 code some more…

     queue | hw-queued |      head |      tail |
         0 |       128 |        44 |        44 |
         1 |       128 |         0 |         0 |

These^ are the USB receive queues. They each show 128 URBs enqueued, which means that the driver has handed off 128 buffers to the USB subsystem to be filled with packets coming from the USB device, and none of those URBs have been completed, or in other words, the driver has received no packets from the hardware that the driver has not yet processed. This is good.

     queue | hw-queued |      head |      tail |
         0 |         0 |       132 |       132 |
         1 |         0 |         0 |         0 |
         2 |       224 |       150 |       182 |
         3 |         0 |         0 |         0 |
         4 |         0 |       132 |       132 |

These^ are the USB transmit queues. The driver doesn't actually appear to use queue 3 (EDIT: Oops, I was looking at the queue mapping for the MT7663.), but it uses queue 0 for transmitting BK (background) packets, queue 1 for transmitting BE (best effort) packets, queue 2 for transmitting VI (video) packets, and queue 4 3 for transmitting VO (voice) packets. Your dump indicates that the driver has enqueued 224 VI packets for transmission but has not yet called back to the networking subsystem to confirm transmission of those packets. Unfortunately, the debugfs dump does not show the first field of those queues, which is used on transmit queues to indicate which packet in the queue is the first that still needs to be sent out over USB, so we can't discern whether the breakdown occurred because the driver failed to kick packets out to the hardware or because the driver failed to process the completions of those packets after the hardware confirmed transmitting them.

While examining the transmit code paths, I found something that looks very bad and like it could potentially be causing nasty issues. In drivers/net/wireless/mediatek/mt76/usb.c, there are three functions that manipulate the transmit queues: mt76u_status_worker, mt76u_tx_queue_skb, and mt76u_tx_kick. The latter two are called from multiple sites in drivers/net/wireless/mediatek/mt76/tx.c, always while holding the queue's spinlock, so no problem there. However, the first, mt76u_status_worker, is called from __mt76_worker_fn in drivers/net/wireless/mediatek/mt76/util.c, which is a generic threaded worker implementation that does not know anything about the packet queues or their spinlocks. And the bombshell is that mt76u_status_worker does not hold the queues' spinlocks while manipulating the queues!!! This is catastrophic since the status worker runs in its own thread, likely concurrent with user threads that would be enqueuing packets into the transmit queues. Multiple threads performing concurrent, non-atomic, read-modify-write cycles on the structural pointers of the transmit queues. That's a recipe for disaster. I'm going to see if locking the spinlocks while accessing the queues in mt76u_status_worker fixes the hangs.

@gdluca
Copy link

gdluca commented Sep 25, 2024

I can only appreciate your knowledge and dedication, @whitslack . I cannot read that code so thanks!

@whitslack
Copy link

I'm going to see if locking the spinlocks while accessing the queues in mt76u_status_worker fixes the hangs.

Nope. I think it's a good fix, and I will submit it to the linux-wireless listserv, but it's not sufficient to cure our problem. The search continues.

@whitslack
Copy link

whitslack commented Sep 26, 2024

Another interesting observation: the MT7921AUN hardware advertises two USB inbound (device-to-host) bulk endpoints, ostensibly with the first being for normal network traffic and the second being for command responses from the microcontroller (MCU), yet the firmware actually only ever sends command responses on the first inbound bulk endpoint, mixed in with all the network traffic. The mt7921u driver is prepared to receive command responses on the second endpoint, but it never actually receives any, as all command responses from the MCU arrive on the first endpoint. I conjecture that command responses could be getting dropped if the first endpoint is overly busy with network traffic. This would manifest in exactly the scenario we are observing, where the driver times out while waiting to receive a response to a command that it has sent to the MCU. I posit that the MCU received the command, processed it, and tried to respond, but the response could not be enqueued to the first endpoint due to all the other traffic clogging it up.

It may be telling that the mt7915 driver did recently receive a patch to retry sending commands to the MCU whenever a wait to receive a response times out. I am going to try porting that new functionality over to the mt7921u driver to see if it can work around the hang we're experiencing. Note, it will only be a workaround; the proper fix would be for the firmware to respond to commands on the second endpoint so its responses can't be drowned out by network traffic.

@whitslack
Copy link

It may be telling that the mt7915 driver did recently receive a patch to retry sending commands to the MCU whenever a wait to receive a response times out. I am going to try porting that new functionality over to the mt7921u driver to see if it can work around the hang we're experiencing.

Nope. The retry logic does activate, but by the time that happens, the chip is already out to lunch. In fact, an Ookla Speedtest on my phone can hang the AP without there being any command messages pending (and so none to timeout and hit the retry logic). After a while, HostAPd will time out the station association and try to remove the station from the AP, and that action does cause a command message to be sent to the MCU. Curiously, the USB subsystem confirms completion of that outbound URB, but then the mt7921u driver times out while awaiting a response from the MCU, at which point the new retry logic kicks in and attempts to retransmit the command, but then the USB subsystem times out the retry URB. That's a different kind of timeout: in the former case, the command URB completes, and the mt7921u driver subsequently times out while awaiting a command response message from the MCU, but in the latter case, the command URB does not complete, and so the mt7921u driver does not progress to the point of awaiting a command response. It is as though the the MCU accepts the first command for processing (and acknowledges receipt of that command) but then hangs while processing the command and so is unavailable to acknowledge any subsequent commands. Note, it is not the command that is causing the problem, as the chip is already buggered before the command is even sent to it. The MCU's failure to respond to the command is a symptom of the hung state that the chip is already in.

Unfortunately, the debugfs dump does not show the first field of those queues

I augmented the xmit-queues file in debugfs to show the first field of the transmit queues. During a hang, there are URBs in the queue (hw-queued > 0) and first == head, which indicates that the mt7921u driver did submit those URBs to the USB subsystem. However, the done flag is not set on any of the URBs in the queue, meaning that the USB subsystem has not yet indicated completion of those URBs. Everything points to a problem in the firmware, as it appears that the Linux kernel driver is doing what it should.

Thinking that the high bandwidth of the speed test might be overrunning a queue or buffer in the chip and crashing the firmware, I tried reducing the number of transmit URBs that can be outstanding concurrently (from 224 to 96), but that had no effect on the hangs. I guess maybe I'll try sending out all traffic on the first outbound endpoint rather than distributing it among the endpoints according to WMM service class. Now I am really just stabbing at the air.

@gdluca
Copy link

gdluca commented Sep 26, 2024

@whitslack

thanks for trying!

Thinking that the high bandwidth of the speed test might be overrunning a queue or buffer in the chip and crashing the firmware

Indeed and I still have to make it crash with an iperf test limited to 100Mbps... however! Once (and only once so who knows) I made it hang with just opening a video file from a sshfs mounted (basically streaming via ssh) and the video was not that high quality to go over 100Mbps, not event 50Mbps.

@whitslack
Copy link

whitslack commented Sep 26, 2024

I think I have found a workaround. The firmware doesn't seem to be able to cope when multiple outbound endpoints are heavily loaded, but sending out all the BK (background) traffic on the same endpoint as the BE (best effort) traffic seems not to trigger the hangs. I can now consistently run Speedtest on my phone without hanging the AP, whereas previously I could never run it without hanging.

Screenshot_20240926-053115

The only change you really have to make is:

--- a/drivers/net/wireless/mediatek/mt76/usb.c
+++ b/drivers/net/wireless/mediatek/mt76/usb.c
@@ -923,7 +923,7 @@ static u8 mt76u_ac_to_hwq(struct mt76_dev *dev, u8 ac)
 		return lmac_queue_map[ac];
 	}
 
-	return mt76_ac_to_hwq(ac);
+	return mt76_chip(dev) == 0x7961 && ac == IEEE80211_AC_BK ? 0 : mt76_ac_to_hwq(ac);
 }
 
 static int mt76u_alloc_tx(struct mt76_dev *dev)

@gdluca
Copy link

gdluca commented Sep 26, 2024

@whitslack

fantastic news! I will try it in the weekend maybe later today.

@gdluca
Copy link

gdluca commented Sep 26, 2024

It didn't work for me. Hope it stays stable for you so I know it is something on my side.

Linux alarm 6.6.51-3-rpi-16k-mt76usbfix #1 SMP PREEMPT Thu Sep 26 12:18:55 UTC 2024 aarch64 GNU/Linux

[root@alarm ~]# cat /sys/kernel/debug/ieee80211/*/mt76/*-queues
     queue | hw-queued |      head |      tail |
         0 |       128 |       113 |       113 |
         1 |       128 |         0 |         0 |
     queue | hw-queued |      head |      tail |
         0 |         0 |        39 |        39 |
         1 |         0 |         0 |         0 |
         2 |       224 |       104 |       136 |
         3 |         0 |         0 |         0 |
         4 |         0 |        39 |        39 |

edit: after 18 minutes, AP is up again automatically (with pending tx errors in dmesg).

@whitslack
Copy link

It didn't work for me.

You might also need the status worker spinlock fix, and you could try putting all outgoing traffic on the same endpoint.

--- a/drivers/net/wireless/mediatek/mt76/usb.c
+++ b/drivers/net/wireless/mediatek/mt76/usb.c
@@ -772,6 +772,7 @@
 		if (!q)
 			continue;
 
+		spin_lock(&q->lock);
 		while (q->queued > 0) {
 			if (!q->entry[q->tail].done)
 				break;
@@ -779,11 +780,13 @@
 			entry = q->entry[q->tail];
 			q->entry[q->tail].done = false;
 
+			spin_unlock(&q->lock);
 			mt76_queue_tx_complete(dev, q, &entry);
+			spin_lock(&q->lock);
 		}
-
 		if (!q->queued)
 			wake_up(&dev->tx_wait);
+		spin_unlock(&q->lock);
 
 		mt76_worker_schedule(&dev->tx_worker);
 	}
@@ -923,7 +926,7 @@
 		return lmac_queue_map[ac];
 	}
 
-	return mt76_ac_to_hwq(ac);
+	return mt76_chip(dev) == 0x7961 ? 0 : mt76_ac_to_hwq(ac);
 }
 
 static int mt76u_alloc_tx(struct mt76_dev *dev)

Hope it stays stable for you so I know it is something on my side.

I suspect I'm not entirely avoiding the problem in the firmware, merely reducing the probability of hitting it. My wild guess is that they have a reëntrancy problem in their code, such that processing of a packet arriving on one endpoint gets interrupted by the arrival of a packet on a second endpoint, and the processing of the second packet corrupts the internal state of the processing of the first packet. They would need to disable interrupts on all endpoints while processing a packet received on any endpoint. (This is an absolutely wild guess made with no knowledge of their architecture, so I could be way off base.)

You might also like to enhance your debugfs output with additional details to help elucidate this issue:

--- a/drivers/net/wireless/mediatek/mt76/debugfs.c
+++ b/drivers/net/wireless/mediatek/mt76/debugfs.c
@@ -56,15 +56,20 @@
 	struct mt76_dev *dev = dev_get_drvdata(s->private);
 	int i;
 
-	seq_puts(s, "     queue | hw-queued |      head |      tail |\n");
+	seq_puts(s, "     queue | hw-idx | hw-queued |     first |      head |      tail |\n");
 	for (i = 0; i < ARRAY_SIZE(dev->phy.q_tx); i++) {
 		struct mt76_queue *q = dev->phy.q_tx[i];
 
 		if (!q)
 			continue;
 
-		seq_printf(s, " %9d | %9d | %9d | %9d |\n",
-			   i, q->queued, q->head, q->tail);
+		seq_printf(s, " %9d | %6d | %9d | %9d | %9d | %9d | ",
+			   i, q->hw_idx, q->queued, q->first, q->head, q->tail);
+		if (q->queued)
+			for (int j = q->tail; j != q->head; j = (j + 1) % q->ndesc)
+				seq_printf(s, "%c", q->entry[j].done ? 'd' : '.');
+		seq_printf(s, "%s%s\n",
+			   q->stopped ? " stopped" : "", q->blocked ? " blocked" : "");
 	}
 
 	return 0;

@morrownr
Copy link
Owner

@Simon566

i think i can answer the last issue with bluetooth crash myself here is a patch...

That bluetooth problem has been a problem since late last year and was reported in Jan. of this year. Glad to see the recognition that it is a bluetooth problem and not a wifi problem. Hopefully this patch fixes the problem and we can put this behind us. It seems to have taken the involvement of several Mediatek devs to get to the bottom it.

@morrownr
Copy link
Owner

@Krobix

What are you trying to run? WiFi 6 on 5 GHz?

Can you post your hostapd,conf

@Krobix
Copy link

Krobix commented Oct 19, 2024

Here is my new conf: https://pastebin.com/qM7f2k4s

@morrownr
Copy link
Owner

vht_oper_chwidth=0
vht_oper_centr_freq_seg0_idx=42

0 and 42 do not go together. 42 is for 80 MHz channel width. If you really want a 40 MHz channel width, you need to look in the docs to see what to set.

Look below and see what you are missing for he:

# IEEE 802.11ax (WiFi 6) configuration
ieee80211ax=1
# Channel width (0 = 40 MHz. 1 = 80 Mhz)
# For WiFi 6 on band 1 (2.4 GHz) comment out the below line
he_oper_chwidth=1
# HE center channel (chan + 6)
# For WiFi 6 on band 1 (2.4 GHz) comment out the below line
he_oper_centr_freq_seg0_idx=42

There are some other issues as well. Recommend you look again at the working example that I pointed you toward.

@Krobix
Copy link

Krobix commented Oct 20, 2024

Thank you, after looking at the documentation again, I fixed this issue.

@Simon566
Copy link
Author

Simon566 commented Oct 22, 2024 via email

@morrownr
Copy link
Owner

Possibly. I am watching a few things right now regarding bluetooth and usb because we have seen a lot of problems in the usb wifi adapter world since kernel 6.6.

@Simon566
Copy link
Author

Hi,

here is the crash when plugging in the ALFA mt7921u device with a vanilla 6.11.5 kernel.

i will try the bluetooth patch mentioned before soon:

[Fr Okt 25 17:35:50 2024] usbcore: deregistering interface driver mt7921u
[Fr Okt 25 17:37:02 2024] usbcore: registered new interface driver mt7921u
[Fr Okt 25 17:42:51 2024] usb 2-2: new SuperSpeed USB device number 7 using xhci_hcd
[Fr Okt 25 17:42:51 2024] usb 2-2: New USB device found, idVendor=0e8d, idProduct=7961, bcdDevice= 1.00
[Fr Okt 25 17:42:51 2024] usb 2-2: New USB device strings: Mfr=6, Product=7, SerialNumber=8
[Fr Okt 25 17:42:51 2024] usb 2-2: Product: Wireless_Device
[Fr Okt 25 17:42:51 2024] usb 2-2: Manufacturer: MediaTek Inc.
[Fr Okt 25 17:42:51 2024] usb 2-2: SerialNumber: 000000000
[Fr Okt 25 17:42:56 2024] Bluetooth: hci0: Failed to write uhw reg(-110)
[Fr Okt 25 17:42:56 2024] Bluetooth: hci0: HW/SW Version: 0x008a008a, Build Time: 20240826151221
[Fr Okt 25 17:43:06 2024] Bluetooth: hci0: Execution of wmt command timed out
[Fr Okt 25 17:43:06 2024] Bluetooth: hci0: Failed to send wmt patch dwnld (-110)
[Fr Okt 25 17:43:06 2024] Bluetooth: hci0: Failed to set up firmware (-110)
[Fr Okt 25 17:43:06 2024] Bluetooth: hci0: HCI Enhanced Setup Synchronous Connection command is advertised, but not supported.
[Fr Okt 25 17:43:06 2024] usb 2-2: Disable of device-initiated U1 failed.
[Fr Okt 25 17:43:12 2024] usb 2-2: Disable of device-initiated U2 failed.
[Fr Okt 25 17:43:12 2024] BUG: kernel NULL pointer dereference, address: 00000000000000c8
[Fr Okt 25 17:43:12 2024] #PF: supervisor write access in kernel mode
[Fr Okt 25 17:43:12 2024] #PF: error_code(0x0002) - not-present page
[Fr Okt 25 17:43:12 2024] PGD 0 P4D 0
[Fr Okt 25 17:43:12 2024] Oops: Oops: 0002 [#1] PREEMPT SMP NOPTI
[Fr Okt 25 17:43:12 2024] CPU: 1 UID: 0 PID: 30979 Comm: kworker/1:2 Tainted: G U W OE 6.11.5 #3
[Fr Okt 25 17:43:12 2024] Tainted: [U]=USER, [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[Fr Okt 25 17:43:12 2024] Hardware name: HARDKERNEL ODROID-H3/ODROID-H3, BIOS 5.19 07/19/2023
[Fr Okt 25 17:43:12 2024] Workqueue: usb_hub_wq hub_event [usbcore]
[Fr Okt 25 17:43:12 2024] RIP: 0010:btusb_disconnect+0x2c/0x170 [btusb]
[Fr Okt 25 17:43:12 2024] Code: 1e fa 0f 1f 44 00 00 41 54 55 48 89 fd 53 48 8b 9f c8 00 00 00 0f 1f 44 00 00 48 85 db 0f 84 ac 00 00 00 48 8b 43 10 4c 8b 23 <48> c7 80 c8 00 00 00 00 00 00 00 48 8b 43 18 48 85 c0 74 0b 48 c7
[Fr Okt 25 17:43:12 2024] RSP: 0018:ffffa9a5c3eaf7a0 EFLAGS: 00010286
[Fr Okt 25 17:43:12 2024] RAX: 0000000000000000 RBX: ffff8e6c04021828 RCX: 0000000000000000
[Fr Okt 25 17:43:12 2024] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8e6c200dec00
[Fr Okt 25 17:43:12 2024] RBP: ffff8e6c200dec00 R08: 0000000000000000 R09: 0000000000000000
[Fr Okt 25 17:43:12 2024] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[Fr Okt 25 17:43:12 2024] R13: ffffffffc230f278 R14: 0000000000000080 R15: ffff8e6c200dec50
[Fr Okt 25 17:43:12 2024] FS: 0000000000000000(0000) GS:ffff8e6f6fe80000(0000) knlGS:0000000000000000
[Fr Okt 25 17:43:12 2024] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fr Okt 25 17:43:12 2024] CR2: 00000000000000c8 CR3: 000000026ec22000 CR4: 0000000000350ef0
[Fr Okt 25 17:43:12 2024] Call Trace:
[Fr Okt 25 17:43:12 2024]
[Fr Okt 25 17:43:12 2024] ? __die+0x23/0x80
[Fr Okt 25 17:43:12 2024] ? page_fault_oops+0x173/0x5d0
[Fr Okt 25 17:43:12 2024] ? exc_page_fault+0x76/0x1b0
[Fr Okt 25 17:43:12 2024] ? asm_exc_page_fault+0x26/0x30
[Fr Okt 25 17:43:12 2024] ? btusb_disconnect+0x2c/0x170 [btusb]
[Fr Okt 25 17:43:12 2024] usb_unbind_interface+0x95/0x2d0 [usbcore]
[Fr Okt 25 17:43:12 2024] ? preempt_count_add+0x53/0xc0
[Fr Okt 25 17:43:12 2024] device_release_driver_internal+0x1a7/0x210
[Fr Okt 25 17:43:12 2024] usb_forced_unbind_intf+0x5d/0xc0 [usbcore]
[Fr Okt 25 17:43:12 2024] usb_reset_device+0xec/0x2a0 [usbcore]
[Fr Okt 25 17:43:12 2024] mt7921u_probe+0xab/0x240 [mt7921u]
[Fr Okt 25 17:43:12 2024] usb_probe_interface+0xe5/0x310 [usbcore]
[Fr Okt 25 17:43:12 2024] really_probe+0xd3/0x390
[Fr Okt 25 17:43:12 2024] ? __pfx___device_attach_driver+0x10/0x10
[Fr Okt 25 17:43:12 2024] __driver_probe_device+0x78/0x160
[Fr Okt 25 17:43:12 2024] driver_probe_device+0x1f/0xa0
[Fr Okt 25 17:43:12 2024] __device_attach_driver+0x99/0x130
[Fr Okt 25 17:43:12 2024] bus_for_each_drv+0x92/0xf0
[Fr Okt 25 17:43:12 2024] __device_attach+0xb2/0x1c0
[Fr Okt 25 17:43:12 2024] bus_probe_device+0x8d/0xb0
[Fr Okt 25 17:43:12 2024] device_add+0x698/0x8a0
[Fr Okt 25 17:43:12 2024] usb_set_configuration+0x632/0x9c0 [usbcore]
[Fr Okt 25 17:43:12 2024] usb_generic_driver_probe+0x58/0x80 [usbcore]
[Fr Okt 25 17:43:12 2024] usb_probe_device+0x46/0x130 [usbcore]
[Fr Okt 25 17:43:12 2024] really_probe+0xd3/0x390
[Fr Okt 25 17:43:12 2024] ? __pfx___device_attach_driver+0x10/0x10
[Fr Okt 25 17:43:12 2024] __driver_probe_device+0x78/0x160
[Fr Okt 25 17:43:12 2024] driver_probe_device+0x1f/0xa0
[Fr Okt 25 17:43:12 2024] __device_attach_driver+0x99/0x130
[Fr Okt 25 17:43:12 2024] bus_for_each_drv+0x92/0xf0
[Fr Okt 25 17:43:12 2024] __device_attach+0xb2/0x1c0
[Fr Okt 25 17:43:12 2024] bus_probe_device+0x8d/0xb0
[Fr Okt 25 17:43:12 2024] device_add+0x698/0x8a0
[Fr Okt 25 17:43:12 2024] usb_new_device+0x292/0x510 [usbcore]
[Fr Okt 25 17:43:12 2024] hub_event+0x1292/0x1c70 [usbcore]
[Fr Okt 25 17:43:12 2024] ? __pm_runtime_suspend+0x4a/0xd0
[Fr Okt 25 17:43:12 2024] process_one_work+0x179/0x390
[Fr Okt 25 17:43:12 2024] worker_thread+0x262/0x370
[Fr Okt 25 17:43:12 2024] ? _raw_spin_lock_irqsave+0x27/0x80
[Fr Okt 25 17:43:12 2024] ? __pfx_worker_thread+0x10/0x10
[Fr Okt 25 17:43:12 2024] kthread+0xde/0x110
[Fr Okt 25 17:43:12 2024] ? __pfx_kthread+0x10/0x10
[Fr Okt 25 17:43:12 2024] ret_from_fork+0x31/0x50
[Fr Okt 25 17:43:12 2024] ? __pfx_kthread+0x10/0x10
[Fr Okt 25 17:43:12 2024] ret_from_fork_asm+0x1a/0x30
[Fr Okt 25 17:43:12 2024]
[Fr Okt 25 17:43:12 2024] Modules linked in: mt7921u mt792x_usb mt7921_common mt792x_lib mt76_connac_lib mt76_usb mt76 mac80211 btusb btrtl btintel btbcm libarc4 btmtk 88XXau(OE) it87(OE) cls_matchall sch_ingress sch_cake act_mirred ifb nft_masq nft_nat nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 veth wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 curve25519_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel pppoe pppox nf_tables crc32c_generic nfnetlink ppp_generic slhc nvme_fabrics bluetooth crc16 msr bridge cfg80211 rfkill binfmt_misc snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component i915 x86_pkg_temp_thermal intel_powerclamp kvm_intel snd_sof_pci_intel_icl snd_sof_pci_intel_cnl snd_sof_intel_hda_generic soundwire_intel kvm soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda_common snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp nls_ascii snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core
[Fr Okt 25 17:43:12 2024] nls_cp437 snd_soc_acpi_intel_match snd_soc_acpi snd_soc_core snd_compress crct10dif_pclmul soundwire_bus vfat snd_hda_intel snd_intel_dspcfg fat snd_intel_sdw_acpi ghash_clmulni_intel snd_hda_codec sha512_ssse3 cec sha256_ssse3 mei_hdcp sha1_ssse3 intel_rapl_msr mei_pxp rc_core snd_hda_core processor_thermal_device_pci_legacy processor_thermal_device drm_buddy snd_hwdep processor_thermal_wt_hint drm_display_helper processor_thermal_rfim snd_pcm aesni_intel snd_timer processor_thermal_rapl gf128mul crypto_simd ttm intel_rapl_common cryptd mei_me snd iTCO_wdt processor_thermal_wt_req intel_pmc_bxt drm_kms_helper processor_thermal_power_floor iTCO_vendor_support ftdi_sio mei processor_thermal_mbox int340x_thermal_zone watchdog intel_cstate soundcore pcspkr ee1004 wmi_bmof usbserial i2c_algo_bit intel_soc_dts_iosf intel_pmc_core evdev intel_vsec pmt_telemetry joydev button pmt_class acpi_pad acpi_tad nfsd sg auth_rpcgss nfs_acl hwmon_vid lockd emc2103 drm grace sunrpc dm_mod coretemp 8021q garp stp loop fuse
[Fr Okt 25 17:43:12 2024] efi_pstore mrp llc configfs efivarfs ip_tables x_tables autofs4 hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid btrfs xor raid6_pq libcrc32c sd_mod xhci_pci xhci_hcd r8169 ahci libahci realtek libata mdio_devres nvme usbcore i2c_i801 libphy i2c_smbus scsi_mod crc32_pclmul nvme_core intel_lpss_pci intel_lpss idma64 crc32c_intel scsi_common usb_common video wmi [last unloaded: mt7921u]
[Fr Okt 25 17:43:12 2024] CR2: 00000000000000c8
[Fr Okt 25 17:43:12 2024] ---[ end trace 0000000000000000 ]---
[Fr Okt 25 17:43:12 2024] RIP: 0010:btusb_disconnect+0x2c/0x170 [btusb]
[Fr Okt 25 17:43:12 2024] Code: 1e fa 0f 1f 44 00 00 41 54 55 48 89 fd 53 48 8b 9f c8 00 00 00 0f 1f 44 00 00 48 85 db 0f 84 ac 00 00 00 48 8b 43 10 4c 8b 23 <48> c7 80 c8 00 00 00 00 00 00 00 48 8b 43 18 48 85 c0 74 0b 48 c7
[Fr Okt 25 17:43:12 2024] RSP: 0018:ffffa9a5c3eaf7a0 EFLAGS: 00010286
[Fr Okt 25 17:43:12 2024] RAX: 0000000000000000 RBX: ffff8e6c04021828 RCX: 0000000000000000
[Fr Okt 25 17:43:12 2024] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8e6c200dec00
[Fr Okt 25 17:43:12 2024] RBP: ffff8e6c200dec00 R08: 0000000000000000 R09: 0000000000000000
[Fr Okt 25 17:43:12 2024] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[Fr Okt 25 17:43:12 2024] R13: ffffffffc230f278 R14: 0000000000000080 R15: ffff8e6c200dec50
[Fr Okt 25 17:43:12 2024] FS: 0000000000000000(0000) GS:ffff8e6f6fe80000(0000) knlGS:0000000000000000
[Fr Okt 25 17:43:12 2024] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fr Okt 25 17:43:12 2024] CR2: 00000000000000c8 CR3: 000000010c4a2000 CR4: 0000000000350ef0
[Fr Okt 25 17:43:12 2024] note: kworker/1:2[30979] exited with irqs disabled

regards,
Simon

@morrownr
Copy link
Owner

@Simon566

This looks like the bluetooth problem some adapters have been experiencing. It is a bluetooth issue and not a wifi issue. Yes, there are patches in progress. You are welcomed to try to the patch but to confirm it is the same bluetooth problem we have been seeing, simply delete the mt7921u bluetooth firmware file. mt7921u uses two wifi firmware files and there's a bluetooth firmware file. Try deleting it and see what happens.

@Simon566
Copy link
Author

Simon566 commented Oct 25, 2024 via email

@Simon566
Copy link
Author

Simon566 commented Oct 26, 2024 via email

@whitslack
Copy link

@Simon566: Please open a new issue report for each distinct issue you're experiencing. Nothing you've been reporting is related to the timed out waiting for pending tx error that this issue report concerns. Sorting discussion onto relevant GitHub issues will cut down on notification noise. Thanks.

@Krobix
Copy link

Krobix commented Dec 4, 2024

Hi,

I realize that I had forgotten to check back on this issue, simply because the error had disappeared completely on my end after I fixed the hostapd configuration. As of the significance of this, I do not know, but I figured that i should at least update.

Thanks

@callegar
Copy link

Unclear if it is the same issue, but I also see a similar device hang with pending TX on my Bus 004 Device 005: ID 0e8d:7961 MediaTek Inc. Wireless_Device. What is interesting is that everybody here seems to see the issue when using the device as an AP, while I see it using the device in infrastructure mode (connecting to an AP). This is when I generate a lot of traffic (typically doing backups).

@whitslack
Copy link

What is interesting is that everybody here seems to see the issue when using the device as an AP, while I see it using the device in infrastructure mode (connecting to an AP).

@callegar: There's nothing about the bug that's inherent to AP mode. Any time you're using multiple hardware transmit queues concurrently you can trigger it. Does my patch work around the problem for you too?

@callegar
Copy link

I still need to test it. Building custom kernels for my distro is unfortunately not the well documented task it should.

@ttrei
Copy link

ttrei commented Jan 5, 2025

What is interesting is that everybody here seems to see the issue when using the device as an AP, while I see it using the device in infrastructure mode (connecting to an AP).

@callegar: There's nothing about the bug that's inherent to AP mode. Any time you're using multiple hardware transmit queues concurrently you can trigger it. Does my patch work around the problem for you too?

I'm using the receiver in infrastructure mode and the patch seems to work.
Fenvi FU-AX1800
NixOS x86-64
kernel 6.6.68

Previously I could somewhat reliably trigger the failure (see below).
It would then re-connect but the connection was clearly degraded - maybe due to some interaction with wireguard vpn, didn't investigate further.
After applying the patch I haven't been able to trigger the failure for two days now.

[   76.611332] mt7921u 3-2:1.0: Message 00002ced (seq 4) timeout
[   76.835321] mt7921u 3-2:1.0: timed out waiting for pending tx
[   76.864133] ------------[ cut here ]------------
[   76.864136] WARNING: CPU: 7 PID: 241 at kernel/kthread.c:661 kthread_park+0x8d/0xa0
[   76.864141] Modules linked in: xt_connmark xt_mark xt_comment xt_addrtype wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel rfcomm bluetooth ecdh_generic ecc ccm af_packet nls_iso8859_1 nls_cp437 vfat fat amdgpu snd_hda_codec_realtek snd_hda_codec_generic drm_exec amdxcp drm_buddy ledtrig_audio gpu_sched drm_suballoc_helper drm_ttm_helper edac_mce_amd intel_rapl_msr edac_core snd_hda_codec_hdmi ttm snd_usb_audio intel_rapl_common drm_display_helper snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi crc32_pclmul snd_usbmidi_lib cec polyval_clmulni snd_ump snd_hda_codec polyval_generic i2c_algo_bit gf128mul snd_rawmidi ghash_clmulni_intel sha512_ssse3 snd_seq_device sha256_ssse3 snd_hda_core wmi_bmof mc gigabyte_wmi snd_hwdep r8169 sha1_ssse3 sp5100_tco snd_pcm aesni_intel crypto_simd realtek snd_timer cryptd mdio_devres watchdog snd libphy rapl soundcore i2c_piix4 k10temp video thermal input_leds joydev mousedev wmi backlight
[   76.864204]  gpio_amdpt mac_hid evdev gpio_generic tiny_power_button button xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter xt_pkttype xt_LOG nf_log_syslog xt_tcpudp nft_compat nf_tables libcrc32c sch_fq_codel uinput atkbd libps2 serio vivaldi_fmap loop tun tap macvlan bridge stp llc ccp rng_core kvm irqbypass fuse efi_pstore configfs nfnetlink efivarfs dmi_sysfs ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 hid_logitech_hidpp hid_logitech_dj hid_generic sd_mod usbhid t10_pi hid crc64_rocksoft crc64 crc_t10dif crct10dif_generic ahci libahci libata xhci_pci xhci_pci_renesas xhci_hcd scsi_mod crct10dif_pclmul crct10dif_common crc32c_intel scsi_common rtc_cmos dm_mod dax mt7921u mt7921_common mt792x_lib mt76_connac_lib mt792x_usb mt76_usb mt76 led_class mac80211 libarc4 cfg80211 rfkill
[   76.864267] CPU: 7 PID: 241 Comm: kworker/u64:6 Not tainted 6.6.67 #1-NixOS
[   76.864269] Hardware name: Gigabyte Technology Co., Ltd. B550M S2H/B550M S2H, BIOS F14 01/04/2022
[   76.864270] Workqueue: mt76 mt7921_mac_reset_work [mt7921_common]
[   76.864278] RIP: 0010:kthread_park+0x8d/0xa0
[   76.864280] Code: 5b 5d 31 f6 31 ff e9 ad fc e0 00 0f 0b 48 8b ab c8 06 00 00 a8 04 74 a8 0f 0b b8 da ff ff ff 5b 5d 31 f6 31 ff e9 8e fc e0 00 <0f> 0b b8 f0 ff ff ff eb cd 0f 0b eb c7 66 0f 1f 44 00 00 90 90 90
[   76.864282] RSP: 0018:ffffc90000c87d60 EFLAGS: 00010202
[   76.864283] RAX: 0000000000000004 RBX: ffff88810903c800 RCX: 0000000000000000
[   76.864285] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88810903c800
[   76.864286] RBP: ffff888100e8c400 R08: 0000000000000000 R09: 0000000000000000
[   76.864287] R10: 0000000000000000 R11: 0000000000000000 R12: ffff888108cb20b8
[   76.864288] R13: ffff888108cb20d8 R14: ffff888108cb20d8 R15: ffff888108c782a8
[   76.864289] FS:  0000000000000000(0000) GS:ffff88842dd80000(0000) knlGS:0000000000000000
[   76.864290] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   76.864299] CR2: 00007f6175ec522c CR3: 0000000165dae000 CR4: 0000000000f50ee0
[   76.864301] PKRU: 55555554
[   76.864302] Call Trace:
[   76.864303]  <TASK>
[   76.864305]  ? kthread_park+0x8d/0xa0
[   76.864307]  ? __warn+0x81/0x130
[   76.864312]  ? kthread_park+0x8d/0xa0
[   76.864315]  ? report_bug+0x182/0x1b0
[   76.864319]  ? handle_bug+0x60/0xa0
[   76.864322]  ? exc_invalid_op+0x17/0x80
[   76.864324]  ? asm_exc_invalid_op+0x1a/0x20
[   76.864330]  ? kthread_park+0x8d/0xa0
[   76.864333]  mt76u_stop_tx+0x206/0x2e0 [mt76_usb]
[   76.864338]  ? __pfx_autoremove_wake_function+0x10/0x10
[   76.864343]  mt7921u_mac_reset+0x72/0x1b0 [mt7921u]
[   76.864347]  mt7921_mac_reset_work+0xa2/0x190 [mt7921_common]
[   76.864353]  process_one_work+0x17d/0x350
[   76.864358]  worker_thread+0x28c/0x3b0
[   76.864360]  ? __pfx_worker_thread+0x10/0x10
[   76.864362]  kthread+0xe8/0x120
[   76.864364]  ? __pfx_kthread+0x10/0x10
[   76.864366]  ret_from_fork+0x34/0x50
[   76.864370]  ? __pfx_kthread+0x10/0x10
[   76.864372]  ret_from_fork_asm+0x1b/0x30
[   76.864377]  </TASK>
[   76.864378] ---[ end trace 0000000000000000 ]---
[   77.028545] mt7921u 3-2:1.0: HW/SW Version: 0x8a108a10, Build Time: 20241106151007a

[   77.043058] mt7921u 3-2:1.0: WM Firmware Version: ____010000, Build Time: 20241106151045
[   79.156517] wlp4s0f3u2: Driver requested disconnection from AP 28:ee:52:d6:35:b4
[   83.843321] mt7921u 3-2:1.0: Message 00020003 (seq 12) timeout
[   86.851427] mt7921u 3-2:1.0: Message 00020002 (seq 13) timeout
[   92.195318] mt7921u 3-2:1.0: timed out waiting for pending tx
[   92.384990] mt7921u 3-2:1.0: HW/SW Version: 0x8a108a10, Build Time: 20241106151007a

[   92.398139] mt7921u 3-2:1.0: WM Firmware Version: ____010000, Build Time: 20241106151045
[   96.797518] wlp4s0f3u2: authenticate with 28:ee:52:d6:35:b4
[   96.947131] wlp4s0f3u2: send auth to 28:ee:52:d6:35:b4 (try 1/3)
[   96.949988] wlp4s0f3u2: authenticated
[   96.955330] wlp4s0f3u2: associate with 28:ee:52:d6:35:b4 (try 1/3)
[   96.960106] wlp4s0f3u2: RX AssocResp from 28:ee:52:d6:35:b4 (capab=0x431 status=0 aid=5)
[   96.974529] wlp4s0f3u2: associated

@ttrei
Copy link

ttrei commented Jan 10, 2025

After a week without problems it triggered again :/
@whitslack, i can gather some debug logs if you are interested.

@whitslack
Copy link

@ttrei: Sorry to hear that. I am not able to get any further into this. Have you tried any kernel versions that include upstream's reworking of the queues in the mt7921u driver?

@ttrei
Copy link

ttrei commented Jan 10, 2025

Haven't tried newer versions yet, maybe later this weekend. But this is not a pressing issue for me, and i don't have a reliable way to reproduce the failure.

@ttrei
Copy link

ttrei commented Jan 20, 2025

Update: i've been running kernel 6.12 for a while now with similar network load patterns as before, and the driver hasn't crashed yet.

@morrownr
Copy link
Owner

morrownr commented Jan 21, 2025

@ttrei

What firmware version are you using?

$ ethtool -i interface name

@ttrei
Copy link

ttrei commented Jan 21, 2025

$ ethtool -i wlp1s0f0u2
driver: mt7921u
version: 6.12.7
firmware-version: ____010000-20241106151045
expansion-rom-version:
bus-info: 1-2:1.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

@Simon566
Copy link
Author

i will try and give it a shot . im on 6.12.10 , but im experiencing the bluetooth oops with the ALFA adapter .....

the firmware is the same on my side

@Simon566
Copy link
Author

$ ethtool -i wlp1s0f0u2
driver: mt7921u
version: 6.12.7
firmware-version: ____010000-20241106151045
expansion-rom-version:
bus-info: 1-2:1.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

can you please post the reworked patch for 6.12.7 ? the original patch doesnt apply anymore :-(

@ttrei
Copy link

ttrei commented Jan 21, 2025

@Simon566
i didn't clarify - i'm using Linux 6.12.7 kernel without any patches

@Simon566
Copy link
Author

@ttrei but you are running in client mode correct , or also AP mode ?

@ttrei
Copy link

ttrei commented Jan 21, 2025

only as a client, haven't used the adapter as access point

@morrownr
Copy link
Owner

@Simon566

but im experiencing the bluetooth oops with the ALFA adapter .....

I decided to start a FAQ and the first thing I added is the current workaround for the Bluetooth Opps problem:

https://github.com/morrownr/USB-WiFi/blob/main/home/FAQ.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants