Properly fix wifi firmware crashes #70

jonas2515 · 2020-11-12T12:15:57Z

No description provided.

jonas2515 · 2020-11-12T13:16:53Z

Added some reverts for the commits which should no longer be needed now.

This needs broader testing on more devices though, especially now that we use host-sleep while suspended again.

kitakar5525 · 2020-11-12T13:46:39Z

About the ps_mode:
The ps_mode on 5 GHz AP is too unstable and almost unusable on my SB1 and S3. This is not only a fw crash issue but also a connection stability issue. So, testing by many people on various devices on both 2.4 GHz and 5 GHz AP is desirable.

About the Host Sleep:
Does S0ix still working on your SP5? Or you mean to use with bridge reset?

jonas2515 · 2020-11-12T14:04:50Z

Does S0ix still working on your SP5? Or you mean to use with bridge reset?

Yeah I'm always applying the bridge reset, sorry.. I'll open another PR for the bridge reset quirk, if we want to go back to host-sleep, there's no way around it.

jonas2515 · 2020-11-12T14:25:00Z

PR for the bridge-reset quirk: #72

drivers/net/wireless/marvell/mwifiex/sta_ioctl.c

kitakar5525 · 2020-11-12T17:12:52Z

Host Sleep suspend method itself won't cause fw crashes as far as I know. The problems are AP scanning after suspend and S0ix (depends on #72 to fix). (EDIT: what caused fw crashes before is bridge_d3.)
So, what about moving this change to a new PR or just changing the title of this PR to something like "mwifiex improvements for stability" ?

Also, we need to test if the method really won't cause AP scanning failure after suspend anymore. As I said in the mailing list, at a glance, it won't happen anymore.

Christoph Paasch reported following crash: dst_release underflow WARNING: CPU: 0 PID: 1319 at net/core/dst.c:175 dst_release+0xc1/0xd0 net/core/dst.c:175 CPU: 0 PID: 1319 Comm: syz-executor217 Not tainted 5.11.0-rc6af8e85128b4d0d24083c5cac646e891227052e0c #70 Call Trace: rt_cache_route+0x12e/0x140 net/ipv4/route.c:1503 rt_set_nexthop.constprop.0+0x1fc/0x590 net/ipv4/route.c:1612 __mkroute_output net/ipv4/route.c:2484 [inline] ... The worker leaves msk->subflow alone even when it happened to close the subflow ssk associated with it. Fixes: 866f26f ("mptcp: always graft subflow socket to parent") Closes: multipath-tcp/mptcp_net-next#157 Reported-by: Christoph Paasch <cpaasch@apple.com> Suggested-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>

On the 88W8897 card it's very important the TX ring write pointer is updated correctly to its new value before setting the TX ready interrupt, otherwise the firmware appears to crash (probably because it's trying to DMA-read from the wrong place). Since PCI uses "posted writes" when writing to a register, it's not guaranteed that a write will happen immediately. That means the pointer might be outdated when setting the TX ready interrupt, leading to firmware crashes especially when ASPM L1 and L1 substates are enabled (because of the higher link latency, the write will probably take longer). So fix those firmware crashes by always forcing non-posted writes. We do that by simply reading back the register after writing it, just as a lot of other drivers do. There are two reproducers that are fixed with this patch: 1) During rx/tx traffic and with ASPM L1 substates enabled (the enabled substates are platform dependent), the firmware crashes and eventually a command timeout appears in the logs. That crash is fixed by using a non-posted write in mwifiex_pcie_send_data(). 2) When sending lots of commands to the card, waking it up from sleep in very quick intervals, the firmware eventually crashes. That crash appears to be fixed by some other non-posted write included here.

jonas2515 · 2021-03-27T13:45:20Z

Okay, updated this one to only include the most important fix, let's do the re-enabling of host-sleep in a separate MR.

The firmware shouldn't crash anymore now that we have a proper fix (or at least not that often), so hopefully we don't need this anymore. This reverts commit 52750be.

jonas2515 · 2021-03-27T14:15:13Z

Okay, opened #89 for reenabling the host sleep, and opened #90 (which we maybe shouldn't merge yet) for reenabling power saving.

qzed

Thanks!

[ Upstream commit 17aee05 ] Christoph Paasch reported following crash: dst_release underflow WARNING: CPU: 0 PID: 1319 at net/core/dst.c:175 dst_release+0xc1/0xd0 net/core/dst.c:175 CPU: 0 PID: 1319 Comm: syz-executor217 Not tainted 5.11.0-rc6af8e85128b4d0d24083c5cac646e891227052e0c #70 Call Trace: rt_cache_route+0x12e/0x140 net/ipv4/route.c:1503 rt_set_nexthop.constprop.0+0x1fc/0x590 net/ipv4/route.c:1612 __mkroute_output net/ipv4/route.c:2484 [inline] ... The worker leaves msk->subflow alone even when it happened to close the subflow ssk associated with it. Fixes: 866f26f ("mptcp: always graft subflow socket to parent") Closes: multipath-tcp/mptcp_net-next#157 Reported-by: Christoph Paasch <cpaasch@apple.com> Suggested-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>

Build fails with the following message: fatal error: tspdrv.h: No such file or directory Fix this kind of error by replacing <> to "". Reference: - Please Correct these lines ! starters are facing problem because of these errors! · Issue #3 · MiCode/Xiaomi_Kernel_OpenSource MiCode/Xiaomi_Kernel_OpenSource#3 - pn548.h missing · Issue linux-surface#70 · MiCode/Xiaomi_Kernel_OpenSource MiCode/Xiaomi_Kernel_OpenSource#70 - wrong path for tracer_pkt_private.h · Issue linux-surface#71 · MiCode/Xiaomi_Kernel_OpenSource MiCode/Xiaomi_Kernel_OpenSource#71 Why this ever happens? From comment in MiCode/Xiaomi_Kernel_OpenSource#70 (comment), > This kernel is made to be built wih ABS (Android Build System) and not > isolated. > They pushed a kernel which works with ABS. > ABS has a specific kernel header managment and that's why your facing > this error (and linux-surface#71 too) so they don't have to be responsable of it Signed-off-by: Tsuchiya Yuto <kitakar@gmail.com>

The srv_mutex is used during writeback so cifs should ensure that allocations done when that mutex is held are done with GFP_NOFS, to avoid having direct reclaim ending up waiting for the same mutex and causing a deadlock. This is detected by lockdep with the splat below: ====================================================== WARNING: possible circular locking dependency detected 5.18.0 #70 Not tainted ------------------------------------------------------ kswapd0/49 is trying to acquire lock: ffff8880195782e0 (&tcp_ses->srv_mutex){+.+.}-{3:3}, at: compound_send_recv but task is already holding lock: ffffffffa98e66c0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire kmem_cache_alloc_trace __request_module crypto_alg_mod_lookup crypto_alloc_tfm_node crypto_alloc_shash cifs_alloc_hash smb311_crypto_shash_allocate smb311_update_preauth_hash compound_send_recv cifs_send_recv SMB2_negotiate smb2_negotiate cifs_negotiate_protocol cifs_get_smb_ses cifs_mount cifs_smb3_do_mount smb3_get_tree vfs_get_tree path_mount __x64_sys_mount do_syscall_64 entry_SYSCALL_64_after_hwframe -> #0 (&tcp_ses->srv_mutex){+.+.}-{3:3}: __lock_acquire lock_acquire __mutex_lock mutex_lock_nested compound_send_recv cifs_send_recv SMB2_write smb2_sync_write cifs_write cifs_writepage_locked cifs_writepage shrink_page_list shrink_lruvec shrink_node balance_pgdat kswapd kthread ret_from_fork other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(&tcp_ses->srv_mutex); lock(fs_reclaim); lock(&tcp_ses->srv_mutex); *** DEADLOCK *** 1 lock held by kswapd0/49: #0: ffffffffa98e66c0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat stack backtrace: CPU: 2 PID: 49 Comm: kswapd0 Not tainted 5.18.0 #70 Call Trace: <TASK> dump_stack_lvl dump_stack print_circular_bug.cold check_noncircular __lock_acquire lock_acquire __mutex_lock mutex_lock_nested compound_send_recv cifs_send_recv SMB2_write smb2_sync_write cifs_write cifs_writepage_locked cifs_writepage shrink_page_list shrink_lruvec shrink_node balance_pgdat kswapd kthread ret_from_fork </TASK> Fix this by using the memalloc_nofs_save/restore APIs around the places where the srv_mutex is held. Do this in a wrapper function for the lock/unlock of the srv_mutex, and rename the srv_mutex to avoid missing call sites in the conversion. Note that there is another lockdep warning involving internal crypto locks, which was masked by this problem and is visible after this fix, see the discussion in this thread: https://lore.kernel.org/all/20220523123755.GA13668@axis.com/ Link: https://lore.kernel.org/r/CANT5p=rqcYfYMVHirqvdnnca4Mo+JQSw5Qu12v=kPfpk5yhhmg@mail.gmail.com/ Reported-by: Shyam Prasad N <nspmangalore@gmail.com> Suggested-by: Lars Persson <larper@axis.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de> Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Signed-off-by: Steve French <stfrench@microsoft.com>

[ Upstream commit cc391b6 ] The srv_mutex is used during writeback so cifs should ensure that allocations done when that mutex is held are done with GFP_NOFS, to avoid having direct reclaim ending up waiting for the same mutex and causing a deadlock. This is detected by lockdep with the splat below: ====================================================== WARNING: possible circular locking dependency detected 5.18.0 #70 Not tainted ------------------------------------------------------ kswapd0/49 is trying to acquire lock: ffff8880195782e0 (&tcp_ses->srv_mutex){+.+.}-{3:3}, at: compound_send_recv but task is already holding lock: ffffffffa98e66c0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire kmem_cache_alloc_trace __request_module crypto_alg_mod_lookup crypto_alloc_tfm_node crypto_alloc_shash cifs_alloc_hash smb311_crypto_shash_allocate smb311_update_preauth_hash compound_send_recv cifs_send_recv SMB2_negotiate smb2_negotiate cifs_negotiate_protocol cifs_get_smb_ses cifs_mount cifs_smb3_do_mount smb3_get_tree vfs_get_tree path_mount __x64_sys_mount do_syscall_64 entry_SYSCALL_64_after_hwframe -> #0 (&tcp_ses->srv_mutex){+.+.}-{3:3}: __lock_acquire lock_acquire __mutex_lock mutex_lock_nested compound_send_recv cifs_send_recv SMB2_write smb2_sync_write cifs_write cifs_writepage_locked cifs_writepage shrink_page_list shrink_lruvec shrink_node balance_pgdat kswapd kthread ret_from_fork other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(&tcp_ses->srv_mutex); lock(fs_reclaim); lock(&tcp_ses->srv_mutex); *** DEADLOCK *** 1 lock held by kswapd0/49: #0: ffffffffa98e66c0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat stack backtrace: CPU: 2 PID: 49 Comm: kswapd0 Not tainted 5.18.0 #70 Call Trace: <TASK> dump_stack_lvl dump_stack print_circular_bug.cold check_noncircular __lock_acquire lock_acquire __mutex_lock mutex_lock_nested compound_send_recv cifs_send_recv SMB2_write smb2_sync_write cifs_write cifs_writepage_locked cifs_writepage shrink_page_list shrink_lruvec shrink_node balance_pgdat kswapd kthread ret_from_fork </TASK> Fix this by using the memalloc_nofs_save/restore APIs around the places where the srv_mutex is held. Do this in a wrapper function for the lock/unlock of the srv_mutex, and rename the srv_mutex to avoid missing call sites in the conversion. Note that there is another lockdep warning involving internal crypto locks, which was masked by this problem and is visible after this fix, see the discussion in this thread: https://lore.kernel.org/all/20220523123755.GA13668@axis.com/ Link: https://lore.kernel.org/r/CANT5p=rqcYfYMVHirqvdnnca4Mo+JQSw5Qu12v=kPfpk5yhhmg@mail.gmail.com/ Reported-by: Shyam Prasad N <nspmangalore@gmail.com> Suggested-by: Lars Persson <larper@axis.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de> Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>

… when the system is sleeping [ Upstream commit 2838a89 ] Some channels may be masked. When the system is suspended, if these masked channels are not filtered out, this will lead to null pointer operations and system crash: Unable to handle kernel NULL pointer dereference at virtual address Mem abort info: ESR = 0x0000000096000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=0000000894300000 [00000000000002a0] pgd=0000000000000000, p4d=0000000000000000 Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP Modules linked in: CPU: 1 PID: 989 Comm: sh Tainted: G B 6.6.0-16203-g557fb7a3ec4c-dirty #70 Hardware name: Freescale i.MX8QM MEK (DT) pstate: 400000c5 (nZcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc: fsl_edma_disable_request+0x3c/0x78 lr: fsl_edma_disable_request+0x3c/0x78 sp:ffff800089ae7690 x29: ffff800089ae7690 x28: ffff000807ab5440 x27: ffff000807ab5830 x26: 0000000000000008 x25: 0000000000000278 x24: 0000000000000001 23: ffff000807ab4328 x22: 0000000000000000 x21: 0000000000000009 x20: ffff800082616940 x19: 0000000000000000 x18: 0000000000000000 x17: 3d3d3d3d3d3d3d3d x16: 3d3d3d3d3d3d3d3d x15: 3d3d3d3d3d3d3d3d x14: 3d3d3d3d3d3d3d3d x13: 3d3d3d3d3d3d3d3d x12: 1ffff00010d45724 x11: ffff700010d45724 x10: dfff800000000000 x9: dfff800000000000 x8: 00008fffef2ba8dc x7: 0000000000000001 x6: ffff800086a2b927 x5: ffff800086a2b920 x4: ffff700010d45725 x3: ffff8000800d5bbc x2 : 0000000000000000 x1 : ffff000800c1d880 x0 : 0000000000000001 Call trace: fsl_edma_disable_request+0x3c/0x78 fsl_edma_suspend_late+0x128/0x12c dpm_run_callback+0xd4/0x304 __device_suspend_late+0xd0/0x240 dpm_suspend_late+0x174/0x59c suspend_devices_and_enter+0x194/0xd00 pm_suspend+0x3c4/0x910 Fixes: 72f5801 ("dmaengine: fsl-edma: integrate v3 support") Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com> Link: https://lore.kernel.org/r/20231113225713.1892643-2-xiaolei.wang@windriver.com Signed-off-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>

commit 3f489c2 upstream. The mmap read lock is used during the shrinker's callback, which means that using alloc->vma pointer isn't safe as it can race with munmap(). As of commit dd2283f ("mm: mmap: zap pages with read mmap_sem in munmap") the mmap lock is downgraded after the vma has been isolated. I was able to reproduce this issue by manually adding some delays and triggering page reclaiming through the shrinker's debug sysfs. The following KASAN report confirms the UAF: ================================================================== BUG: KASAN: slab-use-after-free in zap_page_range_single+0x470/0x4b8 Read of size 8 at addr ffff356ed50e50f0 by task bash/478 CPU: 1 PID: 478 Comm: bash Not tainted 6.6.0-rc5-00055-g1c8b86a3799f-dirty #70 Hardware name: linux,dummy-virt (DT) Call trace: zap_page_range_single+0x470/0x4b8 binder_alloc_free_page+0x608/0xadc __list_lru_walk_one+0x130/0x3b0 list_lru_walk_node+0xc4/0x22c binder_shrink_scan+0x108/0x1dc shrinker_debugfs_scan_write+0x2b4/0x500 full_proxy_write+0xd4/0x140 vfs_write+0x1ac/0x758 ksys_write+0xf0/0x1dc __arm64_sys_write+0x6c/0x9c Allocated by task 492: kmem_cache_alloc+0x130/0x368 vm_area_alloc+0x2c/0x190 mmap_region+0x258/0x18bc do_mmap+0x694/0xa60 vm_mmap_pgoff+0x170/0x29c ksys_mmap_pgoff+0x290/0x3a0 __arm64_sys_mmap+0xcc/0x144 Freed by task 491: kmem_cache_free+0x17c/0x3c8 vm_area_free_rcu_cb+0x74/0x98 rcu_core+0xa38/0x26d4 rcu_core_si+0x10/0x1c __do_softirq+0x2fc/0xd24 Last potentially related work creation: __call_rcu_common.constprop.0+0x6c/0xba0 call_rcu+0x10/0x1c vm_area_free+0x18/0x24 remove_vma+0xe4/0x118 do_vmi_align_munmap.isra.0+0x718/0xb5c do_vmi_munmap+0xdc/0x1fc __vm_munmap+0x10c/0x278 __arm64_sys_munmap+0x58/0x7c Fix this issue by performing instead a vma_lookup() which will fail to find the vma that was isolated before the mmap lock downgrade. Note that this option has better performance than upgrading to a mmap write lock which would increase contention. Plus, mmap_write_trylock() has been recently removed anyway. Fixes: dd2283f ("mm: mmap: zap pages with read mmap_sem in munmap") Cc: stable@vger.kernel.org Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Signed-off-by: Carlos Llamas <cmllamas@google.com> Link: https://lore.kernel.org/r/20231201172212.1813387-3-cmllamas@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

qzed requested changes Nov 12, 2020

View reviewed changes

drivers/net/wireless/marvell/mwifiex/sta_ioctl.c Outdated Show resolved Hide resolved

jonas2515 force-pushed the v5.9-surface-devel-fix-crashes branch from efe6c78 to 1a6897c Compare March 27, 2021 13:44

jonas2515 changed the base branch from v5.9-surface-devel to v5.11-surface-devel March 27, 2021 13:44

jonas2515 changed the title ~~[WIP] Properly fix wifi firmware crashes~~ Properly fix wifi firmware crashes Mar 27, 2021

jonas2515 mentioned this pull request Mar 27, 2021

Reenable host sleep #89

Merged

Revert "mwifiex: pcie: add enable_device_dump module parameter"

6891faf

The firmware shouldn't crash anymore now that we have a proper fix (or at least not that often), so hopefully we don't need this anymore. This reverts commit 52750be.

jonas2515 mentioned this pull request Mar 27, 2021

Enable wifi powersaving by default again #90

Merged

qzed approved these changes Mar 27, 2021

View reviewed changes

qzed merged commit 4b49a23 into linux-surface:v5.11-surface-devel Mar 28, 2021

qzed mentioned this pull request Mar 29, 2021

Announcements and updates linux-surface/linux-surface#96

Open

ecaron mentioned this pull request Apr 24, 2021

Network issue after waking from sleep state sebanc/brunch#953

Open

nexplorer-3e mentioned this pull request May 18, 2024

Mwifiex: tx timeout randomly occurs in AP mode #155

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Properly fix wifi firmware crashes #70

Properly fix wifi firmware crashes #70

Uh oh!

jonas2515 commented Nov 12, 2020

Uh oh!

jonas2515 commented Nov 12, 2020

Uh oh!

kitakar5525 commented Nov 12, 2020

Uh oh!

jonas2515 commented Nov 12, 2020

Uh oh!

jonas2515 commented Nov 12, 2020

Uh oh!

Uh oh!

kitakar5525 commented Nov 12, 2020 •

edited

Loading

Uh oh!

jonas2515 commented Mar 27, 2021

Uh oh!

jonas2515 commented Mar 27, 2021

Uh oh!

qzed left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Properly fix wifi firmware crashes #70

Properly fix wifi firmware crashes #70

Uh oh!

Conversation

jonas2515 commented Nov 12, 2020

Uh oh!

jonas2515 commented Nov 12, 2020

Uh oh!

kitakar5525 commented Nov 12, 2020

Uh oh!

jonas2515 commented Nov 12, 2020

Uh oh!

jonas2515 commented Nov 12, 2020

Uh oh!

Uh oh!

kitakar5525 commented Nov 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonas2515 commented Mar 27, 2021

Uh oh!

jonas2515 commented Mar 27, 2021

Uh oh!

qzed left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kitakar5525 commented Nov 12, 2020 •

edited

Loading