-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upstream nt36xxx driver #6
Comments
incredible. if you need testing i will have a lot of freetime by the end of next week. And i have a working redmi note 9 |
Syzkaller reported a lockdep splat: ============================================ WARNING: possible recursive locking detected 6.11.0-rc6-syzkaller-00019-g67784a74e258 #0 Not tainted -------------------------------------------- syz-executor364/5113 is trying to acquire lock: ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] ffff8880449f1958 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 but task is already holding lock: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(k-slock-AF_INET); lock(k-slock-AF_INET); *** DEADLOCK *** May be due to missing lock nesting notation 7 locks held by syz-executor364/5113: #0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline] #0: ffff8880449f0e18 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg+0x153/0x1b10 net/mptcp/protocol.c:1806 #1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1607 [inline] #1: ffff88803fe39ad8 (k-sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_sendmsg_fastopen+0x11f/0x530 net/mptcp/protocol.c:1727 #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline] #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #2: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: __ip_queue_xmit+0x5f/0x1b80 net/ipv4/ip_output.c:470 #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline] #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #3: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x45f/0x1390 net/ipv4/ip_output.c:228 #4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: local_lock_acquire include/linux/local_lock_internal.h:29 [inline] #4: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: process_backlog+0x33b/0x15b0 net/core/dev.c:6104 #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline] #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #5: ffffffff8e938320 (rcu_read_lock){....}-{1:2}, at: ip_local_deliver_finish+0x230/0x5f0 net/ipv4/ip_input.c:232 #6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] #6: ffff88803fe3cb58 (k-slock-AF_INET){+.-.}-{2:2}, at: sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 stack backtrace: CPU: 0 UID: 0 PID: 5113 Comm: syz-executor364 Not tainted 6.11.0-rc6-syzkaller-00019-g67784a74e258 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:93 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119 check_deadlock kernel/locking/lockdep.c:3061 [inline] validate_chain+0x15d3/0x5900 kernel/locking/lockdep.c:3855 __lock_acquire+0x137a/0x2040 kernel/locking/lockdep.c:5142 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5759 __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline] _raw_spin_lock+0x2e/0x40 kernel/locking/spinlock.c:154 spin_lock include/linux/spinlock.h:351 [inline] sk_clone_lock+0x2cd/0xf40 net/core/sock.c:2328 mptcp_sk_clone_init+0x32/0x13c0 net/mptcp/protocol.c:3279 subflow_syn_recv_sock+0x931/0x1920 net/mptcp/subflow.c:874 tcp_check_req+0xfe4/0x1a20 net/ipv4/tcp_minisocks.c:853 tcp_v4_rcv+0x1c3e/0x37f0 net/ipv4/tcp_ipv4.c:2267 ip_protocol_deliver_rcu+0x22e/0x440 net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x341/0x5f0 net/ipv4/ip_input.c:233 NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314 NF_HOOK+0x3a4/0x450 include/linux/netfilter.h:314 __netif_receive_skb_one_core net/core/dev.c:5661 [inline] __netif_receive_skb+0x2bf/0x650 net/core/dev.c:5775 process_backlog+0x662/0x15b0 net/core/dev.c:6108 __napi_poll+0xcb/0x490 net/core/dev.c:6772 napi_poll net/core/dev.c:6841 [inline] net_rx_action+0x89b/0x1240 net/core/dev.c:6963 handle_softirqs+0x2c4/0x970 kernel/softirq.c:554 do_softirq+0x11b/0x1e0 kernel/softirq.c:455 </IRQ> <TASK> __local_bh_enable_ip+0x1bb/0x200 kernel/softirq.c:382 local_bh_enable include/linux/bottom_half.h:33 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:908 [inline] __dev_queue_xmit+0x1763/0x3e90 net/core/dev.c:4450 dev_queue_xmit include/linux/netdevice.h:3105 [inline] neigh_hh_output include/net/neighbour.h:526 [inline] neigh_output include/net/neighbour.h:540 [inline] ip_finish_output2+0xd41/0x1390 net/ipv4/ip_output.c:235 ip_local_out net/ipv4/ip_output.c:129 [inline] __ip_queue_xmit+0x118c/0x1b80 net/ipv4/ip_output.c:535 __tcp_transmit_skb+0x2544/0x3b30 net/ipv4/tcp_output.c:1466 tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:6542 [inline] tcp_rcv_state_process+0x2c32/0x4570 net/ipv4/tcp_input.c:6729 tcp_v4_do_rcv+0x77d/0xc70 net/ipv4/tcp_ipv4.c:1934 sk_backlog_rcv include/net/sock.h:1111 [inline] __release_sock+0x214/0x350 net/core/sock.c:3004 release_sock+0x61/0x1f0 net/core/sock.c:3558 mptcp_sendmsg_fastopen+0x1ad/0x530 net/mptcp/protocol.c:1733 mptcp_sendmsg+0x1884/0x1b10 net/mptcp/protocol.c:1812 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x1a6/0x270 net/socket.c:745 ____sys_sendmsg+0x525/0x7d0 net/socket.c:2597 ___sys_sendmsg net/socket.c:2651 [inline] __sys_sendmmsg+0x3b2/0x740 net/socket.c:2737 __do_sys_sendmmsg net/socket.c:2766 [inline] __se_sys_sendmmsg net/socket.c:2763 [inline] __x64_sys_sendmmsg+0xa0/0xb0 net/socket.c:2763 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f04fb13a6b9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 01 1a 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffd651f42d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f04fb13a6b9 RDX: 0000000000000001 RSI: 0000000020000d00 RDI: 0000000000000004 RBP: 00007ffd651f4310 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000020000080 R11: 0000000000000246 R12: 00000000000f4240 R13: 00007f04fb187449 R14: 00007ffd651f42f4 R15: 00007ffd651f4300 </TASK> As noted by Cong Wang, the splat is false positive, but the code path leading to the report is an unexpected one: a client is attempting an MPC handshake towards the in-kernel listener created by the in-kernel PM for a port based signal endpoint. Such connection will be never accepted; many of them can make the listener queue full and preventing the creation of MPJ subflow via such listener - its intended role. Explicitly detect this scenario at initial-syn time and drop the incoming MPC request. Fixes: 1729cf1 ("mptcp: create the listening socket for new port") Cc: stable@vger.kernel.org Reported-by: syzbot+f4aacdfef2c6a6529c3e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f4aacdfef2c6a6529c3e Cc: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20241014-net-mptcp-mpc-port-endp-v2-1-7faea8e6b6ae@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
thx and that’s appreciated.
i am afraid your miatoll use other touch panel that nvt. you should be able to boot into pmos but without touch function. if you interested, plz feel free to hands on. |
will do!, i have an otg dock so i should be able to test plenty even without tactile |
i build it and installed it using the following commands
here what i got:
except for the fact i removed the tux logos this feels like exactly the same kernel. did i make a mistake ? was 99degree/linux:next identical to pmos' kernel ? |
i switched to you branch of pmaports and added the packages you added in your ci. But now i get an error saying that super failed to mount. i tried reflashing the stock super to no avail. |
yes, the build under my account does not proof for ufs pmos. my suggedtion is stick with a working pmos branch of your own. i never compile the kernel this way so sorry i cant tell. please use official pmos kernel build procedure. if in doubt, please refer github action file. and you can use official pmbootstrap then merge the firmware miatoll apk source code of pmaport. in additional with adbd-linux. so better start with official build(by pmbootstrap init, in your machine) and add addition apk package such firmware-xiaomi-miatoll and adbd-linux apk. thus you can have advantage of both side. or you can install firmware files into your own rootfs/lib/firmware. note that ufs build might have a special root disk img structure and not easily modify this way rather than sdcard build. my build purposely disable the ufs hw in device tree, and lack support of dynamic partition apk. if you wanna enable this you need to compile your own kernel and add support for dynamic partition apk as well as setting i dont know. if you wanna have adbd, you have to dig with if you have any plan for discuss plz create new issue ticket happy hacking |
If ufshcd_rtc_work calls ufshcd_rpm_put_sync() and the pm's usage_count is 0, we will enter the runtime suspend callback. However, the runtime suspend callback will wait to flush ufshcd_rtc_work, causing a deadlock. Replace ufshcd_rpm_put_sync() with ufshcd_rpm_put() to avoid the deadlock. Fixes: 6bf999e ("scsi: ufs: core: Add UFS RTC support") Cc: stable@vger.kernel.org #6.11.x Signed-off-by: Peter Wang <peter.wang@mediatek.com> Link: https://lore.kernel.org/r/20241024015453.21684-1-peter.wang@mediatek.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Its used from trace__run(), for the 'perf trace' live mode, i.e. its strace-like, non-perf.data file processing mode, the most common one. The trace__run() function will set trace->host using machine__new_host() that is supposed to give a machine instance representing the running machine, and since we'll use perf_env__arch_strerrno() to get the right errno -> string table, we need to use machine->env, so initialize it in machine__new_host(). Before the patch: (gdb) run trace --errno-summary -a sleep 1 <SNIP> Summary of events: gvfs-afc-volume (3187), 2 events, 0.0% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ pselect6 1 0 0.000 0.000 0.000 0.000 0.00% GUsbEventThread (3519), 2 events, 0.0% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ poll 1 0 0.000 0.000 0.000 0.000 0.00% <SNIP> Program received signal SIGSEGV, Segmentation fault. 0x00000000005caba0 in perf_env__arch_strerrno (env=0x0, err=110) at util/env.c:478 478 if (env->arch_strerrno == NULL) (gdb) bt #0 0x00000000005caba0 in perf_env__arch_strerrno (env=0x0, err=110) at util/env.c:478 #1 0x00000000004b75d2 in thread__dump_stats (ttrace=0x14f58f0, trace=0x7fffffffa5b0, fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>) at builtin-trace.c:4673 #2 0x00000000004b78bf in trace__fprintf_thread (fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>, thread=0x10fa0b0, trace=0x7fffffffa5b0) at builtin-trace.c:4708 #3 0x00000000004b7ad9 in trace__fprintf_thread_summary (trace=0x7fffffffa5b0, fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>) at builtin-trace.c:4747 #4 0x00000000004b656e in trace__run (trace=0x7fffffffa5b0, argc=2, argv=0x7fffffffde60) at builtin-trace.c:4456 #5 0x00000000004ba43e in cmd_trace (argc=2, argv=0x7fffffffde60) at builtin-trace.c:5487 #6 0x00000000004c0414 in run_builtin (p=0xec3068 <commands+648>, argc=5, argv=0x7fffffffde60) at perf.c:351 torvalds#7 0x00000000004c06bb in handle_internal_command (argc=5, argv=0x7fffffffde60) at perf.c:404 torvalds#8 0x00000000004c0814 in run_argv (argcp=0x7fffffffdc4c, argv=0x7fffffffdc40) at perf.c:448 torvalds#9 0x00000000004c0b5d in main (argc=5, argv=0x7fffffffde60) at perf.c:560 (gdb) After: root@number:~# perf trace -a --errno-summary sleep 1 <SNIP> pw-data-loop (2685), 1410 events, 16.0% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ epoll_wait 188 0 983.428 0.000 5.231 15.595 8.68% ioctl 94 0 0.811 0.004 0.009 0.016 2.82% read 188 0 0.322 0.001 0.002 0.006 5.15% write 141 0 0.280 0.001 0.002 0.018 8.39% timerfd_settime 94 0 0.138 0.001 0.001 0.007 6.47% gnome-control-c (179406), 1848 events, 20.9% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ poll 222 0 959.577 0.000 4.322 21.414 11.40% recvmsg 150 0 0.539 0.001 0.004 0.013 5.12% write 300 0 0.442 0.001 0.001 0.007 3.29% read 150 0 0.183 0.001 0.001 0.009 5.53% getpid 102 0 0.101 0.000 0.001 0.008 7.82% root@number:~# Fixes: 54373b5 ("perf env: Introduce perf_env__arch_strerrno()") Reported-by: Veronika Molnarova <vmolnaro@redhat.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Veronika Molnarova <vmolnaro@redhat.com> Acked-by: Michael Petlan <mpetlan@redhat.com> Tested-by: Michael Petlan <mpetlan@redhat.com> Link: https://lore.kernel.org/r/Z0XffUgNSv_9OjOi@x1 Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Kernel will hang on destroy admin_q while we create ctrl failed, such as following calltrace: PID: 23644 TASK: ff2d52b40f439fc0 CPU: 2 COMMAND: "nvme" #0 [ff61d23de260fb78] __schedule at ffffffff8323bc15 #1 [ff61d23de260fc08] schedule at ffffffff8323c014 #2 [ff61d23de260fc28] blk_mq_freeze_queue_wait at ffffffff82a3dba1 #3 [ff61d23de260fc78] blk_freeze_queue at ffffffff82a4113a #4 [ff61d23de260fc90] blk_cleanup_queue at ffffffff82a33006 #5 [ff61d23de260fcb0] nvme_rdma_destroy_admin_queue at ffffffffc12686ce #6 [ff61d23de260fcc8] nvme_rdma_setup_ctrl at ffffffffc1268ced torvalds#7 [ff61d23de260fd28] nvme_rdma_create_ctrl at ffffffffc126919b torvalds#8 [ff61d23de260fd68] nvmf_dev_write at ffffffffc024f362 torvalds#9 [ff61d23de260fe38] vfs_write at ffffffff827d5f25 RIP: 00007fda7891d574 RSP: 00007ffe2ef06958 RFLAGS: 00000202 RAX: ffffffffffffffda RBX: 000055e8122a4d90 RCX: 00007fda7891d574 RDX: 000000000000012b RSI: 000055e8122a4d90 RDI: 0000000000000004 RBP: 00007ffe2ef079c0 R8: 000000000000012b R9: 000055e8122a4d90 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000004 R13: 000055e8122923c0 R14: 000000000000012b R15: 00007fda78a54500 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b This due to we have quiesced admi_q before cancel requests, but forgot to unquiesce before destroy it, as a result we fail to drain the pending requests, and hang on blk_mq_freeze_queue_wait() forever. Here try to reuse nvme_rdma_teardown_admin_queue() to fix this issue and simplify the code. Fixes: 958dc1d ("nvme-rdma: add clean action for failed reconnection") Reported-by: Yingfu.zhou <yingfu.zhou@shopee.com> Signed-off-by: Chunguang.xu <chunguang.xu@shopee.com> Signed-off-by: Yue.zhao <yue.zhao@shopee.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
When the power mode change is successful but the power mode hasn't actually changed, the post notification was missed. Similar to the approach with hibernate/clock scale/hce enable, having pre/post notifications in the same function will make it easier to maintain. Additionally, supplement the description of power parameters for the pwr_change_notify callback. Fixes: 7eb584d ("ufs: refactor configuring power mode") Cc: stable@vger.kernel.org #6.11.x Signed-off-by: Peter Wang <peter.wang@mediatek.com> Link: https://lore.kernel.org/r/20241122024943.30589-1-peter.wang@mediatek.com Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Hou Tao says: ==================== This patch set fixes several issues for LPM trie. These issues were found during adding new test cases or were reported by syzbot. The patch set is structured as follows: Patch #1~#2 are clean-ups for lpm_trie_update_elem(). Patch #3 handles BPF_EXIST and BPF_NOEXIST correctly for LPM trie. Patch #4 fixes the accounting of n_entries when doing in-place update. Patch #5 fixes the exact match condition in trie_get_next_key() and it may skip keys when the passed key is not found in the map. Patch #6~torvalds#7 switch from kmalloc() to bpf memory allocator for LPM trie to fix several lock order warnings reported by syzbot. It also enables raw_spinlock_t for LPM trie again. After these changes, the LPM trie will be closer to being usable in any context (though the reentrance check of trie->lock is still missing, but it is on my todo list). Patch torvalds#8: move test_lpm_map to map_tests to make it run regularly. Patch torvalds#9: add test cases for the issues fixed by patch #3~#5. Please see individual patches for more details. Comments are always welcome. Change Log: v3: * patch #2: remove the unnecessary NULL-init for im_node * patch #6: alloc the leaf node before disabling IRQ to low the possibility of -ENOMEM when leaf_size is large; Free these nodes outside the trie lock (Suggested by Alexei) * collect review and ack tags (Thanks for Toke & Daniel) v2: https://lore.kernel.org/bpf/20241127004641.1118269-1-houtao@huaweicloud.com/ * collect review tags (Thanks for Toke) * drop "Add bpf_mem_cache_is_mergeable() helper" patch * patch #3~#4: add fix tag * patch #4: rename the helper to trie_check_add_elem() and increase n_entries in it. * patch #6: use one bpf mem allocator and update commit message to clarify that using bpf mem allocator is more appropriate. * patch torvalds#7: update commit message to add the possible max running time for update operation. * patch torvalds#9: update commit message to specify the purpose of these test cases. v1: https://lore.kernel.org/bpf/20241118010808.2243555-1-houtao@huaweicloud.com/ ==================== Link: https://lore.kernel.org/all/20241206110622.1161752-1-houtao@huaweicloud.com/ Signed-off-by: Alexei Starovoitov <ast@kernel.org>
…le_direct_reclaim() The task sometimes continues looping in throttle_direct_reclaim() because allow_direct_reclaim(pgdat) keeps returning false. #0 [ffff80002cb6f8d0] __switch_to at ffff8000080095ac #1 [ffff80002cb6f900] __schedule at ffff800008abbd1c #2 [ffff80002cb6f990] schedule at ffff800008abc50c #3 [ffff80002cb6f9b0] throttle_direct_reclaim at ffff800008273550 #4 [ffff80002cb6fa20] try_to_free_pages at ffff800008277b68 #5 [ffff80002cb6fae0] __alloc_pages_nodemask at ffff8000082c4660 #6 [ffff80002cb6fc50] alloc_pages_vma at ffff8000082e4a98 torvalds#7 [ffff80002cb6fca0] do_anonymous_page at ffff80000829f5a8 torvalds#8 [ffff80002cb6fce0] __handle_mm_fault at ffff8000082a5974 torvalds#9 [ffff80002cb6fd90] handle_mm_fault at ffff8000082a5bd4 At this point, the pgdat contains the following two zones: NODE: 4 ZONE: 0 ADDR: ffff00817fffe540 NAME: "DMA32" SIZE: 20480 MIN/LOW/HIGH: 11/28/45 VM_STAT: NR_FREE_PAGES: 359 NR_ZONE_INACTIVE_ANON: 18813 NR_ZONE_ACTIVE_ANON: 0 NR_ZONE_INACTIVE_FILE: 50 NR_ZONE_ACTIVE_FILE: 0 NR_ZONE_UNEVICTABLE: 0 NR_ZONE_WRITE_PENDING: 0 NR_MLOCK: 0 NR_BOUNCE: 0 NR_ZSPAGES: 0 NR_FREE_CMA_PAGES: 0 NODE: 4 ZONE: 1 ADDR: ffff00817fffec00 NAME: "Normal" SIZE: 8454144 PRESENT: 98304 MIN/LOW/HIGH: 68/166/264 VM_STAT: NR_FREE_PAGES: 146 NR_ZONE_INACTIVE_ANON: 94668 NR_ZONE_ACTIVE_ANON: 3 NR_ZONE_INACTIVE_FILE: 735 NR_ZONE_ACTIVE_FILE: 78 NR_ZONE_UNEVICTABLE: 0 NR_ZONE_WRITE_PENDING: 0 NR_MLOCK: 0 NR_BOUNCE: 0 NR_ZSPAGES: 0 NR_FREE_CMA_PAGES: 0 In allow_direct_reclaim(), while processing ZONE_DMA32, the sum of inactive/active file-backed pages calculated in zone_reclaimable_pages() based on the result of zone_page_state_snapshot() is zero. Additionally, since this system lacks swap, the calculation of inactive/ active anonymous pages is skipped. crash> p nr_swap_pages nr_swap_pages = $1937 = { counter = 0 } As a result, ZONE_DMA32 is deemed unreclaimable and skipped, moving on to the processing of the next zone, ZONE_NORMAL, despite ZONE_DMA32 having free pages significantly exceeding the high watermark. The problem is that the pgdat->kswapd_failures hasn't been incremented. crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_failures $1935 = 0x0 This is because the node deemed balanced. The node balancing logic in balance_pgdat() evaluates all zones collectively. If one or more zones (e.g., ZONE_DMA32) have enough free pages to meet their watermarks, the entire node is deemed balanced. This causes balance_pgdat() to exit early before incrementing the kswapd_failures, as it considers the overall memory state acceptable, even though some zones (like ZONE_NORMAL) remain under significant pressure. The patch ensures that zone_reclaimable_pages() includes free pages (NR_FREE_PAGES) in its calculation when no other reclaimable pages are available (e.g., file-backed or anonymous pages). This change prevents zones like ZONE_DMA32, which have sufficient free pages, from being mistakenly deemed unreclaimable. By doing so, the patch ensures proper node balancing, avoids masking pressure on other zones like ZONE_NORMAL, and prevents infinite loops in throttle_direct_reclaim() caused by allow_direct_reclaim(pgdat) repeatedly returning false. The kernel hangs due to a task stuck in throttle_direct_reclaim(), caused by a node being incorrectly deemed balanced despite pressure in certain zones, such as ZONE_NORMAL. This issue arises from zone_reclaimable_pages() returning 0 for zones without reclaimable file- backed or anonymous pages, causing zones like ZONE_DMA32 with sufficient free pages to be skipped. The lack of swap or reclaimable pages results in ZONE_DMA32 being ignored during reclaim, masking pressure in other zones. Consequently, pgdat->kswapd_failures remains 0 in balance_pgdat(), preventing fallback mechanisms in allow_direct_reclaim() from being triggered, leading to an infinite loop in throttle_direct_reclaim(). This patch modifies zone_reclaimable_pages() to account for free pages (NR_FREE_PAGES) when no other reclaimable pages exist. This ensures zones with sufficient free pages are not skipped, enabling proper balancing and reclaim behavior. [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20241130164346.436469-1-snishika@redhat.com Link: https://lkml.kernel.org/r/20241130161236.433747-2-snishika@redhat.com Fixes: 5a1c84b ("mm: remove reclaim and compaction retry approximations") Signed-off-by: Seiji Nishikawa <snishika@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Nvidia's Tegra MGBE controllers require the IOMMU "Stream ID" (SID) to be written to the MGBE_WRAP_AXI_ASID0_CTRL register. The current driver is hard coded to use MGBE0's SID for all controllers. This causes softirq time outs and kernel panics when using controllers other than MGBE0. Example dmesg errors when an ethernet cable is connected to MGBE1: [ 116.133290] tegra-mgbe 6910000.ethernet eth1: Link is Up - 1Gbps/Full - flow control rx/tx [ 121.851283] tegra-mgbe 6910000.ethernet eth1: NETDEV WATCHDOG: CPU: 5: transmit queue 0 timed out 5690 ms [ 121.851782] tegra-mgbe 6910000.ethernet eth1: Reset adapter. [ 121.892464] tegra-mgbe 6910000.ethernet eth1: Register MEM_TYPE_PAGE_POOL RxQ-0 [ 121.905920] tegra-mgbe 6910000.ethernet eth1: PHY [stmmac-1:00] driver [Aquantia AQR113] (irq=171) [ 121.907356] tegra-mgbe 6910000.ethernet eth1: Enabling Safety Features [ 121.907578] tegra-mgbe 6910000.ethernet eth1: IEEE 1588-2008 Advanced Timestamp supported [ 121.908399] tegra-mgbe 6910000.ethernet eth1: registered PTP clock [ 121.908582] tegra-mgbe 6910000.ethernet eth1: configuring for phy/10gbase-r link mode [ 125.961292] tegra-mgbe 6910000.ethernet eth1: Link is Up - 1Gbps/Full - flow control rx/tx [ 181.921198] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 181.921404] rcu: 7-....: (1 GPs behind) idle=540c/1/0x4000000000000002 softirq=1748/1749 fqs=2337 [ 181.921684] rcu: (detected by 4, t=6002 jiffies, g=1357, q=1254 ncpus=8) [ 181.921878] Sending NMI from CPU 4 to CPUs 7: [ 181.921886] NMI backtrace for cpu 7 [ 181.922131] CPU: 7 UID: 0 PID: 0 Comm: swapper/7 Kdump: loaded Not tainted 6.13.0-rc3+ #6 [ 181.922390] Hardware name: NVIDIA CTI Forge + Orin AGX/Jetson, BIOS 202402.1-Unknown 10/28/2024 [ 181.922658] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 181.922847] pc : handle_softirqs+0x98/0x368 [ 181.922978] lr : __do_softirq+0x18/0x20 [ 181.923095] sp : ffff80008003bf50 [ 181.923189] x29: ffff80008003bf50 x28: 0000000000000008 x27: 0000000000000000 [ 181.923379] x26: ffffce78ea277000 x25: 0000000000000000 x24: 0000001c61befda0 [ 181.924486] x23: 0000000060400009 x22: ffffce78e99918bc x21: ffff80008018bd70 [ 181.925568] x20: ffffce78e8bb00d8 x19: ffff80008018bc20 x18: 0000000000000000 [ 181.926655] x17: ffff318ebe7d3000 x16: ffff800080038000 x15: 0000000000000000 [ 181.931455] x14: ffff000080816680 x13: ffff318ebe7d3000 x12: 000000003464d91d [ 181.938628] x11: 0000000000000040 x10: ffff000080165a70 x9 : ffffce78e8bb0160 [ 181.945804] x8 : ffff8000827b3160 x7 : f9157b241586f343 x6 : eeb6502a01c81c74 [ 181.953068] x5 : a4acfcdd2e8096bb x4 : ffffce78ea277340 x3 : 00000000ffffd1e1 [ 181.960329] x2 : 0000000000000101 x1 : ffffce78ea277340 x0 : ffff318ebe7d3000 [ 181.967591] Call trace: [ 181.970043] handle_softirqs+0x98/0x368 (P) [ 181.974240] __do_softirq+0x18/0x20 [ 181.977743] ____do_softirq+0x14/0x28 [ 181.981415] call_on_irq_stack+0x24/0x30 [ 181.985180] do_softirq_own_stack+0x20/0x30 [ 181.989379] __irq_exit_rcu+0x114/0x140 [ 181.993142] irq_exit_rcu+0x14/0x28 [ 181.996816] el1_interrupt+0x44/0xb8 [ 182.000316] el1h_64_irq_handler+0x14/0x20 [ 182.004343] el1h_64_irq+0x80/0x88 [ 182.007755] cpuidle_enter_state+0xc4/0x4a8 (P) [ 182.012305] cpuidle_enter+0x3c/0x58 [ 182.015980] cpuidle_idle_call+0x128/0x1c0 [ 182.020005] do_idle+0xe0/0xf0 [ 182.023155] cpu_startup_entry+0x3c/0x48 [ 182.026917] secondary_start_kernel+0xdc/0x120 [ 182.031379] __secondary_switched+0x74/0x78 [ 212.971162] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 7-.... } 6103 jiffies s: 417 root: 0x80/. [ 212.985935] rcu: blocking rcu_node structures (internal RCU debug): [ 212.992758] Sending NMI from CPU 0 to CPUs 7: [ 212.998539] NMI backtrace for cpu 7 [ 213.004304] CPU: 7 UID: 0 PID: 0 Comm: swapper/7 Kdump: loaded Not tainted 6.13.0-rc3+ #6 [ 213.016116] Hardware name: NVIDIA CTI Forge + Orin AGX/Jetson, BIOS 202402.1-Unknown 10/28/2024 [ 213.030817] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 213.040528] pc : handle_softirqs+0x98/0x368 [ 213.046563] lr : __do_softirq+0x18/0x20 [ 213.051293] sp : ffff80008003bf50 [ 213.055839] x29: ffff80008003bf50 x28: 0000000000000008 x27: 0000000000000000 [ 213.067304] x26: ffffce78ea277000 x25: 0000000000000000 x24: 0000001c61befda0 [ 213.077014] x23: 0000000060400009 x22: ffffce78e99918bc x21: ffff80008018bd70 [ 213.087339] x20: ffffce78e8bb00d8 x19: ffff80008018bc20 x18: 0000000000000000 [ 213.097313] x17: ffff318ebe7d3000 x16: ffff800080038000 x15: 0000000000000000 [ 213.107201] x14: ffff000080816680 x13: ffff318ebe7d3000 x12: 000000003464d91d [ 213.116651] x11: 0000000000000040 x10: ffff000080165a70 x9 : ffffce78e8bb0160 [ 213.127500] x8 : ffff8000827b3160 x7 : 0a37b344852820af x6 : 3f049caedd1ff608 [ 213.138002] x5 : cff7cfdbfaf31291 x4 : ffffce78ea277340 x3 : 00000000ffffde04 [ 213.150428] x2 : 0000000000000101 x1 : ffffce78ea277340 x0 : ffff318ebe7d3000 [ 213.162063] Call trace: [ 213.165494] handle_softirqs+0x98/0x368 (P) [ 213.171256] __do_softirq+0x18/0x20 [ 213.177291] ____do_softirq+0x14/0x28 [ 213.182017] call_on_irq_stack+0x24/0x30 [ 213.186565] do_softirq_own_stack+0x20/0x30 [ 213.191815] __irq_exit_rcu+0x114/0x140 [ 213.196891] irq_exit_rcu+0x14/0x28 [ 213.202401] el1_interrupt+0x44/0xb8 [ 213.207741] el1h_64_irq_handler+0x14/0x20 [ 213.213519] el1h_64_irq+0x80/0x88 [ 213.217541] cpuidle_enter_state+0xc4/0x4a8 (P) [ 213.224364] cpuidle_enter+0x3c/0x58 [ 213.228653] cpuidle_idle_call+0x128/0x1c0 [ 213.233993] do_idle+0xe0/0xf0 [ 213.237928] cpu_startup_entry+0x3c/0x48 [ 213.243791] secondary_start_kernel+0xdc/0x120 [ 213.249830] __secondary_switched+0x74/0x78 This bug has existed since the dwmac-tegra driver was added in Dec 2022 (See Fixes tag below for commit hash). The Tegra234 SOC has 4 MGBE controllers, however Nvidia's Developer Kit only uses MGBE0 which is why the bug was not found previously. Connect Tech has many products that use 2 (or more) MGBE controllers. The solution is to read the controller's SID from the existing "iommus" device tree property. The 2nd field of the "iommus" device tree property is the controller's SID. Device tree snippet from tegra234.dtsi showing MGBE1's "iommus" property: smmu_niso0: iommu@12000000 { compatible = "nvidia,tegra234-smmu", "nvidia,smmu-500"; ... } /* MGBE1 */ ethernet@6900000 { compatible = "nvidia,tegra234-mgbe"; ... iommus = <&smmu_niso0 TEGRA234_SID_MGBE_VF1>; ... } Nvidia's arm-smmu driver reads the "iommus" property and stores the SID in the MGBE device's "fwspec" struct. The dwmac-tegra driver can access the SID using the tegra_dev_iommu_get_stream_id() helper function found in linux/iommu.h. Calling tegra_dev_iommu_get_stream_id() should not fail unless the "iommus" property is removed from the device tree or the IOMMU is disabled. While the Tegra234 SOC technically supports bypassing the IOMMU, it is not supported by the current firmware, has not been tested and not recommended. More detailed discussion with Thierry Reding from Nvidia linked below. Fixes: d8ca113 ("net: stmmac: tegra: Add MGBE support") Link: https://lore.kernel.org/netdev/cover.1731685185.git.pnewman@connecttech.com Signed-off-by: Parker Newman <pnewman@connecttech.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Thierry Reding <treding@nvidia.com> Link: https://patch.msgid.link/6fb97f32cf4accb4f7cf92846f6b60064ba0a3bd.1736284360.git.pnewman@connecttech.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
libtraceevent parses and returns an array of argument fields, sometimes larger than RAW_SYSCALL_ARGS_NUM (6) because it includes "__syscall_nr", idx will traverse to index 6 (7th element) whereas sc->fmt->arg holds 6 elements max, creating an out-of-bounds access. This runtime error is found by UBsan. The error message: $ sudo UBSAN_OPTIONS=print_stacktrace=1 ./perf trace -a --max-events=1 builtin-trace.c:1966:35: runtime error: index 6 out of bounds for type 'syscall_arg_fmt [6]' #0 0x5c04956be5fe in syscall__alloc_arg_fmts /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:1966 #1 0x5c04956c0510 in trace__read_syscall_info /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:2110 #2 0x5c04956c372b in trace__syscall_info /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:2436 #3 0x5c04956d2f39 in trace__init_syscalls_bpf_prog_array_maps /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:3897 #4 0x5c04956d6d25 in trace__run /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:4335 #5 0x5c04956e112e in cmd_trace /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:5502 #6 0x5c04956eda7d in run_builtin /home/howard/hw/linux-perf/tools/perf/perf.c:351 torvalds#7 0x5c04956ee0a8 in handle_internal_command /home/howard/hw/linux-perf/tools/perf/perf.c:404 torvalds#8 0x5c04956ee37f in run_argv /home/howard/hw/linux-perf/tools/perf/perf.c:448 torvalds#9 0x5c04956ee8e9 in main /home/howard/hw/linux-perf/tools/perf/perf.c:556 torvalds#10 0x79eb3622a3b7 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 torvalds#11 0x79eb3622a47a in __libc_start_main_impl ../csu/libc-start.c:360 torvalds#12 0x5c04955422d4 in _start (/home/howard/hw/linux-perf/tools/perf/perf+0x4e02d4) (BuildId: 5b6cab2d59e96a4341741765ad6914a4d784dbc6) 0.000 ( 0.014 ms): Chrome_ChildIO/117244 write(fd: 238, buf: !, count: 1) = 1 Fixes: 5e58fcf ("perf trace: Allow allocating sc->arg_fmt even without the syscall tracepoint") Signed-off-by: Howard Chu <howardchu95@gmail.com> Link: https://lore.kernel.org/r/20250122025519.361873-1-howardchu95@gmail.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
This fixes the following hard lockup in isolate_lru_folios() during memory reclaim. If the LRU mostly contains ineligible folios this may trigger watchdog. watchdog: Watchdog detected hard LOCKUP on cpu 173 RIP: 0010:native_queued_spin_lock_slowpath+0x255/0x2a0 Call Trace: _raw_spin_lock_irqsave+0x31/0x40 folio_lruvec_lock_irqsave+0x5f/0x90 folio_batch_move_lru+0x91/0x150 lru_add_drain_per_cpu+0x1c/0x40 process_one_work+0x17d/0x350 worker_thread+0x27b/0x3a0 kthread+0xe8/0x120 ret_from_fork+0x34/0x50 ret_from_fork_asm+0x1b/0x30 lruvec->lru_lock owner: PID: 2865 TASK: ffff888139214d40 CPU: 40 COMMAND: "kswapd0" #0 [fffffe0000945e60] crash_nmi_callback at ffffffffa567a555 #1 [fffffe0000945e68] nmi_handle at ffffffffa563b171 #2 [fffffe0000945eb0] default_do_nmi at ffffffffa6575920 #3 [fffffe0000945ed0] exc_nmi at ffffffffa6575af4 #4 [fffffe0000945ef0] end_repeat_nmi at ffffffffa6601dde [exception RIP: isolate_lru_folios+403] RIP: ffffffffa597df53 RSP: ffffc90006fb7c28 RFLAGS: 00000002 RAX: 0000000000000001 RBX: ffffc90006fb7c60 RCX: ffffea04a2196f88 RDX: ffffc90006fb7c60 RSI: ffffc90006fb7c60 RDI: ffffea04a2197048 RBP: ffff88812cbd3010 R8: ffffea04a2197008 R9: 0000000000000001 R10: 0000000000000000 R11: 0000000000000001 R12: ffffea04a2197008 R13: ffffea04a2197048 R14: ffffc90006fb7de8 R15: 0000000003e3e937 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 <NMI exception stack> #5 [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53 #6 [ffffc90006fb7cf8] shrink_active_list at ffffffffa597f788 torvalds#7 [ffffc90006fb7da8] balance_pgdat at ffffffffa5986db0 torvalds#8 [ffffc90006fb7ec0] kswapd at ffffffffa5987354 torvalds#9 [ffffc90006fb7ef8] kthread at ffffffffa5748238 crash> Scenario: User processe are requesting a large amount of memory and keep page active. Then a module continuously requests memory from ZONE_DMA32 area. Memory reclaim will be triggered due to ZONE_DMA32 watermark alarm reached. However pages in the LRU(active_anon) list are mostly from the ZONE_NORMAL area. Reproduce: Terminal 1: Construct to continuously increase pages active(anon). mkdir /tmp/memory mount -t tmpfs -o size=1024000M tmpfs /tmp/memory dd if=/dev/zero of=/tmp/memory/block bs=4M tail /tmp/memory/block Terminal 2: vmstat -a 1 active will increase. procs ---memory--- ---swap-- ---io---- -system-- ---cpu--- ... r b swpd free inact active si so bi bo 1 0 0 1445623076 45898836 83646008 0 0 0 1 0 0 1445623076 43450228 86094616 0 0 0 1 0 0 1445623076 41003480 88541364 0 0 0 1 0 0 1445623076 38557088 90987756 0 0 0 1 0 0 1445623076 36109688 93435156 0 0 0 1 0 0 1445619552 33663256 95881632 0 0 0 1 0 0 1445619804 31217140 98327792 0 0 0 1 0 0 1445619804 28769988 100774944 0 0 0 1 0 0 1445619804 26322348 103222584 0 0 0 1 0 0 1445619804 23875592 105669340 0 0 0 cat /proc/meminfo | head Active(anon) increase. MemTotal: 1579941036 kB MemFree: 1445618500 kB MemAvailable: 1453013224 kB Buffers: 6516 kB Cached: 128653956 kB SwapCached: 0 kB Active: 118110812 kB Inactive: 11436620 kB Active(anon): 115345744 kB Inactive(anon): 945292 kB When the Active(anon) is 115345744 kB, insmod module triggers the ZONE_DMA32 watermark. perf record -e vmscan:mm_vmscan_lru_isolate -aR perf script isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=2 nr_skipped=2 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=0 nr_skipped=0 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=28835844 nr_skipped=28835844 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=28835844 nr_skipped=28835844 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=29 nr_skipped=29 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=0 nr_skipped=0 nr_taken=0 lru=active_anon See nr_scanned=28835844. 28835844 * 4k = 115343376KB approximately equal to 115345744 kB. If increase Active(anon) to 1000G then insmod module triggers the ZONE_DMA32 watermark. hard lockup will occur. In my device nr_scanned = 0000000003e3e937 when hard lockup. Convert to memory size 0x0000000003e3e937 * 4KB = 261072092 KB. [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53 ffffc90006fb7c30: 0000000000000020 0000000000000000 ffffc90006fb7c40: ffffc90006fb7d40 ffff88812cbd3000 ffffc90006fb7c50: ffffc90006fb7d30 0000000106fb7de8 ffffc90006fb7c60: ffffea04a2197008 ffffea0006ed4a48 ffffc90006fb7c70: 0000000000000000 0000000000000000 ffffc90006fb7c80: 0000000000000000 0000000000000000 ffffc90006fb7c90: 0000000000000000 0000000000000000 ffffc90006fb7ca0: 0000000000000000 0000000003e3e937 ffffc90006fb7cb0: 0000000000000000 0000000000000000 ffffc90006fb7cc0: 8d7c0b56b7874b00 ffff88812cbd3000 About the Fixes: Why did it take eight years to be discovered? The problem requires the following conditions to occur: 1. The device memory should be large enough. 2. Pages in the LRU(active_anon) list are mostly from the ZONE_NORMAL area. 3. The memory in ZONE_DMA32 needs to reach the watermark. If the memory is not large enough, or if the usage design of ZONE_DMA32 area memory is reasonable, this problem is difficult to detect. notes: The problem is most likely to occur in ZONE_DMA32 and ZONE_NORMAL, but other suitable scenarios may also trigger the problem. Link: https://lkml.kernel.org/r/20241119060842.274072-1-liuye@kylinos.cn Fixes: b2e1875 ("mm, vmscan: begin reclaiming pages on a per-node basis") Signed-off-by: liuye <liuye@kylinos.cn> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Yang Shi <yang@os.amperecomputing.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When COWing a relocation tree path, at relocation.c:replace_path(), we can trigger a lockdep splat while we are in the btrfs_search_slot() call against the relocation root. This happens in that callchain at ctree.c:read_block_for_search() when we happen to find a child extent buffer already loaded through the fs tree with a lockdep class set to the fs tree. So when we attempt to lock that extent buffer through a relocation tree we have to reset the lockdep class to the class for a relocation tree, since a relocation tree has extent buffers that used to belong to a fs tree and may currently be already loaded (we swap extent buffers between the two trees at the end of replace_path()). However we are missing calls to btrfs_maybe_reset_lockdep_class() to reset the lockdep class at ctree.c:read_block_for_search() before we read lock an extent buffer, just like we did for btrfs_search_slot() in commit b40130b ("btrfs: fix lockdep splat with reloc root extent buffers"). So add the missing btrfs_maybe_reset_lockdep_class() calls before the attempts to read lock an extent buffer at ctree.c:read_block_for_search(). The lockdep splat was reported by syzbot and it looks like this: ====================================================== WARNING: possible circular locking dependency detected 6.13.0-rc5-syzkaller-00163-gab75170520d4 #0 Not tainted ------------------------------------------------------ syz.0.0/5335 is trying to acquire lock: ffff8880545dbc38 (btrfs-tree-01){++++}-{4:4}, at: btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146 but task is already holding lock: ffff8880545dba58 (btrfs-treloc-02/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-treloc-02/1){+.+.}-{4:4}: reacquire_held_locks+0x3eb/0x690 kernel/locking/lockdep.c:5374 __lock_release kernel/locking/lockdep.c:5563 [inline] lock_release+0x396/0xa30 kernel/locking/lockdep.c:5870 up_write+0x79/0x590 kernel/locking/rwsem.c:1629 btrfs_force_cow_block+0x14b3/0x1fd0 fs/btrfs/ctree.c:660 btrfs_cow_block+0x371/0x830 fs/btrfs/ctree.c:755 btrfs_search_slot+0xc01/0x3180 fs/btrfs/ctree.c:2153 replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224 merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692 merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942 relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754 btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087 btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494 __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278 btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655 btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #1 (btrfs-tree-01/1){+.+.}-{4:4}: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 down_write_nested+0xa2/0x220 kernel/locking/rwsem.c:1693 btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189 btrfs_init_new_buffer fs/btrfs/extent-tree.c:5052 [inline] btrfs_alloc_tree_block+0x41c/0x1440 fs/btrfs/extent-tree.c:5132 btrfs_force_cow_block+0x526/0x1fd0 fs/btrfs/ctree.c:573 btrfs_cow_block+0x371/0x830 fs/btrfs/ctree.c:755 btrfs_search_slot+0xc01/0x3180 fs/btrfs/ctree.c:2153 btrfs_insert_empty_items+0x9c/0x1a0 fs/btrfs/ctree.c:4351 btrfs_insert_empty_item fs/btrfs/ctree.h:688 [inline] btrfs_insert_inode_ref+0x2bb/0xf80 fs/btrfs/inode-item.c:330 btrfs_rename_exchange fs/btrfs/inode.c:7990 [inline] btrfs_rename2+0xcb7/0x2b90 fs/btrfs/inode.c:8374 vfs_rename+0xbdb/0xf00 fs/namei.c:5067 do_renameat2+0xd94/0x13f0 fs/namei.c:5224 __do_sys_renameat2 fs/namei.c:5258 [inline] __se_sys_renameat2 fs/namei.c:5255 [inline] __x64_sys_renameat2+0xce/0xe0 fs/namei.c:5255 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #0 (btrfs-tree-01){++++}-{4:4}: check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 down_read_nested+0xb5/0xa50 kernel/locking/rwsem.c:1649 btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146 btrfs_tree_read_lock fs/btrfs/locking.h:188 [inline] read_block_for_search+0x718/0xbb0 fs/btrfs/ctree.c:1610 btrfs_search_slot+0x1274/0x3180 fs/btrfs/ctree.c:2237 replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224 merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692 merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942 relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754 btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087 btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494 __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278 btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655 btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f other info that might help us debug this: Chain exists of: btrfs-tree-01 --> btrfs-tree-01/1 --> btrfs-treloc-02/1 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-treloc-02/1); lock(btrfs-tree-01/1); lock(btrfs-treloc-02/1); rlock(btrfs-tree-01); *** DEADLOCK *** 8 locks held by syz.0.0/5335: #0: ffff88801e3ae420 (sb_writers#13){.+.+}-{0:0}, at: mnt_want_write_file+0x5e/0x200 fs/namespace.c:559 #1: ffff888052c760d0 (&fs_info->reclaim_bgs_lock){+.+.}-{4:4}, at: __btrfs_balance+0x4c2/0x26b0 fs/btrfs/volumes.c:4183 #2: ffff888052c74850 (&fs_info->cleaner_mutex){+.+.}-{4:4}, at: btrfs_relocate_block_group+0x775/0xd90 fs/btrfs/relocation.c:4086 #3: ffff88801e3ae610 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xf11/0x1ad0 fs/btrfs/relocation.c:1659 #4: ffff888052c76470 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x405/0xda0 fs/btrfs/transaction.c:288 #5: ffff888052c76498 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x405/0xda0 fs/btrfs/transaction.c:288 #6: ffff8880545db878 (btrfs-tree-01/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189 torvalds#7: ffff8880545dba58 (btrfs-treloc-02/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189 stack backtrace: CPU: 0 UID: 0 PID: 5335 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller-00163-gab75170520d4 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_circular_bug+0x13a/0x1b0 kernel/locking/lockdep.c:2074 check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2206 check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 down_read_nested+0xb5/0xa50 kernel/locking/rwsem.c:1649 btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146 btrfs_tree_read_lock fs/btrfs/locking.h:188 [inline] read_block_for_search+0x718/0xbb0 fs/btrfs/ctree.c:1610 btrfs_search_slot+0x1274/0x3180 fs/btrfs/ctree.c:2237 replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224 merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692 merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942 relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754 btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087 btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494 __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278 btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655 btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f1ac6985d29 Code: ff ff c3 (...) RSP: 002b:00007f1ac63fe038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f1ac6b76160 RCX: 00007f1ac6985d29 RDX: 0000000020000180 RSI: 00000000c4009420 RDI: 0000000000000007 RBP: 00007f1ac6a01b08 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000001 R14: 00007f1ac6b76160 R15: 00007fffda145a88 </TASK> Reported-by: syzbot+63913e558c084f7f8fdc@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/677b3014.050a0220.3b53b0.0064.GAE@google.com/ Fixes: 9978599 ("btrfs: reduce lock contention when eb cache miss for btree search") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
We have several places across the kernel where we want to access another task's syscall arguments, such as ptrace(2), seccomp(2), etc., by making a call to syscall_get_arguments(). This works for register arguments right away by accessing the task's `regs' member of `struct pt_regs', however for stack arguments seen with 32-bit/o32 kernels things are more complicated. Technically they ought to be obtained from the user stack with calls to an access_remote_vm(), but we have an easier way available already. So as to be able to access syscall stack arguments as regular function arguments following the MIPS calling convention we copy them over from the user stack to the kernel stack in arch/mips/kernel/scall32-o32.S, in handle_sys(), to the current stack frame's outgoing argument space at the top of the stack, which is where the handler called expects to see its incoming arguments. This area is also pointed at by the `pt_regs' pointer obtained by task_pt_regs(). Make the o32 stack argument space a proper member of `struct pt_regs' then, by renaming the existing member from `pad0' to `args' and using generated offsets to access the space. No functional change though. With the change in place the o32 kernel stack frame layout at the entry to a syscall handler invoked by handle_sys() is therefore as follows: $sp + 68 -> | ... | <- pt_regs.regs[9] +---------------------+ $sp + 64 -> | $t0 | <- pt_regs.regs[8] +---------------------+ $sp + 60 -> | $a3/argument #4 | <- pt_regs.regs[7] +---------------------+ $sp + 56 -> | $a2/argument #3 | <- pt_regs.regs[6] +---------------------+ $sp + 52 -> | $a1/argument #2 | <- pt_regs.regs[5] +---------------------+ $sp + 48 -> | $a0/argument #1 | <- pt_regs.regs[4] +---------------------+ $sp + 44 -> | $v1 | <- pt_regs.regs[3] +---------------------+ $sp + 40 -> | $v0 | <- pt_regs.regs[2] +---------------------+ $sp + 36 -> | $at | <- pt_regs.regs[1] +---------------------+ $sp + 32 -> | $zero | <- pt_regs.regs[0] +---------------------+ $sp + 28 -> | stack argument torvalds#8 | <- pt_regs.args[7] +---------------------+ $sp + 24 -> | stack argument torvalds#7 | <- pt_regs.args[6] +---------------------+ $sp + 20 -> | stack argument #6 | <- pt_regs.args[5] +---------------------+ $sp + 16 -> | stack argument #5 | <- pt_regs.args[4] +---------------------+ $sp + 12 -> | psABI space for $a3 | <- pt_regs.args[3] +---------------------+ $sp + 8 -> | psABI space for $a2 | <- pt_regs.args[2] +---------------------+ $sp + 4 -> | psABI space for $a1 | <- pt_regs.args[1] +---------------------+ $sp + 0 -> | psABI space for $a0 | <- pt_regs.args[0] +---------------------+ holding user data received and with the first 4 frame slots reserved by the psABI for the compiler to spill the incoming arguments from $a0-$a3 registers (which it sometimes does according to its needs) and the next 4 frame slots designated by the psABI for any stack function arguments that follow. This data is also available for other tasks to peek/poke at as reqired and where permitted. Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
This makes ptrace/get_syscall_info selftest pass on mips o32 and mips64 o32 by fixing the following two test assertions: 1. get_syscall_info test assertion on mips o32: # get_syscall_info.c:218:get_syscall_info:Expected exp_args[5] (3134521044) == info.entry.args[4] (4911432) # get_syscall_info.c:219:get_syscall_info:wait #1: entry stop mismatch 2. get_syscall_info test assertion on mips64 o32: # get_syscall_info.c:209:get_syscall_info:Expected exp_args[2] (3134324433) == info.entry.args[1] (18446744072548908753) # get_syscall_info.c:210:get_syscall_info:wait #1: entry stop mismatch The first assertion happens due to mips_get_syscall_arg() trying to access another task's context but failing to do it properly because get_user() it calls just peeks at the current task's context. It usually does not crash because the default user stack always gets assigned the same VMA, but it is pure luck which mips_get_syscall_arg() wouldn't have if e.g. the stack was switched (via setcontext(3) or however) or a non-default process's thread peeked at, and in any case irrelevant data is obtained just as observed with the test case. mips_get_syscall_arg() ought to be using access_remote_vm() instead to retrieve the other task's stack contents, but given that the data has been already obtained and saved in `struct pt_regs' it would be an overkill. The first assertion is fixed for mips o32 by using struct pt_regs.args instead of get_user() to obtain syscall arguments. This approach works due to this piece in arch/mips/kernel/scall32-o32.S: /* * Ok, copy the args from the luser stack to the kernel stack. */ .set push .set noreorder .set nomacro load_a4: user_lw(t5, 16(t0)) # argument #5 from usp load_a5: user_lw(t6, 20(t0)) # argument #6 from usp load_a6: user_lw(t7, 24(t0)) # argument torvalds#7 from usp load_a7: user_lw(t8, 28(t0)) # argument torvalds#8 from usp loads_done: sw t5, PT_ARG4(sp) # argument #5 to ksp sw t6, PT_ARG5(sp) # argument #6 to ksp sw t7, PT_ARG6(sp) # argument torvalds#7 to ksp sw t8, PT_ARG7(sp) # argument torvalds#8 to ksp .set pop .section __ex_table,"a" PTR_WD load_a4, bad_stack_a4 PTR_WD load_a5, bad_stack_a5 PTR_WD load_a6, bad_stack_a6 PTR_WD load_a7, bad_stack_a7 .previous arch/mips/kernel/scall64-o32.S has analogous code for mips64 o32 that allows fixing the issue by obtaining syscall arguments from struct pt_regs.regs[4..11] instead of the erroneous use of get_user(). The second assertion is fixed by truncating 64-bit values to 32-bit syscall arguments. Fixes: c0ff3c5 ("MIPS: Enable HAVE_ARCH_TRACEHOOK.") Signed-off-by: Dmitry V. Levin <ldv@strace.io> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
the first attempt for upstream is at:
https://lore.kernel.org/lkml/c10b0b99-5bf2-47d3-b879-872f80e6883d@kernel.org/
The text was updated successfully, but these errors were encountered: