-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
finish README #1
Comments
thewisenerd
pushed a commit
to thewisenerd/android_kernel_xiaomi_armani
that referenced
this issue
Jan 24, 2015
This moves ARM over to the asm-generic/unaligned.h header. This has the benefit of better code generated especially for ARMv7 on gcc 4.7+ compilers. As Arnd Bergmann, points out: The asm-generic version uses the "struct" version for native-endian unaligned access and the "byteshift" version for the opposite endianess. The current ARM version however uses the "byteshift" implementation for both. Thanks to Nicolas Pitre for the excellent analysis: Test case: int foo (int *x) { return get_unaligned(x); } long long bar (long long *x) { return get_unaligned(x); } With the current ARM version: foo: ldrb r3, [r0, MiCode#2] @ zero_extendqisi2 @ MEM[(const u8 *)x_1(D) + 2B], MEM[(const u8 *)x_1(D) + 2B] ldrb r1, [r0, MiCode#1] @ zero_extendqisi2 @ MEM[(const u8 *)x_1(D) + 1B], MEM[(const u8 *)x_1(D) + 1B] ldrb r2, [r0, #0] @ zero_extendqisi2 @ MEM[(const u8 *)x_1(D)], MEM[(const u8 *)x_1(D)] mov r3, r3, asl MiCode#16 @ tmp154, MEM[(const u8 *)x_1(D) + 2B], ldrb r0, [r0, MiCode#3] @ zero_extendqisi2 @ MEM[(const u8 *)x_1(D) + 3B], MEM[(const u8 *)x_1(D) + 3B] orr r3, r3, r1, asl MiCode#8 @, tmp155, tmp154, MEM[(const u8 *)x_1(D) + 1B], orr r3, r3, r2 @ tmp157, tmp155, MEM[(const u8 *)x_1(D)] orr r0, r3, r0, asl MiCode#24 @,, tmp157, MEM[(const u8 *)x_1(D) + 3B], bx lr @ bar: stmfd sp!, {r4, r5, r6, r7} @, mov r2, #0 @ tmp184, ldrb r5, [r0, MiCode#6] @ zero_extendqisi2 @ MEM[(const u8 *)x_1(D) + 6B], MEM[(const u8 *)x_1(D) + 6B] ldrb r4, [r0, MiCode#5] @ zero_extendqisi2 @ MEM[(const u8 *)x_1(D) + 5B], MEM[(const u8 *)x_1(D) + 5B] ldrb ip, [r0, MiCode#2] @ zero_extendqisi2 @ MEM[(const u8 *)x_1(D) + 2B], MEM[(const u8 *)x_1(D) + 2B] ldrb r1, [r0, MiCode#4] @ zero_extendqisi2 @ MEM[(const u8 *)x_1(D) + 4B], MEM[(const u8 *)x_1(D) + 4B] mov r5, r5, asl MiCode#16 @ tmp175, MEM[(const u8 *)x_1(D) + 6B], ldrb r7, [r0, MiCode#1] @ zero_extendqisi2 @ MEM[(const u8 *)x_1(D) + 1B], MEM[(const u8 *)x_1(D) + 1B] orr r5, r5, r4, asl MiCode#8 @, tmp176, tmp175, MEM[(const u8 *)x_1(D) + 5B], ldrb r6, [r0, MiCode#7] @ zero_extendqisi2 @ MEM[(const u8 *)x_1(D) + 7B], MEM[(const u8 *)x_1(D) + 7B] orr r5, r5, r1 @ tmp178, tmp176, MEM[(const u8 *)x_1(D) + 4B] ldrb r4, [r0, #0] @ zero_extendqisi2 @ MEM[(const u8 *)x_1(D)], MEM[(const u8 *)x_1(D)] mov ip, ip, asl MiCode#16 @ tmp188, MEM[(const u8 *)x_1(D) + 2B], ldrb r1, [r0, MiCode#3] @ zero_extendqisi2 @ MEM[(const u8 *)x_1(D) + 3B], MEM[(const u8 *)x_1(D) + 3B] orr ip, ip, r7, asl MiCode#8 @, tmp189, tmp188, MEM[(const u8 *)x_1(D) + 1B], orr r3, r5, r6, asl MiCode#24 @,, tmp178, MEM[(const u8 *)x_1(D) + 7B], orr ip, ip, r4 @ tmp191, tmp189, MEM[(const u8 *)x_1(D)] orr ip, ip, r1, asl MiCode#24 @, tmp194, tmp191, MEM[(const u8 *)x_1(D) + 3B], mov r1, r3 @, orr r0, r2, ip @ tmp171, tmp184, tmp194 ldmfd sp!, {r4, r5, r6, r7} bx lr In both cases the code is slightly suboptimal. One may wonder why wasting r2 with the constant 0 in the second case for example. And all the mov's could be folded in subsequent orr's, etc. Now with the asm-generic version: foo: ldr r0, [r0, #0] @ unaligned @,* x bx lr @ bar: mov r3, r0 @ x, x ldr r0, [r0, #0] @ unaligned @,* x ldr r1, [r3, MiCode#4] @ unaligned @, bx lr @ This is way better of course, but only because this was compiled for ARMv7. In this case the compiler knows that the hardware can do unaligned word access. This isn't that obvious for foo(), but if we remove the get_unaligned() from bar as follows: long long bar (long long *x) {return *x; } then the resulting code is: bar: ldmia r0, {r0, r1} @ x,, bx lr @ So this proves that the presumed aligned vs unaligned cases does have influence on the instructions the compiler may use and that the above unaligned code results are not just an accident. Still... this isn't fully conclusive without at least looking at the resulting assembly fron a pre ARMv6 compilation. Let's see with an ARMv5 target: foo: ldrb r3, [r0, #0] @ zero_extendqisi2 @ tmp139,* x ldrb r1, [r0, MiCode#1] @ zero_extendqisi2 @ tmp140, ldrb r2, [r0, MiCode#2] @ zero_extendqisi2 @ tmp143, ldrb r0, [r0, MiCode#3] @ zero_extendqisi2 @ tmp146, orr r3, r3, r1, asl MiCode#8 @, tmp142, tmp139, tmp140, orr r3, r3, r2, asl MiCode#16 @, tmp145, tmp142, tmp143, orr r0, r3, r0, asl MiCode#24 @,, tmp145, tmp146, bx lr @ bar: stmfd sp!, {r4, r5, r6, r7} @, ldrb r2, [r0, #0] @ zero_extendqisi2 @ tmp139,* x ldrb r7, [r0, MiCode#1] @ zero_extendqisi2 @ tmp140, ldrb r3, [r0, MiCode#4] @ zero_extendqisi2 @ tmp149, ldrb r6, [r0, MiCode#5] @ zero_extendqisi2 @ tmp150, ldrb r5, [r0, MiCode#2] @ zero_extendqisi2 @ tmp143, ldrb r4, [r0, MiCode#6] @ zero_extendqisi2 @ tmp153, ldrb r1, [r0, MiCode#7] @ zero_extendqisi2 @ tmp156, ldrb ip, [r0, MiCode#3] @ zero_extendqisi2 @ tmp146, orr r2, r2, r7, asl MiCode#8 @, tmp142, tmp139, tmp140, orr r3, r3, r6, asl MiCode#8 @, tmp152, tmp149, tmp150, orr r2, r2, r5, asl MiCode#16 @, tmp145, tmp142, tmp143, orr r3, r3, r4, asl MiCode#16 @, tmp155, tmp152, tmp153, orr r0, r2, ip, asl MiCode#24 @,, tmp145, tmp146, orr r1, r3, r1, asl MiCode#24 @,, tmp155, tmp156, ldmfd sp!, {r4, r5, r6, r7} bx lr Compared to the initial results, this is really nicely optimized and I couldn't do much better if I were to hand code it myself. Signed-off-by: Rob Herring <rob.herring@calxeda.com> Reviewed-by: Nicolas Pitre <nico@linaro.org> Tested-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk> modified for Mako from kernel.org reference Signed-off-by: faux123 <reioux@gmail.com> Conflicts: arch/arm/include/asm/unaligned.h Conflicts: arch/arm/include/asm/unaligned.h
thewisenerd
pushed a commit
to thewisenerd/android_kernel_xiaomi_armani
that referenced
this issue
Jan 24, 2015
Date Wed, 10 Apr 2013 14:07:47 -0700 When switching to a new cpu_base in switch_hrtimer_base(), we briefly enable preemption by unlocking the cpu_base lock in two places. During this interval it's possible for the running thread to be swapped to a different CPU. Consider the following example: CPU #0 CPU MiCode#1 ---- ---- hrtimer_start() ... lock_hrtimer_base() switch_hrtimer_base() this_cpu = 0; target_cpu_base = 0; raw_spin_unlock(&cpu_base->lock) <migrate to CPU 1> ... this_cpu == 0 cpu == this_cpu timer->base = CPU #0 timer->base != LOCAL_CPU Since the cached this_cpu is no longer accurate, we'll skip the hrtimer_check_target() check. Once we eventually go to program the hardware, we'll decide not to do so since it knows the real CPU that we're running on is not the same as the chosen base. As a consequence, we may end up missing the hrtimer's deadline. Fix this by updating the local CPU number each time we retake a cpu_base lock in switch_hrtimer_base(). Another possibility is to disable preemption across the whole of switch_hrtimer_base. This looks suboptimal since preemption would be disabled while waiting for lock(s). Signed-off-by: Michael Bohan <mbohan@codeaurora.org>
thewisenerd
pushed a commit
to thewisenerd/android_kernel_xiaomi_armani
that referenced
this issue
Jan 24, 2015
Date Wed, 10 Apr 2013 14:07:48 -0700 When switching the hrtimer cpu_base, we briefly allow for preemption to become enabled by unlocking the cpu_base lock. During this time, the CPU corresponding to the new cpu_base that was selected may in fact go offline. In this scenario, the hrtimer is enqueued to a CPU that's not online, and therefore it never fires. As an example, consider this example: CPU #0 CPU MiCode#1 ---- ---- ... hrtimer_start() lock_hrtimer_base() switch_hrtimer_base() cpu = hrtimer_get_target() -> 1 spin_unlock(&cpu_base->lock) <migrate thread to CPU #0> <offline> spin_lock(&new_base->lock) this_cpu = 0 cpu != this_cpu enqueue_hrtimer(cpu_base MiCode#1) To prevent this scenario, verify that the CPU corresponding to the new cpu_base is indeed online before selecting it in hrtimer_switch_base(). If it's not online, fallback to using the base of the current CPU. Signed-off-by: Michael Bohan <mbohan@codeaurora.org>
liuguo09
pushed a commit
that referenced
this issue
Mar 31, 2015
Setting an empty security context (length=0) on a file will lead to incorrectly dereferencing the type and other fields of the security context structure, yielding a kernel BUG. As a zero-length security context is never valid, just reject all such security contexts whether coming from userspace via setxattr or coming from the filesystem upon a getxattr request by SELinux. Setting a security context value (empty or otherwise) unknown to SELinux in the first place is only possible for a root process (CAP_MAC_ADMIN), and, if running SELinux in enforcing mode, only if the corresponding SELinux mac_admin permission is also granted to the domain by policy. In Fedora policies, this is only allowed for specific domains such as livecd for setting down security contexts that are not defined in the build host policy. [On Android, this can only be set by root/CAP_MAC_ADMIN processes, and if running SELinux in enforcing mode, only if mac_admin permission is granted in policy. In Android 4.4, this would only be allowed for root/CAP_MAC_ADMIN processes that are also in unconfined domains. In current AOSP master, mac_admin is not allowed for any domains except the recovery console which has a legitimate need for it. The other potential vector is mounting a maliciously crafted filesystem for which SELinux fetches xattrs (e.g. an ext4 filesystem on a SDcard). However, the end result is only a local denial-of-service (DOS) due to kernel BUG. This fix is queued for 3.14.] Reproducer: su setenforce 0 touch foo setfattr -n security.selinux foo Caveat: Relabeling or removing foo after doing the above may not be possible without booting with SELinux disabled. Any subsequent access to foo after doing the above will also trigger the BUG. BUG output from Matthew Thode: [ 473.893141] ------------[ cut here ]------------ [ 473.962110] kernel BUG at security/selinux/ss/services.c:654! [ 473.995314] invalid opcode: 0000 [#6] SMP [ 474.027196] Modules linked in: [ 474.058118] CPU: 0 PID: 8138 Comm: ls Tainted: G D I 3.13.0-grsec #1 [ 474.116637] Hardware name: Supermicro X8ST3/X8ST3, BIOS 2.0 07/29/10 [ 474.149768] task: ffff8805f50cd010 ti: ffff8805f50cd488 task.ti: ffff8805f50cd488 [ 474.183707] RIP: 0010:[<ffffffff814681c7>] [<ffffffff814681c7>] context_struct_compute_av+0xce/0x308 [ 474.219954] RSP: 0018:ffff8805c0ac3c38 EFLAGS: 00010246 [ 474.252253] RAX: 0000000000000000 RBX: ffff8805c0ac3d94 RCX: 0000000000000100 [ 474.287018] RDX: ffff8805e8aac000 RSI: 00000000ffffffff RDI: ffff8805e8aaa000 [ 474.321199] RBP: ffff8805c0ac3cb8 R08: 0000000000000010 R09: 0000000000000006 [ 474.357446] R10: 0000000000000000 R11: ffff8805c567a000 R12: 0000000000000006 [ 474.419191] R13: ffff8805c2b74e88 R14: 00000000000001da R15: 0000000000000000 [ 474.453816] FS: 00007f2e75220800(0000) GS:ffff88061fc00000(0000) knlGS:0000000000000000 [ 474.489254] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 474.522215] CR2: 00007f2e74716090 CR3: 00000005c085e000 CR4: 00000000000207f0 [ 474.556058] Stack: [ 474.584325] ffff8805c0ac3c98 ffffffff811b549b ffff8805c0ac3c98 ffff8805f1190a40 [ 474.618913] ffff8805a6202f08 ffff8805c2b74e88 00068800d0464990 ffff8805e8aac860 [ 474.653955] ffff8805c0ac3cb8 000700068113833a ffff880606c75060 ffff8805c0ac3d94 [ 474.690461] Call Trace: [ 474.723779] [<ffffffff811b549b>] ? lookup_fast+0x1cd/0x22a [ 474.778049] [<ffffffff81468824>] security_compute_av+0xf4/0x20b [ 474.811398] [<ffffffff8196f419>] avc_compute_av+0x2a/0x179 [ 474.843813] [<ffffffff8145727b>] avc_has_perm+0x45/0xf4 [ 474.875694] [<ffffffff81457d0e>] inode_has_perm+0x2a/0x31 [ 474.907370] [<ffffffff81457e76>] selinux_inode_getattr+0x3c/0x3e [ 474.938726] [<ffffffff81455cf6>] security_inode_getattr+0x1b/0x22 [ 474.970036] [<ffffffff811b057d>] vfs_getattr+0x19/0x2d [ 475.000618] [<ffffffff811b05e5>] vfs_fstatat+0x54/0x91 [ 475.030402] [<ffffffff811b063b>] vfs_lstat+0x19/0x1b [ 475.061097] [<ffffffff811b077e>] SyS_newlstat+0x15/0x30 [ 475.094595] [<ffffffff8113c5c1>] ? __audit_syscall_entry+0xa1/0xc3 [ 475.148405] [<ffffffff8197791e>] system_call_fastpath+0x16/0x1b [ 475.179201] Code: 00 48 85 c0 48 89 45 b8 75 02 0f 0b 48 8b 45 a0 48 8b 3d 45 d0 b6 00 8b 40 08 89 c6 ff ce e8 d1 b0 06 00 48 85 c0 49 89 c7 75 02 <0f> 0b 48 8b 45 b8 4c 8b 28 eb 1e 49 8d 7d 08 be 80 01 00 00 e8 [ 475.255884] RIP [<ffffffff814681c7>] context_struct_compute_av+0xce/0x308 [ 475.296120] RSP <ffff8805c0ac3c38> [ 475.328734] ---[ end trace f076482e9d754adc ]--- [sds: commit message edited to note Android implications and to generate a unique Change-Id for gerrit] Change-Id: I4d5389f0cfa72b5f59dada45081fa47e03805413 Reported-by: Matthew Thode <mthode@mthode.org> Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Cc: stable@vger.kernel.org Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Sivasri Kumar Vanka <sivasri@codeaurora.org>
liuguo09
pushed a commit
that referenced
this issue
Mar 31, 2015
Setting an empty security context (length=0) on a file will lead to incorrectly dereferencing the type and other fields of the security context structure, yielding a kernel BUG. As a zero-length security context is never valid, just reject all such security contexts whether coming from userspace via setxattr or coming from the filesystem upon a getxattr request by SELinux. Setting a security context value (empty or otherwise) unknown to SELinux in the first place is only possible for a root process (CAP_MAC_ADMIN), and, if running SELinux in enforcing mode, only if the corresponding SELinux mac_admin permission is also granted to the domain by policy. In Fedora policies, this is only allowed for specific domains such as livecd for setting down security contexts that are not defined in the build host policy. [On Android, this can only be set by root/CAP_MAC_ADMIN processes, and if running SELinux in enforcing mode, only if mac_admin permission is granted in policy. In Android 4.4, this would only be allowed for root/CAP_MAC_ADMIN processes that are also in unconfined domains. In current AOSP master, mac_admin is not allowed for any domains except the recovery console which has a legitimate need for it. The other potential vector is mounting a maliciously crafted filesystem for which SELinux fetches xattrs (e.g. an ext4 filesystem on a SDcard). However, the end result is only a local denial-of-service (DOS) due to kernel BUG. This fix is queued for 3.14.] Reproducer: su setenforce 0 touch foo setfattr -n security.selinux foo Caveat: Relabeling or removing foo after doing the above may not be possible without booting with SELinux disabled. Any subsequent access to foo after doing the above will also trigger the BUG. BUG output from Matthew Thode: [ 473.893141] ------------[ cut here ]------------ [ 473.962110] kernel BUG at security/selinux/ss/services.c:654! [ 473.995314] invalid opcode: 0000 [#6] SMP [ 474.027196] Modules linked in: [ 474.058118] CPU: 0 PID: 8138 Comm: ls Tainted: G D I 3.13.0-grsec #1 [ 474.116637] Hardware name: Supermicro X8ST3/X8ST3, BIOS 2.0 07/29/10 [ 474.149768] task: ffff8805f50cd010 ti: ffff8805f50cd488 task.ti: ffff8805f50cd488 [ 474.183707] RIP: 0010:[<ffffffff814681c7>] [<ffffffff814681c7>] context_struct_compute_av+0xce/0x308 [ 474.219954] RSP: 0018:ffff8805c0ac3c38 EFLAGS: 00010246 [ 474.252253] RAX: 0000000000000000 RBX: ffff8805c0ac3d94 RCX: 0000000000000100 [ 474.287018] RDX: ffff8805e8aac000 RSI: 00000000ffffffff RDI: ffff8805e8aaa000 [ 474.321199] RBP: ffff8805c0ac3cb8 R08: 0000000000000010 R09: 0000000000000006 [ 474.357446] R10: 0000000000000000 R11: ffff8805c567a000 R12: 0000000000000006 [ 474.419191] R13: ffff8805c2b74e88 R14: 00000000000001da R15: 0000000000000000 [ 474.453816] FS: 00007f2e75220800(0000) GS:ffff88061fc00000(0000) knlGS:0000000000000000 [ 474.489254] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 474.522215] CR2: 00007f2e74716090 CR3: 00000005c085e000 CR4: 00000000000207f0 [ 474.556058] Stack: [ 474.584325] ffff8805c0ac3c98 ffffffff811b549b ffff8805c0ac3c98 ffff8805f1190a40 [ 474.618913] ffff8805a6202f08 ffff8805c2b74e88 00068800d0464990 ffff8805e8aac860 [ 474.653955] ffff8805c0ac3cb8 000700068113833a ffff880606c75060 ffff8805c0ac3d94 [ 474.690461] Call Trace: [ 474.723779] [<ffffffff811b549b>] ? lookup_fast+0x1cd/0x22a [ 474.778049] [<ffffffff81468824>] security_compute_av+0xf4/0x20b [ 474.811398] [<ffffffff8196f419>] avc_compute_av+0x2a/0x179 [ 474.843813] [<ffffffff8145727b>] avc_has_perm+0x45/0xf4 [ 474.875694] [<ffffffff81457d0e>] inode_has_perm+0x2a/0x31 [ 474.907370] [<ffffffff81457e76>] selinux_inode_getattr+0x3c/0x3e [ 474.938726] [<ffffffff81455cf6>] security_inode_getattr+0x1b/0x22 [ 474.970036] [<ffffffff811b057d>] vfs_getattr+0x19/0x2d [ 475.000618] [<ffffffff811b05e5>] vfs_fstatat+0x54/0x91 [ 475.030402] [<ffffffff811b063b>] vfs_lstat+0x19/0x1b [ 475.061097] [<ffffffff811b077e>] SyS_newlstat+0x15/0x30 [ 475.094595] [<ffffffff8113c5c1>] ? __audit_syscall_entry+0xa1/0xc3 [ 475.148405] [<ffffffff8197791e>] system_call_fastpath+0x16/0x1b [ 475.179201] Code: 00 48 85 c0 48 89 45 b8 75 02 0f 0b 48 8b 45 a0 48 8b 3d 45 d0 b6 00 8b 40 08 89 c6 ff ce e8 d1 b0 06 00 48 85 c0 49 89 c7 75 02 <0f> 0b 48 8b 45 b8 4c 8b 28 eb 1e 49 8d 7d 08 be 80 01 00 00 e8 [ 475.255884] RIP [<ffffffff814681c7>] context_struct_compute_av+0xce/0x308 [ 475.296120] RSP <ffff8805c0ac3c38> [ 475.328734] ---[ end trace f076482e9d754adc ]--- [sds: commit message edited to note Android implications and to generate a unique Change-Id for gerrit] Change-Id: I4d5389f0cfa72b5f59dada45081fa47e03805413 Reported-by: Matthew Thode <mthode@mthode.org> Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Cc: stable@vger.kernel.org Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Sivasri Kumar Vanka <sivasri@codeaurora.org>
rooque
pushed a commit
to rooque/android_kernel_xiaomi_cancro
that referenced
this issue
May 2, 2015
commit 6f2e9f0 upstream. Now when we set the group inode free count, we don't have a proper group lock so that multiple threads may decrease the inode free count at the same time. And e2fsck will complain something like: Free inodes count wrong for group MiCode#1 (1, counted=0). Fix? no Free inodes count wrong for group MiCode#2 (3, counted=0). Fix? no Directories count wrong for group MiCode#2 (780, counted=779). Fix? no Free inodes count wrong for group MiCode#3 (2272, counted=2273). Fix? no So this patch try to protect it with the ext4_lock_group. btw, it is found by xfstests test case 269 and the volume is mkfsed with the parameter "-O ^resize_inode,^uninit_bg,extent,meta_bg,flex_bg,ext_attr" and I have run it 100 times and the error in e2fsck doesn't show up again. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
rooque
pushed a commit
to rooque/android_kernel_xiaomi_cancro
that referenced
this issue
May 2, 2015
commit 00d4e73 upstream. In ext4_nonda_switch(), if the file system is getting full we used to call writeback_inodes_sb_if_idle(). The problem is that we can be holding i_mutex already, and this causes a potential deadlock when writeback_inodes_sb_if_idle() when it tries to take s_umount. (See lockdep output below). As it turns out we don't need need to hold s_umount; the fact that we are in the middle of the write(2) system call will keep the superblock pinned. Unfortunately writeback_inodes_sb() checks to make sure s_umount is taken, and the VFS uses a different mechanism for making sure the file system doesn't get unmounted out from under us. The simplest way of dealing with this is to just simply grab s_umount using a trylock, and skip kicking the writeback flusher thread in the very unlikely case that we can't take a read lock on s_umount without blocking. Also, we now check the cirteria for kicking the writeback thread before we decide to whether to fall back to non-delayed writeback, so if there are any outstanding delayed allocation writes, we try to get them resolved as soon as possible. [ INFO: possible circular locking dependency detected ] 3.6.0-rc1-00042-gce894ca MiCode#367 Not tainted ------------------------------------------------------- dd/8298 is trying to acquire lock: (&type->s_umount_key#18){++++..}, at: [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46 but task is already holding lock: (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3 which lock already depends on the new lock. 2 locks held by dd/8298: #0: (sb_writers#2){.+.+.+}, at: [<c01ddcc5>] generic_file_aio_write+0x56/0xd3 MiCode#1: (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3 stack backtrace: Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca MiCode#367 Call Trace: [<c015b79c>] ? console_unlock+0x345/0x372 [<c06d62a1>] print_circular_bug+0x190/0x19d [<c019906c>] __lock_acquire+0x86d/0xb6c [<c01999db>] ? mark_held_locks+0x5c/0x7b [<c0199724>] lock_acquire+0x66/0xb9 [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46 [<c06db935>] down_read+0x28/0x58 [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46 [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46 [<c026f3b2>] ext4_nonda_switch+0xe1/0xf4 [<c0271ece>] ext4_da_write_begin+0x27/0x193 [<c01dcdb0>] generic_file_buffered_write+0xc8/0x1bb [<c01ddc47>] __generic_file_aio_write+0x1dd/0x205 [<c01ddce7>] generic_file_aio_write+0x78/0xd3 [<c026d336>] ext4_file_write+0x480/0x4a6 [<c0198c1d>] ? __lock_acquire+0x41e/0xb6c [<c0180944>] ? sched_clock_cpu+0x11a/0x13e [<c01967e9>] ? trace_hardirqs_off+0xb/0xd [<c018099f>] ? local_clock+0x37/0x4e [<c0209f2c>] do_sync_write+0x67/0x9d [<c0209ec5>] ? wait_on_retry_sync_kiocb+0x44/0x44 [<c020a7b9>] vfs_write+0x7b/0xe6 [<c020a9a6>] sys_write+0x3b/0x64 [<c06dd4bd>] syscall_call+0x7/0xb Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
rooque
pushed a commit
to rooque/android_kernel_xiaomi_cancro
that referenced
this issue
May 2, 2015
commit 42d64e1 upstream. The SELinux/NetLabel glue code has a locking bug that affects systems with NetLabel enabled, see the kernel error message below. This patch corrects this problem by converting the bottom half socket lock to a more conventional, and correct for this call-path, lock_sock() call. =============================== [ INFO: suspicious RCU usage. ] 3.11.0-rc3+ MiCode#19 Not tainted ------------------------------- net/ipv4/cipso_ipv4.c:1928 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 2 locks held by ping/731: #0: (slock-AF_INET/1){+.-...}, at: [...] selinux_netlbl_socket_connect MiCode#1: (rcu_read_lock){.+.+..}, at: [<...>] netlbl_conn_setattr stack backtrace: CPU: 1 PID: 731 Comm: ping Not tainted 3.11.0-rc3+ MiCode#19 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 0000000000000001 ffff88006f659d28 ffffffff81726b6a ffff88003732c500 ffff88006f659d58 ffffffff810e4457 ffff88006b845a00 0000000000000000 000000000000000c ffff880075aa2f50 ffff88006f659d90 ffffffff8169bec7 Call Trace: [<ffffffff81726b6a>] dump_stack+0x54/0x74 [<ffffffff810e4457>] lockdep_rcu_suspicious+0xe7/0x120 [<ffffffff8169bec7>] cipso_v4_sock_setattr+0x187/0x1a0 [<ffffffff8170f317>] netlbl_conn_setattr+0x187/0x190 [<ffffffff8170f195>] ? netlbl_conn_setattr+0x5/0x190 [<ffffffff8131ac9e>] selinux_netlbl_socket_connect+0xae/0xc0 [<ffffffff81303025>] selinux_socket_connect+0x135/0x170 [<ffffffff8119d127>] ? might_fault+0x57/0xb0 [<ffffffff812fb146>] security_socket_connect+0x16/0x20 [<ffffffff815d3ad3>] SYSC_connect+0x73/0x130 [<ffffffff81739a85>] ? sysret_check+0x22/0x5d [<ffffffff810e5e2d>] ? trace_hardirqs_on_caller+0xfd/0x1c0 [<ffffffff81373d4e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff815d52be>] SyS_connect+0xe/0x10 [<ffffffff81739a59>] system_call_fastpath+0x16/0x1b Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
rooque
pushed a commit
to rooque/android_kernel_xiaomi_cancro
that referenced
this issue
May 7, 2015
commit 6f2e9f0 upstream. Now when we set the group inode free count, we don't have a proper group lock so that multiple threads may decrease the inode free count at the same time. And e2fsck will complain something like: Free inodes count wrong for group MiCode#1 (1, counted=0). Fix? no Free inodes count wrong for group MiCode#2 (3, counted=0). Fix? no Directories count wrong for group MiCode#2 (780, counted=779). Fix? no Free inodes count wrong for group MiCode#3 (2272, counted=2273). Fix? no So this patch try to protect it with the ext4_lock_group. btw, it is found by xfstests test case 269 and the volume is mkfsed with the parameter "-O ^resize_inode,^uninit_bg,extent,meta_bg,flex_bg,ext_attr" and I have run it 100 times and the error in e2fsck doesn't show up again. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
rooque
pushed a commit
to rooque/android_kernel_xiaomi_cancro
that referenced
this issue
May 7, 2015
commit 00d4e73 upstream. In ext4_nonda_switch(), if the file system is getting full we used to call writeback_inodes_sb_if_idle(). The problem is that we can be holding i_mutex already, and this causes a potential deadlock when writeback_inodes_sb_if_idle() when it tries to take s_umount. (See lockdep output below). As it turns out we don't need need to hold s_umount; the fact that we are in the middle of the write(2) system call will keep the superblock pinned. Unfortunately writeback_inodes_sb() checks to make sure s_umount is taken, and the VFS uses a different mechanism for making sure the file system doesn't get unmounted out from under us. The simplest way of dealing with this is to just simply grab s_umount using a trylock, and skip kicking the writeback flusher thread in the very unlikely case that we can't take a read lock on s_umount without blocking. Also, we now check the cirteria for kicking the writeback thread before we decide to whether to fall back to non-delayed writeback, so if there are any outstanding delayed allocation writes, we try to get them resolved as soon as possible. [ INFO: possible circular locking dependency detected ] 3.6.0-rc1-00042-gce894ca MiCode#367 Not tainted ------------------------------------------------------- dd/8298 is trying to acquire lock: (&type->s_umount_key#18){++++..}, at: [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46 but task is already holding lock: (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3 which lock already depends on the new lock. 2 locks held by dd/8298: #0: (sb_writers#2){.+.+.+}, at: [<c01ddcc5>] generic_file_aio_write+0x56/0xd3 MiCode#1: (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3 stack backtrace: Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca MiCode#367 Call Trace: [<c015b79c>] ? console_unlock+0x345/0x372 [<c06d62a1>] print_circular_bug+0x190/0x19d [<c019906c>] __lock_acquire+0x86d/0xb6c [<c01999db>] ? mark_held_locks+0x5c/0x7b [<c0199724>] lock_acquire+0x66/0xb9 [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46 [<c06db935>] down_read+0x28/0x58 [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46 [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46 [<c026f3b2>] ext4_nonda_switch+0xe1/0xf4 [<c0271ece>] ext4_da_write_begin+0x27/0x193 [<c01dcdb0>] generic_file_buffered_write+0xc8/0x1bb [<c01ddc47>] __generic_file_aio_write+0x1dd/0x205 [<c01ddce7>] generic_file_aio_write+0x78/0xd3 [<c026d336>] ext4_file_write+0x480/0x4a6 [<c0198c1d>] ? __lock_acquire+0x41e/0xb6c [<c0180944>] ? sched_clock_cpu+0x11a/0x13e [<c01967e9>] ? trace_hardirqs_off+0xb/0xd [<c018099f>] ? local_clock+0x37/0x4e [<c0209f2c>] do_sync_write+0x67/0x9d [<c0209ec5>] ? wait_on_retry_sync_kiocb+0x44/0x44 [<c020a7b9>] vfs_write+0x7b/0xe6 [<c020a9a6>] sys_write+0x3b/0x64 [<c06dd4bd>] syscall_call+0x7/0xb Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ivan19871002
referenced
this issue
in mi3-dev/android_kernel_xiaomi_msm8x74pro
May 12, 2015
An issue was observed when a userspace task exits. The page which hits error here is the zero page. In binder mmap, the whole of vma is not mapped. On a task crash, when debuggerd reads the binder regions, the unmapped areas fall to do_anonymous_page in handle_pte_fault, due to the absence of a vm_fault handler. This results in zero page being mapped. Later in zap_pte_range, vm_normal_page returns zero page in the case of VM_MIXEDMAP and it results in the error. BUG: Bad page map in process mediaserver pte:9dff379f pmd:9bfbd831 page:c0ed8e60 count:1 mapcount:-1 mapping: (null) index:0x0 page flags: 0x404(referenced|reserved) addr:40c3f000 vm_flags:10220051 anon_vma: (null) mapping:d9fe0764 index:fd vma->vm_ops->fault: (null) vma->vm_file->f_op->mmap: binder_mmap+0x0/0x274 CPU: 0 PID: 1463 Comm: mediaserver Tainted: G W 3.10.17+ #1 [<c001549c>] (unwind_backtrace+0x0/0x11c) from [<c001200c>] (show_stack+0x10/0x14) [<c001200c>] (show_stack+0x10/0x14) from [<c0103d78>] (print_bad_pte+0x158/0x190) [<c0103d78>] (print_bad_pte+0x158/0x190) from [<c01055f0>] (unmap_single_vma+0x2e4/0x598) [<c01055f0>] (unmap_single_vma+0x2e4/0x598) from [<c010618c>] (unmap_vmas+0x34/0x50) [<c010618c>] (unmap_vmas+0x34/0x50) from [<c010a9e4>] (exit_mmap+0xc8/0x1e8) [<c010a9e4>] (exit_mmap+0xc8/0x1e8) from [<c00520f0>] (mmput+0x54/0xd0) [<c00520f0>] (mmput+0x54/0xd0) from [<c005972c>] (do_exit+0x360/0x990) [<c005972c>] (do_exit+0x360/0x990) from [<c0059ef0>] (do_group_exit+0x84/0xc0) [<c0059ef0>] (do_group_exit+0x84/0xc0) from [<c0066de0>] (get_signal_to_deliver+0x4d4/0x548) [<c0066de0>] (get_signal_to_deliver+0x4d4/0x548) from [<c0011500>] (do_signal+0xa8/0x3b8) Add a vm_fault handler which returns VM_FAULT_SIGBUS, and prevents the wrong fallback to do_anonymous_page. CRs-Fixed: 673147 Change-Id: I43730a51b6c819538b46c5e4dc5c96c8a384098d Signed-off-by: Vinayak Menon <vinayakm.list@gmail.com> Patch-mainline: linux-arm-kernel @ 06/02/14, 18:17 Signed-off-by: Vignesh Radhakrishnan <vigneshr@codeaurora.org> Signed-off-by: Subbaraman Narayanamurthy <subbaram@codeaurora.org>
ivan19871002
referenced
this issue
in mi3-dev/android_kernel_xiaomi_msm8x74pro
May 12, 2015
commit 086ba77 upstream. ARM has some private syscalls (for example, set_tls(2)) which lie outside the range of NR_syscalls. If any of these are called while syscall tracing is being performed, out-of-bounds array access will occur in the ftrace and perf sys_{enter,exit} handlers. # trace-cmd record -e raw_syscalls:* true && trace-cmd report ... true-653 [000] 384.675777: sys_enter: NR 192 (0, 1000, 3, 4000022, ffffffff, 0) true-653 [000] 384.675812: sys_exit: NR 192 = 1995915264 true-653 [000] 384.675971: sys_enter: NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1) true-653 [000] 384.675988: sys_exit: NR 983045 = 0 ... # trace-cmd record -e syscalls:* true [ 17.289329] Unable to handle kernel paging request at virtual address aaaaaace [ 17.289590] pgd = 9e71c000 [ 17.289696] [aaaaaace] *pgd=00000000 [ 17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM [ 17.290169] Modules linked in: [ 17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ MiCode#21 [ 17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000 [ 17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8 [ 17.290866] LR is at syscall_trace_enter+0x124/0x184 Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers. Commit cd0980f "tracing: Check invalid syscall nr while tracing syscalls" added the check for less than zero, but it should have also checked for greater than NR_syscalls. Link: http://lkml.kernel.org/p/1414620418-29472-1-git-send-email-rabin@rab.in Fixes: cd0980f "tracing: Check invalid syscall nr while tracing syscalls" Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> [lizf: Backported to 3.4: adjust context] Signed-off-by: Zefan Li <lizefan@huawei.com>
ivan19871002
referenced
this issue
in mi3-dev/android_kernel_xiaomi_msm8x74pro
May 12, 2015
commit 42d64e1 upstream. The SELinux/NetLabel glue code has a locking bug that affects systems with NetLabel enabled, see the kernel error message below. This patch corrects this problem by converting the bottom half socket lock to a more conventional, and correct for this call-path, lock_sock() call. =============================== [ INFO: suspicious RCU usage. ] 3.11.0-rc3+ MiCode#19 Not tainted ------------------------------- net/ipv4/cipso_ipv4.c:1928 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 2 locks held by ping/731: #0: (slock-AF_INET/1){+.-...}, at: [...] selinux_netlbl_socket_connect #1: (rcu_read_lock){.+.+..}, at: [<...>] netlbl_conn_setattr stack backtrace: CPU: 1 PID: 731 Comm: ping Not tainted 3.11.0-rc3+ MiCode#19 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 0000000000000001 ffff88006f659d28 ffffffff81726b6a ffff88003732c500 ffff88006f659d58 ffffffff810e4457 ffff88006b845a00 0000000000000000 000000000000000c ffff880075aa2f50 ffff88006f659d90 ffffffff8169bec7 Call Trace: [<ffffffff81726b6a>] dump_stack+0x54/0x74 [<ffffffff810e4457>] lockdep_rcu_suspicious+0xe7/0x120 [<ffffffff8169bec7>] cipso_v4_sock_setattr+0x187/0x1a0 [<ffffffff8170f317>] netlbl_conn_setattr+0x187/0x190 [<ffffffff8170f195>] ? netlbl_conn_setattr+0x5/0x190 [<ffffffff8131ac9e>] selinux_netlbl_socket_connect+0xae/0xc0 [<ffffffff81303025>] selinux_socket_connect+0x135/0x170 [<ffffffff8119d127>] ? might_fault+0x57/0xb0 [<ffffffff812fb146>] security_socket_connect+0x16/0x20 [<ffffffff815d3ad3>] SYSC_connect+0x73/0x130 [<ffffffff81739a85>] ? sysret_check+0x22/0x5d [<ffffffff810e5e2d>] ? trace_hardirqs_on_caller+0xfd/0x1c0 [<ffffffff81373d4e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff815d52be>] SyS_connect+0xe/0x10 [<ffffffff81739a59>] system_call_fastpath+0x16/0x1b Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ivan19871002
referenced
this issue
in mi3-dev/android_kernel_xiaomi_msm8x74pro
May 12, 2015
commit 00d4e73 upstream. In ext4_nonda_switch(), if the file system is getting full we used to call writeback_inodes_sb_if_idle(). The problem is that we can be holding i_mutex already, and this causes a potential deadlock when writeback_inodes_sb_if_idle() when it tries to take s_umount. (See lockdep output below). As it turns out we don't need need to hold s_umount; the fact that we are in the middle of the write(2) system call will keep the superblock pinned. Unfortunately writeback_inodes_sb() checks to make sure s_umount is taken, and the VFS uses a different mechanism for making sure the file system doesn't get unmounted out from under us. The simplest way of dealing with this is to just simply grab s_umount using a trylock, and skip kicking the writeback flusher thread in the very unlikely case that we can't take a read lock on s_umount without blocking. Also, we now check the cirteria for kicking the writeback thread before we decide to whether to fall back to non-delayed writeback, so if there are any outstanding delayed allocation writes, we try to get them resolved as soon as possible. [ INFO: possible circular locking dependency detected ] 3.6.0-rc1-00042-gce894ca MiCode#367 Not tainted ------------------------------------------------------- dd/8298 is trying to acquire lock: (&type->s_umount_key#18){++++..}, at: [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46 but task is already holding lock: (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3 which lock already depends on the new lock. 2 locks held by dd/8298: #0: (sb_writers#2){.+.+.+}, at: [<c01ddcc5>] generic_file_aio_write+0x56/0xd3 #1: (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3 stack backtrace: Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca MiCode#367 Call Trace: [<c015b79c>] ? console_unlock+0x345/0x372 [<c06d62a1>] print_circular_bug+0x190/0x19d [<c019906c>] __lock_acquire+0x86d/0xb6c [<c01999db>] ? mark_held_locks+0x5c/0x7b [<c0199724>] lock_acquire+0x66/0xb9 [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46 [<c06db935>] down_read+0x28/0x58 [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46 [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46 [<c026f3b2>] ext4_nonda_switch+0xe1/0xf4 [<c0271ece>] ext4_da_write_begin+0x27/0x193 [<c01dcdb0>] generic_file_buffered_write+0xc8/0x1bb [<c01ddc47>] __generic_file_aio_write+0x1dd/0x205 [<c01ddce7>] generic_file_aio_write+0x78/0xd3 [<c026d336>] ext4_file_write+0x480/0x4a6 [<c0198c1d>] ? __lock_acquire+0x41e/0xb6c [<c0180944>] ? sched_clock_cpu+0x11a/0x13e [<c01967e9>] ? trace_hardirqs_off+0xb/0xd [<c018099f>] ? local_clock+0x37/0x4e [<c0209f2c>] do_sync_write+0x67/0x9d [<c0209ec5>] ? wait_on_retry_sync_kiocb+0x44/0x44 [<c020a7b9>] vfs_write+0x7b/0xe6 [<c020a9a6>] sys_write+0x3b/0x64 [<c06dd4bd>] syscall_call+0x7/0xb Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ivan19871002
referenced
this issue
in mi3-dev/android_kernel_xiaomi_msm8x74pro
May 12, 2015
commit 6f2e9f0 upstream. Now when we set the group inode free count, we don't have a proper group lock so that multiple threads may decrease the inode free count at the same time. And e2fsck will complain something like: Free inodes count wrong for group #1 (1, counted=0). Fix? no Free inodes count wrong for group MiCode#2 (3, counted=0). Fix? no Directories count wrong for group MiCode#2 (780, counted=779). Fix? no Free inodes count wrong for group MiCode#3 (2272, counted=2273). Fix? no So this patch try to protect it with the ext4_lock_group. btw, it is found by xfstests test case 269 and the volume is mkfsed with the parameter "-O ^resize_inode,^uninit_bg,extent,meta_bg,flex_bg,ext_attr" and I have run it 100 times and the error in e2fsck doesn't show up again. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
rooque
pushed a commit
to rooque/android_kernel_xiaomi_cancro
that referenced
this issue
May 12, 2015
…ernor For the sync_freq feature currently we check pcpu->policy->cur frequency for each online cpu. But for a CPU that isn't using interactive governor or for an offline CPU, pcpu->policy can be null or an invalid value. This patch tries to avoid that scenario by using pcpu->target_freq instead of policy->cur to get the frequency of an online CPU. Kernel crash without this patch: [ 20.132373] Unable to handle kernel NULL pointer dereference at virtual address 00000028 [ 20.132375] pgd = c34f34c0 [ 20.132377] pgd = ef6f2440 [ 20.132383] [00000028] *pgd=00000000 [ 20.132385] [ 20.132388] [00000028] *pgd=2e98f003, *pmd=00000000 [ 20.132390] Internal error: Oops: 205 [MiCode#1] PREEMPT SMP ARM [ 20.132394] Modules linked in: [ 20.132398] CPU: 0 PID: 1560 Comm: chown Tainted: G W 3.10.0-perf-gb12057b-00001-ga2c6c16-dirty MiCode#7 [ 20.132401] task: ef9af300 ti: ee49c000 task.ti: ee49c000 [ 20.132411] PC is at cpufreq_interactive_timer+0x10c/0x650 [ 20.132415] LR is at cpufreq_interactive_timer+0x128/0x650 <snip> [ 20.133002] [<c07eb204>] (cpufreq_interactive_timer+0x10c/0x650) from [<c02804d8>] (call_timer_fn+0x80/0x198) [ 20.133012] [<c02804d8>] (call_timer_fn+0x80/0x198) from [<c0280acc>] (run_timer_softirq+0x1f8/0x270) [ 20.133019] [<c0280acc>] (run_timer_softirq+0x1f8/0x270) from [<c0279e20>] (__do_softirq+0x12c/0x2d4) [ 20.133025] [<c0279e20>] (__do_softirq+0x12c/0x2d4) from [<c027a2d4>] (irq_exit+0x74/0xc8) [ 20.133034] [<c027a2d4>] (irq_exit+0x74/0xc8) from [<c0206a00>] (handle_IRQ+0x68/0x8c) [ 20.133041] [<c0206a00>] (handle_IRQ+0x68/0x8c) from [<c02004b8>] (gic_handle_irq+0x3c/0x60) [ 20.133051] [<c02004b8>] (gic_handle_irq+0x3c/0x60) from [<c0ac6900>] (__irq_svc+0x40/0x70) <snip> Change-Id: I4adf0b35cd94004e06ea649c672988db45c54710 Signed-off-by: Vijay Ganti <viganti@codeaurora.org>
rooque
pushed a commit
to rooque/android_kernel_xiaomi_cancro
that referenced
this issue
May 29, 2015
workqueue: change BUG_ON() to WARN_ON() This BUG_ON() can be triggered if you call schedule_work() before calling INIT_WORK(). It is a bug definitely, but it's nicer to just print a stack trace and return. Reported-by: Matt Renzelmann <mjr@cs.wisc.edu> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: Catch more locking problems with flush_work() If a workqueue is flushed with flush_work() lockdep checking can be circumvented. For example: static DEFINE_MUTEX(mutex); static void my_work(struct work_struct *w) { mutex_lock(&mutex); mutex_unlock(&mutex); } static DECLARE_WORK(work, my_work); static int __init start_test_module(void) { schedule_work(&work); return 0; } module_init(start_test_module); static void __exit stop_test_module(void) { mutex_lock(&mutex); flush_work(&work); mutex_unlock(&mutex); } module_exit(stop_test_module); would not always print a warning when flush_work() was called. In this trivial example nothing could go wrong since we are guaranteed module_init() and module_exit() don't run concurrently, but if the work item is schedule asynchronously we could have a scenario where the work item is running just at the time flush_work() is called resulting in a classic ABBA locking problem. Add a lockdep hint by acquiring and releasing the work item lockdep_map in flush_work() so that we always catch this potential deadlock scenario. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> lockdep: fix oops in processing workqueue Under memory load, on x86_64, with lockdep enabled, the workqueue's process_one_work() has been seen to oops in __lock_acquire(), barfing on a 0xffffffff00000000 pointer in the lockdep_map's class_cache[]. Because it's permissible to free a work_struct from its callout function, the map used is an onstack copy of the map given in the work_struct: and that copy is made without any locking. Surprisingly, gcc (4.5.1 in Hugh's case) uses "rep movsl" rather than "rep movsq" for that structure copy: which might race with a workqueue user's wait_on_work() doing lock_map_acquire() on the source of the copy, putting a pointer into the class_cache[], but only in time for the top half of that pointer to be copied to the destination map. Boom when process_one_work() subsequently does lock_map_acquire() on its onstack copy of the lockdep_map. Fix this, and a similar instance in call_timer_fn(), with a lockdep_copy_map() function which additionally NULLs the class_cache[]. Note: this oops was actually seen on 3.4-next, where flush_work() newly does the racing lock_map_acquire(); but Tejun points out that 3.4 and earlier are already vulnerable to the same through wait_on_work(). * Patch orginally from Peter. Hugh modified it a bit and wrote the description. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Reported-by: Hugh Dickins <hughd@google.com> LKML-Reference: <alpine.LSU.2.00.1205070951170.1544@eggly.anvils> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: perform cpu down operations from low priority cpu_notifier() Currently, all workqueue cpu hotplug operations run off CPU_PRI_WORKQUEUE which is higher than normal notifiers. This is to ensure that workqueue is up and running while bringing up a CPU before other notifiers try to use workqueue on the CPU. Per-cpu workqueues are supposed to remain working and bound to the CPU for normal CPU_DOWN_PREPARE notifiers. This holds mostly true even with workqueue offlining running with higher priority because workqueue CPU_DOWN_PREPARE only creates a bound trustee thread which runs the per-cpu workqueue without concurrency management without explicitly detaching the existing workers. However, if the trustee needs to create new workers, it creates unbound workers which may wander off to other CPUs while CPU_DOWN_PREPARE notifiers are in progress. Furthermore, if the CPU down is cancelled, the per-CPU workqueue may end up with workers which aren't bound to the CPU. While reliably reproducible with a convoluted artificial test-case involving scheduling and flushing CPU burning work items from CPU down notifiers, this isn't very likely to happen in the wild, and, even when it happens, the effects are likely to be hidden by the following successful CPU down. Fix it by using different priorities for up and down notifiers - high priority for up operations and low priority for down operations. Workqueue cpu hotplug operations will soon go through further cleanup. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> workqueue: drop CPU_DYING notifier operation Workqueue used CPU_DYING notification to mark GCWQ_DISASSOCIATED. This was necessary because workqueue's CPU_DOWN_PREPARE happened before other DOWN_PREPARE notifiers and workqueue needed to stay associated across the rest of DOWN_PREPARE. After the previous patch, workqueue's DOWN_PREPARE happens after others and can set GCWQ_DISASSOCIATED directly. Drop CPU_DYING and let the trustee set GCWQ_DISASSOCIATED after disabling concurrency management. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> workqueue: ROGUE workers are UNBOUND workers Currently, WORKER_UNBOUND is used to mark workers for the unbound global_cwq and WORKER_ROGUE is used to mark workers for disassociated per-cpu global_cwqs. Both are used to make the marked worker skip concurrency management and the only place they make any difference is in worker_enter_idle() where WORKER_ROGUE is used to skip scheduling idle timer, which can easily be replaced with trustee state testing. This patch replaces WORKER_ROGUE with WORKER_UNBOUND and drops WORKER_ROGUE. This is to prepare for removing trustee and handling disassociated global_cwqs as unbound. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> workqueue: use mutex for global_cwq manager exclusion POOL_MANAGING_WORKERS is used to ensure that at most one worker takes the manager role at any given time on a given global_cwq. Trustee later hitched on it to assume manager adding blocking wait for the bit. As trustee already needed a custom wait mechanism, waiting for MANAGING_WORKERS was rolled into the same mechanism. Trustee is scheduled to be removed. This patch separates out MANAGING_WORKERS wait into per-pool mutex. Workers use mutex_trylock() to test for manager role and trustee uses mutex_lock() to claim manager roles. gcwq_claim/release_management() helpers are added to grab and release manager roles of all pools on a global_cwq. gcwq_claim_management() always grabs pool manager mutexes in ascending pool index order and uses pool index as lockdep subclass. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> workqueue: drop @bind from create_worker() Currently, create_worker()'s callers are responsible for deciding whether the newly created worker should be bound to the associated CPU and create_worker() sets WORKER_UNBOUND only for the workers for the unbound global_cwq. Creation during normal operation is always via maybe_create_worker() and @bind is true. For workers created during hotplug, @bind is false. Normal operation path is planned to be used even while the CPU is going through hotplug operations or offline and this static decision won't work. Drop @bind from create_worker() and decide whether to bind by looking at GCWQ_DISASSOCIATED. create_worker() will also set WORKER_UNBOUND autmatically if disassociated. To avoid flipping GCWQ_DISASSOCIATED while create_worker() is in progress, the flag is now allowed to be changed only while holding all manager_mutexes on the global_cwq. This requires that GCWQ_DISASSOCIATED is not cleared behind trustee's back. CPU_ONLINE no longer clears DISASSOCIATED before flushing trustee, which clears DISASSOCIATED before rebinding remaining workers if asked to release. For cases where trustee isn't around, CPU_ONLINE clears DISASSOCIATED after flushing trustee. Also, now, first_idle has UNBOUND set on creation which is explicitly cleared by CPU_ONLINE while binding it. These convolutions will soon be removed by further simplification of CPU hotplug path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> workqueue: reimplement CPU online rebinding to handle idle workers Currently, if there are left workers when a CPU is being brough back online, the trustee kills all idle workers and scheduled rebind_work so that they re-bind to the CPU after the currently executing work is finished. This works for busy workers because concurrency management doesn't try to wake up them from scheduler callbacks, which require the target task to be on the local run queue. The busy worker bumps concurrency counter appropriately as it clears WORKER_UNBOUND from the rebind work item and it's bound to the CPU before returning to the idle state. To reduce CPU on/offlining overhead (as many embedded systems use it for powersaving) and simplify the code path, workqueue is planned to be modified to retain idle workers across CPU on/offlining. This patch reimplements CPU online rebinding such that it can also handle idle workers. As noted earlier, due to the local wakeup requirement, rebinding idle workers is tricky. All idle workers must be re-bound before scheduler callbacks are enabled. This is achieved by interlocking idle re-binding. Idle workers are requested to re-bind and then hold until all idle re-binding is complete so that no bound worker starts executing work item. Only after all idle workers are re-bound and parked, CPU_ONLINE proceeds to release them and queue rebind work item to busy workers thus guaranteeing scheduler callbacks aren't invoked until all idle workers are ready. worker_rebind_fn() is renamed to busy_worker_rebind_fn() and idle_worker_rebind() for idle workers is added. Rebinding logic is moved to rebind_workers() and now called from CPU_ONLINE after flushing trustee. While at it, add CPU sanity check in worker_thread(). Note that now a worker may become idle or the manager between trustee release and rebinding during CPU_ONLINE. As the previous patch updated create_worker() so that it can be used by regular manager while unbound and this patch implements idle re-binding, this is safe. This prepares for removal of trustee and keeping idle workers across CPU hotplugs. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> workqueue: don't butcher idle workers on an offline CPU Currently, during CPU offlining, after all pending work items are drained, the trustee butchers all workers. Also, on CPU onlining failure, workqueue_cpu_callback() ensures that the first idle worker is destroyed. Combined, these guarantee that an offline CPU doesn't have any worker for it once all the lingering work items are finished. This guarantee isn't really necessary and makes CPU on/offlining more expensive than needs to be, especially for platforms which use CPU hotplug for powersaving. This patch lets offline CPUs removes idle worker butchering from the trustee and let a CPU which failed onlining keep the created first worker. The first worker is created if the CPU doesn't have any during CPU_DOWN_PREPARE and started right away. If onlining succeeds, the rebind_workers() call in CPU_ONLINE will rebind it like any other workers. If onlining fails, the worker is left alone till the next try. This makes CPU hotplugs cheaper by allowing global_cwqs to keep workers across them and simplifies code. Note that trustee doesn't re-arm idle timer when it's done and thus the disassociated global_cwq will keep all workers until it comes back online. This will be improved by further patches. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> workqueue: remove CPU offline trustee With the previous changes, a disassociated global_cwq now can run as an unbound one on its own - it can create workers as necessary to drain remaining works after the CPU has been brought down and manage the number of workers using the usual idle timer mechanism making trustee completely redundant except for the actual unbinding operation. This patch removes the trustee and let a disassociated global_cwq manage itself. Unbinding is moved to a work item (for CPU affinity) which is scheduled and flushed from CPU_DONW_PREPARE. This patch moves nr_running clearing outside gcwq and manager locks to simplify the code. As nr_running is unused at the point, this is safe. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> workqueue: simplify CPU hotplug code With trustee gone, CPU hotplug code can be simplified. * gcwq_claim/release_management() now grab and release gcwq lock too respectively and gained _and_lock and _and_unlock postfixes. * All CPU hotplug logic was implemented in workqueue_cpu_callback() which was called by workqueue_cpu_up/down_callback() for the correct priority. This was because up and down paths shared a lot of logic, which is no longer true. Remove workqueue_cpu_callback() and move all hotplug logic into the two actual callbacks. This patch doesn't make any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> workqueue: fix spurious CPU locality WARN from process_one_work() 25511a4776 "workqueue: reimplement CPU online rebinding to handle idle workers" added CPU locality sanity check in process_one_work(). It triggers if a worker is executing on a different CPU without UNBOUND or REBIND set. This works for all normal workers but rescuers can trigger this spuriously when they're serving the unbound or a disassociated global_cwq - rescuers don't have either flag set and thus its gcwq->cpu can be a different value including %WORK_CPU_UNBOUND. Fix it by additionally testing %GCWQ_DISASSOCIATED. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> LKML-Refence: <20120721213656.GA7783@linux.vnet.ibm.com> workqueue: reorder queueing functions so that _on() variants are on top Currently, queue/schedule[_delayed]_work_on() are located below the counterpart without the _on postifx even though the latter is usually implemented using the former. Swap them. This is cleanup and doesn't cause any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: make queueing functions return bool All queueing functions return 1 on success, 0 if the work item was already pending. Update them to return bool instead. This signifies better that they don't return 0 / -errno. This is cleanup and doesn't cause any functional difference. While at it, fix comment opening for schedule_work_on(). Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: add missing smp_wmb() in process_one_work() WORK_STRUCT_PENDING is used to claim ownership of a work item and process_one_work() releases it before starting execution. When someone else grabs PENDING, all pre-release updates to the work item should be visible and all updates made by the new owner should happen afterwards. Grabbing PENDING uses test_and_set_bit() and thus has a full barrier; however, clearing doesn't have a matching wmb. Given the preceding spin_unlock and use of clear_bit, I don't believe this can be a problem on an actual machine and there hasn't been any related report but it still is theretically possible for clear_pending to permeate upwards and happen before work->entry update. Add an explicit smp_wmb() before work_clear_pending(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: stable@vger.kernel.org workqueue: disable irq while manipulating PENDING Queueing operations use WORK_STRUCT_PENDING_BIT to synchronize access to the target work item. They first try to claim the bit and proceed with queueing only after that succeeds and there's a window between PENDING being set and the actual queueing where the task can be interrupted or preempted. There's also a similar window in process_one_work() when clearing PENDING. A work item is dequeued, gcwq->lock is released and then PENDING is cleared and the worker might get interrupted or preempted between releasing gcwq->lock and clearing PENDING. cancel[_delayed]_work_sync() tries to claim or steal PENDING. The function assumes that a work item with PENDING is either queued or in the process of being [de]queued. In the latter case, it busy-loops until either the work item loses PENDING or is queued. If canceling coincides with the above described interrupts or preemptions, the canceling task will busy-loop while the queueing or executing task is preempted. This patch keeps irq disabled across claiming PENDING and actual queueing and moves PENDING clearing in process_one_work() inside gcwq->lock so that busy looping from PENDING && !queued doesn't wait for interrupted/preempted tasks. Note that, in process_one_work(), setting last CPU and clearing PENDING got merged into single operation. This removes possible long busy-loops and will allow using try_to_grab_pending() from bh and irq contexts. v2: __queue_work() was testing preempt_count() to ensure that the caller has disabled preemption. This triggers spuriously if !CONFIG_PREEMPT_COUNT. Use preemptible() instead. Reported by Fengguang Wu. v3: Disable irq instead of preemption. IRQ will be disabled while grabbing gcwq->lock later anyway and this allows using try_to_grab_pending() from bh and irq contexts. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> workqueue: set delayed_work->timer function on initialization delayed_work->timer.function is currently initialized during queue_delayed_work_on(). Export delayed_work_timer_fn() and set delayed_work timer function during delayed_work initialization together with other fields. This ensures the timer function is always valid on an initialized delayed_work. This is to help mod_delayed_work() implementation. To detect delayed_work users which diddle with the internal timer, trigger WARN if timer function doesn't match on queue. Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: unify local CPU queueing handling Queueing functions have been using different methods to determine the local CPU. * queue_work() superflously uses get/put_cpu() to acquire and hold the local CPU across queue_work_on(). * delayed_work_timer_fn() uses smp_processor_id(). * queue_delayed_work() calls queue_delayed_work_on() with -1 @cpu which is interpreted as the local CPU. * flush_delayed_work[_sync]() were using raw_smp_processor_id(). * __queue_work() interprets %WORK_CPU_UNBOUND as local CPU if the target workqueue is bound one but nobody uses this. This patch converts all functions to uniformly use %WORK_CPU_UNBOUND to indicate local CPU and use the local binding feature of __queue_work(). unlikely() is dropped from %WORK_CPU_UNBOUND handling in __queue_work(). Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: fix zero @delay handling of queue_delayed_work_on() If @delay is zero and the dealyed_work is idle, queue_delayed_work() queues it for immediate execution; however, queue_delayed_work_on() lacks this logic and always goes through timer regardless of @delay. This patch moves 0 @delay handling logic from queue_delayed_work() to queue_delayed_work_on() so that both functions behave the same. Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: move try_to_grab_pending() upwards try_to_grab_pending() will be used by to-be-implemented mod_delayed_work[_on](). Move try_to_grab_pending() and related functions above queueing functions. This patch only moves functions around. Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: introduce WORK_OFFQ_FLAG_* Low WORK_STRUCT_FLAG_BITS bits of work_struct->data contain WORK_STRUCT_FLAG_* and flush color. If the work item is queued, the rest point to the cpu_workqueue with WORK_STRUCT_CWQ set; otherwise, WORK_STRUCT_CWQ is clear and the bits contain the last CPU number - either a real CPU number or one of WORK_CPU_*. Scheduled addition of mod_delayed_work[_on]() requires an additional flag, which is used only while a work item is off queue. There are more than enough bits to represent off-queue CPU number on both 32 and 64bits. This patch introduces WORK_OFFQ_FLAG_* which occupy the lower part of the @work->data high bits while off queue. This patch doesn't define any actual OFFQ flag yet. Off-queue CPU number is now shifted by WORK_OFFQ_CPU_SHIFT, which adds the number of bits used by OFFQ flags to WORK_STRUCT_FLAG_SHIFT, to make room for OFFQ flags. To avoid shift width warning with large WORK_OFFQ_FLAG_BITS, ulong cast is added to WORK_STRUCT_NO_CPU and, just in case, BUILD_BUG_ON() to check that there are enough bits to accomodate off-queue CPU number is added. This patch doesn't make any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: factor out __queue_delayed_work() from queue_delayed_work_on() This is to prepare for mod_delayed_work[_on]() and doesn't cause any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: reorganize try_to_grab_pending() and __cancel_timer_work() * Use bool @is_dwork instead of @timer and let try_to_grab_pending() use to_delayed_work() to determine the delayed_work address. * Move timer handling from __cancel_work_timer() to try_to_grab_pending(). * Make try_to_grab_pending() use -EAGAIN instead of -1 for busy-looping and drop the ret local variable. * Add proper function comment to try_to_grab_pending(). This makes the code a bit easier to understand and will ease further changes. This patch doesn't make any functional change. v2: Use @is_dwork instead of @timer. Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: mark a work item being canceled as such There can be two reasons try_to_grab_pending() can fail with -EAGAIN. One is when someone else is queueing or deqeueing the work item. With the previous patches, it is guaranteed that PENDING and queued state will soon agree making it safe to busy-retry in this case. The other is if multiple __cancel_work_timer() invocations are racing one another. __cancel_work_timer() grabs PENDING and then waits for running instances of the target work item on all CPUs while holding PENDING and !queued. try_to_grab_pending() invoked from another task will keep returning -EAGAIN while the current owner is waiting. Not distinguishing the two cases is okay because __cancel_work_timer() is the only user of try_to_grab_pending() and it invokes wait_on_work() whenever grabbing fails. For the first case, busy looping should be fine but wait_on_work() doesn't cause any critical problem. For the latter case, the new contender usually waits for the same condition as the current owner, so no unnecessarily extended busy-looping happens. Combined, these make __cancel_work_timer() technically correct even without irq protection while grabbing PENDING or distinguishing the two different cases. While the current code is technically correct, not distinguishing the two cases makes it difficult to use try_to_grab_pending() for other purposes than canceling because it's impossible to tell whether it's safe to busy-retry grabbing. This patch adds a mechanism to mark a work item being canceled. try_to_grab_pending() now disables irq on success and returns -EAGAIN to indicate that grabbing failed but PENDING and queued states are gonna agree soon and it's safe to busy-loop. It returns -ENOENT if the work item is being canceled and it may stay PENDING && !queued for arbitrary amount of time. __cancel_work_timer() is modified to mark the work canceling with WORK_OFFQ_CANCELING after grabbing PENDING, thus making try_to_grab_pending() fail with -ENOENT instead of -EAGAIN. Also, it invokes wait_on_work() iff grabbing failed with -ENOENT. This isn't necessary for correctness but makes it consistent with other future users of try_to_grab_pending(). v2: try_to_grab_pending() was testing preempt_count() to ensure that the caller has disabled preemption. This triggers spuriously if !CONFIG_PREEMPT_COUNT. Use preemptible() instead. Reported by Fengguang Wu. v3: Updated so that try_to_grab_pending() disables irq on success rather than requiring preemption disabled by the caller. This makes busy-looping easier and will allow try_to_grap_pending() to be used from bh/irq contexts. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Fengguang Wu <fengguang.wu@intel.com> workqueue: implement mod_delayed_work[_on]() Workqueue was lacking a mechanism to modify the timeout of an already pending delayed_work. delayed_work users have been working around this using several methods - using an explicit timer + work item, messing directly with delayed_work->timer, and canceling before re-queueing, all of which are error-prone and/or ugly. This patch implements mod_delayed_work[_on]() which behaves similarly to mod_timer() - if the delayed_work is idle, it's queued with the given delay; otherwise, its timeout is modified to the new value. Zero @delay guarantees immediate execution. v2: Updated to reflect try_to_grab_pending() changes. Now safe to be called from bh context. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> workqueue: fix CPU binding of flush_delayed_work[_sync]() delayed_work encodes the workqueue to use and the last CPU in delayed_work->work.data while it's on timer. The target CPU is implicitly recorded as the CPU the timer is queued on and delayed_work_timer_fn() queues delayed_work->work to the CPU it is running on. Unfortunately, this leaves flush_delayed_work[_sync]() no way to find out which CPU the delayed_work was queued for when they try to re-queue after killing the timer. Currently, it chooses the local CPU flush is running on. This can unexpectedly move a delayed_work queued on a specific CPU to another CPU and lead to subtle errors. There isn't much point in trying to save several bytes in struct delayed_work, which is already close to a hundred bytes on 64bit with all debug options turned off. This patch adds delayed_work->cpu to remember the CPU it's queued for. Note that if the timer is migrated during CPU down, the work item could be queued to the downed global_cwq after this change. As a detached global_cwq behaves like an unbound one, this doesn't change much for the delayed_work. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> workqueue: add missing wmb() in clear_work_data() Any operation which clears PENDING should be preceded by a wmb to guarantee that the next PENDING owner sees all the changes made before PENDING release. There are only two places where PENDING is cleared - set_work_cpu_and_clear_pending() and clear_work_data(). The caller of the former already does smp_wmb() but the latter doesn't have any. Move the wmb above set_work_cpu_and_clear_pending() into it and add one to clear_work_data(). There hasn't been any report related to this issue, and, given how clear_work_data() is used, it is extremely unlikely to have caused any actual problems on any architecture. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> workqueue: use enum value to set array size of pools in gcwq Commit 3270476a6c0ce322354df8679652f060d66526dc ('workqueue: reimplement WQ_HIGHPRI using a separate worker_pool') introduce separate worker_pool for HIGHPRI. Although there is NR_WORKER_POOLS enum value which represent size of pools, definition of worker_pool in gcwq doesn't use it. Using it makes code robust and prevent future mistakes. So change code to use this enum value. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: correct req_cpu in trace_workqueue_queue_work() When we do tracing workqueue_queue_work(), it records requested cpu. But, if !(@wq->flag & WQ_UNBOUND) and @cpu is WORK_CPU_UNBOUND, requested cpu is changed as local cpu. In case of @wq->flag & WQ_UNBOUND, above change is not occured, therefore it is reasonable to correct it. Use temporary local variable for storing requested cpu. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: change value of lcpu in __queue_delayed_work_on() We assign cpu id into work struct's data field in __queue_delayed_work_on(). In current implementation, when work is come in first time, current running cpu id is assigned. If we do __queue_delayed_work_on() with CPU A on CPU B, __queue_work() invoked in delayed_work_timer_fn() go into the following sub-optimal path in case of WQ_NON_REENTRANT. gcwq = get_gcwq(cpu); if (wq->flags & WQ_NON_REENTRANT && (last_gcwq = get_work_gcwq(work)) && last_gcwq != gcwq) { Change lcpu to @cpu and rechange lcpu to local cpu if lcpu is WORK_CPU_UNBOUND. It is sufficient to prevent to go into sub-optimal path. tj: Slightly rephrased the comment. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: introduce system_highpri_wq Commit 3270476a6c0ce322354df8679652f060d66526dc ('workqueue: reimplement WQ_HIGHPRI using a separate worker_pool') introduce separate worker pool for HIGHPRI. When we handle busyworkers for gcwq, it can be normal worker or highpri worker. But, we don't consider this difference in rebind_workers(), we use just system_wq for highpri worker. It makes mismatch between cwq->pool and worker->pool. It doesn't make error in current implementation, but possible in the future. Now, we introduce system_highpri_wq to use proper cwq for highpri workers in rebind_workers(). Following patch fix this issue properly. tj: Even apart from rebinding, having system_highpri_wq generally makes sense. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: use system_highpri_wq for highpri workers in rebind_workers() In rebind_workers(), we do inserting a work to rebind to cpu for busy workers. Currently, in this case, we use only system_wq. This makes a possible error situation as there is mismatch between cwq->pool and worker->pool. To prevent this, we should use system_highpri_wq for highpri worker to match theses. This implements it. tj: Rephrased comment a bit. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: use system_highpri_wq for unbind_work To speed cpu down processing up, use system_highpri_wq. As scheduling priority of workers on it is higher than system_wq and it is not contended by other normal works on this cpu, work on it is processed faster than system_wq. tj: CPU up/downs care quite a bit about latency these days. This shouldn't hurt anything and makes sense. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: fix checkpatch issues Fixed some checkpatch warnings. tj: adapted to wq/for-3.7 and massaged pr_xxx() format strings a bit. Signed-off-by: Valentin Ilie <valentin.ilie@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <1345326762-21747-1-git-send-email-valentin.ilie@gmail.com> workqueue: make all workqueues non-reentrant By default, each per-cpu part of a bound workqueue operates separately and a work item may be executing concurrently on different CPUs. The behavior avoids some cross-cpu traffic but leads to subtle weirdities and not-so-subtle contortions in the API. * There's no sane usefulness in allowing a single work item to be executed concurrently on multiple CPUs. People just get the behavior unintentionally and get surprised after learning about it. Most either explicitly synchronize or use non-reentrant/ordered workqueue but this is error-prone. * flush_work() can't wait for multiple instances of the same work item on different CPUs. If a work item is executing on cpu0 and then queued on cpu1, flush_work() can only wait for the one on cpu1. Unfortunately, work items can easily cross CPU boundaries unintentionally when the queueing thread gets migrated. This means that if multiple queuers compete, flush_work() can't even guarantee that the instance queued right before it is finished before returning. * flush_work_sync() was added to work around some of the deficiencies of flush_work(). In addition to the usual flushing, it ensures that all currently executing instances are finished before returning. This operation is expensive as it has to walk all CPUs and at the same time fails to address competing queuer case. Incorrectly using flush_work() when flush_work_sync() is necessary is an easy error to make and can lead to bugs which are difficult to reproduce. * Similar problems exist for flush_delayed_work[_sync](). Other than the cross-cpu access concern, there's no benefit in allowing parallel execution and it's plain silly to have this level of contortion for workqueue which is widely used from core code to extremely obscure drivers. This patch makes all workqueues non-reentrant. If a work item is executing on a different CPU when queueing is requested, it is always queued to that CPU. This guarantees that any given work item can be executing on one CPU at maximum and if a work item is queued and executing, both are on the same CPU. The only behavior change which may affect workqueue users negatively is that non-reentrancy overrides the affinity specified by queue_work_on(). On a reentrant workqueue, the affinity specified by queue_work_on() is always followed. Now, if the work item is executing on one of the CPUs, the work item will be queued there regardless of the requested affinity. I've reviewed all workqueue users which request explicit affinity, and, fortunately, none seems to be crazy enough to exploit parallel execution of the same work item. This adds an additional busy_hash lookup if the work item was previously queued on a different CPU. This shouldn't be noticeable under any sane workload. Work item queueing isn't a very high-frequency operation and they don't jump across CPUs all the time. In a micro benchmark to exaggerate this difference - measuring the time it takes for two work items to repeatedly jump between two CPUs a number (10M) of times with busy_hash table densely populated, the difference was around 3%. While the overhead is measureable, it is only visible in pathological cases and the difference isn't huge. This change brings much needed sanity to workqueue and makes its behavior consistent with timer. I think this is the right tradeoff to make. This enables significant simplification of workqueue API. Simplification patches will follow. Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: gut flush[_delayed]_work_sync() Now that all workqueues are non-reentrant, flush[_delayed]_work_sync() are equivalent to flush[_delayed]_work(). Drop the separate implementation and make them thin wrappers around flush[_delayed]_work(). * start_flush_work() no longer takes @wait_executing as the only left user - flush_work() - always sets it to %true. * __cancel_work_timer() uses flush_work() instead of wait_on_work(). Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: gut system_nrt[_freezable]_wq() Now that all workqueues are non-reentrant, system[_freezable]_wq() are equivalent to system_nrt[_freezable]_wq(). Replace the latter with wrappers around system[_freezable]_wq(). The wrapping goes through inline functions so that __deprecated can be added easily. Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: cosmetic whitespace updates for macro definitions Consistently use the last tab position for '\' line continuation in complex macro definitions. This is to help the following patches. This patch is cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback() workqueue_cpu_down_callback() is used only if HOTPLUG_CPU=y, so hotcpu_notifier() fits better than cpu_notifier(). When HOTPLUG_CPU=y, hotcpu_notifier() and cpu_notifier() are the same. When HOTPLUG_CPU=n, if we use cpu_notifier(), workqueue_cpu_down_callback() will be called during boot to do nothing, and the memory of workqueue_cpu_down_callback() and gcwq_unbind_fn() will be discarded after boot. If we use hotcpu_notifier(), we can avoid the no-op call of workqueue_cpu_down_callback() and the memory of workqueue_cpu_down_callback() and gcwq_unbind_fn() will be discard at build time: $ ls -l kernel/workqueue.o.cpu_notifier kernel/workqueue.o.hotcpu_notifier -rw-rw-r-- 1 laijs laijs 484080 Sep 15 11:31 kernel/workqueue.o.cpu_notifier -rw-rw-r-- 1 laijs laijs 478240 Sep 15 11:31 kernel/workqueue.o.hotcpu_notifier $ size kernel/workqueue.o.cpu_notifier kernel/workqueue.o.hotcpu_notifier text data bss dec hex filename 18513 2387 1221 22121 5669 kernel/workqueue.o.cpu_notifier 18082 2355 1221 21658 549a kernel/workqueue.o.hotcpu_notifier tj: Updated description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: reimplement cancel_delayed_work() using try_to_grab_pending() cancel_delayed_work() can't be called from IRQ handlers due to its use of del_timer_sync() and can't cancel work items which are already transferred from timer to worklist. Also, unlike other flush and cancel functions, a canceled delayed_work would still point to the last associated cpu_workqueue. If the workqueue is destroyed afterwards and the work item is re-used on a different workqueue, the queueing code can oops trying to dereference already freed cpu_workqueue. This patch reimplements cancel_delayed_work() using try_to_grab_pending() and set_work_cpu_and_clear_pending(). This allows the function to be called from IRQ handlers and makes its behavior consistent with other flush / cancel functions. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> workqueue: UNBOUND -> REBIND morphing in rebind_workers() should be atomic The compiler may compile the following code into TWO write/modify instructions. worker->flags &= ~WORKER_UNBOUND; worker->flags |= WORKER_REBIND; so the other CPU may temporarily see worker->flags which doesn't have either WORKER_UNBOUND or WORKER_REBIND set and perform local wakeup prematurely. Fix it by using single explicit assignment via ACCESS_ONCE(). Because idle workers have another WORKER_NOT_RUNNING flag, this bug doesn't exist for them; however, update it to use the same pattern for consistency. tj: Applied the change to idle workers too and updated comments and patch description a bit. Change-Id: I9b95f51d146c40c31ba028668d6f412bd74c6026 Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org workqueue: move WORKER_REBIND clearing in rebind_workers() to the end of the function This doesn't make any functional difference and is purely to help the next patch to be simpler. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> workqueue: fix possible deadlock in idle worker rebinding Currently, rebind_workers() and idle_worker_rebind() are two-way interlocked. rebind_workers() waits for idle workers to finish rebinding and rebound idle workers wait for rebind_workers() to finish rebinding busy workers before proceeding. Unfortunately, this isn't enough. The second wait from idle workers is implemented as follows. wait_event(gcwq->rebind_hold, !(worker->flags & WORKER_REBIND)); rebind_workers() clears WORKER_REBIND, wakes up the idle workers and then returns. If CPU hotplug cycle happens again before one of the idle workers finishes the above wait_event(), rebind_workers() will repeat the first part of the handshake - set WORKER_REBIND again and wait for the idle worker to finish rebinding - and this leads to deadlock because the idle worker would be waiting for WORKER_REBIND to clear. This is fixed by adding another interlocking step at the end - rebind_workers() now waits for all the idle workers to finish the above WORKER_REBIND wait before returning. This ensures that all rebinding steps are complete on all idle workers before the next hotplug cycle can happen. This problem was diagnosed by Lai Jiangshan who also posted a patch to fix the issue, upon which this patch is based. This is the minimal fix and further patches are scheduled for the next merge window to simplify the CPU hotplug path. Signed-off-by: Tejun Heo <tj@kernel.org> Original-patch-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <1346516916-1991-3-git-send-email-laijs@cn.fujitsu.com> workqueue: restore POOL_MANAGING_WORKERS This patch restores POOL_MANAGING_WORKERS which was replaced by pool->manager_mutex by 6037315269 "workqueue: use mutex for global_cwq manager exclusion". There's a subtle idle worker depletion bug across CPU hotplug events and we need to distinguish an actual manager and CPU hotplug preventing management. POOL_MANAGING_WORKERS will be used for the former and manager_mutex the later. This patch just lays POOL_MANAGING_WORKERS on top of the existing manager_mutex and doesn't introduce any synchronization changes. The next patch will update it. Note that this patch fixes a non-critical anomaly where too_many_workers() may return %true spuriously while CPU hotplug is in progress. While the issue could schedule idle timer spuriously, it didn't trigger any actual misbehavior. tj: Rewrote patch description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: fix possible idle worker depletion across CPU hotplug To simplify both normal and CPU hotplug paths, worker management is prevented while CPU hoplug is in progress. This is achieved by CPU hotplug holding the same exclusion mechanism used by workers to ensure there's only one manager per pool. If someone else seems to be performing the manager role, workers proceed to execute work items. CPU hotplug using the same mechanism can lead to idle worker depletion because all workers could proceed to execute work items while CPU hotplug is in progress and CPU hotplug itself wouldn't actually perform the worker management duty - it doesn't guarantee that there's an idle worker left when it releases management. This idle worker depletion, under extreme circumstances, can break forward-progress guarantee and thus lead to deadlock. This patch fixes the bug by using separate mechanisms for manager exclusion among workers and hotplug exclusion. For manager exclusion, POOL_MANAGING_WORKERS which was restored by the previous patch is used. pool->manager_mutex is now only used for exclusion between the elected manager and CPU hotplug. The elected manager won't proceed without holding pool->manager_mutex. This ensures that the worker which won the manager position can't skip managing while CPU hotplug is in progress. It will block on manager_mutex and perform management after CPU hotplug is complete. Note that hotplug may happen while waiting for manager_mutex. A manager isn't either on idle or busy list and thus the hoplug code can't unbind/rebind it. Make the manager handle its own un/rebinding. tj: Updated comment and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: always clear WORKER_REBIND in busy_worker_rebind_fn() busy_worker_rebind_fn() didn't clear WORKER_REBIND if rebinding failed (CPU is down again). This used to be okay because the flag wasn't used for anything else. However, after 25511a477 "workqueue: reimplement CPU online rebinding to handle idle workers", WORKER_REBIND is also used to command idle workers to rebind. If not cleared, the worker may confuse the next CPU_UP cycle by having REBIND spuriously set or oops / get stuck by prematurely calling idle_worker_rebind(). WARNING: at /work/os/wq/kernel/workqueue.c:1323 worker_thread+0x4cd/0x5 00() Hardware name: Bochs Modules linked in: test_wq(O-) Pid: 33, comm: kworker/1:1 Tainted: G O 3.6.0-rc1-work+ #3 Call Trace: [<ffffffff8109039f>] warn_slowpath_common+0x7f/0xc0 [<ffffffff810903fa>] warn_slowpath_null+0x1a/0x20 [<ffffffff810b3f1d>] worker_thread+0x4cd/0x500 [<ffffffff810bc16e>] kthread+0xbe/0xd0 [<ffffffff81bd2664>] kernel_thread_helper+0x4/0x10 ---[ end trace e977cf20f4661968 ]--- BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff810b3db0>] worker_thread+0x360/0x500 PGD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: test_wq(O-) CPU 0 Pid: 33, comm: kworker/1:1 Tainted: G W O 3.6.0-rc1-work+ #3 Bochs Bochs RIP: 0010:[<ffffffff810b3db0>] [<ffffffff810b3db0>] worker_thread+0x360/0x500 RSP: 0018:ffff88001e1c9de0 EFLAGS: 00010086 RAX: 0000000000000000 RBX: ffff88001e633e00 RCX: 0000000000004140 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009 RBP: ffff88001e1c9ea0 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000002 R11: 0000000000000000 R12: ffff88001fc8d580 R13: ffff88001fc8d590 R14: ffff88001e633e20 R15: ffff88001e1c6900 FS: 0000000000000000(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 00000000130e8000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kworker/1:1 (pid: 33, threadinfo ffff88001e1c8000, task ffff88001e1c6900) Stack: ffff880000000000 ffff88001e1c9e40 0000000000000001 ffff88001e1c8010 ffff88001e519c78 ffff88001e1c9e58 ffff88001e1c6900 ffff88001e1c6900 ffff88001e1c6900 ffff88001e1c6900 ffff88001fc8d340 ffff88001fc8d340 Call Trace: [<ffffffff810bc16e>] kthread+0xbe/0xd0 [<ffffffff81bd2664>] kernel_thread_helper+0x4/0x10 Code: b1 00 f6 43 48 02 0f 85 91 01 00 00 48 8b 43 38 48 89 df 48 8b 00 48 89 45 90 e8 ac f0 ff ff 3c 01 0f 85 60 01 00 00 48 8b 53 50 <8b> 02 83 e8 01 85 c0 89 02 0f 84 3b 01 00 00 48 8b 43 38 48 8b RIP [<ffffffff810b3db0>] worker_thread+0x360/0x500 RSP <ffff88001e1c9de0> CR2: 0000000000000000 There was no reason to keep WORKER_REBIND on failure in the first place - WORKER_UNBOUND is guaranteed to be set in such cases preventing incorrectly activating concurrency management. Always clear WORKER_REBIND. tj: Updated comment and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: reimplement idle worker rebinding Currently rebind_workers() uses rebinds idle workers synchronously before proceeding to requesting busy workers to rebind. This is necessary because all workers on @worker_pool->idle_list must be bound before concurrency management local wake-ups from the busy workers take place. Unfortunately, the synchronous idle rebinding is quite complicated. This patch reimplements idle rebinding to simplify the code path. Rather than trying to make all idle workers bound before rebinding busy workers, we simply remove all to-be-bound idle workers from the idle list and let them add themselves back after completing rebinding (successful or not). As only workers which finished rebinding can on on the idle worker list, the idle worker list is guaranteed to have only bound workers unless CPU went down again and local wake-ups are safe. After the change, @worker_pool->nr_idle may deviate than the actual number of idle workers on @worker_pool->idle_list. More specifically, nr_idle may be non-zero while ->idle_list is empty. All users of ->nr_idle and ->idle_list are audited. The only affected one is too_many_workers() which is updated to check %false if ->idle_list is empty regardless of ->nr_idle. After this patch, rebind_workers() no longer performs the nasty idle-rebind retries which require temporary release of gcwq->lock, and both unbinding and rebinding are atomic w.r.t. global_cwq->lock. worker->idle_rebind and global_cwq->rebind_hold are now unnecessary and removed along with the definition of struct idle_rebind. Changed from V1: 1) remove unlikely from too_many_workers(), ->idle_list can be empty anytime, even before this patch, no reason to use unlikely. 2) fix a small rebasing mistake. (which is from rebasing the orignal fixing patch to for-next) 3) add a lot of comments. 4) clear WORKER_REBIND unconditionaly in idle_worker_rebind() tj: Updated comments and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: WORKER_REBIND is no longer necessary for busy rebinding Because the old unbind/rebinding implementation wasn't atomic w.r.t. GCWQ_DISASSOCIATED manipulation which is protected by global_cwq->lock, we had to use two flags, WORKER_UNBOUND and WORKER_REBIND, to avoid incorrectly losing all NOT_RUNNING bits with back-to-back CPU hotplug operations; otherwise, completion of rebinding while another unbinding is in progress could clear UNBIND prematurely. Now that both unbind/rebinding are atomic w.r.t. GCWQ_DISASSOCIATED, there's no need to use two flags. Just one is enough. Don't use WORKER_REBIND for busy rebinding. tj: Updated description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: WORKER_REBIND is no longer necessary for idle rebinding Now both worker destruction and idle rebinding remove the worker from idle list while it's still idle, so list_empty(&worker->entry) can be used to test whether either is pending and WORKER_DIE to distinguish between the two instead making WORKER_REBIND unnecessary. Use list_empty(&worker->entry) to determine whether destruction or rebinding is pending. This simplifies worker state transitions. WORKER_REBIND is not needed anymore. Remove it. tj: Updated comments and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: rename manager_mutex to assoc_mutex Now that manager_mutex's role has changed from synchronizing manager role to excluding hotplug against manager, the name is misleading. As it is protecting the CPU-association of the gcwq now, rename it to assoc_mutex. This patch is pure rename and doesn't introduce any functional change. tj: Updated comments and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: use __cpuinit instead of __devinit for cpu callbacks For workqueue hotplug callbacks, it makes less sense to use __devinit which discards the memory after boot if !HOTPLUG. __cpuinit, which discards the memory after boot if !HOTPLUG_CPU fits better. tj: Updated description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: fix possible stall on try_to_grab_pending() of a delayed work item Currently, when try_to_grab_pending() grabs a delayed work item, it leaves its linked work items alone on the delayed_works. The linked work items are always NO_COLOR and will cause future cwq_activate_first_delayed() increase cwq->nr_active incorrectly, and may cause the whole cwq to stall. For example, state: cwq->max_active = 1, cwq->nr_active = 1 one work in cwq->pool, many in cwq->delayed_works. step1: try_to_grab_pending() removes a work item from delayed_works but leaves its NO_COLOR linked work items on it. step2: Later on, cwq_activate_first_delayed() activates the linked work item increasing ->nr_active. step3: cwq->nr_active = 1, but all activated work items of the cwq are NO_COLOR. When they finish, cwq->nr_active will not be decreased due to NO_COLOR, and no further work items will be activated from cwq->delayed_works. the cwq stalls. Fix it by ensuring the target work item is activated before stealing PENDING in try_to_grab_pending(). This ensures that all the linked work items are activated without incorrectly bumping cwq->nr_active. tj: Updated comment and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org workqueue: reimplement work_on_cpu() using system_wq The existing work_on_cpu() implementation is hugely inefficient. It creates a new kthread, execute that single function and then let the kthread die on each invocation. Now that system_wq can handle concurrent executions, there's no advantage of doing this. Reimplement work_on_cpu() using system_wq which makes it simpler and way more efficient. stable: While this isn't a fix in itself, it's needed to fix a workqueue related bug in cpufreq/powernow-k8. AFAICS, this shouldn't break other existing users. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jiri Kosina <jkosina@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Len Brown <lenb@kernel.org> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: stable@vger.kernel.org workqueue: introduce cwq_set_max_active() helper for thaw_workqueues() Using a helper instead of open code makes thaw_workqueues() clearer. The helper will also be used by the next patch. tj: Slight update to comment and description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: use cwq_set_max_active() helper for workqueue_set_max_active() workqueue_set_max_active() may increase ->max_active without activating delayed works and may make the activation order differ from the queueing order. Both aren't strictly bugs but the resulting behavior could be a bit odd. To make things more consistent, use cwq_set_max_active() helper which immediately makes use of the newly increased max_mactive if there are delayed work items and also keeps the activation order. tj: Slight update to description. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending() e0aecdd874 ("workqueue: use irqsafe timer for delayed_work") made try_to_grab_pending() safe to use from irq context but forgot to remove WARN_ON_ONCE(in_irq()). Remove it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Fengguang Wu <fengguang.wu@intel.com> workqueue: cancel_delayed_work() should return %false if work item is idle 57b30ae77b ("workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()") made cancel_delayed_work() always return %true unless someone else is also trying to cancel the work item, which is broken - if the target work item is idle, the return value should be %false. try_to_grab_pending() indicates that the target work item was idle by zero return value. Use it for return. Note that this brings cancel_delayed_work() in line with __cancel_work_timer() in return value handling. Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org> LKML-Reference: <444a6439-b1a4-4740-9e7e-bc37267cfe73@default> workqueue: exit rescuer_thread() as TASK_RUNNING A rescue thread exiting TASK_INTERRUPTIBLE can lead to a task scheduling off, never to be seen again. In the case where this occurred, an exiting thread hit reiserfs homebrew conditional resched while holding a mutex, bringing the box to its knees. PID: 18105 TASK: ffff8807fd412180 CPU: 5 COMMAND: "kdmflush" #0 [ffff8808157e7670] schedule at ffffffff8143f489 #1 [ffff8808157e77b8] reiserfs_get_block at ffffffffa038ab2d [reiserfs] #2 [ffff8808157e79a8] __block_write_begin at ffffffff8117fb14 #3 [ffff8808157e7a98] reiserfs_write_begin at ffffffffa0388695 [reiserfs] #4 [ffff8808157e7ad8] generic_perform_write at ffffffff810ee9e2 #5 [ffff8808157e7b58] generic_file_buffered_write at ffffffff810eeb41 #6 [ffff8808157e7ba8] __generic_file_aio_write at ffffffff810f1a3a #7 [ffff8808157e7c58] generic_file_aio_write at ffffffff810f1c88 #8 [ffff8808157e7cc8] do_sync_write at ffffffff8114f850 #9 [ffff8808157e7dd8] do_acct_process at ffffffff810a268f [exception RIP: kernel_thread_helper] RIP: ffffffff8144a5c0 RSP: ffff8808157e7f58 RFLAGS: 00000202 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff8107af60 RDI: ffff8803ee491d18 RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 Signed-off-by: Mike Galbraith <mgalbraith@suse.de> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org workqueue: mod_delayed_work_on() shouldn't queue timer on 0 delay 8376fe22c7 ("workqueue: implement mod_delayed_work[_on]()") implemented mod_delayed_work[_on]() using the improved try_to_grab_pending(). The function is later used, among others, to replace [__]candel_delayed_work() + queue_delayed_work() combinations. Unfortunately, a delayed_work item w/ zero @delay is handled slightly differently by mod_delayed_work_on() compared to queue_delayed_work_on(). The latter skips timer altogether and directly queues it using queue_work_on() while the former schedules timer which will expire on the closest tick. This means, when @delay is zero, that [__]cancel_delayed_work() + queue_delayed_work_on() makes the target item immediately executable while mod_delayed_work_on() may induce delay of upto a full tick. This somewhat subtle difference breaks some of the converted users. e.g. block queue plugging uses delayed_work for deferred processing and uses mod_delayed_work_on() when the queue needs to be immediately unplugged. The above problem manifested as noticeably higher number of context switches under certain circumstances. The difference in behavior was caused by missing special case handling for 0 delay in mod_delayed_work_on() compared to queue_delayed_work_on(). Joonsoo Kim posted a patch to add it - ("workqueue: optimize mod_delayed_work_on() when @delay == 0")[1]. The patch was queued for 3.8 but it was described as optimization and I missed that it was a correctness issue. As both queue_delayed_work_on() and mod_delayed_work_on() use __queue_delayed_work() for queueing, it seems that the better approach is to move the 0 delay special handling to the function instead of duplicating it in mod_delayed_work_on(). Fix the problem by moving 0 delay special case handling from queue_delayed_work_on() to __queue_delayed_work(). This replaces Joonsoo's patch. [1] http://thread.gmane.org/gmane.linux.kernel/1379011/focus=1379012 Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Anders Kaseorg <andersk@MIT.EDU> Reported-and-tested-by: Zlatko Calusic <zlatko.calusic@iskon.hr> LKML-Reference: <alpine.DEB.2.00.1211280953350.26602@dr-wily.mit.edu> LKML-Reference: <50A78AA9.5040904@iskon.hr> Cc: Joonsoo Kim <js1304@gmail.com> workqueue: trivial fix for return statement in work_busy() Return type of work_busy() is unsigned int. There is return statement returning boolean value, 'false' in work_busy(). It is not problem, because 'false' may be treated '0'. However, fixing it would make code robust. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: add WARN_ON_ONCE() on CPU number to wq_worker_waking_up() Recently, workqueue code has gone through some changes and we found some bugs related to concurrency management operations happening on the wrong CPU. When a worker is concurrency managed (!WORKER_NOT_RUNNIG), it should be bound to its associated cpu and woken up to that cpu. Add WARN_ON_ONCE() to verify this. Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> workqueue: convert BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s 8852aac25e ("workqueue: mod_delayed_work_on() shouldn't queue timer on 0 delay") unexpectedly uncovered a very nasty abuse of delayed_work in megaraid - it allocated work_struct, casted it to delayed_work and then pass that into queue_delayed_work(). Previously, this was okay because 0 @delay short-circuited to queue_work() before doing anything with delayed_work. 8852aac25e moved 0 @delay test into __queue_delayed_work() after sanity check on delayed_work making megaraid trigger BUG_ON(). Although megaraid is already fixed by c1d390d8e6 ("megaraid: fix BUG_ON() from incorrect use of delayed work"), this patch converts BUG_ON()s in __queue_delayed_work() to WARN_ON_ONCE()s so that such abusers, if there are more, trigger warning but don't crash the machine. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Xiaotian Feng <xtfeng@gmail.com> wq Change-Id: Ia3c507777a995f32bf6b40dc8318203e53134229 Signed-off-by: franciscofranco <franciscofranco.1990@gmail.com>
rooque
pushed a commit
to rooque/android_kernel_xiaomi_cancro
that referenced
this issue
Jul 6, 2015
commit a1cbcaa upstream. The sched_clock_remote() implementation has the following inatomicity problem on 32bit systems when accessing the remote scd->clock, which is a 64bit value. CPU0 CPU1 sched_clock_local() sched_clock_remote(CPU0) ... remote_clock = scd[CPU0]->clock read_low32bit(scd[CPU0]->clock) cmpxchg64(scd->clock,...) read_high32bit(scd[CPU0]->clock) While the update of scd->clock is using an atomic64 mechanism, the readout on the remote cpu is not, which can cause completely bogus readouts. It is a quite rare problem, because it requires the update to hit the narrow race window between the low/high readout and the update must go across the 32bit boundary. The resulting misbehaviour is, that CPU1 will see the sched_clock on CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value to this bogus timestamp. This stays that way due to the clamping implementation for about 4 seconds until the synchronization with CLOCK_MONOTONIC undoes the problem. The issue is hard to observe, because it might only result in a less accurate SCHED_OTHER timeslicing behaviour. To create observable damage on realtime scheduling classes, it is necessary that the bogus update of CPU1 sched_clock happens in the context of an realtime thread, which then gets charged 4 seconds of RT runtime, which results in the RT throttler mechanism to trigger and prevent scheduling of RT tasks for a little less than 4 seconds. So this is quite unlikely as well. The issue was quite hard to decode as the reproduction time is between 2 days and 3 weeks and intrusive tracing makes it less likely, but the following trace recorded with trace_clock=global, which uses sched_clock_local(), gave the final hint: <idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80 <idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ... irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0 <idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80 What happens is that CPU0 goes idle and invokes sched_clock_idle_sleep_event() which invokes sched_clock_local() and CPU1 runs a remote wakeup for CPU0 at the same time, which invokes sched_remote_clock(). The time jump gets propagated to CPU0 via sched_remote_clock() and stays stale on both cores for ~4 seconds. There are only two other possibilities, which could cause a stale sched clock: 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic wrong value. 2) sched_clock() which reads the TSC returns a sporadic wrong value. MiCode#1 can be excluded because sched_clock would continue to increase for one jiffy and then go stale. MiCode#2 can be excluded because it would not make the clock jump forward. It would just result in a stale sched_clock for one jiffy. After quite some brain twisting and finding the same pattern on other traces, sched_clock_remote() remained the only place which could cause such a problem and as explained above it's indeed racy on 32bit systems. So while on 64bit systems the readout is atomic, we need to verify the remote readout on 32bit machines. We need to protect the local->clock readout in sched_clock_remote() on 32bit as well because an NMI could hit between the low and the high readout, call sched_clock_local() and modify local->clock. Thanks to Siegfried Wulsch for bearing with my debug requests and going through the tedious tasks of running a bunch of reproducer systems to generate the debug information which let me decode the issue. Reported-by: Siegfried Wulsch <Siegfried.Wulsch@rovema.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
fefifofum
referenced
this issue
in armani-dev/android_kernel_xiaomi_armani_kk
Oct 10, 2015
commit 6f2e9f0 upstream. Now when we set the group inode free count, we don't have a proper group lock so that multiple threads may decrease the inode free count at the same time. And e2fsck will complain something like: Free inodes count wrong for group #1 (1, counted=0). Fix? no Free inodes count wrong for group #2 (3, counted=0). Fix? no Directories count wrong for group #2 (780, counted=779). Fix? no Free inodes count wrong for group #3 (2272, counted=2273). Fix? no So this patch try to protect it with the ext4_lock_group. btw, it is found by xfstests test case 269 and the volume is mkfsed with the parameter "-O ^resize_inode,^uninit_bg,extent,meta_bg,flex_bg,ext_attr" and I have run it 100 times and the error in e2fsck doesn't show up again. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
fefifofum
referenced
this issue
in armani-dev/android_kernel_xiaomi_armani_kk
Oct 10, 2015
commit 00d4e73 upstream. In ext4_nonda_switch(), if the file system is getting full we used to call writeback_inodes_sb_if_idle(). The problem is that we can be holding i_mutex already, and this causes a potential deadlock when writeback_inodes_sb_if_idle() when it tries to take s_umount. (See lockdep output below). As it turns out we don't need need to hold s_umount; the fact that we are in the middle of the write(2) system call will keep the superblock pinned. Unfortunately writeback_inodes_sb() checks to make sure s_umount is taken, and the VFS uses a different mechanism for making sure the file system doesn't get unmounted out from under us. The simplest way of dealing with this is to just simply grab s_umount using a trylock, and skip kicking the writeback flusher thread in the very unlikely case that we can't take a read lock on s_umount without blocking. Also, we now check the cirteria for kicking the writeback thread before we decide to whether to fall back to non-delayed writeback, so if there are any outstanding delayed allocation writes, we try to get them resolved as soon as possible. [ INFO: possible circular locking dependency detected ] 3.6.0-rc1-00042-gce894ca MiCode#367 Not tainted ------------------------------------------------------- dd/8298 is trying to acquire lock: (&type->s_umount_key#18){++++..}, at: [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46 but task is already holding lock: (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3 which lock already depends on the new lock. 2 locks held by dd/8298: #0: (sb_writers#2){.+.+.+}, at: [<c01ddcc5>] generic_file_aio_write+0x56/0xd3 #1: (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3 stack backtrace: Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca MiCode#367 Call Trace: [<c015b79c>] ? console_unlock+0x345/0x372 [<c06d62a1>] print_circular_bug+0x190/0x19d [<c019906c>] __lock_acquire+0x86d/0xb6c [<c01999db>] ? mark_held_locks+0x5c/0x7b [<c0199724>] lock_acquire+0x66/0xb9 [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46 [<c06db935>] down_read+0x28/0x58 [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46 [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46 [<c026f3b2>] ext4_nonda_switch+0xe1/0xf4 [<c0271ece>] ext4_da_write_begin+0x27/0x193 [<c01dcdb0>] generic_file_buffered_write+0xc8/0x1bb [<c01ddc47>] __generic_file_aio_write+0x1dd/0x205 [<c01ddce7>] generic_file_aio_write+0x78/0xd3 [<c026d336>] ext4_file_write+0x480/0x4a6 [<c0198c1d>] ? __lock_acquire+0x41e/0xb6c [<c0180944>] ? sched_clock_cpu+0x11a/0x13e [<c01967e9>] ? trace_hardirqs_off+0xb/0xd [<c018099f>] ? local_clock+0x37/0x4e [<c0209f2c>] do_sync_write+0x67/0x9d [<c0209ec5>] ? wait_on_retry_sync_kiocb+0x44/0x44 [<c020a7b9>] vfs_write+0x7b/0xe6 [<c020a9a6>] sys_write+0x3b/0x64 [<c06dd4bd>] syscall_call+0x7/0xb Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
fefifofum
referenced
this issue
in armani-dev/android_kernel_xiaomi_armani_kk
Oct 10, 2015
commit 42d64e1 upstream. The SELinux/NetLabel glue code has a locking bug that affects systems with NetLabel enabled, see the kernel error message below. This patch corrects this problem by converting the bottom half socket lock to a more conventional, and correct for this call-path, lock_sock() call. =============================== [ INFO: suspicious RCU usage. ] 3.11.0-rc3+ MiCode#19 Not tainted ------------------------------- net/ipv4/cipso_ipv4.c:1928 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 2 locks held by ping/731: #0: (slock-AF_INET/1){+.-...}, at: [...] selinux_netlbl_socket_connect #1: (rcu_read_lock){.+.+..}, at: [<...>] netlbl_conn_setattr stack backtrace: CPU: 1 PID: 731 Comm: ping Not tainted 3.11.0-rc3+ MiCode#19 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 0000000000000001 ffff88006f659d28 ffffffff81726b6a ffff88003732c500 ffff88006f659d58 ffffffff810e4457 ffff88006b845a00 0000000000000000 000000000000000c ffff880075aa2f50 ffff88006f659d90 ffffffff8169bec7 Call Trace: [<ffffffff81726b6a>] dump_stack+0x54/0x74 [<ffffffff810e4457>] lockdep_rcu_suspicious+0xe7/0x120 [<ffffffff8169bec7>] cipso_v4_sock_setattr+0x187/0x1a0 [<ffffffff8170f317>] netlbl_conn_setattr+0x187/0x190 [<ffffffff8170f195>] ? netlbl_conn_setattr+0x5/0x190 [<ffffffff8131ac9e>] selinux_netlbl_socket_connect+0xae/0xc0 [<ffffffff81303025>] selinux_socket_connect+0x135/0x170 [<ffffffff8119d127>] ? might_fault+0x57/0xb0 [<ffffffff812fb146>] security_socket_connect+0x16/0x20 [<ffffffff815d3ad3>] SYSC_connect+0x73/0x130 [<ffffffff81739a85>] ? sysret_check+0x22/0x5d [<ffffffff810e5e2d>] ? trace_hardirqs_on_caller+0xfd/0x1c0 [<ffffffff81373d4e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff815d52be>] SyS_connect+0xe/0x10 [<ffffffff81739a59>] system_call_fastpath+0x16/0x1b Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
fefifofum
referenced
this issue
in armani-dev/android_kernel_xiaomi_armani_kk
Oct 10, 2015
commit 6f2e9f0 upstream. Now when we set the group inode free count, we don't have a proper group lock so that multiple threads may decrease the inode free count at the same time. And e2fsck will complain something like: Free inodes count wrong for group #1 (1, counted=0). Fix? no Free inodes count wrong for group #2 (3, counted=0). Fix? no Directories count wrong for group #2 (780, counted=779). Fix? no Free inodes count wrong for group #3 (2272, counted=2273). Fix? no So this patch try to protect it with the ext4_lock_group. btw, it is found by xfstests test case 269 and the volume is mkfsed with the parameter "-O ^resize_inode,^uninit_bg,extent,meta_bg,flex_bg,ext_attr" and I have run it 100 times and the error in e2fsck doesn't show up again. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
fefifofum
referenced
this issue
in armani-dev/android_kernel_xiaomi_armani_kk
Oct 10, 2015
commit 00d4e73 upstream. In ext4_nonda_switch(), if the file system is getting full we used to call writeback_inodes_sb_if_idle(). The problem is that we can be holding i_mutex already, and this causes a potential deadlock when writeback_inodes_sb_if_idle() when it tries to take s_umount. (See lockdep output below). As it turns out we don't need need to hold s_umount; the fact that we are in the middle of the write(2) system call will keep the superblock pinned. Unfortunately writeback_inodes_sb() checks to make sure s_umount is taken, and the VFS uses a different mechanism for making sure the file system doesn't get unmounted out from under us. The simplest way of dealing with this is to just simply grab s_umount using a trylock, and skip kicking the writeback flusher thread in the very unlikely case that we can't take a read lock on s_umount without blocking. Also, we now check the cirteria for kicking the writeback thread before we decide to whether to fall back to non-delayed writeback, so if there are any outstanding delayed allocation writes, we try to get them resolved as soon as possible. [ INFO: possible circular locking dependency detected ] 3.6.0-rc1-00042-gce894ca MiCode#367 Not tainted ------------------------------------------------------- dd/8298 is trying to acquire lock: (&type->s_umount_key#18){++++..}, at: [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46 but task is already holding lock: (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3 which lock already depends on the new lock. 2 locks held by dd/8298: #0: (sb_writers#2){.+.+.+}, at: [<c01ddcc5>] generic_file_aio_write+0x56/0xd3 #1: (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3 stack backtrace: Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca MiCode#367 Call Trace: [<c015b79c>] ? console_unlock+0x345/0x372 [<c06d62a1>] print_circular_bug+0x190/0x19d [<c019906c>] __lock_acquire+0x86d/0xb6c [<c01999db>] ? mark_held_locks+0x5c/0x7b [<c0199724>] lock_acquire+0x66/0xb9 [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46 [<c06db935>] down_read+0x28/0x58 [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46 [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46 [<c026f3b2>] ext4_nonda_switch+0xe1/0xf4 [<c0271ece>] ext4_da_write_begin+0x27/0x193 [<c01dcdb0>] generic_file_buffered_write+0xc8/0x1bb [<c01ddc47>] __generic_file_aio_write+0x1dd/0x205 [<c01ddce7>] generic_file_aio_write+0x78/0xd3 [<c026d336>] ext4_file_write+0x480/0x4a6 [<c0198c1d>] ? __lock_acquire+0x41e/0xb6c [<c0180944>] ? sched_clock_cpu+0x11a/0x13e [<c01967e9>] ? trace_hardirqs_off+0xb/0xd [<c018099f>] ? local_clock+0x37/0x4e [<c0209f2c>] do_sync_write+0x67/0x9d [<c0209ec5>] ? wait_on_retry_sync_kiocb+0x44/0x44 [<c020a7b9>] vfs_write+0x7b/0xe6 [<c020a9a6>] sys_write+0x3b/0x64 [<c06dd4bd>] syscall_call+0x7/0xb Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
fefifofum
referenced
this issue
in armani-dev/android_kernel_xiaomi_armani_kk
Oct 10, 2015
commit 42d64e1 upstream. The SELinux/NetLabel glue code has a locking bug that affects systems with NetLabel enabled, see the kernel error message below. This patch corrects this problem by converting the bottom half socket lock to a more conventional, and correct for this call-path, lock_sock() call. =============================== [ INFO: suspicious RCU usage. ] 3.11.0-rc3+ MiCode#19 Not tainted ------------------------------- net/ipv4/cipso_ipv4.c:1928 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 2 locks held by ping/731: #0: (slock-AF_INET/1){+.-...}, at: [...] selinux_netlbl_socket_connect #1: (rcu_read_lock){.+.+..}, at: [<...>] netlbl_conn_setattr stack backtrace: CPU: 1 PID: 731 Comm: ping Not tainted 3.11.0-rc3+ MiCode#19 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 0000000000000001 ffff88006f659d28 ffffffff81726b6a ffff88003732c500 ffff88006f659d58 ffffffff810e4457 ffff88006b845a00 0000000000000000 000000000000000c ffff880075aa2f50 ffff88006f659d90 ffffffff8169bec7 Call Trace: [<ffffffff81726b6a>] dump_stack+0x54/0x74 [<ffffffff810e4457>] lockdep_rcu_suspicious+0xe7/0x120 [<ffffffff8169bec7>] cipso_v4_sock_setattr+0x187/0x1a0 [<ffffffff8170f317>] netlbl_conn_setattr+0x187/0x190 [<ffffffff8170f195>] ? netlbl_conn_setattr+0x5/0x190 [<ffffffff8131ac9e>] selinux_netlbl_socket_connect+0xae/0xc0 [<ffffffff81303025>] selinux_socket_connect+0x135/0x170 [<ffffffff8119d127>] ? might_fault+0x57/0xb0 [<ffffffff812fb146>] security_socket_connect+0x16/0x20 [<ffffffff815d3ad3>] SYSC_connect+0x73/0x130 [<ffffffff81739a85>] ? sysret_check+0x22/0x5d [<ffffffff810e5e2d>] ? trace_hardirqs_on_caller+0xfd/0x1c0 [<ffffffff81373d4e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff815d52be>] SyS_connect+0xe/0x10 [<ffffffff81739a59>] system_call_fastpath+0x16/0x1b Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
fefifofum
referenced
this issue
in armani-dev/android_kernel_xiaomi_armani_kk
Oct 10, 2015
commit 6f2e9f0 upstream. Now when we set the group inode free count, we don't have a proper group lock so that multiple threads may decrease the inode free count at the same time. And e2fsck will complain something like: Free inodes count wrong for group #1 (1, counted=0). Fix? no Free inodes count wrong for group #2 (3, counted=0). Fix? no Directories count wrong for group #2 (780, counted=779). Fix? no Free inodes count wrong for group #3 (2272, counted=2273). Fix? no So this patch try to protect it with the ext4_lock_group. btw, it is found by xfstests test case 269 and the volume is mkfsed with the parameter "-O ^resize_inode,^uninit_bg,extent,meta_bg,flex_bg,ext_attr" and I have run it 100 times and the error in e2fsck doesn't show up again. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
fefifofum
referenced
this issue
in armani-dev/android_kernel_xiaomi_armani_kk
Oct 10, 2015
commit 00d4e73 upstream. In ext4_nonda_switch(), if the file system is getting full we used to call writeback_inodes_sb_if_idle(). The problem is that we can be holding i_mutex already, and this causes a potential deadlock when writeback_inodes_sb_if_idle() when it tries to take s_umount. (See lockdep output below). As it turns out we don't need need to hold s_umount; the fact that we are in the middle of the write(2) system call will keep the superblock pinned. Unfortunately writeback_inodes_sb() checks to make sure s_umount is taken, and the VFS uses a different mechanism for making sure the file system doesn't get unmounted out from under us. The simplest way of dealing with this is to just simply grab s_umount using a trylock, and skip kicking the writeback flusher thread in the very unlikely case that we can't take a read lock on s_umount without blocking. Also, we now check the cirteria for kicking the writeback thread before we decide to whether to fall back to non-delayed writeback, so if there are any outstanding delayed allocation writes, we try to get them resolved as soon as possible. [ INFO: possible circular locking dependency detected ] 3.6.0-rc1-00042-gce894ca MiCode#367 Not tainted ------------------------------------------------------- dd/8298 is trying to acquire lock: (&type->s_umount_key#18){++++..}, at: [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46 but task is already holding lock: (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3 which lock already depends on the new lock. 2 locks held by dd/8298: #0: (sb_writers#2){.+.+.+}, at: [<c01ddcc5>] generic_file_aio_write+0x56/0xd3 #1: (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3 stack backtrace: Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca MiCode#367 Call Trace: [<c015b79c>] ? console_unlock+0x345/0x372 [<c06d62a1>] print_circular_bug+0x190/0x19d [<c019906c>] __lock_acquire+0x86d/0xb6c [<c01999db>] ? mark_held_locks+0x5c/0x7b [<c0199724>] lock_acquire+0x66/0xb9 [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46 [<c06db935>] down_read+0x28/0x58 [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46 [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46 [<c026f3b2>] ext4_nonda_switch+0xe1/0xf4 [<c0271ece>] ext4_da_write_begin+0x27/0x193 [<c01dcdb0>] generic_file_buffered_write+0xc8/0x1bb [<c01ddc47>] __generic_file_aio_write+0x1dd/0x205 [<c01ddce7>] generic_file_aio_write+0x78/0xd3 [<c026d336>] ext4_file_write+0x480/0x4a6 [<c0198c1d>] ? __lock_acquire+0x41e/0xb6c [<c0180944>] ? sched_clock_cpu+0x11a/0x13e [<c01967e9>] ? trace_hardirqs_off+0xb/0xd [<c018099f>] ? local_clock+0x37/0x4e [<c0209f2c>] do_sync_write+0x67/0x9d [<c0209ec5>] ? wait_on_retry_sync_kiocb+0x44/0x44 [<c020a7b9>] vfs_write+0x7b/0xe6 [<c020a9a6>] sys_write+0x3b/0x64 [<c06dd4bd>] syscall_call+0x7/0xb Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
fefifofum
referenced
this issue
in armani-dev/android_kernel_xiaomi_armani_kk
Oct 10, 2015
commit 42d64e1 upstream. The SELinux/NetLabel glue code has a locking bug that affects systems with NetLabel enabled, see the kernel error message below. This patch corrects this problem by converting the bottom half socket lock to a more conventional, and correct for this call-path, lock_sock() call. =============================== [ INFO: suspicious RCU usage. ] 3.11.0-rc3+ MiCode#19 Not tainted ------------------------------- net/ipv4/cipso_ipv4.c:1928 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 2 locks held by ping/731: #0: (slock-AF_INET/1){+.-...}, at: [...] selinux_netlbl_socket_connect #1: (rcu_read_lock){.+.+..}, at: [<...>] netlbl_conn_setattr stack backtrace: CPU: 1 PID: 731 Comm: ping Not tainted 3.11.0-rc3+ MiCode#19 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 0000000000000001 ffff88006f659d28 ffffffff81726b6a ffff88003732c500 ffff88006f659d58 ffffffff810e4457 ffff88006b845a00 0000000000000000 000000000000000c ffff880075aa2f50 ffff88006f659d90 ffffffff8169bec7 Call Trace: [<ffffffff81726b6a>] dump_stack+0x54/0x74 [<ffffffff810e4457>] lockdep_rcu_suspicious+0xe7/0x120 [<ffffffff8169bec7>] cipso_v4_sock_setattr+0x187/0x1a0 [<ffffffff8170f317>] netlbl_conn_setattr+0x187/0x190 [<ffffffff8170f195>] ? netlbl_conn_setattr+0x5/0x190 [<ffffffff8131ac9e>] selinux_netlbl_socket_connect+0xae/0xc0 [<ffffffff81303025>] selinux_socket_connect+0x135/0x170 [<ffffffff8119d127>] ? might_fault+0x57/0xb0 [<ffffffff812fb146>] security_socket_connect+0x16/0x20 [<ffffffff815d3ad3>] SYSC_connect+0x73/0x130 [<ffffffff81739a85>] ? sysret_check+0x22/0x5d [<ffffffff810e5e2d>] ? trace_hardirqs_on_caller+0xfd/0x1c0 [<ffffffff81373d4e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff815d52be>] SyS_connect+0xe/0x10 [<ffffffff81739a59>] system_call_fastpath+0x16/0x1b Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
fefifofum
referenced
this issue
in armani-dev/android_kernel_xiaomi_armani_kk
Oct 10, 2015
commit 6f2e9f0 upstream. Now when we set the group inode free count, we don't have a proper group lock so that multiple threads may decrease the inode free count at the same time. And e2fsck will complain something like: Free inodes count wrong for group #1 (1, counted=0). Fix? no Free inodes count wrong for group #2 (3, counted=0). Fix? no Directories count wrong for group #2 (780, counted=779). Fix? no Free inodes count wrong for group #3 (2272, counted=2273). Fix? no So this patch try to protect it with the ext4_lock_group. btw, it is found by xfstests test case 269 and the volume is mkfsed with the parameter "-O ^resize_inode,^uninit_bg,extent,meta_bg,flex_bg,ext_attr" and I have run it 100 times and the error in e2fsck doesn't show up again. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
fefifofum
referenced
this issue
in armani-dev/android_kernel_xiaomi_armani_kk
Oct 10, 2015
commit 00d4e73 upstream. In ext4_nonda_switch(), if the file system is getting full we used to call writeback_inodes_sb_if_idle(). The problem is that we can be holding i_mutex already, and this causes a potential deadlock when writeback_inodes_sb_if_idle() when it tries to take s_umount. (See lockdep output below). As it turns out we don't need need to hold s_umount; the fact that we are in the middle of the write(2) system call will keep the superblock pinned. Unfortunately writeback_inodes_sb() checks to make sure s_umount is taken, and the VFS uses a different mechanism for making sure the file system doesn't get unmounted out from under us. The simplest way of dealing with this is to just simply grab s_umount using a trylock, and skip kicking the writeback flusher thread in the very unlikely case that we can't take a read lock on s_umount without blocking. Also, we now check the cirteria for kicking the writeback thread before we decide to whether to fall back to non-delayed writeback, so if there are any outstanding delayed allocation writes, we try to get them resolved as soon as possible. [ INFO: possible circular locking dependency detected ] 3.6.0-rc1-00042-gce894ca MiCode#367 Not tainted ------------------------------------------------------- dd/8298 is trying to acquire lock: (&type->s_umount_key#18){++++..}, at: [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46 but task is already holding lock: (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3 which lock already depends on the new lock. 2 locks held by dd/8298: #0: (sb_writers#2){.+.+.+}, at: [<c01ddcc5>] generic_file_aio_write+0x56/0xd3 #1: (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3 stack backtrace: Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca MiCode#367 Call Trace: [<c015b79c>] ? console_unlock+0x345/0x372 [<c06d62a1>] print_circular_bug+0x190/0x19d [<c019906c>] __lock_acquire+0x86d/0xb6c [<c01999db>] ? mark_held_locks+0x5c/0x7b [<c0199724>] lock_acquire+0x66/0xb9 [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46 [<c06db935>] down_read+0x28/0x58 [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46 [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46 [<c026f3b2>] ext4_nonda_switch+0xe1/0xf4 [<c0271ece>] ext4_da_write_begin+0x27/0x193 [<c01dcdb0>] generic_file_buffered_write+0xc8/0x1bb [<c01ddc47>] __generic_file_aio_write+0x1dd/0x205 [<c01ddce7>] generic_file_aio_write+0x78/0xd3 [<c026d336>] ext4_file_write+0x480/0x4a6 [<c0198c1d>] ? __lock_acquire+0x41e/0xb6c [<c0180944>] ? sched_clock_cpu+0x11a/0x13e [<c01967e9>] ? trace_hardirqs_off+0xb/0xd [<c018099f>] ? local_clock+0x37/0x4e [<c0209f2c>] do_sync_write+0x67/0x9d [<c0209ec5>] ? wait_on_retry_sync_kiocb+0x44/0x44 [<c020a7b9>] vfs_write+0x7b/0xe6 [<c020a9a6>] sys_write+0x3b/0x64 [<c06dd4bd>] syscall_call+0x7/0xb Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
fefifofum
referenced
this issue
in armani-dev/android_kernel_xiaomi_armani_kk
Oct 10, 2015
commit 42d64e1 upstream. The SELinux/NetLabel glue code has a locking bug that affects systems with NetLabel enabled, see the kernel error message below. This patch corrects this problem by converting the bottom half socket lock to a more conventional, and correct for this call-path, lock_sock() call. =============================== [ INFO: suspicious RCU usage. ] 3.11.0-rc3+ MiCode#19 Not tainted ------------------------------- net/ipv4/cipso_ipv4.c:1928 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 2 locks held by ping/731: #0: (slock-AF_INET/1){+.-...}, at: [...] selinux_netlbl_socket_connect #1: (rcu_read_lock){.+.+..}, at: [<...>] netlbl_conn_setattr stack backtrace: CPU: 1 PID: 731 Comm: ping Not tainted 3.11.0-rc3+ MiCode#19 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 0000000000000001 ffff88006f659d28 ffffffff81726b6a ffff88003732c500 ffff88006f659d58 ffffffff810e4457 ffff88006b845a00 0000000000000000 000000000000000c ffff880075aa2f50 ffff88006f659d90 ffffffff8169bec7 Call Trace: [<ffffffff81726b6a>] dump_stack+0x54/0x74 [<ffffffff810e4457>] lockdep_rcu_suspicious+0xe7/0x120 [<ffffffff8169bec7>] cipso_v4_sock_setattr+0x187/0x1a0 [<ffffffff8170f317>] netlbl_conn_setattr+0x187/0x190 [<ffffffff8170f195>] ? netlbl_conn_setattr+0x5/0x190 [<ffffffff8131ac9e>] selinux_netlbl_socket_connect+0xae/0xc0 [<ffffffff81303025>] selinux_socket_connect+0x135/0x170 [<ffffffff8119d127>] ? might_fault+0x57/0xb0 [<ffffffff812fb146>] security_socket_connect+0x16/0x20 [<ffffffff815d3ad3>] SYSC_connect+0x73/0x130 [<ffffffff81739a85>] ? sysret_check+0x22/0x5d [<ffffffff810e5e2d>] ? trace_hardirqs_on_caller+0xfd/0x1c0 [<ffffffff81373d4e>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff815d52be>] SyS_connect+0xe/0x10 [<ffffffff81739a59>] system_call_fastpath+0x16/0x1b Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
[ Upstream commit 97f88a3 ] I found a null pointer reference in arch_prepare_kprobe(): # echo 'p cmdline_proc_show' > kprobe_events # echo 'p cmdline_proc_show+16' >> kprobe_events Kernel attempted to read user page (0) - exploit attempt? (uid: 0) BUG: Kernel NULL pointer dereference on read at 0x00000000 Faulting instruction address: 0xc000000000050bfc Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV Modules linked in: CPU: 0 PID: 122 Comm: sh Not tainted 6.0.0-rc3-00007-gdcf8e5633e2e #10 NIP: c000000000050bfc LR: c000000000050bec CTR: 0000000000005bdc REGS: c0000000348475b0 TRAP: 0300 Not tainted (6.0.0-rc3-00007-gdcf8e5633e2e) MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 88002444 XER: 20040006 CFAR: c00000000022d100 DAR: 0000000000000000 DSISR: 40000000 IRQMASK: 0 ... NIP arch_prepare_kprobe+0x10c/0x2d0 LR arch_prepare_kprobe+0xfc/0x2d0 Call Trace: 0xc0000000012f77a0 (unreliable) register_kprobe+0x3c0/0x7a0 __register_trace_kprobe+0x140/0x1a0 __trace_kprobe_create+0x794/0x1040 trace_probe_create+0xc4/0xe0 create_or_delete_trace_kprobe+0x2c/0x80 trace_parse_run_command+0xf0/0x210 probes_write+0x20/0x40 vfs_write+0xfc/0x450 ksys_write+0x84/0x140 system_call_exception+0x17c/0x3a0 system_call_vectored_common+0xe8/0x278 --- interrupt: 3000 at 0x7fffa5682de0 NIP: 00007fffa5682de0 LR: 0000000000000000 CTR: 0000000000000000 REGS: c000000034847e80 TRAP: 3000 Not tainted (6.0.0-rc3-00007-gdcf8e5633e2e) MSR: 900000000280f033 <SF,HV,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 44002408 XER: 00000000 The address being probed has some special: cmdline_proc_show: Probe based on ftrace cmdline_proc_show+16: Probe for the next instruction at the ftrace location The ftrace-based kprobe does not generate kprobe::ainsn::insn, it gets set to NULL. In arch_prepare_kprobe() it will check for: ... prev = get_kprobe(p->addr - 1); preempt_enable_no_resched(); if (prev && ppc_inst_prefixed(ppc_inst_read(prev->ainsn.insn))) { ... If prev is based on ftrace, 'ppc_inst_read(prev->ainsn.insn)' will occur with a null pointer reference. At this point prev->addr will not be a prefixed instruction, so the check can be skipped. Check if prev is ftrace-based kprobe before reading 'prev->ainsn.insn' to fix this problem. Fixes: b4657f7 ("powerpc/kprobes: Don't allow breakpoints on suffixes") Signed-off-by: Li Huafei <lihuafei1@huawei.com> [mpe: Trim oops] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20220923093253.177298-1-lihuafei1@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
[ Upstream commit 99ee931 ] There is a recursive lock on the cpu_hotplug_lock. In kernel/trace/trace_osnoise.c:<start/stop>_per_cpu_kthreads: - start_per_cpu_kthreads calls cpus_read_lock() and if start_kthreads returns a error it will call stop_per_cpu_kthreads. - stop_per_cpu_kthreads then calls cpus_read_lock() again causing deadlock. Fix this by calling cpus_read_unlock() before calling stop_per_cpu_kthreads. This behavior can also be seen in commit f46b165 ("trace/hwlat: Implement the per-cpu mode"). This error was noticed during the LTP ftrace-stress-test: WARNING: possible recursive locking detected -------------------------------------------- sh/275006 is trying to acquire lock: ffffffffb02f5400 (cpu_hotplug_lock){++++}-{0:0}, at: stop_per_cpu_kthreads but task is already holding lock: ffffffffb02f5400 (cpu_hotplug_lock){++++}-{0:0}, at: start_per_cpu_kthreads other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(cpu_hotplug_lock); lock(cpu_hotplug_lock); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by sh/275006: #0: ffff8881023f0470 (sb_writers#24){.+.+}-{0:0}, at: ksys_write #1: ffffffffb084f430 (trace_types_lock){+.+.}-{3:3}, at: rb_simple_write #2: ffffffffb02f5400 (cpu_hotplug_lock){++++}-{0:0}, at: start_per_cpu_kthreads Link: https://lkml.kernel.org/r/20220919144932.3064014-1-npache@redhat.com Fixes: c8895e2 ("trace/osnoise: Support hotplug operations") Signed-off-by: Nico Pache <npache@redhat.com> Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
[ Upstream commit aa666b6 ] The variable i is changed when setting random MAC address and causes invalid address access when printing the value of pi->reqs[i]->reqid. We replace reqs index with ri to fix the issue. [ 136.726473] Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000000000000000 [ 136.737365] Mem abort info: [ 136.740172] ESR = 0x96000004 [ 136.743359] Exception class = DABT (current EL), IL = 32 bits [ 136.749294] SET = 0, FnV = 0 [ 136.752481] EA = 0, S1PTW = 0 [ 136.755635] Data abort info: [ 136.758514] ISV = 0, ISS = 0x00000004 [ 136.762487] CM = 0, WnR = 0 [ 136.765522] user pgtable: 4k pages, 48-bit VAs, pgdp = 000000005c4e2577 [ 136.772265] [0000000000000000] pgd=0000000000000000 [ 136.777160] Internal error: Oops: 96000004 [#1] PREEMPT SMP [ 136.782732] Modules linked in: brcmfmac(O) brcmutil(O) cfg80211(O) compat(O) [ 136.789788] Process wificond (pid: 3175, stack limit = 0x00000000053048fb) [ 136.796664] CPU: 3 PID: 3175 Comm: wificond Tainted: G O 4.19.42-00001-g531a5f5 #1 [ 136.805532] Hardware name: Freescale i.MX8MQ EVK (DT) [ 136.810584] pstate: 60400005 (nZCv daif +PAN -UAO) [ 136.815429] pc : brcmf_pno_config_sched_scans+0x6cc/0xa80 [brcmfmac] [ 136.821811] lr : brcmf_pno_config_sched_scans+0x67c/0xa80 [brcmfmac] [ 136.828162] sp : ffff00000e9a3880 [ 136.831475] x29: ffff00000e9a3890 x28: ffff800020543400 [ 136.836786] x27: ffff8000b1008880 x26: ffff0000012bf6a0 [ 136.842098] x25: ffff80002054345c x24: ffff800088d22400 [ 136.847409] x23: ffff0000012bf638 x22: ffff0000012bf6d8 [ 136.852721] x21: ffff8000aced8fc0 x20: ffff8000ac164400 [ 136.858032] x19: ffff00000e9a3946 x18: 0000000000000000 [ 136.863343] x17: 0000000000000000 x16: 0000000000000000 [ 136.868655] x15: ffff0000093f3b37 x14: 0000000000000050 [ 136.873966] x13: 0000000000003135 x12: 0000000000000000 [ 136.879277] x11: 0000000000000000 x10: ffff000009a61888 [ 136.884589] x9 : 000000000000000f x8 : 0000000000000008 [ 136.889900] x7 : 303a32303d726464 x6 : ffff00000a1f957d [ 136.895211] x5 : 0000000000000000 x4 : ffff00000e9a3942 [ 136.900523] x3 : 0000000000000000 x2 : ffff0000012cead8 [ 136.905834] x1 : ffff0000012bf6d8 x0 : 0000000000000000 [ 136.911146] Call trace: [ 136.913623] brcmf_pno_config_sched_scans+0x6cc/0xa80 [brcmfmac] [ 136.919658] brcmf_pno_start_sched_scan+0xa4/0x118 [brcmfmac] [ 136.925430] brcmf_cfg80211_sched_scan_start+0x80/0xe0 [brcmfmac] [ 136.931636] nl80211_start_sched_scan+0x140/0x308 [cfg80211] [ 136.937298] genl_rcv_msg+0x358/0x3f4 [ 136.940960] netlink_rcv_skb+0xb4/0x118 [ 136.944795] genl_rcv+0x34/0x48 [ 136.947935] netlink_unicast+0x264/0x300 [ 136.951856] netlink_sendmsg+0x2e4/0x33c [ 136.955781] __sys_sendto+0x120/0x19c Signed-off-by: Wright Feng <wright.feng@cypress.com> Signed-off-by: Chi-hsien Lin <chi-hsien.lin@cypress.com> Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk> Signed-off-by: Kalle Valo <kvalo@kernel.org> Link: https://lore.kernel.org/r/20220722115632.620681-4-alvin@pqrs.dk Signed-off-by: Sasha Levin <sashal@kernel.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
[ Upstream commit 3f42faf ] > ret = brcmf_proto_tx_queue_data(drvr, ifp->ifidx, skb); may be schedule, and then complete before the line > ndev->stats.tx_bytes += skb->len; [ 46.912801] ================================================================== [ 46.920552] BUG: KASAN: use-after-free in brcmf_netdev_start_xmit+0x718/0x8c8 [brcmfmac] [ 46.928673] Read of size 4 at addr ffffff803f5882e8 by task systemd-resolve/328 [ 46.935991] [ 46.937514] CPU: 1 PID: 328 Comm: systemd-resolve Tainted: G O 5.4.199-[REDACTED] #1 [ 46.947255] Hardware name: [REDACTED] [ 46.954568] Call trace: [ 46.957037] dump_backtrace+0x0/0x2b8 [ 46.960719] show_stack+0x24/0x30 [ 46.964052] dump_stack+0x128/0x194 [ 46.967557] print_address_description.isra.0+0x64/0x380 [ 46.972877] __kasan_report+0x1d4/0x240 [ 46.976723] kasan_report+0xc/0x18 [ 46.980138] __asan_report_load4_noabort+0x18/0x20 [ 46.985027] brcmf_netdev_start_xmit+0x718/0x8c8 [brcmfmac] [ 46.990613] dev_hard_start_xmit+0x1bc/0xda0 [ 46.994894] sch_direct_xmit+0x198/0xd08 [ 46.998827] __qdisc_run+0x37c/0x1dc0 [ 47.002500] __dev_queue_xmit+0x1528/0x21f8 [ 47.006692] dev_queue_xmit+0x24/0x30 [ 47.010366] neigh_resolve_output+0x37c/0x678 [ 47.014734] ip_finish_output2+0x598/0x2458 [ 47.018927] __ip_finish_output+0x300/0x730 [ 47.023118] ip_output+0x2e0/0x430 [ 47.026530] ip_local_out+0x90/0x140 [ 47.030117] igmpv3_sendpack+0x14c/0x228 [ 47.034049] igmpv3_send_cr+0x384/0x6b8 [ 47.037895] igmp_ifc_timer_expire+0x4c/0x118 [ 47.042262] call_timer_fn+0x1cc/0xbe8 [ 47.046021] __run_timers+0x4d8/0xb28 [ 47.049693] run_timer_softirq+0x24/0x40 [ 47.053626] __do_softirq+0x2c0/0x117c [ 47.057387] irq_exit+0x2dc/0x388 [ 47.060715] __handle_domain_irq+0xb4/0x158 [ 47.064908] gic_handle_irq+0x58/0xb0 [ 47.068581] el0_irq_naked+0x50/0x5c [ 47.072162] [ 47.073665] Allocated by task 328: [ 47.077083] save_stack+0x24/0xb0 [ 47.080410] __kasan_kmalloc.isra.0+0xc0/0xe0 [ 47.084776] kasan_slab_alloc+0x14/0x20 [ 47.088622] kmem_cache_alloc+0x15c/0x468 [ 47.092643] __alloc_skb+0xa4/0x498 [ 47.096142] igmpv3_newpack+0x158/0xd78 [ 47.099987] add_grhead+0x210/0x288 [ 47.103485] add_grec+0x6b0/0xb70 [ 47.106811] igmpv3_send_cr+0x2e0/0x6b8 [ 47.110657] igmp_ifc_timer_expire+0x4c/0x118 [ 47.115027] call_timer_fn+0x1cc/0xbe8 [ 47.118785] __run_timers+0x4d8/0xb28 [ 47.122457] run_timer_softirq+0x24/0x40 [ 47.126389] __do_softirq+0x2c0/0x117c [ 47.130142] [ 47.131643] Freed by task 180: [ 47.134712] save_stack+0x24/0xb0 [ 47.138041] __kasan_slab_free+0x108/0x180 [ 47.142146] kasan_slab_free+0x10/0x18 [ 47.145904] slab_free_freelist_hook+0xa4/0x1b0 [ 47.150444] kmem_cache_free+0x8c/0x528 [ 47.154292] kfree_skbmem+0x94/0x108 [ 47.157880] consume_skb+0x10c/0x5a8 [ 47.161466] __dev_kfree_skb_any+0x88/0xa0 [ 47.165598] brcmu_pkt_buf_free_skb+0x44/0x68 [brcmutil] [ 47.171023] brcmf_txfinalize+0xec/0x190 [brcmfmac] [ 47.176016] brcmf_proto_bcdc_txcomplete+0x1c0/0x210 [brcmfmac] [ 47.182056] brcmf_sdio_sendfromq+0x8dc/0x1e80 [brcmfmac] [ 47.187568] brcmf_sdio_dpc+0xb48/0x2108 [brcmfmac] [ 47.192529] brcmf_sdio_dataworker+0xc8/0x238 [brcmfmac] [ 47.197859] process_one_work+0x7fc/0x1a80 [ 47.201965] worker_thread+0x31c/0xc40 [ 47.205726] kthread+0x2d8/0x370 [ 47.208967] ret_from_fork+0x10/0x18 [ 47.212546] [ 47.214051] The buggy address belongs to the object at ffffff803f588280 [ 47.214051] which belongs to the cache skbuff_head_cache of size 208 [ 47.227086] The buggy address is located 104 bytes inside of [ 47.227086] 208-byte region [ffffff803f588280, ffffff803f588350) [ 47.238814] The buggy address belongs to the page: [ 47.243618] page:ffffffff00dd6200 refcount:1 mapcount:0 mapping:ffffff804b6bf800 index:0xffffff803f589900 compound_mapcount: 0 [ 47.255007] flags: 0x10200(slab|head) [ 47.258689] raw: 0000000000010200 ffffffff00dfa980 0000000200000002 ffffff804b6bf800 [ 47.266439] raw: ffffff803f589900 0000000080190018 00000001ffffffff 0000000000000000 [ 47.274180] page dumped because: kasan: bad access detected [ 47.279752] [ 47.281251] Memory state around the buggy address: [ 47.286051] ffffff803f588180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 47.293277] ffffff803f588200: fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 47.300502] >ffffff803f588280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 47.307723] ^ [ 47.314343] ffffff803f588300: fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc [ 47.321569] ffffff803f588380: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb [ 47.328789] ================================================================== Signed-off-by: Alexander Coffin <alex.coffin@matician.com> Signed-off-by: Kalle Valo <kvalo@kernel.org> Link: https://lore.kernel.org/r/20220808174925.3922558-1-alex.coffin@matician.com Signed-off-by: Sasha Levin <sashal@kernel.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
[ Upstream commit 448a496 ] device_add shall not be called multiple times as stated in its documentation: 'Do not call this routine or device_register() more than once for any device structure' Syzkaller reports a bug as follows [1]: ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:33! invalid opcode: 0000 [#1] PREEMPT SMP KASAN [...] Call Trace: <TASK> __list_add include/linux/list.h:69 [inline] list_add_tail include/linux/list.h:102 [inline] kobj_kset_join lib/kobject.c:164 [inline] kobject_add_internal+0x18f/0x8f0 lib/kobject.c:214 kobject_add_varg lib/kobject.c:358 [inline] kobject_add+0x150/0x1c0 lib/kobject.c:410 device_add+0x368/0x1e90 drivers/base/core.c:3452 hci_conn_add_sysfs+0x9b/0x1b0 net/bluetooth/hci_sysfs.c:53 hci_le_cis_estabilished_evt+0x57c/0xae0 net/bluetooth/hci_event.c:6799 hci_le_meta_evt+0x2b8/0x510 net/bluetooth/hci_event.c:7110 hci_event_func net/bluetooth/hci_event.c:7440 [inline] hci_event_packet+0x63d/0xfd0 net/bluetooth/hci_event.c:7495 hci_rx_work+0xae7/0x1230 net/bluetooth/hci_core.c:4007 process_one_work+0x991/0x1610 kernel/workqueue.c:2289 worker_thread+0x665/0x1080 kernel/workqueue.c:2436 kthread+0x2e4/0x3a0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306 </TASK> Link: https://syzkaller.appspot.com/bug?id=da3246e2d33afdb92d66bc166a0934c5b146404a Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Tested-by: Hawkins Jiawei <yin31149@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
[ Upstream commit f6ee304 ] There are some struct drm_driver fields that are required by drivers since drm_copy_field() attempts to copy them to user-space via DRM_IOCTL_VERSION. But it can be possible that a driver has a bug and did not set some of the fields, which leads to drm_copy_field() attempting to copy a NULL pointer: [ +10.395966] Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000000000000000 [ +0.010955] Mem abort info: [ +0.002835] ESR = 0x0000000096000004 [ +0.003872] EC = 0x25: DABT (current EL), IL = 32 bits [ +0.005395] SET = 0, FnV = 0 [ +0.003113] EA = 0, S1PTW = 0 [ +0.003182] FSC = 0x04: level 0 translation fault [ +0.004964] Data abort info: [ +0.002919] ISV = 0, ISS = 0x00000004 [ +0.003886] CM = 0, WnR = 0 [ +0.003040] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000115dad000 [ +0.006536] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000 [ +0.006925] Internal error: Oops: 96000004 [#1] SMP ... [ +0.011113] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ +0.007061] pc : __pi_strlen+0x14/0x150 [ +0.003895] lr : drm_copy_field+0x30/0x1a4 [ +0.004156] sp : ffff8000094b3a50 [ +0.003355] x29: ffff8000094b3a50 x28: ffff8000094b3b70 x27: 0000000000000040 [ +0.007242] x26: ffff443743c2ba00 x25: 0000000000000000 x24: 0000000000000040 [ +0.007243] x23: ffff443743c2ba00 x22: ffff8000094b3b70 x21: 0000000000000000 [ +0.007241] x20: 0000000000000000 x19: ffff8000094b3b90 x18: 0000000000000000 [ +0.007241] x17: 0000000000000000 x16: 0000000000000000 x15: 0000aaab14b9af40 [ +0.007241] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ +0.007239] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffa524ad67d4d8 [ +0.007242] x8 : 0101010101010101 x7 : 7f7f7f7f7f7f7f7f x6 : 6c6e6263606e7141 [ +0.007239] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ +0.007241] x2 : 0000000000000000 x1 : ffff8000094b3b90 x0 : 0000000000000000 [ +0.007240] Call trace: [ +0.002475] __pi_strlen+0x14/0x150 [ +0.003537] drm_version+0x84/0xac [ +0.003448] drm_ioctl_kernel+0xa8/0x16c [ +0.003975] drm_ioctl+0x270/0x580 [ +0.003448] __arm64_sys_ioctl+0xb8/0xfc [ +0.003978] invoke_syscall+0x78/0x100 [ +0.003799] el0_svc_common.constprop.0+0x4c/0xf4 [ +0.004767] do_el0_svc+0x38/0x4c [ +0.003357] el0_svc+0x34/0x100 [ +0.003185] el0t_64_sync_handler+0x11c/0x150 [ +0.004418] el0t_64_sync+0x190/0x194 [ +0.003716] Code: 92402c04 b200c3e8 f13fc09f 5400088c (a9400c02) [ +0.006180] ---[ end trace 0000000000000000 ]--- Reported-by: Peter Robinson <pbrobinson@gmail.com> Signed-off-by: Javier Martinez Canillas <javierm@redhat.com> Acked-by: Thomas Zimmermann <tzimmermann@suse.de> Link: https://patchwork.freedesktop.org/patch/msgid/20220705100215.572498-3-javierm@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
[ Upstream commit d9c04a1 ] When userspace tries to map the dmabuf and if for some reason (e.g. OOM) the creation of the sg table fails, ubuf->sg needs to be set to NULL. Otherwise, when the userspace subsequently closes the dmabuf fd, we'd try to erroneously free the invalid sg table from release_udmabuf resulting in the following crash reported by syzbot: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 0 PID: 3609 Comm: syz-executor487 Not tainted 5.19.0-syzkaller-13930-g7ebfc85e2cd7 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/22/2022 RIP: 0010:dma_unmap_sgtable include/linux/dma-mapping.h:378 [inline] RIP: 0010:put_sg_table drivers/dma-buf/udmabuf.c:89 [inline] RIP: 0010:release_udmabuf+0xcb/0x4f0 drivers/dma-buf/udmabuf.c:114 Code: 48 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 2b 04 00 00 48 8d 7d 0c 4c 8b 63 30 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 e2 RSP: 0018:ffffc900037efd30 EFLAGS: 00010246 RAX: dffffc0000000000 RBX: ffffffff8cb67800 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff84ad27e0 RDI: 0000000000000000 RBP: fffffffffffffff4 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000008c07c R12: ffff88801fa05000 R13: ffff888073db07e8 R14: ffff888025c25440 R15: 0000000000000000 FS: 0000555555fc4300(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fc1c0ce06e4 CR3: 00000000715e6000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> dma_buf_release+0x157/0x2d0 drivers/dma-buf/dma-buf.c:78 __dentry_kill+0x42b/0x640 fs/dcache.c:612 dentry_kill fs/dcache.c:733 [inline] dput+0x806/0xdb0 fs/dcache.c:913 __fput+0x39c/0x9d0 fs/file_table.c:333 task_work_run+0xdd/0x1a0 kernel/task_work.c:177 ptrace_notify+0x114/0x140 kernel/signal.c:2353 ptrace_report_syscall include/linux/ptrace.h:420 [inline] ptrace_report_syscall_exit include/linux/ptrace.h:482 [inline] syscall_exit_work kernel/entry/common.c:249 [inline] syscall_exit_to_user_mode_prepare+0x129/0x280 kernel/entry/common.c:276 __syscall_exit_to_user_mode_work kernel/entry/common.c:281 [inline] syscall_exit_to_user_mode+0x9/0x50 kernel/entry/common.c:294 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fc1c0c35b6b Code: 0f 05 48 3d 00 f0 ff ff 77 45 c3 0f 1f 40 00 48 83 ec 18 89 7c 24 0c e8 63 fc ff ff 8b 7c 24 0c 41 89 c0 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 89 44 24 0c e8 a1 fc ff ff 8b 44 RSP: 002b:00007ffd78a06090 EFLAGS: 00000293 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 0000000000000007 RCX: 00007fc1c0c35b6b RDX: 0000000020000280 RSI: 0000000040086200 RDI: 0000000000000006 RBP: 0000000000000007 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000000c R13: 0000000000000003 R14: 00007fc1c0cfe4a0 R15: 00007ffd78a06140 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:dma_unmap_sgtable include/linux/dma-mapping.h:378 [inline] RIP: 0010:put_sg_table drivers/dma-buf/udmabuf.c:89 [inline] RIP: 0010:release_udmabuf+0xcb/0x4f0 drivers/dma-buf/udmabuf.c:114 Reported-by: syzbot+c80e9ef5d8bb45894db0@syzkaller.appspotmail.com Cc: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20220825063522.801264-1-vivek.kasireddy@intel.com Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
[ Upstream commit 31c5199 ] Unloading the driver triggers the following KASAN warning: [ +0.006275] ============================================================= [ +0.000029] BUG: KASAN: use-after-free in __list_del_entry_valid+0xe0/0x1a0 [ +0.000026] Read of size 8 at addr ffff000020c395e0 by task rmmod/2695 [ +0.000019] CPU: 5 PID: 2695 Comm: rmmod Tainted: G C O 5.19.0-rc6-lrmbkasan+ #1 [ +0.000013] Hardware name: Hardkernel ODROID-N2Plus (DT) [ +0.000008] Call trace: [ +0.000007] dump_backtrace+0x1ec/0x280 [ +0.000013] show_stack+0x24/0x80 [ +0.000008] dump_stack_lvl+0x98/0xd4 [ +0.000011] print_address_description.constprop.0+0x80/0x520 [ +0.000011] print_report+0x128/0x260 [ +0.000007] kasan_report+0xb8/0xfc [ +0.000008] __asan_report_load8_noabort+0x3c/0x50 [ +0.000010] __list_del_entry_valid+0xe0/0x1a0 [ +0.000009] drm_atomic_private_obj_fini+0x30/0x200 [drm] [ +0.000172] drm_bridge_detach+0x94/0x260 [drm] [ +0.000145] drm_encoder_cleanup+0xa4/0x290 [drm] [ +0.000144] drm_mode_config_cleanup+0x118/0x740 [drm] [ +0.000143] drm_mode_config_init_release+0x1c/0x2c [drm] [ +0.000144] drm_managed_release+0x170/0x414 [drm] [ +0.000142] drm_dev_put.part.0+0xc0/0x124 [drm] [ +0.000143] drm_dev_put+0x20/0x30 [drm] [ +0.000142] meson_drv_unbind+0x1d8/0x2ac [meson_drm] [ +0.000028] take_down_aggregate_device+0xb0/0x160 [ +0.000016] component_del+0x18c/0x360 [ +0.000009] meson_dw_hdmi_remove+0x28/0x40 [meson_dw_hdmi] [ +0.000015] platform_remove+0x64/0xb0 [ +0.000009] device_remove+0xb8/0x154 [ +0.000009] device_release_driver_internal+0x398/0x5b0 [ +0.000009] driver_detach+0xac/0x1b0 [ +0.000009] bus_remove_driver+0x158/0x29c [ +0.000009] driver_unregister+0x70/0xb0 [ +0.000008] platform_driver_unregister+0x20/0x2c [ +0.000008] meson_dw_hdmi_platform_driver_exit+0x1c/0x30 [meson_dw_hdmi] [ +0.000012] __do_sys_delete_module+0x288/0x400 [ +0.000011] __arm64_sys_delete_module+0x5c/0x80 [ +0.000009] invoke_syscall+0x74/0x260 [ +0.000009] el0_svc_common.constprop.0+0xcc/0x260 [ +0.000009] do_el0_svc+0x50/0x70 [ +0.000007] el0_svc+0x68/0x1a0 [ +0.000012] el0t_64_sync_handler+0x11c/0x150 [ +0.000008] el0t_64_sync+0x18c/0x190 [ +0.000018] Allocated by task 0: [ +0.000007] (stack is not available) [ +0.000011] Freed by task 2695: [ +0.000008] kasan_save_stack+0x2c/0x5c [ +0.000011] kasan_set_track+0x2c/0x40 [ +0.000008] kasan_set_free_info+0x28/0x50 [ +0.000009] ____kasan_slab_free+0x128/0x1d4 [ +0.000008] __kasan_slab_free+0x18/0x24 [ +0.000007] slab_free_freelist_hook+0x108/0x230 [ +0.000011] kfree+0x110/0x35c [ +0.000008] release_nodes+0xf0/0x16c [ +0.000009] devres_release_group+0x180/0x270 [ +0.000008] component_unbind+0x128/0x1e0 [ +0.000010] component_unbind_all+0x1b8/0x264 [ +0.000009] meson_drv_unbind+0x1a0/0x2ac [meson_drm] [ +0.000025] take_down_aggregate_device+0xb0/0x160 [ +0.000009] component_del+0x18c/0x360 [ +0.000009] meson_dw_hdmi_remove+0x28/0x40 [meson_dw_hdmi] [ +0.000012] platform_remove+0x64/0xb0 [ +0.000008] device_remove+0xb8/0x154 [ +0.000009] device_release_driver_internal+0x398/0x5b0 [ +0.000009] driver_detach+0xac/0x1b0 [ +0.000009] bus_remove_driver+0x158/0x29c [ +0.000008] driver_unregister+0x70/0xb0 [ +0.000008] platform_driver_unregister+0x20/0x2c [ +0.000008] meson_dw_hdmi_platform_driver_exit+0x1c/0x30 [meson_dw_hdmi] [ +0.000011] __do_sys_delete_module+0x288/0x400 [ +0.000010] __arm64_sys_delete_module+0x5c/0x80 [ +0.000008] invoke_syscall+0x74/0x260 [ +0.000008] el0_svc_common.constprop.0+0xcc/0x260 [ +0.000008] do_el0_svc+0x50/0x70 [ +0.000007] el0_svc+0x68/0x1a0 [ +0.000009] el0t_64_sync_handler+0x11c/0x150 [ +0.000009] el0t_64_sync+0x18c/0x190 [ +0.000014] The buggy address belongs to the object at ffff000020c39000 which belongs to the cache kmalloc-4k of size 4096 [ +0.000008] The buggy address is located 1504 bytes inside of 4096-byte region [ffff000020c39000, ffff000020c3a000) [ +0.000016] The buggy address belongs to the physical page: [ +0.000009] page:fffffc0000830e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x20c38 [ +0.000013] head:fffffc0000830e00 order:3 compound_mapcount:0 compound_pincount:0 [ +0.000008] flags: 0xffff00000010200(slab|head|node=0|zone=0|lastcpupid=0xffff) [ +0.000019] raw: 0ffff00000010200 fffffc0000fd4808 fffffc0000126208 ffff000000002e80 [ +0.000009] raw: 0000000000000000 0000000000020002 00000001ffffffff 0000000000000000 [ +0.000008] page dumped because: kasan: bad access detected [ +0.000011] Memory state around the buggy address: [ +0.000008] ffff000020c39480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ +0.000007] ffff000020c39500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ +0.000007] >ffff000020c39580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ +0.000007] ^ [ +0.000007] ffff000020c39600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ +0.000007] ffff000020c39680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ +0.000006] ================================================================== The reason this is happening is unloading meson-dw-hdmi will cause the component API to take down the aggregate device, which in turn will cause all devres-managed memory to be freed, including the struct dw_hdmi allocated in dw_hdmi_probe. This struct embeds a struct drm_bridge that is added at the end of the function, and which is later on picked up in meson_encoder_hdmi_init. However, when attaching the bridge to the encoder created in meson_encoder_hdmi_init, it's linked to the encoder's bridge chain, from where it never leaves, even after devres_release_group is called when the driver's components are unbound and the embedding structure freed. Then, when calling drm_dev_put in the aggregate driver's unbind function, drm_bridge_detach is called for every single bridge linked to the encoder, including the one whose memory had already been deallocated. Fix by calling component_unbind_all after drm_dev_put. Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com> Reviewed-by: Neil Armstrong <neil.armstrong@linaro.org> Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org> Link: https://patchwork.freedesktop.org/patch/msgid/20220919010940.419893-2-adrian.larumbe@collabora.com Signed-off-by: Sasha Levin <sashal@kernel.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
[ Upstream commit 8616f2a ] Because component_master_del wasn't being called when unloading the meson_drm module, the aggregate device would linger forever in the global aggregate_devices list. That means when unloading and reloading the meson_dw_hdmi module, component_add would call into try_to_bring_up_aggregate_device and find the unbound meson_drm aggregate device. This would in turn dereference some of the aggregate_device's struct entries which point to memory automatically freed by the devres API when unbinding the aggregate device from meson_drv_unbind, and trigger an use-after-free bug: [ +0.000014] ============================================================= [ +0.000007] BUG: KASAN: use-after-free in find_components+0x468/0x500 [ +0.000017] Read of size 8 at addr ffff000006731688 by task modprobe/2536 [ +0.000018] CPU: 4 PID: 2536 Comm: modprobe Tainted: G C O 5.19.0-rc6-lrmbkasan+ #1 [ +0.000010] Hardware name: Hardkernel ODROID-N2Plus (DT) [ +0.000008] Call trace: [ +0.000005] dump_backtrace+0x1ec/0x280 [ +0.000011] show_stack+0x24/0x80 [ +0.000007] dump_stack_lvl+0x98/0xd4 [ +0.000010] print_address_description.constprop.0+0x80/0x520 [ +0.000011] print_report+0x128/0x260 [ +0.000007] kasan_report+0xb8/0xfc [ +0.000007] __asan_report_load8_noabort+0x3c/0x50 [ +0.000009] find_components+0x468/0x500 [ +0.000008] try_to_bring_up_aggregate_device+0x64/0x390 [ +0.000009] __component_add+0x1dc/0x49c [ +0.000009] component_add+0x20/0x30 [ +0.000008] meson_dw_hdmi_probe+0x28/0x34 [meson_dw_hdmi] [ +0.000013] platform_probe+0xd0/0x220 [ +0.000008] really_probe+0x3ac/0xa80 [ +0.000008] __driver_probe_device+0x1f8/0x400 [ +0.000008] driver_probe_device+0x68/0x1b0 [ +0.000008] __driver_attach+0x20c/0x480 [ +0.000009] bus_for_each_dev+0x114/0x1b0 [ +0.000007] driver_attach+0x48/0x64 [ +0.000009] bus_add_driver+0x390/0x564 [ +0.000007] driver_register+0x1a8/0x3e4 [ +0.000009] __platform_driver_register+0x6c/0x94 [ +0.000007] meson_dw_hdmi_platform_driver_init+0x30/0x1000 [meson_dw_hdmi] [ +0.000014] do_one_initcall+0xc4/0x2b0 [ +0.000008] do_init_module+0x154/0x570 [ +0.000010] load_module+0x1a78/0x1ea4 [ +0.000008] __do_sys_init_module+0x184/0x1cc [ +0.000008] __arm64_sys_init_module+0x78/0xb0 [ +0.000008] invoke_syscall+0x74/0x260 [ +0.000008] el0_svc_common.constprop.0+0xcc/0x260 [ +0.000009] do_el0_svc+0x50/0x70 [ +0.000008] el0_svc+0x68/0x1a0 [ +0.000009] el0t_64_sync_handler+0x11c/0x150 [ +0.000009] el0t_64_sync+0x18c/0x190 [ +0.000014] Allocated by task 902: [ +0.000007] kasan_save_stack+0x2c/0x5c [ +0.000009] __kasan_kmalloc+0x90/0xd0 [ +0.000007] __kmalloc_node+0x240/0x580 [ +0.000010] memcg_alloc_slab_cgroups+0xa4/0x1ac [ +0.000010] memcg_slab_post_alloc_hook+0xbc/0x4c0 [ +0.000008] kmem_cache_alloc_node+0x1d0/0x490 [ +0.000009] __alloc_skb+0x1d4/0x310 [ +0.000010] alloc_skb_with_frags+0x8c/0x620 [ +0.000008] sock_alloc_send_pskb+0x5ac/0x6d0 [ +0.000010] unix_dgram_sendmsg+0x2e0/0x12f0 [ +0.000010] sock_sendmsg+0xcc/0x110 [ +0.000007] sock_write_iter+0x1d0/0x304 [ +0.000008] new_sync_write+0x364/0x460 [ +0.000007] vfs_write+0x420/0x5ac [ +0.000008] ksys_write+0x19c/0x1f0 [ +0.000008] __arm64_sys_write+0x78/0xb0 [ +0.000007] invoke_syscall+0x74/0x260 [ +0.000008] el0_svc_common.constprop.0+0x1a8/0x260 [ +0.000009] do_el0_svc+0x50/0x70 [ +0.000007] el0_svc+0x68/0x1a0 [ +0.000008] el0t_64_sync_handler+0x11c/0x150 [ +0.000008] el0t_64_sync+0x18c/0x190 [ +0.000013] Freed by task 2509: [ +0.000008] kasan_save_stack+0x2c/0x5c [ +0.000007] kasan_set_track+0x2c/0x40 [ +0.000008] kasan_set_free_info+0x28/0x50 [ +0.000008] ____kasan_slab_free+0x128/0x1d4 [ +0.000008] __kasan_slab_free+0x18/0x24 [ +0.000007] slab_free_freelist_hook+0x108/0x230 [ +0.000010] kfree+0x110/0x35c [ +0.000008] release_nodes+0xf0/0x16c [ +0.000008] devres_release_all+0xfc/0x180 [ +0.000008] device_unbind_cleanup+0x24/0x164 [ +0.000008] device_release_driver_internal+0x3e8/0x5b0 [ +0.000010] driver_detach+0xac/0x1b0 [ +0.000008] bus_remove_driver+0x158/0x29c [ +0.000008] driver_unregister+0x70/0xb0 [ +0.000009] platform_driver_unregister+0x20/0x2c [ +0.000007] 0xffff800003722d98 [ +0.000012] __do_sys_delete_module+0x288/0x400 [ +0.000009] __arm64_sys_delete_module+0x5c/0x80 [ +0.000008] invoke_syscall+0x74/0x260 [ +0.000008] el0_svc_common.constprop.0+0xcc/0x260 [ +0.000008] do_el0_svc+0x50/0x70 [ +0.000007] el0_svc+0x68/0x1a0 [ +0.000008] el0t_64_sync_handler+0x11c/0x150 [ +0.000009] el0t_64_sync+0x18c/0x190 [ +0.000013] Last potentially related work creation: [ +0.000007] kasan_save_stack+0x2c/0x5c [ +0.000007] __kasan_record_aux_stack+0xb8/0xf0 [ +0.000009] kasan_record_aux_stack_noalloc+0x14/0x20 [ +0.000008] insert_work+0x54/0x290 [ +0.000009] __queue_work+0x48c/0xd24 [ +0.000008] queue_work_on+0x90/0x11c [ +0.000008] call_usermodehelper_exec+0x188/0x404 [ +0.000010] kobject_uevent_env+0x5a8/0x794 [ +0.000010] kobject_uevent+0x14/0x20 [ +0.000008] driver_register+0x230/0x3e4 [ +0.000009] __platform_driver_register+0x6c/0x94 [ +0.000007] gxbb_driver_init+0x28/0x34 [ +0.000010] do_one_initcall+0xc4/0x2b0 [ +0.000008] do_initcalls+0x20c/0x24c [ +0.000010] kernel_init_freeable+0x22c/0x278 [ +0.000009] kernel_init+0x3c/0x170 [ +0.000008] ret_from_fork+0x10/0x20 [ +0.000013] The buggy address belongs to the object at ffff000006731600 which belongs to the cache kmalloc-256 of size 256 [ +0.000009] The buggy address is located 136 bytes inside of 256-byte region [ffff000006731600, ffff000006731700) [ +0.000015] The buggy address belongs to the physical page: [ +0.000008] page:fffffc000019cc00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff000006730a00 pfn:0x6730 [ +0.000011] head:fffffc000019cc00 order:2 compound_mapcount:0 compound_pincount:0 [ +0.000008] flags: 0xffff00000010200(slab|head|node=0|zone=0|lastcpupid=0xffff) [ +0.000016] raw: 0ffff00000010200 fffffc00000c3d08 fffffc0000ef2b08 ffff000000002680 [ +0.000009] raw: ffff000006730a00 0000000000150014 00000001ffffffff 0000000000000000 [ +0.000006] page dumped because: kasan: bad access detected [ +0.000011] Memory state around the buggy address: [ +0.000007] ffff000006731580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ +0.000007] ffff000006731600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ +0.000007] >ffff000006731680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ +0.000007] ^ [ +0.000006] ffff000006731700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ +0.000007] ffff000006731780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ +0.000006] ================================================================== Fix by adding 'remove' driver callback for meson-drm, and explicitly deleting the aggregate device. Signed-off-by: Adrián Larumbe <adrian.larumbe@collabora.com> Reviewed-by: Neil Armstrong <neil.armstrong@linaro.org> Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org> Link: https://patchwork.freedesktop.org/patch/msgid/20220919010940.419893-3-adrian.larumbe@collabora.com Signed-off-by: Sasha Levin <sashal@kernel.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
[ Upstream commit 823f606 ] In case CONFIG_KASAN_VMALLOC=y kasan_populate_vmalloc() allocates the shadow pages dynamically. But even worse is that kasan_release_vmalloc() releases them, which is not compatible with create_mapping() of MODULES_VADDR..MODULES_END range: BUG: Bad page state in process kworker/9:1 pfn:2068b page:e5e06160 refcount:0 mapcount:0 mapping:00000000 index:0x0 flags: 0x1000(reserved) raw: 00001000 e5e06164 e5e06164 00000000 00000000 00000000 ffffffff 00000000 page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set bad because of flags: 0x1000(reserved) Modules linked in: ip_tables CPU: 9 PID: 154 Comm: kworker/9:1 Not tainted 5.4.188-... #1 Hardware name: LSI Axxia AXM55XX Workqueue: events do_free_init unwind_backtrace show_stack dump_stack bad_page free_pcp_prepare free_unref_page kasan_depopulate_vmalloc_pte __apply_to_page_range apply_to_existing_page_range kasan_release_vmalloc __purge_vmap_area_lazy _vm_unmap_aliases.part.0 __vunmap do_free_init process_one_work worker_thread kthread Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
[ Upstream commit 2b064d9 ] When the driver calls cx88_risc_buffer() to prepare the buffer, the function call may fail, resulting in a empty buffer and null-ptr-deref later in buffer_queue(). The following log can reveal it: [ 41.822762] general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN PTI [ 41.824488] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] [ 41.828027] RIP: 0010:buffer_queue+0xc2/0x500 [ 41.836311] Call Trace: [ 41.836945] __enqueue_in_driver+0x141/0x360 [ 41.837262] vb2_start_streaming+0x62/0x4a0 [ 41.838216] vb2_core_streamon+0x1da/0x2c0 [ 41.838516] __vb2_init_fileio+0x981/0xbc0 [ 41.839141] __vb2_perform_fileio+0xbf9/0x1120 [ 41.840072] vb2_fop_read+0x20e/0x400 [ 41.840346] v4l2_read+0x215/0x290 [ 41.840603] vfs_read+0x162/0x4c0 Fix this by checking the return value of cx88_risc_buffer() [hverkuil: fix coding style issues] Signed-off-by: Zheyu Ma <zheyuma97@gmail.com> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Mauro Carvalho Chehab <mchehab@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
…ext() commit abe3c63 upstream. The following warning was triggered on a hardware environment: SELinux: Converting 162 SID table entries... BUG: sleeping function called from invalid context at __might_sleep+0x60/0x74 0x0 in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 5943, name: tar CPU: 7 PID: 5943 Comm: tar Tainted: P O 5.10.0 #1 Call trace: dump_backtrace+0x0/0x1c8 show_stack+0x18/0x28 dump_stack+0xe8/0x15c ___might_sleep+0x168/0x17c __might_sleep+0x60/0x74 __kmalloc_track_caller+0xa0/0x7dc kstrdup+0x54/0xac convert_context+0x48/0x2e4 sidtab_context_to_sid+0x1c4/0x36c security_context_to_sid_core+0x168/0x238 security_context_to_sid_default+0x14/0x24 inode_doinit_use_xattr+0x164/0x1e4 inode_doinit_with_dentry+0x1c0/0x488 selinux_d_instantiate+0x20/0x34 security_d_instantiate+0x70/0xbc d_splice_alias+0x4c/0x3c0 ext4_lookup+0x1d8/0x200 [ext4] __lookup_slow+0x12c/0x1e4 walk_component+0x100/0x200 path_lookupat+0x88/0x118 filename_lookup+0x98/0x130 user_path_at_empty+0x48/0x60 vfs_statx+0x84/0x140 vfs_fstatat+0x20/0x30 __se_sys_newfstatat+0x30/0x74 __arm64_sys_newfstatat+0x1c/0x2c el0_svc_common.constprop.0+0x100/0x184 do_el0_svc+0x1c/0x2c el0_svc+0x20/0x34 el0_sync_handler+0x80/0x17c el0_sync+0x13c/0x140 SELinux: Context system_u:object_r:pssp_rsyslog_log_t:s0:c0 is not valid (left unmapped). It was found that within a critical section of spin_lock_irqsave in sidtab_context_to_sid(), convert_context() (hooked by sidtab_convert_params.func) might cause the process to sleep via allocating memory with GFP_KERNEL, which is problematic. As Ondrej pointed out [1], convert_context()/sidtab_convert_params.func has another caller sidtab_convert_tree(), which is okay with GFP_KERNEL. Therefore, fix this problem by adding a gfp_t argument for convert_context()/sidtab_convert_params.func and pass GFP_KERNEL/_ATOMIC properly in individual callers. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20221018120111.1474581-1-gongruiqi1@huawei.com/ [1] Reported-by: Tan Ninghao <tanninghao1@huawei.com> Fixes: ee1a84f ("selinux: overhaul sidtab to fix bug and improve performance") Signed-off-by: GONG, Ruiqi <gongruiqi1@huawei.com> Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> [PM: wrap long BUG() output lines, tweak subject line] Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
commit 01039fb upstream. This commit fixes a kernel oops because of a write in some read-only memory: [ 9.068287] Unable to handle kernel write to read-only memory at virtual address ffff800009240ad8 ..snip.. [ 9.138790] Internal error: Oops: 9600004f [#1] PREEMPT SMP ..snip.. [ 9.269161] Call trace: [ 9.276271] __memcpy+0x5c/0x230 [ 9.278531] snprintf+0x58/0x80 [ 9.282002] qcom_cpufreq_msm8939_name_version+0xb4/0x190 [ 9.284869] qcom_cpufreq_probe+0xc8/0x39c ..snip.. The following line defines a pointer that point to a char buffer stored in read-only memory: char *pvs_name = "speedXX-pvsXX-vXX"; This pointer is meant to hold a template "speedXX-pvsXX-vXX" where the XX values get overridden by the qcom_cpufreq_krait_name_version function. Since the template is actually stored in read-only memory, when the function executes the following call we get an oops: snprintf(*pvs_name, sizeof("speedXX-pvsXX-vXX"), "speed%d-pvs%d-v%d", speed, pvs, pvs_ver); To fix this issue, we instead store the template name onto the stack by using the following syntax: char pvs_name_buffer[] = "speedXX-pvsXX-vXX"; Because the `pvs_name` needs to be able to be assigned to NULL, the template buffer is stored in the pvs_name_buffer and not under the pvs_name variable. Cc: v5.7+ <stable@vger.kernel.org> # v5.7+ Fixes: a8811ec ("cpufreq: qcom: Add support for krait based socs") Signed-off-by: Fabien Parent <fabien.parent@linaro.org> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
[ Upstream commit d8b5713 ] syzbot got a crash [1] in skb_clone(), caused by a bug in hsr_get_untagged_frame(). When/if create_stripped_skb_hsr() returns NULL, we must not attempt to call skb_clone(). While we are at it, replace a WARN_ONCE() by netdev_warn_once(). [1] general protection fault, probably for non-canonical address 0xdffffc000000000f: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000078-0x000000000000007f] CPU: 1 PID: 754 Comm: syz-executor.0 Not tainted 6.0.0-syzkaller-02734-g0326074ff465 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022 RIP: 0010:skb_clone+0x108/0x3c0 net/core/skbuff.c:1641 Code: 93 02 00 00 49 83 7c 24 28 00 0f 85 e9 00 00 00 e8 5d 4a 29 fa 4c 8d 75 7e 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 ea 03 <0f> b6 04 02 4c 89 f2 83 e2 07 38 d0 7f 08 84 c0 0f 85 9e 01 00 00 RSP: 0018:ffffc90003ccf4e0 EFLAGS: 00010207 RAX: dffffc0000000000 RBX: ffffc90003ccf5f8 RCX: ffffc9000c24b000 RDX: 000000000000000f RSI: ffffffff8751cb13 RDI: 0000000000000000 RBP: 0000000000000000 R08: 00000000000000f0 R09: 0000000000000140 R10: fffffbfff181d972 R11: 0000000000000000 R12: ffff888161fc3640 R13: 0000000000000a20 R14: 000000000000007e R15: ffffffff8dc5f620 FS: 00007feb621e4700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007feb621e3ff8 CR3: 00000001643a9000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> hsr_get_untagged_frame+0x4e/0x610 net/hsr/hsr_forward.c:164 hsr_forward_do net/hsr/hsr_forward.c:461 [inline] hsr_forward_skb+0xcca/0x1d50 net/hsr/hsr_forward.c:623 hsr_handle_frame+0x588/0x7c0 net/hsr/hsr_slave.c:69 __netif_receive_skb_core+0x9fe/0x38f0 net/core/dev.c:5379 __netif_receive_skb_one_core+0xae/0x180 net/core/dev.c:5483 __netif_receive_skb+0x1f/0x1c0 net/core/dev.c:5599 netif_receive_skb_internal net/core/dev.c:5685 [inline] netif_receive_skb+0x12f/0x8d0 net/core/dev.c:5744 tun_rx_batched+0x4ab/0x7a0 drivers/net/tun.c:1544 tun_get_user+0x2686/0x3a00 drivers/net/tun.c:1995 tun_chr_write_iter+0xdb/0x200 drivers/net/tun.c:2025 call_write_iter include/linux/fs.h:2187 [inline] new_sync_write fs/read_write.c:491 [inline] vfs_write+0x9e9/0xdd0 fs/read_write.c:584 ksys_write+0x127/0x250 fs/read_write.c:637 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: f266a68 ("net/hsr: Better frame dispatch") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20221017165928.2150130-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
[ Upstream commit 51f9a89 ] When the default qdisc is cake, if the qdisc of dev_queue fails to be inited during mqprio_init(), cake_reset() is invoked to clear resources. In this case, the tins is NULL, and it will cause gpf issue. The process is as follows: qdisc_create_dflt() cake_init() q->tins = kvcalloc(...) --->failed, q->tins is NULL ... qdisc_put() ... cake_reset() ... cake_dequeue_one() b = &q->tins[...] --->q->tins is NULL The following is the Call Trace information: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] RIP: 0010:cake_dequeue_one+0xc9/0x3c0 Call Trace: <TASK> cake_reset+0xb1/0x140 qdisc_reset+0xed/0x6f0 qdisc_destroy+0x82/0x4c0 qdisc_put+0x9e/0xb0 qdisc_create_dflt+0x2c3/0x4a0 mqprio_init+0xa71/0x1760 qdisc_create+0x3eb/0x1000 tc_modify_qdisc+0x408/0x1720 rtnetlink_rcv_msg+0x38e/0xac0 netlink_rcv_skb+0x12d/0x3a0 netlink_unicast+0x4a2/0x740 netlink_sendmsg+0x826/0xcc0 sock_sendmsg+0xc5/0x100 ____sys_sendmsg+0x583/0x690 ___sys_sendmsg+0xe8/0x160 __sys_sendmsg+0xbf/0x160 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f89e5122d04 </TASK> Fixes: 046f6fd ("sched: Add Common Applications Kept Enhanced (cake) qdisc") Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
[ Upstream commit 2a3fc78 ] When the default qdisc is sfb, if the qdisc of dev_queue fails to be inited during mqprio_init(), sfb_reset() is invoked to clear resources. In this case, the q->qdisc is NULL, and it will cause gpf issue. The process is as follows: qdisc_create_dflt() sfb_init() tcf_block_get() --->failed, q->qdisc is NULL ... qdisc_put() ... sfb_reset() qdisc_reset(q->qdisc) --->q->qdisc is NULL ops = qdisc->ops The following is the Call Trace information: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] RIP: 0010:qdisc_reset+0x2b/0x6f0 Call Trace: <TASK> sfb_reset+0x37/0xd0 qdisc_reset+0xed/0x6f0 qdisc_destroy+0x82/0x4c0 qdisc_put+0x9e/0xb0 qdisc_create_dflt+0x2c3/0x4a0 mqprio_init+0xa71/0x1760 qdisc_create+0x3eb/0x1000 tc_modify_qdisc+0x408/0x1720 rtnetlink_rcv_msg+0x38e/0xac0 netlink_rcv_skb+0x12d/0x3a0 netlink_unicast+0x4a2/0x740 netlink_sendmsg+0x826/0xcc0 sock_sendmsg+0xc5/0x100 ____sys_sendmsg+0x583/0x690 ___sys_sendmsg+0xe8/0x160 __sys_sendmsg+0xbf/0x160 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f2164122d04 </TASK> Fixes: e13e02a ("net_sched: SFB flow scheduler") Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
[ Upstream commit 7175e13 ] I experience issues when putting a lkbsb on the stack and have sb_lvbptr field to a dangled pointer while not using DLM_LKF_VALBLK. It will crash with the following kernel message, the dangled pointer is here 0xdeadbeef as example: [ 102.749317] BUG: unable to handle page fault for address: 00000000deadbeef [ 102.749320] #PF: supervisor read access in kernel mode [ 102.749323] #PF: error_code(0x0000) - not-present page [ 102.749325] PGD 0 P4D 0 [ 102.749332] Oops: 0000 [#1] PREEMPT SMP PTI [ 102.749336] CPU: 0 PID: 1567 Comm: lock_torture_wr Tainted: G W 5.19.0-rc3+ #1565 [ 102.749343] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-2.module+el8.7.0+15506+033991b0 04/01/2014 [ 102.749344] RIP: 0010:memcpy_erms+0x6/0x10 [ 102.749353] Code: cc cc cc cc eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe [ 102.749355] RSP: 0018:ffff97a58145fd08 EFLAGS: 00010202 [ 102.749358] RAX: ffff901778b77070 RBX: 0000000000000000 RCX: 0000000000000040 [ 102.749360] RDX: 0000000000000040 RSI: 00000000deadbeef RDI: ffff901778b77070 [ 102.749362] RBP: ffff97a58145fd10 R08: ffff901760b67a70 R09: 0000000000000001 [ 102.749364] R10: ffff9017008e2cb8 R11: 0000000000000001 R12: ffff901760b67a70 [ 102.749366] R13: ffff901760b78f00 R14: 0000000000000003 R15: 0000000000000001 [ 102.749368] FS: 0000000000000000(0000) GS:ffff901876e00000(0000) knlGS:0000000000000000 [ 102.749372] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 102.749374] CR2: 00000000deadbeef CR3: 000000017c49a004 CR4: 0000000000770ef0 [ 102.749376] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 102.749378] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 102.749379] PKRU: 55555554 [ 102.749381] Call Trace: [ 102.749382] <TASK> [ 102.749383] ? send_args+0xb2/0xd0 [ 102.749389] send_common+0xb7/0xd0 [ 102.749395] _unlock_lock+0x2c/0x90 [ 102.749400] unlock_lock.isra.56+0x62/0xa0 [ 102.749405] dlm_unlock+0x21e/0x330 [ 102.749411] ? lock_torture_stats+0x80/0x80 [dlm_locktorture] [ 102.749416] torture_unlock+0x5a/0x90 [dlm_locktorture] [ 102.749419] ? preempt_count_sub+0xba/0x100 [ 102.749427] lock_torture_writer+0xbd/0x150 [dlm_locktorture] [ 102.786186] kthread+0x10a/0x130 [ 102.786581] ? kthread_complete_and_exit+0x20/0x20 [ 102.787156] ret_from_fork+0x22/0x30 [ 102.787588] </TASK> [ 102.787855] Modules linked in: dlm_locktorture torture rpcsec_gss_krb5 intel_rapl_msr intel_rapl_common kvm_intel iTCO_wdt iTCO_vendor_support kvm vmw_vsock_virtio_transport qxl irqbypass vmw_vsock_virtio_transport_common drm_ttm_helper crc32_pclmul joydev crc32c_intel ttm vsock virtio_scsi virtio_balloon snd_pcm drm_kms_helper virtio_console snd_timer snd drm soundcore syscopyarea i2c_i801 sysfillrect sysimgblt i2c_smbus pcspkr fb_sys_fops lpc_ich serio_raw [ 102.792536] CR2: 00000000deadbeef [ 102.792930] ---[ end trace 0000000000000000 ]--- This patch fixes the issue by checking also on DLM_LKF_VALBLK on exflags is set when copying the lvbptr array instead of if it's just null which fixes for me the issue. I think this patch can fix other dlm users as well, depending how they handle the init, freeing memory handling of sb_lvbptr and don't set DLM_LKF_VALBLK for some dlm_lock() calls. It might a there could be a hidden issue all the time. However with checking on DLM_LKF_VALBLK the user always need to provide a sb_lvbptr non-null value. There might be more intelligent handling between per ls lvblen, DLM_LKF_VALBLK and non-null to report the user the way how DLM API is used is wrong but can be added for later, this will only fix the current behaviour. Cc: stable@vger.kernel.org Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
commit c3ed222 upstream. Send along the already-allocated fattr along with nfs4_fs_locations, and drop the memcpy of fattr. We end up growing two more allocations, but this fixes up a crash as: PID: 790 TASK: ffff88811b43c000 CPU: 0 COMMAND: "ls" #0 [ffffc90000857920] panic at ffffffff81b9bfde #1 [ffffc900008579c0] do_trap at ffffffff81023a9b #2 [ffffc90000857a10] do_error_trap at ffffffff81023b78 #3 [ffffc90000857a58] exc_stack_segment at ffffffff81be1f45 #4 [ffffc90000857a80] asm_exc_stack_segment at ffffffff81c009de #5 [ffffc90000857b08] nfs_lookup at ffffffffa0302322 [nfs] #6 [ffffc90000857b70] __lookup_slow at ffffffff813a4a5f #7 [ffffc90000857c60] walk_component at ffffffff813a86c4 #8 [ffffc90000857cb8] path_lookupat at ffffffff813a9553 #9 [ffffc90000857cf0] filename_lookup at ffffffff813ab86b Suggested-by: Trond Myklebust <trondmy@hammerspace.com> Fixes: 9558a00 ("NFS: Remove the label from the nfs4_lookup_res struct") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
commit 4f40a5b upstream. This was missed in c3ed222 ("NFSv4: Fix free of uninitialized nfs4_label on referral lookup.") and causes a panic when mounting with '-o trunkdiscovery': PID: 1604 TASK: ffff93dac3520000 CPU: 3 COMMAND: "mount.nfs" #0 [ffffb79140f738f8] machine_kexec at ffffffffaec64bee #1 [ffffb79140f73950] __crash_kexec at ffffffffaeda67fd #2 [ffffb79140f73a18] crash_kexec at ffffffffaeda76ed #3 [ffffb79140f73a30] oops_end at ffffffffaec2658d #4 [ffffb79140f73a50] general_protection at ffffffffaf60111e [exception RIP: nfs_fattr_init+0x5] RIP: ffffffffc0c18265 RSP: ffffb79140f73b08 RFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff93dac304a800 RCX: 0000000000000000 RDX: ffffb79140f73bb0 RSI: ffff93dadc8cbb40 RDI: d03ee11cfaf6bd50 RBP: ffffb79140f73be8 R8: ffffffffc0691560 R9: 0000000000000006 R10: ffff93db3ffd3df8 R11: 0000000000000000 R12: ffff93dac4040000 R13: ffff93dac2848e00 R14: ffffb79140f73b60 R15: ffffb79140f73b30 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #5 [ffffb79140f73b08] _nfs41_proc_get_locations at ffffffffc0c73d53 [nfsv4] #6 [ffffb79140f73bf0] nfs4_proc_get_locations at ffffffffc0c83e90 [nfsv4] #7 [ffffb79140f73c60] nfs4_discover_trunking at ffffffffc0c83fb7 [nfsv4] #8 [ffffb79140f73cd8] nfs_probe_fsinfo at ffffffffc0c0f95f [nfs] #9 [ffffb79140f73da0] nfs_probe_server at ffffffffc0c1026a [nfs] RIP: 00007f6254fce26e RSP: 00007ffc69496ac8 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f6254fce26e RDX: 00005600220a82a0 RSI: 00005600220a64d0 RDI: 00005600220a6520 RBP: 00007ffc69496c50 R8: 00005600220a8710 R9: 003035322e323231 R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffc69496c50 R13: 00005600220a8440 R14: 0000000000000010 R15: 0000560020650ef9 ORIG_RAX: 00000000000000a5 CS: 0033 SS: 002b Fixes: c3ed222 ("NFSv4: Fix free of uninitialized nfs4_label on referral lookup.") Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
[ Upstream commit d89d7ff ] Another syzbot report [1] with no reproducer hints at a bug in ip6_gre tunnel (dev:ip6gretap0) Since ipv6 mcast code makes sure to read dev->mtu once and applies a sanity check on it (see commit b9b312a "ipv6: mcast: better catch silly mtu values"), a remaining possibility is that a layer is able to set dev->mtu to an underflowed value (high order bit set). This could happen indeed in ip6gre_tnl_link_config_route(), ip6_tnl_link_config() and ipip6_tunnel_bind_dev() Make sure to sanitize mtu value in a local variable before it is written once on dev->mtu, as lockless readers could catch wrong temporary value. [1] skbuff: skb_over_panic: text:ffff80000b7a2f38 len:40 put:40 head:ffff000149dcf200 data:ffff000149dcf2b0 tail:0xd8 end:0xc0 dev:ip6gretap0 ------------[ cut here ]------------ kernel BUG at net/core/skbuff.c:120 Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP Modules linked in: CPU: 1 PID: 10241 Comm: kworker/1:1 Not tainted 6.0.0-rc7-syzkaller-18095-gbbed346d5a96 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/30/2022 Workqueue: mld mld_ifc_work pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : skb_panic+0x4c/0x50 net/core/skbuff.c:116 lr : skb_panic+0x4c/0x50 net/core/skbuff.c:116 sp : ffff800020dd3b60 x29: ffff800020dd3b70 x28: 0000000000000000 x27: ffff00010df2a800 x26: 00000000000000c0 x25: 00000000000000b0 x24: ffff000149dcf200 x23: 00000000000000c0 x22: 00000000000000d8 x21: ffff80000b7a2f38 x20: ffff00014c2f7800 x19: 0000000000000028 x18: 00000000000001a9 x17: 0000000000000000 x16: ffff80000db49158 x15: ffff000113bf1a80 x14: 0000000000000000 x13: 00000000ffffffff x12: ffff000113bf1a80 x11: ff808000081c0d5c x10: 0000000000000000 x9 : 73f125dc5c63ba00 x8 : 73f125dc5c63ba00 x7 : ffff800008161d1c x6 : 0000000000000000 x5 : 0000000000000080 x4 : 0000000000000001 x3 : 0000000000000000 x2 : ffff0001fefddcd0 x1 : 0000000100000000 x0 : 0000000000000089 Call trace: skb_panic+0x4c/0x50 net/core/skbuff.c:116 skb_over_panic net/core/skbuff.c:125 [inline] skb_put+0xd4/0xdc net/core/skbuff.c:2049 ip6_mc_hdr net/ipv6/mcast.c:1714 [inline] mld_newpack+0x14c/0x270 net/ipv6/mcast.c:1765 add_grhead net/ipv6/mcast.c:1851 [inline] add_grec+0xa20/0xae0 net/ipv6/mcast.c:1989 mld_send_cr+0x438/0x5a8 net/ipv6/mcast.c:2115 mld_ifc_work+0x38/0x290 net/ipv6/mcast.c:2653 process_one_work+0x2d8/0x504 kernel/workqueue.c:2289 worker_thread+0x340/0x610 kernel/workqueue.c:2436 kthread+0x12c/0x158 kernel/kthread.c:376 ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:860 Code: 91011400 aa0803e1 a90027ea 94373093 (d4210000) Fixes: c12b395 ("gre: Support GRE over IPv6") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20221024020124.3756833-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
[ Upstream commit bacd22d ] mlx5_cmd_cleanup_async_ctx should return only after all its callback handlers were completed. Before this patch, the below race between mlx5_cmd_cleanup_async_ctx and mlx5_cmd_exec_cb_handler was possible and lead to a use-after-free: 1. mlx5_cmd_cleanup_async_ctx is called while num_inflight is 2 (i.e. elevated by 1, a single inflight callback). 2. mlx5_cmd_cleanup_async_ctx decreases num_inflight to 1. 3. mlx5_cmd_exec_cb_handler is called, decreases num_inflight to 0 and is about to call wake_up(). 4. mlx5_cmd_cleanup_async_ctx calls wait_event, which returns immediately as the condition (num_inflight == 0) holds. 5. mlx5_cmd_cleanup_async_ctx returns. 6. The caller of mlx5_cmd_cleanup_async_ctx frees the mlx5_async_ctx object. 7. mlx5_cmd_exec_cb_handler goes on and calls wake_up() on the freed object. Fix it by syncing using a completion object. Mark it completed when num_inflight reaches 0. Trace: BUG: KASAN: use-after-free in do_raw_spin_lock+0x23d/0x270 Read of size 4 at addr ffff888139cd12f4 by task swapper/5/0 CPU: 5 PID: 0 Comm: swapper/5 Not tainted 6.0.0-rc3_for_upstream_debug_2022_08_30_13_10 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: <IRQ> dump_stack_lvl+0x57/0x7d print_report.cold+0x2d5/0x684 ? do_raw_spin_lock+0x23d/0x270 kasan_report+0xb1/0x1a0 ? do_raw_spin_lock+0x23d/0x270 do_raw_spin_lock+0x23d/0x270 ? rwlock_bug.part.0+0x90/0x90 ? __delete_object+0xb8/0x100 ? lock_downgrade+0x6e0/0x6e0 _raw_spin_lock_irqsave+0x43/0x60 ? __wake_up_common_lock+0xb9/0x140 __wake_up_common_lock+0xb9/0x140 ? __wake_up_common+0x650/0x650 ? destroy_tis_callback+0x53/0x70 [mlx5_core] ? kasan_set_track+0x21/0x30 ? destroy_tis_callback+0x53/0x70 [mlx5_core] ? kfree+0x1ba/0x520 ? do_raw_spin_unlock+0x54/0x220 mlx5_cmd_exec_cb_handler+0x136/0x1a0 [mlx5_core] ? mlx5_cmd_cleanup_async_ctx+0x220/0x220 [mlx5_core] ? mlx5_cmd_cleanup_async_ctx+0x220/0x220 [mlx5_core] mlx5_cmd_comp_handler+0x65a/0x12b0 [mlx5_core] ? dump_command+0xcc0/0xcc0 [mlx5_core] ? lockdep_hardirqs_on_prepare+0x400/0x400 ? cmd_comp_notifier+0x7e/0xb0 [mlx5_core] cmd_comp_notifier+0x7e/0xb0 [mlx5_core] atomic_notifier_call_chain+0xd7/0x1d0 mlx5_eq_async_int+0x3ce/0xa20 [mlx5_core] atomic_notifier_call_chain+0xd7/0x1d0 ? irq_release+0x140/0x140 [mlx5_core] irq_int_handler+0x19/0x30 [mlx5_core] __handle_irq_event_percpu+0x1f2/0x620 handle_irq_event+0xb2/0x1d0 handle_edge_irq+0x21e/0xb00 __common_interrupt+0x79/0x1a0 common_interrupt+0x78/0xa0 </IRQ> <TASK> asm_common_interrupt+0x22/0x40 RIP: 0010:default_idle+0x42/0x60 Code: c1 83 e0 07 48 c1 e9 03 83 c0 03 0f b6 14 11 38 d0 7c 04 84 d2 75 14 8b 05 eb 47 22 02 85 c0 7e 07 0f 00 2d e0 9f 48 00 fb f4 <c3> 48 c7 c7 80 08 7f 85 e8 d1 d3 3e fe eb de 66 66 2e 0f 1f 84 00 RSP: 0018:ffff888100dbfdf0 EFLAGS: 00000242 RAX: 0000000000000001 RBX: ffffffff84ecbd48 RCX: 1ffffffff0afe110 RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffff835cc9bc RBP: 0000000000000005 R08: 0000000000000001 R09: ffff88881dec4ac3 R10: ffffed1103bd8958 R11: 0000017d0ca571c9 R12: 0000000000000005 R13: ffffffff84f024e0 R14: 0000000000000000 R15: dffffc0000000000 ? default_idle_call+0xcc/0x450 default_idle_call+0xec/0x450 do_idle+0x394/0x450 ? arch_cpu_idle_exit+0x40/0x40 ? do_idle+0x17/0x450 cpu_startup_entry+0x19/0x20 start_secondary+0x221/0x2b0 ? set_cpu_sibling_map+0x2070/0x2070 secondary_startup_64_no_verify+0xcd/0xdb </TASK> Allocated by task 49502: kasan_save_stack+0x1e/0x40 __kasan_kmalloc+0x81/0xa0 kvmalloc_node+0x48/0xe0 mlx5e_bulk_async_init+0x35/0x110 [mlx5_core] mlx5e_tls_priv_tx_list_cleanup+0x84/0x3e0 [mlx5_core] mlx5e_ktls_cleanup_tx+0x38f/0x760 [mlx5_core] mlx5e_cleanup_nic_tx+0xa7/0x100 [mlx5_core] mlx5e_detach_netdev+0x1ca/0x2b0 [mlx5_core] mlx5e_suspend+0xdb/0x140 [mlx5_core] mlx5e_remove+0x89/0x190 [mlx5_core] auxiliary_bus_remove+0x52/0x70 device_release_driver_internal+0x40f/0x650 driver_detach+0xc1/0x180 bus_remove_driver+0x125/0x2f0 auxiliary_driver_unregister+0x16/0x50 mlx5e_cleanup+0x26/0x30 [mlx5_core] cleanup+0xc/0x4e [mlx5_core] __x64_sys_delete_module+0x2b5/0x450 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Freed by task 49502: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 kasan_set_free_info+0x20/0x30 ____kasan_slab_free+0x11d/0x1b0 kfree+0x1ba/0x520 mlx5e_tls_priv_tx_list_cleanup+0x2e7/0x3e0 [mlx5_core] mlx5e_ktls_cleanup_tx+0x38f/0x760 [mlx5_core] mlx5e_cleanup_nic_tx+0xa7/0x100 [mlx5_core] mlx5e_detach_netdev+0x1ca/0x2b0 [mlx5_core] mlx5e_suspend+0xdb/0x140 [mlx5_core] mlx5e_remove+0x89/0x190 [mlx5_core] auxiliary_bus_remove+0x52/0x70 device_release_driver_internal+0x40f/0x650 driver_detach+0xc1/0x180 bus_remove_driver+0x125/0x2f0 auxiliary_driver_unregister+0x16/0x50 mlx5e_cleanup+0x26/0x30 [mlx5_core] cleanup+0xc/0x4e [mlx5_core] __x64_sys_delete_module+0x2b5/0x450 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Fixes: e355477 ("net/mlx5: Make mlx5_cmd_exec_cb() a safe API") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20221026135153.154807-8-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
[ Upstream commit 569bea7 ] In piix4_probe(), the piix4 adapter will be registered in: piix4_probe() piix4_add_adapters_sb800() / piix4_add_adapter() i2c_add_adapter() Based on the probed device type, piix4_add_adapters_sb800() or single piix4_add_adapter() will be called. For the former case, piix4_adapter_count is set as the number of adapters, while for antoher case it is not set and kept default *zero*. When piix4 is removed, piix4_remove() removes the adapters added in piix4_probe(), basing on the piix4_adapter_count value. Because the count is zero for the single adapter case, the adapter won't be removed and makes the sources allocated for adapter leaked, such as the i2c client and device. These sources can still be accessed by i2c or bus and cause problems. An easily reproduced case is that if a new adapter is registered, i2c will get the leaked adapter and try to call smbus_algorithm, which was already freed: Triggered by: rmmod i2c_piix4 && modprobe max31730 BUG: unable to handle page fault for address: ffffffffc053d860 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page Oops: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 3752 Comm: modprobe Tainted: G Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) RIP: 0010:i2c_default_probe (drivers/i2c/i2c-core-base.c:2259) i2c_core RSP: 0018:ffff888107477710 EFLAGS: 00000246 ... <TASK> i2c_detect (drivers/i2c/i2c-core-base.c:2302) i2c_core __process_new_driver (drivers/i2c/i2c-core-base.c:1336) i2c_core bus_for_each_dev (drivers/base/bus.c:301) i2c_for_each_dev (drivers/i2c/i2c-core-base.c:1823) i2c_core i2c_register_driver (drivers/i2c/i2c-core-base.c:1861) i2c_core do_one_initcall (init/main.c:1296) do_init_module (kernel/module/main.c:2455) ... </TASK> ---[ end trace 0000000000000000 ]--- Fix this problem by correctly set piix4_adapter_count as 1 for the single adapter so it can be normally removed. Fixes: 528d53a ("i2c: piix4: Fix probing of reserved ports on AMD Family 16h Model 30h") Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com> Reviewed-by: Jean Delvare <jdelvare@suse.de> Signed-off-by: Wolfram Sang <wsa@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
commit 968b715 upstream. We have been seeing the following panic in production kernel BUG at fs/btrfs/tree-mod-log.c:677! invalid opcode: 0000 [#1] SMP RIP: 0010:tree_mod_log_rewind+0x1b4/0x200 RSP: 0000:ffffc9002c02f890 EFLAGS: 00010293 RAX: 0000000000000003 RBX: ffff8882b448c700 RCX: 0000000000000000 RDX: 0000000000008000 RSI: 00000000000000a7 RDI: ffff88877d831c00 RBP: 0000000000000002 R08: 000000000000009f R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000100c40 R12: 0000000000000001 R13: ffff8886c26d6a00 R14: ffff88829f5424f8 R15: ffff88877d831a00 FS: 00007fee1d80c780(0000) GS:ffff8890400c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fee1963a020 CR3: 0000000434f33002 CR4: 00000000007706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: btrfs_get_old_root+0x12b/0x420 btrfs_search_old_slot+0x64/0x2f0 ? tree_mod_log_oldest_root+0x3d/0xf0 resolve_indirect_ref+0xfd/0x660 ? ulist_alloc+0x31/0x60 ? kmem_cache_alloc_trace+0x114/0x2c0 find_parent_nodes+0x97a/0x17e0 ? ulist_alloc+0x30/0x60 btrfs_find_all_roots_safe+0x97/0x150 iterate_extent_inodes+0x154/0x370 ? btrfs_search_path_in_tree+0x240/0x240 iterate_inodes_from_logical+0x98/0xd0 ? btrfs_search_path_in_tree+0x240/0x240 btrfs_ioctl_logical_to_ino+0xd9/0x180 btrfs_ioctl+0xe2/0x2ec0 ? __mod_memcg_lruvec_state+0x3d/0x280 ? do_sys_openat2+0x6d/0x140 ? kretprobe_dispatcher+0x47/0x70 ? kretprobe_rethook_handler+0x38/0x50 ? rethook_trampoline_handler+0x82/0x140 ? arch_rethook_trampoline_callback+0x3b/0x50 ? kmem_cache_free+0xfb/0x270 ? do_sys_openat2+0xd5/0x140 __x64_sys_ioctl+0x71/0xb0 do_syscall_64+0x2d/0x40 Which is this code in tree_mod_log_rewind() switch (tm->op) { case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING: BUG_ON(tm->slot < n); This occurs because we replay the nodes in order that they happened, and when we do a REPLACE we will log a REMOVE_WHILE_FREEING for every slot, starting at 0. 'n' here is the number of items in this block, which in this case was 1, but we had 2 REMOVE_WHILE_FREEING operations. The actual root cause of this was that we were replaying operations for a block that shouldn't have been replayed. Consider the following sequence of events 1. We have an already modified root, and we do a btrfs_get_tree_mod_seq(). 2. We begin removing items from this root, triggering KEY_REPLACE for it's child slots. 3. We remove one of the 2 children this root node points to, thus triggering the root node promotion of the remaining child, and freeing this node. 4. We modify a new root, and re-allocate the above node to the root node of this other root. The tree mod log looks something like this logical 0 op KEY_REPLACE (slot 1) seq 2 logical 0 op KEY_REMOVE (slot 1) seq 3 logical 0 op KEY_REMOVE_WHILE_FREEING (slot 0) seq 4 logical 4096 op LOG_ROOT_REPLACE (old logical 0) seq 5 logical 8192 op KEY_REMOVE_WHILE_FREEING (slot 1) seq 6 logical 8192 op KEY_REMOVE_WHILE_FREEING (slot 0) seq 7 logical 0 op LOG_ROOT_REPLACE (old logical 8192) seq 8 >From here the bug is triggered by the following steps 1. Call btrfs_get_old_root() on the new_root. 2. We call tree_mod_log_oldest_root(btrfs_root_node(new_root)), which is currently logical 0. 3. tree_mod_log_oldest_root() calls tree_mod_log_search_oldest(), which gives us the KEY_REPLACE seq 2, and since that's not a LOG_ROOT_REPLACE we incorrectly believe that we don't have an old root, because we expect that the most recent change should be a LOG_ROOT_REPLACE. 4. Back in tree_mod_log_oldest_root() we don't have a LOG_ROOT_REPLACE, so we don't set old_root, we simply use our existing extent buffer. 5. Since we're using our existing extent buffer (logical 0) we call tree_mod_log_search(0) in order to get the newest change to start the rewind from, which ends up being the LOG_ROOT_REPLACE at seq 8. 6. Again since we didn't find an old_root we simply clone logical 0 at it's current state. 7. We call tree_mod_log_rewind() with the cloned extent buffer. 8. Set n = btrfs_header_nritems(logical 0), which would be whatever the original nritems was when we COWed the original root, say for this example it's 2. 9. We start from the newest operation and work our way forward, so we see LOG_ROOT_REPLACE which we ignore. 10. Next we see KEY_REMOVE_WHILE_FREEING for slot 0, which triggers the BUG_ON(tm->slot < n), because it expects if we've done this we have a completely empty extent buffer to replay completely. The correct thing would be to find the first LOG_ROOT_REPLACE, and then get the old_root set to logical 8192. In fact making that change fixes this particular problem. However consider the much more complicated case. We have a child node in this tree and the above situation. In the above case we freed one of the child blocks at the seq 3 operation. If this block was also re-allocated and got new tree mod log operations we would have a different problem. btrfs_search_old_slot(orig root) would get down to the logical 0 root that still pointed at that node. However in btrfs_search_old_slot() we call tree_mod_log_rewind(buf) directly. This is not context aware enough to know which operations we should be replaying. If the block was re-allocated multiple times we may only want to replay a range of operations, and determining what that range is isn't possible to determine. We could maybe solve this by keeping track of which root the node belonged to at every tree mod log operation, and then passing this around to make sure we're only replaying operations that relate to the root we're trying to rewind. However there's a simpler way to solve this problem, simply disallow reallocations if we have currently running tree mod log users. We already do this for leaf's, so we're simply expanding this to nodes as well. This is a relatively uncommon occurrence, and the problem is complicated enough I'm worried that we will still have corner cases in the reallocation case. So fix this in the most straightforward way possible. Fixes: bd989ba ("Btrfs: add tree modification log functions") CC: stable@vger.kernel.org # 3.3+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
…ker() commit 6788ba8 upstream. This patch fixes an intra-object buffer overflow in brcmfmac that occurs when the device provides a 'bsscfgidx' equal to or greater than the buffer size. The patch adds a check that leads to a safe failure if that is the case. This fixes CVE-2022-3628. UBSAN: array-index-out-of-bounds in drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c index 52 is out of range for type 'brcmf_if *[16]' CPU: 0 PID: 1898 Comm: kworker/0:2 Tainted: G O 5.14.0+ #132 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 Workqueue: events brcmf_fweh_event_worker Call Trace: dump_stack_lvl+0x57/0x7d ubsan_epilogue+0x5/0x40 __ubsan_handle_out_of_bounds+0x69/0x80 ? memcpy+0x39/0x60 brcmf_fweh_event_worker+0xae1/0xc00 ? brcmf_fweh_call_event_handler.isra.0+0x100/0x100 ? rcu_read_lock_sched_held+0xa1/0xd0 ? rcu_read_lock_bh_held+0xb0/0xb0 ? lockdep_hardirqs_on_prepare+0x273/0x3e0 process_one_work+0x873/0x13e0 ? lock_release+0x640/0x640 ? pwq_dec_nr_in_flight+0x320/0x320 ? rwlock_bug.part.0+0x90/0x90 worker_thread+0x8b/0xd10 ? __kthread_parkme+0xd9/0x1d0 ? process_one_work+0x13e0/0x13e0 kthread+0x379/0x450 ? _raw_spin_unlock_irq+0x24/0x30 ? set_kthread_struct+0x100/0x100 ret_from_fork+0x1f/0x30 ================================================================================ general protection fault, probably for non-canonical address 0xe5601c0020023fff: 0000 [#1] SMP KASAN KASAN: maybe wild-memory-access in range [0x2b0100010011fff8-0x2b0100010011ffff] CPU: 0 PID: 1898 Comm: kworker/0:2 Tainted: G O 5.14.0+ #132 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 Workqueue: events brcmf_fweh_event_worker RIP: 0010:brcmf_fweh_call_event_handler.isra.0+0x42/0x100 Code: 89 f5 53 48 89 fb 48 83 ec 08 e8 79 0b 38 fe 48 85 ed 74 7e e8 6f 0b 38 fe 48 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 8b 00 00 00 4c 8b 7d 00 44 89 e0 48 ba 00 00 00 RSP: 0018:ffffc9000259fbd8 EFLAGS: 00010207 RAX: dffffc0000000000 RBX: ffff888115d8cd50 RCX: 0000000000000000 RDX: 0560200020023fff RSI: ffffffff8304bc91 RDI: ffff888115d8cd50 RBP: 2b0100010011ffff R08: ffff888112340050 R09: ffffed1023549809 R10: ffff88811aa4c047 R11: ffffed1023549808 R12: 0000000000000045 R13: ffffc9000259fca0 R14: ffff888112340050 R15: ffff888112340000 FS: 0000000000000000(0000) GS:ffff88811aa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000004053ccc0 CR3: 0000000112740000 CR4: 0000000000750ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: brcmf_fweh_event_worker+0x117/0xc00 ? brcmf_fweh_call_event_handler.isra.0+0x100/0x100 ? rcu_read_lock_sched_held+0xa1/0xd0 ? rcu_read_lock_bh_held+0xb0/0xb0 ? lockdep_hardirqs_on_prepare+0x273/0x3e0 process_one_work+0x873/0x13e0 ? lock_release+0x640/0x640 ? pwq_dec_nr_in_flight+0x320/0x320 ? rwlock_bug.part.0+0x90/0x90 worker_thread+0x8b/0xd10 ? __kthread_parkme+0xd9/0x1d0 ? process_one_work+0x13e0/0x13e0 kthread+0x379/0x450 ? _raw_spin_unlock_irq+0x24/0x30 ? set_kthread_struct+0x100/0x100 ret_from_fork+0x1f/0x30 Modules linked in: 88XXau(O) 88x2bu(O) ---[ end trace 41d302138f3ff55a ]--- RIP: 0010:brcmf_fweh_call_event_handler.isra.0+0x42/0x100 Code: 89 f5 53 48 89 fb 48 83 ec 08 e8 79 0b 38 fe 48 85 ed 74 7e e8 6f 0b 38 fe 48 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 8b 00 00 00 4c 8b 7d 00 44 89 e0 48 ba 00 00 00 RSP: 0018:ffffc9000259fbd8 EFLAGS: 00010207 RAX: dffffc0000000000 RBX: ffff888115d8cd50 RCX: 0000000000000000 RDX: 0560200020023fff RSI: ffffffff8304bc91 RDI: ffff888115d8cd50 RBP: 2b0100010011ffff R08: ffff888112340050 R09: ffffed1023549809 R10: ffff88811aa4c047 R11: ffffed1023549808 R12: 0000000000000045 R13: ffffc9000259fca0 R14: ffff888112340050 R15: ffff888112340000 FS: 0000000000000000(0000) GS:ffff88811aa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000004053ccc0 CR3: 0000000112740000 CR4: 0000000000750ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Kernel panic - not syncing: Fatal exception Reported-by: Dokyung Song <dokyungs@yonsei.ac.kr> Reported-by: Jisoo Jang <jisoo.jang@yonsei.ac.kr> Reported-by: Minsuk Kang <linuxlovemin@yonsei.ac.kr> Reviewed-by: Arend van Spriel <aspriel@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Dokyung Song <dokyung.song@gmail.com> Signed-off-by: Kalle Valo <kvalo@kernel.org> Link: https://lore.kernel.org/r/20221021061359.GA550858@laguna Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
commit 47b0c2e upstream. make_mmu_pages_available() must be called with mmu_lock held for write. However, if the TDP MMU is used, it will be called with mmu_lock held for read. This function does nothing unless shadow pages are used, so there is no race unless nested TDP is used. Since nested TDP uses shadow pages, old shadow pages may be zapped by this function even when the TDP MMU is enabled. Since shadow pages are never allocated by kvm_tdp_mmu_map(), a race condition can be avoided by not calling make_mmu_pages_available() if the TDP MMU is currently in use. I encountered this when repeatedly starting and stopping nested VM. It can be artificially caused by allocating a large number of nested TDP SPTEs. For example, the following BUG and general protection fault are caused in the host kernel. pte_list_remove: 00000000cd54fc10 many->many ------------[ cut here ]------------ kernel BUG at arch/x86/kvm/mmu/mmu.c:963! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI RIP: 0010:pte_list_remove.cold+0x16/0x48 [kvm] Call Trace: <TASK> drop_spte+0xe0/0x180 [kvm] mmu_page_zap_pte+0x4f/0x140 [kvm] __kvm_mmu_prepare_zap_page+0x62/0x3e0 [kvm] kvm_mmu_zap_oldest_mmu_pages+0x7d/0xf0 [kvm] direct_page_fault+0x3cb/0x9b0 [kvm] kvm_tdp_page_fault+0x2c/0xa0 [kvm] kvm_mmu_page_fault+0x207/0x930 [kvm] npf_interception+0x47/0xb0 [kvm_amd] svm_invoke_exit_handler+0x13c/0x1a0 [kvm_amd] svm_handle_exit+0xfc/0x2c0 [kvm_amd] kvm_arch_vcpu_ioctl_run+0xa79/0x1780 [kvm] kvm_vcpu_ioctl+0x29b/0x6f0 [kvm] __x64_sys_ioctl+0x95/0xd0 do_syscall_64+0x5c/0x90 general protection fault, probably for non-canonical address 0xdead000000000122: 0000 [#1] PREEMPT SMP NOPTI RIP: 0010:kvm_mmu_commit_zap_page.part.0+0x4b/0xe0 [kvm] Call Trace: <TASK> kvm_mmu_zap_oldest_mmu_pages+0xae/0xf0 [kvm] direct_page_fault+0x3cb/0x9b0 [kvm] kvm_tdp_page_fault+0x2c/0xa0 [kvm] kvm_mmu_page_fault+0x207/0x930 [kvm] npf_interception+0x47/0xb0 [kvm_amd] Bug: 261254636 CVE: CVE-2022-45869 Fixes: a2855af ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU") Signed-off-by: Kazuki Takiguchi <takiguchi.kazuki171@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Lee Jones <joneslee@google.com> Change-Id: I8743eb0cd44ae6c87f463028caffeb269e3d3756
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
[ Upstream commit 11c1095 ] Syz reported the following issue: kernel BUG at lib/list_debug.c:53! invalid opcode: 0000 [#1] PREEMPT SMP KASAN RIP: 0010:__list_del_entry_valid.cold+0x5c/0x72 Call Trace: <TASK> p9_fd_cancel+0xb1/0x270 p9_client_rpc+0x8ea/0xba0 p9_client_create+0x9c0/0xed0 v9fs_session_init+0x1e0/0x1620 v9fs_mount+0xba/0xb80 legacy_get_tree+0x103/0x200 vfs_get_tree+0x89/0x2d0 path_mount+0x4c0/0x1ac0 __x64_sys_mount+0x33b/0x430 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> The process is as follows: Thread A: Thread B: p9_poll_workfn() p9_client_create() ... ... p9_conn_cancel() p9_fd_cancel() list_del() ... ... list_del() //list_del corruption There is no lock protection when deleting list in p9_conn_cancel(). After deleting list in Thread A, thread B will delete the same list again. It will cause issue of list_del corruption. Setting req->status to REQ_STATUS_ERROR under lock prevents other cleanup paths from trying to manipulate req_list. The other thread can safely check req->status because it still holds a reference to req at this point. Link: https://lkml.kernel.org/r/20221110122606.383352-1-shaozhengchao@huawei.com Fixes: 52f1c45 ("9p: trans_fd/p9_conn_cancel: drop client lock earlier") Reported-by: syzbot+9b69b8d10ab4a7d88056@syzkaller.appspotmail.com Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> [Dominique: add description of the fix in commit message] Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Bug: 266361822 Change-Id: I3787f2f735eb10cba84110be160f14158bd49873 Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
This change fixes suspicious RCU usage warnings from vendor hook. ============================= WARNING: suspicious RCU usage 5.15.41-debug-gc1163f69ba3b-dirty #1 Not tainted ----------------------------- include/trace/events/lock.h:37 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 RCU used illegally from extended quiescent state! no locks held by swapper/0/0. stack backtrace: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.41-debug-gc1163f69ba3b-dirty #1 Call trace: dump_backtrace+0x0/0x1d8 dump_stack+0x1c/0x4c .. .. _printk+0x58/0x84 lockdep_rcu_suspicious+0x44/0x15c trace_android_vh_printk_caller_id+0xc4/0x13c vprintk_store+0x54/0x59c vprintk_emit+0x8c/0x130 vprintk_default+0x48/0x74 vprintk+0xf8/0x13c _printk+0x58/0x84 lockdep_rcu_suspicious+0x44/0x15c trace_android_vh_cpuidle_psci_enter+0xc4/0x144 __psci_enter_domain_idle_state+0x64/0x118 psci_enter_domain_idle_state+0x1c/0x2c cpuidle_enter_state+0x14c/0x2fc cpuidle_enter+0x3c/0x58 Bug: 267847290 Fixes: 3567f51 ("ANDROID: cpuidle-psci: Add vendor hook for cpuidle psci enter and exit") Change-Id: I910a6a0595c3a79b75e581297eb56d512ce5885c Signed-off-by: Maulik Shah <quic_mkshah@quicinc.com>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
[ upstream commit 12ad3d2 ] There is an interesting race condition of poll_refs which could result in a NULL pointer dereference. The crash trace is like: KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 0 PID: 30781 Comm: syz-executor.2 Not tainted 6.0.0-g493ffd6605b2 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:io_poll_remove_entry io_uring/poll.c:154 [inline] RIP: 0010:io_poll_remove_entries+0x171/0x5b4 io_uring/poll.c:190 Code: ... RSP: 0018:ffff88810dfefba0 EFLAGS: 00010202 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000040000 RDX: ffffc900030c4000 RSI: 000000000003ffff RDI: 0000000000040000 RBP: 0000000000000008 R08: ffffffff9764d3dd R09: fffffbfff3836781 R10: fffffbfff3836781 R11: 0000000000000000 R12: 1ffff11003422d60 R13: ffff88801a116b04 R14: ffff88801a116ac0 R15: dffffc0000000000 FS: 00007f9c07497700(0000) GS:ffff88811a600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffb5c00ea98 CR3: 0000000105680005 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> io_apoll_task_func+0x3f/0xa0 io_uring/poll.c:299 handle_tw_list io_uring/io_uring.c:1037 [inline] tctx_task_work+0x37e/0x4f0 io_uring/io_uring.c:1090 task_work_run+0x13a/0x1b0 kernel/task_work.c:177 get_signal+0x2402/0x25a0 kernel/signal.c:2635 arch_do_signal_or_restart+0x3b/0x660 arch/x86/kernel/signal.c:869 exit_to_user_mode_loop kernel/entry/common.c:166 [inline] exit_to_user_mode_prepare+0xc2/0x160 kernel/entry/common.c:201 __syscall_exit_to_user_mode_work kernel/entry/common.c:283 [inline] syscall_exit_to_user_mode+0x58/0x160 kernel/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x63/0xcd The root cause for this is a tiny overlooking in io_poll_check_events() when cocurrently run with poll cancel routine io_poll_cancel_req(). The interleaving to trigger use-after-free: CPU0 | CPU1 | io_apoll_task_func() | io_poll_cancel_req() io_poll_check_events() | // do while first loop | v = atomic_read(...) | // v = poll_refs = 1 | ... | io_poll_mark_cancelled() | atomic_or() | // poll_refs = IO_POLL_CANCEL_FLAG | 1 | atomic_sub_return(...) | // poll_refs = IO_POLL_CANCEL_FLAG | // loop continue | | | io_poll_execute() | io_poll_get_ownership() | // poll_refs = IO_POLL_CANCEL_FLAG | 1 | // gets the ownership v = atomic_read(...) | // poll_refs not change | | if (v & IO_POLL_CANCEL_FLAG) | return -ECANCELED; | // io_poll_check_events return | // will go into | // io_req_complete_failed() free req | | | io_apoll_task_func() | // also go into io_req_complete_failed() And the interleaving to trigger the kernel WARNING: CPU0 | CPU1 | io_apoll_task_func() | io_poll_cancel_req() io_poll_check_events() | // do while first loop | v = atomic_read(...) | // v = poll_refs = 1 | ... | io_poll_mark_cancelled() | atomic_or() | // poll_refs = IO_POLL_CANCEL_FLAG | 1 | atomic_sub_return(...) | // poll_refs = IO_POLL_CANCEL_FLAG | // loop continue | | v = atomic_read(...) | // v = IO_POLL_CANCEL_FLAG | | io_poll_execute() | io_poll_get_ownership() | // poll_refs = IO_POLL_CANCEL_FLAG | 1 | // gets the ownership | WARN_ON_ONCE(!(v & IO_POLL_REF_MASK))) | // v & IO_POLL_REF_MASK = 0 WARN | | | io_apoll_task_func() | // also go into io_req_complete_failed() By looking up the source code and communicating with Pavel, the implementation of this atomic poll refs should continue the loop of io_poll_check_events() just to avoid somewhere else to grab the ownership. Therefore, this patch simply adds another AND operation to make sure the loop will stop if it finds the poll_refs is exactly equal to IO_POLL_CANCEL_FLAG. Since io_poll_cancel_req() grabs ownership and will finally make its way to io_req_complete_failed(), the req will be reclaimed as expected. Fixes: aa43477 ("io_uring: poll rework") Change-Id: I88349fb8c0a657a7fbfc60e43e09eb9051487a1c Signed-off-by: Lin Ma <linma@zju.edu.cn> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: tweak description and code style] Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit df4b177b48516da64b988722a22d93d257dcda9a) Bug: 268174392 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
Currently, we have * 29 HAB devices defined in the array hab_devices * 28 pchan_factory_settings entries * 26 vdev-shmem devices defined in most of the platforms for HAB. A crash issue would happen if we add enough vdev-shmem devices in linux-lv/la.config, makes the total number of shmem >= 29, even they are not for HAB. hab:get_guest_ctrl_paddr:61 name = vm3-hab_aud1, factory paddr = 0x4d534732474d5651, irq 0, pages 257 Unable to handle kernel paging request at virtual address 00534732474d5651 Internal error: Oops: 96000004 [#1] PREEMPT SMP pc : get_guest_ctrl_paddr+0x70/0x1b8 lr : get_guest_ctrl_paddr+0x6c/0x1b8 Call trace: get_guest_ctrl_paddr+0x70/0x1b8 hab_shmem_attach+0x58 Root cause is OOB write happens when we probe and update the 29th shmem factory and irq info in the array pchan_factory_settings while the size of it is only 28. HAB driver is probing shmem devices more than expected. Since the compatible name of vdev-shmem cannot tell apart from different shmem users due to the limitation of vdev-shmem library, we decide to decrease the number of maximum successful probe attempts to 26 by adding the number of pchan_factory_setting entries into the sanity check during probe and meanwhile decreasing the number to 26 from 28. Excessive probe attempts would return -ENODEV if the number of successfully probed devices reaches to 26. The actual number of created physical channels is based on device tree settings. Considering no need for CLK1/2 and XVM1/2/3 physical channels, 26 is enough for all hab use cases. Meanwhile, an assumption is required if there is another vdev-shmem user other than HAB: there must be 26 vdev-shmem devices for hab in the linux-lv/la.config if we hope to successfully probe and only probe the exact number of vdev-shmem devices defined for HAB. Change-Id: I6fd7001faa52c7924cf5721c565f934282d39470 Signed-off-by: lixiang <quic_lixian@quicinc.com>
mi-code
pushed a commit
that referenced
this issue
Apr 8, 2024
[ 36.999268][ T212] BUG: KASAN: null-ptr-deref in mem_offline_driver_probe+0x1198/0x1738 [mem_offline] [ 37.008679][ T212] Read of size 8 at addr 0000000000000020 by task init/212 [ 37.017988][ T212] CPU: 5 PID: 212 Comm: init Tainted: GOE 5.15.74-android13-8-gf2a83840aecd-dirty #1 [ 37.028655][ T212] Hardware name: Qualcomm Technologies, Inc. KHAJE QRD nopmi overlay PD2280F_EX (DT) [ 37.037998][ T212] Call trace: [ 37.041168][ T212] dump_backtrace+0x0/0x284 [ 37.045573][ T212] show_stack+0x18/0x24 [ 37.049624][ T212] dump_stack_lvl+0x88/0xa4 [ 37.054029][ T212] kasan_report+0x174/0x200 [ 37.058436][ T212] __asan_load8+0xb4/0xb8 [ 37.062670][ T212] mem_offline_driver_probe+0x1198/0x1738 [mem_offline] [ 37.069545][ T212] platform_probe+0x100/0x160 [ 37.074133][ T212] really_probe+0x1d8/0x60c [ 37.078537][ T212] __driver_probe_device+0x120/0x194 [ 37.083720][ T212] driver_probe_device+0x78/0x24c [ 37.088651][ T212] __driver_attach+0x2a0/0x33c [ 37.093309][ T212] bus_for_each_dev+0xd0/0x134 [ 37.097982][ T212] driver_attach+0x38/0x48 [ 37.102302][ T212] bus_add_driver+0x1d8/0x378 [ 37.106887][ T212] driver_register+0x18c/0x244 [ 37.111548][ T212] __platform_driver_register+0x4c/0x60 [ 37.116997][ T212] init_module+0x24/0xff4 [mem_offline] [ 37.122482][ T212] do_one_initcall+0x104/0x4f4 [ 37.127150][ T212] do_init_module+0xf8/0x584 [ 37.131643][ T212] load_module+0x1de4/0x1efc [ 37.136130][ T212] __arm64_sys_finit_module+0x11c/0x138 [ 37.141577][ T212] invoke_syscall+0x54/0x178 [ 37.146067][ T212] el0_svc_common+0x100/0x148 [ 37.150654][ T212] do_el0_svc+0x38/0xa4 [ 37.154710][ T212] el0_svc+0x20/0x50 [ 37.158506][ T212] el0t_64_sync_handler+0x68/0xac [ 37.163441][ T212] el0t_64_sync+0x160/0x164. Change-Id: I8c0b63f51822d3cac2c99628ad222360c0ce2df5 Signed-off-by: Kassey Li <quic_yingangl@quicinc.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
new
The text was updated successfully, but these errors were encountered: