Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcm_dma_wait_idle not exported? #26

Closed
simonjhall opened this issue May 18, 2012 · 3 comments
Closed

bcm_dma_wait_idle not exported? #26

simonjhall opened this issue May 18, 2012 · 3 comments

Comments

@simonjhall
Copy link

Hi there,

I'm using the BCM DMA functions from inside a kernel module.
I'm having a run-time link error as it appears that bcm_dma_wait_idle symbol is not exported. All the other interesting DMA functions are. Is this by design, or is this just an omission?

If it is by design, what is the best way of waiting for a DMA to finish? (IRQ?)

Kind regards,
Simon

@popcornmix
Copy link
Collaborator

That's just an omission as no-one is currently using that function.
Note the comment, this busy waits, so will hog the CPU.
This really need a down_interruptable that is triggered from the end of DMA imterrupt.

@simonjhall
Copy link
Author

Ok cool cheers.
I'd seen the comment and was wondering if it's worth sticking in a schedule() call into the busy loop? (although I head that pre-emptive kernel doesn't work)

popcornmix pushed a commit that referenced this issue Oct 13, 2012
Printing the "start_ip" for every secondary cpu is very noisy on a large
system - and doesn't add any value. Drop this message.

Console log before:
Booting Node   0, Processors  #1
smpboot cpu 1: start_ip = 96000
 #2
smpboot cpu 2: start_ip = 96000
 #3
smpboot cpu 3: start_ip = 96000
 #4
smpboot cpu 4: start_ip = 96000
       ...
 #31
smpboot cpu 31: start_ip = 96000
Brought up 32 CPUs

Console log after:
Booting Node   0, Processors  #1 #2 #3 #4 #5 #6 #7 Ok.
Booting Node   1, Processors  #8 #9 #10 #11 #12 #13 #14 #15 Ok.
Booting Node   0, Processors  #16 #17 #18 #19 #20 #21 #22 #23 Ok.
Booting Node   1, Processors  #24 #25 #26 #27 #28 #29 #30 #31
Brought up 32 CPUs

Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4f452eb42507460426@agluck-desktop.sc.intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
popcornmix pushed a commit that referenced this issue Oct 13, 2012
reg_timeout_work() calls restore_regulatory_settings() which
takes cfg80211_mutex.

reg_set_request_processed() already holds cfg80211_mutex
before calling cancel_delayed_work_sync(reg_timeout),
so it might deadlock.

Call the async cancel_delayed_work instead, in order
to avoid the potential deadlock.

This is the relevant lockdep warning:

cfg80211: Calling CRDA for country: XX

======================================================
[ INFO: possible circular locking dependency detected ]
3.4.0-rc5-wl+ #26 Not tainted
-------------------------------------------------------
kworker/0:2/1391 is trying to acquire lock:
 (cfg80211_mutex){+.+.+.}, at: [<bf28ae00>] restore_regulatory_settings+0x34/0x418 [cfg80211]

but task is already holding lock:
 ((reg_timeout).work){+.+...}, at: [<c0059e94>] process_one_work+0x1f0/0x480

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 ((reg_timeout).work){+.+...}:
       [<c008fd44>] validate_chain+0xb94/0x10f0
       [<c0090b68>] __lock_acquire+0x8c8/0x9b0
       [<c0090d40>] lock_acquire+0xf0/0x114
       [<c005b600>] wait_on_work+0x4c/0x154
       [<c005c000>] __cancel_work_timer+0xd4/0x11c
       [<c005c064>] cancel_delayed_work_sync+0x1c/0x20
       [<bf28b274>] reg_set_request_processed+0x50/0x78 [cfg80211]
       [<bf28bd84>] set_regdom+0x550/0x600 [cfg80211]
       [<bf294cd8>] nl80211_set_reg+0x218/0x258 [cfg80211]
       [<c03c7738>] genl_rcv_msg+0x1a8/0x1e8
       [<c03c6a00>] netlink_rcv_skb+0x5c/0xc0
       [<c03c7584>] genl_rcv+0x28/0x34
       [<c03c6720>] netlink_unicast+0x15c/0x228
       [<c03c6c7c>] netlink_sendmsg+0x218/0x298
       [<c03933c8>] sock_sendmsg+0xa4/0xc0
       [<c039406c>] __sys_sendmsg+0x1e4/0x268
       [<c0394228>] sys_sendmsg+0x4c/0x70
       [<c0013840>] ret_fast_syscall+0x0/0x3c

-> #1 (reg_mutex){+.+.+.}:
       [<c008fd44>] validate_chain+0xb94/0x10f0
       [<c0090b68>] __lock_acquire+0x8c8/0x9b0
       [<c0090d40>] lock_acquire+0xf0/0x114
       [<c04734dc>] mutex_lock_nested+0x48/0x320
       [<bf28b2cc>] reg_todo+0x30/0x538 [cfg80211]
       [<c0059f44>] process_one_work+0x2a0/0x480
       [<c005a4b4>] worker_thread+0x1bc/0x2bc
       [<c0061148>] kthread+0x98/0xa4
       [<c0014af4>] kernel_thread_exit+0x0/0x8

-> #0 (cfg80211_mutex){+.+.+.}:
       [<c008ed58>] print_circular_bug+0x68/0x2cc
       [<c008fb28>] validate_chain+0x978/0x10f0
       [<c0090b68>] __lock_acquire+0x8c8/0x9b0
       [<c0090d40>] lock_acquire+0xf0/0x114
       [<c04734dc>] mutex_lock_nested+0x48/0x320
       [<bf28ae00>] restore_regulatory_settings+0x34/0x418 [cfg80211]
       [<bf28b200>] reg_timeout_work+0x1c/0x20 [cfg80211]
       [<c0059f44>] process_one_work+0x2a0/0x480
       [<c005a4b4>] worker_thread+0x1bc/0x2bc
       [<c0061148>] kthread+0x98/0xa4
       [<c0014af4>] kernel_thread_exit+0x0/0x8

other info that might help us debug this:

Chain exists of:
  cfg80211_mutex --> reg_mutex --> (reg_timeout).work

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock((reg_timeout).work);
                               lock(reg_mutex);
                               lock((reg_timeout).work);
  lock(cfg80211_mutex);

 *** DEADLOCK ***

2 locks held by kworker/0:2/1391:
 #0:  (events){.+.+.+}, at: [<c0059e94>] process_one_work+0x1f0/0x480
 #1:  ((reg_timeout).work){+.+...}, at: [<c0059e94>] process_one_work+0x1f0/0x480

stack backtrace:
[<c001b928>] (unwind_backtrace+0x0/0x12c) from [<c0471d3c>] (dump_stack+0x20/0x24)
[<c0471d3c>] (dump_stack+0x20/0x24) from [<c008ef70>] (print_circular_bug+0x280/0x2cc)
[<c008ef70>] (print_circular_bug+0x280/0x2cc) from [<c008fb28>] (validate_chain+0x978/0x10f0)
[<c008fb28>] (validate_chain+0x978/0x10f0) from [<c0090b68>] (__lock_acquire+0x8c8/0x9b0)
[<c0090b68>] (__lock_acquire+0x8c8/0x9b0) from [<c0090d40>] (lock_acquire+0xf0/0x114)
[<c0090d40>] (lock_acquire+0xf0/0x114) from [<c04734dc>] (mutex_lock_nested+0x48/0x320)
[<c04734dc>] (mutex_lock_nested+0x48/0x320) from [<bf28ae00>] (restore_regulatory_settings+0x34/0x418 [cfg80211])
[<bf28ae00>] (restore_regulatory_settings+0x34/0x418 [cfg80211]) from [<bf28b200>] (reg_timeout_work+0x1c/0x20 [cfg80211])
[<bf28b200>] (reg_timeout_work+0x1c/0x20 [cfg80211]) from [<c0059f44>] (process_one_work+0x2a0/0x480)
[<c0059f44>] (process_one_work+0x2a0/0x480) from [<c005a4b4>] (worker_thread+0x1bc/0x2bc)
[<c005a4b4>] (worker_thread+0x1bc/0x2bc) from [<c0061148>] (kthread+0x98/0xa4)
[<c0061148>] (kthread+0x98/0xa4) from [<c0014af4>] (kernel_thread_exit+0x0/0x8)
cfg80211: Calling CRDA to update world regulatory domain
cfg80211: World regulatory domain updated:
cfg80211:   (start_freq - end_freq @ bandwidth), (max_antenna_gain, max_eirp)
cfg80211:   (2402000 KHz - 2472000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
cfg80211:   (2457000 KHz - 2482000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
cfg80211:   (2474000 KHz - 2494000 KHz @ 20000 KHz), (300 mBi, 2000 mBm)
cfg80211:   (5170000 KHz - 5250000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)
cfg80211:   (5735000 KHz - 5835000 KHz @ 40000 KHz), (300 mBi, 2000 mBm)

Cc: stable@kernel.org
Signed-off-by: Eliad Peller <eliad@wizery.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
ghaskins pushed a commit to ghaskins/raspberrypi-rt that referenced this issue Feb 20, 2013
=======================================================
[ INFO: possible circular locking dependency detected ]
3.0.0-rc3+ raspberrypi#26
-------------------------------------------------------
ip/1104 is trying to acquire lock:
 (local_softirq_lock){+.+...}, at: [<ffffffff81056d12>] __local_lock+0x25/0x68

but task is already holding lock:
 (sk_lock-AF_INET){+.+...}, at: [<ffffffff81433308>] lock_sock+0x10/0x12

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> raspberrypi#1 (sk_lock-AF_INET){+.+...}:
       [<ffffffff810836e5>] lock_acquire+0x103/0x12e
       [<ffffffff813e2781>] lock_sock_nested+0x82/0x92
       [<ffffffff81433308>] lock_sock+0x10/0x12
       [<ffffffff81433afa>] tcp_close+0x1b/0x355
       [<ffffffff81453c99>] inet_release+0xc3/0xcd
       [<ffffffff813dff3f>] sock_release+0x1f/0x74
       [<ffffffff813dffbb>] sock_close+0x27/0x2b
       [<ffffffff81129c63>] fput+0x11d/0x1e3
       [<ffffffff81126577>] filp_close+0x70/0x7b
       [<ffffffff8112667a>] sys_close+0xf8/0x13d
       [<ffffffff814ae882>] system_call_fastpath+0x16/0x1b

-> #0 (local_softirq_lock){+.+...}:
       [<ffffffff81082ecc>] __lock_acquire+0xacc/0xdc8
       [<ffffffff810836e5>] lock_acquire+0x103/0x12e
       [<ffffffff814a7e40>] _raw_spin_lock+0x3b/0x4a
       [<ffffffff81056d12>] __local_lock+0x25/0x68
       [<ffffffff81056d8b>] local_bh_disable+0x36/0x3b
       [<ffffffff814a7fc4>] _raw_write_lock_bh+0x16/0x4f
       [<ffffffff81433c38>] tcp_close+0x159/0x355
       [<ffffffff81453c99>] inet_release+0xc3/0xcd
       [<ffffffff813dff3f>] sock_release+0x1f/0x74
       [<ffffffff813dffbb>] sock_close+0x27/0x2b
       [<ffffffff81129c63>] fput+0x11d/0x1e3
       [<ffffffff81126577>] filp_close+0x70/0x7b
       [<ffffffff8112667a>] sys_close+0xf8/0x13d
       [<ffffffff814ae882>] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(sk_lock-AF_INET);
                               lock(local_softirq_lock);
                               lock(sk_lock-AF_INET);
  lock(local_softirq_lock);

 *** DEADLOCK ***

1 lock held by ip/1104:
 #0:  (sk_lock-AF_INET){+.+...}, at: [<ffffffff81433308>] lock_sock+0x10/0x12

stack backtrace:
Pid: 1104, comm: ip Not tainted 3.0.0-rc3+ raspberrypi#26
Call Trace:
 [<ffffffff81081649>] print_circular_bug+0x1f8/0x209
 [<ffffffff81082ecc>] __lock_acquire+0xacc/0xdc8
 [<ffffffff81056d12>] ? __local_lock+0x25/0x68
 [<ffffffff810836e5>] lock_acquire+0x103/0x12e
 [<ffffffff81056d12>] ? __local_lock+0x25/0x68
 [<ffffffff81046c75>] ? get_parent_ip+0x11/0x41
 [<ffffffff814a7e40>] _raw_spin_lock+0x3b/0x4a
 [<ffffffff81056d12>] ? __local_lock+0x25/0x68
 [<ffffffff81046c8c>] ? get_parent_ip+0x28/0x41
 [<ffffffff81056d12>] __local_lock+0x25/0x68
 [<ffffffff81056d8b>] local_bh_disable+0x36/0x3b
 [<ffffffff81433308>] ? lock_sock+0x10/0x12
 [<ffffffff814a7fc4>] _raw_write_lock_bh+0x16/0x4f
 [<ffffffff81433c38>] tcp_close+0x159/0x355
 [<ffffffff81453c99>] inet_release+0xc3/0xcd
 [<ffffffff813dff3f>] sock_release+0x1f/0x74
 [<ffffffff813dffbb>] sock_close+0x27/0x2b
 [<ffffffff81129c63>] fput+0x11d/0x1e3
 [<ffffffff81126577>] filp_close+0x70/0x7b
 [<ffffffff8112667a>] sys_close+0xf8/0x13d
 [<ffffffff814ae882>] system_call_fastpath+0x16/0x1b


Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
@ghollingworth
Copy link

Assuming this is now fixed? It's been a while!

popcornmix pushed a commit that referenced this issue Nov 4, 2013
With DT-based boot, the GPMC OneNAND sync mode setup does not work
correctly. During the async mode setup, sync flags gets incorrectly
set in the onenand_async data and the system crashes during the async
setup. Also, the sync mode never gets set in gpmc_onenand_data->flags, so
even without the crash, the actual sync mode setup would never be called.

The patch fixes this by adjusting the gpmc_onenand_data->flags when the
data is read from the DT. Also while doing this we force the onenand_async
to be always async.

The patch enables to use the following DTS chunk (that should correspond
the arch/arm/mach-omap2/board-rm680.c board file setup) with Nokia N950,
which currently crashes with 3.12-rc1. The crash output can be also
found below.

&gpmc {
	ranges = <0 0 0x04000000 0x20000000>;

	onenand@0,0 {
		#address-cells = <1>;
		#size-cells = <1>;
		reg = <0 0 0x20000000>;

		gpmc,sync-read;
		gpmc,sync-write;
		gpmc,burst-length = <16>;
		gpmc,burst-read;
		gpmc,burst-wrap;
		gpmc,burst-write;
		gpmc,device-width = <2>;
		gpmc,mux-add-data = <2>;
		gpmc,cs-on-ns = <0>;
		gpmc,cs-rd-off-ns = <87>;
		gpmc,cs-wr-off-ns = <87>;
		gpmc,adv-on-ns = <0>;
		gpmc,adv-rd-off-ns = <10>;
		gpmc,adv-wr-off-ns = <10>;
		gpmc,oe-on-ns = <15>;
		gpmc,oe-off-ns = <87>;
		gpmc,we-on-ns = <0>;
		gpmc,we-off-ns = <87>;
		gpmc,rd-cycle-ns = <112>;
		gpmc,wr-cycle-ns = <112>;
		gpmc,access-ns = <81>;
		gpmc,page-burst-access-ns = <15>;
		gpmc,bus-turnaround-ns = <0>;
		gpmc,cycle2cycle-delay-ns = <0>;
		gpmc,wait-monitoring-ns = <0>;
		gpmc,clk-activation-ns = <5>;
		gpmc,wr-data-mux-bus-ns = <30>;
		gpmc,wr-access-ns = <81>;
		gpmc,sync-clk-ps = <15000>;
	};
};

[    1.467559] GPMC CS0: cs_on     :   0 ticks,   0 ns (was   0 ticks)   0 ns
[    1.474822] GPMC CS0: cs_rd_off :   1 ticks,   5 ns (was  24 ticks)   5 ns
[    1.482116] GPMC CS0: cs_wr_off :  14 ticks,  71 ns (was  24 ticks)  71 ns
[    1.489349] GPMC CS0: adv_on    :   0 ticks,   0 ns (was   0 ticks)   0 ns
[    1.496582] GPMC CS0: adv_rd_off:   3 ticks,  15 ns (was   3 ticks)  15 ns
[    1.503845] GPMC CS0: adv_wr_off:   3 ticks,  15 ns (was   3 ticks)  15 ns
[    1.511077] GPMC CS0: oe_on     :   3 ticks,  15 ns (was   4 ticks)  15 ns
[    1.518310] GPMC CS0: oe_off    :   1 ticks,   5 ns (was  24 ticks)   5 ns
[    1.525543] GPMC CS0: we_on     :   0 ticks,   0 ns (was   0 ticks)   0 ns
[    1.532806] GPMC CS0: we_off    :   8 ticks,  40 ns (was  24 ticks)  40 ns
[    1.540039] GPMC CS0: rd_cycle  :   4 ticks,  20 ns (was  29 ticks)  20 ns
[    1.547302] GPMC CS0: wr_cycle  :   4 ticks,  20 ns (was  29 ticks)  20 ns
[    1.554504] GPMC CS0: access    :   0 ticks,   0 ns (was  23 ticks)   0 ns
[    1.561767] GPMC CS0: page_burst_access:   0 ticks,   0 ns (was   3 ticks)   0 ns
[    1.569641] GPMC CS0: bus_turnaround:   0 ticks,   0 ns (was   0 ticks)   0 ns
[    1.577270] GPMC CS0: cycle2cycle_delay:   0 ticks,   0 ns (was   0 ticks)   0 ns
[    1.585144] GPMC CS0: wait_monitoring:   0 ticks,   0 ns (was   0 ticks)   0 ns
[    1.592834] GPMC CS0: clk_activation:   0 ticks,   0 ns (was   0 ticks)   0 ns
[    1.600463] GPMC CS0: wr_data_mux_bus:   5 ticks,  25 ns (was   8 ticks)  25 ns
[    1.608154] GPMC CS0: wr_access :   0 ticks,   0 ns (was  23 ticks)   0 ns
[    1.615386] GPMC CS0 CLK period is 5 ns (div 1)
[    1.625122] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf009e442
[    1.633178] Internal error: : 1008 [#1] ARM
[    1.637573] Modules linked in:
[    1.640777] CPU: 0 PID: 1 Comm: swapper Not tainted 3.12.0-rc1-n9xx-los.git-5318619-00006-g4baa700-dirty #26
[    1.651123] task: ef04c000 ti: ef050000 task.ti: ef050000
[    1.656799] PC is at gpmc_onenand_setup+0x98/0x1e0
[    1.661865] LR is at gpmc_cs_set_timings+0x494/0x5a4
[    1.667083] pc : [<c002e040>]    lr : [<c001f384>]    psr: 60000113
[    1.667083] sp : ef051d10  ip : ef051ce0  fp : ef051d94
[    1.679138] r10: c0caaf60  r9 : ef050000  r8 : ef18b32c
[    1.684631] r7 : f0080000  r6 : c0caaf60  r5 : 00000000  r4 : f009e400
[    1.691497] r3 : f009e442  r2 : 80050000  r1 : 00000014  r0 : 00000000
[    1.698333] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
[    1.706024] Control: 10c5387d  Table: af290019  DAC: 00000015
[    1.712066] Process swapper (pid: 1, stack limit = 0xef050240)
[    1.718200] Stack: (0xef051d10 to 0xef052000)
[    1.722778] 1d00:                                     00004000 00001402 00000000 00000005
[    1.731384] 1d20: 00000047 00000000 0000000f 0000000f 00000000 00000028 0000000f 00000005
[    1.739990] 1d40: 00000000 00000000 00000014 00000014 00000000 00000000 00000000 00000000
[    1.748596] 1d60: 00000000 00000019 00000000 00000000 ef18b000 ef099c50 c0c8cb30 00000000
[    1.757171] 1d80: c0488074 c048f868 ef051dcc ef051d98 c024447c c002dfb4 00000000 c048f868
[    1.765777] 1da0: 00000000 00000000 c010e4a4 c0dbbb7c c0c8cb40 00000000 c0ca2500 c0488074
[    1.774383] 1dc0: ef051ddc ef051dd0 c01fd508 c0244370 ef051dfc ef051de0 c01fc204 c01fd4f4
[    1.782989] 1de0: c0c8cb40 c0ca2500 c0c8cb74 00000000 ef051e1c ef051e00 c01fc3b0 c01fc104
[    1.791595] 1e00: ef0983bc 00000000 c0ca2500 c01fc31c ef051e44 ef051e20 c01fa794 c01fc328
[    1.800201] 1e20: ef03634c ef0983b0 ef27d534 c0ca2500 ef27d500 c0c9a2f8 ef051e54 ef051e48
[    1.808807] 1e40: c01fbcfc c01fa744 ef051e84 ef051e58 c01fb838 c01fbce4 c0411df8 c0caa040
[    1.817413] 1e60: ef051e84 c0ca2500 00000006 c0caa040 00000066 c0488074 ef051e9c ef051e88
[    1.825988] 1e80: c01fca30 c01fb768 c04975b8 00000006 ef051eac ef051ea0 c01fd728 c01fc9bc
[    1.834594] 1ea0: ef051ebc ef051eb0 c048808c c01fd6e4 ef051f4c ef051ec0 c0008888 c0488080
[    1.843200] 1ec0: 0000006f c046bae8 00000000 00000000 ef051efc ef051ee0 ef051f04 ef051ee8
[    1.851806] 1ee0: c046d400 c0181218 c046d410 c18da8d5 c036a8e4 00000066 ef051f4c ef051f08
[    1.860412] 1f00: c004b9a8 c046d41c c048f840 00000006 00000006 c046b488 00000000 c043ec08
[    1.869018] 1f20: ef051f4c c04975b8 00000006 c0caa040 00000066 c046d410 c048f85c c048f868
[    1.877593] 1f40: ef051f94 ef051f50 c046db8c c00087a0 00000006 00000006 c046d410 ffffffff
[    1.886199] 1f60: ffffffff ffffffff ffffffff 00000000 c0348fd0 00000000 00000000 00000000
[    1.894805] 1f80: 00000000 00000000 ef051fac ef051f98 c0348fe0 c046daa8 00000000 00000000
[    1.903411] 1fa0: 00000000 ef051fb0 c000e7f8 c0348fdc 00000000 00000000 00000000 00000000
[    1.912017] 1fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[    1.920623] 1fe0: 00000000 00000000 00000000 00000000 00000013 00000000 ffffffff ffffffff
[    1.929199] Backtrace:
[    1.931793] [<c002dfa8>] (gpmc_onenand_setup+0x0/0x1e0) from [<c024447c>] (omap2_onenand_probe+0x118/0x49c)
[    1.942047] [<c0244364>] (omap2_onenand_probe+0x0/0x49c) from [<c01fd508>] (platform_drv_probe+0x20/0x24)
[    1.952117]  r8:c0488074 r7:c0ca2500 r6:00000000 r5:c0c8cb40 r4:c0dbbb7c
[    1.959197] [<c01fd4e8>] (platform_drv_probe+0x0/0x24) from [<c01fc204>] (driver_probe_device+0x10c/0x224)
[    1.969360] [<c01fc0f8>] (driver_probe_device+0x0/0x224) from [<c01fc3b0>] (__driver_attach+0x94/0x98)
[    1.979125]  r7:00000000 r6:c0c8cb74 r5:c0ca2500 r4:c0c8cb40
[    1.985107] [<c01fc31c>] (__driver_attach+0x0/0x98) from [<c01fa794>] (bus_for_each_dev+0x5c/0x90)
[    1.994506]  r6:c01fc31c r5:c0ca2500 r4:00000000 r3:ef0983bc
[    2.000488] [<c01fa738>] (bus_for_each_dev+0x0/0x90) from [<c01fbcfc>] (driver_attach+0x24/0x28)
[    2.009735]  r6:c0c9a2f8 r5:ef27d500 r4:c0ca2500
[    2.014587] [<c01fbcd8>] (driver_attach+0x0/0x28) from [<c01fb838>] (bus_add_driver+0xdc/0x260)
[    2.023742] [<c01fb75c>] (bus_add_driver+0x0/0x260) from [<c01fca30>] (driver_register+0x80/0xfc)
[    2.033081]  r8:c0488074 r7:00000066 r6:c0caa040 r5:00000006 r4:c0ca2500
[    2.040161] [<c01fc9b0>] (driver_register+0x0/0xfc) from [<c01fd728>] (__platform_driver_register+0x50/0x64)
[    2.050476]  r5:00000006 r4:c04975b8
[    2.054260] [<c01fd6d8>] (__platform_driver_register+0x0/0x64) from [<c048808c>] (omap2_onenand_driver_init+0x18/0x20)
[    2.065490] [<c0488074>] (omap2_onenand_driver_init+0x0/0x20) from [<c0008888>] (do_one_initcall+0xf4/0x150)
[    2.075836] [<c0008794>] (do_one_initcall+0x0/0x150) from [<c046db8c>] (kernel_init_freeable+0xf0/0x1b4)
[    2.085815] [<c046da9c>] (kernel_init_freeable+0x0/0x1b4) from [<c0348fe0>] (kernel_init+0x10/0xec)
[    2.095336] [<c0348fd0>] (kernel_init+0x0/0xec) from [<c000e7f8>] (ret_from_fork+0x14/0x3c)
[    2.104125]  r4:00000000 r3:00000000
[    2.107879] Code: ebffc3ae e2505000 ba00002e e2843042 (e1d320b0)
[    2.114318] ---[ end trace b8ee3e3e5e002451 ]---

Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi>
Signed-off-by: Tony Lindgren <tony@atomide.com>
popcornmix pushed a commit that referenced this issue Mar 31, 2014
…d problems with booting

Without that change booting leads to crash with more warnings like below:
[    0.284454] omap_hwmod: uart4: cannot clk_get main_clk uart4_fck
[    0.284484] omap_hwmod: uart4: cannot _init_clocks
[    0.284484] ------------[ cut here ]------------
[    0.284545] WARNING: CPU: 0 PID: 1 at arch/arm/mach-omap2/omap_hwmod.c:2543 _init+0x300/0x3e4()
[    0.284545] omap_hwmod: uart4: couldn't init clocks
[    0.284576] Modules linked in:
[    0.284606] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-next-20140124-00020-gd2aefec-dirty #26
[    0.284637] [<c00151c0>] (unwind_backtrace) from [<c0011e20>] (show_stack+0x10/0x14)
[    0.284667] [<c0011e20>] (show_stack) from [<c0568544>] (dump_stack+0x7c/0x94)
[    0.284729] [<c0568544>] (dump_stack) from [<c003ff94>] (warn_slowpath_common+0x6c/0x90)
[    0.284729] [<c003ff94>] (warn_slowpath_common) from [<c003ffe8>] (warn_slowpath_fmt+0x30/0x40)
[    0.284759] [<c003ffe8>] (warn_slowpath_fmt) from [<c07d1be8>] (_init+0x300/0x3e4)
[    0.284790] [<c07d1be8>] (_init) from [<c07d217c>] (__omap_hwmod_setup_all+0x40/0x8c)
[    0.284820] [<c07d217c>] (__omap_hwmod_setup_all) from [<c0008918>] (do_one_initcall+0xe8/0x14c)
[    0.284851] [<c0008918>] (do_one_initcall) from [<c07c5c18>] (kernel_init_freeable+0x104/0x1c8)
[    0.284881] [<c07c5c18>] (kernel_init_freeable) from [<c0563524>] (kernel_init+0x8/0x118)
[    0.284912] [<c0563524>] (kernel_init) from [<c000e368>] (ret_from_fork+0x14/0x2c)
[    0.285064] ---[ end trace 63de210ad43b627d ]---

Reference:
https://lkml.org/lkml/2013/10/8/553

Signed-off-by: Marek Belisko <marek@goldelico.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
popcornmix pushed a commit that referenced this issue Jun 8, 2014
While implementing atomic_write_len, 4d3773c ("kernfs: implement
kernfs_ops->atomic_write_len") moved data copy from userland inside
kernfs_get_active() and kernfs_open_file->mutex so that
kernfs_ops->atomic_write_len can be accessed before copying buffer
from userland; unfortunately, this could lead to locking order
inversion involving mmap_sem if copy_from_user() takes a page fault.

  ======================================================
  [ INFO: possible circular locking dependency detected ]
  3.14.0-rc4-next-20140228-sasha-00011-g4077c67-dirty #26 Tainted: G        W
  -------------------------------------------------------
  trinity-c236/10658 is trying to acquire lock:
   (&of->mutex#2){+.+.+.}, at: [<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120

  but task is already holding lock:
   (&mm->mmap_sem){++++++}, at: [<mm/util.c:397>] vm_mmap_pgoff+0x6e/0xe0

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

 -> #1 (&mm->mmap_sem){++++++}:
	 [<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0
	 [<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0
	 [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0
	 [<mm/memory.c:4188>] might_fault+0x7e/0xb0
	 [<arch/x86/include/asm/uaccess.h:713 fs/kernfs/file.c:291>] kernfs_fop_write+0xd8/0x190
	 [<fs/read_write.c:473>] vfs_write+0xe3/0x1d0
	 [<fs/read_write.c:523 fs/read_write.c:515>] SyS_write+0x5d/0xa0
	 [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2

 -> #0 (&of->mutex#2){+.+.+.}:
	 [<kernel/locking/lockdep.c:1840>] check_prev_add+0x13f/0x560
	 [<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0
	 [<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0
	 [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0
	 [<kernel/locking/mutex.c:470 kernel/locking/mutex.c:571>] mutex_lock_nested+0x6a/0x510
	 [<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120
	 [<mm/mmap.c:1573>] mmap_region+0x310/0x5c0
	 [<mm/mmap.c:1365>] do_mmap_pgoff+0x385/0x430
	 [<mm/util.c:399>] vm_mmap_pgoff+0x8f/0xe0
	 [<mm/mmap.c:1416 mm/mmap.c:1374>] SyS_mmap_pgoff+0x1b0/0x210
	 [<arch/x86/kernel/sys_x86_64.c:72>] SyS_mmap+0x1d/0x20
	 [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2

  other info that might help us debug this:

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(&mm->mmap_sem);
				 lock(&of->mutex#2);
				 lock(&mm->mmap_sem);
    lock(&of->mutex#2);

   *** DEADLOCK ***

  1 lock held by trinity-c236/10658:
   #0:  (&mm->mmap_sem){++++++}, at: [<mm/util.c:397>] vm_mmap_pgoff+0x6e/0xe0

  stack backtrace:
  CPU: 2 PID: 10658 Comm: trinity-c236 Tainted: G        W 3.14.0-rc4-next-20140228-sasha-00011-g4077c67-dirty #26
   0000000000000000 ffff88011911fa48 ffffffff8438e945 0000000000000000
   0000000000000000 ffff88011911fa98 ffffffff811a0109 ffff88011911fab8
   ffff88011911fab8 ffff88011911fa98 ffff880119128cc0 ffff880119128cf8
  Call Trace:
   [<lib/dump_stack.c:52>] dump_stack+0x52/0x7f
   [<kernel/locking/lockdep.c:1213>] print_circular_bug+0x129/0x160
   [<kernel/locking/lockdep.c:1840>] check_prev_add+0x13f/0x560
   [<include/linux/spinlock.h:343 mm/slub.c:1933>] ? deactivate_slab+0x511/0x550
   [<kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131>] validate_chain+0x6c5/0x7b0
   [<kernel/locking/lockdep.c:3182>] __lock_acquire+0x4cd/0x5a0
   [<mm/mmap.c:1552>] ? mmap_region+0x24a/0x5c0
   [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>] lock_acquire+0x182/0x1d0
   [<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120
   [<kernel/locking/mutex.c:470 kernel/locking/mutex.c:571>] mutex_lock_nested+0x6a/0x510
   [<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120
   [<kernel/sched/core.c:2477>] ? get_parent_ip+0x11/0x50
   [<fs/kernfs/file.c:487>] ? kernfs_fop_mmap+0x54/0x120
   [<fs/kernfs/file.c:487>] kernfs_fop_mmap+0x54/0x120
   [<mm/mmap.c:1573>] mmap_region+0x310/0x5c0
   [<mm/mmap.c:1365>] do_mmap_pgoff+0x385/0x430
   [<mm/util.c:397>] ? vm_mmap_pgoff+0x6e/0xe0
   [<mm/util.c:399>] vm_mmap_pgoff+0x8f/0xe0
   [<kernel/rcu/update.c:97>] ? __rcu_read_unlock+0x44/0xb0
   [<fs/file.c:641>] ? dup_fd+0x3c0/0x3c0
   [<mm/mmap.c:1416 mm/mmap.c:1374>] SyS_mmap_pgoff+0x1b0/0x210
   [<arch/x86/kernel/sys_x86_64.c:72>] SyS_mmap+0x1d/0x20
   [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2

Fix it by caching atomic_write_len in kernfs_open_file during open so
that it can be determined without accessing kernfs_ops in
kernfs_fop_write().  This restores the structure of kernfs_fop_write()
before 4d3773c with updated @len determination logic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
References: http://lkml.kernel.org/g/53113485.2090407@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Jan 10, 2015
commit 2ef84b3 upstream.

Hand down the cpu number instead, otherwise lockdep screams when doing

echo 1 > /sys/devices/system/cpu/microcode/reload.

BUG: using smp_processor_id() in preemptible [00000000] code: amd64-microcode/2470
caller is debug_smp_processor_id+0x12/0x20
CPU: 1 PID: 2470 Comm: amd64-microcode Not tainted 3.18.0-rc6+ #26
...

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1417428741-4501-1-git-send-email-bp@alien8.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
davet321 pushed a commit to davet321/rpi-linux that referenced this issue Aug 17, 2015
commit ecf5fc6 upstream.

Nikolay has reported a hang when a memcg reclaim got stuck with the
following backtrace:

PID: 18308  TASK: ffff883d7c9b0a30  CPU: 1   COMMAND: "rsync"
  #0 __schedule at ffffffff815ab152
  raspberrypi#1 schedule at ffffffff815ab76e
  raspberrypi#2 schedule_timeout at ffffffff815ae5e5
  raspberrypi#3 io_schedule_timeout at ffffffff815aad6a
  raspberrypi#4 bit_wait_io at ffffffff815abfc6
  raspberrypi#5 __wait_on_bit at ffffffff815abda5
  raspberrypi#6 wait_on_page_bit at ffffffff8111fd4f
  raspberrypi#7 shrink_page_list at ffffffff81135445
  raspberrypi#8 shrink_inactive_list at ffffffff81135845
  raspberrypi#9 shrink_lruvec at ffffffff81135ead
 raspberrypi#10 shrink_zone at ffffffff811360c3
 raspberrypi#11 shrink_zones at ffffffff81136eff
 raspberrypi#12 do_try_to_free_pages at ffffffff8113712f
 raspberrypi#13 try_to_free_mem_cgroup_pages at ffffffff811372be
 raspberrypi#14 try_charge at ffffffff81189423
 raspberrypi#15 mem_cgroup_try_charge at ffffffff8118c6f5
 raspberrypi#16 __add_to_page_cache_locked at ffffffff8112137d
 raspberrypi#17 add_to_page_cache_lru at ffffffff81121618
 raspberrypi#18 pagecache_get_page at ffffffff8112170b
 raspberrypi#19 grow_dev_page at ffffffff811c8297
 raspberrypi#20 __getblk_slow at ffffffff811c91d6
 raspberrypi#21 __getblk_gfp at ffffffff811c92c1
 raspberrypi#22 ext4_ext_grow_indepth at ffffffff8124565c
 raspberrypi#23 ext4_ext_create_new_leaf at ffffffff81246ca8
 raspberrypi#24 ext4_ext_insert_extent at ffffffff81246f09
 raspberrypi#25 ext4_ext_map_blocks at ffffffff8124a848
 raspberrypi#26 ext4_map_blocks at ffffffff8121a5b7
 raspberrypi#27 mpage_map_one_extent at ffffffff8121b1fa
 raspberrypi#28 mpage_map_and_submit_extent at ffffffff8121f07b
 raspberrypi#29 ext4_writepages at ffffffff8121f6d5
 raspberrypi#30 do_writepages at ffffffff8112c490
 raspberrypi#31 __filemap_fdatawrite_range at ffffffff81120199
 raspberrypi#32 filemap_flush at ffffffff8112041c
 raspberrypi#33 ext4_alloc_da_blocks at ffffffff81219da1
 raspberrypi#34 ext4_rename at ffffffff81229b91
 raspberrypi#35 ext4_rename2 at ffffffff81229e32
 raspberrypi#36 vfs_rename at ffffffff811a08a5
 raspberrypi#37 SYSC_renameat2 at ffffffff811a3ffc
 raspberrypi#38 sys_renameat2 at ffffffff811a408e
 raspberrypi#39 sys_rename at ffffffff8119e51e
 raspberrypi#40 system_call_fastpath at ffffffff815afa89

Dave Chinner has properly pointed out that this is a deadlock in the
reclaim code because ext4 doesn't submit pages which are marked by
PG_writeback right away.

The heuristic was introduced by commit e62e384 ("memcg: prevent OOM
with too many dirty pages") and it was applied only when may_enter_fs
was specified.  The code has been changed by c3b94f4 ("memcg:
further prevent OOM with too many dirty pages") which has removed the
__GFP_FS restriction with a reasoning that we do not get into the fs
code.  But this is not sufficient apparently because the fs doesn't
necessarily submit pages marked PG_writeback for IO right away.

ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily
submit the bio.  Instead it tries to map more pages into the bio and
mpage_map_one_extent might trigger memcg charge which might end up
waiting on a page which is marked PG_writeback but hasn't been submitted
yet so we would end up waiting for something that never finishes.

Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2)
before we go to wait on the writeback.  The page fault path, which is
the only path that triggers memcg oom killer since 3.12, shouldn't
require GFP_NOFS and so we shouldn't reintroduce the premature OOM
killer issue which was originally addressed by the heuristic.

As per David Chinner the xfs is doing similar thing since 2.6.15 already
so ext4 is not the only affected filesystem.  Moreover he notes:

: For example: IO completion might require unwritten extent conversion
: which executes filesystem transactions and GFP_NOFS allocations. The
: writeback flag on the pages can not be cleared until unwritten
: extent conversion completes. Hence memory reclaim cannot wait on
: page writeback to complete in GFP_NOFS context because it is not
: safe to do so, memcg reclaim or otherwise.

Cc: stable@vger.kernel.org # 3.9+
[tytso@mit.edu: corrected the control flow]
Fixes: c3b94f4 ("memcg: further prevent OOM with too many dirty pages")
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Aug 20, 2015
Nikolay has reported a hang when a memcg reclaim got stuck with the
following backtrace:

PID: 18308  TASK: ffff883d7c9b0a30  CPU: 1   COMMAND: "rsync"
  #0 __schedule at ffffffff815ab152
  #1 schedule at ffffffff815ab76e
  #2 schedule_timeout at ffffffff815ae5e5
  #3 io_schedule_timeout at ffffffff815aad6a
  #4 bit_wait_io at ffffffff815abfc6
  #5 __wait_on_bit at ffffffff815abda5
  #6 wait_on_page_bit at ffffffff8111fd4f
  #7 shrink_page_list at ffffffff81135445
  #8 shrink_inactive_list at ffffffff81135845
  #9 shrink_lruvec at ffffffff81135ead
 #10 shrink_zone at ffffffff811360c3
 #11 shrink_zones at ffffffff81136eff
 #12 do_try_to_free_pages at ffffffff8113712f
 #13 try_to_free_mem_cgroup_pages at ffffffff811372be
 #14 try_charge at ffffffff81189423
 #15 mem_cgroup_try_charge at ffffffff8118c6f5
 #16 __add_to_page_cache_locked at ffffffff8112137d
 #17 add_to_page_cache_lru at ffffffff81121618
 #18 pagecache_get_page at ffffffff8112170b
 #19 grow_dev_page at ffffffff811c8297
 #20 __getblk_slow at ffffffff811c91d6
 #21 __getblk_gfp at ffffffff811c92c1
 #22 ext4_ext_grow_indepth at ffffffff8124565c
 #23 ext4_ext_create_new_leaf at ffffffff81246ca8
 #24 ext4_ext_insert_extent at ffffffff81246f09
 #25 ext4_ext_map_blocks at ffffffff8124a848
 #26 ext4_map_blocks at ffffffff8121a5b7
 #27 mpage_map_one_extent at ffffffff8121b1fa
 #28 mpage_map_and_submit_extent at ffffffff8121f07b
 #29 ext4_writepages at ffffffff8121f6d5
 #30 do_writepages at ffffffff8112c490
 #31 __filemap_fdatawrite_range at ffffffff81120199
 #32 filemap_flush at ffffffff8112041c
 #33 ext4_alloc_da_blocks at ffffffff81219da1
 #34 ext4_rename at ffffffff81229b91
 #35 ext4_rename2 at ffffffff81229e32
 #36 vfs_rename at ffffffff811a08a5
 #37 SYSC_renameat2 at ffffffff811a3ffc
 #38 sys_renameat2 at ffffffff811a408e
 #39 sys_rename at ffffffff8119e51e
 #40 system_call_fastpath at ffffffff815afa89

Dave Chinner has properly pointed out that this is a deadlock in the
reclaim code because ext4 doesn't submit pages which are marked by
PG_writeback right away.

The heuristic was introduced by commit e62e384 ("memcg: prevent OOM
with too many dirty pages") and it was applied only when may_enter_fs
was specified.  The code has been changed by c3b94f4 ("memcg:
further prevent OOM with too many dirty pages") which has removed the
__GFP_FS restriction with a reasoning that we do not get into the fs
code.  But this is not sufficient apparently because the fs doesn't
necessarily submit pages marked PG_writeback for IO right away.

ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily
submit the bio.  Instead it tries to map more pages into the bio and
mpage_map_one_extent might trigger memcg charge which might end up
waiting on a page which is marked PG_writeback but hasn't been submitted
yet so we would end up waiting for something that never finishes.

Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2)
before we go to wait on the writeback.  The page fault path, which is
the only path that triggers memcg oom killer since 3.12, shouldn't
require GFP_NOFS and so we shouldn't reintroduce the premature OOM
killer issue which was originally addressed by the heuristic.

As per David Chinner the xfs is doing similar thing since 2.6.15 already
so ext4 is not the only affected filesystem.  Moreover he notes:

: For example: IO completion might require unwritten extent conversion
: which executes filesystem transactions and GFP_NOFS allocations. The
: writeback flag on the pages can not be cleared until unwritten
: extent conversion completes. Hence memory reclaim cannot wait on
: page writeback to complete in GFP_NOFS context because it is not
: safe to do so, memcg reclaim or otherwise.

Cc: stable@vger.kernel.org # 3.9+
[tytso@mit.edu: corrected the control flow]
Fixes: c3b94f4 ("memcg: further prevent OOM with too many dirty pages")
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
giraldeau pushed a commit to giraldeau/linux that referenced this issue Apr 12, 2016
=======================================================
[ INFO: possible circular locking dependency detected ]
3.0.0-rc3+ raspberrypi#26
-------------------------------------------------------
ip/1104 is trying to acquire lock:
 (local_softirq_lock){+.+...}, at: [<ffffffff81056d12>] __local_lock+0x25/0x68

but task is already holding lock:
 (sk_lock-AF_INET){+.+...}, at: [<ffffffff81433308>] lock_sock+0x10/0x12

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> raspberrypi#1 (sk_lock-AF_INET){+.+...}:
       [<ffffffff810836e5>] lock_acquire+0x103/0x12e
       [<ffffffff813e2781>] lock_sock_nested+0x82/0x92
       [<ffffffff81433308>] lock_sock+0x10/0x12
       [<ffffffff81433afa>] tcp_close+0x1b/0x355
       [<ffffffff81453c99>] inet_release+0xc3/0xcd
       [<ffffffff813dff3f>] sock_release+0x1f/0x74
       [<ffffffff813dffbb>] sock_close+0x27/0x2b
       [<ffffffff81129c63>] fput+0x11d/0x1e3
       [<ffffffff81126577>] filp_close+0x70/0x7b
       [<ffffffff8112667a>] sys_close+0xf8/0x13d
       [<ffffffff814ae882>] system_call_fastpath+0x16/0x1b

-> #0 (local_softirq_lock){+.+...}:
       [<ffffffff81082ecc>] __lock_acquire+0xacc/0xdc8
       [<ffffffff810836e5>] lock_acquire+0x103/0x12e
       [<ffffffff814a7e40>] _raw_spin_lock+0x3b/0x4a
       [<ffffffff81056d12>] __local_lock+0x25/0x68
       [<ffffffff81056d8b>] local_bh_disable+0x36/0x3b
       [<ffffffff814a7fc4>] _raw_write_lock_bh+0x16/0x4f
       [<ffffffff81433c38>] tcp_close+0x159/0x355
       [<ffffffff81453c99>] inet_release+0xc3/0xcd
       [<ffffffff813dff3f>] sock_release+0x1f/0x74
       [<ffffffff813dffbb>] sock_close+0x27/0x2b
       [<ffffffff81129c63>] fput+0x11d/0x1e3
       [<ffffffff81126577>] filp_close+0x70/0x7b
       [<ffffffff8112667a>] sys_close+0xf8/0x13d
       [<ffffffff814ae882>] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(sk_lock-AF_INET);
                               lock(local_softirq_lock);
                               lock(sk_lock-AF_INET);
  lock(local_softirq_lock);

 *** DEADLOCK ***

1 lock held by ip/1104:
 #0:  (sk_lock-AF_INET){+.+...}, at: [<ffffffff81433308>] lock_sock+0x10/0x12

stack backtrace:
Pid: 1104, comm: ip Not tainted 3.0.0-rc3+ raspberrypi#26
Call Trace:
 [<ffffffff81081649>] print_circular_bug+0x1f8/0x209
 [<ffffffff81082ecc>] __lock_acquire+0xacc/0xdc8
 [<ffffffff81056d12>] ? __local_lock+0x25/0x68
 [<ffffffff810836e5>] lock_acquire+0x103/0x12e
 [<ffffffff81056d12>] ? __local_lock+0x25/0x68
 [<ffffffff81046c75>] ? get_parent_ip+0x11/0x41
 [<ffffffff814a7e40>] _raw_spin_lock+0x3b/0x4a
 [<ffffffff81056d12>] ? __local_lock+0x25/0x68
 [<ffffffff81046c8c>] ? get_parent_ip+0x28/0x41
 [<ffffffff81056d12>] __local_lock+0x25/0x68
 [<ffffffff81056d8b>] local_bh_disable+0x36/0x3b
 [<ffffffff81433308>] ? lock_sock+0x10/0x12
 [<ffffffff814a7fc4>] _raw_write_lock_bh+0x16/0x4f
 [<ffffffff81433c38>] tcp_close+0x159/0x355
 [<ffffffff81453c99>] inet_release+0xc3/0xcd
 [<ffffffff813dff3f>] sock_release+0x1f/0x74
 [<ffffffff813dffbb>] sock_close+0x27/0x2b
 [<ffffffff81129c63>] fput+0x11d/0x1e3
 [<ffffffff81126577>] filp_close+0x70/0x7b
 [<ffffffff8112667a>] sys_close+0xf8/0x13d
 [<ffffffff814ae882>] system_call_fastpath+0x16/0x1b


Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
mlyle pushed a commit to d-ronin/linux that referenced this issue Aug 21, 2016
=======================================================
[ INFO: possible circular locking dependency detected ]
3.0.0-rc3+ raspberrypi#26
-------------------------------------------------------
ip/1104 is trying to acquire lock:
 (local_softirq_lock){+.+...}, at: [<ffffffff81056d12>] __local_lock+0x25/0x68

but task is already holding lock:
 (sk_lock-AF_INET){+.+...}, at: [<ffffffff81433308>] lock_sock+0x10/0x12

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> raspberrypi#1 (sk_lock-AF_INET){+.+...}:
       [<ffffffff810836e5>] lock_acquire+0x103/0x12e
       [<ffffffff813e2781>] lock_sock_nested+0x82/0x92
       [<ffffffff81433308>] lock_sock+0x10/0x12
       [<ffffffff81433afa>] tcp_close+0x1b/0x355
       [<ffffffff81453c99>] inet_release+0xc3/0xcd
       [<ffffffff813dff3f>] sock_release+0x1f/0x74
       [<ffffffff813dffbb>] sock_close+0x27/0x2b
       [<ffffffff81129c63>] fput+0x11d/0x1e3
       [<ffffffff81126577>] filp_close+0x70/0x7b
       [<ffffffff8112667a>] sys_close+0xf8/0x13d
       [<ffffffff814ae882>] system_call_fastpath+0x16/0x1b

-> #0 (local_softirq_lock){+.+...}:
       [<ffffffff81082ecc>] __lock_acquire+0xacc/0xdc8
       [<ffffffff810836e5>] lock_acquire+0x103/0x12e
       [<ffffffff814a7e40>] _raw_spin_lock+0x3b/0x4a
       [<ffffffff81056d12>] __local_lock+0x25/0x68
       [<ffffffff81056d8b>] local_bh_disable+0x36/0x3b
       [<ffffffff814a7fc4>] _raw_write_lock_bh+0x16/0x4f
       [<ffffffff81433c38>] tcp_close+0x159/0x355
       [<ffffffff81453c99>] inet_release+0xc3/0xcd
       [<ffffffff813dff3f>] sock_release+0x1f/0x74
       [<ffffffff813dffbb>] sock_close+0x27/0x2b
       [<ffffffff81129c63>] fput+0x11d/0x1e3
       [<ffffffff81126577>] filp_close+0x70/0x7b
       [<ffffffff8112667a>] sys_close+0xf8/0x13d
       [<ffffffff814ae882>] system_call_fastpath+0x16/0x1b

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(sk_lock-AF_INET);
                               lock(local_softirq_lock);
                               lock(sk_lock-AF_INET);
  lock(local_softirq_lock);

 *** DEADLOCK ***

1 lock held by ip/1104:
 #0:  (sk_lock-AF_INET){+.+...}, at: [<ffffffff81433308>] lock_sock+0x10/0x12

stack backtrace:
Pid: 1104, comm: ip Not tainted 3.0.0-rc3+ raspberrypi#26
Call Trace:
 [<ffffffff81081649>] print_circular_bug+0x1f8/0x209
 [<ffffffff81082ecc>] __lock_acquire+0xacc/0xdc8
 [<ffffffff81056d12>] ? __local_lock+0x25/0x68
 [<ffffffff810836e5>] lock_acquire+0x103/0x12e
 [<ffffffff81056d12>] ? __local_lock+0x25/0x68
 [<ffffffff81046c75>] ? get_parent_ip+0x11/0x41
 [<ffffffff814a7e40>] _raw_spin_lock+0x3b/0x4a
 [<ffffffff81056d12>] ? __local_lock+0x25/0x68
 [<ffffffff81046c8c>] ? get_parent_ip+0x28/0x41
 [<ffffffff81056d12>] __local_lock+0x25/0x68
 [<ffffffff81056d8b>] local_bh_disable+0x36/0x3b
 [<ffffffff81433308>] ? lock_sock+0x10/0x12
 [<ffffffff814a7fc4>] _raw_write_lock_bh+0x16/0x4f
 [<ffffffff81433c38>] tcp_close+0x159/0x355
 [<ffffffff81453c99>] inet_release+0xc3/0xcd
 [<ffffffff813dff3f>] sock_release+0x1f/0x74
 [<ffffffff813dffbb>] sock_close+0x27/0x2b
 [<ffffffff81129c63>] fput+0x11d/0x1e3
 [<ffffffff81126577>] filp_close+0x70/0x7b
 [<ffffffff8112667a>] sys_close+0xf8/0x13d
 [<ffffffff814ae882>] system_call_fastpath+0x16/0x1b

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
popcornmix pushed a commit that referenced this issue Oct 31, 2016
The pcpu_build_alloc_info() function group CPUs according to their
proximity, by call callback function @cpu_distance_fn from different
ARCHs.

For arm64 the callback of @cpu_distance_fn is
    pcpu_cpu_distance(from, to)
        -> node_distance(from, to)
The @from and @to for function node_distance() should be nid.

However, pcpu_cpu_distance() in arch/arm64/mm/numa.c just past the
cpu id for @from and @to, and didn't convert to numa node id.

For this incorrect cpu proximity get from ARCH, it may cause each CPU
in one group and make group_cnt out of bound:

	setup_per_cpu_areas()
		pcpu_embed_first_chunk()
			pcpu_build_alloc_info()
in pcpu_build_alloc_info, since cpu_distance_fn will return
REMOTE_DISTANCE if we pass cpu ids (0,1,2...), so
cpu_distance_fn(cpu, tcpu) > LOCAL_DISTANCE will wrongly be ture.

This may results in triggering the BUG_ON(unit != nr_units) later:

[    0.000000] kernel BUG at mm/percpu.c:1916!
[    0.000000] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.0-rc1-00003-g14155ca-dirty #26
[    0.000000] Hardware name: Hisilicon Hi1616 Evaluation Board (DT)
[    0.000000] task: ffff000008d6e900 task.stack: ffff000008d60000
[    0.000000] PC is at pcpu_embed_first_chunk+0x420/0x704
[    0.000000] LR is at pcpu_embed_first_chunk+0x3bc/0x704
[    0.000000] pc : [<ffff000008c754f4>] lr : [<ffff000008c75490>] pstate: 800000c5
[    0.000000] sp : ffff000008d63eb0
[    0.000000] x29: ffff000008d63eb0 [    0.000000] x28: 0000000000000000
[    0.000000] x27: 0000000000000040 [    0.000000] x26: ffff8413fbfcef00
[    0.000000] x25: 0000000000000042 [    0.000000] x24: 0000000000000042
[    0.000000] x23: 0000000000001000 [    0.000000] x22: 0000000000000046
[    0.000000] x21: 0000000000000001 [    0.000000] x20: ffff000008cb3bc8
[    0.000000] x19: ffff8413fbfcf570 [    0.000000] x18: 0000000000000000
[    0.000000] x17: ffff000008e49ae0 [    0.000000] x16: 0000000000000003
[    0.000000] x15: 000000000000001e [    0.000000] x14: 0000000000000004
[    0.000000] x13: 0000000000000000 [    0.000000] x12: 000000000000006f
[    0.000000] x11: 00000413fbffff00 [    0.000000] x10: 0000000000000004
[    0.000000] x9 : 0000000000000000 [    0.000000] x8 : 0000000000000001
[    0.000000] x7 : ffff8413fbfcf63c [    0.000000] x6 : ffff000008d65d28
[    0.000000] x5 : ffff000008d65e50 [    0.000000] x4 : 0000000000000000
[    0.000000] x3 : ffff000008cb3cc8 [    0.000000] x2 : 0000000000000040
[    0.000000] x1 : 0000000000000040 [    0.000000] x0 : 0000000000000000
[...]
[    0.000000] Call trace:
[    0.000000] Exception stack(0xffff000008d63ce0 to 0xffff000008d63e10)
[    0.000000] 3ce0: ffff8413fbfcf570 0001000000000000 ffff000008d63eb0 ffff000008c754f4
[    0.000000] 3d00: ffff000008d63d50 ffff0000081af210 00000413fbfff010 0000000000001000
[    0.000000] 3d20: ffff000008d63d50 ffff0000081af220 00000413fbfff010 0000000000001000
[    0.000000] 3d40: 00000413fbfcef00 0000000000000004 ffff000008d63db0 ffff0000081af390
[    0.000000] 3d60: 00000413fbfcef00 0000000000001000 0000000000000000 0000000000001000
[    0.000000] 3d80: 0000000000000000 0000000000000040 0000000000000040 ffff000008cb3cc8
[    0.000000] 3da0: 0000000000000000 ffff000008d65e50 ffff000008d65d28 ffff8413fbfcf63c
[    0.000000] 3dc0: 0000000000000001 0000000000000000 0000000000000004 00000413fbffff00
[    0.000000] 3de0: 000000000000006f 0000000000000000 0000000000000004 000000000000001e
[    0.000000] 3e00: 0000000000000003 ffff000008e49ae0
[    0.000000] [<ffff000008c754f4>] pcpu_embed_first_chunk+0x420/0x704
[    0.000000] [<ffff000008c6658c>] setup_per_cpu_areas+0x38/0xc8
[    0.000000] [<ffff000008c608d8>] start_kernel+0x10c/0x390
[    0.000000] [<ffff000008c601d8>] __primary_switched+0x5c/0x64
[    0.000000] Code: b8018660 17ffffd7 6b16037f 54000080 (d4210000)
[    0.000000] ---[ end trace 0000000000000000 ]---
[    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!

Fix by getting cpu's node id with early_cpu_to_node() then pass it
to node_distance() as the original intention.

Fixes: 7af3a0a ("arm64/numa: support HAVE_SETUP_PER_CPU_AREA")
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
popcornmix pushed a commit that referenced this issue Jan 13, 2017
commit 1c7de2b upstream.

There is at least one Chelsio 10Gb card which uses VPD area to store some
non-standard blocks (example below).  However pci_vpd_size() returns the
length of the first block only assuming that there can be only one VPD "End
Tag".

Since 4e1a635 ("vfio/pci: Use kernel VPD access functions"), VFIO
blocks access beyond that offset, which prevents the guest "cxgb3" driver
from probing the device.  The host system does not have this problem as its
driver accesses the config space directly without pci_read_vpd().

Add a quirk to override the VPD size to a bigger value.  The maximum size
is taken from EEPROMSIZE in drivers/net/ethernet/chelsio/cxgb3/common.h.
We do not read the tag as the cxgb3 driver does as the driver supports
writing to EEPROM/VPD and when it writes, it only checks for 8192 bytes
boundary.  The quirk is registered for all devices supported by the cxgb3
driver.

This adds a quirk to the PCI layer (not to the cxgb3 driver) as the cxgb3
driver itself accesses VPD directly and the problem only exists with the
vfio-pci driver (when cxgb3 is not running on the host and may not be even
loaded) which blocks accesses beyond the first block of VPD data.  However
vfio-pci itself does not have quirks mechanism so we add it to PCI.

This is the controller:
Ethernet controller [0200]: Chelsio Communications Inc T310 10GbE Single Port Adapter [1425:0030]

This is what I parsed from its VPD:
===
b'\x82*\x0010 Gigabit Ethernet-SR PCI Express Adapter\x90J\x00EC\x07D76809 FN\x0746K'
 0000 Large item 42 bytes; name 0x2 Identifier String
	b'10 Gigabit Ethernet-SR PCI Express Adapter'
 002d Large item 74 bytes; name 0x10
	#00 [EC] len=7: b'D76809 '
	#0a [FN] len=7: b'46K7897'
	#14 [PN] len=7: b'46K7897'
	#1e [MN] len=4: b'1037'
	#25 [FC] len=4: b'5769'
	#2c [SN] len=12: b'YL102035603V'
	#3b [NA] len=12: b'00145E992ED1'
 007a Small item 1 bytes; name 0xf End Tag

 0c00 Large item 16 bytes; name 0x2 Identifier String
	b'S310E-SR-X      '
 0c13 Large item 234 bytes; name 0x10
	#00 [PN] len=16: b'TBD             '
	#13 [EC] len=16: b'110107730D2     '
	#26 [SN] len=16: b'97YL102035603V  '
	#39 [NA] len=12: b'00145E992ED1'
	#48 [V0] len=6: b'175000'
	#51 [V1] len=6: b'266666'
	#5a [V2] len=6: b'266666'
	#63 [V3] len=6: b'2000  '
	#6c [V4] len=2: b'1 '
	#71 [V5] len=6: b'c2    '
	#7a [V6] len=6: b'0     '
	#83 [V7] len=2: b'1 '
	#88 [V8] len=2: b'0 '
	#8d [V9] len=2: b'0 '
	#92 [VA] len=2: b'0 '
	#97 [RV] len=80: b's\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'...
 0d00 Large item 252 bytes; name 0x11
	#00 [VC] len=16: b'122310_1222 dp  '
	#13 [VD] len=16: b'610-0001-00 H1\x00\x00'
	#26 [VE] len=16: b'122310_1353 fp  '
	#39 [VF] len=16: b'610-0001-00 H1\x00\x00'
	#4c [RW] len=173: b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'...
 0dff Small item 0 bytes; name 0xf End Tag

10f3 Large item 13315 bytes; name 0x62
!!! unknown item name 98: b'\xd0\x03\x00@`\x0c\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00'
===

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Mar 21, 2017
As Eric Dumazet pointed out this also needs to be fixed in IPv6.
v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.

We have seen a few incidents lately where a dst_enty has been freed
with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
dst_entry. If the conditions/timings are right a crash then ensues when the
freed dst_entry is referenced later on. A Common crashing back trace is:

 #8 [] page_fault at ffffffff8163e648
    [exception RIP: __tcp_ack_snd_check+74]
.
.
 #9 [] tcp_rcv_established at ffffffff81580b64
#10 [] tcp_v4_do_rcv at ffffffff8158b54a
#11 [] tcp_v4_rcv at ffffffff8158cd02
#12 [] ip_local_deliver_finish at ffffffff815668f4
#13 [] ip_local_deliver at ffffffff81566bd9
#14 [] ip_rcv_finish at ffffffff8156656d
#15 [] ip_rcv at ffffffff81566f06
#16 [] __netif_receive_skb_core at ffffffff8152b3a2
#17 [] __netif_receive_skb at ffffffff8152b608
#18 [] netif_receive_skb at ffffffff8152b690
#19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
#20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
#21 [] net_rx_action at ffffffff8152bac2
#22 [] __do_softirq at ffffffff81084b4f
#23 [] call_softirq at ffffffff8164845c
#24 [] do_softirq at ffffffff81016fc5
#25 [] irq_exit at ffffffff81084ee5
#26 [] do_IRQ at ffffffff81648ff8

Of course it may happen with other NIC drivers as well.

It's found the freed dst_entry here:

 224 static bool tcp_in_quickack_mode(struct sock *sk)↩
 225 {↩
 226 ▹       const struct inet_connection_sock *icsk = inet_csk(sk);↩
 227 ▹       const struct dst_entry *dst = __sk_dst_get(sk);↩
 228 ↩
 229 ▹       return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩
 230 ▹       ▹       (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩
 231 }↩

But there are other backtraces attributed to the same freed dst_entry in
netfilter code as well.

All the vmcores showed 2 significant clues:

- Remote hosts behind the default gateway had always been redirected to a
different gateway. A rtable/dst_entry will be added for that host. Making
more dst_entrys with lower reference counts. Making this more probable.

- All vmcores showed a postitive LockDroppedIcmps value, e.g:

LockDroppedIcmps                  267

A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
regardless of whether user space has the socket locked. This can result in a
race condition where the same dst_entry cached in sk->sk_dst_entry can be
decremented twice for the same socket via:

do_redirect()->__sk_dst_check()-> dst_release().

Which leads to the dst_entry being prematurely freed with another socket
pointing to it via sk->sk_dst_cache and a subsequent crash.

To fix this skip do_redirect() if usespace has the socket locked. Instead let
the redirect take place later when user space does not have the socket
locked.

The dccp/IPv6 code is very similar in this respect, so fixing it there too.

As Eric Garver pointed out the following commit now invalidates routes. Which
can set the dst->obsolete flag so that ipv4_dst_check() returns null and
triggers the dst_release().

Fixes: ceb3320 ("ipv4: Kill routes during PMTU/redirect updates.")
Cc: Eric Garver <egarver@redhat.com>
Cc: Hannes Sowa <hsowa@redhat.com>
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ED6E0F17 pushed a commit to ED6E0F17/linux that referenced this issue Mar 22, 2017
[ Upstream commit 45caeaa ]

As Eric Dumazet pointed out this also needs to be fixed in IPv6.
v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.

We have seen a few incidents lately where a dst_enty has been freed
with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
dst_entry. If the conditions/timings are right a crash then ensues when the
freed dst_entry is referenced later on. A Common crashing back trace is:

 raspberrypi#8 [] page_fault at ffffffff8163e648
    [exception RIP: __tcp_ack_snd_check+74]
.
.
 raspberrypi#9 [] tcp_rcv_established at ffffffff81580b64
raspberrypi#10 [] tcp_v4_do_rcv at ffffffff8158b54a
raspberrypi#11 [] tcp_v4_rcv at ffffffff8158cd02
raspberrypi#12 [] ip_local_deliver_finish at ffffffff815668f4
raspberrypi#13 [] ip_local_deliver at ffffffff81566bd9
raspberrypi#14 [] ip_rcv_finish at ffffffff8156656d
raspberrypi#15 [] ip_rcv at ffffffff81566f06
raspberrypi#16 [] __netif_receive_skb_core at ffffffff8152b3a2
raspberrypi#17 [] __netif_receive_skb at ffffffff8152b608
raspberrypi#18 [] netif_receive_skb at ffffffff8152b690
raspberrypi#19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
raspberrypi#20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
raspberrypi#21 [] net_rx_action at ffffffff8152bac2
raspberrypi#22 [] __do_softirq at ffffffff81084b4f
raspberrypi#23 [] call_softirq at ffffffff8164845c
raspberrypi#24 [] do_softirq at ffffffff81016fc5
raspberrypi#25 [] irq_exit at ffffffff81084ee5
raspberrypi#26 [] do_IRQ at ffffffff81648ff8

Of course it may happen with other NIC drivers as well.

It's found the freed dst_entry here:

 224 static bool tcp_in_quickack_mode(struct sock *sk)↩
 225 {↩
 226 ▹       const struct inet_connection_sock *icsk = inet_csk(sk);↩
 227 ▹       const struct dst_entry *dst = __sk_dst_get(sk);↩
 228 ↩
 229 ▹       return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩
 230 ▹       ▹       (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩
 231 }↩

But there are other backtraces attributed to the same freed dst_entry in
netfilter code as well.

All the vmcores showed 2 significant clues:

- Remote hosts behind the default gateway had always been redirected to a
different gateway. A rtable/dst_entry will be added for that host. Making
more dst_entrys with lower reference counts. Making this more probable.

- All vmcores showed a postitive LockDroppedIcmps value, e.g:

LockDroppedIcmps                  267

A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
regardless of whether user space has the socket locked. This can result in a
race condition where the same dst_entry cached in sk->sk_dst_entry can be
decremented twice for the same socket via:

do_redirect()->__sk_dst_check()-> dst_release().

Which leads to the dst_entry being prematurely freed with another socket
pointing to it via sk->sk_dst_cache and a subsequent crash.

To fix this skip do_redirect() if usespace has the socket locked. Instead let
the redirect take place later when user space does not have the socket
locked.

The dccp/IPv6 code is very similar in this respect, so fixing it there too.

As Eric Garver pointed out the following commit now invalidates routes. Which
can set the dst->obsolete flag so that ipv4_dst_check() returns null and
triggers the dst_release().

Fixes: ceb3320 ("ipv4: Kill routes during PMTU/redirect updates.")
Cc: Eric Garver <egarver@redhat.com>
Cc: Hannes Sowa <hsowa@redhat.com>
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Mar 23, 2017
[ Upstream commit 45caeaa ]

As Eric Dumazet pointed out this also needs to be fixed in IPv6.
v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.

We have seen a few incidents lately where a dst_enty has been freed
with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
dst_entry. If the conditions/timings are right a crash then ensues when the
freed dst_entry is referenced later on. A Common crashing back trace is:

 #8 [] page_fault at ffffffff8163e648
    [exception RIP: __tcp_ack_snd_check+74]
.
.
 #9 [] tcp_rcv_established at ffffffff81580b64
#10 [] tcp_v4_do_rcv at ffffffff8158b54a
#11 [] tcp_v4_rcv at ffffffff8158cd02
#12 [] ip_local_deliver_finish at ffffffff815668f4
#13 [] ip_local_deliver at ffffffff81566bd9
#14 [] ip_rcv_finish at ffffffff8156656d
#15 [] ip_rcv at ffffffff81566f06
#16 [] __netif_receive_skb_core at ffffffff8152b3a2
#17 [] __netif_receive_skb at ffffffff8152b608
#18 [] netif_receive_skb at ffffffff8152b690
#19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
#20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
#21 [] net_rx_action at ffffffff8152bac2
#22 [] __do_softirq at ffffffff81084b4f
#23 [] call_softirq at ffffffff8164845c
#24 [] do_softirq at ffffffff81016fc5
#25 [] irq_exit at ffffffff81084ee5
#26 [] do_IRQ at ffffffff81648ff8

Of course it may happen with other NIC drivers as well.

It's found the freed dst_entry here:

 224 static bool tcp_in_quickack_mode(struct sock *sk)↩
 225 {↩
 226 ▹       const struct inet_connection_sock *icsk = inet_csk(sk);↩
 227 ▹       const struct dst_entry *dst = __sk_dst_get(sk);↩
 228 ↩
 229 ▹       return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩
 230 ▹       ▹       (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩
 231 }↩

But there are other backtraces attributed to the same freed dst_entry in
netfilter code as well.

All the vmcores showed 2 significant clues:

- Remote hosts behind the default gateway had always been redirected to a
different gateway. A rtable/dst_entry will be added for that host. Making
more dst_entrys with lower reference counts. Making this more probable.

- All vmcores showed a postitive LockDroppedIcmps value, e.g:

LockDroppedIcmps                  267

A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
regardless of whether user space has the socket locked. This can result in a
race condition where the same dst_entry cached in sk->sk_dst_entry can be
decremented twice for the same socket via:

do_redirect()->__sk_dst_check()-> dst_release().

Which leads to the dst_entry being prematurely freed with another socket
pointing to it via sk->sk_dst_cache and a subsequent crash.

To fix this skip do_redirect() if usespace has the socket locked. Instead let
the redirect take place later when user space does not have the socket
locked.

The dccp/IPv6 code is very similar in this respect, so fixing it there too.

As Eric Garver pointed out the following commit now invalidates routes. Which
can set the dst->obsolete flag so that ipv4_dst_check() returns null and
triggers the dst_release().

Fixes: ceb3320 ("ipv4: Kill routes during PMTU/redirect updates.")
Cc: Eric Garver <egarver@redhat.com>
Cc: Hannes Sowa <hsowa@redhat.com>
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Mar 23, 2017
[ Upstream commit 45caeaa ]

As Eric Dumazet pointed out this also needs to be fixed in IPv6.
v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.

We have seen a few incidents lately where a dst_enty has been freed
with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
dst_entry. If the conditions/timings are right a crash then ensues when the
freed dst_entry is referenced later on. A Common crashing back trace is:

 #8 [] page_fault at ffffffff8163e648
    [exception RIP: __tcp_ack_snd_check+74]
.
.
 #9 [] tcp_rcv_established at ffffffff81580b64
#10 [] tcp_v4_do_rcv at ffffffff8158b54a
#11 [] tcp_v4_rcv at ffffffff8158cd02
#12 [] ip_local_deliver_finish at ffffffff815668f4
#13 [] ip_local_deliver at ffffffff81566bd9
#14 [] ip_rcv_finish at ffffffff8156656d
#15 [] ip_rcv at ffffffff81566f06
#16 [] __netif_receive_skb_core at ffffffff8152b3a2
#17 [] __netif_receive_skb at ffffffff8152b608
#18 [] netif_receive_skb at ffffffff8152b690
#19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
#20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
#21 [] net_rx_action at ffffffff8152bac2
#22 [] __do_softirq at ffffffff81084b4f
#23 [] call_softirq at ffffffff8164845c
#24 [] do_softirq at ffffffff81016fc5
#25 [] irq_exit at ffffffff81084ee5
#26 [] do_IRQ at ffffffff81648ff8

Of course it may happen with other NIC drivers as well.

It's found the freed dst_entry here:

 224 static bool tcp_in_quickack_mode(struct sock *sk)↩
 225 {↩
 226 ▹       const struct inet_connection_sock *icsk = inet_csk(sk);↩
 227 ▹       const struct dst_entry *dst = __sk_dst_get(sk);↩
 228 ↩
 229 ▹       return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩
 230 ▹       ▹       (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩
 231 }↩

But there are other backtraces attributed to the same freed dst_entry in
netfilter code as well.

All the vmcores showed 2 significant clues:

- Remote hosts behind the default gateway had always been redirected to a
different gateway. A rtable/dst_entry will be added for that host. Making
more dst_entrys with lower reference counts. Making this more probable.

- All vmcores showed a postitive LockDroppedIcmps value, e.g:

LockDroppedIcmps                  267

A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
regardless of whether user space has the socket locked. This can result in a
race condition where the same dst_entry cached in sk->sk_dst_entry can be
decremented twice for the same socket via:

do_redirect()->__sk_dst_check()-> dst_release().

Which leads to the dst_entry being prematurely freed with another socket
pointing to it via sk->sk_dst_cache and a subsequent crash.

To fix this skip do_redirect() if usespace has the socket locked. Instead let
the redirect take place later when user space does not have the socket
locked.

The dccp/IPv6 code is very similar in this respect, so fixing it there too.

As Eric Garver pointed out the following commit now invalidates routes. Which
can set the dst->obsolete flag so that ipv4_dst_check() returns null and
triggers the dst_release().

Fixes: ceb3320 ("ipv4: Kill routes during PMTU/redirect updates.")
Cc: Eric Garver <egarver@redhat.com>
Cc: Hannes Sowa <hsowa@redhat.com>
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
dlech pushed a commit to ev3dev/rpi-kernel that referenced this issue Apr 14, 2017
[ Upstream commit 45caeaa ]

As Eric Dumazet pointed out this also needs to be fixed in IPv6.
v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.

We have seen a few incidents lately where a dst_enty has been freed
with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
dst_entry. If the conditions/timings are right a crash then ensues when the
freed dst_entry is referenced later on. A Common crashing back trace is:

 raspberrypi#8 [] page_fault at ffffffff8163e648
    [exception RIP: __tcp_ack_snd_check+74]
.
.
 raspberrypi#9 [] tcp_rcv_established at ffffffff81580b64
raspberrypi#10 [] tcp_v4_do_rcv at ffffffff8158b54a
raspberrypi#11 [] tcp_v4_rcv at ffffffff8158cd02
raspberrypi#12 [] ip_local_deliver_finish at ffffffff815668f4
raspberrypi#13 [] ip_local_deliver at ffffffff81566bd9
raspberrypi#14 [] ip_rcv_finish at ffffffff8156656d
raspberrypi#15 [] ip_rcv at ffffffff81566f06
raspberrypi#16 [] __netif_receive_skb_core at ffffffff8152b3a2
raspberrypi#17 [] __netif_receive_skb at ffffffff8152b608
raspberrypi#18 [] netif_receive_skb at ffffffff8152b690
raspberrypi#19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
raspberrypi#20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
raspberrypi#21 [] net_rx_action at ffffffff8152bac2
raspberrypi#22 [] __do_softirq at ffffffff81084b4f
raspberrypi#23 [] call_softirq at ffffffff8164845c
raspberrypi#24 [] do_softirq at ffffffff81016fc5
raspberrypi#25 [] irq_exit at ffffffff81084ee5
raspberrypi#26 [] do_IRQ at ffffffff81648ff8

Of course it may happen with other NIC drivers as well.

It's found the freed dst_entry here:

 224 static bool tcp_in_quickack_mode(struct sock *sk)↩
 225 {↩
 226 ▹       const struct inet_connection_sock *icsk = inet_csk(sk);↩
 227 ▹       const struct dst_entry *dst = __sk_dst_get(sk);↩
 228 ↩
 229 ▹       return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩
 230 ▹       ▹       (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩
 231 }↩

But there are other backtraces attributed to the same freed dst_entry in
netfilter code as well.

All the vmcores showed 2 significant clues:

- Remote hosts behind the default gateway had always been redirected to a
different gateway. A rtable/dst_entry will be added for that host. Making
more dst_entrys with lower reference counts. Making this more probable.

- All vmcores showed a postitive LockDroppedIcmps value, e.g:

LockDroppedIcmps                  267

A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
regardless of whether user space has the socket locked. This can result in a
race condition where the same dst_entry cached in sk->sk_dst_entry can be
decremented twice for the same socket via:

do_redirect()->__sk_dst_check()-> dst_release().

Which leads to the dst_entry being prematurely freed with another socket
pointing to it via sk->sk_dst_cache and a subsequent crash.

To fix this skip do_redirect() if usespace has the socket locked. Instead let
the redirect take place later when user space does not have the socket
locked.

The dccp/IPv6 code is very similar in this respect, so fixing it there too.

As Eric Garver pointed out the following commit now invalidates routes. Which
can set the dst->obsolete flag so that ipv4_dst_check() returns null and
triggers the dst_release().

Fixes: ceb3320 ("ipv4: Kill routes during PMTU/redirect updates.")
Cc: Eric Garver <egarver@redhat.com>
Cc: Hannes Sowa <hsowa@redhat.com>
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ED6E0F17 pushed a commit to ED6E0F17/linux that referenced this issue May 25, 2017
…ive fence

The current code is subject to a race where we may try to acquire a
reference on a stale fence:

[13703.335118] WARNING: CPU: 1 PID: 14975 at ./include/linux/kref.h:46 i915_gem_object_wait+0x1a3/0x1c0
[13703.335184] Modules linked in:
[13703.335202] CPU: 1 PID: 14975 Comm: gem_concurrent_ Not tainted 4.9.0-rc4+ raspberrypi#26
[13703.335216] Hardware name:                  /        , BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015
[13703.335233]  ffffc90002f5bcc8 ffffffff812807de 0000000000000000 0000000000000000
[13703.335257]  ffffc90002f5bd08 ffffffff81073811 0000002e80000000 ffff88026bf7c780
[13703.335279]  7fffffffffffffff 0000000000000001 ffff88027045a550 ffff88026bf7c780
[13703.335301] Call Trace:
[13703.335316]  [<ffffffff812807de>] dump_stack+0x4d/0x6f
[13703.335331]  [<ffffffff81073811>] __warn+0xc1/0xe0
[13703.335343]  [<ffffffff810738e8>] warn_slowpath_null+0x18/0x20
[13703.335355]  [<ffffffff813ac443>] i915_gem_object_wait+0x1a3/0x1c0
[13703.335367]  [<ffffffff813ae8ec>] i915_gem_set_domain_ioctl+0xcc/0x330
[13703.335386]  [<ffffffff813534ab>] drm_ioctl+0x1cb/0x410
[13703.335400]  [<ffffffff813ae820>] ? i915_gem_obj_prepare_shmem_write+0x1d0/0x1d0
[13703.335416]  [<ffffffff8135359b>] ? drm_ioctl+0x2bb/0x410
[13703.335429]  [<ffffffff8117d32f>] do_vfs_ioctl+0x8f/0x5c0
[13703.335442]  [<ffffffff8117d89c>] SyS_ioctl+0x3c/0x70
[13703.335456]  [<ffffffff815a07a4>] entry_SYSCALL_64_fastpath+0x17/0x98
[13703.335558] ---[ end trace fd24176416ba6981 ]---
[13703.382778] general protection fault: 0000 [raspberrypi#1] SMP
[13703.382802] Modules linked in:
[13703.382816] CPU: 1 PID: 14967 Comm: gem_concurrent_ Tainted: G        W       4.9.0-rc4+ raspberrypi#26
[13703.382828] Hardware name:                  /        , BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015
[13703.382841] task: ffff880275458000 task.stack: ffffc90002f18000
[13703.382849] RIP: 0010:[<ffffffff813b3534>]  [<ffffffff813b3534>] i915_gem_request_retire+0x2b4/0x320
[13703.382870] RSP: 0018:ffffc90002f1bbc8  EFLAGS: 00010293
[13703.382878] RAX: dead000000000200 RBX: ffff88026bf7dce8 RCX: dead000000000100
[13703.382887] RDX: dead000000000100 RSI: ffff88026bf7c930 RDI: ffff88026bf7dd00
[13703.382897] RBP: ffffc90002f1bbf8 R08: 00000000ffffffff R09: ffff88026b89a000
[13703.382905] R10: 0000000000000001 R11: ffff88026bbe8fe0 R12: ffff88026bf7c000
[13703.382913] R13: ffff880275af8000 R14: ffff88026bf7c180 R15: dead000000000200
[13703.382922] FS:  00007f89e787d740(0000) GS:ffff88027fd00000(0000) knlGS:0000000000000000
[13703.382934] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13703.382942] CR2: 00007f9053d2e000 CR3: 000000026d414000 CR4: 00000000001006e0
[13703.382951] Stack:
[13703.382958]  ffff880275413000 ffffc90002f1bde8 ffff880275af8000 ffff880274e8a600
[13703.382976]  ffff880276a06000 ffffc90002f1bde8 ffffc90002f1bc38 ffffffff813b48c5
[13703.382995]  ffffc90002f1bc00 ffffc90002f1bde8 ffff88026972a440 0000000000000000
[13703.383021] Call Trace:
[13703.383032]  [<ffffffff813b48c5>] i915_gem_request_alloc+0xa5/0x350
[13703.383043]  [<ffffffff813a17c3>] i915_gem_do_execbuffer.isra.41+0x7b3/0x18b0
[13703.383055]  [<ffffffff813b144c>] ? i915_gem_object_get_sg+0x25c/0x2b0
[13703.383065]  [<ffffffff813b1d4d>] ? i915_gem_object_get_page+0x1d/0x50
[13703.383076]  [<ffffffff813b28cc>] ? i915_gem_pwrite_ioctl+0x66c/0x6d0
[13703.383086]  [<ffffffff813a2c25>] i915_gem_execbuffer2+0x95/0x1e0
[13703.383096]  [<ffffffff813534ab>] drm_ioctl+0x1cb/0x410
[13703.383105]  [<ffffffff813a2b90>] ? i915_gem_execbuffer+0x2d0/0x2d0
[13703.383117]  [<ffffffff810c3df0>] ? hrtimer_start_range_ns+0x1a0/0x310
[13703.383128]  [<ffffffff8117d32f>] do_vfs_ioctl+0x8f/0x5c0
[13703.383140]  [<ffffffff810c60e8>] ? SyS_timer_settime+0x118/0x1a0
[13703.383150]  [<ffffffff8117d89c>] SyS_ioctl+0x3c/0x70
[13703.383162]  [<ffffffff815a07a4>] entry_SYSCALL_64_fastpath+0x17/0x98
[13703.383172] Code: 49 39 c6 48 8d 70 e8 48 8d 5f e8 75 16 eb 47 48 8d 43 18 48 8b 53 18 48 89 de 49 39 c6 48 8d 5a e8 74 33 48 8b 56 08 48 8b 46 10 <48> 89 42 08 48 89 10 f6 46 38 01 48 89 4e 08 4c 89 7e 10 74 cf
[13703.383557] RIP  [<ffffffff813b3534>] i915_gem_request_retire+0x2b4/0x320
[13703.383570]  RSP <ffffc90002f1bbc8>
[13703.383586] ---[ end trace fd24176416ba6982 ]---

This is fixed by using the kref_get_unless_zero() as a full memory
barrier to validate the fence is still the current exclusive fence before
returning it back to the caller. (Note the fix only requires using
fence_get_rcu() and correct handling, but we may as well use the
helper rather than inline equivalent code.)

Note: Issue can only be hit with the i915 driver.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Sumit Semwal <sumit.semwal@linaro.org
Fixes: d07f0e5 ("drm/i915: Move GEM activity tracking into a common struct reservation_object")
Reviewed-by: Christian König <christian.koenig@amd.com>.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: http://patchwork.freedesktop.org/patch/msgid/20161114115540.31155-1-chris@chris-wilson.co.uk
(cherry picked from commit 6bfec6d)
Signed-off-by: Marissa Wall <marissaw@google.com>

Change-Id: I97d60ddd81b27281c326d7afa07e456f9ac7c125
popcornmix pushed a commit that referenced this issue Jun 14, 2017
'perf annotate' is dropping the cr* fields from branch instructions.

Fix it by adding support to display branch instructions having
multiple operands.

Power Arch objdump of int_sqrt:

 20.36 | c0000000004d2694:   subf   r10,r10,r3
       | c0000000004d2698: v bgt    cr6,c0000000004d26a0 <int_sqrt+0x40>
  1.82 | c0000000004d269c:   mr     r3,r10
 29.18 | c0000000004d26a0:   mr     r10,r8
       | c0000000004d26a4: v bgt    cr7,c0000000004d26ac <int_sqrt+0x4c>
       | c0000000004d26a8:   mr     r10,r7

Power Arch Before Patch:

 20.36 |       subf   r10,r10,r3
       |     v bgt    40
  1.82 |       mr     r3,r10
 29.18 | 40:   mr     r10,r8
       |     v bgt    4c
       |       mr     r10,r7

Power Arch After patch:

 20.36 |       subf   r10,r10,r3
       |     v bgt    cr6,40
  1.82 |       mr     r3,r10
 29.18 | 40:   mr     r10,r8
       |     v bgt    cr7,4c
       |       mr     r10,r7

Also support AArch64 conditional branch instructions, which can
have up to three operands:

Aarch64 Non-simplified (raw objdump) view:

       │ffff0000083cd11c: ↑ cbz    w0, ffff0000083cd100 <security_fil▒
...
  4.44 │ffff000│083cd134: ↓ tbnz   w0, #26, ffff0000083cd190 <securit▒
...
  1.37 │ffff000│083cd144: ↓ tbnz   w22, #5, ffff0000083cd1a4 <securit▒
       │ffff000│083cd148:   mov    w19, #0x20000                   //▒
  1.02 │ffff000│083cd14c: ↓ tbz    w22, #2, ffff0000083cd1ac <securit▒
...
  0.68 │ffff000└──3cd16c: ↑ cbnz   w0, ffff0000083cd120 <security_fil▒

Aarch64 Simplified, before this patch:

       │    ↑ cbz    40
...
  4.44 │   │↓ tbnz   w0, #26, ffff0000083cd190 <security_file_permiss▒
...
  1.37 │   │↓ tbnz   w22, #5, ffff0000083cd1a4 <security_file_permiss▒
       │   │  mov    w19, #0x20000                   // #131072
  1.02 │   │↓ tbz    w22, #2, ffff0000083cd1ac <security_file_permiss▒
...
  0.68 │   └──cbnz   60

the cbz operand is missing, and the tbz doesn't get simplified processing
at all because the parsing function failed to match an address.

Aarch64 Simplified, After this patch applied:

       │    ↑ cbz    w0, 40
...
  4.44 │   │↓ tbnz   w0, #26, d0
...
  1.37 │   │↓ tbnz   w22, #5, e4
       │   │  mov    w19, #0x20000                   // #131072
  1.02 │   │↓ tbz    w22, #2, ec
...
  0.68 │   └──cbnz   w0, 60

Originally-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Tested-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Reported-by: Anton Blanchard <anton@samba.org>
Reported-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Kim Phillips <kim.phillips@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Taeung Song <treeze.taeung@gmail.com>
Link: http://lkml.kernel.org/r/20170601092959.f60d98912e8a1b66fd1e4c0e@arm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
popcornmix pushed a commit that referenced this issue Nov 12, 2017
Fengguang reported a KASAN warning:

  Kprobe smoke test: started
  ==================================================================
  BUG: KASAN: stack-out-of-bounds in deref_stack_reg+0xb5/0x11a
  Read of size 8 at addr ffff8800001c7cd8 by task swapper/1

  CPU: 0 PID: 1 Comm: swapper Not tainted 4.14.0-rc8 #26
  Call Trace:
   <#DB>
   ...
   save_trace+0xd9/0x1d3
   mark_lock+0x5f7/0xdc3
   __lock_acquire+0x6b4/0x38ef
   lock_acquire+0x1a1/0x2aa
   _raw_spin_lock_irqsave+0x46/0x55
   kretprobe_table_lock+0x1a/0x42
   pre_handler_kretprobe+0x3f5/0x521
   kprobe_int3_handler+0x19c/0x25f
   do_int3+0x61/0x142
   int3+0x30/0x60
  [...]

The ORC unwinder got confused by some kprobes changes, which isn't
surprising since the runtime code no longer matches vmlinux and the
stack was modified for kretprobes.

Until we have a way for generated code to register changes with the
unwinder, these types of warnings are inevitable.  So just disable KASAN
checks for stack accesses in the ORC unwinder.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171108021934.zbl6unh5hpugybc5@treble
Signed-off-by: Ingo Molnar <mingo@kernel.org>
popcornmix pushed a commit that referenced this issue Dec 12, 2017
Reported by syzkaller:

   *** Guest State ***
   CR0: actual=0x0000000080010031, shadow=0x0000000060000010, gh_mask=fffffffffffffff7
   CR4: actual=0x0000000000002061, shadow=0x0000000000000000, gh_mask=ffffffffffffe8f1
   CR3 = 0x000000002081e000
   RSP = 0x000000000000fffa  RIP = 0x0000000000000000
   RFLAGS=0x00023000         DR7 = 0x00000000000000
          ^^^^^^^^^^
   ------------[ cut here ]------------
   WARNING: CPU: 6 PID: 24431 at /home/kernel/linux/arch/x86/kvm//x86.c:7302 kvm_arch_vcpu_ioctl_run+0x651/0x2ea0 [kvm]
   CPU: 6 PID: 24431 Comm: reprotest Tainted: G        W  OE   4.14.0+ #26
   RIP: 0010:kvm_arch_vcpu_ioctl_run+0x651/0x2ea0 [kvm]
   RSP: 0018:ffff880291d179e0 EFLAGS: 00010202
   Call Trace:
    kvm_vcpu_ioctl+0x479/0x880 [kvm]
    do_vfs_ioctl+0x142/0x9a0
    SyS_ioctl+0x74/0x80
    entry_SYSCALL_64_fastpath+0x23/0x9a

The failed vmentry is triggered by the following beautified testcase:

    #include <unistd.h>
    #include <sys/syscall.h>
    #include <string.h>
    #include <stdint.h>
    #include <linux/kvm.h>
    #include <fcntl.h>
    #include <sys/ioctl.h>

    long r[5];
    int main()
    {
        struct kvm_debugregs dr = { 0 };

        r[2] = open("/dev/kvm", O_RDONLY);
        r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
        r[4] = ioctl(r[3], KVM_CREATE_VCPU, 7);
        struct kvm_guest_debug debug = {
                .control = 0xf0403,
                .arch = {
                        .debugreg[6] = 0x2,
                        .debugreg[7] = 0x2
                }
        };
        ioctl(r[4], KVM_SET_GUEST_DEBUG, &debug);
        ioctl(r[4], KVM_RUN, 0);
    }

which testcase tries to setup the processor specific debug
registers and configure vCPU for handling guest debug events through
KVM_SET_GUEST_DEBUG.  The KVM_SET_GUEST_DEBUG ioctl will get and set
rflags in order to set TF bit if single step is needed. All regs' caches
are reset to avail and GUEST_RFLAGS vmcs field is reset to 0x2 during vCPU
reset. However, the cache of rflags is not reset during vCPU reset. The
function vmx_get_rflags() returns an unreset rflags cache value since
the cache is marked avail, it is 0 after boot. Vmentry fails if the
rflags reserved bit 1 is 0.

This patch fixes it by resetting both the GUEST_RFLAGS vmcs field and
its cache to 0x2 during vCPU reset.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
popcornmix pushed a commit that referenced this issue Dec 12, 2017
Reported by syzkaller:

   ------------[ cut here ]------------
   WARNING: CPU: 5 PID: 2939 at arch/x86/kvm/vmx.c:3844 free_loaded_vmcs+0x77/0x80 [kvm_intel]
   CPU: 5 PID: 2939 Comm: repro Not tainted 4.14.0+ #26
   RIP: 0010:free_loaded_vmcs+0x77/0x80 [kvm_intel]
   Call Trace:
    vmx_free_vcpu+0xda/0x130 [kvm_intel]
    kvm_arch_destroy_vm+0x192/0x290 [kvm]
    kvm_put_kvm+0x262/0x560 [kvm]
    kvm_vm_release+0x2c/0x30 [kvm]
    __fput+0x190/0x370
    task_work_run+0xa1/0xd0
    do_exit+0x4d2/0x13e0
    do_group_exit+0x89/0x140
    get_signal+0x318/0xb80
    do_signal+0x8c/0xb40
    exit_to_usermode_loop+0xe4/0x140
    syscall_return_slowpath+0x206/0x230
    entry_SYSCALL_64_fastpath+0x98/0x9a

The syzkaller testcase will execute VMXON/VMLAUCH instructions, so the
vmx->nested stuff is populated, it will also issue KVM_SMI ioctl. However,
the testcase is just a simple c program and not be lauched by something
like seabios which implements smi_handler. Commit 05cade7 (KVM: nSVM:
fix SMI injection in guest mode) gets out of guest mode and set nested.vmxon
to false for the duration of SMM according to SDM 34.14.1 "leave VMX
operation" upon entering SMM. We can't alloc/free the vmx->nested stuff
each time when entering/exiting SMM since it will induce more overhead. So
the function vmx_pre_enter_smm() marks nested.vmxon false even if vmx->nested
stuff is still populated. What it expected is em_rsm() can mark nested.vmxon
to be true again. However, the smi_handler/rsm will not execute since there
is no something like seabios in this scenario. The function free_nested()
fails to free the vmx->nested stuff since the vmx->nested.vmxon is false
which results in the above warning.

This patch fixes it by also considering the no SMI handler case, luckily
vmx->nested.smm.vmxon is marked according to the value of vmx->nested.vmxon
in vmx_pre_enter_smm(), we can take advantage of it and free vmx->nested
stuff when L1 goes down.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Fixes: 05cade7 (KVM: nSVM: fix SMI injection in guest mode)
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
popcornmix pushed a commit that referenced this issue Dec 12, 2017
In asn1_ber_decoder(), indefinitely-sized ASN.1 items were being passed
to the action functions before their lengths had been computed, using
the bogus length of 0x80 (ASN1_INDEFINITE_LENGTH).  This resulted in
reading data past the end of the input buffer, when given a specially
crafted message.

Fix it by rearranging the code so that the indefinite length is resolved
before the action is called.

This bug was originally found by fuzzing the X.509 parser in userspace
using libFuzzer from the LLVM project.

KASAN report (cleaned up slightly):

    BUG: KASAN: slab-out-of-bounds in memcpy ./include/linux/string.h:341 [inline]
    BUG: KASAN: slab-out-of-bounds in x509_fabricate_name.constprop.1+0x1a4/0x940 crypto/asymmetric_keys/x509_cert_parser.c:366
    Read of size 128 at addr ffff880035dd9eaf by task keyctl/195

    CPU: 1 PID: 195 Comm: keyctl Not tainted 4.14.0-09238-g1d3b78bbc6e9 #26
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
    Call Trace:
     __dump_stack lib/dump_stack.c:17 [inline]
     dump_stack+0xd1/0x175 lib/dump_stack.c:53
     print_address_description+0x78/0x260 mm/kasan/report.c:252
     kasan_report_error mm/kasan/report.c:351 [inline]
     kasan_report+0x23f/0x350 mm/kasan/report.c:409
     memcpy+0x1f/0x50 mm/kasan/kasan.c:302
     memcpy ./include/linux/string.h:341 [inline]
     x509_fabricate_name.constprop.1+0x1a4/0x940 crypto/asymmetric_keys/x509_cert_parser.c:366
     asn1_ber_decoder+0xb4a/0x1fd0 lib/asn1_decoder.c:447
     x509_cert_parse+0x1c7/0x620 crypto/asymmetric_keys/x509_cert_parser.c:89
     x509_key_preparse+0x61/0x750 crypto/asymmetric_keys/x509_public_key.c:174
     asymmetric_key_preparse+0xa4/0x150 crypto/asymmetric_keys/asymmetric_type.c:388
     key_create_or_update+0x4d4/0x10a0 security/keys/key.c:850
     SYSC_add_key security/keys/keyctl.c:122 [inline]
     SyS_add_key+0xe8/0x290 security/keys/keyctl.c:62
     entry_SYSCALL_64_fastpath+0x1f/0x96

    Allocated by task 195:
     __do_kmalloc_node mm/slab.c:3675 [inline]
     __kmalloc_node+0x47/0x60 mm/slab.c:3682
     kvmalloc ./include/linux/mm.h:540 [inline]
     SYSC_add_key security/keys/keyctl.c:104 [inline]
     SyS_add_key+0x19e/0x290 security/keys/keyctl.c:62
     entry_SYSCALL_64_fastpath+0x1f/0x96

Fixes: 42d5ec2 ("X.509: Add an ASN.1 decoder")
Reported-by: Alexander Potapenko <glider@google.com>
Cc: <stable@vger.kernel.org> # v3.7+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
popcornmix pushed a commit that referenced this issue Dec 12, 2017
Adding a specially crafted X.509 certificate whose subjectPublicKey
ASN.1 value is zero-length caused x509_extract_key_data() to set the
public key size to SIZE_MAX, as it subtracted the nonexistent BIT STRING
metadata byte.  Then, x509_cert_parse() called kmemdup() with that bogus
size, triggering the WARN_ON_ONCE() in kmalloc_slab().

This appears to be harmless, but it still must be fixed since WARNs are
never supposed to be user-triggerable.

Fix it by updating x509_cert_parse() to validate that the value has a
BIT STRING metadata byte, and that the byte is 0 which indicates that
the number of bits in the bitstring is a multiple of 8.

It would be nice to handle the metadata byte in asn1_ber_decoder()
instead.  But that would be tricky because in the general case a BIT
STRING could be implicitly tagged, and/or could legitimately have a
length that is not a whole number of bytes.

Here was the WARN (cleaned up slightly):

    WARNING: CPU: 1 PID: 202 at mm/slab_common.c:971 kmalloc_slab+0x5d/0x70 mm/slab_common.c:971
    Modules linked in:
    CPU: 1 PID: 202 Comm: keyctl Tainted: G    B            4.14.0-09238-g1d3b78bbc6e9 #26
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
    task: ffff880033014180 task.stack: ffff8800305c8000
    Call Trace:
     __do_kmalloc mm/slab.c:3706 [inline]
     __kmalloc_track_caller+0x22/0x2e0 mm/slab.c:3726
     kmemdup+0x17/0x40 mm/util.c:118
     kmemdup include/linux/string.h:414 [inline]
     x509_cert_parse+0x2cb/0x620 crypto/asymmetric_keys/x509_cert_parser.c:106
     x509_key_preparse+0x61/0x750 crypto/asymmetric_keys/x509_public_key.c:174
     asymmetric_key_preparse+0xa4/0x150 crypto/asymmetric_keys/asymmetric_type.c:388
     key_create_or_update+0x4d4/0x10a0 security/keys/key.c:850
     SYSC_add_key security/keys/keyctl.c:122 [inline]
     SyS_add_key+0xe8/0x290 security/keys/keyctl.c:62
     entry_SYSCALL_64_fastpath+0x1f/0x96

Fixes: 42d5ec2 ("X.509: Add an ASN.1 decoder")
Cc: <stable@vger.kernel.org> # v3.7+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
popcornmix pushed a commit that referenced this issue Dec 14, 2017
commit e0058f3 upstream.

In asn1_ber_decoder(), indefinitely-sized ASN.1 items were being passed
to the action functions before their lengths had been computed, using
the bogus length of 0x80 (ASN1_INDEFINITE_LENGTH).  This resulted in
reading data past the end of the input buffer, when given a specially
crafted message.

Fix it by rearranging the code so that the indefinite length is resolved
before the action is called.

This bug was originally found by fuzzing the X.509 parser in userspace
using libFuzzer from the LLVM project.

KASAN report (cleaned up slightly):

    BUG: KASAN: slab-out-of-bounds in memcpy ./include/linux/string.h:341 [inline]
    BUG: KASAN: slab-out-of-bounds in x509_fabricate_name.constprop.1+0x1a4/0x940 crypto/asymmetric_keys/x509_cert_parser.c:366
    Read of size 128 at addr ffff880035dd9eaf by task keyctl/195

    CPU: 1 PID: 195 Comm: keyctl Not tainted 4.14.0-09238-g1d3b78bbc6e9 #26
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
    Call Trace:
     __dump_stack lib/dump_stack.c:17 [inline]
     dump_stack+0xd1/0x175 lib/dump_stack.c:53
     print_address_description+0x78/0x260 mm/kasan/report.c:252
     kasan_report_error mm/kasan/report.c:351 [inline]
     kasan_report+0x23f/0x350 mm/kasan/report.c:409
     memcpy+0x1f/0x50 mm/kasan/kasan.c:302
     memcpy ./include/linux/string.h:341 [inline]
     x509_fabricate_name.constprop.1+0x1a4/0x940 crypto/asymmetric_keys/x509_cert_parser.c:366
     asn1_ber_decoder+0xb4a/0x1fd0 lib/asn1_decoder.c:447
     x509_cert_parse+0x1c7/0x620 crypto/asymmetric_keys/x509_cert_parser.c:89
     x509_key_preparse+0x61/0x750 crypto/asymmetric_keys/x509_public_key.c:174
     asymmetric_key_preparse+0xa4/0x150 crypto/asymmetric_keys/asymmetric_type.c:388
     key_create_or_update+0x4d4/0x10a0 security/keys/key.c:850
     SYSC_add_key security/keys/keyctl.c:122 [inline]
     SyS_add_key+0xe8/0x290 security/keys/keyctl.c:62
     entry_SYSCALL_64_fastpath+0x1f/0x96

    Allocated by task 195:
     __do_kmalloc_node mm/slab.c:3675 [inline]
     __kmalloc_node+0x47/0x60 mm/slab.c:3682
     kvmalloc ./include/linux/mm.h:540 [inline]
     SYSC_add_key security/keys/keyctl.c:104 [inline]
     SyS_add_key+0x19e/0x290 security/keys/keyctl.c:62
     entry_SYSCALL_64_fastpath+0x1f/0x96

Fixes: 42d5ec2 ("X.509: Add an ASN.1 decoder")
Reported-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Mar 21, 2022
[ Upstream commit 4224cfd ]

When bringing down the netdevice or system shutdown, a panic can be
triggered while accessing the sysfs path because the device is already
removed.

    [  755.549084] mlx5_core 0000:12:00.1: Shutdown was called
    [  756.404455] mlx5_core 0000:12:00.0: Shutdown was called
    ...
    [  757.937260] BUG: unable to handle kernel NULL pointer dereference at           (null)
    [  758.031397] IP: [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280

    crash> bt
    ...
    PID: 12649  TASK: ffff8924108f2100  CPU: 1   COMMAND: "amsd"
    ...
     #9 [ffff89240e1a38b0] page_fault at ffffffff8f38c778
        [exception RIP: dma_pool_alloc+0x1ab]
        RIP: ffffffff8ee11acb  RSP: ffff89240e1a3968  RFLAGS: 00010046
        RAX: 0000000000000246  RBX: ffff89243d874100  RCX: 0000000000001000
        RDX: 0000000000000000  RSI: 0000000000000246  RDI: ffff89243d874090
        RBP: ffff89240e1a39c0   R8: 000000000001f080   R9: ffff8905ffc03c00
        R10: ffffffffc04680d4  R11: ffffffff8edde9fd  R12: 00000000000080d0
        R13: ffff89243d874090  R14: ffff89243d874080  R15: 0000000000000000
        ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    #10 [ffff89240e1a39c8] mlx5_alloc_cmd_msg at ffffffffc04680f3 [mlx5_core]
    #11 [ffff89240e1a3a18] cmd_exec at ffffffffc046ad62 [mlx5_core]
    #12 [ffff89240e1a3ab8] mlx5_cmd_exec at ffffffffc046b4fb [mlx5_core]
    #13 [ffff89240e1a3ae8] mlx5_core_access_reg at ffffffffc0475434 [mlx5_core]
    #14 [ffff89240e1a3b40] mlx5e_get_fec_caps at ffffffffc04a7348 [mlx5_core]
    #15 [ffff89240e1a3bb0] get_fec_supported_advertised at ffffffffc04992bf [mlx5_core]
    #16 [ffff89240e1a3c08] mlx5e_get_link_ksettings at ffffffffc049ab36 [mlx5_core]
    #17 [ffff89240e1a3ce8] __ethtool_get_link_ksettings at ffffffff8f25db46
    #18 [ffff89240e1a3d48] speed_show at ffffffff8f277208
    #19 [ffff89240e1a3dd8] dev_attr_show at ffffffff8f0b70e3
    #20 [ffff89240e1a3df8] sysfs_kf_seq_show at ffffffff8eedbedf
    #21 [ffff89240e1a3e18] kernfs_seq_show at ffffffff8eeda596
    #22 [ffff89240e1a3e28] seq_read at ffffffff8ee76d10
    #23 [ffff89240e1a3e98] kernfs_fop_read at ffffffff8eedaef5
    #24 [ffff89240e1a3ed8] vfs_read at ffffffff8ee4e3ff
    #25 [ffff89240e1a3f08] sys_read at ffffffff8ee4f27f
    #26 [ffff89240e1a3f50] system_call_fastpath at ffffffff8f395f92

    crash> net_device.state ffff89443b0c0000
      state = 0x5  (__LINK_STATE_START| __LINK_STATE_NOCARRIER)

To prevent this scenario, we also make sure that the netdevice is present.

Signed-off-by: suresh kumar <suresh2514@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
popcornmix pushed a commit that referenced this issue Apr 8, 2022
[ Upstream commit 4224cfd ]

When bringing down the netdevice or system shutdown, a panic can be
triggered while accessing the sysfs path because the device is already
removed.

    [  755.549084] mlx5_core 0000:12:00.1: Shutdown was called
    [  756.404455] mlx5_core 0000:12:00.0: Shutdown was called
    ...
    [  757.937260] BUG: unable to handle kernel NULL pointer dereference at           (null)
    [  758.031397] IP: [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280

    crash> bt
    ...
    PID: 12649  TASK: ffff8924108f2100  CPU: 1   COMMAND: "amsd"
    ...
     #9 [ffff89240e1a38b0] page_fault at ffffffff8f38c778
        [exception RIP: dma_pool_alloc+0x1ab]
        RIP: ffffffff8ee11acb  RSP: ffff89240e1a3968  RFLAGS: 00010046
        RAX: 0000000000000246  RBX: ffff89243d874100  RCX: 0000000000001000
        RDX: 0000000000000000  RSI: 0000000000000246  RDI: ffff89243d874090
        RBP: ffff89240e1a39c0   R8: 000000000001f080   R9: ffff8905ffc03c00
        R10: ffffffffc04680d4  R11: ffffffff8edde9fd  R12: 00000000000080d0
        R13: ffff89243d874090  R14: ffff89243d874080  R15: 0000000000000000
        ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    #10 [ffff89240e1a39c8] mlx5_alloc_cmd_msg at ffffffffc04680f3 [mlx5_core]
    #11 [ffff89240e1a3a18] cmd_exec at ffffffffc046ad62 [mlx5_core]
    #12 [ffff89240e1a3ab8] mlx5_cmd_exec at ffffffffc046b4fb [mlx5_core]
    #13 [ffff89240e1a3ae8] mlx5_core_access_reg at ffffffffc0475434 [mlx5_core]
    #14 [ffff89240e1a3b40] mlx5e_get_fec_caps at ffffffffc04a7348 [mlx5_core]
    #15 [ffff89240e1a3bb0] get_fec_supported_advertised at ffffffffc04992bf [mlx5_core]
    #16 [ffff89240e1a3c08] mlx5e_get_link_ksettings at ffffffffc049ab36 [mlx5_core]
    #17 [ffff89240e1a3ce8] __ethtool_get_link_ksettings at ffffffff8f25db46
    #18 [ffff89240e1a3d48] speed_show at ffffffff8f277208
    #19 [ffff89240e1a3dd8] dev_attr_show at ffffffff8f0b70e3
    #20 [ffff89240e1a3df8] sysfs_kf_seq_show at ffffffff8eedbedf
    #21 [ffff89240e1a3e18] kernfs_seq_show at ffffffff8eeda596
    #22 [ffff89240e1a3e28] seq_read at ffffffff8ee76d10
    #23 [ffff89240e1a3e98] kernfs_fop_read at ffffffff8eedaef5
    #24 [ffff89240e1a3ed8] vfs_read at ffffffff8ee4e3ff
    #25 [ffff89240e1a3f08] sys_read at ffffffff8ee4f27f
    #26 [ffff89240e1a3f50] system_call_fastpath at ffffffff8f395f92

    crash> net_device.state ffff89443b0c0000
      state = 0x5  (__LINK_STATE_START| __LINK_STATE_NOCARRIER)

To prevent this scenario, we also make sure that the netdevice is present.

Signed-off-by: suresh kumar <suresh2514@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
popcornmix pushed a commit that referenced this issue Jul 18, 2022
Release refcount after xas_set to fix UAF which may cause panic like this:

 page:ffffea000491fa40 refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1247e9
 head:ffffea000491fa00 order:3 compound_mapcount:0 compound_pincount:0
 memcg:ffff888104f91091
 flags: 0x2fffff80010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
...
page dumped because: VM_BUG_ON_PAGE(PageTail(page))
 ------------[ cut here ]------------
 kernel BUG at include/linux/page-flags.h:632!
 invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
 CPU: 1 PID: 7642 Comm: sh Not tainted 5.15.51-dirty #26
...
 Call Trace:
  <TASK>
  __invalidate_mapping_pages+0xe7/0x540
  drop_pagecache_sb+0x159/0x320
  iterate_supers+0x120/0x240
  drop_caches_sysctl_handler+0xaa/0xe0
  proc_sys_call_handler+0x2b4/0x480
  new_sync_write+0x3d6/0x5c0
  vfs_write+0x446/0x7a0
  ksys_write+0x105/0x210
  do_syscall_64+0x35/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xae
 RIP: 0033:0x7f52b5733130
...

This problem has been fixed on mainline by patch 6b24ca4 ("mm: Use
multi-index entries in the page cache") since it deletes the related code.

Fixes: 5c211ba ("mm: add and use find_lock_entries")
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Jan 23, 2023
Driver's probe allocates memory for RX FIFO (port->rx_fifo) based on
default RX FIFO depth, e.g. 16.  Later during serial startup the
qcom_geni_serial_port_setup() updates the RX FIFO depth
(port->rx_fifo_depth) to match real device capabilities, e.g. to 32.

The RX UART handle code will read "port->rx_fifo_depth" number of words
into "port->rx_fifo" buffer, thus exceeding the bounds.  This can be
observed in certain configurations with Qualcomm Bluetooth HCI UART
device and KASAN:

  Bluetooth: hci0: QCA Product ID   :0x00000010
  Bluetooth: hci0: QCA SOC Version  :0x400a0200
  Bluetooth: hci0: QCA ROM Version  :0x00000200
  Bluetooth: hci0: QCA Patch Version:0x00000d2b
  Bluetooth: hci0: QCA controller version 0x02000200
  Bluetooth: hci0: QCA Downloading qca/htbtfw20.tlv
  bluetooth hci0: Direct firmware load for qca/htbtfw20.tlv failed with error -2
  Bluetooth: hci0: QCA Failed to request file: qca/htbtfw20.tlv (-2)
  Bluetooth: hci0: QCA Failed to download patch (-2)
  ==================================================================
  BUG: KASAN: slab-out-of-bounds in handle_rx_uart+0xa8/0x18c
  Write of size 4 at addr ffff279347d578c0 by task swapper/0/0

  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.1.0-rt5-00350-gb2450b7e00be-dirty #26
  Hardware name: Qualcomm Technologies, Inc. Robotics RB5 (DT)
  Call trace:
   dump_backtrace.part.0+0xe0/0xf0
   show_stack+0x18/0x40
   dump_stack_lvl+0x8c/0xb8
   print_report+0x188/0x488
   kasan_report+0xb4/0x100
   __asan_store4+0x80/0xa4
   handle_rx_uart+0xa8/0x18c
   qcom_geni_serial_handle_rx+0x84/0x9c
   qcom_geni_serial_isr+0x24c/0x760
   __handle_irq_event_percpu+0x108/0x500
   handle_irq_event+0x6c/0x110
   handle_fasteoi_irq+0x138/0x2cc
   generic_handle_domain_irq+0x48/0x64

If the RX FIFO depth changes after probe, be sure to resize the buffer.

Fixes: f9d690b ("tty: serial: qcom_geni_serial: Allocate port->rx_fifo buffer in probe")
Cc: <stable@vger.kernel.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Jiri Slaby <jirislaby@kernel.org>
Link: https://lore.kernel.org/r/20221221164022.1087814-1-krzysztof.kozlowski@linaro.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Jan 24, 2023
commit b8caf69 upstream.

Driver's probe allocates memory for RX FIFO (port->rx_fifo) based on
default RX FIFO depth, e.g. 16.  Later during serial startup the
qcom_geni_serial_port_setup() updates the RX FIFO depth
(port->rx_fifo_depth) to match real device capabilities, e.g. to 32.

The RX UART handle code will read "port->rx_fifo_depth" number of words
into "port->rx_fifo" buffer, thus exceeding the bounds.  This can be
observed in certain configurations with Qualcomm Bluetooth HCI UART
device and KASAN:

  Bluetooth: hci0: QCA Product ID   :0x00000010
  Bluetooth: hci0: QCA SOC Version  :0x400a0200
  Bluetooth: hci0: QCA ROM Version  :0x00000200
  Bluetooth: hci0: QCA Patch Version:0x00000d2b
  Bluetooth: hci0: QCA controller version 0x02000200
  Bluetooth: hci0: QCA Downloading qca/htbtfw20.tlv
  bluetooth hci0: Direct firmware load for qca/htbtfw20.tlv failed with error -2
  Bluetooth: hci0: QCA Failed to request file: qca/htbtfw20.tlv (-2)
  Bluetooth: hci0: QCA Failed to download patch (-2)
  ==================================================================
  BUG: KASAN: slab-out-of-bounds in handle_rx_uart+0xa8/0x18c
  Write of size 4 at addr ffff279347d578c0 by task swapper/0/0

  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.1.0-rt5-00350-gb2450b7e00be-dirty #26
  Hardware name: Qualcomm Technologies, Inc. Robotics RB5 (DT)
  Call trace:
   dump_backtrace.part.0+0xe0/0xf0
   show_stack+0x18/0x40
   dump_stack_lvl+0x8c/0xb8
   print_report+0x188/0x488
   kasan_report+0xb4/0x100
   __asan_store4+0x80/0xa4
   handle_rx_uart+0xa8/0x18c
   qcom_geni_serial_handle_rx+0x84/0x9c
   qcom_geni_serial_isr+0x24c/0x760
   __handle_irq_event_percpu+0x108/0x500
   handle_irq_event+0x6c/0x110
   handle_fasteoi_irq+0x138/0x2cc
   generic_handle_domain_irq+0x48/0x64

If the RX FIFO depth changes after probe, be sure to resize the buffer.

Fixes: f9d690b ("tty: serial: qcom_geni_serial: Allocate port->rx_fifo buffer in probe")
Cc: <stable@vger.kernel.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Jiri Slaby <jirislaby@kernel.org>
Link: https://lore.kernel.org/r/20221221164022.1087814-1-krzysztof.kozlowski@linaro.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Jan 30, 2023
commit b8caf69 upstream.

Driver's probe allocates memory for RX FIFO (port->rx_fifo) based on
default RX FIFO depth, e.g. 16.  Later during serial startup the
qcom_geni_serial_port_setup() updates the RX FIFO depth
(port->rx_fifo_depth) to match real device capabilities, e.g. to 32.

The RX UART handle code will read "port->rx_fifo_depth" number of words
into "port->rx_fifo" buffer, thus exceeding the bounds.  This can be
observed in certain configurations with Qualcomm Bluetooth HCI UART
device and KASAN:

  Bluetooth: hci0: QCA Product ID   :0x00000010
  Bluetooth: hci0: QCA SOC Version  :0x400a0200
  Bluetooth: hci0: QCA ROM Version  :0x00000200
  Bluetooth: hci0: QCA Patch Version:0x00000d2b
  Bluetooth: hci0: QCA controller version 0x02000200
  Bluetooth: hci0: QCA Downloading qca/htbtfw20.tlv
  bluetooth hci0: Direct firmware load for qca/htbtfw20.tlv failed with error -2
  Bluetooth: hci0: QCA Failed to request file: qca/htbtfw20.tlv (-2)
  Bluetooth: hci0: QCA Failed to download patch (-2)
  ==================================================================
  BUG: KASAN: slab-out-of-bounds in handle_rx_uart+0xa8/0x18c
  Write of size 4 at addr ffff279347d578c0 by task swapper/0/0

  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.1.0-rt5-00350-gb2450b7e00be-dirty #26
  Hardware name: Qualcomm Technologies, Inc. Robotics RB5 (DT)
  Call trace:
   dump_backtrace.part.0+0xe0/0xf0
   show_stack+0x18/0x40
   dump_stack_lvl+0x8c/0xb8
   print_report+0x188/0x488
   kasan_report+0xb4/0x100
   __asan_store4+0x80/0xa4
   handle_rx_uart+0xa8/0x18c
   qcom_geni_serial_handle_rx+0x84/0x9c
   qcom_geni_serial_isr+0x24c/0x760
   __handle_irq_event_percpu+0x108/0x500
   handle_irq_event+0x6c/0x110
   handle_fasteoi_irq+0x138/0x2cc
   generic_handle_domain_irq+0x48/0x64

If the RX FIFO depth changes after probe, be sure to resize the buffer.

Fixes: f9d690b ("tty: serial: qcom_geni_serial: Allocate port->rx_fifo buffer in probe")
Cc: <stable@vger.kernel.org>
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Jiri Slaby <jirislaby@kernel.org>
Link: https://lore.kernel.org/r/20221221164022.1087814-1-krzysztof.kozlowski@linaro.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
margro pushed a commit to margro/linux that referenced this issue May 28, 2023
[ Upstream commit e24c644 ]

I compiled with AddressSanitizer and I had these memory leaks while I
was using the tep_parse_format function:

    Direct leak of 28 byte(s) in 4 object(s) allocated from:
        #0 0x7fb07db49ffe in __interceptor_realloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dffe)
        raspberrypi#1 0x7fb07a724228 in extend_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:985
        raspberrypi#2 0x7fb07a724c21 in __read_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1140
        raspberrypi#3 0x7fb07a724f78 in read_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1206
        raspberrypi#4 0x7fb07a725191 in __read_expect_type /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1291
        raspberrypi#5 0x7fb07a7251df in read_expect_type /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1299
        raspberrypi#6 0x7fb07a72e6c8 in process_dynamic_array_len /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:2849
        raspberrypi#7 0x7fb07a7304b8 in process_function /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3161
        raspberrypi#8 0x7fb07a730900 in process_arg_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3207
        raspberrypi#9 0x7fb07a727c0b in process_arg /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1786
        raspberrypi#10 0x7fb07a731080 in event_read_print_args /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3285
        raspberrypi#11 0x7fb07a731722 in event_read_print /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3369
        raspberrypi#12 0x7fb07a740054 in __tep_parse_format /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:6335
        raspberrypi#13 0x7fb07a74047a in __parse_event /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:6389
        raspberrypi#14 0x7fb07a740536 in tep_parse_format /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:6431
        raspberrypi#15 0x7fb07a785acf in parse_event ../../../src/fs-src/fs.c:251
        raspberrypi#16 0x7fb07a785ccd in parse_systems ../../../src/fs-src/fs.c:284
        raspberrypi#17 0x7fb07a786fb3 in read_metadata ../../../src/fs-src/fs.c:593
        raspberrypi#18 0x7fb07a78760e in ftrace_fs_source_init ../../../src/fs-src/fs.c:727
        raspberrypi#19 0x7fb07d90c19c in add_component_with_init_method_data ../../../../src/lib/graph/graph.c:1048
        raspberrypi#20 0x7fb07d90c87b in add_source_component_with_initialize_method_data ../../../../src/lib/graph/graph.c:1127
        raspberrypi#21 0x7fb07d90c92a in bt_graph_add_source_component ../../../../src/lib/graph/graph.c:1152
        raspberrypi#22 0x55db11aa632e in cmd_run_ctx_create_components_from_config_components ../../../src/cli/babeltrace2.c:2252
        raspberrypi#23 0x55db11aa6fda in cmd_run_ctx_create_components ../../../src/cli/babeltrace2.c:2347
        raspberrypi#24 0x55db11aa780c in cmd_run ../../../src/cli/babeltrace2.c:2461
        raspberrypi#25 0x55db11aa8a7d in main ../../../src/cli/babeltrace2.c:2673
        raspberrypi#26 0x7fb07d5460b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)

The token variable in the process_dynamic_array_len function is
allocated in the read_expect_type function, but is not freed before
calling the read_token function.

Free the token variable before calling read_token in order to plug the
leak.

Signed-off-by: Philippe Duplessis-Guindon <pduplessis@efficios.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/linux-trace-devel/20200730150236.5392-1-pduplessis@efficios.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
margro pushed a commit to margro/linux that referenced this issue May 28, 2023
[ Upstream commit 96ecdcc ]

Netpoll can try to poll napi as soon as napi_enable() is called.
It crashes trying to access a doorbell which is still NULL:

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 CPU: 59 PID: 6039 Comm: ethtool Kdump: loaded Tainted: G S                5.9.0-rc1-00469-g5fd99b5d9950-dirty raspberrypi#26
 RIP: 0010:bnxt_poll+0x121/0x1c0
 Code: c4 20 44 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 8b 86 a0 01 00 00 41 23 85 18 01 00 00 49 8b 96 a8 01 00 00 0d 00 00 00 24 <89> 02
41 f6 45 77 02 74 cb 49 8b ae d8 01 00 00 31 c0 c7 44 24 1a
  netpoll_poll_dev+0xbd/0x1a0
  __netpoll_send_skb+0x1b2/0x210
  netpoll_send_udp+0x2c9/0x406
  write_ext_msg+0x1d7/0x1f0
  console_unlock+0x23c/0x520
  vprintk_emit+0xe0/0x1d0
  printk+0x58/0x6f
  x86_vector_activate.cold+0xf/0x46
  __irq_domain_activate_irq+0x50/0x80
  __irq_domain_activate_irq+0x32/0x80
  __irq_domain_activate_irq+0x32/0x80
  irq_domain_activate_irq+0x25/0x40
  __setup_irq+0x2d2/0x700
  request_threaded_irq+0xfb/0x160
  __bnxt_open_nic+0x3b1/0x750
  bnxt_open_nic+0x19/0x30
  ethtool_set_channels+0x1ac/0x220
  dev_ethtool+0x11ba/0x2240
  dev_ioctl+0x1cf/0x390
  sock_do_ioctl+0x95/0x130

Reported-by: Rob Sherwood <rsher@fb.com>
Fixes: c0c050c ("bnxt_en: New Broadcom ethernet driver.")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
popcornmix pushed a commit that referenced this issue Jun 6, 2023
The cited commit adds a compeletion to remove dependency on rtnl
lock. But it causes a deadlock for multiple encapsulations:

 crash> bt ffff8aece8a64000
 PID: 1514557  TASK: ffff8aece8a64000  CPU: 3    COMMAND: "tc"
  #0 [ffffa6d14183f368] __schedule at ffffffffb8ba7f45
  #1 [ffffa6d14183f3f8] schedule at ffffffffb8ba8418
  #2 [ffffa6d14183f418] schedule_preempt_disabled at ffffffffb8ba8898
  #3 [ffffa6d14183f428] __mutex_lock at ffffffffb8baa7f8
  #4 [ffffa6d14183f4d0] mutex_lock_nested at ffffffffb8baabeb
  #5 [ffffa6d14183f4e0] mlx5e_attach_encap at ffffffffc0f48c17 [mlx5_core]
  #6 [ffffa6d14183f628] mlx5e_tc_add_fdb_flow at ffffffffc0f39680 [mlx5_core]
  #7 [ffffa6d14183f688] __mlx5e_add_fdb_flow at ffffffffc0f3b636 [mlx5_core]
  #8 [ffffa6d14183f6f0] mlx5e_tc_add_flow at ffffffffc0f3bcdf [mlx5_core]
  #9 [ffffa6d14183f728] mlx5e_configure_flower at ffffffffc0f3c1d1 [mlx5_core]
 #10 [ffffa6d14183f790] mlx5e_rep_setup_tc_cls_flower at ffffffffc0f3d529 [mlx5_core]
 #11 [ffffa6d14183f7a0] mlx5e_rep_setup_tc_cb at ffffffffc0f3d714 [mlx5_core]
 #12 [ffffa6d14183f7b0] tc_setup_cb_add at ffffffffb8931bb8
 #13 [ffffa6d14183f810] fl_hw_replace_filter at ffffffffc0dae901 [cls_flower]
 #14 [ffffa6d14183f8d8] fl_change at ffffffffc0db5c57 [cls_flower]
 #15 [ffffa6d14183f970] tc_new_tfilter at ffffffffb8936047
 #16 [ffffa6d14183fac8] rtnetlink_rcv_msg at ffffffffb88c7c31
 #17 [ffffa6d14183fb50] netlink_rcv_skb at ffffffffb8942853
 #18 [ffffa6d14183fbc0] rtnetlink_rcv at ffffffffb88c1835
 #19 [ffffa6d14183fbd0] netlink_unicast at ffffffffb8941f27
 #20 [ffffa6d14183fc18] netlink_sendmsg at ffffffffb8942245
 #21 [ffffa6d14183fc98] sock_sendmsg at ffffffffb887d482
 #22 [ffffa6d14183fcb8] ____sys_sendmsg at ffffffffb887d81a
 #23 [ffffa6d14183fd38] ___sys_sendmsg at ffffffffb88806e2
 #24 [ffffa6d14183fe90] __sys_sendmsg at ffffffffb88807a2
 #25 [ffffa6d14183ff28] __x64_sys_sendmsg at ffffffffb888080f
 #26 [ffffa6d14183ff38] do_syscall_64 at ffffffffb8b9b6a8
 #27 [ffffa6d14183ff50] entry_SYSCALL_64_after_hwframe at ffffffffb8c0007c
 crash> bt 0xffff8aeb07544000
 PID: 1110766  TASK: ffff8aeb07544000  CPU: 0    COMMAND: "kworker/u20:9"
  #0 [ffffa6d14e6b7bd8] __schedule at ffffffffb8ba7f45
  #1 [ffffa6d14e6b7c68] schedule at ffffffffb8ba8418
  #2 [ffffa6d14e6b7c88] schedule_timeout at ffffffffb8baef88
  #3 [ffffa6d14e6b7d10] wait_for_completion at ffffffffb8ba968b
  #4 [ffffa6d14e6b7d60] mlx5e_take_all_encap_flows at ffffffffc0f47ec4 [mlx5_core]
  #5 [ffffa6d14e6b7da0] mlx5e_rep_update_flows at ffffffffc0f3e734 [mlx5_core]
  #6 [ffffa6d14e6b7df8] mlx5e_rep_neigh_update at ffffffffc0f400bb [mlx5_core]
  #7 [ffffa6d14e6b7e50] process_one_work at ffffffffb80acc9c
  #8 [ffffa6d14e6b7ed0] worker_thread at ffffffffb80ad012
  #9 [ffffa6d14e6b7f10] kthread at ffffffffb80b615d
 #10 [ffffa6d14e6b7f50] ret_from_fork at ffffffffb8001b2f

After the first encap is attached, flow will be added to encap
entry's flows list. If neigh update is running at this time, the
following encaps of the flow can't hold the encap_tbl_lock and
sleep. If neigh update thread is waiting for that flow's init_done,
deadlock happens.

Fix it by holding lock outside of the for loop. If neigh update is
running, prevent encap flows from offloading. Since the lock is held
outside of the for loop, concurrent creation of encap entries is not
allowed. So remove unnecessary wait_for_completion call for res_ready.

Fixes: 95435ad ("net/mlx5e: Only access fully initialized flows in neigh update")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
popcornmix pushed a commit that referenced this issue Jun 12, 2023
[ Upstream commit 37c3b9f ]

The cited commit adds a compeletion to remove dependency on rtnl
lock. But it causes a deadlock for multiple encapsulations:

 crash> bt ffff8aece8a64000
 PID: 1514557  TASK: ffff8aece8a64000  CPU: 3    COMMAND: "tc"
  #0 [ffffa6d14183f368] __schedule at ffffffffb8ba7f45
  #1 [ffffa6d14183f3f8] schedule at ffffffffb8ba8418
  #2 [ffffa6d14183f418] schedule_preempt_disabled at ffffffffb8ba8898
  #3 [ffffa6d14183f428] __mutex_lock at ffffffffb8baa7f8
  #4 [ffffa6d14183f4d0] mutex_lock_nested at ffffffffb8baabeb
  #5 [ffffa6d14183f4e0] mlx5e_attach_encap at ffffffffc0f48c17 [mlx5_core]
  #6 [ffffa6d14183f628] mlx5e_tc_add_fdb_flow at ffffffffc0f39680 [mlx5_core]
  #7 [ffffa6d14183f688] __mlx5e_add_fdb_flow at ffffffffc0f3b636 [mlx5_core]
  #8 [ffffa6d14183f6f0] mlx5e_tc_add_flow at ffffffffc0f3bcdf [mlx5_core]
  #9 [ffffa6d14183f728] mlx5e_configure_flower at ffffffffc0f3c1d1 [mlx5_core]
 #10 [ffffa6d14183f790] mlx5e_rep_setup_tc_cls_flower at ffffffffc0f3d529 [mlx5_core]
 #11 [ffffa6d14183f7a0] mlx5e_rep_setup_tc_cb at ffffffffc0f3d714 [mlx5_core]
 #12 [ffffa6d14183f7b0] tc_setup_cb_add at ffffffffb8931bb8
 #13 [ffffa6d14183f810] fl_hw_replace_filter at ffffffffc0dae901 [cls_flower]
 #14 [ffffa6d14183f8d8] fl_change at ffffffffc0db5c57 [cls_flower]
 #15 [ffffa6d14183f970] tc_new_tfilter at ffffffffb8936047
 #16 [ffffa6d14183fac8] rtnetlink_rcv_msg at ffffffffb88c7c31
 #17 [ffffa6d14183fb50] netlink_rcv_skb at ffffffffb8942853
 #18 [ffffa6d14183fbc0] rtnetlink_rcv at ffffffffb88c1835
 #19 [ffffa6d14183fbd0] netlink_unicast at ffffffffb8941f27
 #20 [ffffa6d14183fc18] netlink_sendmsg at ffffffffb8942245
 #21 [ffffa6d14183fc98] sock_sendmsg at ffffffffb887d482
 #22 [ffffa6d14183fcb8] ____sys_sendmsg at ffffffffb887d81a
 #23 [ffffa6d14183fd38] ___sys_sendmsg at ffffffffb88806e2
 #24 [ffffa6d14183fe90] __sys_sendmsg at ffffffffb88807a2
 #25 [ffffa6d14183ff28] __x64_sys_sendmsg at ffffffffb888080f
 #26 [ffffa6d14183ff38] do_syscall_64 at ffffffffb8b9b6a8
 #27 [ffffa6d14183ff50] entry_SYSCALL_64_after_hwframe at ffffffffb8c0007c
 crash> bt 0xffff8aeb07544000
 PID: 1110766  TASK: ffff8aeb07544000  CPU: 0    COMMAND: "kworker/u20:9"
  #0 [ffffa6d14e6b7bd8] __schedule at ffffffffb8ba7f45
  #1 [ffffa6d14e6b7c68] schedule at ffffffffb8ba8418
  #2 [ffffa6d14e6b7c88] schedule_timeout at ffffffffb8baef88
  #3 [ffffa6d14e6b7d10] wait_for_completion at ffffffffb8ba968b
  #4 [ffffa6d14e6b7d60] mlx5e_take_all_encap_flows at ffffffffc0f47ec4 [mlx5_core]
  #5 [ffffa6d14e6b7da0] mlx5e_rep_update_flows at ffffffffc0f3e734 [mlx5_core]
  #6 [ffffa6d14e6b7df8] mlx5e_rep_neigh_update at ffffffffc0f400bb [mlx5_core]
  #7 [ffffa6d14e6b7e50] process_one_work at ffffffffb80acc9c
  #8 [ffffa6d14e6b7ed0] worker_thread at ffffffffb80ad012
  #9 [ffffa6d14e6b7f10] kthread at ffffffffb80b615d
 #10 [ffffa6d14e6b7f50] ret_from_fork at ffffffffb8001b2f

After the first encap is attached, flow will be added to encap
entry's flows list. If neigh update is running at this time, the
following encaps of the flow can't hold the encap_tbl_lock and
sleep. If neigh update thread is waiting for that flow's init_done,
deadlock happens.

Fix it by holding lock outside of the for loop. If neigh update is
running, prevent encap flows from offloading. Since the lock is held
outside of the for loop, concurrent creation of encap entries is not
allowed. So remove unnecessary wait_for_completion call for res_ready.

Fixes: 95435ad ("net/mlx5e: Only access fully initialized flows in neigh update")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
popcornmix pushed a commit that referenced this issue Jul 14, 2023
[ Upstream commit 8d21155 ]

adjust_inuse_and_calc_cost() use spin_lock_irq() and IRQ will be enabled
when unlock. DEADLOCK might happen if we have held other locks and disabled
IRQ before invoking it.

Fix it by using spin_lock_irqsave() instead, which can keep IRQ state
consistent with before when unlock.

  ================================
  WARNING: inconsistent lock state
  5.10.0-02758-g8e5f91fd772f #26 Not tainted
  --------------------------------
  inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
  kworker/2:3/388 [HC0[0]:SC0[0]:HE0:SE1] takes:
  ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: spin_lock_irq
  ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: bfq_bio_merge+0x141/0x390
  {IN-HARDIRQ-W} state was registered at:
    __lock_acquire+0x3d7/0x1070
    lock_acquire+0x197/0x4a0
    __raw_spin_lock_irqsave
    _raw_spin_lock_irqsave+0x3b/0x60
    bfq_idle_slice_timer_body
    bfq_idle_slice_timer+0x53/0x1d0
    __run_hrtimer+0x477/0xa70
    __hrtimer_run_queues+0x1c6/0x2d0
    hrtimer_interrupt+0x302/0x9e0
    local_apic_timer_interrupt
    __sysvec_apic_timer_interrupt+0xfd/0x420
    run_sysvec_on_irqstack_cond
    sysvec_apic_timer_interrupt+0x46/0xa0
    asm_sysvec_apic_timer_interrupt+0x12/0x20
  irq event stamp: 837522
  hardirqs last  enabled at (837521): [<ffffffff84b9419d>] __raw_spin_unlock_irqrestore
  hardirqs last  enabled at (837521): [<ffffffff84b9419d>] _raw_spin_unlock_irqrestore+0x3d/0x40
  hardirqs last disabled at (837522): [<ffffffff84b93fa3>] __raw_spin_lock_irq
  hardirqs last disabled at (837522): [<ffffffff84b93fa3>] _raw_spin_lock_irq+0x43/0x50
  softirqs last  enabled at (835852): [<ffffffff84e00558>] __do_softirq+0x558/0x8ec
  softirqs last disabled at (835845): [<ffffffff84c010ff>] asm_call_irq_on_stack+0xf/0x20

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(&bfqd->lock);
    <Interrupt>
      lock(&bfqd->lock);

   *** DEADLOCK ***

  3 locks held by kworker/2:3/388:
   #0: ffff888107af0f38 ((wq_completion)kthrotld){+.+.}-{0:0}, at: process_one_work+0x742/0x13f0
   #1: ffff8881176bfdd8 ((work_completion)(&td->dispatch_work)){+.+.}-{0:0}, at: process_one_work+0x777/0x13f0
   #2: ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: spin_lock_irq
   #2: ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: bfq_bio_merge+0x141/0x390

  stack backtrace:
  CPU: 2 PID: 388 Comm: kworker/2:3 Not tainted 5.10.0-02758-g8e5f91fd772f #26
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  Workqueue: kthrotld blk_throtl_dispatch_work_fn
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x107/0x167
   print_usage_bug
   valid_state
   mark_lock_irq.cold+0x32/0x3a
   mark_lock+0x693/0xbc0
   mark_held_locks+0x9e/0xe0
   __trace_hardirqs_on_caller
   lockdep_hardirqs_on_prepare.part.0+0x151/0x360
   trace_hardirqs_on+0x5b/0x180
   __raw_spin_unlock_irq
   _raw_spin_unlock_irq+0x24/0x40
   spin_unlock_irq
   adjust_inuse_and_calc_cost+0x4fb/0x970
   ioc_rqos_merge+0x277/0x740
   __rq_qos_merge+0x62/0xb0
   rq_qos_merge
   bio_attempt_back_merge+0x12c/0x4a0
   blk_mq_sched_try_merge+0x1b6/0x4d0
   bfq_bio_merge+0x24a/0x390
   __blk_mq_sched_bio_merge+0xa6/0x460
   blk_mq_sched_bio_merge
   blk_mq_submit_bio+0x2e7/0x1ee0
   __submit_bio_noacct_mq+0x175/0x3b0
   submit_bio_noacct+0x1fb/0x270
   blk_throtl_dispatch_work_fn+0x1ef/0x2b0
   process_one_work+0x83e/0x13f0
   process_scheduled_works
   worker_thread+0x7e3/0xd80
   kthread+0x353/0x470
   ret_from_fork+0x1f/0x30

Fixes: b0853ab ("blk-iocost: revamp in-period donation snapbacks")
Signed-off-by: Li Nan <linan122@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230527091904.3001833-1-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
popcornmix pushed a commit that referenced this issue Jul 20, 2023
[ Upstream commit 8d21155 ]

adjust_inuse_and_calc_cost() use spin_lock_irq() and IRQ will be enabled
when unlock. DEADLOCK might happen if we have held other locks and disabled
IRQ before invoking it.

Fix it by using spin_lock_irqsave() instead, which can keep IRQ state
consistent with before when unlock.

  ================================
  WARNING: inconsistent lock state
  5.10.0-02758-g8e5f91fd772f #26 Not tainted
  --------------------------------
  inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
  kworker/2:3/388 [HC0[0]:SC0[0]:HE0:SE1] takes:
  ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: spin_lock_irq
  ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: bfq_bio_merge+0x141/0x390
  {IN-HARDIRQ-W} state was registered at:
    __lock_acquire+0x3d7/0x1070
    lock_acquire+0x197/0x4a0
    __raw_spin_lock_irqsave
    _raw_spin_lock_irqsave+0x3b/0x60
    bfq_idle_slice_timer_body
    bfq_idle_slice_timer+0x53/0x1d0
    __run_hrtimer+0x477/0xa70
    __hrtimer_run_queues+0x1c6/0x2d0
    hrtimer_interrupt+0x302/0x9e0
    local_apic_timer_interrupt
    __sysvec_apic_timer_interrupt+0xfd/0x420
    run_sysvec_on_irqstack_cond
    sysvec_apic_timer_interrupt+0x46/0xa0
    asm_sysvec_apic_timer_interrupt+0x12/0x20
  irq event stamp: 837522
  hardirqs last  enabled at (837521): [<ffffffff84b9419d>] __raw_spin_unlock_irqrestore
  hardirqs last  enabled at (837521): [<ffffffff84b9419d>] _raw_spin_unlock_irqrestore+0x3d/0x40
  hardirqs last disabled at (837522): [<ffffffff84b93fa3>] __raw_spin_lock_irq
  hardirqs last disabled at (837522): [<ffffffff84b93fa3>] _raw_spin_lock_irq+0x43/0x50
  softirqs last  enabled at (835852): [<ffffffff84e00558>] __do_softirq+0x558/0x8ec
  softirqs last disabled at (835845): [<ffffffff84c010ff>] asm_call_irq_on_stack+0xf/0x20

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(&bfqd->lock);
    <Interrupt>
      lock(&bfqd->lock);

   *** DEADLOCK ***

  3 locks held by kworker/2:3/388:
   #0: ffff888107af0f38 ((wq_completion)kthrotld){+.+.}-{0:0}, at: process_one_work+0x742/0x13f0
   #1: ffff8881176bfdd8 ((work_completion)(&td->dispatch_work)){+.+.}-{0:0}, at: process_one_work+0x777/0x13f0
   #2: ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: spin_lock_irq
   #2: ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: bfq_bio_merge+0x141/0x390

  stack backtrace:
  CPU: 2 PID: 388 Comm: kworker/2:3 Not tainted 5.10.0-02758-g8e5f91fd772f #26
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  Workqueue: kthrotld blk_throtl_dispatch_work_fn
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x107/0x167
   print_usage_bug
   valid_state
   mark_lock_irq.cold+0x32/0x3a
   mark_lock+0x693/0xbc0
   mark_held_locks+0x9e/0xe0
   __trace_hardirqs_on_caller
   lockdep_hardirqs_on_prepare.part.0+0x151/0x360
   trace_hardirqs_on+0x5b/0x180
   __raw_spin_unlock_irq
   _raw_spin_unlock_irq+0x24/0x40
   spin_unlock_irq
   adjust_inuse_and_calc_cost+0x4fb/0x970
   ioc_rqos_merge+0x277/0x740
   __rq_qos_merge+0x62/0xb0
   rq_qos_merge
   bio_attempt_back_merge+0x12c/0x4a0
   blk_mq_sched_try_merge+0x1b6/0x4d0
   bfq_bio_merge+0x24a/0x390
   __blk_mq_sched_bio_merge+0xa6/0x460
   blk_mq_sched_bio_merge
   blk_mq_submit_bio+0x2e7/0x1ee0
   __submit_bio_noacct_mq+0x175/0x3b0
   submit_bio_noacct+0x1fb/0x270
   blk_throtl_dispatch_work_fn+0x1ef/0x2b0
   process_one_work+0x83e/0x13f0
   process_scheduled_works
   worker_thread+0x7e3/0xd80
   kthread+0x353/0x470
   ret_from_fork+0x1f/0x30

Fixes: b0853ab ("blk-iocost: revamp in-period donation snapbacks")
Signed-off-by: Li Nan <linan122@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230527091904.3001833-1-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
popcornmix pushed a commit that referenced this issue Jul 20, 2023
[ Upstream commit 8d21155 ]

adjust_inuse_and_calc_cost() use spin_lock_irq() and IRQ will be enabled
when unlock. DEADLOCK might happen if we have held other locks and disabled
IRQ before invoking it.

Fix it by using spin_lock_irqsave() instead, which can keep IRQ state
consistent with before when unlock.

  ================================
  WARNING: inconsistent lock state
  5.10.0-02758-g8e5f91fd772f #26 Not tainted
  --------------------------------
  inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
  kworker/2:3/388 [HC0[0]:SC0[0]:HE0:SE1] takes:
  ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: spin_lock_irq
  ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: bfq_bio_merge+0x141/0x390
  {IN-HARDIRQ-W} state was registered at:
    __lock_acquire+0x3d7/0x1070
    lock_acquire+0x197/0x4a0
    __raw_spin_lock_irqsave
    _raw_spin_lock_irqsave+0x3b/0x60
    bfq_idle_slice_timer_body
    bfq_idle_slice_timer+0x53/0x1d0
    __run_hrtimer+0x477/0xa70
    __hrtimer_run_queues+0x1c6/0x2d0
    hrtimer_interrupt+0x302/0x9e0
    local_apic_timer_interrupt
    __sysvec_apic_timer_interrupt+0xfd/0x420
    run_sysvec_on_irqstack_cond
    sysvec_apic_timer_interrupt+0x46/0xa0
    asm_sysvec_apic_timer_interrupt+0x12/0x20
  irq event stamp: 837522
  hardirqs last  enabled at (837521): [<ffffffff84b9419d>] __raw_spin_unlock_irqrestore
  hardirqs last  enabled at (837521): [<ffffffff84b9419d>] _raw_spin_unlock_irqrestore+0x3d/0x40
  hardirqs last disabled at (837522): [<ffffffff84b93fa3>] __raw_spin_lock_irq
  hardirqs last disabled at (837522): [<ffffffff84b93fa3>] _raw_spin_lock_irq+0x43/0x50
  softirqs last  enabled at (835852): [<ffffffff84e00558>] __do_softirq+0x558/0x8ec
  softirqs last disabled at (835845): [<ffffffff84c010ff>] asm_call_irq_on_stack+0xf/0x20

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(&bfqd->lock);
    <Interrupt>
      lock(&bfqd->lock);

   *** DEADLOCK ***

  3 locks held by kworker/2:3/388:
   #0: ffff888107af0f38 ((wq_completion)kthrotld){+.+.}-{0:0}, at: process_one_work+0x742/0x13f0
   #1: ffff8881176bfdd8 ((work_completion)(&td->dispatch_work)){+.+.}-{0:0}, at: process_one_work+0x777/0x13f0
   #2: ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: spin_lock_irq
   #2: ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: bfq_bio_merge+0x141/0x390

  stack backtrace:
  CPU: 2 PID: 388 Comm: kworker/2:3 Not tainted 5.10.0-02758-g8e5f91fd772f #26
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  Workqueue: kthrotld blk_throtl_dispatch_work_fn
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x107/0x167
   print_usage_bug
   valid_state
   mark_lock_irq.cold+0x32/0x3a
   mark_lock+0x693/0xbc0
   mark_held_locks+0x9e/0xe0
   __trace_hardirqs_on_caller
   lockdep_hardirqs_on_prepare.part.0+0x151/0x360
   trace_hardirqs_on+0x5b/0x180
   __raw_spin_unlock_irq
   _raw_spin_unlock_irq+0x24/0x40
   spin_unlock_irq
   adjust_inuse_and_calc_cost+0x4fb/0x970
   ioc_rqos_merge+0x277/0x740
   __rq_qos_merge+0x62/0xb0
   rq_qos_merge
   bio_attempt_back_merge+0x12c/0x4a0
   blk_mq_sched_try_merge+0x1b6/0x4d0
   bfq_bio_merge+0x24a/0x390
   __blk_mq_sched_bio_merge+0xa6/0x460
   blk_mq_sched_bio_merge
   blk_mq_submit_bio+0x2e7/0x1ee0
   __submit_bio_noacct_mq+0x175/0x3b0
   submit_bio_noacct+0x1fb/0x270
   blk_throtl_dispatch_work_fn+0x1ef/0x2b0
   process_one_work+0x83e/0x13f0
   process_scheduled_works
   worker_thread+0x7e3/0xd80
   kthread+0x353/0x470
   ret_from_fork+0x1f/0x30

Fixes: b0853ab ("blk-iocost: revamp in-period donation snapbacks")
Signed-off-by: Li Nan <linan122@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230527091904.3001833-1-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
popcornmix pushed a commit that referenced this issue Sep 13, 2023
commit 2ea3528 upstream.

Commit bf5c25d ("skbuff: in skb_segment, call zerocopy functions
once per nskb") added the call to zero copy functions in skb_segment().
The change introduced a bug in skb_segment() because skb_orphan_frags()
may possibly change the number of fragments or allocate new fragments
altogether leaving nrfrags and frag to point to the old values. This can
cause a panic with stacktrace like the one below.

[  193.894380] BUG: kernel NULL pointer dereference, address: 00000000000000bc
[  193.895273] CPU: 13 PID: 18164 Comm: vh-net-17428 Kdump: loaded Tainted: G           O      5.15.123+ #26
[  193.903919] RIP: 0010:skb_segment+0xb0e/0x12f0
[  194.021892] Call Trace:
[  194.027422]  <TASK>
[  194.072861]  tcp_gso_segment+0x107/0x540
[  194.082031]  inet_gso_segment+0x15c/0x3d0
[  194.090783]  skb_mac_gso_segment+0x9f/0x110
[  194.095016]  __skb_gso_segment+0xc1/0x190
[  194.103131]  netem_enqueue+0x290/0xb10 [sch_netem]
[  194.107071]  dev_qdisc_enqueue+0x16/0x70
[  194.110884]  __dev_queue_xmit+0x63b/0xb30
[  194.121670]  bond_start_xmit+0x159/0x380 [bonding]
[  194.128506]  dev_hard_start_xmit+0xc3/0x1e0
[  194.131787]  __dev_queue_xmit+0x8a0/0xb30
[  194.138225]  macvlan_start_xmit+0x4f/0x100 [macvlan]
[  194.141477]  dev_hard_start_xmit+0xc3/0x1e0
[  194.144622]  sch_direct_xmit+0xe3/0x280
[  194.147748]  __dev_queue_xmit+0x54a/0xb30
[  194.154131]  tap_get_user+0x2a8/0x9c0 [tap]
[  194.157358]  tap_sendmsg+0x52/0x8e0 [tap]
[  194.167049]  handle_tx_zerocopy+0x14e/0x4c0 [vhost_net]
[  194.173631]  handle_tx+0xcd/0xe0 [vhost_net]
[  194.176959]  vhost_worker+0x76/0xb0 [vhost]
[  194.183667]  kthread+0x118/0x140
[  194.190358]  ret_from_fork+0x1f/0x30
[  194.193670]  </TASK>

In this case calling skb_orphan_frags() updated nr_frags leaving nrfrags
local variable in skb_segment() stale. This resulted in the code hitting
i >= nrfrags prematurely and trying to move to next frag_skb using
list_skb pointer, which was NULL, and caused kernel panic. Move the call
to zero copy functions before using frags and nr_frags.

Fixes: bf5c25d ("skbuff: in skb_segment, call zerocopy functions once per nskb")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reported-by: Amit Goyal <agoyal@purestorage.com>
Cc: stable@vger.kernel.org
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Sep 13, 2023
commit 2ea3528 upstream.

Commit bf5c25d ("skbuff: in skb_segment, call zerocopy functions
once per nskb") added the call to zero copy functions in skb_segment().
The change introduced a bug in skb_segment() because skb_orphan_frags()
may possibly change the number of fragments or allocate new fragments
altogether leaving nrfrags and frag to point to the old values. This can
cause a panic with stacktrace like the one below.

[  193.894380] BUG: kernel NULL pointer dereference, address: 00000000000000bc
[  193.895273] CPU: 13 PID: 18164 Comm: vh-net-17428 Kdump: loaded Tainted: G           O      5.15.123+ #26
[  193.903919] RIP: 0010:skb_segment+0xb0e/0x12f0
[  194.021892] Call Trace:
[  194.027422]  <TASK>
[  194.072861]  tcp_gso_segment+0x107/0x540
[  194.082031]  inet_gso_segment+0x15c/0x3d0
[  194.090783]  skb_mac_gso_segment+0x9f/0x110
[  194.095016]  __skb_gso_segment+0xc1/0x190
[  194.103131]  netem_enqueue+0x290/0xb10 [sch_netem]
[  194.107071]  dev_qdisc_enqueue+0x16/0x70
[  194.110884]  __dev_queue_xmit+0x63b/0xb30
[  194.121670]  bond_start_xmit+0x159/0x380 [bonding]
[  194.128506]  dev_hard_start_xmit+0xc3/0x1e0
[  194.131787]  __dev_queue_xmit+0x8a0/0xb30
[  194.138225]  macvlan_start_xmit+0x4f/0x100 [macvlan]
[  194.141477]  dev_hard_start_xmit+0xc3/0x1e0
[  194.144622]  sch_direct_xmit+0xe3/0x280
[  194.147748]  __dev_queue_xmit+0x54a/0xb30
[  194.154131]  tap_get_user+0x2a8/0x9c0 [tap]
[  194.157358]  tap_sendmsg+0x52/0x8e0 [tap]
[  194.167049]  handle_tx_zerocopy+0x14e/0x4c0 [vhost_net]
[  194.173631]  handle_tx+0xcd/0xe0 [vhost_net]
[  194.176959]  vhost_worker+0x76/0xb0 [vhost]
[  194.183667]  kthread+0x118/0x140
[  194.190358]  ret_from_fork+0x1f/0x30
[  194.193670]  </TASK>

In this case calling skb_orphan_frags() updated nr_frags leaving nrfrags
local variable in skb_segment() stale. This resulted in the code hitting
i >= nrfrags prematurely and trying to move to next frag_skb using
list_skb pointer, which was NULL, and caused kernel panic. Move the call
to zero copy functions before using frags and nr_frags.

Fixes: bf5c25d ("skbuff: in skb_segment, call zerocopy functions once per nskb")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reported-by: Amit Goyal <agoyal@purestorage.com>
Cc: stable@vger.kernel.org
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Sep 13, 2023
commit 2ea3528 upstream.

Commit bf5c25d ("skbuff: in skb_segment, call zerocopy functions
once per nskb") added the call to zero copy functions in skb_segment().
The change introduced a bug in skb_segment() because skb_orphan_frags()
may possibly change the number of fragments or allocate new fragments
altogether leaving nrfrags and frag to point to the old values. This can
cause a panic with stacktrace like the one below.

[  193.894380] BUG: kernel NULL pointer dereference, address: 00000000000000bc
[  193.895273] CPU: 13 PID: 18164 Comm: vh-net-17428 Kdump: loaded Tainted: G           O      5.15.123+ #26
[  193.903919] RIP: 0010:skb_segment+0xb0e/0x12f0
[  194.021892] Call Trace:
[  194.027422]  <TASK>
[  194.072861]  tcp_gso_segment+0x107/0x540
[  194.082031]  inet_gso_segment+0x15c/0x3d0
[  194.090783]  skb_mac_gso_segment+0x9f/0x110
[  194.095016]  __skb_gso_segment+0xc1/0x190
[  194.103131]  netem_enqueue+0x290/0xb10 [sch_netem]
[  194.107071]  dev_qdisc_enqueue+0x16/0x70
[  194.110884]  __dev_queue_xmit+0x63b/0xb30
[  194.121670]  bond_start_xmit+0x159/0x380 [bonding]
[  194.128506]  dev_hard_start_xmit+0xc3/0x1e0
[  194.131787]  __dev_queue_xmit+0x8a0/0xb30
[  194.138225]  macvlan_start_xmit+0x4f/0x100 [macvlan]
[  194.141477]  dev_hard_start_xmit+0xc3/0x1e0
[  194.144622]  sch_direct_xmit+0xe3/0x280
[  194.147748]  __dev_queue_xmit+0x54a/0xb30
[  194.154131]  tap_get_user+0x2a8/0x9c0 [tap]
[  194.157358]  tap_sendmsg+0x52/0x8e0 [tap]
[  194.167049]  handle_tx_zerocopy+0x14e/0x4c0 [vhost_net]
[  194.173631]  handle_tx+0xcd/0xe0 [vhost_net]
[  194.176959]  vhost_worker+0x76/0xb0 [vhost]
[  194.183667]  kthread+0x118/0x140
[  194.190358]  ret_from_fork+0x1f/0x30
[  194.193670]  </TASK>

In this case calling skb_orphan_frags() updated nr_frags leaving nrfrags
local variable in skb_segment() stale. This resulted in the code hitting
i >= nrfrags prematurely and trying to move to next frag_skb using
list_skb pointer, which was NULL, and caused kernel panic. Move the call
to zero copy functions before using frags and nr_frags.

Fixes: bf5c25d ("skbuff: in skb_segment, call zerocopy functions once per nskb")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reported-by: Amit Goyal <agoyal@purestorage.com>
Cc: stable@vger.kernel.org
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
vvarma pushed a commit to vvarma/linux that referenced this issue Sep 19, 2023
commit 2ea3528 upstream.

Commit bf5c25d ("skbuff: in skb_segment, call zerocopy functions
once per nskb") added the call to zero copy functions in skb_segment().
The change introduced a bug in skb_segment() because skb_orphan_frags()
may possibly change the number of fragments or allocate new fragments
altogether leaving nrfrags and frag to point to the old values. This can
cause a panic with stacktrace like the one below.

[  193.894380] BUG: kernel NULL pointer dereference, address: 00000000000000bc
[  193.895273] CPU: 13 PID: 18164 Comm: vh-net-17428 Kdump: loaded Tainted: G           O      5.15.123+ raspberrypi#26
[  193.903919] RIP: 0010:skb_segment+0xb0e/0x12f0
[  194.021892] Call Trace:
[  194.027422]  <TASK>
[  194.072861]  tcp_gso_segment+0x107/0x540
[  194.082031]  inet_gso_segment+0x15c/0x3d0
[  194.090783]  skb_mac_gso_segment+0x9f/0x110
[  194.095016]  __skb_gso_segment+0xc1/0x190
[  194.103131]  netem_enqueue+0x290/0xb10 [sch_netem]
[  194.107071]  dev_qdisc_enqueue+0x16/0x70
[  194.110884]  __dev_queue_xmit+0x63b/0xb30
[  194.121670]  bond_start_xmit+0x159/0x380 [bonding]
[  194.128506]  dev_hard_start_xmit+0xc3/0x1e0
[  194.131787]  __dev_queue_xmit+0x8a0/0xb30
[  194.138225]  macvlan_start_xmit+0x4f/0x100 [macvlan]
[  194.141477]  dev_hard_start_xmit+0xc3/0x1e0
[  194.144622]  sch_direct_xmit+0xe3/0x280
[  194.147748]  __dev_queue_xmit+0x54a/0xb30
[  194.154131]  tap_get_user+0x2a8/0x9c0 [tap]
[  194.157358]  tap_sendmsg+0x52/0x8e0 [tap]
[  194.167049]  handle_tx_zerocopy+0x14e/0x4c0 [vhost_net]
[  194.173631]  handle_tx+0xcd/0xe0 [vhost_net]
[  194.176959]  vhost_worker+0x76/0xb0 [vhost]
[  194.183667]  kthread+0x118/0x140
[  194.190358]  ret_from_fork+0x1f/0x30
[  194.193670]  </TASK>

In this case calling skb_orphan_frags() updated nr_frags leaving nrfrags
local variable in skb_segment() stale. This resulted in the code hitting
i >= nrfrags prematurely and trying to move to next frag_skb using
list_skb pointer, which was NULL, and caused kernel panic. Move the call
to zero copy functions before using frags and nr_frags.

Fixes: bf5c25d ("skbuff: in skb_segment, call zerocopy functions once per nskb")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reported-by: Amit Goyal <agoyal@purestorage.com>
Cc: stable@vger.kernel.org
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Sep 19, 2023
The following processes run into a deadlock. CPU 41 was waiting for CPU 29
to handle a CSD request while holding spinlock "crashdump_lock", but CPU 29
was hung by that spinlock with IRQs disabled.

  PID: 17360    TASK: ffff95c1090c5c40  CPU: 41  COMMAND: "mrdiagd"
  !# 0 [ffffb80edbf37b58] __read_once_size at ffffffff9b871a40 include/linux/compiler.h:185:0
  !# 1 [ffffb80edbf37b58] atomic_read at ffffffff9b871a40 arch/x86/include/asm/atomic.h:27:0
  !# 2 [ffffb80edbf37b58] dump_stack at ffffffff9b871a40 lib/dump_stack.c:54:0
   # 3 [ffffb80edbf37b78] csd_lock_wait_toolong at ffffffff9b131ad5 kernel/smp.c:364:0
   # 4 [ffffb80edbf37b78] __csd_lock_wait at ffffffff9b131ad5 kernel/smp.c:384:0
   # 5 [ffffb80edbf37bf8] csd_lock_wait at ffffffff9b13267a kernel/smp.c:394:0
   # 6 [ffffb80edbf37bf8] smp_call_function_many at ffffffff9b13267a kernel/smp.c:843:0
   # 7 [ffffb80edbf37c50] smp_call_function at ffffffff9b13279d kernel/smp.c:867:0
   # 8 [ffffb80edbf37c50] on_each_cpu at ffffffff9b13279d kernel/smp.c:976:0
   # 9 [ffffb80edbf37c78] flush_tlb_kernel_range at ffffffff9b085c4b arch/x86/mm/tlb.c:742:0
   #10 [ffffb80edbf37cb8] __purge_vmap_area_lazy at ffffffff9b23a1e0 mm/vmalloc.c:701:0
   #11 [ffffb80edbf37ce0] try_purge_vmap_area_lazy at ffffffff9b23a2cc mm/vmalloc.c:722:0
   #12 [ffffb80edbf37ce0] free_vmap_area_noflush at ffffffff9b23a2cc mm/vmalloc.c:754:0
   #13 [ffffb80edbf37cf8] free_unmap_vmap_area at ffffffff9b23bb3b mm/vmalloc.c:764:0
   #14 [ffffb80edbf37cf8] remove_vm_area at ffffffff9b23bb3b mm/vmalloc.c:1509:0
   #15 [ffffb80edbf37d18] __vunmap at ffffffff9b23bb8a mm/vmalloc.c:1537:0
   #16 [ffffb80edbf37d40] vfree at ffffffff9b23bc85 mm/vmalloc.c:1612:0
   #17 [ffffb80edbf37d58] megasas_free_host_crash_buffer [megaraid_sas] at ffffffffc020b7f2 drivers/scsi/megaraid/megaraid_sas_fusion.c:3932:0
   #18 [ffffb80edbf37d80] fw_crash_state_store [megaraid_sas] at ffffffffc01f804d drivers/scsi/megaraid/megaraid_sas_base.c:3291:0
   #19 [ffffb80edbf37dc0] dev_attr_store at ffffffff9b56dd7b drivers/base/core.c:758:0
   #20 [ffffb80edbf37dd0] sysfs_kf_write at ffffffff9b326acf fs/sysfs/file.c:144:0
   #21 [ffffb80edbf37de0] kernfs_fop_write at ffffffff9b325fd4 fs/kernfs/file.c:316:0
   #22 [ffffb80edbf37e20] __vfs_write at ffffffff9b29418a fs/read_write.c:480:0
   #23 [ffffb80edbf37ea8] vfs_write at ffffffff9b294462 fs/read_write.c:544:0
   #24 [ffffb80edbf37ee8] SYSC_write at ffffffff9b2946ec fs/read_write.c:590:0
   #25 [ffffb80edbf37ee8] SyS_write at ffffffff9b2946ec fs/read_write.c:582:0
   #26 [ffffb80edbf37f30] do_syscall_64 at ffffffff9b003ca9 arch/x86/entry/common.c:298:0
   #27 [ffffb80edbf37f58] entry_SYSCALL_64 at ffffffff9ba001b1 arch/x86/entry/entry_64.S:238:0

  PID: 17355    TASK: ffff95c1090c3d80  CPU: 29  COMMAND: "mrdiagd"
  !# 0 [ffffb80f2d3c7d30] __read_once_size at ffffffff9b0f2ab0 include/linux/compiler.h:185:0
  !# 1 [ffffb80f2d3c7d30] native_queued_spin_lock_slowpath at ffffffff9b0f2ab0 kernel/locking/qspinlock.c:368:0
   # 2 [ffffb80f2d3c7d58] pv_queued_spin_lock_slowpath at ffffffff9b0f244b arch/x86/include/asm/paravirt.h:674:0
   # 3 [ffffb80f2d3c7d58] queued_spin_lock_slowpath at ffffffff9b0f244b arch/x86/include/asm/qspinlock.h:53:0
   # 4 [ffffb80f2d3c7d68] queued_spin_lock at ffffffff9b8961a6 include/asm-generic/qspinlock.h:90:0
   # 5 [ffffb80f2d3c7d68] do_raw_spin_lock_flags at ffffffff9b8961a6 include/linux/spinlock.h:173:0
   # 6 [ffffb80f2d3c7d68] __raw_spin_lock_irqsave at ffffffff9b8961a6 include/linux/spinlock_api_smp.h:122:0
   # 7 [ffffb80f2d3c7d68] _raw_spin_lock_irqsave at ffffffff9b8961a6 kernel/locking/spinlock.c:160:0
   # 8 [ffffb80f2d3c7d88] fw_crash_buffer_store [megaraid_sas] at ffffffffc01f8129 drivers/scsi/megaraid/megaraid_sas_base.c:3205:0
   # 9 [ffffb80f2d3c7dc0] dev_attr_store at ffffffff9b56dd7b drivers/base/core.c:758:0
   #10 [ffffb80f2d3c7dd0] sysfs_kf_write at ffffffff9b326acf fs/sysfs/file.c:144:0
   #11 [ffffb80f2d3c7de0] kernfs_fop_write at ffffffff9b325fd4 fs/kernfs/file.c:316:0
   #12 [ffffb80f2d3c7e20] __vfs_write at ffffffff9b29418a fs/read_write.c:480:0
   #13 [ffffb80f2d3c7ea8] vfs_write at ffffffff9b294462 fs/read_write.c:544:0
   #14 [ffffb80f2d3c7ee8] SYSC_write at ffffffff9b2946ec fs/read_write.c:590:0
   #15 [ffffb80f2d3c7ee8] SyS_write at ffffffff9b2946ec fs/read_write.c:582:0
   #16 [ffffb80f2d3c7f30] do_syscall_64 at ffffffff9b003ca9 arch/x86/entry/common.c:298:0
   #17 [ffffb80f2d3c7f58] entry_SYSCALL_64 at ffffffff9ba001b1 arch/x86/entry/entry_64.S:238:0

The lock is used to synchronize different sysfs operations, it doesn't
protect any resource that will be touched by an interrupt. Consequently
it's not required to disable IRQs. Replace the spinlock with a mutex to fix
the deadlock.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Link: https://lore.kernel.org/r/20230828221018.19471-1-junxiao.bi@oracle.com
Reviewed-by: Mike Christie <michael.christie@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
popcornmix pushed a commit that referenced this issue Sep 25, 2023
commit 0b0747d upstream.

The following processes run into a deadlock. CPU 41 was waiting for CPU 29
to handle a CSD request while holding spinlock "crashdump_lock", but CPU 29
was hung by that spinlock with IRQs disabled.

  PID: 17360    TASK: ffff95c1090c5c40  CPU: 41  COMMAND: "mrdiagd"
  !# 0 [ffffb80edbf37b58] __read_once_size at ffffffff9b871a40 include/linux/compiler.h:185:0
  !# 1 [ffffb80edbf37b58] atomic_read at ffffffff9b871a40 arch/x86/include/asm/atomic.h:27:0
  !# 2 [ffffb80edbf37b58] dump_stack at ffffffff9b871a40 lib/dump_stack.c:54:0
   # 3 [ffffb80edbf37b78] csd_lock_wait_toolong at ffffffff9b131ad5 kernel/smp.c:364:0
   # 4 [ffffb80edbf37b78] __csd_lock_wait at ffffffff9b131ad5 kernel/smp.c:384:0
   # 5 [ffffb80edbf37bf8] csd_lock_wait at ffffffff9b13267a kernel/smp.c:394:0
   # 6 [ffffb80edbf37bf8] smp_call_function_many at ffffffff9b13267a kernel/smp.c:843:0
   # 7 [ffffb80edbf37c50] smp_call_function at ffffffff9b13279d kernel/smp.c:867:0
   # 8 [ffffb80edbf37c50] on_each_cpu at ffffffff9b13279d kernel/smp.c:976:0
   # 9 [ffffb80edbf37c78] flush_tlb_kernel_range at ffffffff9b085c4b arch/x86/mm/tlb.c:742:0
   #10 [ffffb80edbf37cb8] __purge_vmap_area_lazy at ffffffff9b23a1e0 mm/vmalloc.c:701:0
   #11 [ffffb80edbf37ce0] try_purge_vmap_area_lazy at ffffffff9b23a2cc mm/vmalloc.c:722:0
   #12 [ffffb80edbf37ce0] free_vmap_area_noflush at ffffffff9b23a2cc mm/vmalloc.c:754:0
   #13 [ffffb80edbf37cf8] free_unmap_vmap_area at ffffffff9b23bb3b mm/vmalloc.c:764:0
   #14 [ffffb80edbf37cf8] remove_vm_area at ffffffff9b23bb3b mm/vmalloc.c:1509:0
   #15 [ffffb80edbf37d18] __vunmap at ffffffff9b23bb8a mm/vmalloc.c:1537:0
   #16 [ffffb80edbf37d40] vfree at ffffffff9b23bc85 mm/vmalloc.c:1612:0
   #17 [ffffb80edbf37d58] megasas_free_host_crash_buffer [megaraid_sas] at ffffffffc020b7f2 drivers/scsi/megaraid/megaraid_sas_fusion.c:3932:0
   #18 [ffffb80edbf37d80] fw_crash_state_store [megaraid_sas] at ffffffffc01f804d drivers/scsi/megaraid/megaraid_sas_base.c:3291:0
   #19 [ffffb80edbf37dc0] dev_attr_store at ffffffff9b56dd7b drivers/base/core.c:758:0
   #20 [ffffb80edbf37dd0] sysfs_kf_write at ffffffff9b326acf fs/sysfs/file.c:144:0
   #21 [ffffb80edbf37de0] kernfs_fop_write at ffffffff9b325fd4 fs/kernfs/file.c:316:0
   #22 [ffffb80edbf37e20] __vfs_write at ffffffff9b29418a fs/read_write.c:480:0
   #23 [ffffb80edbf37ea8] vfs_write at ffffffff9b294462 fs/read_write.c:544:0
   #24 [ffffb80edbf37ee8] SYSC_write at ffffffff9b2946ec fs/read_write.c:590:0
   #25 [ffffb80edbf37ee8] SyS_write at ffffffff9b2946ec fs/read_write.c:582:0
   #26 [ffffb80edbf37f30] do_syscall_64 at ffffffff9b003ca9 arch/x86/entry/common.c:298:0
   #27 [ffffb80edbf37f58] entry_SYSCALL_64 at ffffffff9ba001b1 arch/x86/entry/entry_64.S:238:0

  PID: 17355    TASK: ffff95c1090c3d80  CPU: 29  COMMAND: "mrdiagd"
  !# 0 [ffffb80f2d3c7d30] __read_once_size at ffffffff9b0f2ab0 include/linux/compiler.h:185:0
  !# 1 [ffffb80f2d3c7d30] native_queued_spin_lock_slowpath at ffffffff9b0f2ab0 kernel/locking/qspinlock.c:368:0
   # 2 [ffffb80f2d3c7d58] pv_queued_spin_lock_slowpath at ffffffff9b0f244b arch/x86/include/asm/paravirt.h:674:0
   # 3 [ffffb80f2d3c7d58] queued_spin_lock_slowpath at ffffffff9b0f244b arch/x86/include/asm/qspinlock.h:53:0
   # 4 [ffffb80f2d3c7d68] queued_spin_lock at ffffffff9b8961a6 include/asm-generic/qspinlock.h:90:0
   # 5 [ffffb80f2d3c7d68] do_raw_spin_lock_flags at ffffffff9b8961a6 include/linux/spinlock.h:173:0
   # 6 [ffffb80f2d3c7d68] __raw_spin_lock_irqsave at ffffffff9b8961a6 include/linux/spinlock_api_smp.h:122:0
   # 7 [ffffb80f2d3c7d68] _raw_spin_lock_irqsave at ffffffff9b8961a6 kernel/locking/spinlock.c:160:0
   # 8 [ffffb80f2d3c7d88] fw_crash_buffer_store [megaraid_sas] at ffffffffc01f8129 drivers/scsi/megaraid/megaraid_sas_base.c:3205:0
   # 9 [ffffb80f2d3c7dc0] dev_attr_store at ffffffff9b56dd7b drivers/base/core.c:758:0
   #10 [ffffb80f2d3c7dd0] sysfs_kf_write at ffffffff9b326acf fs/sysfs/file.c:144:0
   #11 [ffffb80f2d3c7de0] kernfs_fop_write at ffffffff9b325fd4 fs/kernfs/file.c:316:0
   #12 [ffffb80f2d3c7e20] __vfs_write at ffffffff9b29418a fs/read_write.c:480:0
   #13 [ffffb80f2d3c7ea8] vfs_write at ffffffff9b294462 fs/read_write.c:544:0
   #14 [ffffb80f2d3c7ee8] SYSC_write at ffffffff9b2946ec fs/read_write.c:590:0
   #15 [ffffb80f2d3c7ee8] SyS_write at ffffffff9b2946ec fs/read_write.c:582:0
   #16 [ffffb80f2d3c7f30] do_syscall_64 at ffffffff9b003ca9 arch/x86/entry/common.c:298:0
   #17 [ffffb80f2d3c7f58] entry_SYSCALL_64 at ffffffff9ba001b1 arch/x86/entry/entry_64.S:238:0

The lock is used to synchronize different sysfs operations, it doesn't
protect any resource that will be touched by an interrupt. Consequently
it's not required to disable IRQs. Replace the spinlock with a mutex to fix
the deadlock.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Link: https://lore.kernel.org/r/20230828221018.19471-1-junxiao.bi@oracle.com
Reviewed-by: Mike Christie <michael.christie@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Oct 5, 2023
commit 0b0747d upstream.

The following processes run into a deadlock. CPU 41 was waiting for CPU 29
to handle a CSD request while holding spinlock "crashdump_lock", but CPU 29
was hung by that spinlock with IRQs disabled.

  PID: 17360    TASK: ffff95c1090c5c40  CPU: 41  COMMAND: "mrdiagd"
  !# 0 [ffffb80edbf37b58] __read_once_size at ffffffff9b871a40 include/linux/compiler.h:185:0
  !# 1 [ffffb80edbf37b58] atomic_read at ffffffff9b871a40 arch/x86/include/asm/atomic.h:27:0
  !# 2 [ffffb80edbf37b58] dump_stack at ffffffff9b871a40 lib/dump_stack.c:54:0
   # 3 [ffffb80edbf37b78] csd_lock_wait_toolong at ffffffff9b131ad5 kernel/smp.c:364:0
   # 4 [ffffb80edbf37b78] __csd_lock_wait at ffffffff9b131ad5 kernel/smp.c:384:0
   # 5 [ffffb80edbf37bf8] csd_lock_wait at ffffffff9b13267a kernel/smp.c:394:0
   # 6 [ffffb80edbf37bf8] smp_call_function_many at ffffffff9b13267a kernel/smp.c:843:0
   # 7 [ffffb80edbf37c50] smp_call_function at ffffffff9b13279d kernel/smp.c:867:0
   # 8 [ffffb80edbf37c50] on_each_cpu at ffffffff9b13279d kernel/smp.c:976:0
   # 9 [ffffb80edbf37c78] flush_tlb_kernel_range at ffffffff9b085c4b arch/x86/mm/tlb.c:742:0
   #10 [ffffb80edbf37cb8] __purge_vmap_area_lazy at ffffffff9b23a1e0 mm/vmalloc.c:701:0
   #11 [ffffb80edbf37ce0] try_purge_vmap_area_lazy at ffffffff9b23a2cc mm/vmalloc.c:722:0
   #12 [ffffb80edbf37ce0] free_vmap_area_noflush at ffffffff9b23a2cc mm/vmalloc.c:754:0
   #13 [ffffb80edbf37cf8] free_unmap_vmap_area at ffffffff9b23bb3b mm/vmalloc.c:764:0
   #14 [ffffb80edbf37cf8] remove_vm_area at ffffffff9b23bb3b mm/vmalloc.c:1509:0
   #15 [ffffb80edbf37d18] __vunmap at ffffffff9b23bb8a mm/vmalloc.c:1537:0
   #16 [ffffb80edbf37d40] vfree at ffffffff9b23bc85 mm/vmalloc.c:1612:0
   #17 [ffffb80edbf37d58] megasas_free_host_crash_buffer [megaraid_sas] at ffffffffc020b7f2 drivers/scsi/megaraid/megaraid_sas_fusion.c:3932:0
   #18 [ffffb80edbf37d80] fw_crash_state_store [megaraid_sas] at ffffffffc01f804d drivers/scsi/megaraid/megaraid_sas_base.c:3291:0
   #19 [ffffb80edbf37dc0] dev_attr_store at ffffffff9b56dd7b drivers/base/core.c:758:0
   #20 [ffffb80edbf37dd0] sysfs_kf_write at ffffffff9b326acf fs/sysfs/file.c:144:0
   #21 [ffffb80edbf37de0] kernfs_fop_write at ffffffff9b325fd4 fs/kernfs/file.c:316:0
   #22 [ffffb80edbf37e20] __vfs_write at ffffffff9b29418a fs/read_write.c:480:0
   #23 [ffffb80edbf37ea8] vfs_write at ffffffff9b294462 fs/read_write.c:544:0
   #24 [ffffb80edbf37ee8] SYSC_write at ffffffff9b2946ec fs/read_write.c:590:0
   #25 [ffffb80edbf37ee8] SyS_write at ffffffff9b2946ec fs/read_write.c:582:0
   #26 [ffffb80edbf37f30] do_syscall_64 at ffffffff9b003ca9 arch/x86/entry/common.c:298:0
   #27 [ffffb80edbf37f58] entry_SYSCALL_64 at ffffffff9ba001b1 arch/x86/entry/entry_64.S:238:0

  PID: 17355    TASK: ffff95c1090c3d80  CPU: 29  COMMAND: "mrdiagd"
  !# 0 [ffffb80f2d3c7d30] __read_once_size at ffffffff9b0f2ab0 include/linux/compiler.h:185:0
  !# 1 [ffffb80f2d3c7d30] native_queued_spin_lock_slowpath at ffffffff9b0f2ab0 kernel/locking/qspinlock.c:368:0
   # 2 [ffffb80f2d3c7d58] pv_queued_spin_lock_slowpath at ffffffff9b0f244b arch/x86/include/asm/paravirt.h:674:0
   # 3 [ffffb80f2d3c7d58] queued_spin_lock_slowpath at ffffffff9b0f244b arch/x86/include/asm/qspinlock.h:53:0
   # 4 [ffffb80f2d3c7d68] queued_spin_lock at ffffffff9b8961a6 include/asm-generic/qspinlock.h:90:0
   # 5 [ffffb80f2d3c7d68] do_raw_spin_lock_flags at ffffffff9b8961a6 include/linux/spinlock.h:173:0
   # 6 [ffffb80f2d3c7d68] __raw_spin_lock_irqsave at ffffffff9b8961a6 include/linux/spinlock_api_smp.h:122:0
   # 7 [ffffb80f2d3c7d68] _raw_spin_lock_irqsave at ffffffff9b8961a6 kernel/locking/spinlock.c:160:0
   # 8 [ffffb80f2d3c7d88] fw_crash_buffer_store [megaraid_sas] at ffffffffc01f8129 drivers/scsi/megaraid/megaraid_sas_base.c:3205:0
   # 9 [ffffb80f2d3c7dc0] dev_attr_store at ffffffff9b56dd7b drivers/base/core.c:758:0
   #10 [ffffb80f2d3c7dd0] sysfs_kf_write at ffffffff9b326acf fs/sysfs/file.c:144:0
   #11 [ffffb80f2d3c7de0] kernfs_fop_write at ffffffff9b325fd4 fs/kernfs/file.c:316:0
   #12 [ffffb80f2d3c7e20] __vfs_write at ffffffff9b29418a fs/read_write.c:480:0
   #13 [ffffb80f2d3c7ea8] vfs_write at ffffffff9b294462 fs/read_write.c:544:0
   #14 [ffffb80f2d3c7ee8] SYSC_write at ffffffff9b2946ec fs/read_write.c:590:0
   #15 [ffffb80f2d3c7ee8] SyS_write at ffffffff9b2946ec fs/read_write.c:582:0
   #16 [ffffb80f2d3c7f30] do_syscall_64 at ffffffff9b003ca9 arch/x86/entry/common.c:298:0
   #17 [ffffb80f2d3c7f58] entry_SYSCALL_64 at ffffffff9ba001b1 arch/x86/entry/entry_64.S:238:0

The lock is used to synchronize different sysfs operations, it doesn't
protect any resource that will be touched by an interrupt. Consequently
it's not required to disable IRQs. Replace the spinlock with a mutex to fix
the deadlock.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Link: https://lore.kernel.org/r/20230828221018.19471-1-junxiao.bi@oracle.com
Reviewed-by: Mike Christie <michael.christie@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
naushir pushed a commit to naushir/linux that referenced this issue Oct 10, 2023
commit 0b0747d upstream.

The following processes run into a deadlock. CPU 41 was waiting for CPU 29
to handle a CSD request while holding spinlock "crashdump_lock", but CPU 29
was hung by that spinlock with IRQs disabled.

  PID: 17360    TASK: ffff95c1090c5c40  CPU: 41  COMMAND: "mrdiagd"
  !# 0 [ffffb80edbf37b58] __read_once_size at ffffffff9b871a40 include/linux/compiler.h:185:0
  !# 1 [ffffb80edbf37b58] atomic_read at ffffffff9b871a40 arch/x86/include/asm/atomic.h:27:0
  !# 2 [ffffb80edbf37b58] dump_stack at ffffffff9b871a40 lib/dump_stack.c:54:0
   # 3 [ffffb80edbf37b78] csd_lock_wait_toolong at ffffffff9b131ad5 kernel/smp.c:364:0
   # 4 [ffffb80edbf37b78] __csd_lock_wait at ffffffff9b131ad5 kernel/smp.c:384:0
   # 5 [ffffb80edbf37bf8] csd_lock_wait at ffffffff9b13267a kernel/smp.c:394:0
   # 6 [ffffb80edbf37bf8] smp_call_function_many at ffffffff9b13267a kernel/smp.c:843:0
   # 7 [ffffb80edbf37c50] smp_call_function at ffffffff9b13279d kernel/smp.c:867:0
   # 8 [ffffb80edbf37c50] on_each_cpu at ffffffff9b13279d kernel/smp.c:976:0
   # 9 [ffffb80edbf37c78] flush_tlb_kernel_range at ffffffff9b085c4b arch/x86/mm/tlb.c:742:0
   raspberrypi#10 [ffffb80edbf37cb8] __purge_vmap_area_lazy at ffffffff9b23a1e0 mm/vmalloc.c:701:0
   raspberrypi#11 [ffffb80edbf37ce0] try_purge_vmap_area_lazy at ffffffff9b23a2cc mm/vmalloc.c:722:0
   raspberrypi#12 [ffffb80edbf37ce0] free_vmap_area_noflush at ffffffff9b23a2cc mm/vmalloc.c:754:0
   raspberrypi#13 [ffffb80edbf37cf8] free_unmap_vmap_area at ffffffff9b23bb3b mm/vmalloc.c:764:0
   raspberrypi#14 [ffffb80edbf37cf8] remove_vm_area at ffffffff9b23bb3b mm/vmalloc.c:1509:0
   raspberrypi#15 [ffffb80edbf37d18] __vunmap at ffffffff9b23bb8a mm/vmalloc.c:1537:0
   raspberrypi#16 [ffffb80edbf37d40] vfree at ffffffff9b23bc85 mm/vmalloc.c:1612:0
   raspberrypi#17 [ffffb80edbf37d58] megasas_free_host_crash_buffer [megaraid_sas] at ffffffffc020b7f2 drivers/scsi/megaraid/megaraid_sas_fusion.c:3932:0
   raspberrypi#18 [ffffb80edbf37d80] fw_crash_state_store [megaraid_sas] at ffffffffc01f804d drivers/scsi/megaraid/megaraid_sas_base.c:3291:0
   raspberrypi#19 [ffffb80edbf37dc0] dev_attr_store at ffffffff9b56dd7b drivers/base/core.c:758:0
   raspberrypi#20 [ffffb80edbf37dd0] sysfs_kf_write at ffffffff9b326acf fs/sysfs/file.c:144:0
   raspberrypi#21 [ffffb80edbf37de0] kernfs_fop_write at ffffffff9b325fd4 fs/kernfs/file.c:316:0
   raspberrypi#22 [ffffb80edbf37e20] __vfs_write at ffffffff9b29418a fs/read_write.c:480:0
   raspberrypi#23 [ffffb80edbf37ea8] vfs_write at ffffffff9b294462 fs/read_write.c:544:0
   raspberrypi#24 [ffffb80edbf37ee8] SYSC_write at ffffffff9b2946ec fs/read_write.c:590:0
   raspberrypi#25 [ffffb80edbf37ee8] SyS_write at ffffffff9b2946ec fs/read_write.c:582:0
   raspberrypi#26 [ffffb80edbf37f30] do_syscall_64 at ffffffff9b003ca9 arch/x86/entry/common.c:298:0
   raspberrypi#27 [ffffb80edbf37f58] entry_SYSCALL_64 at ffffffff9ba001b1 arch/x86/entry/entry_64.S:238:0

  PID: 17355    TASK: ffff95c1090c3d80  CPU: 29  COMMAND: "mrdiagd"
  !# 0 [ffffb80f2d3c7d30] __read_once_size at ffffffff9b0f2ab0 include/linux/compiler.h:185:0
  !# 1 [ffffb80f2d3c7d30] native_queued_spin_lock_slowpath at ffffffff9b0f2ab0 kernel/locking/qspinlock.c:368:0
   # 2 [ffffb80f2d3c7d58] pv_queued_spin_lock_slowpath at ffffffff9b0f244b arch/x86/include/asm/paravirt.h:674:0
   # 3 [ffffb80f2d3c7d58] queued_spin_lock_slowpath at ffffffff9b0f244b arch/x86/include/asm/qspinlock.h:53:0
   # 4 [ffffb80f2d3c7d68] queued_spin_lock at ffffffff9b8961a6 include/asm-generic/qspinlock.h:90:0
   # 5 [ffffb80f2d3c7d68] do_raw_spin_lock_flags at ffffffff9b8961a6 include/linux/spinlock.h:173:0
   # 6 [ffffb80f2d3c7d68] __raw_spin_lock_irqsave at ffffffff9b8961a6 include/linux/spinlock_api_smp.h:122:0
   # 7 [ffffb80f2d3c7d68] _raw_spin_lock_irqsave at ffffffff9b8961a6 kernel/locking/spinlock.c:160:0
   # 8 [ffffb80f2d3c7d88] fw_crash_buffer_store [megaraid_sas] at ffffffffc01f8129 drivers/scsi/megaraid/megaraid_sas_base.c:3205:0
   # 9 [ffffb80f2d3c7dc0] dev_attr_store at ffffffff9b56dd7b drivers/base/core.c:758:0
   raspberrypi#10 [ffffb80f2d3c7dd0] sysfs_kf_write at ffffffff9b326acf fs/sysfs/file.c:144:0
   raspberrypi#11 [ffffb80f2d3c7de0] kernfs_fop_write at ffffffff9b325fd4 fs/kernfs/file.c:316:0
   raspberrypi#12 [ffffb80f2d3c7e20] __vfs_write at ffffffff9b29418a fs/read_write.c:480:0
   raspberrypi#13 [ffffb80f2d3c7ea8] vfs_write at ffffffff9b294462 fs/read_write.c:544:0
   raspberrypi#14 [ffffb80f2d3c7ee8] SYSC_write at ffffffff9b2946ec fs/read_write.c:590:0
   raspberrypi#15 [ffffb80f2d3c7ee8] SyS_write at ffffffff9b2946ec fs/read_write.c:582:0
   raspberrypi#16 [ffffb80f2d3c7f30] do_syscall_64 at ffffffff9b003ca9 arch/x86/entry/common.c:298:0
   raspberrypi#17 [ffffb80f2d3c7f58] entry_SYSCALL_64 at ffffffff9ba001b1 arch/x86/entry/entry_64.S:238:0

The lock is used to synchronize different sysfs operations, it doesn't
protect any resource that will be touched by an interrupt. Consequently
it's not required to disable IRQs. Replace the spinlock with a mutex to fix
the deadlock.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Link: https://lore.kernel.org/r/20230828221018.19471-1-junxiao.bi@oracle.com
Reviewed-by: Mike Christie <michael.christie@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Dec 5, 2023
As of commit b92143d ("net: dsa: mv88e6xxx: add infrastructure for
phylink_pcs") probing of a Marvell 88e6350 switch causes a NULL pointer
de-reference like this example:

    ...
    mv88e6085 d0072004.mdio-mii:11: switch 0x3710 detected: Marvell 88E6350, revision 2
    8<--- cut here ---
    Unable to handle kernel NULL pointer dereference at virtual address 00000000 when read
    [00000000] *pgd=00000000
    Internal error: Oops: 5 [#1] ARM
    Modules linked in:
    CPU: 0 PID: 8 Comm: kworker/u2:0 Not tainted 6.7.0-rc2-dirty #26
    Hardware name: Marvell Armada 370/XP (Device Tree)
    Workqueue: events_unbound deferred_probe_work_func
    PC is at mv88e6xxx_port_setup+0x1c/0x44
    LR is at dsa_port_devlink_setup+0x74/0x154
    pc : [<c057ea24>]    lr : [<c0819598>]    psr: a0000013
    sp : c184fce0  ip : c542b8f4  fp : 00000000
    r10: 00000001  r9 : c542a540  r8 : c542bc00
    r7 : c542b838  r6 : c5244580  r5 : 00000005  r4 : c5244580
    r3 : 00000000  r2 : c542b840  r1 : 00000005  r0 : c1a02040
    ...

The Marvell 6350 switch has no SERDES interface and so has no
corresponding pcs_ops defined for it. But during probing a call is made
to mv88e6xxx_port_setup() which unconditionally expects pcs_ops to exist -
though the presence of the pcs_ops->pcs_init function is optional.

Modify code to check for pcs_ops first, before checking for and calling
pcs_ops->pcs_init. Modify checking and use of pcs_ops->pcs_teardown
which may potentially suffer the same problem.

Fixes: b92143d ("net: dsa: mv88e6xxx: add infrastructure for phylink_pcs")
Signed-off-by: Greg Ungerer <gerg@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
popcornmix pushed a commit that referenced this issue Dec 8, 2023
[ Upstream commit a524eab ]

As of commit b92143d ("net: dsa: mv88e6xxx: add infrastructure for
phylink_pcs") probing of a Marvell 88e6350 switch causes a NULL pointer
de-reference like this example:

    ...
    mv88e6085 d0072004.mdio-mii:11: switch 0x3710 detected: Marvell 88E6350, revision 2
    8<--- cut here ---
    Unable to handle kernel NULL pointer dereference at virtual address 00000000 when read
    [00000000] *pgd=00000000
    Internal error: Oops: 5 [#1] ARM
    Modules linked in:
    CPU: 0 PID: 8 Comm: kworker/u2:0 Not tainted 6.7.0-rc2-dirty #26
    Hardware name: Marvell Armada 370/XP (Device Tree)
    Workqueue: events_unbound deferred_probe_work_func
    PC is at mv88e6xxx_port_setup+0x1c/0x44
    LR is at dsa_port_devlink_setup+0x74/0x154
    pc : [<c057ea24>]    lr : [<c0819598>]    psr: a0000013
    sp : c184fce0  ip : c542b8f4  fp : 00000000
    r10: 00000001  r9 : c542a540  r8 : c542bc00
    r7 : c542b838  r6 : c5244580  r5 : 00000005  r4 : c5244580
    r3 : 00000000  r2 : c542b840  r1 : 00000005  r0 : c1a02040
    ...

The Marvell 6350 switch has no SERDES interface and so has no
corresponding pcs_ops defined for it. But during probing a call is made
to mv88e6xxx_port_setup() which unconditionally expects pcs_ops to exist -
though the presence of the pcs_ops->pcs_init function is optional.

Modify code to check for pcs_ops first, before checking for and calling
pcs_ops->pcs_init. Modify checking and use of pcs_ops->pcs_teardown
which may potentially suffer the same problem.

Fixes: b92143d ("net: dsa: mv88e6xxx: add infrastructure for phylink_pcs")
Signed-off-by: Greg Ungerer <gerg@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
0lxb pushed a commit to 0lxb/rpi_linux that referenced this issue Jan 30, 2024
SCX: Use "idle core" instead of "whole cpu"
popcornmix pushed a commit that referenced this issue Feb 6, 2024
commit 56925f3 upstream.

With previous patch, one of subtests in test_btf_id becomes
flaky and may fail. The following is a failing example:

  Error: #26 btf
  Error: #26/174 btf/BTF ID
    Error: #26/174 btf/BTF ID
    btf_raw_create:PASS:check 0 nsec
    btf_raw_create:PASS:check 0 nsec
    test_btf_id:PASS:check 0 nsec
    ...
    test_btf_id:PASS:check 0 nsec
    test_btf_id:FAIL:check BTF lingersdo_test_get_info:FAIL:check failed: -1

The test tries to prove a btf_id not available after the map is closed.
But btf_id is freed only after workqueue and a rcu grace period, compared
to previous case just after a rcu grade period.
Depending on system workload, workqueue could take quite some time
to execute function bpf_map_free_deferred() which may cause the test failure.
Instead of adding arbitrary delays, let us remove the logic to
check btf_id availability after map is closed.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231214203820.1469402-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Samasth Norway Ananda <samasth.norway.ananda@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Feb 6, 2024
commit 56925f3 upstream.

With previous patch, one of subtests in test_btf_id becomes
flaky and may fail. The following is a failing example:

  Error: #26 btf
  Error: #26/174 btf/BTF ID
    Error: #26/174 btf/BTF ID
    btf_raw_create:PASS:check 0 nsec
    btf_raw_create:PASS:check 0 nsec
    test_btf_id:PASS:check 0 nsec
    ...
    test_btf_id:PASS:check 0 nsec
    test_btf_id:FAIL:check BTF lingersdo_test_get_info:FAIL:check failed: -1

The test tries to prove a btf_id not available after the map is closed.
But btf_id is freed only after workqueue and a rcu grace period, compared
to previous case just after a rcu grade period.
Depending on system workload, workqueue could take quite some time
to execute function bpf_map_free_deferred() which may cause the test failure.
Instead of adding arbitrary delays, let us remove the logic to
check btf_id availability after map is closed.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20231214203820.1469402-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Samasth Norway Ananda <samasth.norway.ananda@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Nov 18, 2024
Syzkaller reported this warning:
 ------------[ cut here ]------------
 WARNING: CPU: 0 PID: 16 at net/ipv4/af_inet.c:156 inet_sock_destruct+0x1c5/0x1e0
 Modules linked in:
 CPU: 0 UID: 0 PID: 16 Comm: ksoftirqd/0 Not tainted 6.12.0-rc5 #26
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
 RIP: 0010:inet_sock_destruct+0x1c5/0x1e0
 Code: 24 12 4c 89 e2 5b 48 c7 c7 98 ec bb 82 41 5c e9 d1 18 17 ff 4c 89 e6 5b 48 c7 c7 d0 ec bb 82 41 5c e9 bf 18 17 ff 0f 0b eb 83 <0f> 0b eb 97 0f 0b eb 87 0f 0b e9 68 ff ff ff 66 66 2e 0f 1f 84 00
 RSP: 0018:ffffc9000008bd90 EFLAGS: 00010206
 RAX: 0000000000000300 RBX: ffff88810b172a90 RCX: 0000000000000007
 RDX: 0000000000000002 RSI: 0000000000000300 RDI: ffff88810b172a00
 RBP: ffff88810b172a00 R08: ffff888104273c00 R09: 0000000000100007
 R10: 0000000000020000 R11: 0000000000000006 R12: ffff88810b172a00
 R13: 0000000000000004 R14: 0000000000000000 R15: ffff888237c31f78
 FS:  0000000000000000(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007ffc63fecac8 CR3: 000000000342e000 CR4: 00000000000006f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <TASK>
  ? __warn+0x88/0x130
  ? inet_sock_destruct+0x1c5/0x1e0
  ? report_bug+0x18e/0x1a0
  ? handle_bug+0x53/0x90
  ? exc_invalid_op+0x18/0x70
  ? asm_exc_invalid_op+0x1a/0x20
  ? inet_sock_destruct+0x1c5/0x1e0
  __sk_destruct+0x2a/0x200
  rcu_do_batch+0x1aa/0x530
  ? rcu_do_batch+0x13b/0x530
  rcu_core+0x159/0x2f0
  handle_softirqs+0xd3/0x2b0
  ? __pfx_smpboot_thread_fn+0x10/0x10
  run_ksoftirqd+0x25/0x30
  smpboot_thread_fn+0xdd/0x1d0
  kthread+0xd3/0x100
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x34/0x50
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 ---[ end trace 0000000000000000 ]---

Its possible that two threads call tcp_v6_do_rcv()/sk_forward_alloc_add()
concurrently when sk->sk_state == TCP_LISTEN with sk->sk_lock unlocked,
which triggers a data-race around sk->sk_forward_alloc:
tcp_v6_rcv
    tcp_v6_do_rcv
        skb_clone_and_charge_r
            sk_rmem_schedule
                __sk_mem_schedule
                    sk_forward_alloc_add()
            skb_set_owner_r
                sk_mem_charge
                    sk_forward_alloc_add()
        __kfree_skb
            skb_release_all
                skb_release_head_state
                    sock_rfree
                        sk_mem_uncharge
                            sk_forward_alloc_add()
                            sk_mem_reclaim
                                // set local var reclaimable
                                __sk_mem_reclaim
                                    sk_forward_alloc_add()

In this syzkaller testcase, two threads call
tcp_v6_do_rcv() with skb->truesize=768, the sk_forward_alloc changes like
this:
 (cpu 1)             | (cpu 2)             | sk_forward_alloc
 ...                 | ...                 | 0
 __sk_mem_schedule() |                     | +4096 = 4096
                     | __sk_mem_schedule() | +4096 = 8192
 sk_mem_charge()     |                     | -768  = 7424
                     | sk_mem_charge()     | -768  = 6656
 ...                 |    ...              |
 sk_mem_uncharge()   |                     | +768  = 7424
 reclaimable=7424    |                     |
                     | sk_mem_uncharge()   | +768  = 8192
                     | reclaimable=8192    |
 __sk_mem_reclaim()  |                     | -4096 = 4096
                     | __sk_mem_reclaim()  | -8192 = -4096 != 0

The skb_clone_and_charge_r() should not be called in tcp_v6_do_rcv() when
sk->sk_state is TCP_LISTEN, it happens later in tcp_v6_syn_recv_sock().
Fix the same issue in dccp_v6_do_rcv().

Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Fixes: e994b2f ("tcp: do not lock listener to process SYN packets")
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Link: https://patch.msgid.link/20241107023405.889239-1-wangliang74@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
popcornmix pushed a commit that referenced this issue Nov 25, 2024
[ Upstream commit 073d898 ]

Syzkaller reported this warning:
 ------------[ cut here ]------------
 WARNING: CPU: 0 PID: 16 at net/ipv4/af_inet.c:156 inet_sock_destruct+0x1c5/0x1e0
 Modules linked in:
 CPU: 0 UID: 0 PID: 16 Comm: ksoftirqd/0 Not tainted 6.12.0-rc5 #26
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
 RIP: 0010:inet_sock_destruct+0x1c5/0x1e0
 Code: 24 12 4c 89 e2 5b 48 c7 c7 98 ec bb 82 41 5c e9 d1 18 17 ff 4c 89 e6 5b 48 c7 c7 d0 ec bb 82 41 5c e9 bf 18 17 ff 0f 0b eb 83 <0f> 0b eb 97 0f 0b eb 87 0f 0b e9 68 ff ff ff 66 66 2e 0f 1f 84 00
 RSP: 0018:ffffc9000008bd90 EFLAGS: 00010206
 RAX: 0000000000000300 RBX: ffff88810b172a90 RCX: 0000000000000007
 RDX: 0000000000000002 RSI: 0000000000000300 RDI: ffff88810b172a00
 RBP: ffff88810b172a00 R08: ffff888104273c00 R09: 0000000000100007
 R10: 0000000000020000 R11: 0000000000000006 R12: ffff88810b172a00
 R13: 0000000000000004 R14: 0000000000000000 R15: ffff888237c31f78
 FS:  0000000000000000(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007ffc63fecac8 CR3: 000000000342e000 CR4: 00000000000006f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <TASK>
  ? __warn+0x88/0x130
  ? inet_sock_destruct+0x1c5/0x1e0
  ? report_bug+0x18e/0x1a0
  ? handle_bug+0x53/0x90
  ? exc_invalid_op+0x18/0x70
  ? asm_exc_invalid_op+0x1a/0x20
  ? inet_sock_destruct+0x1c5/0x1e0
  __sk_destruct+0x2a/0x200
  rcu_do_batch+0x1aa/0x530
  ? rcu_do_batch+0x13b/0x530
  rcu_core+0x159/0x2f0
  handle_softirqs+0xd3/0x2b0
  ? __pfx_smpboot_thread_fn+0x10/0x10
  run_ksoftirqd+0x25/0x30
  smpboot_thread_fn+0xdd/0x1d0
  kthread+0xd3/0x100
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x34/0x50
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 ---[ end trace 0000000000000000 ]---

Its possible that two threads call tcp_v6_do_rcv()/sk_forward_alloc_add()
concurrently when sk->sk_state == TCP_LISTEN with sk->sk_lock unlocked,
which triggers a data-race around sk->sk_forward_alloc:
tcp_v6_rcv
    tcp_v6_do_rcv
        skb_clone_and_charge_r
            sk_rmem_schedule
                __sk_mem_schedule
                    sk_forward_alloc_add()
            skb_set_owner_r
                sk_mem_charge
                    sk_forward_alloc_add()
        __kfree_skb
            skb_release_all
                skb_release_head_state
                    sock_rfree
                        sk_mem_uncharge
                            sk_forward_alloc_add()
                            sk_mem_reclaim
                                // set local var reclaimable
                                __sk_mem_reclaim
                                    sk_forward_alloc_add()

In this syzkaller testcase, two threads call
tcp_v6_do_rcv() with skb->truesize=768, the sk_forward_alloc changes like
this:
 (cpu 1)             | (cpu 2)             | sk_forward_alloc
 ...                 | ...                 | 0
 __sk_mem_schedule() |                     | +4096 = 4096
                     | __sk_mem_schedule() | +4096 = 8192
 sk_mem_charge()     |                     | -768  = 7424
                     | sk_mem_charge()     | -768  = 6656
 ...                 |    ...              |
 sk_mem_uncharge()   |                     | +768  = 7424
 reclaimable=7424    |                     |
                     | sk_mem_uncharge()   | +768  = 8192
                     | reclaimable=8192    |
 __sk_mem_reclaim()  |                     | -4096 = 4096
                     | __sk_mem_reclaim()  | -8192 = -4096 != 0

The skb_clone_and_charge_r() should not be called in tcp_v6_do_rcv() when
sk->sk_state is TCP_LISTEN, it happens later in tcp_v6_syn_recv_sock().
Fix the same issue in dccp_v6_do_rcv().

Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Fixes: e994b2f ("tcp: do not lock listener to process SYN packets")
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Link: https://patch.msgid.link/20241107023405.889239-1-wangliang74@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
popcornmix pushed a commit that referenced this issue Dec 16, 2024
This fixes the circular locking dependency warning below, by reworking
iso_sock_recvmsg, to ensure that the socket lock is always released
before calling a function that locks hdev.

[  561.670344] ======================================================
[  561.670346] WARNING: possible circular locking dependency detected
[  561.670349] 6.12.0-rc6+ #26 Not tainted
[  561.670351] ------------------------------------------------------
[  561.670353] iso-tester/3289 is trying to acquire lock:
[  561.670355] ffff88811f600078 (&hdev->lock){+.+.}-{3:3},
               at: iso_conn_big_sync+0x73/0x260 [bluetooth]
[  561.670405]
               but task is already holding lock:
[  561.670407] ffff88815af58258 (sk_lock-AF_BLUETOOTH){+.+.}-{0:0},
               at: iso_sock_recvmsg+0xbf/0x500 [bluetooth]
[  561.670450]
               which lock already depends on the new lock.

[  561.670452]
               the existing dependency chain (in reverse order) is:
[  561.670453]
               -> #2 (sk_lock-AF_BLUETOOTH){+.+.}-{0:0}:
[  561.670458]        lock_acquire+0x7c/0xc0
[  561.670463]        lock_sock_nested+0x3b/0xf0
[  561.670467]        bt_accept_dequeue+0x1a5/0x4d0 [bluetooth]
[  561.670510]        iso_sock_accept+0x271/0x830 [bluetooth]
[  561.670547]        do_accept+0x3dd/0x610
[  561.670550]        __sys_accept4+0xd8/0x170
[  561.670553]        __x64_sys_accept+0x74/0xc0
[  561.670556]        x64_sys_call+0x17d6/0x25f0
[  561.670559]        do_syscall_64+0x87/0x150
[  561.670563]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  561.670567]
               -> #1 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}:
[  561.670571]        lock_acquire+0x7c/0xc0
[  561.670574]        lock_sock_nested+0x3b/0xf0
[  561.670577]        iso_sock_listen+0x2de/0xf30 [bluetooth]
[  561.670617]        __sys_listen_socket+0xef/0x130
[  561.670620]        __x64_sys_listen+0xe1/0x190
[  561.670623]        x64_sys_call+0x2517/0x25f0
[  561.670626]        do_syscall_64+0x87/0x150
[  561.670629]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  561.670632]
               -> #0 (&hdev->lock){+.+.}-{3:3}:
[  561.670636]        __lock_acquire+0x32ad/0x6ab0
[  561.670639]        lock_acquire.part.0+0x118/0x360
[  561.670642]        lock_acquire+0x7c/0xc0
[  561.670644]        __mutex_lock+0x18d/0x12f0
[  561.670647]        mutex_lock_nested+0x1b/0x30
[  561.670651]        iso_conn_big_sync+0x73/0x260 [bluetooth]
[  561.670687]        iso_sock_recvmsg+0x3e9/0x500 [bluetooth]
[  561.670722]        sock_recvmsg+0x1d5/0x240
[  561.670725]        sock_read_iter+0x27d/0x470
[  561.670727]        vfs_read+0x9a0/0xd30
[  561.670731]        ksys_read+0x1a8/0x250
[  561.670733]        __x64_sys_read+0x72/0xc0
[  561.670736]        x64_sys_call+0x1b12/0x25f0
[  561.670738]        do_syscall_64+0x87/0x150
[  561.670741]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  561.670744]
               other info that might help us debug this:

[  561.670745] Chain exists of:
&hdev->lock --> sk_lock-AF_BLUETOOTH-BTPROTO_ISO --> sk_lock-AF_BLUETOOTH

[  561.670751]  Possible unsafe locking scenario:

[  561.670753]        CPU0                    CPU1
[  561.670754]        ----                    ----
[  561.670756]   lock(sk_lock-AF_BLUETOOTH);
[  561.670758]                                lock(sk_lock
                                              AF_BLUETOOTH-BTPROTO_ISO);
[  561.670761]                                lock(sk_lock-AF_BLUETOOTH);
[  561.670764]   lock(&hdev->lock);
[  561.670767]
                *** DEADLOCK ***

Fixes: 07a9342 ("Bluetooth: ISO: Send BIG Create Sync via hci_sync")
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants