Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
drm/amdgpu: fix OOM panic and deadlock
In mmu release, we schedule a work to free amn. however there is race between exit_mmap and oom_reap_task_mm. exit_mmap -> mmu_notifier_release oom_reap_task_mm -> __oom_reap_task_mm -> mmu_notifier_invalidate_range_start_nonblock So the amn might have been freed. sync rcu in destroy to wait for ongoing range invalidate. calltrace: [ 4407.908455] BUG: kernel NULL pointer dereference, address: 0000000000000050 [ 4407.915591] #PF: supervisor read access in kernel mode [ 4407.920827] #PF: error_code(0x0000) - not-present page [ 4407.926079] PGD 0 P4D 0 [ 4407.928662] Oops: 0000 [#1] SMP PTI [ 4407.932216] CPU: 3 PID: 55 Comm: oom_reaper Tainted: G W O 5.4.0-rc7+ #1 [ 4407.940206] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1702 01/28/2016 [ 4407.949080] RIP: 0010:mark_lock+0xc9/0x540 [ 4407.953282] Code: 31 c0 eb 12 48 8d 14 80 48 8d 04 50 48 c1 e0 04 48 05 00 dd 01 9c 41 bc 01 00 00 00 89 d9 41 bf 01 00 00 00 41 d3 e4 4d 63 e4 <4c> 85 60 50 0f 85 50 ff ff ff e8 18 bb ff ff 85 c0 0f 84 40 ff ff [ 4407.972385] RSP: 0018:ffffab4440253ab0 EFLAGS: 00010006 [ 4407.977723] RAX: 0000000000000000 RBX: 0000000000000008 RCX: 0000000000000008 [ 4407.984977] RDX: ffff9df1d9fd8040 RSI: 0000000000000001 RDI: ffffffff9a318d0a [ 4407.992222] RBP: ffffab4440253af0 R08: 0000000000000000 R09: 000000000003b540 [ 4407.999493] R10: 0000000000000000 R11: 0000000000000037 R12: 0000000000000100 [ 4408.006774] R13: ffff9df1d9fd8040 R14: ffff9df1d9fd8c88 R15: 0000000000000001 [ 4408.014062] FS: 0000000000000000(0000) GS:ffff9df1de180000(0000) knlGS:0000000000000000 [ 4408.022295] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4408.028154] CR2: 0000000000000050 CR3: 0000000393410001 CR4: 00000000003606e0 [ 4408.035427] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 4408.042689] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 4408.049968] Call Trace: [ 4408.052467] __lock_acquire+0x261/0x1600 [ 4408.056486] ? __lock_acquire+0x43a/0x1600 [ 4408.060628] ? __lock_acquire+0x43a/0x1600 [ 4408.064816] lock_acquire+0xb8/0x1c0 [ 4408.068482] ? rwsem_down_read_slowpath+0x1e8/0x5f0 [ 4408.073509] _raw_spin_lock_irq+0x3b/0x50 [ 4408.077624] ? rwsem_down_read_slowpath+0x1e8/0x5f0 [ 4408.082582] rwsem_down_read_slowpath+0x1e8/0x5f0 [ 4408.087401] ? finish_task_switch+0x63/0x230 [ 4408.091759] ? __schedule+0x2b3/0x860 [ 4408.095487] down_read_non_owner+0x86/0x160 [ 4408.099767] ? down_read_non_owner+0x86/0x160 [ 4408.104295] amdgpu_mn_read_lock+0x9f/0xb0 [amdgpu] [ 4408.109364] amdgpu_mn_invalidate_range_start_gfx+0x3f/0x1e0 [amdgpu] [ 4408.115941] __mmu_notifier_invalidate_range_start+0x9e/0x190 [ 4408.121816] ? __oom_reap_task_mm+0x6d/0x220 [ 4408.126166] __oom_reap_task_mm+0x1b5/0x220 [ 4408.130450] oom_reaper+0x4d0/0x650 [ 4408.133991] ? __kthread_parkme+0x2f/0x90 [ 4408.138083] ? finish_wait+0x90/0x90 [ 4408.141715] kthread+0x12c/0x150 [ 4408.145043] ? __oom_reap_task_mm+0x220/0x220 [ 4408.149461] ? kthread_park+0x90/0x90 [ 4408.153208] ret_from_fork+0x3a/0x50 There is another deadlock. calltrace: [ 1635.072660] BUG: sleeping function called from invalid context at ../kernel/locking/rwsem.c:1621 [ 1635.081870] in_atomic(): 0, irqs_disabled(): 0, non_block: 1, pid: 55, name: oom_reaper [ 1635.090106] 4 locks held by oom_reaper/55: [ 1635.091485] init_user_pages: Failed to get user pages: -512 [ 1635.094302] #0: ffff9ca48e94b5d8 (&mm->mmap_sem#2){++++}, at: oom_reaper+0xa4/0x650 [ 1635.108116] #1: ffffffff82750fc0 (mmu_notifier_invalidate_range_start){+.+.}, at: __oom_reap_task_mm+0x6d/0x220 [ 1635.118558] #2: ffffffff827637f0 (srcu){....}, at: __mmu_notifier_invalidate_range_start+0x5/0x190 [ 1635.127879] #3: ffff9ca4f7c605f0 (&amn->read_lock){+.+.}, at: amdgpu_mn_read_lock+0x75/0xb0 [amdgpu] [ 1635.137614] CPU: 3 PID: 55 Comm: oom_reaper Tainted: G W O 5.4.0-rc7+ #1 [ 1635.145787] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1702 01/28/2016 [ 1635.154744] Call Trace: [ 1635.157264] dump_stack+0x98/0xd5 [ 1635.160688] ___might_sleep+0x175/0x260 [ 1635.164675] __might_sleep+0x4a/0x80 [ 1635.168340] down_read_non_owner+0x20/0x160 [ 1635.172748] amdgpu_mn_read_lock+0x9f/0xb0 [amdgpu] [ 1635.177884] amdgpu_mn_invalidate_range_start_hsa+0x3f/0x180 [amdgpu] [ 1635.184477] __mmu_notifier_invalidate_range_start+0x9e/0x190 [ 1635.190337] ? __oom_reap_task_mm+0x6d/0x220 [ 1635.194725] __oom_reap_task_mm+0x1b5/0x220 [ 1635.199069] oom_reaper+0x4d0/0x650 [ 1635.202611] ? __kthread_parkme+0x2f/0x90 [ 1635.206740] ? finish_wait+0x90/0x90 [ 1635.210425] kthread+0x12c/0x150 [ 1635.213714] ? __oom_reap_task_mm+0x220/0x220 [ 1635.218215] ? kthread_park+0x90/0x90 [ 1635.221976] ret_from_fork+0x3a/0x50 [ 1815.108088] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1815.116065] kworker/1:1 D 0 9357 2 0x80004000 [ 1815.121873] Workqueue: events amdgpu_mn_destroy [amdgpu] [ 1815.127264] Call Trace: [ 1815.129765] __schedule+0x2ab/0x860 [ 1815.133299] ? rwsem_down_write_slowpath+0x329/0x660 [ 1815.138382] schedule+0x3a/0xc0 [ 1815.141613] rwsem_down_write_slowpath+0x32e/0x660 [ 1815.146523] down_write+0x74/0x80 [ 1815.149921] ? down_write+0x40/0x80 [ 1815.153490] ? down_write+0x74/0x80 [ 1815.157285] amdgpu_mn_destroy+0x6e/0x240 [amdgpu] [ 1815.162176] process_one_work+0x231/0x5c0 [ 1815.166313] worker_thread+0x3f/0x3b0 [ 1815.170100] ? __kthread_parkme+0x61/0x90 [ 1815.174204] kthread+0x12c/0x150 [ 1815.177573] ? process_one_work+0x5c0/0x5c0 [ 1815.181857] ? kthread_park+0x90/0x90 [ 1815.185591] ret_from_fork+0x3a/0x50 oom killer want to invalidate range in nonblock context. But the amdgpu_mn_read_lock might sleep, and casue deadlock then. Reviewed-by: Flora Cui <flora.cui@amd.com> Signed-off-by: xinhui pan <xinhui.pan@amd.com> Signed-off-by: Yifan Zhang <Yifan1.Zhang@amd.com>
- Loading branch information