-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Labels
Component: Memory Managementkernel memory managementkernel memory managementStatus: InactiveNot being actively updatedNot being actively updatedType: DefectIncorrect behavior (e.g. crash, hang)Incorrect behavior (e.g. crash, hang)
Description
I have intermittently been hitting a deadlock and finally setup a VM to reproduce it and get real stacktraces. This is done on master for SPL and ZFS over the weekend so it is completely up to date. The VM has 1GB ram and the root drive is on ext4 and set up a mirror zpool on two 1GB disks and a couple random datasets in it. I then rsync'd the portage tree (lots of little files) into /tank/.
The rough steps to trigger it are
- make sure the ARC is as big as possible (find /tank -exec cat {} + > /dev/null 2>&1 did that nicely).
- having a scrub run helped too.
- arcstat.py shows the arc is using lots of space and htop shows ram as almost completely used up.
- I wrote a little C program to malloc() a lot of ram and mlockall() then just while(1) sleep().
- kswapd starts using CPU time and ARC starts to shrink a bit and if run for long enough everything will completely lock up then the kernel's hung task detection kicks in and reboots the machine. The OOM killer should kill a few of the eatram processes but never gets there.
I have been following #4106 and #4166 but am still hitting this :(.
I have the gentoo hardened 4.3.3-r4 kernel but I dont think the kernel version matters since I have been hitting it for a while.
I think this is the most relevant stacktrace:
[ 3688.586051] txg_sync D ffff88003d9fc630 9784 20739 2 0x00000000
[ 3688.586051] ffffc90000413a68 0000000000000046 0000000000000000 ffff880003b09740
[ 3688.586051] ffffc90000413a70 ffff88003e10c080 ffff88003d9fc080 ffffffffc0478a80
[ 3688.586051] ffffffffa28224ea e103ae4808b3649e ffff88003d9fcaf8 7fffffffffffffff
[ 3688.586051] Call Trace:
[ 3688.586051] [<ffffffffc0478a80>] ? zio_taskq_member.isra.4.constprop.11+0x70/0x70 [zfs]
[ 3688.586051] [<ffffffffa28224ea>] ? schedule_timeout+0x25a/0x310
[ 3688.586051] [<ffffffffa281dd4d>] schedule+0x3d/0x90
[ 3688.586051] [<ffffffffa28224ea>] schedule_timeout+0x25a/0x310
[ 3688.586051] [<ffffffffa2100ea4>] ? preempt_count_add+0x54/0xa0
[ 3688.586051] [<ffffffffa2048383>] ? kvm_clock_read+0x23/0x50
[ 3688.586051] [<ffffffffa20483b9>] ? kvm_clock_get_cycles+0x9/0x20
[ 3688.586051] [<ffffffffa2140a2d>] ? ktime_get+0x4d/0xd0
[ 3688.586051] [<ffffffffa281cff1>] io_schedule_timeout+0xb1/0x130
[ 3688.586051] [<ffffffffc000c9a7>] cv_wait_common+0x117/0x2f0 [spl]
[ 3688.586051] [<ffffffffa2117e10>] ? wait_woken+0xa0/0xa0
[ 3688.586051] [<ffffffffc000cc07>] __cv_wait_io+0x27/0x40 [spl]
[ 3688.586051] [<ffffffffc047c6a1>] zio_wait+0x181/0x360 [zfs]
[ 3688.586051] [<ffffffffc03e3458>] dsl_pool_sync+0x118/0x5f0 [zfs]
[ 3688.586051] [<ffffffffc040647d>] spa_sync+0x36d/0xe10 [zfs]
[ 3688.586051] [<ffffffffa2048383>] ? kvm_clock_read+0x23/0x50
[ 3688.586051] [<ffffffffc041dad5>] txg_sync_thread+0x3c5/0x6a0 [zfs]
[ 3688.586051] [<ffffffffc041d710>] ? txg_fini+0x2e0/0x2e0 [zfs]
[ 3688.586051] [<ffffffffc00051d5>] thread_generic_wrapper+0x95/0xe0 [spl]
[ 3688.586051] [<ffffffffc0005140>] ? __thread_exit+0x20/0x20 [spl]
[ 3688.586051] [<ffffffffa20f4a0a>] kthread+0x10a/0x120
[ 3688.586051] [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3688.586051] [<ffffffffa28240ce>] ret_from_fork+0x3e/0x70
[ 3688.586051] [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3688.132006] kswapd0 D ffffffffa2107ef0 12192 129 2 0x00000000
[ 3688.132006] ffffc9000041b888 0000000000000046 ffffffffa2100e10 ffffc9000041b8c8
[ 3688.132006] ffffc900000135d8 ffffffffa2c04980 ffff88003da38ac0 ffffc9000041b868
[ 3688.132006] ffffffffc000ca27 72e0da80ceb50efd ffff88003da39538 ffff88003d32c290
[ 3688.132006] Call Trace:
[ 3688.132006] [<ffffffffa2100e10>] ? get_parent_ip+0x10/0x50
[ 3688.132006] [<ffffffffc000ca27>] ? cv_wait_common+0x197/0x2f0 [spl]
[ 3688.132006] [<ffffffffa281dd4d>] schedule+0x3d/0x90
[ 3688.132006] [<ffffffffc000ca27>] cv_wait_common+0x197/0x2f0 [spl]
[ 3688.132006] [<ffffffffa2117e10>] ? wait_woken+0xa0/0xa0
[ 3688.132006] [<ffffffffc000cba4>] __cv_wait+0x24/0x30 [spl]
[ 3688.132006] [<ffffffffc041d1d0>] txg_wait_open+0xf0/0x1c0 [zfs]
[ 3688.132006] [<ffffffffc03c2fe3>] dmu_tx_wait+0x503/0x510 [zfs]
[ 3688.132006] [<ffffffffc03c30be>] dmu_tx_assign+0xce/0x750 [zfs]
[ 3688.132006] [<ffffffffc0464fab>] zfs_inactive+0x19b/0x2b0 [zfs]
[ 3688.132006] [<ffffffffa2823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 3688.132006] [<ffffffffc0488c56>] zpl_evict_inode+0x46/0xa0 [zfs]
[ 3688.132006] [<ffffffffc04a4580>] ? __FUNCTION__.49746+0x4e/0x4e [zfs]
[ 3688.132006] [<ffffffffa226938f>] evict+0xbf/0x1a0
[ 3688.132006] [<ffffffffa22694a8>] dispose_list+0x38/0x70
[ 3688.132006] [<ffffffffa226a961>] prune_icache_sb+0x61/0x90
[ 3688.132006] [<ffffffffa224a398>] super_cache_scan+0x278/0x450
[ 3688.132006] [<ffffffffa21f7bf4>] ? __list_lru_count_one.isra.2+0x44/0x80
[ 3688.132006] [<ffffffffa21de4a9>] shrink_slab.part.40+0x3f9/0x610
[ 3688.132006] [<ffffffffa21e0c65>] ? shrink_lruvec+0x625/0x720
[ 3688.132006] [<ffffffffa21e0fee>] shrink_zone+0x28e/0x2c0
[ 3688.132006] [<ffffffffa21e1f5f>] kswapd+0x53f/0x9a0
[ 3688.132006] [<ffffffffa21e1a20>] ? mem_cgroup_shrink_node_zone+0x200/0x200
[ 3688.132006] [<ffffffffa20f4a0a>] kthread+0x10a/0x120
[ 3688.132006] [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3688.132006] [<ffffffffa28240ce>] ret_from_fork+0x3e/0x70
[ 3688.132006] [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3688.132006] spl_dynamic_tas D ffff88002aec85b0 14192 20666 2 0x00000000
[ 3688.132006] ffffc90005313b38 0000000000000046 ffffffffa210e184 0000000000000000
[ 3688.132006] ffff88003fd12c40 ffff88003e108ac0 ffff88002aec8000 0000000000000000
[ 3688.132006] ffffffffa28224ea 9d481ca78e65b67c ffff88002aec8a78 7fffffffffffffff
[ 3688.132006] Call Trace:
[ 3688.132006] [<ffffffffa210e184>] ? enqueue_entity+0x4f4/0xc40
[ 3688.132006] [<ffffffffa28224ea>] ? schedule_timeout+0x25a/0x310
[ 3688.132006] [<ffffffffa281dd4d>] schedule+0x3d/0x90
[ 3688.132006] [<ffffffffa28224ea>] schedule_timeout+0x25a/0x310
[ 3688.132006] [<ffffffffa281ef37>] ? wait_for_completion_killable+0x47/0x1f0
[ 3688.132006] [<ffffffffa281f007>] wait_for_completion_killable+0x117/0x1f0
[ 3688.132006] [<ffffffffa2822290>] ? usleep_range+0x90/0x90
[ 3688.132006] [<ffffffffa20ffb10>] ? wake_up_q+0x80/0x80
[ 3688.132006] [<ffffffffc0006910>] ? taskq_thread_should_stop+0x90/0x90 [spl]
[ 3688.132006] [<ffffffffc0013882>] ? __FUNCTION__.26687+0x39a1/0x49a7 [spl]
[ 3688.132006] [<ffffffffa20f4836>] kthread_create_on_node+0xf6/0x1c0
[ 3688.132006] [<ffffffffa2463c57>] ? string.isra.4+0x47/0xe0
[ 3688.132006] [<ffffffffc0006910>] ? taskq_thread_should_stop+0x90/0x90 [spl]
[ 3688.586051] [<ffffffffc00052bb>] spl_kthread_create+0x9b/0xf0 [spl]
[ 3688.586051] [<ffffffffc0007639>] taskq_thread_create+0x69/0x110 [spl]
[ 3688.586051] [<ffffffffc00076f5>] taskq_thread_spawn_task+0x15/0x40 [spl]
[ 3688.586051] [<ffffffffc0006bb2>] taskq_thread+0x2a2/0x5c0 [spl]
[ 3688.586051] [<ffffffffa20ffb10>] ? wake_up_q+0x80/0x80
[ 3688.586051] [<ffffffffc0006910>] ? taskq_thread_should_stop+0x90/0x90 [spl]
[ 3688.586051] [<ffffffffa20f4a0a>] kthread+0x10a/0x120
[ 3688.586051] [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3688.586051] [<ffffffffa28240ce>] ret_from_fork+0x3e/0x70
[ 3688.586051] [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3688.132006] khugepaged D ffff88003fd12140 11856 37 2 0x00000000
[ 3688.132006] ffffc9000013b5d8 0000000000000046 ffffffffa2100e10 0000000000000000
[ 3688.132006] 0000000000000002 ffff88003d98eb80 ffff88003e2bb5c0 0000000000000001
[ 3688.132006] ffffffffc000ca27 02d1924d7a783fd3 ffff88003e2bc038 ffff88003d32c290
[ 3688.132006] Call Trace:
[ 3688.132006] [<ffffffffa2100e10>] ? get_parent_ip+0x10/0x50
[ 3688.132006] [<ffffffffc000ca27>] ? cv_wait_common+0x197/0x2f0 [spl]
[ 3688.132006] [<ffffffffa281dd4d>] schedule+0x3d/0x90
[ 3688.132006] [<ffffffffc000ca27>] cv_wait_common+0x197/0x2f0 [spl]
[ 3688.132006] [<ffffffffa2117e10>] ? wait_woken+0xa0/0xa0
[ 3688.132006] [<ffffffffc000cba4>] __cv_wait+0x24/0x30 [spl]
[ 3688.132006] [<ffffffffc041d1d0>] txg_wait_open+0xf0/0x1c0 [zfs]
[ 3688.132006] [<ffffffffc03c2fe3>] dmu_tx_wait+0x503/0x510 [zfs]
[ 3688.132006] [<ffffffffc03c30be>] dmu_tx_assign+0xce/0x750 [zfs]
[ 3688.132006] [<ffffffffc0464fab>] zfs_inactive+0x19b/0x2b0 [zfs]
[ 3688.132006] [<ffffffffa2823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 3688.132006] [<ffffffffc0488c56>] zpl_evict_inode+0x46/0xa0 [zfs]
[ 3688.132006] [<ffffffffc04a4580>] ? __FUNCTION__.49746+0x4e/0x4e [zfs]
[ 3688.132006] [<ffffffffa226938f>] evict+0xbf/0x1a0
[ 3688.132006] [<ffffffffa22694a8>] dispose_list+0x38/0x70
[ 3688.132006] [<ffffffffa226a961>] prune_icache_sb+0x61/0x90
[ 3688.132006] [<ffffffffa224a398>] super_cache_scan+0x278/0x450
[ 3688.132006] [<ffffffffa21f7bf4>] ? __list_lru_count_one.isra.2+0x44/0x80
[ 3688.132006] [<ffffffffa21de4a9>] shrink_slab.part.40+0x3f9/0x610
[ 3688.132006] [<ffffffffa21e0c65>] ? shrink_lruvec+0x625/0x720
[ 3688.132006] [<ffffffffa21e0fee>] shrink_zone+0x28e/0x2c0
[ 3688.132006] [<ffffffffa21e118b>] do_try_to_free_pages+0x16b/0x490
[ 3688.132006] [<ffffffffa21e1591>] try_to_free_pages+0xe1/0x1c0
[ 3688.132006] [<ffffffffa21d257b>] __alloc_pages_nodemask+0x55b/0x990
[ 3688.132006] [<ffffffffa222eed5>] khugepaged+0x475/0x16a0
[ 3688.132006] [<ffffffffa2117e10>] ? wait_woken+0xa0/0xa0
[ 3688.132006] [<ffffffffa222ea60>] ? maybe_pmd_mkwrite+0x40/0x40
[ 3688.132006] [<ffffffffa222ea60>] ? maybe_pmd_mkwrite+0x40/0x40
[ 3688.132006] [<ffffffffa20f4a0a>] kthread+0x10a/0x120
[ 3688.132006] [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3688.132006] [<ffffffffa28240ce>] ret_from_fork+0x3e/0x70
[ 3688.132006] [<ffffffffa20f4900>] ? kthread_create_on_node+0x1c0/0x1c0
Most of the other processes are like this (kthreadd, init, python, find, htop ...):
[ 3688.586051] python3.4 D ffff88003cd45bb0 8 3210 20765 0x00000000
[ 3688.586051] ffffc900005bb3e8 0000000000000086 ffffffffa2100e10 ffffc900005bb428
[ 3688.586051] ffffc900007e33c8 ffff88003e22cb40 ffff88003cd45600 ffffc900005bb3c8
[ 3688.586051] ffffffffc000ca27 87e75cd14186f159 ffff88003cd46078 ffff88003d32c290
[ 3688.586051] Call Trace:
[ 3688.586051] [<ffffffffa2100e10>] ? get_parent_ip+0x10/0x50
[ 3688.586051] [<ffffffffc000ca27>] ? cv_wait_common+0x197/0x2f0 [spl]
[ 3688.586051] [<ffffffffa281dd4d>] schedule+0x3d/0x90
[ 3688.586051] [<ffffffffc000ca27>] cv_wait_common+0x197/0x2f0 [spl]
[ 3688.586051] [<ffffffffa2117e10>] ? wait_woken+0xa0/0xa0
[ 3688.586051] [<ffffffffc000cba4>] __cv_wait+0x24/0x30 [spl]
[ 3688.586051] [<ffffffffc041d1d0>] txg_wait_open+0xf0/0x1c0 [zfs]
[ 3688.586051] [<ffffffffc03c2fe3>] dmu_tx_wait+0x503/0x510 [zfs]
[ 3688.586051] [<ffffffffc03c30be>] dmu_tx_assign+0xce/0x750 [zfs]
[ 3688.586051] [<ffffffffc0464fab>] zfs_inactive+0x19b/0x2b0 [zfs]
[ 3688.586051] [<ffffffffa2823166>] ? _raw_spin_unlock_irq+0x26/0x50
[ 3688.586051] [<ffffffffc0488c56>] zpl_evict_inode+0x46/0xa0 [zfs]
[ 3688.586051] [<ffffffffc04a4580>] ? __FUNCTION__.49746+0x4e/0x4e [zfs]
[ 3688.586051] [<ffffffffa226938f>] evict+0xbf/0x1a0
[ 3688.586051] [<ffffffffa22694a8>] dispose_list+0x38/0x70
[ 3688.586051] [<ffffffffa226a961>] prune_icache_sb+0x61/0x90
[ 3688.586051] [<ffffffffa224a398>] super_cache_scan+0x278/0x450
[ 3688.586051] [<ffffffffa21f7bf4>] ? __list_lru_count_one.isra.2+0x44/0x80
[ 3688.586051] [<ffffffffa21de4a9>] shrink_slab.part.40+0x3f9/0x610
[ 3688.586051] [<ffffffffa21e0c65>] ? shrink_lruvec+0x625/0x720
[ 3688.586051] [<ffffffffa251128d>] ? alloc_indirect.isra.3+0x2d/0x70
[ 3688.586051] [<ffffffffa21e0fee>] shrink_zone+0x28e/0x2c0
[ 3688.586051] [<ffffffffa21e118b>] do_try_to_free_pages+0x16b/0x490
[ 3688.586051] [<ffffffffa21dc60c>] ? pfmemalloc_watermark_ok+0xbc/0xf0
[ 3688.586051] [<ffffffffa21e1591>] try_to_free_pages+0xe1/0x1c0
[ 3688.586051] [<ffffffffa21d257b>] __alloc_pages_nodemask+0x55b/0x990
[ 3688.586051] [<ffffffffa282311e>] ? _raw_spin_unlock_irqrestore+0x2e/0x50
[ 3688.586051] [<ffffffffa21d7940>] __do_page_cache_readahead+0x130/0x290
[ 3688.586051] [<ffffffffa21c91dc>] ? find_get_entry+0x6c/0xa0
[ 3688.586051] [<ffffffffa21c93ad>] ? pagecache_get_page+0x2d/0x1c0
[ 3688.586051] [<ffffffffa21cbc53>] filemap_fault+0x3a3/0x460
[ 3688.586051] [<ffffffffa21fb41f>] __do_fault+0x7f/0x130
[ 3688.586051] [<ffffffffa21fff05>] handle_mm_fault+0xb75/0x1750
[ 3688.586051] [<ffffffffa204d930>] __do_page_fault+0x220/0x6d0
[ 3688.586051] [<ffffffffa204de99>] trace_do_page_fault+0x49/0x150
[ 3688.586051] [<ffffffffa2047dac>] do_async_page_fault+0x2c/0xa0
[ 3688.586051] [<ffffffffa2825c68>] async_page_fault+0x28/0x30
How can I help debugging?
-- Jason
Metadata
Metadata
Assignees
Labels
Component: Memory Managementkernel memory managementkernel memory managementStatus: InactiveNot being actively updatedNot being actively updatedType: DefectIncorrect behavior (e.g. crash, hang)Incorrect behavior (e.g. crash, hang)