Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: soft lockup - CPU#5 stuck for 67s! [kswapd0:260] #1482

Closed
lukasz99 opened this issue May 29, 2013 · 12 comments
Closed

BUG: soft lockup - CPU#5 stuck for 67s! [kswapd0:260] #1482

lukasz99 opened this issue May 29, 2013 · 12 comments

Comments

@lukasz99
Copy link

Hello,
I've noticed just two of these after running the system for a couple of days. the
box seems to be relatively happy - no other strange symptoms so far other than
system load on what otherwise should be an idle system staying at 3.0 since the
time the bugs got reported.

Guessing by the contents of the stack traces it looks like some sort of problem
with SPL/ARC...

Below's what I've recovered from dmesg - it looks like the second kswapd0 lockup
got followed by:

INFO: task arc_adapt:2143 blocked for more than 120 seconds.

In case it matters, I'm running the most recent version of Centos 6.4 modified to
run openvz (2.6.32-042stab076.8 kernel)

lukasz

[181067.234355] BUG: soft lockup - CPU#5 stuck for 67s! [kswapd0:260]
[181067.234428] Modules linked in: vzethdev pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop simfs vzrst nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 vzcpt nf_conntrack vzdquota ip6t_REJECT ip6table_mangle ip6table_filter ip6_tables xt_length xt_hl xt_tcpmss xt_TCPMSS iptable_mangle iptable_filter xt_multiport xt_limit xt_dscp ipt_REJECT ip_tables vzevent autofs4 nfs lockd fscache nfs_acl auth_rpcgss sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf vznetdev vzmon vzdev ipv6 ext3 jbd zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate microcode sb_edac edac_core i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support ioatdma igb dca shpchp ext4 mbcache jbd2 raid1 isci libsas scsi_transport_sas sr_mod cdrom sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
[181067.234481] CPU 5
[181067.234482] Modules linked in: vzethdev pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop simfs vzrst nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 vzcpt nf_conntrack vzdquota ip6t_REJECT ip6table_mangle ip6table_filter ip6_tables xt_length xt_hl xt_tcpmss xt_TCPMSS iptable_mangle iptable_filter xt_multiport xt_limit xt_dscp ipt_REJECT ip_tables vzevent autofs4 nfs lockd fscache nfs_acl auth_rpcgss sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf vznetdev vzmon vzdev ipv6 ext3 jbd zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate microcode sb_edac edac_core i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support ioatdma igb dca shpchp ext4 mbcache jbd2 raid1 isci libsas scsi_transport_sas sr_mod cdrom sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
[181067.234516]
[181067.234519] Pid: 260, comm: kswapd0 veid: 0 Tainted: P C --------------- 2.6.32-042stab076.8 #1 042stab076_8 Supermicro X9DR3-F/X9DR3-F
[181067.234523] RIP: 0010:[] [] remove_vm_area+0x86/0xa0
[181067.234529] RSP: 0018:ffff881076e61ac0 EFLAGS: 00000287
[181067.234531] RAX: ffff88007b8be3c0 RBX: ffff881076e61ae0 RCX: ffff8810393d2e00
[181067.234533] RDX: ffff8808d9249700 RSI: 0000000000000001 RDI: ffffffff81ac37d0
[181067.234535] RBP: ffffffff8100bc4e R08: 0000000000000000 R09: ffff88007baff800
[181067.234536] R10: ffff88007baff800 R11: ffff88007baff600 R12: ffff88107fc45000
[181067.234538] R13: ffff88107fc02500 R14: ffff881079f36e18 R15: ffff881079f36e00
[181067.234540] FS: 0000000000000000(0000) GS:ffff880069b40000(0000) knlGS:0000000000000000
[181067.234542] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[181067.234544] CR2: 00007fe41568a000 CR3: 0000000001a85000 CR4: 00000000000406e0
[181067.234546] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[181067.234548] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[181067.234550] Process kswapd0 (pid: 260, veid: 0, threadinfo ffff881076e60000, task ffff881076e3d060)
[181067.234552] Stack:
[181067.234553] ffff88007b9dacc0 ffff88007b8be3c0 ffffc90ec6457000 ffffc913b2cf2000
[181067.234558] ffff881076e61b10 ffffffff8116b5de 0000000000000000 ffff881078070000
[181067.234562] ffffc913b2cb0000 0000000000000000 ffff881076e61b20 ffffffff8116b74f
[181067.234566] Call Trace:
[181067.234569] [] ? __vunmap+0x2e/0x120
[181067.234572] [] ? vfree+0x2f/0x40
[181067.234582] [] ? kv_free+0x65/0x70 [spl]
[181067.234587] [] ? spl_slab_reclaim+0x2d9/0x3e0 [spl]
[181067.234592] [] ? __switch_to+0xd0/0x320
[181067.234598] [] ? spl_kmem_cache_reap_now+0x144/0x230 [spl]
[181067.234603] [] ? spl_kmem_cache_reap_now+0x157/0x230 [spl]
[181067.234623] [] ? arc_kmem_reap_now+0x67/0xc0 [zfs]
[181067.234637] [] ? arc_shrinker_func+0xdf/0x1e0 [zfs]
[181067.234642] [] ? shrink_slab+0x168/0x1e0
[181067.234645] [] ? balance_pgdat+0x7fd/0xb40
[181067.234648] [] ? kswapd+0x181/0x3f0
[181067.234652] [] ? autoremove_wake_function+0x0/0x40
[181067.234655] [] ? kswapd+0x0/0x3f0
[181067.234658] [] ? kthread+0x96/0xa0
[181067.234660] [] ? child_rip+0xa/0x20
[181067.234663] [] ? kthread+0x0/0xa0
[181067.234665] [] ? child_rip+0x0/0x20
[181067.234667] Code: d0 37 ac 81 48 89 45 e8 e8 48 92 38 00 48 8b 15 a9 f6 25 01 48 c7 c1 98 ab 3c 82 48 8b 45 e8 48 39 d3 74 0c 90 48 89 d1 48 8b 12 <48> 39 d3 75 f5 48 8b 13 48 89 11 f0 81 05 b4 82 95 00 00 00 00
[181067.234686] Call Trace:
[181067.234688] [] ? __vunmap+0x2e/0x120
[181067.234691] [] ? vfree+0x2f/0x40
[181067.234696] [] ? kv_free+0x65/0x70 [spl]
[181067.234701] [] ? spl_slab_reclaim+0x2d9/0x3e0 [spl]
[181067.234705] [] ? __switch_to+0xd0/0x320
[181067.234710] [] ? spl_kmem_cache_reap_now+0x144/0x230 [spl]
[181067.234716] [] ? spl_kmem_cache_reap_now+0x157/0x230 [spl]
[181067.234728] [] ? arc_kmem_reap_now+0x67/0xc0 [zfs]
[181067.234741] [] ? arc_shrinker_func+0xdf/0x1e0 [zfs]
[181067.234745] [] ? shrink_slab+0x168/0x1e0
[181067.234748] [] ? balance_pgdat+0x7fd/0xb40
[181067.234751] [] ? kswapd+0x181/0x3f0
[181067.234754] [] ? autoremove_wake_function+0x0/0x40
[181067.234757] [] ? kswapd+0x0/0x3f0
[181067.234759] [] ? kthread+0x96/0xa0
[181067.234761] [] ? child_rip+0xa/0x20
[181067.234764] [] ? kthread+0x0/0xa0
[181067.234766] [] ? child_rip+0x0/0x20

[181151.111272] BUG: soft lockup - CPU#5 stuck for 67s! [kswapd0:260]
[181151.111343] Modules linked in: vzethdev pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop simfs vzrst nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 vzcpt nf_conntrack vzdquota ip6t_REJECT ip6table_mangle ip6table_filter ip6_tables xt_length xt_hl xt_tcpmss xt_TCPMSS iptable_mangle iptable_filter xt_multiport xt_limit xt_dscp ipt_REJECT ip_tables vzevent autofs4 nfs lockd fscache nfs_acl auth_rpcgss sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf vznetdev vzmon vzdev ipv6 ext3 jbd zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate microcode sb_edac edac_core i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support ioatdma igb dca shpchp ext4 mbcache jbd2 raid1 isci libsas scsi_transport_sas sr_mod cdrom sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
[181151.111396] CPU 5
[181151.111397] Modules linked in: vzethdev pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop simfs vzrst nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 vzcpt nf_conntrack vzdquota ip6t_REJECT ip6table_mangle ip6table_filter ip6_tables xt_length xt_hl xt_tcpmss xt_TCPMSS iptable_mangle iptable_filter xt_multiport xt_limit xt_dscp ipt_REJECT ip_tables vzevent autofs4 nfs lockd fscache nfs_acl auth_rpcgss sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf vznetdev vzmon vzdev ipv6 ext3 jbd zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate microcode sb_edac edac_core i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support ioatdma igb dca shpchp ext4 mbcache jbd2 raid1 isci libsas scsi_transport_sas sr_mod cdrom sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
[181151.111431]
[181151.111433] Pid: 260, comm: kswapd0 veid: 0 Tainted: P C --------------- 2.6.32-042stab076.8 #1 042stab076_8 Supermicro X9DR3-F/X9DR3-F
[181151.111437] RIP: 0010:[] [] remove_vm_area+0x86/0xa0
[181151.111444] RSP: 0018:ffff881076e61ac0 EFLAGS: 00000283
[181151.111446] RAX: ffff88007da6ce40 RBX: ffff881076e61ae0 RCX: ffff88062ccb5d00
[181151.111448] RDX: ffff8806678a2a80 RSI: 0000000000000001 RDI: ffffffff81ac37d0
[181151.111450] RBP: ffffffff8100bc4e R08: 0000000000000000 R09: 0000000000000000
[181151.111451] R10: 0000000000000000 R11: 0000000000000000 R12: ffff881076e61a90
[181151.111453] R13: ffff881079f0b640 R14: ffff881080011640 R15: ffffea0074dbff40
[181151.111455] FS: 0000000000000000(0000) GS:ffff880069b40000(0000) knlGS:0000000000000000
[181151.111458] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[181151.111459] CR2: 00007fe41568a000 CR3: 0000000001a85000 CR4: 00000000000406e0
[181151.111461] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[181151.111463] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[181151.111465] Process kswapd0 (pid: 260, veid: 0, threadinfo ffff881076e60000, task ffff881076e3d060)
[181151.111467] Stack:
[181151.111468] ffff88007df6a8c0 ffff88007da6ce40 ffffc90ec6457000 ffffc911b2542000
[181151.111472] ffff881076e61b10 ffffffff8116b5de 0000000000000000 ffff881078070000
[181151.111476] ffffc911b2500000 0000000000000000 ffff881076e61b20 ffffffff8116b74f
[181151.111480] Call Trace:
[181151.111484] [] ? __vunmap+0x2e/0x120
[181151.111486] [] ? vfree+0x2f/0x40
[181151.111495] [] ? kv_free+0x65/0x70 [spl]
[181151.111501] [] ? spl_slab_reclaim+0x2d9/0x3e0 [spl]
[181151.111505] [] ? __switch_to+0xd0/0x320
[181151.111511] [] ? spl_kmem_cache_reap_now+0x144/0x230 [spl]
[181151.111516] [] ? spl_kmem_cache_reap_now+0x157/0x230 [spl]
[181151.111536] [] ? arc_kmem_reap_now+0x67/0xc0 [zfs]
[181151.111550] [] ? arc_shrinker_func+0xdf/0x1e0 [zfs]
[181151.111555] [] ? shrink_slab+0x168/0x1e0
[181151.111558] [] ? balance_pgdat+0x7fd/0xb40
[181151.111561] [] ? kswapd+0x181/0x3f0
[181151.111565] [] ? autoremove_wake_function+0x0/0x40
[181151.111567] [] ? kswapd+0x0/0x3f0
[181151.111570] [] ? kthread+0x96/0xa0
[181151.111573] [] ? child_rip+0xa/0x20
[181151.111575] [] ? kthread+0x0/0xa0
[181151.111577] [] ? child_rip+0x0/0x20
[181151.111579] Code: d0 37 ac 81 48 89 45 e8 e8 48 92 38 00 48 8b 15 a9 f6 25 01 48 c7 c1 98 ab 3c 82 48 8b 45 e8 48 39 d3 74 0c 90 48 89 d1 48 8b 12 <48> 39 d3 75 f5 48 8b 13 48 89 11 f0 81 05 b4 82 95 00 00 00 00
[181151.111598] Call Trace:
[181151.111600] [] ? __vunmap+0x2e/0x120
[181151.111603] [] ? vfree+0x2f/0x40
[181151.111608] [] ? kv_free+0x65/0x70 [spl]
[181151.111614] [] ? spl_slab_reclaim+0x2d9/0x3e0 [spl]
[181151.111617] [] ? __switch_to+0xd0/0x320
[181151.111622] [] ? spl_kmem_cache_reap_now+0x144/0x230 [spl]
[181151.111628] [] ? spl_kmem_cache_reap_now+0x157/0x230 [spl]
[181151.111641] [] ? arc_kmem_reap_now+0x67/0xc0 [zfs]
[181151.111653] [] ? arc_shrinker_func+0xdf/0x1e0 [zfs]
[181151.111657] [] ? shrink_slab+0x168/0x1e0
[181151.111660] [] ? balance_pgdat+0x7fd/0xb40
[181151.111663] [] ? kswapd+0x181/0x3f0
[181151.111666] [] ? autoremove_wake_function+0x0/0x40
[181151.111669] [] ? kswapd+0x0/0x3f0
[181151.111671] [] ? kthread+0x96/0xa0
[181151.111674] [] ? child_rip+0xa/0x20
[181151.111676] [] ? kthread+0x0/0xa0
[181151.111679] [] ? child_rip+0x0/0x20

[181175.794555] INFO: task arc_adapt:2143 blocked for more than 120 seconds.
[181175.794644] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[181175.794792] arc_adapt D ffff882075fbc6c0 0 2143 2 0 0x00000000
[181175.794801] ffff8820780a3d60 0000000000000046 ffff8820780a3ce0 ffffffff8108033c
[181175.794811] ffff8820780a3d60 00000000ffffffff ffff88207946c000 ffffffff8100bc4e
[181175.794819] ffff8820780a3d60 0000000000000000 ffff882075fbcc88 000000000001e9c0
[181175.794828] Call Trace:
[181175.794842] [] ? lock_timer_base+0x3c/0x70
[181175.794851] [] ? apic_timer_interrupt+0xe/0x20
[181175.794877] [] __mutex_lock_slowpath+0x13e/0x180
[181175.794887] [] mutex_lock+0x2b/0x50
[181175.794910] [] __cv_timedwait_common+0xc7/0x250 [spl]
[181175.794917] [] ? apic_timer_interrupt+0xe/0x20
[181175.794925] [] ? autoremove_wake_function+0x0/0x40
[181175.794968] [] ? arc_adapt_thread+0x0/0xd0 [zfs]
[181175.794984] [] __cv_timedwait_interruptible+0x13/0x20 [spl]
[181175.795014] [] arc_adapt_thread+0x9f/0xd0 [zfs]
[181175.795028] [] thread_generic_wrapper+0x68/0x80 [spl]
[181175.795042] [] ? thread_generic_wrapper+0x0/0x80 [spl]
[181175.795048] [] kthread+0x96/0xa0
[181175.795056] [] child_rip+0xa/0x20
[181175.795062] [] ? kthread+0x0/0xa0
[181175.795067] [] ? child_rip+0x0/0x20

@zrav
Copy link

zrav commented Jun 13, 2013

I experienced something similar.
I'm running ZoL 0.6.1 under Linux 3.2.0-45-generic #70-Ubuntu SMP Wed May 29 20:12:06 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux, Xeon X3430, 8GB RAM, pool of 6xWD RE4 2TB as RAIDZ2, using LZ4 compression on datasets. Not sure what else could be of interest.
The issue occurred while under load from about 15 parallel rsync and a few zfs send processes. Afterwards the box was so unresponsive I couldn't even ssh into it anymore.

[135748.154481] INFO: task kswapd0:38 blocked for more than 120 seconds. [135748.155525] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [135748.156791] kswapd0 D ffffffff81806240 0 38 2 0x00000000 [135748.156798] ffff88022f561a70 0000000000000046 ffff88022f561a10 ffffffffa00573d6 [135748.156805] ffff88022f561fd8 ffff88022f561fd8 ffff88022f561fd8 00000000000137c0 [135748.156812] ffff8802321b1700 ffff88022f559700 ffff8801e32f4b80 ffff88021fdecec8 [135748.156819] Call Trace: [135748.156840] [<ffffffffa00573d6>] ? trace_put_tcd+0x16/0x20 [spl] [135748.156847] [<ffffffff8165d76f>] schedule+0x3f/0x60 [135748.156853] [<ffffffff8165e577>] __mutex_lock_slowpath+0xd7/0x150 [135748.156858] [<ffffffff8165e18a>] mutex_lock+0x2a/0x50 [135748.156911] [<ffffffffa01aa52c>] zfs_zinactive+0x5c/0xd0 [zfs] [135748.156959] [<ffffffffa01a501c>] zfs_inactive+0x7c/0x230 [zfs] [135748.156967] [<ffffffff81126c6e>] ? truncate_pagecache+0x5e/0x70 [135748.157013] [<ffffffffa01b9188>] zpl_evict_inode+0x28/0x30 [zfs] [135748.157020] [<ffffffff81194658>] evict+0xb8/0x1c0 [135748.157025] [<ffffffff81194d41>] dispose_list+0x41/0x50 [135748.157031] [<ffffffff8165f65f>] ? _raw_spin_trylock+0xf/0x30 [135748.157036] [<ffffffff81195788>] prune_icache_sb+0x168/0x330 [135748.157042] [<ffffffff8117ce6e>] prune_super+0xee/0x1a0 [135748.157047] [<ffffffff8112b2b4>] shrink_slab+0x154/0x300 [135748.157053] [<ffffffff8112e5aa>] balance_pgdat+0x50a/0x6d0 [135748.157058] [<ffffffff8112e891>] kswapd+0x121/0x220 [135748.157063] [<ffffffff8112e770>] ? balance_pgdat+0x6d0/0x6d0 [135748.157069] [<ffffffff8108b48c>] kthread+0x8c/0xa0 [135748.157075] [<ffffffff81669e34>] kernel_thread_helper+0x4/0x10 [135748.157080] [<ffffffff8108b400>] ? flush_kthread_worker+0xa0/0xa0 [135748.157086] [<ffffffff81669e30>] ? gs_change+0x13/0x13 [135748.157091] INFO: task spl_kmem_cache/:431 blocked for more than 120 seconds. [135748.158225] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [135748.159582] spl_kmem_cache/ D ffffffff81806240 0 431 2 0x00000000 [135748.159589] ffff88022e1ed520 0000000000000046 0000000000000000 0000000100400009 [135748.159605] ffff88022e1edfd8 ffff88022e1edfd8 ffff88022e1edfd8 00000000000137c0 [135748.159621] ffffffff81c0d020 ffff88022e24c500 0000000000000040 ffff88021fdfd188 [135748.159636] Call Trace: [135748.159645] [<ffffffff8165d76f>] schedule+0x3f/0x60 [135748.159655] [<ffffffff8165e577>] __mutex_lock_slowpath+0xd7/0x150 [135748.159672] [<ffffffffa005bd8b>] ? kmem_free_debug+0x4b/0x150 [spl] [135748.159682] [<ffffffff8165e18a>] mutex_lock+0x2a/0x50 [135748.159735] [<ffffffffa01aa52c>] zfs_zinactive+0x5c/0xd0 [zfs] [135748.159787] [<ffffffffa01a501c>] zfs_inactive+0x7c/0x230 [zfs] [135748.159798] [<ffffffff81126c6e>] ? truncate_pagecache+0x5e/0x70 [135748.159848] [<ffffffffa01b9188>] zpl_evict_inode+0x28/0x30 [zfs] [135748.159859] [<ffffffff81194658>] evict+0xb8/0x1c0 [135748.159868] [<ffffffff81194d41>] dispose_list+0x41/0x50 [135748.159878] [<ffffffff8165f65f>] ? _raw_spin_trylock+0xf/0x30 [135748.159887] [<ffffffff81195788>] prune_icache_sb+0x168/0x330 [135748.159898] [<ffffffff8117ce6e>] prune_super+0xee/0x1a0 [135748.159907] [<ffffffff8112b2b4>] shrink_slab+0x154/0x300 [135748.159917] [<ffffffff8101b749>] ? read_tsc+0x9/0x20 [135748.159927] [<ffffffff8112db48>] do_try_to_free_pages+0x268/0x2d0 [135748.159937] [<ffffffff8112de8b>] try_to_free_pages+0xab/0x170 [135748.159948] [<ffffffff8112210e>] __alloc_pages_nodemask+0x59e/0x8f0 [135748.159961] [<ffffffff81159576>] alloc_pages_current+0xb6/0x120 [135748.159971] [<ffffffff8111d4de>] __get_free_pages+0xe/0x40 [135748.159981] [<ffffffff810454e5>] pte_alloc_one_kernel+0x15/0x20 [135748.159991] [<ffffffff8113cadb>] __pte_alloc_kernel+0x1b/0xe0 [135748.160002] [<ffffffff8114b4ce>] vmap_pte_range+0x14e/0x170 [135748.160011] [<ffffffff8114b612>] vmap_pud_range+0x122/0x1c0 [135748.160021] [<ffffffff8114b746>] vmap_page_range_noflush+0x96/0xc0 [135748.160031] [<ffffffff8114b7a2>] map_vm_area+0x32/0x50 [135748.160040] [<ffffffff8114dda8>] __vmalloc_area_node+0x118/0x1c0 [135748.160057] [<ffffffffa005c271>] ? kv_alloc.isra.8+0x51/0x60 [spl] [135748.160067] [<ffffffff8114da46>] __vmalloc_node_range+0x86/0xd0 [135748.160083] [<ffffffffa005c271>] ? kv_alloc.isra.8+0x51/0x60 [spl] [135748.160093] [<ffffffff8114dac5>] __vmalloc_node+0x35/0x40 [135748.160109] [<ffffffffa005c271>] ? kv_alloc.isra.8+0x51/0x60 [spl] [135748.160118] [<ffffffff8114dc82>] __vmalloc+0x22/0x30 [135748.160133] [<ffffffffa005c271>] kv_alloc.isra.8+0x51/0x60 [spl] [135748.160150] [<ffffffffa005c2c0>] spl_cache_grow_work+0x40/0x430 [spl] [135748.160168] [<ffffffffa0061776>] taskq_thread+0x236/0x4b0 [spl] [135748.160179] [<ffffffff81060670>] ? try_to_wake_up+0x200/0x200 [135748.160196] [<ffffffffa0061540>] ? task_done+0x160/0x160 [spl] [135748.160205] [<ffffffff8108b48c>] kthread+0x8c/0xa0 [135748.160215] [<ffffffff81669e34>] kernel_thread_helper+0x4/0x10 [135748.160226] [<ffffffff8108b400>] ? flush_kthread_worker+0xa0/0xa0 [135748.160236] [<ffffffff81669e30>] ? gs_change+0x13/0x13 [135748.160244] INFO: task arc_adapt:436 blocked for more than 120 seconds. [135748.161299] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [135748.162670] arc_adapt D ffffffff81806240 0 436 2 0x00000000 [135748.162676] ffff88022e839b30 0000000000000046 0000000000000000 0000000100400007 [135748.162683] ffff88022e839fd8 ffff88022e839fd8 ffff88022e839fd8 00000000000137c0 [135748.162689] ffff8802321e9700 ffff88022bc69700 ffff8801e32bb800 ffff880224fd1188 [135748.162696] Call Trace: [135748.162701] [<ffffffff8165d76f>] schedule+0x3f/0x60 [135748.162706] [<ffffffff8165e577>] __mutex_lock_slowpath+0xd7/0x150 [135748.162735] [<ffffffffa01194a0>] ? dbuf_rele_and_unlock+0x1b0/0x210 [zfs] [135748.162741] [<ffffffff8165e18a>] mutex_lock+0x2a/0x50 [135748.162788] [<ffffffffa01aa52c>] zfs_zinactive+0x5c/0xd0 [zfs] [135748.162835] [<ffffffffa01a501c>] zfs_inactive+0x7c/0x230 [zfs] [135748.162842] [<ffffffff81126c6e>] ? truncate_pagecache+0x5e/0x70 [135748.162887] [<ffffffffa01b9188>] zpl_evict_inode+0x28/0x30 [zfs] [135748.162893] [<ffffffff81194658>] evict+0xb8/0x1c0 [135748.162899] [<ffffffff81194d41>] dispose_list+0x41/0x50 [135748.162904] [<ffffffff8165f65f>] ? _raw_spin_trylock+0xf/0x30 [135748.162909] [<ffffffff81195788>] prune_icache_sb+0x168/0x330 [135748.162915] [<ffffffff8117ce6e>] prune_super+0xee/0x1a0 [135748.162960] [<ffffffffa01b9220>] ? zpl_inode_alloc+0x70/0x70 [zfs] [135748.163007] [<ffffffffa019b2e2>] zfs_sb_prune+0xa2/0xc0 [zfs] [135748.163053] [<ffffffffa01b9240>] zpl_prune_sb+0x20/0x30 [zfs] [135748.163059] [<ffffffff8117c265>] iterate_supers_type+0x85/0xe0 [135748.163104] [<ffffffffa01b9250>] ? zpl_prune_sb+0x30/0x30 [zfs] [135748.163150] [<ffffffffa01b928c>] zpl_prune_sbs+0x3c/0x50 [zfs] [135748.163175] [<ffffffffa0113a61>] arc_adjust_meta+0x121/0x1e0 [zfs] [135748.163199] [<ffffffffa0113b20>] ? arc_adjust_meta+0x1e0/0x1e0 [zfs] [135748.163224] [<ffffffffa0113b20>] ? arc_adjust_meta+0x1e0/0x1e0 [zfs] [135748.163249] [<ffffffffa0113b8a>] arc_adapt_thread+0x6a/0xd0 [zfs] [135748.163262] [<ffffffffa005fd08>] thread_generic_wrapper+0x78/0x90 [spl] [135748.163275] [<ffffffffa005fc90>] ? __thread_create+0x310/0x310 [spl] [135748.163280] [<ffffffff8108b48c>] kthread+0x8c/0xa0 [135748.163286] [<ffffffff81669e34>] kernel_thread_helper+0x4/0x10 [135748.163291] [<ffffffff8108b400>] ? flush_kthread_worker+0xa0/0xa0 [135748.163297] [<ffffffff81669e30>] ? gs_change+0x13/0x13

@mattaw
Copy link

mattaw commented Jun 13, 2013

Thanks for the information - I have added a link to your bug from my bug #1515 report. Good that you have a stack trace - I wasn't able to get one as it was headless and completely locked. Perhaps I need to hookup a screen and keyboard but it isn't even making it to the logs on my system.

Matthew

@zrav
Copy link

zrav commented Jun 18, 2013

I started experiencing these hangs after I updated the Ubuntu kernel to 3.2.0-45. Now I update to 3.2.0-48 and so far the issues have not reappeared. It might be unrelated, just a possible hint.

@behlendorf
Copy link
Contributor

@lukasz99 Your issue is just contention of the virtual address space lock which is a known issue. The stacks dumped are just informational and as long as they only occur once and a while not harmful. We're looking in to a proper long term fix to address this.

@zrav Your stack suggests a possible deadlock in zfs_zinactive() on the zp->z_lock during memory reclaim. Unfortunately, from the stacks you posted it's not clear what task is holding that lock. It's possible there was kernel bug here but I can't say for sure, if it occurs again please post the stacks.

@zrav
Copy link

zrav commented Sep 15, 2013

Upgrading to kernel 3.8.0-30 the hangs returned (now on ZoL 0.6.2):
[207635.591320] INFO: task kswapd0:44 blocked for more than 120 seconds. [207635.591694] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [207635.592153] kswapd0 D ffff8802307fc220 0 44 2 0x00000000 [207635.592159] ffff88023097ba38 0000000000000046 ffff8801889bd1c0 ffff88023fc53f40 [207635.592164] ffff88023097bfd8 ffff88023097bfd8 ffff88023097bfd8 0000000000013f40 [207635.592169] ffff8802301eae80 ffff880233d05d00 ffff88023097ba28 ffff880228ca1358 [207635.592174] Call Trace: [207635.592183] [<ffffffff816f2de9>] schedule+0x29/0x70 [207635.592188] [<ffffffff816f309e>] schedule_preempt_disabled+0xe/0x10 [207635.592192] [<ffffffff816f1d17>] __mutex_lock_slowpath+0xd7/0x150 [207635.592230] [<ffffffffa01345b0>] ? dbuf_rele_and_unlock+0x1b0/0x210 [zfs] [207635.592236] [<ffffffff816f192a>] mutex_lock+0x2a/0x50 [207635.592280] [<ffffffffa01c689a>] zfs_zinactive+0x5a/0xd0 [zfs] [207635.592324] [<ffffffffa01c136c>] zfs_inactive+0x7c/0x230 [zfs] [207635.592330] [<ffffffff811433de>] ? truncate_pagecache+0x5e/0x70 [207635.592371] [<ffffffffa01d5968>] zpl_evict_inode+0x28/0x30 [zfs] [207635.592378] [<ffffffff811b5f60>] evict+0xc0/0x1d0 [207635.592383] [<ffffffff811b60b1>] dispose_list+0x41/0x50 [207635.592387] [<ffffffff816f3cef>] ? _raw_spin_trylock+0xf/0x30 [207635.592391] [<ffffffff811b70c5>] prune_icache_sb+0x185/0x340 [207635.592396] [<ffffffff8119eaf8>] prune_super+0xf8/0x1b0 [207635.592400] [<ffffffff81147044>] shrink_slab+0x154/0x300 [207635.592404] [<ffffffff811918a8>] ? mem_cgroup_iter+0xe8/0x200 [207635.592409] [<ffffffff8114a274>] balance_pgdat+0x524/0x6c0 [207635.592414] [<ffffffff8114a533>] kswapd+0x123/0x240 [207635.592418] [<ffffffff8114a410>] ? balance_pgdat+0x6c0/0x6c0 [207635.592423] [<ffffffff8107f1b0>] kthread+0xc0/0xd0 [207635.592428] [<ffffffff8107f0f0>] ? flush_kthread_worker+0xb0/0xb0 [207635.592433] [<ffffffff816fcaec>] ret_from_fork+0x7c/0xb0 [207635.592437] [<ffffffff8107f0f0>] ? flush_kthread_worker+0xb0/0xb0 [207635.592443] INFO: task spl_kmem_cache/:386 blocked for more than 120 seconds. [207635.592840] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [207635.593274] spl_kmem_cache/ D ffff8802307fc220 0 386 2 0x00000000 [207635.593278] ffff8802339534d8 0000000000000046 ffff88020a2b0040 ffff88023fc13f40 [207635.593283] ffff880233953fd8 ffff880233953fd8 ffff880233953fd8 0000000000013f40 [207635.593287] ffffffff81c15440 ffff88023037c5c0 ffff8802339534c8 ffff880228ca1358 [207635.593292] Call Trace: [207635.593296] [<ffffffff816f2de9>] schedule+0x29/0x70 [207635.593300] [<ffffffff816f309e>] schedule_preempt_disabled+0xe/0x10 [207635.593304] [<ffffffff816f1d17>] __mutex_lock_slowpath+0xd7/0x150 [207635.593329] [<ffffffffa01345b0>] ? dbuf_rele_and_unlock+0x1b0/0x210 [zfs] [207635.593334] [<ffffffff816f192a>] mutex_lock+0x2a/0x50 [207635.593377] [<ffffffffa01c689a>] zfs_zinactive+0x5a/0xd0 [zfs] [207635.593420] [<ffffffffa01c136c>] zfs_inactive+0x7c/0x230 [zfs] [207635.593425] [<ffffffff811433de>] ? truncate_pagecache+0x5e/0x70 [207635.593467] [<ffffffffa01d5968>] zpl_evict_inode+0x28/0x30 [zfs] [207635.593473] [<ffffffff811b5f60>] evict+0xc0/0x1d0 [207635.593477] [<ffffffff811b60b1>] dispose_list+0x41/0x50 [207635.593481] [<ffffffff816f3cef>] ? _raw_spin_trylock+0xf/0x30 [207635.593485] [<ffffffff811b70c5>] prune_icache_sb+0x185/0x340 [207635.593489] [<ffffffff8119eaf8>] prune_super+0xf8/0x1b0 [207635.593494] [<ffffffff81147044>] shrink_slab+0x154/0x300 [207635.593498] [<ffffffff81149037>] ? shrink_zone+0x57/0xa0 [207635.593502] [<ffffffff81149778>] do_try_to_free_pages+0x258/0x2c0 [207635.593507] [<ffffffff81152589>] ? __mod_zone_page_state+0x49/0x50 [207635.593511] [<ffffffff81149b14>] try_to_free_pages+0xe4/0x1a0 [207635.593516] [<ffffffff8113e19c>] __alloc_pages_nodemask+0x60c/0x990 [207635.593522] [<ffffffff8117aa1a>] alloc_pages_current+0xba/0x170 [207635.593527] [<ffffffff811394de>] __get_free_pages+0xe/0x40 [207635.593532] [<ffffffff8104d225>] pte_alloc_one_kernel+0x15/0x20 [207635.593537] [<ffffffff8115c68b>] __pte_alloc_kernel+0x1b/0xd0 [207635.593541] [<ffffffff8116c39e>] vmap_pte_range+0x14e/0x170 [207635.593545] [<ffffffff8116c7d2>] vmap_pud_range+0x122/0x1c0 [207635.593550] [<ffffffff8116c906>] vmap_page_range_noflush+0x96/0xc0 [207635.593554] [<ffffffff8116c962>] map_vm_area+0x32/0x50 [207635.593559] [<ffffffff8116ed78>] __vmalloc_area_node+0x118/0x1c0 [207635.593571] [<ffffffffa0076d99>] ? kv_alloc.isra.8+0x49/0x50 [spl] [207635.593576] [<ffffffff8116ea16>] __vmalloc_node_range+0x86/0xd0 [207635.593586] [<ffffffffa0076d99>] ? kv_alloc.isra.8+0x49/0x50 [spl] [207635.593591] [<ffffffff8116ea95>] __vmalloc_node+0x35/0x40 [207635.593601] [<ffffffffa0076d99>] ? kv_alloc.isra.8+0x49/0x50 [spl] [207635.593605] [<ffffffff8116ec52>] __vmalloc+0x22/0x30 [207635.593615] [<ffffffffa0076d99>] kv_alloc.isra.8+0x49/0x50 [spl] [207635.593626] [<ffffffffa0076de0>] spl_cache_grow_work+0x40/0x430 [spl] [207635.593637] [<ffffffffa007c586>] taskq_thread+0x236/0x4b0 [spl] [207635.593643] [<ffffffff81092130>] ? try_to_wake_up+0x200/0x200 [207635.593654] [<ffffffffa007c350>] ? task_done+0x160/0x160 [spl] [207635.593658] [<ffffffff8107f1b0>] kthread+0xc0/0xd0 [207635.593663] [<ffffffff8107f0f0>] ? flush_kthread_worker+0xb0/0xb0 [207635.593667] [<ffffffff816fcaec>] ret_from_fork+0x7c/0xb0 [207635.593672] [<ffffffff8107f0f0>] ? flush_kthread_worker+0xb0/0xb0 [207635.593680] INFO: task z_rd_int/0:528 blocked for more than 120 seconds. [207635.594059] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [207635.594517] z_rd_int/0 D ffff8802307fc220 0 528 2 0x00000000 [207635.594521] ffff8802338edca8 0000000000000046 0000126adcb76faa ffff88023fc13f40 [207635.594526] ffff8802338edfd8 ffff8802338edfd8 ffff8802338edfd8 0000000000013f40 [207635.594530] ffffffff81c15440 ffff8802301edd00 ffff880200000001 ffff880231eb0728 [207635.594534] Call Trace: [207635.594539] [<ffffffff816f2de9>] schedule+0x29/0x70 [207635.594542] [<ffffffff816f309e>] schedule_preempt_disabled+0xe/0x10 [207635.594546] [<ffffffff816f1d17>] __mutex_lock_slowpath+0xd7/0x150 [207635.594551] [<ffffffff816f192a>] mutex_lock+0x2a/0x50 [207635.594596] [<ffffffffa0194570>] vdev_queue_io_done+0x40/0x120 [zfs] [207635.594639] [<ffffffffa01cbb68>] zio_vdev_io_done+0x88/0x190 [zfs] [207635.594682] [<ffffffffa01cc60a>] zio_execute+0xba/0x160 [zfs] [207635.594694] [<ffffffffa007c586>] taskq_thread+0x236/0x4b0 [spl] [207635.594699] [<ffffffff81092130>] ? try_to_wake_up+0x200/0x200 [207635.594710] [<ffffffffa007c350>] ? task_done+0x160/0x160 [spl] [207635.594715] [<ffffffff8107f1b0>] kthread+0xc0/0xd0 [207635.594720] [<ffffffff8107f0f0>] ? flush_kthread_worker+0xb0/0xb0 [207635.594724] [<ffffffff816fcaec>] ret_from_fork+0x7c/0xb0 [207635.594728] [<ffffffff8107f0f0>] ? flush_kthread_worker+0xb0/0xb0 [207635.594732] INFO: task z_rd_int/2:530 blocked for more than 120 seconds. [207635.595120] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [207635.595600] z_rd_int/2 D ffff8802307fc220 0 530 2 0x00000000 [207635.595605] ffff88022e003ca8 0000000000000046 000014a8f7a5f814 ffff88023fc93f40 [207635.595610] ffff88022e003fd8 ffff88022e003fd8 ffff88022e003fd8 0000000000013f40 [207635.595614] ffff88023414ae80 ffff88022e008000 ffff88022e003cd8 ffff880231eb6728 [207635.595619] Call Trace: [207635.595624] [<ffffffff816f2de9>] schedule+0x29/0x70 [207635.595628] [<ffffffff816f309e>] schedule_preempt_disabled+0xe/0x10 [207635.595631] [<ffffffff816f1d17>] __mutex_lock_slowpath+0xd7/0x150 [207635.595636] [<ffffffff816f192a>] mutex_lock+0x2a/0x50 [207635.595681] [<ffffffffa0194570>] vdev_queue_io_done+0x40/0x120 [zfs] [207635.595736] [<ffffffffa01cbb68>] zio_vdev_io_done+0x88/0x190 [zfs] [207635.595788] [<ffffffffa01cc60a>] zio_execute+0xba/0x160 [zfs] [207635.595810] [<ffffffffa007c586>] taskq_thread+0x236/0x4b0 [spl] [207635.595826] [<ffffffff81092130>] ? try_to_wake_up+0x200/0x200 [207635.595847] [<ffffffffa007c350>] ? task_done+0x160/0x160 [spl] [207635.595862] [<ffffffff8107f1b0>] kthread+0xc0/0xd0 [207635.595874] [<ffffffff8107f0f0>] ? flush_kthread_worker+0xb0/0xb0 [207635.595879] [<ffffffff816fcaec>] ret_from_fork+0x7c/0xb0 [207635.595883] [<ffffffff8107f0f0>] ? flush_kthread_worker+0xb0/0xb0 [207635.595887] INFO: task z_rd_int/3:531 blocked for more than 120 seconds. [207635.596275] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [207635.596734] z_rd_int/3 D ffff8802307fc220 0 531 2 0x00000000 [207635.596738] ffff8802338ddca8 0000000000000046 000014aed3627ef6 ffff88023fcd3f40 [207635.596743] ffff8802338ddfd8 ffff8802338ddfd8 ffff8802338ddfd8 0000000000013f40 [207635.596747] ffff88023414c5c0 ffff88022e009740 ffff8802338ddcd8 ffff880231eb3728 [207635.596751] Call Trace: [207635.596756] [<ffffffff816f2de9>] schedule+0x29/0x70 [207635.596759] [<ffffffff816f309e>] schedule_preempt_disabled+0xe/0x10 [207635.596763] [<ffffffff816f1d17>] __mutex_lock_slowpath+0xd7/0x150 [207635.596768] [<ffffffff816f192a>] mutex_lock+0x2a/0x50 [207635.596812] [<ffffffffa0194570>] vdev_queue_io_done+0x40/0x120 [zfs] [207635.596856] [<ffffffffa01cbb68>] zio_vdev_io_done+0x88/0x190 [zfs] [207635.596898] [<ffffffffa01cc60a>] zio_execute+0xba/0x160 [zfs] [207635.596910] [<ffffffffa007c586>] taskq_thread+0x236/0x4b0 [spl] [207635.596915] [<ffffffff81092130>] ? try_to_wake_up+0x200/0x200 [207635.596926] [<ffffffffa007c350>] ? task_done+0x160/0x160 [spl] [207635.596931] [<ffffffff8107f1b0>] kthread+0xc0/0xd0 [207635.596935] [<ffffffff8107f0f0>] ? flush_kthread_worker+0xb0/0xb0 [207635.596940] [<ffffffff816fcaec>] ret_from_fork+0x7c/0xb0 [207635.596944] [<ffffffff8107f0f0>] ? flush_kthread_worker+0xb0/0xb0 [207635.596947] INFO: task z_wr_iss/0:532 blocked for more than 120 seconds. [207635.597334] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [207635.597794] z_wr_iss/0 D ffff8802307fc220 0 532 2 0x00000000 [207635.597797] ffff8802338dfa38 0000000000000046 ffffffff81e514c0 ffff88023fc13f40 [207635.597802] ffff8802338dffd8 ffff8802338dffd8 ffff8802338dffd8 0000000000013f40 [207635.597806] ffff88023037c5c0 ffff88022e00ae80 ffff88023058a890 ffff880231eb6728 [207635.597811] Call Trace: [207635.597815] [<ffffffff816f2de9>] schedule+0x29/0x70 [207635.597819] [<ffffffff816f309e>] schedule_preempt_disabled+0xe/0x10 [207635.597823] [<ffffffff816f1d17>] __mutex_lock_slowpath+0xd7/0x150 [207635.597828] [<ffffffff816f192a>] mutex_lock+0x2a/0x50 [207635.597871] [<ffffffffa019447f>] vdev_queue_io+0x6f/0x120 [zfs] [207635.597913] [<ffffffffa01cece7>] zio_vdev_io_start+0x1e7/0x2f0 [zfs] [207635.597955] [<ffffffffa01cf06a>] zio_nowait+0xba/0x190 [zfs] [207635.597999] [<ffffffffa0196f35>] vdev_raidz_io_start+0x5d5/0x6b0 [zfs] [207635.598043] [<ffffffffa0194850>] ? vdev_raidz_asize+0x60/0x60 [zfs] [207635.598085] [<ffffffffa01cebab>] zio_vdev_io_start+0xab/0x2f0 [zfs] [207635.598127] [<ffffffffa01cf06a>] zio_nowait+0xba/0x190 [zfs] [207635.598170] [<ffffffffa0193660>] vdev_mirror_io_start+0x220/0x4e0 [zfs] [207635.598213] [<ffffffffa0192e70>] ? vdev_config_sync+0x180/0x180 [zfs] [207635.598255] [<ffffffffa01ced37>] zio_vdev_io_start+0x237/0x2f0 [zfs] [207635.598298] [<ffffffffa01cc60a>] zio_execute+0xba/0x160 [zfs] [207635.598310] [<ffffffffa007c586>] taskq_thread+0x236/0x4b0 [spl] [207635.598315] [<ffffffff81092130>] ? try_to_wake_up+0x200/0x200 [207635.598326] [<ffffffffa007c350>] ? task_done+0x160/0x160 [spl] [207635.598330] [<ffffffff8107f1b0>] kthread+0xc0/0xd0 [207635.598335] [<ffffffff8107f0f0>] ? flush_kthread_worker+0xb0/0xb0 [207635.598339] [<ffffffff816fcaec>] ret_from_fork+0x7c/0xb0 [207635.598344] [<ffffffff8107f0f0>] ? flush_kthread_worker+0xb0/0xb0 [207635.598347] INFO: task z_wr_iss/2:534 blocked for more than 120 seconds. [207635.598735] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [207635.599206] z_wr_iss/2 D ffff8802307fc220 0 534 2 0x00000000 [207635.599211] ffff88022e007a38 0000000000000046 ffff880226c14900 ffff88023fc93f40 [207635.599215] ffff88022e007fd8 ffff88022e007fd8 ffff88022e007fd8 0000000000013f40 [207635.599219] ffff88023414ae80 ffff88022e00dd00 ffff88022e007a38 ffff880231eb5728 [207635.599224] Call Trace: [207635.599229] [<ffffffff816f2de9>] schedule+0x29/0x70 [207635.599233] [<ffffffff816f309e>] schedule_preempt_disabled+0xe/0x10 [207635.599236] [<ffffffff816f1d17>] __mutex_lock_slowpath+0xd7/0x150 [207635.599242] [<ffffffff816f192a>] mutex_lock+0x2a/0x50 [207635.599286] [<ffffffffa019447f>] vdev_queue_io+0x6f/0x120 [zfs] [207635.599340] [<ffffffffa01cece7>] zio_vdev_io_start+0x1e7/0x2f0 [zfs] [207635.599392] [<ffffffffa01cf06a>] zio_nowait+0xba/0x190 [zfs] [207635.599445] [<ffffffffa0196f35>] vdev_raidz_io_start+0x5d5/0x6b0 [zfs] [207635.599499] [<ffffffffa0194850>] ? vdev_raidz_asize+0x60/0x60 [zfs] [207635.599550] [<ffffffffa01cebab>] zio_vdev_io_start+0xab/0x2f0 [zfs] [207635.599602] [<ffffffffa01cf06a>] zio_nowait+0xba/0x190 [zfs] [207635.599651] [<ffffffffa0193660>] vdev_mirror_io_start+0x220/0x4e0 [zfs] [207635.599694] [<ffffffffa0192e70>] ? vdev_config_sync+0x180/0x180 [zfs] [207635.599736] [<ffffffffa01ced37>] zio_vdev_io_start+0x237/0x2f0 [zfs] [207635.599779] [<ffffffffa01cc60a>] zio_execute+0xba/0x160 [zfs] [207635.599801] [<ffffffffa007c586>] taskq_thread+0x236/0x4b0 [spl] [207635.599814] [<ffffffff81092130>] ? try_to_wake_up+0x200/0x200 [207635.599826] [<ffffffffa007c350>] ? task_done+0x160/0x160 [spl] [207635.599830] [<ffffffff8107f1b0>] kthread+0xc0/0xd0 [207635.599835] [<ffffffff8107f0f0>] ? flush_kthread_worker+0xb0/0xb0 [207635.599839] [<ffffffff816fcaec>] ret_from_fork+0x7c/0xb0 [207635.599844] [<ffffffff8107f0f0>] ? flush_kthread_worker+0xb0/0xb0 [207635.599848] INFO: task z_wr_int/0:541 blocked for more than 120 seconds. [207635.610638] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [207635.632365] z_wr_int/0 D ffff8802307fc220 0 541 2 0x00000000 [207635.632370] ffff88022e02dca8 0000000000000046 ffff8801b3ee9620 ffff88023fc13f40 [207635.632374] ffff88022e02dfd8 ffff88022e02dfd8 ffff88022e02dfd8 0000000000013f40 [207635.632378] ffff8802317a2e80 ffff8802317a1740 ffff88022e02dcb8 ffff880231eb6728 [207635.632383] Call Trace: [207635.632388] [<ffffffff816f2de9>] schedule+0x29/0x70 [207635.632392] [<ffffffff816f309e>] schedule_preempt_disabled+0xe/0x10 [207635.632395] [<ffffffff816f1d17>] __mutex_lock_slowpath+0xd7/0x150 [207635.632400] [<ffffffff816f192a>] mutex_lock+0x2a/0x50 [207635.632446] [<ffffffffa0194570>] vdev_queue_io_done+0x40/0x120 [zfs] [207635.632500] [<ffffffffa01cbb68>] zio_vdev_io_done+0x88/0x190 [zfs] [207635.632547] [<ffffffffa01cc60a>] zio_execute+0xba/0x160 [zfs] [207635.632560] [<ffffffffa007c586>] taskq_thread+0x236/0x4b0 [spl] [207635.632565] [<ffffffff81092130>] ? try_to_wake_up+0x200/0x200 [207635.632576] [<ffffffffa007c350>] ? task_done+0x160/0x160 [spl] [207635.632581] [<ffffffff8107f1b0>] kthread+0xc0/0xd0 [207635.632586] [<ffffffff8107f0f0>] ? flush_kthread_worker+0xb0/0xb0 [207635.632590] [<ffffffff816fcaec>] ret_from_fork+0x7c/0xb0 [207635.632594] [<ffffffff8107f0f0>] ? flush_kthread_worker+0xb0/0xb0 [207635.632598] INFO: task z_wr_int/2:543 blocked for more than 120 seconds. [207635.643662] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [207635.665462] z_wr_int/2 D ffff8802307fc220 0 543 2 0x00000000 [207635.665466] ffff88022e031ca8 0000000000000046 000014a8f7d5e49d ffff88023fc93f40 [207635.665471] ffff88022e031fd8 ffff88022e031fd8 ffff88022e031fd8 0000000000013f40 [207635.665475] ffff88023414ae80 ffff880231782e80 ffff88022e031cb8 ffff880231eb5728 [207635.665480] Call Trace: [207635.665485] [<ffffffff816f2de9>] schedule+0x29/0x70 [207635.665488] [<ffffffff816f309e>] schedule_preempt_disabled+0xe/0x10 [207635.665492] [<ffffffff816f1d17>] __mutex_lock_slowpath+0xd7/0x150 [207635.665497] [<ffffffff816f192a>] mutex_lock+0x2a/0x50 [207635.665543] [<ffffffffa0194570>] vdev_queue_io_done+0x40/0x120 [zfs] [207635.665597] [<ffffffffa01cbb68>] zio_vdev_io_done+0x88/0x190 [zfs] [207635.665644] [<ffffffffa01cc60a>] zio_execute+0xba/0x160 [zfs] [207635.665658] [<ffffffffa007c586>] taskq_thread+0x236/0x4b0 [spl] [207635.665663] [<ffffffff81092130>] ? try_to_wake_up+0x200/0x200 [207635.665674] [<ffffffffa007c350>] ? task_done+0x160/0x160 [spl] [207635.665679] [<ffffffff8107f1b0>] kthread+0xc0/0xd0 [207635.665684] [<ffffffff8107f0f0>] ? flush_kthread_worker+0xb0/0xb0 [207635.665688] [<ffffffff816fcaec>] ret_from_fork+0x7c/0xb0 [207635.665693] [<ffffffff8107f0f0>] ? flush_kthread_worker+0xb0/0xb0 [207635.665696] INFO: task z_wr_int/3:544 blocked for more than 120 seconds. [207635.676741] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [207635.698530] z_wr_int/3 D ffff8802307fc220 0 544 2 0x00000000 [207635.698534] ffff88022e033ca8 0000000000000046 000014af0f91ede4 ffff88023fcd3f40 [207635.698539] ffff88022e033fd8 ffff88022e033fd8 ffff88022e033fd8 0000000000013f40 [207635.698543] ffff88023414c5c0 ffff88022e038000 0000000100000001 ffff880231eb3728 [207635.698548] Call Trace: [207635.698553] [<ffffffff816f2de9>] schedule+0x29/0x70 [207635.698556] [<ffffffff816f309e>] schedule_preempt_disabled+0xe/0x10 [207635.698560] [<ffffffff816f1d17>] __mutex_lock_slowpath+0xd7/0x150 [207635.698565] [<ffffffff816f192a>] mutex_lock+0x2a/0x50 [207635.698611] [<ffffffffa0194570>] vdev_queue_io_done+0x40/0x120 [zfs] [207635.698665] [<ffffffffa01cbb68>] zio_vdev_io_done+0x88/0x190 [zfs] [207635.698712] [<ffffffffa01cc60a>] zio_execute+0xba/0x160 [zfs] [207635.698726] [<ffffffffa007c586>] taskq_thread+0x236/0x4b0 [spl] [207635.698731] [<ffffffff81092130>] ? try_to_wake_up+0x200/0x200 [207635.698743] [<ffffffffa007c350>] ? task_done+0x160/0x160 [spl] [207635.698747] [<ffffffff8107f1b0>] kthread+0xc0/0xd0 [207635.698752] [<ffffffff8107f0f0>] ? flush_kthread_worker+0xb0/0xb0 [207635.698756] [<ffffffff816fcaec>] ret_from_fork+0x7c/0xb0 [207635.698761] [<ffffffff8107f0f0>] ? flush_kthread_worker+0xb0/0xb0

@raketentimo
Copy link

I think I have the same issue (first post):

BUG: soft lockup - CPU#1 stuck for 67s! [kswapd0:98]
Modules linked in: mpt2sas raid_class mptctl mptbase nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc 8021q garp stp llc cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate e1000e s
es enclosure ixgbe ptp pps_core mdio sg microcode serio_raw sb_edac edac_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma dca shpchp ext3 jbd mbcache sd_mod crc_t10dif isci libsas scsi_transport_sas megaraid_sas ahci dm_mirror dm_region_hash dm_log dm_mod [las
t unloaded: scsi_wait_scan]
CPU 1 
Modules linked in: mpt2sas raid_class mptctl mptbase nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc 8021q garp stp llc cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate e1000e ses enclosure ixgbe ptp pps_core mdio sg microcode serio_raw sb_edac edac_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma dca shpchp ext3 jbd mbcache sd_mod crc_t10dif isci libsas scsi_transport_sas megaraid_sas ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 98, comm: kswapd0 Tainted: P           ---------------    2.6.32-358.2.1.el6.x86_64 #1 Supermicro X9SRL-F/X9SRL-F
RIP: 0010:[<ffffffff8112c74f>]  [<ffffffff8112c74f>] free_hot_page+0x2f/0x60
RSP: 0018:ffff881855cddaa0  EFLAGS: 00000297
RAX: 00000000000000a9 RBX: ffff881855cddab0 RCX: 0000000000000010
RDX: ffffea0011ea73d0 RSI: ffff88185bb0f650 RDI: 0000000000000297
RBP: ffffffff8100bb8e R08: ffffea0011ea73d0 R09: 0000000000000000
R10: ffff8800284094c0 R11: 0000000000000000 R12: ffff880000022d80
R13: ffffea0011ea70d0 R14: ffff881855cdda50 R15: 000000037fc40100
FS:  0000000000000000(0000) GS:ffff880028240000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000003578218280 CR3: 0000000001a85000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kswapd0 (pid: 98, threadinfo ffff881855cdc000, task ffff88185ae02080)
Stack:
 ffffea0011ea73e0 0000000000000000 ffff881855cddae0 ffffffff8112c7e0
<d> ffff8805ab859140 0000000000000068 ffff8805821f35c0 0000000000000001
<d> ffff881855cddb10 ffffffff81150d5b 000000000000201b ffff8818537c0000
Call Trace:
 [<ffffffff8112c7e0>] ? __free_pages+0x60/0xa0
 [<ffffffff81150d5b>] ? __vunmap+0x7b/0x120
 [<ffffffff81150e7a>] ? vfree+0x2a/0x40
 [<ffffffffa02203c5>] ? kv_free+0x65/0x70 [spl]
 [<ffffffffa02206a9>] ? spl_slab_reclaim+0x2d9/0x3e0 [spl]
 [<ffffffffa02208f4>] ? spl_kmem_cache_reap_now+0x144/0x230 [spl]
 [<ffffffffa02e8f8b>] ? arc_kmem_reap_now+0x4b/0xc0 [zfs]
 [<ffffffffa02e90df>] ? arc_shrinker_func+0xdf/0x1e0 [zfs]
 [<ffffffff81131f0a>] ? shrink_slab+0x12a/0x1a0
 [<ffffffff811350fa>] ? balance_pgdat+0x59a/0x820
 [<ffffffff811354b4>] ? kswapd+0x134/0x3c0
 [<ffffffff81096ca0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81135380>] ? kswapd+0x0/0x3c0
 [<ffffffff81096936>] ? kthread+0x96/0xa0
 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
 [<ffffffff810968a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Code: 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 0f 1f 44 00 00 44 8b 1d cb 02 aa 00 48 89 fb 45 85 db 75 15 48 89 df 31 f6 e8 21 fd ff ff <48> 8b 1c 24 4c 8b 64 24 08 c9 c3 4c 8b 25 bf 02 aa 00 4d 85 e4 
Call Trace:
 [<ffffffff8112c7e0>] ? __free_pages+0x60/0xa0
 [<ffffffff81150d5b>] ? __vunmap+0x7b/0x120
 [<ffffffff81150e7a>] ? vfree+0x2a/0x40
 [<ffffffffa02203c5>] ? kv_free+0x65/0x70 [spl]
 [<ffffffffa02206a9>] ? spl_slab_reclaim+0x2d9/0x3e0 [spl]
 [<ffffffffa02208f4>] ? spl_kmem_cache_reap_now+0x144/0x230 [spl]
 [<ffffffffa02e8f8b>] ? arc_kmem_reap_now+0x4b/0xc0 [zfs]
 [<ffffffffa02e90df>] ? arc_shrinker_func+0xdf/0x1e0 [zfs]
 [<ffffffff81131f0a>] ? shrink_slab+0x12a/0x1a0
 [<ffffffff811350fa>] ? balance_pgdat+0x59a/0x820
 [<ffffffff811354b4>] ? kswapd+0x134/0x3c0
 [<ffffffff81096ca0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81135380>] ? kswapd+0x0/0x3c0
 [<ffffffff81096936>] ? kthread+0x96/0xa0
 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
 [<ffffffff810968a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

And a dead kswapd process using 100% CPU load ( -> one core )
All the litte tricks ( forcing cache flush etc do not work )

Here is a earlier one:

brt_version:   2.0.8
cmdline:        ro root=UUID=d5d9c24a-4ae3-4601-8e65-a2869af2af69 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD quiet SYSFONT=latarcyrheb-sun16 rhgb crashkernel=135M@0M  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM
kernel:         2.6.32-358.2.1.el6.x86_64
not-reportable: A kernel problem occurred, but your kernel has been tainted (flags:P           ). Kernel maintainers are unable to diagnose tainted reports.
time:           Sat 01 Mar 2014 10:35:28 AM CET

sosreport.tar.xz: Binary file, 1523424 bytes

backtrace:
:BUG: soft lockup - CPU#1 stuck for 67s! [kswapd0:98]
:Modules linked in: mpt2sas raid_class mptctl mptbase nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc 8021q garp stp llc cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate e1000e ses enclosure ixgbe ptp pps_core mdio sg microcode serio_raw sb_edac edac_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma dca shpchp ext3 jbd mbcache sd_mod crc_t10dif isci libsas scsi_transport_sas megaraid_sas ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
:CPU 1 
:Modules linked in: mpt2sas raid_class mptctl mptbase nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc 8021q garp stp llc cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate e1000e ses enclosure ixgbe ptp pps_core mdio sg microcode serio_raw sb_edac edac_core i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support ioatdma dca shpchp ext3 jbd mbcache sd_mod crc_t10dif isci libsas scsi_transport_sas megaraid_sas ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
:Pid: 98, comm: kswapd0 Tainted: P           ---------------    2.6.32-358.2.1.el6.x86_64 #1 Supermicro X9SRL-F/X9SRL-F
:RIP: 0010:[<ffffffff8112c74f>]  [<ffffffff8112c74f>] free_hot_page+0x2f/0x60
:RSP: 0018:ffff881855cddaa0  EFLAGS: 00000297
:RAX: 00000000000000a9 RBX: ffff881855cddab0 RCX: 0000000000000010
:RDX: ffffea0011ea73d0 RSI: ffff88185bb0f650 RDI: 0000000000000297
:RBP: ffffffff8100bb8e R08: ffffea0011ea73d0 R09: 0000000000000000
:R10: ffff8800284094c0 R11: 0000000000000000 R12: ffff880000022d80
:R13: ffffea0011ea70d0 R14: ffff881855cdda50 R15: 000000037fc40100
:FS:  0000000000000000(0000) GS:ffff880028240000(0000) knlGS:0000000000000000
:CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
:CR2: 0000003578218280 CR3: 0000000001a85000 CR4: 00000000000407e0
:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
:Process kswapd0 (pid: 98, threadinfo ffff881855cdc000, task ffff88185ae02080)
:Stack:
:ffffea0011ea73e0 0000000000000000 ffff881855cddae0 ffffffff8112c7e0
: ffff8805ab859140 0000000000000068 ffff8805821f35c0 0000000000000001
: ffff881855cddb10 ffffffff81150d5b 000000000000201b ffff8818537c0000
:Call Trace:
:[<ffffffff8112c7e0>] ? __free_pages+0x60/0xa0
:[<ffffffff81150d5b>] ? __vunmap+0x7b/0x120
:[<ffffffff81150e7a>] ? vfree+0x2a/0x40
:[<ffffffffa02203c5>] ? kv_free+0x65/0x70 [spl]
:[<ffffffffa02206a9>] ? spl_slab_reclaim+0x2d9/0x3e0 [spl]
:[<ffffffffa02208f4>] ? spl_kmem_cache_reap_now+0x144/0x230 [spl]
:[<ffffffffa02e8f8b>] ? arc_kmem_reap_now+0x4b/0xc0 [zfs]
:[<ffffffffa02e90df>] ? arc_shrinker_func+0xdf/0x1e0 [zfs]
:[<ffffffff81131f0a>] ? shrink_slab+0x12a/0x1a0
:[<ffffffff811350fa>] ? balance_pgdat+0x59a/0x820
:[<ffffffff811354b4>] ? kswapd+0x134/0x3c0
:[<ffffffff81096ca0>] ? autoremove_wake_function+0x0/0x40
:[<ffffffff81135380>] ? kswapd+0x0/0x3c0
:[<ffffffff81096936>] ? kthread+0x96/0xa0
:[<ffffffff8100c0ca>] ? child_rip+0xa/0x20
:[<ffffffff810968a0>] ? kthread+0x0/0xa0
:[<ffffffff8100c0c0>] ? child_rip+0x0/0x20
:Code: 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 0f 1f 44 00 00 44 8b 1d cb 02 aa 00 48 89 fb 45 85 db 75 15 48 89 df 31 f6 e8 21 fd ff ff <48> 8b 1c 24 4c 8b 64 24 08 c9 c3 4c 8b 25 bf 02 aa 00 4d 85 e4

and again a dead kswapd.

These issues came up ahter we where adding SSD for l2arc cache.

System information:

hyperdrive:
    version: 5000
    name: 'hyperdrive'
    state: 0
    txg: 4
    pool_guid: 18113477370619874072
    hostid: 8323328
    hostname: 'hyperspace'
    vdev_children: 3
    vdev_tree:
        type: 'root'
        id: 0
        guid: 18113477370619874072
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 375145638163873069
            path: '/dev/disk/by-id/wwn-0x600605b00380e2301aa3cd053ae541a0-part1'
            whole_disk: 1
            metaslab_array: 37
            metaslab_shift: 39
            ashift: 12
            asize: 96005388763136
            is_log: 0
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
            guid: 9307061418147530512
            path: '/dev/disk/by-id/ata-Samsung_SSD_840_PRO_Series_S1ATNSADC03205L-part2'
            whole_disk: 0
            metaslab_array: 36
            metaslab_shift: 24
            ashift: 12
            asize: 2142765056
            is_log: 1
            create_txg: 4
        children[2]:
            type: 'disk'
            id: 2
            guid: 75045006762167238
            path: '/dev/disk/by-id/ata-Samsung_SSD_840_PRO_Series_S1ATNSADC03217B-part2'
            whole_disk: 0
            metaslab_array: 35
            metaslab_shift: 24
            ashift: 12
            asize: 2142765056
            is_log: 1
            create_txg: 4
    features_for_read:
hyperpool:
    version: 5000
    name: 'hyperpool'
    state: 0
    txg: 562531
    pool_guid: 7814256871853716583
    hostid: 8323328
    hostname: 'hyperspace'
    vdev_children: 3
    vdev_tree:
        type: 'root'
        id: 0
        guid: 7814256871853716583
        children[0]:
            type: 'disk'
            id: 0
            guid: 8187808693641915575
            path: '/dev/disk/by-id/scsi-3600605b0039615501a7ea97250bf6b3f-part1'
            whole_disk: 1
            metaslab_array: 36
            metaslab_shift: 37
            ashift: 12
            asize: 19998426267648
            is_log: 0
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
            guid: 14301691803657194035
            path: '/dev/disk/by-id/ata-Samsung_SSD_840_PRO_Series_S1ATNSADC03205L-part1'
            whole_disk: 0
            metaslab_array: 69634
            metaslab_shift: 24
            ashift: 9
            asize: 2142765056
            is_log: 1
            create_txg: 562529
        children[2]:
            type: 'disk'
            id: 2
            guid: 7167947407955197219
            path: '/dev/disk/by-id/ata-Samsung_SSD_840_PRO_Series_S1ATNSADC03217B-part1'
            whole_disk: 0
            metaslab_array: 69633
            metaslab_shift: 24
            ashift: 9
            asize: 2142765056
            is_log: 1
            create_txg: 562529
top - 16:23:58 up 5 days, 16:57,  4 users,  load average: 1.30, 1.51, 1.60
Tasks: 495 total,   3 running, 492 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.3%us,  1.7%sy,  0.0%ni, 97.7%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us, 74.8%sy,  0.0%ni, 24.8%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu2  :  0.0%us,  1.7%sy,  0.0%ni, 93.4%id,  4.6%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu3  :  0.0%us,  2.0%sy,  0.0%ni, 94.4%id,  3.3%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu4  :  0.0%us,  1.3%sy,  0.0%ni, 93.7%id,  5.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :  0.0%us, 24.8%sy,  0.0%ni, 75.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.3%sy,  0.0%ni, 98.7%id,  1.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.0%us,  1.3%sy,  0.0%ni, 91.3%id,  7.0%wa,  0.0%hi,  0.3%si,  0.0%st
Mem:  99039944k total, 72360372k used, 26679572k free,   208736k buffers
Swap:  8191992k total,        0k used,  8191992k free,   702788k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                                                           
   98 root      20   0     0    0    0 R 99.8  0.0   3118:46 kswapd0                                                                                                                                                                                                           
 3320 root      20   0     0    0    0 S  3.3  0.0  19:45.97 nfsd 

CentOS release 6.4 (Final), with ZFS 0.6.3

We do lots of zfs sync to our mirror. We had not these issues without the l2arc cache ssds

If you need more informa I will do my very best.
( I am a little scared that the resolve is planned for 6.6 )

@zeigerpuppy
Copy link

Running Debian Wheezy/sid hybrid here with zfsonlinux 2wheezy, (zfsutils 0.6.2-4wheezy)
also using Xen 4.3 kernel 3.2.0-4-amd64. It's on an E5 Xeon server with 32 cores and 92MB memory (generally under low load currently).
I am getting similar soft locks when running scrubs mostly. I don't think it's a Xen issue as I have followed the guidlines here http://zfsonlinux.org/faq.html#HowCanIHelp (autoballooning off)
I have not manually set arc-max but it looks sane at 2GB (with 32GB allocated to Dom0).
The server recovers after the soft lock and there is no data corruption. Running 3 zpools:

 pool: zpool1
 state: ONLINE
 scan: scrub in progress since Fri Mar 7 20:05:42 2014
 39.2G scanned out of 718G at 45.0M/s, 4h17m to go
 0 repaired, 5.47% done
config:

    NAME STATE READ WRITE CKSUM
    zpool1 ONLINE 0 0 0
     mirror-0 ONLINE 0 0 0
     scsi-35000******** ONLINE 0 0 0
     scsi-35000******** ONLINE 0 0 0
     mirror-1 ONLINE 0 0 0
     scsi-35000******** ONLINE 0 0 0
     scsi-35000******** ONLINE 0 0 0
    logs
     mirror-2 ONLINE 0 0 0
     scsi-SATA_INTEL_SSD********-part3 ONLINE 0 0 0
     scsi-SATA_INTEL_SSD********-part3 ONLINE 0 0 0
    cache
     scsi-SATA_INTEL_SSD********-part1 ONLINE 0 0 0

errors: No known data errors

 pool: zpool2
 state: ONLINE
 scan: scrub repaired 0 in 0h1m with 0 errors on Wed Feb 12 13:12:39 2014
config:

    NAME STATE READ WRITE CKSUM
    zpool2 ONLINE 0 0 0
     scsi-SATA_INTEL_SSD********-part7 ONLINE 0 0 0
     scsi-SATA_INTEL_SSD********-part7 ONLINE 0 0 0

errors: No known data errors

 pool: zpool3
 state: ONLINE
 scan: scrub repaired 0 in 0h0m with 0 errors on Wed Feb 12 13:11:47 2014
config:

    NAME STATE READ WRITE CKSUM
    zpool3 ONLINE 0 0 0
     mirror-0 ONLINE 0 0 0
     scsi-SATA_INTEL_SSD********-part6 ONLINE 0 0 0
     scsi-SATA_INTEL_SSD********-part6 ONLINE 0 0 0

I have had 4 soft locks over about a month of the server running, all affecting kswapd eg:

[3593034.714225] INFO: rcu_sched detected stall on CPU 3 (t=15000 jiffies)
[3593034.714265] sending NMI to all CPUs:
[3593055.826227] BUG: soft lockup - CPU#3 stuck for 23s! [kswapd0:48]
[3593055.826267] Modules linked in: xt_physdev xen_netback xen_blkback xt_multiport iptable_filter ip_tables x_tables xen_gntdev xen_evtchn xenfs nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc bridge stp xen_pciback loop coretemp crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 aes_generic cryptd snd_pcm snd_page_alloc snd_timer snd soundcore pcspkr psmouse serio_raw sb_edac zfs(P) edac_core zunicode(P) zavl(P) zcommon(P) znvpair(P) i2c_i801 i2c_core spl(O) zlib_deflate iTCO_wdt iTCO_vendor_support evdev joydev ioatdma acpi_cpufreq mperf processor thermal_sys wmi container button ext4 crc16 jbd2 mbcache dm_mod usb_storage raid1 md_mod usbhid hid sg sd_mod crc_t10dif ahci libahci libata ehci_hcd usbcore usb_common mpt2sas raid_class scsi_transport_sas scsi_mod igb dca [last unloaded: scsi_wait_scan]
[3593055.826321] CPU 3 
[3593055.826322] Modules linked in: xt_physdev xen_netback xen_blkback xt_multiport iptable_filter ip_tables x_tables xen_gntdev xen_evtchn xenfs nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc bridge stp xen_pciback loop coretemp crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 aes_generic cryptd snd_pcm snd_page_alloc snd_timer snd soundcore pcspkr psmouse serio_raw sb_edac zfs(P) edac_core zunicode(P) zavl(P) zcommon(P) znvpair(P) i2c_i801 i2c_core spl(O) zlib_deflate iTCO_wdt iTCO_vendor_support evdev joydev ioatdma acpi_cpufreq mperf processor thermal_sys wmi container button ext4 crc16 jbd2 mbcache dm_mod usb_storage raid1 md_mod usbhid hid sg sd_mod crc_t10dif ahci libahci libata ehci_hcd usbcore usb_common mpt2sas raid_class scsi_transport_sas scsi_mod igb dca [last unloaded: scsi_wait_scan]
[3593055.826356] 
[3593055.826359] Pid: 48, comm: kswapd0 Tainted: P O 3.2.0-4-amd64 #1 Debian 3.2.51-1 Supermicro X9DRW-3LN4F+/X9DRW-3TF+/X9DRW-3LN4F+/X9DRW-3TF+
[3593055.826364] RIP: e030:[ ] [ ] remove_vm_area+0x45/0x7f
[3593055.826371] RSP: e02b:ffff880896361b60 EFLAGS: 00000287
[3593055.826373] RAX: ffff8802252b06c0 RBX: ffff88053065f940 RCX: ffffc9031f7f8000
[3593055.826375] RDX: ffff8805183bbb00 RSI: 000000000000059f RDI: ffffffff81629110
[3593055.826377] RBP: ffff880530be5840 R08: ffff88052ec5dc00 R09: ffff88052ee92000
[3593055.826379] R10: ffff88052ee92000 R11: ffff88053065fd00 R12: 0000000000000000
[3593055.826381] R13: ffff880896361bd0 R14: ffff880896361be0 R15: ffff880896361bc8
[3593055.826387] FS: 00007f82b9b4d700(0000) GS:ffff8808ff260000(0000) knlGS:0000000000000000
[3593055.826389] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[3593055.826391] CR2: 00007f82b95a048f CR3: 0000000000000000 CR4: 0000000000000000
[3593055.826393] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[3593055.826395] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000000
[3593055.826398] Process kswapd0 (pid: 48, threadinfo ffff880896360000, task ffff88089635f890)
[3593055.826399] Stack:
[3593055.826422] ffff88052ec5dc00 ffff880897bea000 ffffc9031f87a000 ffffffff810dc1ed
[3593055.826426] ffff880896361bc8 ffffc9031f7f8000 ffff880897bea000 ffffc9031f87a000
[3593055.826429] ffff880897beb040 ffffffffa0265dba ffff880896361bf0 ffff880888c79040
[3593055.826432] Call Trace:
[3593055.826458] [ ] ? __vunmap+0x35/0xb8
[3593055.826466] [ ] ? spl_slab_reclaim+0x2c0/0x324 [spl]
[3593055.826471] [ ] ? spl_kmem_cache_reap_now+0x167/0x1d4 [spl]
[3593055.826476] [ ] ? wake_up_bit+0xd/0x1e
[3593055.826488] [ ] ? arc_kmem_reap_now+0x5a/0x91 [zfs]
[3593055.826497] [ ] ? arc_shrinker_func+0x72/0xd7 [zfs]
[3593055.826501] [ ] ? shrink_slab+0x18f/0x24d
[3593055.826504] [ ] ? balance_pgdat+0x283/0x4b7
[3593055.826507] [ ] ? kswapd+0x2d3/0x303
[3593055.826510] [ ] ? add_wait_queue+0x3c/0x3c
[3593055.826513] [ ] ? balance_pgdat+0x4b7/0x4b7
[3593055.826516] [ ] ? kthread+0x76/0x7e
[3593055.826519] [ ] ? kernel_thread_helper+0x4/0x10
[3593055.826522] [ ] ? int_ret_from_sys_call+0x7/0x1b
[3593055.826526] [ ] ? retint_restore_args+0x5/0x6
[3593055.826528] [ ] ? gs_change+0x13/0x13
[3593055.826530] Code: 8b 58 50 f6 43 18 20 75 38 48 c7 c7 10 91 62 81 e8 e5 31 27 00 48 8b 05 5a 00 70 00 48 c7 c2 40 c1 7d 81 eb 06 48 89 c2 48 8b 00 &lt;48&gt; 39 d8 75 f5 48 8b 03 48 89 02 f0 81 05 05 d0 54 00 00 00 10 
[3593055.826828] Call Trace:
[3593055.826830] [ ] ? __vunmap+0x35/0xb8
[3593055.826835] [ ] ? spl_slab_reclaim+0x2c0/0x324 [spl]
[3593055.826840] [ ] ? spl_kmem_cache_reap_now+0x167/0x1d4 [spl]
[3593055.826843] [ ] ? wake_up_bit+0xd/0x1e
[3593055.826851] [ ] ? arc_kmem_reap_now+0x5a/0x91 [zfs]
[3593055.826859] [ ] ? arc_shrinker_func+0x72/0xd7 [zfs]
[3593055.826862] [ ] ? shrink_slab+0x18f/0x24d
[3593055.826865] [ ] ? balance_pgdat+0x283/0x4b7
[3593055.826868] [ ] ? kswapd+0x2d3/0x303
[3593055.826871] [ ] ? add_wait_queue+0x3c/0x3c
[3593055.826873] [ ] ? balance_pgdat+0x4b7/0x4b7
[3593055.826876] [ ] ? kthread+0x76/0x7e
[3593055.826878] [ ] ? kernel_thread_helper+0x4/0x10
[3593055.826881] [ ] ? int_ret_from_sys_call+0x7/0x1b
[3593055.826883] [ ] ? retint_restore_args+0x5/0x6
[3593055.826886] [ ] ? gs_change+0x13/0x13

The other traces are very similar

@mattaw
Copy link

mattaw commented Mar 9, 2014

My problem was solved by updating to git-head. There has been a huge amount
of development, I mean huge, since 0.6.2. The trouble is that git head may
not quite be stable although I have never had a problem updating to
git-head and have done so about 6 times but I am not in production.

Matthew

On 9 March 2014 03:16, zeigerpuppy notifications@github.com wrote:

Running Debian Wheezy/sid hybrid here with zfsonlinux 2wheezy, (zfsutils
0.6.2-4
wheezy)
also using Xen 4.3 kernel 3.2.0-4-amd64. It's on an E5 Xeon server with 32
cores and 92MB memory (generally under low load currently).
I am getting similar soft locks when running scrubs mostly. I don't think
it's a Xen issue as I have followed the guidlines here
http://zfsonlinux.org/faq.html#HowCanIHelp (autoballooning off)
I have not manually set arc-max but it looks sane at 2GB (with 32GB
allocated to Dom0).
The server recovers after the soft lock and there is no data corruption.
Running 3 zpools:

pool: zpool1
state: ONLINE
scan: scrub in progress since Fri Mar 7 20:05:42 2014
39.2G scanned out of 718G at 45.0M/s, 4h17m to go
0 repaired, 5.47% done
config:

NAME STATE READ WRITE CKSUM
zpool1 ONLINE 0 0 0
 mirror-0 ONLINE 0 0 0
 scsi-35000******** ONLINE 0 0 0
 scsi-35000******** ONLINE 0 0 0
 mirror-1 ONLINE 0 0 0
 scsi-35000******** ONLINE 0 0 0
 scsi-35000******** ONLINE 0 0 0
logs
 mirror-2 ONLINE 0 0 0
 scsi-SATA_INTEL_SSD********-part3 ONLINE 0 0 0
 scsi-SATA_INTEL_SSD********-part3 ONLINE 0 0 0
cache
 scsi-SATA_INTEL_SSD********-part1 ONLINE 0 0 0

errors: No known data errors

pool: zpool2
state: ONLINE
scan: scrub repaired 0 in 0h1m with 0 errors on Wed Feb 12 13:12:39 2014
config:

NAME STATE READ WRITE CKSUM
zpool2 ONLINE 0 0 0
 scsi-SATA_INTEL_SSD********-part7 ONLINE 0 0 0
 scsi-SATA_INTEL_SSD********-part7 ONLINE 0 0 0

errors: No known data errors

pool: zpool3
state: ONLINE
scan: scrub repaired 0 in 0h0m with 0 errors on Wed Feb 12 13:11:47 2014
config:

NAME STATE READ WRITE CKSUM
zpool3 ONLINE 0 0 0
 mirror-0 ONLINE 0 0 0
 scsi-SATA_INTEL_SSD********-part6 ONLINE 0 0 0
 scsi-SATA_INTEL_SSD********-part6 ONLINE 0 0 0

I have had 4 soft locks over about a month of the server running, all
affecting kswapd eg:

[3593034.714225] INFO: rcu_sched detected stall on CPU 3 (t=15000 jiffies)
[3593034.714265] sending NMI to all CPUs:
[3593055.826227] BUG: soft lockup - CPU#3 stuck for 23s! [kswapd0:48]
[3593055.826267] Modules linked in: xt_physdev xen_netback xen_blkback xt_multiport iptable_filter ip_tables x_tables xen_gntdev xen_evtchn xenfs nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc bridge stp xen_pciback loop coretemp crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 aes_generic cryptd snd_pcm snd_page_alloc snd_timer snd soundcore pcspkr psmouse serio_raw sb_edac zfs(P) edac_core zunicode(P) zavl(P) zcommon(P) znvpair(P) i2c_i801 i2c_core spl(O) zlib_deflate iTCO_wdt iTCO_vendor_support evdev joydev ioatdma acpi_cpufreq mperf processor thermal_sys wmi container button ext4 crc16 jbd2 mbcache dm_mod usb_storage raid1 md_mod usbhid hid sg sd_mod crc_t10dif ahci libahci libata ehci_hcd usbcore usb_common mpt2sas raid_class scsi_transport_sas scsi_mod igb dca [last unloaded: scsi_wait_scan]
[3593055.826321] CPU 3
[3593055.826322] Modules linked in: xt_physdev xen_netback xen_blkback xt_multiport iptable_filter ip_tables x_tables xen_gntdev xen_evtchn xenfs nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc bridge stp xen_pciback loop coretemp crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 aes_generic cryptd snd_pcm snd_page_alloc snd_timer snd soundcore pcspkr psmouse serio_raw sb_edac zfs(P) edac_core zunicode(P) zavl(P) zcommon(P) znvpair(P) i2c_i801 i2c_core spl(O) zlib_deflate iTCO_wdt iTCO_vendor_support evdev joydev ioatdma acpi_cpufreq mperf processor thermal_sys wmi container button ext4 crc16 jbd2 mbcache dm_mod usb_storage raid1 md_mod usbhid hid sg sd_mod crc_t10dif ahci libahci libata ehci_hcd usbcore usb_common mpt2sas raid_class scsi_transport_sas scsi_mod igb dca [last unloaded: scsi_wait_scan]
[3593055.826356]
[3593055.826359] Pid: 48, comm: kswapd0 Tainted: P O 3.2.0-4-amd64 #1 Debian 3.2.51-1 Supermicro X9DRW-3LN4F+/X9DRW-3TF+/X9DRW-3LN4F+/X9DRW-3TF+
[3593055.826364] RIP: e030:[ ] [ ] remove_vm_area+0x45/0x7f
[3593055.826371] RSP: e02b:ffff880896361b60 EFLAGS: 00000287
[3593055.826373] RAX: ffff8802252b06c0 RBX: ffff88053065f940 RCX: ffffc9031f7f8000
[3593055.826375] RDX: ffff8805183bbb00 RSI: 000000000000059f RDI: ffffffff81629110
[3593055.826377] RBP: ffff880530be5840 R08: ffff88052ec5dc00 R09: ffff88052ee92000
[3593055.826379] R10: ffff88052ee92000 R11: ffff88053065fd00 R12: 0000000000000000
[3593055.826381] R13: ffff880896361bd0 R14: ffff880896361be0 R15: ffff880896361bc8
[3593055.826387] FS: 00007f82b9b4d700(0000) GS:ffff8808ff260000(0000) knlGS:0000000000000000
[3593055.826389] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[3593055.826391] CR2: 00007f82b95a048f CR3: 0000000000000000 CR4: 0000000000000000
[3593055.826393] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[3593055.826395] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000000
[3593055.826398] Process kswapd0 (pid: 48, threadinfo ffff880896360000, task ffff88089635f890)
[3593055.826399] Stack:
[3593055.826422] ffff88052ec5dc00 ffff880897bea000 ffffc9031f87a000 ffffffff810dc1ed
[3593055.826426] ffff880896361bc8 ffffc9031f7f8000 ffff880897bea000 ffffc9031f87a000
[3593055.826429] ffff880897beb040 ffffffffa0265dba ffff880896361bf0 ffff880888c79040
[3593055.826432] Call Trace:
[3593055.826458] [ ] ? __vunmap+0x35/0xb8
[3593055.826466] [ ] ? spl_slab_reclaim+0x2c0/0x324 [spl]
[3593055.826471] [ ] ? spl_kmem_cache_reap_now+0x167/0x1d4 [spl]
[3593055.826476] [ ] ? wake_up_bit+0xd/0x1e
[3593055.826488] [ ] ? arc_kmem_reap_now+0x5a/0x91 [zfs]
[3593055.826497] [ ] ? arc_shrinker_func+0x72/0xd7 [zfs]
[3593055.826501] [ ] ? shrink_slab+0x18f/0x24d
[3593055.826504] [ ] ? balance_pgdat+0x283/0x4b7
[3593055.826507] [ ] ? kswapd+0x2d3/0x303
[3593055.826510] [ ] ? add_wait_queue+0x3c/0x3c
[3593055.826513] [ ] ? balance_pgdat+0x4b7/0x4b7
[3593055.826516] [ ] ? kthread+0x76/0x7e
[3593055.826519] [ ] ? kernel_thread_helper+0x4/0x10
[3593055.826522] [ ] ? int_ret_from_sys_call+0x7/0x1b
[3593055.826526] [ ] ? retint_restore_args+0x5/0x6
[3593055.826528] [ ] ? gs_change+0x13/0x13
[3593055.826530] Code: 8b 58 50 f6 43 18 20 75 38 48 c7 c7 10 91 62 81 e8 e5 31 27 00 48 8b 05 5a 00 70 00 48 c7 c2 40 c1 7d 81 eb 06 48 89 c2 48 8b 00 <48> 39 d8 75 f5 48 8b 03 48 89 02 f0 81 05 05 d0 54 00 00 00 10
[3593055.826828] Call Trace:
[3593055.826830] [ ] ? __vunmap+0x35/0xb8
[3593055.826835] [ ] ? spl_slab_reclaim+0x2c0/0x324 [spl]
[3593055.826840] [ ] ? spl_kmem_cache_reap_now+0x167/0x1d4 [spl]
[3593055.826843] [ ] ? wake_up_bit+0xd/0x1e
[3593055.826851] [ ] ? arc_kmem_reap_now+0x5a/0x91 [zfs]
[3593055.826859] [ ] ? arc_shrinker_func+0x72/0xd7 [zfs]
[3593055.826862] [ ] ? shrink_slab+0x18f/0x24d
[3593055.826865] [ ] ? balance_pgdat+0x283/0x4b7
[3593055.826868] [ ] ? kswapd+0x2d3/0x303
[3593055.826871] [ ] ? add_wait_queue+0x3c/0x3c
[3593055.826873] [ ] ? balance_pgdat+0x4b7/0x4b7
[3593055.826876] [ ] ? kthread+0x76/0x7e
[3593055.826878] [ ] ? kernel_thread_helper+0x4/0x10
[3593055.826881] [ ] ? int_ret_from_sys_call+0x7/0x1b
[3593055.826883] [ ] ? retint_restore_args+0x5/0x6
[3593055.826886] [ ] ? gs_change+0x13/0x13

The other traces are very similar

Reply to this email directly or view it on GitHubhttps://github.com//issues/1482#issuecomment-37120918
.

@zeigerpuppy
Copy link

All things considered, the server is bleeding edge enough!
I think I'll wait until the next release considering this bug is not really causing any damage as far as I can see.

@FransUrbo
Copy link
Contributor

@lukasz99 @zrav @raketentimo @zeigerpuppy according to @mattaw, this is fixed in latest HEAD. Could you confirm?

@edamato
Copy link

edamato commented Jul 21, 2014

Maybe:

#1579 ->
#1627

@behlendorf
Copy link
Contributor

This issue is understood and was significantly improved in 0.6.3 and newer. Additional improvement plans which should improve things further.

@behlendorf behlendorf modified the milestone: 0.6.6 Nov 8, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants