-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel stall during resilver #3936
Comments
@alexanderhaensch |
@tuxoko i updated the post. The resilver is now completed, but i can say that the memory usage was huge during the resilver. |
There's huge contention in the slub. @behlendorf I see the spl kmem shrinker will shrink linux slab. But I don't think this is needed. |
@tuxoko is it better to use SLAB instead of SLUB? I heard that SLUB is more performant than SLAB. |
@alexanderhaensch |
Linux slab will automatically free empty slab when number of partial slab is over min_partial, so we don't need to explicitly shrink it. In fact, calling kmem_cache_shrink from shrinker will cause heavy contention on kmem_cache_node->list_lock, to the point that it might cause __slab_free to livelock (see openzfs/zfs#3936) Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
@alexanderhaensch |
I installed the patch with 0.6.5.3 and started a new resilver. Lets wait... |
This issue is solved by openzfs/spl#487 . |
Linux slab will automatically free empty slab when number of partial slab is over min_partial, so we don't need to explicitly shrink it. In fact, calling kmem_cache_shrink from shrinker will cause heavy contention on kmem_cache_node->list_lock, to the point that it might cause __slab_free to livelock (see openzfs/zfs#3936) Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs/zfs#3936 Closes #487
Linux slab will automatically free empty slab when number of partial slab is over min_partial, so we don't need to explicitly shrink it. In fact, calling kmem_cache_shrink from shrinker will cause heavy contention on kmem_cache_node->list_lock, to the point that it might cause __slab_free to livelock (see openzfs/zfs#3936) Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs/zfs#3936 Closes openzfs#487
Linux slab will automatically free empty slab when number of partial slab is over min_partial, so we don't need to explicitly shrink it. In fact, calling kmem_cache_shrink from shrinker will cause heavy contention on kmem_cache_node->list_lock, to the point that it might cause __slab_free to livelock (see openzfs/zfs#3936) Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs/zfs#3936 Closes openzfs#487
Linux slab will automatically free empty slab when number of partial slab is over min_partial, so we don't need to explicitly shrink it. In fact, calling kmem_cache_shrink from shrinker will cause heavy contention on kmem_cache_node->list_lock, to the point that it might cause __slab_free to livelock (see openzfs/zfs#3936) Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs/zfs#3936 Closes openzfs#487
Linux slab will automatically free empty slab when number of partial slab is over min_partial, so we don't need to explicitly shrink it. In fact, calling kmem_cache_shrink from shrinker will cause heavy contention on kmem_cache_node->list_lock, to the point that it might cause __slab_free to livelock (see openzfs/zfs#3936) Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs/zfs#3936 Closes #487
system was completly lock up.
CPUs: 2
Memory: 128GB
VM/Hypervisor: no
ECC mem: yes
Distribution: Gentoo GNU/Linux
Kernel version: Linux eos 3.14.51-hardened #1 SMP Wed Sep 16 11:13:14 CEST 2015 x86_64 Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz GenuineIntel GNU/Linux
SPL/ZFS source: Gentoo Packages
SPL/ZFS version: [ 26.396886] SPL: Loaded module v0.6.5.2-r0-gentoo (DEBUG mode)
[ 26.606759] ZFS: Loaded module v0.6.5.2-r0-gentoo (DEBUG mode), ZFS pool version 5000, ZFS filesystem version 5
Short description: Removing large number of files caused the system to hang
[ 26.396886] SPL: Loaded module v0.6.5.2-r0-gentoo (DEBUG mode)
[ 26.606759] ZFS: Loaded module v0.6.5.2-r0-gentoo (DEBUG mode), ZFS pool version 5000, ZFS filesystem version 5
SLAB allocator: SLUB
INFO: rcu_sched self-detected stall on CPU { 22} (t=2100 jiffies g=47679397 c=47679396 q=171016)
sending NMI to all CPUs:
I think there is a maximal post lenght here, showing only the stalled core.
INFO: rcu_sched self-detected stall on CPU { 22} (t=2100 jiffies g=47679397 c=47679396 q=171016)
sending NMI to all CPUs:
--- snip ---
Hardware name: Supermicro X9DR3-F/X9DR3-F, BIOS 3.0b 06/30/2014
CPU: 22 PID: 1856 Comm: dbu_evict Tainted: P O 3.14.51-hardened #1
NMI backtrace for cpu 22
Code: 55 b9 01 ff ff ff 48 8b 07 48 89 e5 48 8b 57 08 48 39 c8 74 22 b9 02 ff ff ff 48 39 ca 74 54 48 8b 12 48 39 d7 75 39 48 8b 50 08 <48> 39 d7 75 1d b8 01 00 00 00 5d c3 48 89 c2 48 89 fe 31 c0 48
[] ? system_call_fastpath+0x16/0x1b
[] stub_clone+0x65/0x90
[] SyS_clone+0x11/0x20
[] ? sigprocmask+0x4f/0x80
[] ? __set_current_blocked+0x31/0x50
[] do_fork+0xcb/0x310
[] ? recalc_sigpending+0x16/0x50
[] ? __do_page_fault+0x1dc/0x500
[] copy_process.part.44+0x168/0x1880
[] arch_dup_task_struct+0xb9/0x110
[] kmem_cache_alloc+0x9b/0x130
[] ? arch_dup_task_struct+0xb9/0x110
[] ? unlock_page+0x1e/0x30
[] ? arch_dup_task_struct+0xb9/0x110
[] __slab_alloc+0x2bf/0x4ad
[] new_slab+0x275/0x300
[] alloc_pages_current+0xa3/0x170
[] __alloc_pages_nodemask+0x56c/0xa00
[] try_to_free_pages+0xb7/0xd0
[] do_try_to_free_pages+0x421/0x550
[] shrink_slab+0x83/0x150
[] shrink_slab_node+0x112/0x1b0
[] spl_kmem_cache_generic_shrinker_scan_objects+0xd/0x30 [spl]
[] __spl_kmem_cache_generic_shrinker.isra.12+0x9d/0x120 [spl]
[] spl_kmem_cache_reap_now+0x13c/0x1d0 [spl]
[] kmem_cache_shrink+0x138/0x250
[] __list_del_entry+0xd/0x30
Call Trace:
ffff882027a35000 0000000100000002 ffff88202628a610 ffff881c38b60e40
ffffffff8110a958 ffff881c38b60e60 000000018107d20f 0000000000000246
ffff881fe9e73848 ffffffff8137723d ffff881c38b60e50 ffff881fe9e738b8
Stack:
CR2: 00007f4ff65a6310 CR3: 00000010099f4000 CR4: 00000000001607f0
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
FS: 00007f4ff720d700(0000) GS:ffff88207fd20000(0000) knlGS:0000000000000000
R13: ffffea0048530e00 R14: ffffea006180b020 R15: ffff88202628a600
R10: ffffea0070e2d800 R11: ffff88103f803c00 R12: ffffea006180b000
RBP: ffff881fe9e73830 R08: ffffffff818a7fd0 R09: ffff88207fd2fd40
RDX: ffffea006180b020 RSI: 0000000000000010 RDI: ffffea006180b020
RAX: ffffea0048530e20 RBX: ffffea006180b020 RCX: 00000000ffffff02
RSP: 0018:ffff881fe9e73830 EFLAGS: 00000046
RIP: 0010:[] [] __list_del_entry_debug+0x2b/0x90
task: ffff881ffe41e6c0 ti: ffff881ffe41ec20 task.ti: ffff881ffe41ec20
---snap ---
The text was updated successfully, but these errors were encountered: