Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest git nightly freezes under load with no log messages (could be xattr & rsync) #2725

Closed
angstymeat opened this issue Sep 21, 2014 · 19 comments

Comments

@angstymeat
Copy link

I don't know if this is related to other other recently-posted issues concerning xattr & rsync, but I thought I would add my problem.

On Thursday I installed the latest git development version of SPL (0.6.3-9_gf9bde4f) and ZFS (0.6.3-77_g6d9036f) on a Dell 515 server that we use for a backup, replacing the 0.6.3 release version that had been on it. It's not vital so I thought it would make a good real-world test, and it has been running for over a year without issue.

Since then the machine will consistently freeze up a few minutes into running our nightly rsyncs. The freeze appears to be I/O related. If I'm logged in I can continue to work with it until I need disk access, then that session hangs forever.

This happens on a Dell 515 server. The OS is Fedora 20 running the 3.16.2-200 kernel (3.16.2-200.fc20.x86_64). The pool is fully upgraded with all of the new features enabled. The pool is compressed, and there is no dedup running. XATTR=on and ACLTYPE=posixacl.

I tried turning xattr off, but it didn't appear to make a difference. I don't see any errors when trying to access files, and scrub appears to run without crashing.

I get no panic or crash dump concerning ZFS or SPL written to the logs. However, I do get the page allocation failure that I've added at the end of this post.

Sep 20 09:56:00 lethe kernel: : [  746.414637] rsync: page allocation failure: order:5, mode:0x40d0
Sep 20 09:56:00 lethe kernel: : [  746.414647] CPU: 1 PID: 1613 Comm: rsync Tainted: P           OE 3.16.2-200.fc20.x86_64 #1
Sep 20 09:56:00 lethe kernel: : [  746.414650] Hardware name: Dell Inc. PowerEdge R515/015C68, BIOS 1.8.6 12/06/2011
Sep 20 09:56:00 lethe kernel: : [  746.414653]  0000000000000000 00000000c0bc2775 ffff8800da92f840 ffffffff81706ce1
Sep 20 09:56:00 lethe kernel: : [  746.414657]  00000000000040d0 ffff8800da92f8d0 ffffffff8118360f 0000000000000780
Sep 20 09:56:00 lethe kernel: : [  746.414660]  0000000000000010 ffffffe000000001 ffff88041ffe9e00 00000005da92f8a0
Sep 20 09:56:00 lethe kernel: : [  746.414663] Call Trace:
Sep 20 09:56:00 lethe kernel: : [  746.414674]  [<ffffffff81706ce1>] dump_stack+0x45/0x56
Sep 20 09:56:00 lethe kernel: : [  746.414679]  [<ffffffff8118360f>] warn_alloc_failed+0xff/0x170
Sep 20 09:56:00 lethe kernel: : [  746.414684]  [<ffffffff81187d70>] __alloc_pages_nodemask+0x8a0/0xaf0
Sep 20 09:56:02 lethe kernel: : [  746.414688]  [<ffffffff811cace4>] alloc_pages_current+0xa4/0x170
Sep 20 09:56:02 lethe kernel: : [  746.414691]  [<ffffffff8118379b>] alloc_kmem_pages+0x3b/0xf0
Sep 20 09:56:02 lethe kernel: : [  746.414695]  [<ffffffff8109a30a>] ? del_timer_sync+0x4a/0x60
Sep 20 09:56:02 lethe kernel: : [  746.414700]  [<ffffffff811a0e58>] kmalloc_order+0x18/0x50
Sep 20 09:56:02 lethe kernel: : [  746.414703]  [<ffffffff811a0eb6>] kmalloc_order_trace+0x26/0xa0
Sep 20 09:56:02 lethe kernel: : [  746.414706]  [<ffffffff81098db0>] ? internal_add_timer+0x70/0x70
Sep 20 09:56:02 lethe kernel: : [  746.414710]  [<ffffffff811d7191>] __kmalloc+0x221/0x240
Sep 20 09:56:02 lethe kernel: : [  746.414732]  [<ffffffffa00fa465>] ? spl_kmem_cache_alloc+0x975/0xd80 [spl]
Sep 20 09:56:02 lethe kernel: : [  746.414740]  [<ffffffffa00fa480>] spl_kmem_cache_alloc+0x990/0xd80 [spl]
Sep 20 09:56:02 lethe kernel: : [  746.414744]  [<ffffffff810d4600>] ? abort_exclusive_wait+0xb0/0xb0
Sep 20 09:56:02 lethe kernel: : [  746.414802]  [<ffffffffa01e21e7>] sa_spill_alloc+0x17/0x20 [zfs]
Sep 20 09:56:02 lethe kernel: : [  746.414839]  [<ffffffffa022b3ea>] zfs_sa_set_xattr+0x7a/0x170 [zfs]
Sep 20 09:56:02 lethe kernel: : [  746.414873]  [<ffffffffa024d38a>] zpl_xattr_set+0x23a/0x2c0 [zfs]
Sep 20 09:56:02 lethe kernel: : [  746.414905]  [<ffffffffa024d480>] zpl_xattr_user_set+0x70/0xa0 [zfs]
Sep 20 09:56:02 lethe kernel: : [  746.414910]  [<ffffffff812179e2>] generic_setxattr+0x62/0x80
Sep 20 09:56:02 lethe kernel: : [  746.414913]  [<ffffffff81218243>] __vfs_setxattr_noperm+0x63/0x1b0
Sep 20 09:56:02 lethe kernel: : [  746.414916]  [<ffffffff81218445>] vfs_setxattr+0xb5/0xc0
Sep 20 09:56:02 lethe kernel: : [  746.414918]  [<ffffffff8121857e>] setxattr+0x12e/0x1c0
Sep 20 09:56:02 lethe kernel: : [  746.414923]  [<ffffffff811fe632>] ? final_putname+0x22/0x50
Sep 20 09:56:02 lethe kernel: : [  746.414925]  [<ffffffff811fe8d9>] ? putname+0x29/0x40
Sep 20 09:56:02 lethe kernel: : [  746.414928]  [<ffffffff81204082>] ? user_path_at_empty+0x72/0xc0
Sep 20 09:56:02 lethe kernel: : [  746.414933]  [<ffffffff811f5e88>] ? __sb_start_write+0x58/0x100
Sep 20 09:56:02 lethe kernel: : [  746.414936]  [<ffffffff811f0803>] ? chmod_common+0xf3/0x140
Sep 20 09:56:02 lethe kernel: : [  746.414939]  [<ffffffff8121895f>] SyS_lsetxattr+0xaf/0xf0
Sep 20 09:56:02 lethe kernel: : [  746.414964]  [<ffffffff8170e329>] system_call_fastpath+0x16/0x1b
Sep 20 09:56:02 lethe kernel: : [  746.414966] Mem-Info:
Sep 20 09:56:02 lethe kernel: : [  746.414968] Node 0 DMA per-cpu:
Sep 20 09:56:02 lethe kernel: : [  746.414970] CPU    0: hi:    0, btch:   1 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.414972] CPU    1: hi:    0, btch:   1 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.414973] CPU    2: hi:    0, btch:   1 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.414974] CPU    3: hi:    0, btch:   1 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.414976] CPU    4: hi:    0, btch:   1 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.414977] CPU    5: hi:    0, btch:   1 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.414979] CPU    6: hi:    0, btch:   1 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.414980] CPU    7: hi:    0, btch:   1 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.414981] Node 0 DMA32 per-cpu:
Sep 20 09:56:02 lethe kernel: : [  746.414983] CPU    0: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.414985] CPU    1: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.414986] CPU    2: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.414988] CPU    3: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.414989] CPU    4: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.414991] CPU    5: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.414992] CPU    6: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.414993] CPU    7: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.414994] Node 0 Normal per-cpu:
Sep 20 09:56:02 lethe kernel: : [  746.414996] CPU    0: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.414997] CPU    1: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.414999] CPU    2: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.415000] CPU    3: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.415002] CPU    4: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.415003] CPU    5: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.415004] CPU    6: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.415006] CPU    7: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.415007] Node 1 Normal per-cpu:
Sep 20 09:56:02 lethe kernel: : [  746.415008] CPU    0: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.415010] CPU    1: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.415011] CPU    2: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.415012] CPU    3: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.415014] CPU    4: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.415015] CPU    5: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.415017] CPU    6: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.415018] CPU    7: hi:  186, btch:  31 usd:   0
Sep 20 09:56:02 lethe kernel: : [  746.415022] active_anon:19433 inactive_anon:20412 isolated_anon:0
Sep 20 09:56:02 lethe kernel: : [  746.415022]  active_file:16358 inactive_file:16205 isolated_file:0
Sep 20 09:56:02 lethe kernel: : [  746.415022]  unevictable:0 dirty:6 writeback:0 unstable:0
Sep 20 09:56:02 lethe kernel: : [  746.415022]  free:29278 slab_reclaimable:194061 slab_unreclaimable:2474874
Sep 20 09:56:02 lethe kernel: : [  746.415022]  mapped:15261 shmem:84 pagetables:1852 bounce:0
Sep 20 09:56:02 lethe kernel: : [  746.415022]  free_cma:0
Sep 20 09:56:02 lethe kernel: : [  746.415027] Node 0 DMA free:15904kB min:88kB low:108kB high:132kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15996kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Sep 20 09:56:02 lethe kernel: : [  746.415033] lowmem_reserve[]: 0 3350 7826 7826
Sep 20 09:56:02 lethe kernel: : [  746.415035] Node 0 DMA32 free:36420kB min:18984kB low:23728kB high:28476kB active_anon:36860kB inactive_anon:37736kB active_file:14596kB inactive_file:13964kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3644004kB managed:3434312kB mlocked:0kB dirty:4kB writeback:0kB mapped:12316kB shmem:96kB slab_reclaimable:171336kB slab_unreclaimable:2119616kB kernel_stack:928kB pagetables:1784kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Sep 20 09:56:02 lethe kernel: : [  746.415041] lowmem_reserve[]: 0 0 4476 4476
Sep 20 09:56:02 lethe kernel: : [  746.415043] Node 0 Normal free:20848kB min:25364kB low:31704kB high:38044kB active_anon:25368kB inactive_anon:25484kB active_file:9908kB inactive_file:9908kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:4718592kB managed:4583424kB mlocked:0kB dirty:8kB writeback:0kB mapped:16804kB shmem:0kB slab_reclaimable:210092kB slab_unreclaimable:2699652kB kernel_stack:4880kB pagetables:2592kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Sep 20 09:56:02 lethe kernel: : [  746.415048] lowmem_reserve[]: 0 0 0 0
Sep 20 09:56:02 lethe kernel: : [  746.415050] Node 1 Normal free:43940kB min:45672kB low:57088kB high:68508kB active_anon:15504kB inactive_anon:18428kB active_file:40928kB inactive_file:40948kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:8388608kB managed:8253204kB mlocked:0kB dirty:12kB writeback:0kB mapped:31924kB shmem:240kB slab_reclaimable:394816kB slab_unreclaimable:5080228kB kernel_stack:3072kB pagetables:3032kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Sep 20 09:56:02 lethe kernel: : [  746.415056] lowmem_reserve[]: 0 0 0 0
Sep 20 09:56:02 lethe kernel: : [  746.415058] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15904kB
Sep 20 09:56:02 lethe kernel: : [  746.415068] Node 0 DMA32: 169*4kB (UEM) 160*8kB (UEM) 37*16kB (UEM) 25*32kB (UEM) 19*64kB (UEM) 13*128kB (UEM) 6*256kB (UEM) 8*512kB (UM) 4*1024kB (UM) 2*2048kB (UR) 4*4096kB (MR) = 36436kB
Sep 20 09:56:02 lethe kernel: : [  746.415080] Node 0 Normal: 915*4kB (UEM) 886*8kB (UEM) 337*16kB (UEM) 56*32kB (UMR) 14*64kB (M) 2*128kB (MR) 1*256kB (R) 1*512kB (R) 1*1024kB (R) 0*2048kB 0*4096kB = 20876kB
Sep 20 09:56:02 lethe kernel: : [  746.415090] Node 1 Normal: 2252*4kB (UEM) 3128*8kB (UEM) 374*16kB (UEM) 33*32kB (UEM) 5*64kB (UE) 3*128kB (UR) 1*256kB (R) 0*512kB 0*1024kB 1*2048kB (R) 0*4096kB = 44080kB
Sep 20 09:56:02 lethe kernel: : [  746.415102] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Sep 20 09:56:02 lethe kernel: : [  746.415103] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Sep 20 09:56:02 lethe kernel: : [  746.415105] 32647 total pagecache pages
Sep 20 09:56:02 lethe kernel: : [  746.415107] 0 pages in swap cache
Sep 20 09:56:02 lethe kernel: : [  746.415108] Swap cache stats: add 3013, delete 3013, find 0/0
Sep 20 09:56:02 lethe kernel: : [  746.415110] Free swap  = 8204520kB
Sep 20 09:56:02 lethe kernel: : [  746.415111] Total swap = 8216572kB
Sep 20 09:56:02 lethe kernel: : [  746.415112] 4191800 pages RAM
Sep 20 09:56:02 lethe kernel: : [  746.415114] 0 pages HighMem/MovableOnly
Sep 20 09:56:02 lethe kernel: : [  746.415115] 33851 pages reserved
Sep 20 09:56:02 lethe kernel: : [  746.415116] 0 pages hwpoisoned
Sep 20 09:56:53 lethe kernel: : [  799.355414] rsync: page allocation failure: order:5, mode:0x40d0
Sep 20 09:56:53 lethe kernel: : [  799.355422] CPU: 1 PID: 1613 Comm: rsync Tainted: P           OE 3.16.2-200.fc20.x86_64 #1
Sep 20 09:56:53 lethe kernel: : [  799.355425] Hardware name: Dell Inc. PowerEdge R515/015C68, BIOS 1.8.6 12/06/2011
Sep 20 09:56:53 lethe kernel: : [  799.355427]  0000000000000000 00000000c0bc2775 ffff8800da92f840 ffffffff81706ce1
Sep 20 09:56:53 lethe kernel: : [  799.355432]  00000000000040d0 ffff8800da92f8d0 ffffffff8118360f 0000000000000780
Sep 20 09:56:53 lethe kernel: : [  799.355435]  0000000000000010 ffffffe0da92f870 ffff88041ffe9e00 00000005da92f8a0
Sep 20 09:56:53 lethe kernel: : [  799.355438] Call Trace:
Sep 20 09:56:53 lethe kernel: : [  799.355449]  [<ffffffff81706ce1>] dump_stack+0x45/0x56
Sep 20 09:56:53 lethe kernel: : [  799.355454]  [<ffffffff8118360f>] warn_alloc_failed+0xff/0x170
Sep 20 09:56:53 lethe kernel: : [  799.355459]  [<ffffffff81187d70>] __alloc_pages_nodemask+0x8a0/0xaf0
Sep 20 09:56:53 lethe kernel: : [  799.355464]  [<ffffffff811cace4>] alloc_pages_current+0xa4/0x170
Sep 20 09:56:53 lethe kernel: : [  799.355466]  [<ffffffff8118379b>] alloc_kmem_pages+0x3b/0xf0
Sep 20 09:56:53 lethe kernel: : [  799.355470]  [<ffffffff8109a30a>] ? del_timer_sync+0x4a/0x60
Sep 20 09:56:53 lethe kernel: : [  799.355475]  [<ffffffff811a0e58>] kmalloc_order+0x18/0x50
Sep 20 09:56:53 lethe kernel: : [  799.355478]  [<ffffffff811a0eb6>] kmalloc_order_trace+0x26/0xa0
Sep 20 09:56:53 lethe kernel: : [  799.355481]  [<ffffffff81098db0>] ? internal_add_timer+0x70/0x70
Sep 20 09:56:53 lethe kernel: : [  799.355485]  [<ffffffff811d7191>] __kmalloc+0x221/0x240
Sep 20 09:56:53 lethe kernel: : [  799.355503]  [<ffffffffa00fa465>] ? spl_kmem_cache_alloc+0x975/0xd80 [spl]
Sep 20 09:56:53 lethe kernel: : [  799.355511]  [<ffffffffa00fa480>] spl_kmem_cache_alloc+0x990/0xd80 [spl]
Sep 20 09:56:53 lethe kernel: : [  799.355515]  [<ffffffff810d4600>] ? abort_exclusive_wait+0xb0/0xb0
Sep 20 09:56:53 lethe kernel: : [  799.355568]  [<ffffffffa01e21e7>] sa_spill_alloc+0x17/0x20 [zfs]
Sep 20 09:56:53 lethe kernel: : [  799.355604]  [<ffffffffa022b3ea>] zfs_sa_set_xattr+0x7a/0x170 [zfs]
Sep 20 09:56:53 lethe kernel: : [  799.355638]  [<ffffffffa024d38a>] zpl_xattr_set+0x23a/0x2c0 [zfs]
Sep 20 09:56:53 lethe kernel: : [  799.355671]  [<ffffffffa024d480>] zpl_xattr_user_set+0x70/0xa0 [zfs]
Sep 20 09:56:53 lethe kernel: : [  799.355675]  [<ffffffff812179e2>] generic_setxattr+0x62/0x80
Sep 20 09:56:53 lethe kernel: : [  799.355678]  [<ffffffff81218243>] __vfs_setxattr_noperm+0x63/0x1b0
Sep 20 09:56:53 lethe kernel: : [  799.355681]  [<ffffffff81218445>] vfs_setxattr+0xb5/0xc0
Sep 20 09:56:53 lethe kernel: : [  799.355684]  [<ffffffff8121857e>] setxattr+0x12e/0x1c0
Sep 20 09:56:53 lethe kernel: : [  799.355688]  [<ffffffff811fe632>] ? final_putname+0x22/0x50
Sep 20 09:56:53 lethe kernel: : [  799.355691]  [<ffffffff811fe8d9>] ? putname+0x29/0x40
Sep 20 09:56:53 lethe kernel: : [  799.355694]  [<ffffffff81204082>] ? user_path_at_empty+0x72/0xc0
Sep 20 09:56:53 lethe kernel: : [  799.355699]  [<ffffffff811f5e88>] ? __sb_start_write+0x58/0x100
Sep 20 09:56:53 lethe kernel: : [  799.355702]  [<ffffffff811f0803>] ? chmod_common+0xf3/0x140
Sep 20 09:56:53 lethe kernel: : [  799.355705]  [<ffffffff8121895f>] SyS_lsetxattr+0xaf/0xf0
Sep 20 09:56:53 lethe kernel: : [  799.355709]  [<ffffffff8170e329>] system_call_fastpath+0x16/0x1b
Sep 20 09:56:53 lethe kernel: : [  799.355710] Mem-Info:
Sep 20 09:56:53 lethe kernel: : [  799.355712] Node 0 DMA per-cpu:
Sep 20 09:56:53 lethe kernel: : [  799.355715] CPU    0: hi:    0, btch:   1 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355716] CPU    1: hi:    0, btch:   1 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355717] CPU    2: hi:    0, btch:   1 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355719] CPU    3: hi:    0, btch:   1 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355720] CPU    4: hi:    0, btch:   1 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355721] CPU    5: hi:    0, btch:   1 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355722] CPU    6: hi:    0, btch:   1 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355724] CPU    7: hi:    0, btch:   1 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355725] Node 0 DMA32 per-cpu:
Sep 20 09:56:53 lethe kernel: : [  799.355727] CPU    0: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355729] CPU    1: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355730] CPU    2: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355732] CPU    3: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355733] CPU    4: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355735] CPU    5: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355736] CPU    6: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355738] CPU    7: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355739] Node 0 Normal per-cpu:
Sep 20 09:56:53 lethe kernel: : [  799.355740] CPU    0: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355742] CPU    1: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355743] CPU    2: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355745] CPU    3: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355746] CPU    4: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355747] CPU    5: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355749] CPU    6: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355750] CPU    7: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355751] Node 1 Normal per-cpu:
Sep 20 09:56:53 lethe kernel: : [  799.355753] CPU    0: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355754] CPU    1: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355756] CPU    2: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355757] CPU    3: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355759] CPU    4: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355760] CPU    5: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355762] CPU    6: hi:  186, btch:  31 usd:   0
Sep 20 09:56:53 lethe kernel: : [  799.355763] CPU    7: hi:  186, btch:  31 usd:  59
Sep 20 09:56:53 lethe kernel: : [  799.355767] active_anon:19333 inactive_anon:19832 isolated_anon:0
Sep 20 09:56:53 lethe kernel: : [  799.355767]  active_file:9570 inactive_file:9281 isolated_file:0
Sep 20 09:56:53 lethe kernel: : [  799.355767]  unevictable:0 dirty:6 writeback:0 unstable:0
Sep 20 09:56:53 lethe kernel: : [  799.355767]  free:33139 slab_reclaimable:183173 slab_unreclaimable:2491121
Sep 20 09:56:53 lethe kernel: : [  799.355767]  mapped:11634 shmem:68 pagetables:1863 bounce:0
Sep 20 09:56:53 lethe kernel: : [  799.355767]  free_cma:0
Sep 20 09:56:53 lethe kernel: : [  799.355771] Node 0 DMA free:15904kB min:88kB low:108kB high:132kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15996kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Sep 20 09:56:53 lethe kernel: : [  799.355777] lowmem_reserve[]: 0 3350 7826 7826
Sep 20 09:56:53 lethe kernel: : [  799.355780] Node 0 DMA32 free:32364kB min:18984kB low:23728kB high:28476kB active_anon:36356kB inactive_anon:36924kB active_file:14640kB inactive_file:13460kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3644004kB managed:3434312kB mlocked:0kB dirty:4kB writeback:0kB mapped:12500kB shmem:32kB slab_reclaimable:163392kB slab_unreclaimable:2129864kB kernel_stack:928kB pagetables:1784kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Sep 20 09:56:53 lethe kernel: : [  799.355786] lowmem_reserve[]: 0 0 4476 4476
Sep 20 09:56:53 lethe kernel: : [  799.355788] Node 0 Normal free:26492kB min:25364kB low:31704kB high:38044kB active_anon:23104kB inactive_anon:23260kB active_file:6276kB inactive_file:6276kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:4718592kB managed:4583424kB mlocked:0kB dirty:4kB writeback:0kB mapped:12332kB shmem:0kB slab_reclaimable:201568kB slab_unreclaimable:2711976kB kernel_stack:4880kB pagetables:2592kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Sep 20 09:56:53 lethe kernel: : [  799.355793] lowmem_reserve[]: 0 0 0 0
Sep 20 09:56:53 lethe kernel: : [  799.355795] Node 1 Normal free:57796kB min:45672kB low:57088kB high:68508kB active_anon:17872kB inactive_anon:19144kB active_file:17364kB inactive_file:17388kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:8388608kB managed:8253204kB mlocked:0kB dirty:16kB writeback:0kB mapped:21704kB shmem:240kB slab_reclaimable:367732kB slab_unreclaimable:5122644kB kernel_stack:3072kB pagetables:3076kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Sep 20 09:56:53 lethe kernel: : [  799.355800] lowmem_reserve[]: 0 0 0 0
Sep 20 09:56:53 lethe kernel: : [  799.355803] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15904kB
Sep 20 09:56:53 lethe kernel: : [  799.355813] Node 0 DMA32: 477*4kB (UEM) 348*8kB (UEM) 151*16kB (UEM) 73*32kB (EM) 37*64kB (EM) 18*128kB (EM) 5*256kB (EM) 2*512kB (EM) 2*1024kB (M) 1*2048kB (R) 3*4096kB (MR) = 32804kB
Sep 20 09:56:53 lethe kernel: : [  799.355824] Node 0 Normal: 1566*4kB (UEM) 1353*8kB (UEM) 487*16kB (UEM) 32*32kB (UEMR) 9*64kB (UR) 4*128kB (UR) 2*256kB (UR) 0*512kB 1*1024kB (R) 0*2048kB 0*4096kB = 28528kB
Sep 20 09:56:53 lethe kernel: : [  799.355834] Node 1 Normal: 2287*4kB (UEM) 4213*8kB (UEM) 796*16kB (EM) 16*32kB (ER) 1*64kB (R) 0*128kB 1*256kB (R) 0*512kB 0*1024kB 1*2048kB (R) 0*4096kB = 58468kB
Sep 20 09:56:53 lethe kernel: : [  799.355845] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Sep 20 09:56:53 lethe kernel: : [  799.355846] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Sep 20 09:56:53 lethe kernel: : [  799.355848] 18998 total pagecache pages
Sep 20 09:56:53 lethe kernel: : [  799.355850] 80 pages in swap cache
Sep 20 09:56:53 lethe kernel: : [  799.355851] Swap cache stats: add 4479, delete 4399, find 17/145
Sep 20 09:56:53 lethe kernel: : [  799.355853] Free swap  = 8199764kB
Sep 20 09:56:53 lethe kernel: : [  799.355854] Total swap = 8216572kB
Sep 20 09:56:53 lethe kernel: : [  799.355855] 4191800 pages RAM
Sep 20 09:56:53 lethe kernel: : [  799.355857] 0 pages HighMem/MovableOnly
Sep 20 09:56:53 lethe kernel: : [  799.355858] 33851 pages reserved
Sep 20 09:56:53 lethe kernel: : [  799.355859] 0 pages hwpoisoned
@dweeezil
Copy link
Contributor

Since you have xattr=on, it's different, but your stack trace is helpful nonetheless. I'll be analyzing it later today.

@dweeezil
Copy link
Contributor

Are you sure xattr=sa isn't set?

@angstymeat
Copy link
Author

Sorry, I posted this late after a trip last night. I was wrong, it's xattr=sa.

@dweeezil
Copy link
Contributor

@angstymeat Could you please tell me what your spl_kmem_cache_slab_limit and spl_kmem_cache_reclaim parameter settings are. Can you easily reproduce this problem? Does the filesystem written to by rsync appear to be corrupted (can you, for example, do an ls -lR on it without it crashing)?

@angstymeat
Copy link
Author

spl_kmem_cache_slab_limit = 16384
spl_kmem_cache_reclaim = 0

So far I've been able to make it crash in about 30 minutes or less anytime I run my backup script, so it appears to be easily reproducible.

I haven't seen any corruption. I took a look at the directories after I read #2717, but didn't see a problem. I'm currently running ls -lR on the pool, but it's 5.2TB with a few million files so it could take a while.

@angstymeat
Copy link
Author

The ls -lR just finished on the pool, and I didn't get any errors.

@dweeezil
Copy link
Contributor

@angstymeat Could you please grep sa_cache /proc/slabinfo and grep spill_cache /proc/spl/kmem/slab when this happens. Also, could you please try it with spl_kmem_cache_slab_limit=0. You're clearly running into some problem related to SA spilling but it may not be directly related to the other issues I'm investigating.

@angstymeat
Copy link
Author

Ok, I'm running the jobs right now. I'll post when I've got the info.

@angstymeat
Copy link
Author

It crashed again, and here are the results:

spill_cache      0x00040         0         0  4194304   131072      0     0     2      0     0    62      1     0     0

There was no output when I ran the grep sa_cache /proc/slabinfo, but here's everything the had _cache in it:

zil_lwb_cache          0      0    200   40    2 : tunables    0    0    0 : slabdata      0      0      0
zio_cache             20    240   1080   30    8 : tunables    0    0    0 : slabdata      8      8      0
ddt_entry_cache        0      0    448   36    4 : tunables    0    0    0 : slabdata      0      0      0
scsi_cmd_cache       844    924    384   42    4 : tunables    0    0    0 : slabdata     22     22      0
mqueue_inode_cache     36     36    896   36    8 : tunables    0    0    0 : slabdata      1      1      0
hugetlbfs_inode_cache    108    108    600   54    8 : tunables    0    0    0 : slabdata      2      2      0
ext4_inode_cache    5826  10360   1008   32    8 : tunables    0    0    0 : slabdata    329    329      0
posix_timers_cache     66     66    248   33    2 : tunables    0    0    0 : slabdata      2      2      0
sock_inode_cache     663    663    640   51    8 : tunables    0    0    0 : slabdata     13     13      0
shmem_inode_cache   1680   1680    672   48    8 : tunables    0    0    0 : slabdata     35     35      0
proc_inode_cache    4915   5450    648   50    8 : tunables    0    0    0 : slabdata    109    109      0
bdev_cache           312    312    832   39    8 : tunables    0    0    0 : slabdata      8      8      0
kernfs_node_cache  28526  28526    120   34    1 : tunables    0    0    0 : slabdata    839    839      0
mnt_cache            306    306    320   51    4 : tunables    0    0    0 : slabdata      6      6      0
inode_cache        13384  13384    584   56    8 : tunables    0    0    0 : slabdata    239    239      0
files_cache          459    459    640   51    8 : tunables    0    0    0 : slabdata      9      9      0
signal_cache        1267   1372   1152   28    8 : tunables    0    0    0 : slabdata     49     49      0
sighand_cache        752    765   2112   15    8 : tunables    0    0    0 : slabdata     51     51      0
idr_layer_cache      525    525   2096   15    8 : tunables    0    0    0 : slabdata     35     35      0
kmem_cache_node      496    512     64   64    1 : tunables    0    0    0 : slabdata      8      8      0
kmem_cache           256    256    256   32    2 : tunables    0    0    0 : slabdata      8      8      0

I'm rebooting and I'm going to run with spl_kmem_cache_slab_limit=0, now.

@angstymeat
Copy link
Author

The backup completed without error with spl_kmem_cache_slab_limit=0 set.

@dweeezil
Copy link
Contributor

@angstymeat Thanks for the report. This could be a simple out-of-memory condition or it could be something deeper. Could you please try it again with the slab limit set to 16KiB and boot your kernel with the slub_nomerge parameter and then try grepping for sa_cache in /proc/slabinfo.

@angstymeat
Copy link
Author

I think it crashed again, but I got no message in the logs. Instead, rsync is either hanging or getting IO errors. Also, I lost snmpd and I can't open new ssh sessions or shells because it seems like some IO is hanging. I just rebooted it with the session I still had opened.

I've got these, now:

sa_cache    3838024 3853650     88   46    1 : tunables    0    0    0 : slabdata  83775  83775      0

spill_cache    0x00040         0         0  4194304   131072      0     0     2      0     0    62      0     0     0

@dweeezil
Copy link
Contributor

@angstymeat Those number are pretty much as I'd expect and confirm heavy use of the sa_cache. I'm building myself some lesser-debugging kernels to do some more tests. Normally I enable a lot of the mutex debugging options which balloons the size of many objects such as the sa_cache entries.

I think I may have found something. When using the Linux slab and after lots of sa_cache entries are allocated, an echo 3 > /proc/sys/vm/drop_caches deadlocked the system. I'll post more when I've got more information (hopefully later today when I've got more time to look into it further).

@dweeezil
Copy link
Contributor

dweeezil commented Oct 6, 2014

@angstymeat Regarding the drop caches issue, see #2753 and the related spl issue. If you can reproduce this problem, it might be worth trying those patches given that you're running a pretty new kernel.

@angstymeat
Copy link
Author

I'll get this tested out tomorrow or Wednesday. Thanks!

@angstymeat
Copy link
Author

I applied the patch in af59f0e to the latest git from this afternoon, and I've run the backup job twice without a problem.

I'm going to let it run at its normal nightly time and keep an eye on it.

@angstymeat
Copy link
Author

So far, so good. It looks like the patch has solved the problems.

@dweeezil
Copy link
Contributor

@angstymeat Thanks for the update.

@angstymeat
Copy link
Author

Just closing up my old issues that have been solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants