Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy files with xattr kills my box #3143

Closed
tobias-k opened this issue Mar 2, 2015 · 23 comments
Closed

Copy files with xattr kills my box #3143

tobias-k opened this issue Mar 2, 2015 · 23 comments
Milestone

Comments

@tobias-k
Copy link

tobias-k commented Mar 2, 2015

Situation:

Server A has a zpool and zfs volumes with following settings

raidz1
acltype=posixacl
compression=lz4
xattr=sa
3.13.0-24-generic #47-Ubuntu SMP

Server B clean installationion with new zpool und zfs volumes. Same Settings like A

Copy Files from Server A to B including xattr causes instant crash of the B box. (rsync)
Copy Files from Server A to B WITHOUT xattr works fine. Setting xattr after copy works fine.

Once the box is crahses while copying you can access the volumes and browse. Deleting files copyed from Box A with xattr causes same crash. Only option is deleting the entire zfs and start again.

100% repruduceable in my case.

Looks like a bug in handling xattr in this specific caombination of settings.

Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] ------------[ cut here ]------------
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] WARNING: CPU: 2 PID: 3260 at /build/buildd/linux-3.13.0/kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xd0()
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] Watchdog detected hard LOCKUP on cpu 2
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] Modules linked in: gpio_ich ppdev coretemp kvm_intel kvm joydev radeon lpc_ich ttm drm_kms_helper serio_raw drm i2c_algo_bit i5000_edac edac_core i5k_amb ioatdma dca parport_pc shpchp mac_hid lp parport nfsd auth_rpcgss nfs_acl nfs lockd sunrpc fscache zfs(POF) zunicode(POF) zavl(POF) zcommon(POF) znvpair(POF) spl(OF) hid_generic e1000e usbhid hid mpt2sas pata_acpi raid_class ptp scsi_transport_sas pps_core floppy
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] CPU: 2 PID: 3260 Comm: rsync Tainted: PF          O 3.13.0-24-generic #47-Ubuntu
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] Hardware name: Supermicro X7DB8/X7DB8, BIOS 2.1 06/23/2008
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  0000000000000009 ffff88042fc86c20 ffffffff81715ac4 ffff88042fc86c68
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  ffff88042fc86c58 ffffffff810676bd ffff880419728000 0000000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  ffff88042fc86d88 0000000000000000 ffff88042fc86ef8 ffff88042fc86cb8
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] Call Trace:
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  <NMI>  [<ffffffff81715ac4>] dump_stack+0x45/0x56
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff810676bd>] warn_slowpath_common+0x7d/0xa0
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8106772c>] warn_slowpath_fmt+0x4c/0x50
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8110d330>] ? restart_watchdog_hrtimer+0x50/0x50
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8110d3cc>] watchdog_overflow_callback+0x9c/0xd0
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff81144b3e>] __perf_event_overflow+0x8e/0x240
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff81028e48>] ? x86_perf_event_set_period+0xe8/0x150
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff81145654>] perf_event_overflow+0x14/0x20
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8103028d>] intel_pmu_handle_irq+0x1cd/0x3d0
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8171f96b>] perf_event_nmi_handler+0x2b/0x50
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8171f188>] nmi_handle.isra.3+0x88/0x180
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8171f3e9>] do_nmi+0x169/0x340
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8171e5f1>] end_repeat_nmi+0x1e/0x2e
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8171db4a>] ? _raw_spin_lock_irq+0x3a/0x60
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8171db4a>] ? _raw_spin_lock_irq+0x3a/0x60
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8171db4a>] ? _raw_spin_lock_irq+0x3a/0x60
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  <<EOE>>  [<ffffffff8171cc3f>] rwsem_down_read_failed+0x3f/0x150
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8171c302>] ? mutex_lock+0x12/0x2f
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8136ab34>] call_rwsem_down_read_failed+0x14/0x30
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8171c540>] ? down_read+0x20/0x30
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffffa0227870>] zap_get_leaf_byblk+0xd0/0x2c0 [zfs]
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8171c302>] ? mutex_lock+0x12/0x2f
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffffa0227ac4>] zap_deref_leaf+0x64/0x80 [zfs]
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffffa0229697>] fzap_cursor_retrieve+0x107/0x2a0 [zfs]
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffffa022cbbc>] zap_cursor_retrieve+0x5c/0x2f0 [zfs]
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffffa01ceea5>] ? dmu_prefetch+0x235/0x2c0 [zfs]
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffffa024c9be>] zfs_readdir+0x14e/0x4c0 [zfs]
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff81182ee8>] ? page_add_new_anon_rmap+0xd8/0x170
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff81178a7a>] ? handle_mm_fault+0x88a/0xf10
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff811c8d7b>] ? path_openat+0xbb/0x620
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8101b763>] ? native_sched_clock+0x13/0x80
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffffa0265b1c>] zpl_iterate+0x3c/0x60 [zfs]
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff811cca05>] iterate_dir+0xa5/0xe0
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8109dd84>] ? vtime_account_user+0x54/0x60
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff811cce62>] SyS_getdents+0x92/0x120
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff811ccb20>] ? fillonedir+0xe0/0xe0
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8172665c>] ? tracesys+0x7e/0xe6
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff817266bf>] tracesys+0xe1/0xe6
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] ---[ end trace 11b4512ddf6cbc2a ]---
Feb 27 16:35:36 fshh1 kernel: [ 1482.124004] INFO: rcu_sched detected stalls on CPUs/tasks: { 2} (detected by 7, t=15002 jiffies, g=6236, c=6235, q=0)
Feb 27 16:35:36 fshh1 kernel: [ 1482.124207] sending NMI to all CPUs:
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] NMI backtrace for cpu 2
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] CPU: 2 PID: 3260 Comm: rsync Tainted: PF       W  O 3.13.0-24-generic #47-Ubuntu
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] Hardware name: Supermicro X7DB8/X7DB8, BIOS 2.1 06/23/2008
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] task: ffff880411c50000 ti: ffff8801f7418000 task.ti: ffff8801f7418000
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] RIP: 0010:[<ffffffff8171db4a>]  [<ffffffff8171db4a>] _raw_spin_lock_irq+0x3a/0x60
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] RSP: 0018:ffff8801f7419aa8  EFLAGS: 00000093
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] RAX: 00000000000005af RBX: ffff880411c50000 RCX: 0000000000000017
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8802bd486488
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] RBP: ffff8801f7419aa8 R08: 0000000000000202 R09: ffff88042fc972c0
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] R10: ffffea0008ce6000 R11: 0000000000000000 R12: ffff8802bd486480
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] R13: ffff8802bd486488 R14: 0000000000000000 R15: ffff8802bd486480
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] FS:  00007f89ee553740(0000) GS:ffff88042fc80000(0000) knlGS:0000000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] CR2: 00000000013bbbd8 CR3: 00000002bd55c000 CR4: 00000000000007e0
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] Stack:
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  ffff8801f7419b00 ffffffff8171cc3f ffff8800431a12e8 ffffffff8171c302
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  ffff8800431a12e8 ffff880411c50000 ffffffff00000001 000000000000000a
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  ffff8802bd486480 0000000000000002 ffff8801f7419d38 ffff8801f7419b60
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] Call Trace:
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8171cc3f>] rwsem_down_read_failed+0x3f/0x150
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8171c302>] ? mutex_lock+0x12/0x2f
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8136ab34>] call_rwsem_down_read_failed+0x14/0x30
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8171c540>] ? down_read+0x20/0x30
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffffa0227870>] zap_get_leaf_byblk+0xd0/0x2c0 [zfs]
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8171c302>] ? mutex_lock+0x12/0x2f
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffffa0227ac4>] zap_deref_leaf+0x64/0x80 [zfs]
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffffa0229697>] fzap_cursor_retrieve+0x107/0x2a0 [zfs]
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffffa022cbbc>] zap_cursor_retrieve+0x5c/0x2f0 [zfs]
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffffa01ceea5>] ? dmu_prefetch+0x235/0x2c0 [zfs]
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffffa024c9be>] zfs_readdir+0x14e/0x4c0 [zfs]
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff81182ee8>] ? page_add_new_anon_rmap+0xd8/0x170
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff81178a7a>] ? handle_mm_fault+0x88a/0xf10
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff811c8d7b>] ? path_openat+0xbb/0x620
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8101b763>] ? native_sched_clock+0x13/0x80
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffffa0265b1c>] zpl_iterate+0x3c/0x60 [zfs]
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff811cca05>] iterate_dir+0xa5/0xe0
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8109dd84>] ? vtime_account_user+0x54/0x60
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff811cce62>] SyS_getdents+0x92/0x120
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff811ccb20>] ? fillonedir+0xe0/0xe0
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff8172665c>] ? tracesys+0x7e/0xe6
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617]  [<ffffffff817266bf>] tracesys+0xe1/0xe6
Feb 27 16:35:36 fshh1 kernel: [ 1426.632617] Code: 90 b8 00 00 02 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f b7 f2 b8 00 80 00 00 eb 0d 66 0f 1f 44 00 00 f3 90 <83> e8 01 74 0a 0f b7 0f 66 39 ca 75 f1 5d c3 66 66 66 90 66 66
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003] NMI backtrace for cpu 0
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003] CPU: 0 PID: 0 Comm: swapper/0 Tainted: PF       W  O 3.13.0-24-generic #47-Ubuntu
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003] Hardware name: Supermicro X7DB8/X7DB8, BIOS 2.1 06/23/2008
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003] task: ffffffff81c15480 ti: ffffffff81c00000 task.ti: ffffffff81c00000
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003] RIP: 0010:[<ffffffff8104f596>]  [<ffffffff8104f596>] native_safe_halt+0x6/0x10
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003] RSP: 0018:ffffffff81c01eb8  EFLAGS: 00000286
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003] RAX: 00000000ffffffed RBX: ffffffffffffffff RCX: 0100000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003] RBP: ffffffff81c01eb8 R08: 0000000000000000 R09: 0000000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003] R10: 0000000000000003 R11: ffffffff81eb5de8 R12: 0000000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003] R13: ffffffff81d12600 R14: ffffffff81de02c0 R15: ffffffff81c01fd8
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003] FS:  0000000000000000(0000) GS:ffff88042fc00000(0000) knlGS:0000000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003] CR2: 00000000006f8df4 CR3: 00000003fff4b000 CR4: 00000000000007f0
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003] Stack:
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003]  ffffffff81c01ed8 ffffffff8101c5ff ffffffffffffffff ffffffff81c01fd8
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003]  ffffffff81c01ee8 ffffffff8101cec6 ffffffff81c01f38 ffffffff810beb85
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003]  ffffffff81c01fd8 ffffffff81c01fd8 04e4d76b17907187 ffffffffffffffff
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003] Call Trace:
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003]  [<ffffffff8101c5ff>] default_idle+0x1f/0xc0
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003]  [<ffffffff8101cec6>] arch_cpu_idle+0x26/0x30
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003]  [<ffffffff810beb85>] cpu_startup_entry+0xc5/0x290
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003]  [<ffffffff81703f97>] rest_init+0x77/0x80
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003]  [<ffffffff81d34f70>] start_kernel+0x438/0x443
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003]  [<ffffffff81d34941>] ? repair_env_string+0x5c/0x5c
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003]  [<ffffffff81d34120>] ? early_idt_handlers+0x120/0x120
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003]  [<ffffffff81d345ee>] x86_64_start_reservations+0x2a/0x2c
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003]  [<ffffffff81d34733>] x86_64_start_kernel+0x143/0x152
Feb 27 16:35:36 fshh1 kernel: [ 1481.824003] Code: 00 00 00 00 00 55 48 89 e5 fa 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb f4 <5d> c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 f4 5d c3 66 0f 1f 84
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008] NMI backtrace for cpu 3
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008] CPU: 3 PID: 0 Comm: swapper/3 Tainted: PF       W  O 3.13.0-24-generic #47-Ubuntu
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008] Hardware name: Supermicro X7DB8/X7DB8, BIOS 2.1 06/23/2008
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008] task: ffff8804196247d0 ti: ffff88041962e000 task.ti: ffff88041962e000
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008] RIP: 0010:[<ffffffff8104f596>]  [<ffffffff8104f596>] native_safe_halt+0x6/0x10
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008] RSP: 0018:ffff88041962fea8  EFLAGS: 00000286
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008] RAX: 00000000ffffffed RBX: ffff88041962ff38 RCX: 0100000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008] RBP: ffff88041962fea8 R08: 0000000000000000 R09: 0000000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008] R10: 0000000000000003 R11: ffff880419755028 R12: 0000000000000003
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008] R13: ffffffff81d12600 R14: 0000000000000000 R15: ffff88041962ffd8
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008] FS:  0000000000000000(0000) GS:ffff88042fcc0000(0000) knlGS:0000000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008] CR2: 00007fc03ce7a000 CR3: 00000000c6853000 CR4: 00000000000007e0
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008] Stack:
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008]  ffff88041962fec8 ffffffff8101c5ff ffff88041962ff38 ffff88041962ffd8
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008]  ffff88041962fed8 ffffffff8101cec6 ffff88041962ff28 ffffffff810beb85
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008]  ffff88041962ffd8 ffff88041962ffd8 f38eb6193b97eba0 ffff88041962ff38
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008] Call Trace:
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008]  [<ffffffff8101c5ff>] default_idle+0x1f/0xc0
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008]  [<ffffffff8101cec6>] arch_cpu_idle+0x26/0x30
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008]  [<ffffffff810beb85>] cpu_startup_entry+0xc5/0x290
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008]  [<ffffffff81040fc8>] start_secondary+0x218/0x2c0
Feb 27 16:35:36 fshh1 kernel: [ 1481.900008] Code: 00 00 00 00 00 55 48 89 e5 fa 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb f4 <5d> c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 f4 5d c3 66 0f 1f 84
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164] NMI backtrace for cpu 1
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164] CPU: 1 PID: 0 Comm: swapper/1 Tainted: PF       W  O 3.13.0-24-generic #47-Ubuntu
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164] Hardware name: Supermicro X7DB8/X7DB8, BIOS 2.1 06/23/2008
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164] task: ffff8804196217f0 ti: ffff88041962a000 task.ti: ffff88041962a000
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164] RIP: 0010:[<ffffffff8104f596>]  [<ffffffff8104f596>] native_safe_halt+0x6/0x10
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164] RSP: 0018:ffff88041962bea8  EFLAGS: 00000286
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164] RAX: 00000000ffffffed RBX: ffff88041962bf38 RCX: 0100000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164] RBP: ffff88041962bea8 R08: 0000000000000000 R09: 0000000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164] R13: ffffffff81d12600 R14: 0000000000000000 R15: ffff88041962bfd8
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164] FS:  0000000000000000(0000) GS:ffff88042fc40000(0000) knlGS:0000000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164] CR2: 00007f89eda65a10 CR3: 00000003fedae000 CR4: 00000000000007e0
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164] Stack:
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164]  ffff88041962bec8 ffffffff8101c5ff ffff88041962bf38 ffff88041962bfd8
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164]  ffff88041962bed8 ffffffff8101cec6 ffff88041962bf28 ffffffff810beb85
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164]  ffff88041962bfd8 ffff88041962bfd8 3d43f8609d377307 ffff88041962bf38
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164] Call Trace:
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164]  [<ffffffff8101c5ff>] default_idle+0x1f/0xc0
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164]  [<ffffffff8101cec6>] arch_cpu_idle+0x26/0x30
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164]  [<ffffffff810beb85>] cpu_startup_entry+0xc5/0x290
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164]  [<ffffffff81040fc8>] start_secondary+0x218/0x2c0
Feb 27 16:35:36 fshh1 kernel: [ 1481.872164] Code: 00 00 00 00 00 55 48 89 e5 fa 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb f4 <5d> c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 f4 5d c3 66 0f 1f 84
Feb 27 16:35:36 fshh1 kernel: [ 1482.125093] NMI backtrace for cpu 7
Feb 27 16:35:36 fshh1 kernel: [ 1482.125098] CPU: 7 PID: 0 Comm: swapper/7 Tainted: PF       W  O 3.13.0-24-generic #47-Ubuntu
Feb 27 16:35:36 fshh1 kernel: [ 1482.125100] Hardware name: Supermicro X7DB8/X7DB8, BIOS 2.1 06/23/2008
Feb 27 16:35:36 fshh1 kernel: [ 1482.125103] task: ffff88041963afe0 ti: ffff880419636000 task.ti: ffff880419636000
Feb 27 16:35:36 fshh1 kernel: [ 1482.125113] RIP: 0010:[<ffffffff813696b1>]  [<ffffffff813696b1>] __const_udelay+0x1/0x30
Feb 27 16:35:36 fshh1 kernel: [ 1482.125134] RSP: 0018:ffff88042fdc3df0  EFLAGS: 00000046
Feb 27 16:35:36 fshh1 kernel: [ 1482.125143] RAX: 0000000000000000 RBX: 0000000000002710 RCX: 0000000000000004
Feb 27 16:35:36 fshh1 kernel: [ 1482.125153] RDX: 0000000000000c00 RSI: 0000000000000100 RDI: 0000000000418958
Feb 27 16:35:36 fshh1 kernel: [ 1482.125162] RBP: ffff88042fdc3e08 R08: 0000000000000082 R09: 00000000000005cb
Feb 27 16:35:36 fshh1 kernel: [ 1482.125171] R10: 0000000000000000 R11: ffff88042fdc3b2e R12: ffffffff81c4e140
Feb 27 16:35:36 fshh1 kernel: [ 1482.125180] R13: ffffffff81d12620 R14: ffffffff81c4e140 R15: 0000000000000007
Feb 27 16:35:36 fshh1 kernel: [ 1482.125190] FS:  0000000000000000(0000) GS:ffff88042fdc0000(0000) knlGS:0000000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1482.125199] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 27 16:35:36 fshh1 kernel: [ 1482.125209] CR2: 00007fffcfcc4f70 CR3: 00000003fe41b000 CR4: 00000000000007e0
Feb 27 16:35:36 fshh1 kernel: [ 1482.125217] Stack:
Feb 27 16:35:36 fshh1 kernel: [ 1482.125226]  ffff88042fdc3e08 ffffffff81044c6f ffff88042fdce800 ffff88042fdc3e60
Feb 27 16:35:36 fshh1 kernel: [ 1482.125259]  ffffffff810cab21 ffffffff81c4e140 ffffffff00000002 0000000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1482.125264]  0000000000000001 ffff88041963afe0 0000000000000000 0000000000000007
Feb 27 16:35:36 fshh1 kernel: [ 1482.125268] Call Trace:
Feb 27 16:35:36 fshh1 kernel: [ 1482.125270]  <IRQ>
Feb 27 16:35:36 fshh1 kernel: [ 1482.125272]  [<ffffffff81044c6f>] ? arch_trigger_all_cpu_backtrace+0x8f/0xb0
Feb 27 16:35:36 fshh1 kernel: [ 1482.125282]  [<ffffffff810cab21>] rcu_check_callbacks+0x631/0x650
Feb 27 16:35:36 fshh1 kernel: [ 1482.125287]  [<ffffffff81076217>] update_process_times+0x47/0x70
Feb 27 16:35:36 fshh1 kernel: [ 1482.125292]  [<ffffffff810d5be5>] tick_sched_handle.isra.17+0x25/0x60
Feb 27 16:35:36 fshh1 kernel: [ 1482.125295]  [<ffffffff810d5c61>] tick_sched_timer+0x41/0x60
Feb 27 16:35:36 fshh1 kernel: [ 1482.125301]  [<ffffffff8108e537>] __run_hrtimer+0x77/0x1d0
Feb 27 16:35:36 fshh1 kernel: [ 1482.125304]  [<ffffffff810d5c20>] ? tick_sched_handle.isra.17+0x60/0x60
Feb 27 16:35:36 fshh1 kernel: [ 1482.125307]  [<ffffffff8108ed3f>] hrtimer_interrupt+0xef/0x230
Feb 27 16:35:36 fshh1 kernel: [ 1482.125313]  [<ffffffff81043087>] local_apic_timer_interrupt+0x37/0x60
Feb 27 16:35:36 fshh1 kernel: [ 1482.125318]  [<ffffffff8172887f>] smp_apic_timer_interrupt+0x3f/0x60
Feb 27 16:35:36 fshh1 kernel: [ 1482.125322]  [<ffffffff8172721d>] apic_timer_interrupt+0x6d/0x80
Feb 27 16:35:36 fshh1 kernel: [ 1482.125324]  <EOI>
Feb 27 16:35:36 fshh1 kernel: [ 1482.125326]  [<ffffffff8104f596>] ? native_safe_halt+0x6/0x10
Feb 27 16:35:36 fshh1 kernel: [ 1482.125334]  [<ffffffff8101c5ff>] default_idle+0x1f/0xc0
Feb 27 16:35:36 fshh1 kernel: [ 1482.125338]  [<ffffffff8101cec6>] arch_cpu_idle+0x26/0x30
Feb 27 16:35:36 fshh1 kernel: [ 1482.125342]  [<ffffffff810beb85>] cpu_startup_entry+0xc5/0x290
Feb 27 16:35:36 fshh1 kernel: [ 1482.125346]  [<ffffffff81040fc8>] start_secondary+0x218/0x2c0
Feb 27 16:35:36 fshh1 kernel: [ 1482.125348] Code: ff c8 5d c3 66 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 ff 15 59 41 92 00 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 <48> 8d 04 bd 00 00 00 00 65 48 8b 14 25 60 3c 01 00 48 8d 0c 12
Feb 27 16:35:36 fshh1 kernel: [ 1482.125422] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.201 msecs
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016] NMI backtrace for cpu 6
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016] CPU: 6 PID: 0 Comm: swapper/6 Tainted: PF       W  O 3.13.0-24-generic #47-Ubuntu
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016] Hardware name: Supermicro X7DB8/X7DB8, BIOS 2.1 06/23/2008
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016] task: ffff8804196397f0 ti: ffff880419634000 task.ti: ffff880419634000
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016] RIP: 0010:[<ffffffff8104f596>]  [<ffffffff8104f596>] native_safe_halt+0x6/0x10
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016] RSP: 0018:ffff880419635ea8  EFLAGS: 00000286
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016] RAX: 00000000ffffffed RBX: ffff880419635f38 RCX: 0100000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016] RBP: ffff880419635ea8 R08: 0000000000000000 R09: 0000000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016] R10: 0000000000000003 R11: ffff8804197ed028 R12: 0000000000000006
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016] R13: ffffffff81d12600 R14: 0000000000000000 R15: ffff880419635fd8
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016] FS:  0000000000000000(0000) GS:ffff88042fd80000(0000) knlGS:0000000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016] CR2: 00007f846e1f93da CR3: 00000000359d9000 CR4: 00000000000007e0
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016] Stack:
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016]  ffff880419635ec8 ffffffff8101c5ff ffff880419635f38 ffff880419635fd8
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016]  ffff880419635ed8 ffffffff8101cec6 ffff880419635f28 ffffffff810beb85
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016]  ffff880419635fd8 ffff880419635fd8 75b35ab2c17456c3 ffff880419635f38
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016] Call Trace:
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016]  [<ffffffff8101c5ff>] default_idle+0x1f/0xc0
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016]  [<ffffffff8101cec6>] arch_cpu_idle+0x26/0x30
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016]  [<ffffffff810beb85>] cpu_startup_entry+0xc5/0x290
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016]  [<ffffffff81040fc8>] start_secondary+0x218/0x2c0
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016] Code: 00 00 00 00 00 55 48 89 e5 fa 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb f4 <5d> c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 f4 5d c3 66 0f 1f 84
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435] NMI backtrace for cpu 5
Feb 27 16:35:36 fshh1 kernel: [ 1482.040016] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.386 msecs
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435] CPU: 5 PID: 0 Comm: swapper/5 Tainted: PF       W  O 3.13.0-24-generic #47-Ubuntu
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435] Hardware name: Supermicro X7DB8/X7DB8, BIOS 2.1 06/23/2008
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435] task: ffff880419638000 ti: ffff880419632000 task.ti: ffff880419632000
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435] RIP: 0010:[<ffffffff8104f596>]  [<ffffffff8104f596>] native_safe_halt+0x6/0x10
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435] RSP: 0018:ffff880419633ea8  EFLAGS: 00000286
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435] RAX: 00000000ffffffed RBX: ffff880419633f38 RCX: 0100000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435] RBP: ffff880419633ea8 R08: 0000000000000000 R09: 0000000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435] R13: ffffffff81d12600 R14: 0000000000000000 R15: ffff880419633fd8
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435] FS:  0000000000000000(0000) GS:ffff88042fd40000(0000) knlGS:0000000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435] CR2: 00007fe57f81fd88 CR3: 00000003fff4b000 CR4: 00000000000007e0
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435] Stack:
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435]  ffff880419633ec8 ffffffff8101c5ff ffff880419633f38 ffff880419633fd8
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435]  ffff880419633ed8 ffffffff8101cec6 ffff880419633f28 ffffffff810beb85
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435]  ffff880419633fd8 ffff880419633fd8 6ef73db823a0efd4 ffff880419633f38
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435] Call Trace:
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435]  [<ffffffff8101c5ff>] default_idle+0x1f/0xc0
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435]  [<ffffffff8101cec6>] arch_cpu_idle+0x26/0x30
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435]  [<ffffffff810beb85>] cpu_startup_entry+0xc5/0x290
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435]  [<ffffffff81040fc8>] start_secondary+0x218/0x2c0
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435] Code: 00 00 00 00 00 55 48 89 e5 fa 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb f4 <5d> c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 f4 5d c3 66 0f 1f 84
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028] NMI backtrace for cpu 4
Feb 27 16:35:36 fshh1 kernel: [ 1482.063435] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.542 msecs
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028] CPU: 4 PID: 0 Comm: swapper/4 Tainted: PF       W  O 3.13.0-24-generic #47-Ubuntu
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028] Hardware name: Supermicro X7DB8/X7DB8, BIOS 2.1 06/23/2008
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028] task: ffff880419625fc0 ti: ffff880419630000 task.ti: ffff880419630000
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028] RIP: 0010:[<ffffffff8104f596>]  [<ffffffff8104f596>] native_safe_halt+0x6/0x10
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028] RSP: 0018:ffff880419631ea8  EFLAGS: 00000286
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028] RAX: 00000000ffffffed RBX: ffff880419631f38 RCX: 0100000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028] RBP: ffff880419631ea8 R08: 0000000000000000 R09: 0000000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028] R13: ffffffff81d12600 R14: 0000000000000000 R15: ffff880419631fd8
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028] FS:  0000000000000000(0000) GS:ffff88042fd00000(0000) knlGS:0000000000000000
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028] CR2: 00007fc41f34a000 CR3: 00000000359d9000 CR4: 00000000000007e0
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028] Stack:
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028]  ffff880419631ec8 ffffffff8101c5ff ffff880419631f38 ffff880419631fd8
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028]  ffff880419631ed8 ffffffff8101cec6 ffff880419631f28 ffffffff810beb85
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028]  ffff880419631fd8 ffff880419631fd8 9ea30f4af7ac56de ffff880419631f38
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028] Call Trace:
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028]  [<ffffffff8101c5ff>] default_idle+0x1f/0xc0
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028]  [<ffffffff8101cec6>] arch_cpu_idle+0x26/0x30
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028]  [<ffffffff810beb85>] cpu_startup_entry+0xc5/0x290
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028]  [<ffffffff81040fc8>] start_secondary+0x218/0x2c0
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028] Code: 00 00 00 00 00 55 48 89 e5 fa 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb 5d c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 fb f4 <5d> c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 f4 5d c3 66 0f 1f 84
Feb 27 16:35:36 fshh1 kernel: [ 1482.040028] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.668 msecs


@behlendorf
Copy link
Contributor

@tobias-k which server is crashing A or B? Are you able to cause this problem when using two newly created pools? What version of ZoL are you using?

@behlendorf behlendorf added this to the 0.6.5 milestone Mar 2, 2015
@prometheanfire
Copy link
Contributor

I do copies with zpool send and that works fine (xattr=sa), perhaps the extra load of the rsync is the real cause?

@tuxoko
Copy link
Contributor

tuxoko commented Mar 3, 2015

Thought I haven't look into the code, this could be the rwsem version of mutex_unlock/free race.
@behlendorf Are you aware of any rwsem usage in ZFS code that could suffer from this race?

@tobias-k
Copy link
Author

tobias-k commented Mar 3, 2015

@behlendorf the destination server B crashed. And yes, same behavior on new created pools. v0.6.3-5~trusty from ubuntu ppa repo.
@prometheanfire confimed: zpool send works.It not might be the extra load of rsync. the source is only 1GBit. So load > 2 on 8 core mashine

@behlendorf
Copy link
Contributor

@tuxoko now that you mention it, it sure does look similar to the mutex issue. The code must be spinning on the sem->wait_lock in rwsem_down_read_failed() with interrupts disabled. That's the only way the NMI watchdog being triggered makes sense. It's definitely possible the ZAP leaf rwsews suffer from a similar issue, but it's harder to tell from a quick look since the code is more complicated. Also in this case it looks like we're stuck trying to acquire the lock and not release it.

@tuxoko
Copy link
Contributor

tuxoko commented Mar 3, 2015

@behlendorf Yeah, I also did the quick look. Both ZAP and ZAP leaf are free by dbuf evict path, and I haven't found anything suspicious. But nonetheless, there should be some sort of memory corruption on the spinlock, because every other CPU is in idle state, so there's no way they can hold a spinlock.

@behlendorf
Copy link
Contributor

One possible cause is that we're somehow accidentally damaging the leaves in memory. There have been a very small number of bugs reported (#1445, #2861) which show that somehow a trashed leaf was written to disk. If it did somehow get overwritten in memory that is one way the spinlock could be damaged. It's in part what prompted me to open #3138 to make checking for this even more rigorous. However, because the reports have been rare we've never been able to identify a root cause.

@tuxoko
Copy link
Contributor

tuxoko commented Mar 3, 2015

I think you're right. #2861 has very similar lockup. And it consistantly occurs on only a few machine. It's quite possible caused by damaged pool.

@dweeezil
Copy link
Contributor

dweeezil commented Mar 3, 2015

I decided to take a peek at recent changes to the rwsem code in the kernel. I was initially made curious by the realization that openzfs/spl@46aa7b3 and its successors pretty much always are enabled these days (in the mainly x86_64 world) because torvalds/linux@29671f2 to which it implicitly refers (first appearing in 2.6.33) only modified the generic rwsem code and didn't do anything for the machine-specific version which is always used by x86_64.

In any case, I wanted to raise the possiblity that torvalds/linux@c8de2fa might be an interesting inflection point (first landing in 3.10) because it pulled in a whole lot of patches to the rwsem code. I've not reviewed any of it in detail yet but a cursory overview makes me wonder if something happened there that may have an impact (positive or negative) on ZoL.

@tomposmiko
Copy link

@dweeezil
Do you have update on this?
I have recently faced this issue and need to decide, which way I should follow.

@nedbass
Copy link
Contributor

nedbass commented Mar 25, 2015

@tobias-k @tomposmiko If possible, please try #3223 to see if it helps with this issue. Thanks

@tomposmiko
Copy link

@nedbass
I assume it should be done on master.
Why is it targeted to 0.6.5?

@tomposmiko
Copy link

@nedbass
Anyway, I applyied it to the sourcetree from Ubuntu ppa and there is no change.
When I access the directory (ie. /bin/ls), it never returns.

@nedbass
Copy link
Contributor

nedbass commented Mar 25, 2015

@tomposmiko thanks for trying the patch. You must be running into a different issue. It would be helpful if you collect any stack traces that appear in dmesg and open a new issue. The milestones are a best guess. This issue was probably pushed out to 0.6.5 because there was no fix forthcoming and it wasn't deemed critical enough to hold up a tag. If it does get fixed sooner it will most likely make it into 0.6.4.

@tomposmiko
Copy link

@nedbass
There is already a bugreport #3217 .
Though it has been closed, because it was suspected as duplicate of this one.
How can I reopen it?

@nedbass
Copy link
Contributor

nedbass commented Mar 25, 2015

@tomposmiko I see, that does look like the same issue. I suspect the directory is corrupt on disk in your case, so my patch wouldn't help with that, but it may prevent the corruption from happening in the first place.

@tomposmiko
Copy link

tomposmiko commented Mar 25, 2015 via email

@nedbass
Copy link
Contributor

nedbass commented Mar 25, 2015

@tomposmiko Unfortunately your situation is not a useful test for the patch. The patch should prevent the ZAP object from getting damaged in memory before it is written out to disk, but in your case it seems to already be damaged on disk. I'm hoping it helps @tobias-k since the lockup occurs every time his rsync job runs, so it must be getting damaged in memory.

@tobias-k
Copy link
Author

@nedbass actually I'm not able to do some futher testing. The system is hot. Next week I can creat a lab enviroment and test the patch.

@behlendorf
Copy link
Contributor

@tobias-k @tomposmiko is this still an outstanding issue in either the 0.6.4.2 or master branch?

@behlendorf behlendorf modified the milestones: 0.7.0, 0.6.5 Jul 16, 2015
@tomposmiko
Copy link

I disabled xattr and acltype properties after a while.
Would it be safe to enable it again?

@behlendorf
Copy link
Contributor

Yes if you're running 0.6.4 or newer. All the known SA issues have been addressed and we haven't seen any new ones.

@behlendorf
Copy link
Contributor

Based on the above comments it sounds as if all the SA xattr fixes and resolved this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants