spl_panic when receiving encrypted dataset #6821

sjau · 2017-11-04T14:37:29Z

System information

Type	Version/Name
Distribution Name	Nixos
Distribution Version	Nixos Unstable Small
Linux Kernel	4.9.58
Architecture	x86_64
ZFS Version	0.7.0-1
SPL Version	0.7.0-1

Describe the problem you're observing

I'm trying to backup encrypted datasets from my notebook to my homeserver. Both run same Nixos version. However it doesn't work.

When I use -wR options for full dataset sending, then on the receiving end spl_panic appears and it never finishes.

If I omit the -R option, then full dataset sending works. However when I then try send an incremental set, dame things happens again - spl_panic

Describe how to reproduce the problem

On notebook I have:
tank/encZFS/Nixos -> encZFS was create this way:
zfs create -o encryption=aes-256-gcm -o keyformat=passphrase -o mountpoint=none -o atime=off ${zfsPool}/encZFS

On the server I have:
serviTank/BU/subi -> None of those is encrypted

I then took a snapshot and tried to send like this:

zfs send -wR tank/encZFS/Nixos@encZFSSend_2017-11-04_12-31 | ssh root@10.0.0.3 'zfs receive serviTank/BU/subi/Nixos'

It seems all was transferred correctly:

zfs list
NAME                       USED  AVAIL  REFER  MOUNTPOINT
serviTank                  137G  3.38T    96K  /serviTank
serviTank/BU              95.8G  3.38T    96K  none
serviTank/BU/subi         95.8G  3.38T    96K  none
serviTank/BU/subi/Nixos   95.8G  3.38T  95.8G  legacy
serviTank/encZFS          40.8G  3.38T  1.39M  none
serviTank/encZFS/BU       2.78M  3.38T  1.39M  none
serviTank/encZFS/BU/subi  1.39M  3.38T  1.39M  none
serviTank/encZFS/Nixos    40.8G  3.38T  5.64G  legacy

However the zfs send/receive command never "finishes" and on the server side dmesg shows spl_panic

[ 1556.014734] VERIFY3(0 == dmu_object_dirty_raw(os, object, tx)) failed (0 == 17)
[ 1556.014757] PANIC at dmu.c:937:dmu_free_long_object_impl()
[ 1556.014770] Showing stack for process 18808
[ 1556.014772] CPU: 5 PID: 18808 Comm: receive_writer Tainted: P           O    4.9.58 #1-NixOS
[ 1556.014772] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Pro4, BIOS P2.50 05/27/2016
[ 1556.014773]  ffffaf4e11137b38 ffffffff942f7a12 ffffffffc02c4e03 00000000000003a9
[ 1556.014775]  ffffaf4e11137b48 ffffffffc00759f2 ffffaf4e11137cd0 ffffffffc0075ac5
[ 1556.014776]  0000000000000000 ffffaf4e00000030 ffffaf4e11137ce0 ffffaf4e11137c80
[ 1556.014777] Call Trace:
[ 1556.014781]  [<ffffffff942f7a12>] dump_stack+0x63/0x81
[ 1556.014785]  [<ffffffffc00759f2>] spl_dumpstack+0x42/0x50 [spl]
[ 1556.014787]  [<ffffffffc0075ac5>] spl_panic+0xc5/0x100 [spl]
[ 1556.014806]  [<ffffffffc016ec26>] ? dbuf_rele+0x36/0x40 [zfs]
[ 1556.014816]  [<ffffffffc0190107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[ 1556.014825]  [<ffffffffc0190443>] ? dnode_setdirty+0x83/0x100 [zfs]
[ 1556.014826]  [<ffffffff945671e2>] ? mutex_lock+0x12/0x30
[ 1556.014839]  [<ffffffffc01bf84b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1556.014848]  [<ffffffffc019020b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1556.014857]  [<ffffffffc017aa7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[ 1556.014865]  [<ffffffffc017ab24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[ 1556.014873]  [<ffffffffc0187858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[ 1556.014881]  [<ffffffffc0187cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1556.014883]  [<ffffffff941ce021>] ? __slab_free+0xa1/0x2e0
[ 1556.014884]  [<ffffffff940a5200>] ? set_next_entity+0x70/0x890
[ 1556.014886]  [<ffffffffc006ff53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1556.014887]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1556.014895]  [<ffffffffc0187910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1556.014896]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1556.014898]  [<ffffffffc0072642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1556.014899]  [<ffffffff9408e457>] kthread+0xd7/0xf0
[ 1556.014899]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60
[ 1556.014901]  [<ffffffff9456a155>] ret_from_fork+0x25/0x30
[ 1721.304223] INFO: task txg_quiesce:468 blocked for more than 120 seconds.
[ 1721.304317]       Tainted: P           O    4.9.58 #1-NixOS
[ 1721.304376] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1721.304456] txg_quiesce     D    0   468      2 0x00000000
[ 1721.304463]  ffff92d789a44400 0000000000000000 ffff92d7afbd7ec0 ffff92d783539a80
[ 1721.304469]  ffff92d78c355cc0 ffffaf4e10e1fd30 ffffffff94565082 ffffaf4e10e1fd00
[ 1721.304474]  0000000000000246 0000000180200010 ffffaf4e10e1fd50 ffff92d783539a80
[ 1721.304479] Call Trace:
[ 1721.304493]  [<ffffffff94565082>] ? __schedule+0x192/0x660
[ 1721.304500]  [<ffffffff94565586>] schedule+0x36/0x80
[ 1721.304511]  [<ffffffffc0077cb8>] cv_wait_common+0x128/0x140 [spl]
[ 1721.304518]  [<ffffffff940ad390>] ? wake_atomic_t_function+0x60/0x60
[ 1721.304525]  [<ffffffffc0077ce5>] __cv_wait+0x15/0x20 [spl]
[ 1721.304591]  [<ffffffffc01de633>] txg_quiesce_thread+0x2e3/0x3f0 [zfs]
[ 1721.304640]  [<ffffffffc01de350>] ? txg_wait_open+0x100/0x100 [zfs]
[ 1721.304647]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1721.304654]  [<ffffffffc0072642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1721.304658]  [<ffffffff9408e457>] kthread+0xd7/0xf0
[ 1721.304662]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60
[ 1721.304664]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60
[ 1721.304669]  [<ffffffff9456a155>] ret_from_fork+0x25/0x30
[ 1721.304686] INFO: task zfs:15048 blocked for more than 120 seconds.
[ 1721.304753]       Tainted: P           O    4.9.58 #1-NixOS
[ 1721.304810] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1721.304890] zfs             D    0 15048  15040 0x00000000
[ 1721.304894]  ffff92d7617f0400 0000000000000000 ffff92d7afb57ec0 ffff92d78bf7cf80
[ 1721.304900]  ffff92d78c354240 ffffaf4e032438c8 ffffffff94565082 ffffffffc016ec26
[ 1721.304904]  ffff92d68c3e9730 0000000000000001 ffffaf4e032438d0 ffff92d78bf7cf80
[ 1721.304909] Call Trace:
[ 1721.304916]  [<ffffffff94565082>] ? __schedule+0x192/0x660
[ 1721.304953]  [<ffffffffc016ec26>] ? dbuf_rele+0x36/0x40 [zfs]
[ 1721.304959]  [<ffffffff94565586>] schedule+0x36/0x80
[ 1721.304967]  [<ffffffffc0077cb8>] cv_wait_common+0x128/0x140 [spl]
[ 1721.304972]  [<ffffffff940ad390>] ? wake_atomic_t_function+0x60/0x60
[ 1721.304979]  [<ffffffffc0077ce5>] __cv_wait+0x15/0x20 [spl]
[ 1721.305015]  [<ffffffffc0173f92>] bqueue_enqueue+0x62/0xe0 [zfs]
[ 1721.305059]  [<ffffffffc01898c1>] dmu_recv_stream+0x691/0x11c0 [zfs]
[ 1721.305066]  [<ffffffffc009062a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1721.305116]  [<ffffffffc02108fa>] ? zfs_set_prop_nvlist+0x2fa/0x510 [zfs]
[ 1721.305190]  [<ffffffffc0211057>] zfs_ioc_recv_impl+0x407/0x1170 [zfs]
[ 1721.305241]  [<ffffffffc02123f9>] zfs_ioc_recv_new+0x369/0x400 [zfs]
[ 1721.305254]  [<ffffffffc00702cc>] ? spl_kmem_alloc_impl+0x9c/0x180 [spl]
[ 1721.305263]  [<ffffffffc00724a9>] ? spl_vmem_alloc+0x19/0x20 [spl]
[ 1721.305270]  [<ffffffffc00958af>] ? nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
[ 1721.305276]  [<ffffffffc009062a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1721.305283]  [<ffffffffc00906ff>] ? nvlist_xalloc.part.13+0x5f/0xc0 [znvpair]
[ 1721.305330]  [<ffffffffc020f0eb>] zfsdev_ioctl+0x20b/0x660 [zfs]
[ 1721.305340]  [<ffffffff941ff604>] do_vfs_ioctl+0x94/0x5c0
[ 1721.305347]  [<ffffffff9405dece>] ? __do_page_fault+0x25e/0x4c0
[ 1721.305352]  [<ffffffff941ffba9>] SyS_ioctl+0x79/0x90
[ 1721.305359]  [<ffffffff94569ef7>] entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 1721.305366] INFO: task receive_writer:18808 blocked for more than 120 seconds.
[ 1721.305442]       Tainted: P           O    4.9.58 #1-NixOS
[ 1721.305500] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1721.305637] receive_writer  D    0 18808      2 0x00000000
[ 1721.305644]  ffff92d789a44400 0000000000000000 ffff92d7afb57ec0 ffff92d78bf7ea00
[ 1721.305654]  ffff92d78bf78000 ffffaf4e11137b30 ffffffff94565082 0000000000000000
[ 1721.305662]  ffffffffc02af5d0 00ffffffc02c51d0 0000000000000001 ffff92d78bf7ea00
[ 1721.305671] Call Trace:
[ 1721.305681]  [<ffffffff94565082>] ? __schedule+0x192/0x660
[ 1721.305691]  [<ffffffff94565586>] schedule+0x36/0x80
[ 1721.305703]  [<ffffffffc0075aeb>] spl_panic+0xeb/0x100 [spl]
[ 1721.305765]  [<ffffffffc016ec26>] ? dbuf_rele+0x36/0x40 [zfs]
[ 1721.305821]  [<ffffffffc0190107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[ 1721.305873]  [<ffffffffc0190443>] ? dnode_setdirty+0x83/0x100 [zfs]
[ 1721.305879]  [<ffffffff945671e2>] ? mutex_lock+0x12/0x30
[ 1721.305943]  [<ffffffffc01bf84b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1721.305997]  [<ffffffffc019020b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1721.306051]  [<ffffffffc017aa7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[ 1721.306102]  [<ffffffffc017ab24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[ 1721.306147]  [<ffffffffc0187858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[ 1721.306207]  [<ffffffffc0187cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1721.306214]  [<ffffffff941ce021>] ? __slab_free+0xa1/0x2e0
[ 1721.306221]  [<ffffffff940a5200>] ? set_next_entity+0x70/0x890
[ 1721.306231]  [<ffffffffc006ff53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1721.306244]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1721.306288]  [<ffffffffc0187910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1721.306296]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1721.306303]  [<ffffffffc0072642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1721.306307]  [<ffffffff9408e457>] kthread+0xd7/0xf0
[ 1721.306316]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60
[ 1721.306323]  [<ffffffff9456a155>] ret_from_fork+0x25/0x30
[ 1844.182739] INFO: task txg_quiesce:468 blocked for more than 120 seconds.
[ 1844.182831]       Tainted: P           O    4.9.58 #1-NixOS
[ 1844.182889] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1844.182967] txg_quiesce     D    0   468      2 0x00000000
[ 1844.182974]  ffff92d789a44400 0000000000000000 ffff92d7afbd7ec0 ffff92d783539a80
[ 1844.182980]  ffff92d78c355cc0 ffffaf4e10e1fd30 ffffffff94565082 ffffaf4e10e1fd00
[ 1844.182985]  0000000000000246 0000000180200010 ffffaf4e10e1fd50 ffff92d783539a80
[ 1844.182990] Call Trace:
[ 1844.183003]  [<ffffffff94565082>] ? __schedule+0x192/0x660
[ 1844.183009]  [<ffffffff94565586>] schedule+0x36/0x80
[ 1844.183019]  [<ffffffffc0077cb8>] cv_wait_common+0x128/0x140 [spl]
[ 1844.183026]  [<ffffffff940ad390>] ? wake_atomic_t_function+0x60/0x60
[ 1844.183033]  [<ffffffffc0077ce5>] __cv_wait+0x15/0x20 [spl]
[ 1844.183094]  [<ffffffffc01de633>] txg_quiesce_thread+0x2e3/0x3f0 [zfs]
[ 1844.183144]  [<ffffffffc01de350>] ? txg_wait_open+0x100/0x100 [zfs]
[ 1844.183151]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1844.183157]  [<ffffffffc0072642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1844.183162]  [<ffffffff9408e457>] kthread+0xd7/0xf0
[ 1844.183165]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60
[ 1844.183168]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60
[ 1844.183172]  [<ffffffff9456a155>] ret_from_fork+0x25/0x30
[ 1844.183190] INFO: task zfs:15048 blocked for more than 120 seconds.
[ 1844.183255]       Tainted: P           O    4.9.58 #1-NixOS
[ 1844.183312] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1844.183392] zfs             D    0 15048  15040 0x00000000
[ 1844.183397]  ffff92d7617f0400 0000000000000000 ffff92d7afb57ec0 ffff92d78bf7cf80
[ 1844.183402]  ffff92d78c354240 ffffaf4e032438c8 ffffffff94565082 ffffffffc016ec26
[ 1844.183407]  ffff92d68c3e9730 0000000000000001 ffffaf4e032438d0 ffff92d78bf7cf80
[ 1844.183411] Call Trace:
[ 1844.183418]  [<ffffffff94565082>] ? __schedule+0x192/0x660
[ 1844.183455]  [<ffffffffc016ec26>] ? dbuf_rele+0x36/0x40 [zfs]
[ 1844.183461]  [<ffffffff94565586>] schedule+0x36/0x80
[ 1844.183468]  [<ffffffffc0077cb8>] cv_wait_common+0x128/0x140 [spl]
[ 1844.183473]  [<ffffffff940ad390>] ? wake_atomic_t_function+0x60/0x60
[ 1844.183480]  [<ffffffffc0077ce5>] __cv_wait+0x15/0x20 [spl]
[ 1844.183516]  [<ffffffffc0173f92>] bqueue_enqueue+0x62/0xe0 [zfs]
[ 1844.183560]  [<ffffffffc01898c1>] dmu_recv_stream+0x691/0x11c0 [zfs]
[ 1844.183567]  [<ffffffffc009062a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1844.183617]  [<ffffffffc02108fa>] ? zfs_set_prop_nvlist+0x2fa/0x510 [zfs]
[ 1844.183663]  [<ffffffffc0211057>] zfs_ioc_recv_impl+0x407/0x1170 [zfs]
[ 1844.183731]  [<ffffffffc02123f9>] zfs_ioc_recv_new+0x369/0x400 [zfs]
[ 1844.183744]  [<ffffffffc00702cc>] ? spl_kmem_alloc_impl+0x9c/0x180 [spl]
[ 1844.183751]  [<ffffffffc00724a9>] ? spl_vmem_alloc+0x19/0x20 [spl]
[ 1844.183758]  [<ffffffffc00958af>] ? nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
[ 1844.183764]  [<ffffffffc009062a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1844.183770]  [<ffffffffc00906ff>] ? nvlist_xalloc.part.13+0x5f/0xc0 [znvpair]
[ 1844.183819]  [<ffffffffc020f0eb>] zfsdev_ioctl+0x20b/0x660 [zfs]
[ 1844.183828]  [<ffffffff941ff604>] do_vfs_ioctl+0x94/0x5c0
[ 1844.183835]  [<ffffffff9405dece>] ? __do_page_fault+0x25e/0x4c0
[ 1844.183841]  [<ffffffff941ffba9>] SyS_ioctl+0x79/0x90
[ 1844.183846]  [<ffffffff94569ef7>] entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 1844.183852] INFO: task receive_writer:18808 blocked for more than 120 seconds.
[ 1844.183929]       Tainted: P           O    4.9.58 #1-NixOS
[ 1844.183987] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1844.184067] receive_writer  D    0 18808      2 0x00000000
[ 1844.184071]  ffff92d789a44400 0000000000000000 ffff92d7afb57ec0 ffff92d78bf7ea00
[ 1844.184076]  ffff92d78bf78000 ffffaf4e11137b30 ffffffff94565082 0000000000000000
[ 1844.184082]  ffffffffc02af5d0 00ffffffc02c51d0 0000000000000001 ffff92d78bf7ea00
[ 1844.184088] Call Trace:
[ 1844.184095]  [<ffffffff94565082>] ? __schedule+0x192/0x660
[ 1844.184104]  [<ffffffff94565586>] schedule+0x36/0x80
[ 1844.184112]  [<ffffffffc0075aeb>] spl_panic+0xeb/0x100 [spl]
[ 1844.184149]  [<ffffffffc016ec26>] ? dbuf_rele+0x36/0x40 [zfs]
[ 1844.184196]  [<ffffffffc0190107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[ 1844.184256]  [<ffffffffc0190443>] ? dnode_setdirty+0x83/0x100 [zfs]
[ 1844.184262]  [<ffffffff945671e2>] ? mutex_lock+0x12/0x30
[ 1844.184334]  [<ffffffffc01bf84b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1844.184391]  [<ffffffffc019020b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1844.184446]  [<ffffffffc017aa7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[ 1844.184498]  [<ffffffffc017ab24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[ 1844.184557]  [<ffffffffc0187858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[ 1844.184607]  [<ffffffffc0187cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1844.184616]  [<ffffffff941ce021>] ? __slab_free+0xa1/0x2e0
[ 1844.184622]  [<ffffffff940a5200>] ? set_next_entity+0x70/0x890
[ 1844.184630]  [<ffffffffc006ff53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1844.184642]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1844.184705]  [<ffffffffc0187910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1844.184714]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1844.184719]  [<ffffffffc0072642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1844.184724]  [<ffffffff9408e457>] kthread+0xd7/0xf0
[ 1844.184730]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60
[ 1844.184736]  [<ffffffff9456a155>] ret_from_fork+0x25/0x30
[ 1967.061258] INFO: task txg_quiesce:468 blocked for more than 120 seconds.
[ 1967.061345]       Tainted: P           O    4.9.58 #1-NixOS
[ 1967.061403] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.061483] txg_quiesce     D    0   468      2 0x00000000
[ 1967.061490]  ffff92d789a44400 0000000000000000 ffff92d7afbd7ec0 ffff92d783539a80
[ 1967.061496]  ffff92d78c355cc0 ffffaf4e10e1fd30 ffffffff94565082 ffffaf4e10e1fd00
[ 1967.061501]  0000000000000246 0000000180200010 ffffaf4e10e1fd50 ffff92d783539a80
[ 1967.061507] Call Trace:
[ 1967.061521]  [<ffffffff94565082>] ? __schedule+0x192/0x660
[ 1967.061528]  [<ffffffff94565586>] schedule+0x36/0x80
[ 1967.061539]  [<ffffffffc0077cb8>] cv_wait_common+0x128/0x140 [spl]
[ 1967.061546]  [<ffffffff940ad390>] ? wake_atomic_t_function+0x60/0x60
[ 1967.061554]  [<ffffffffc0077ce5>] __cv_wait+0x15/0x20 [spl]
[ 1967.061638]  [<ffffffffc01de633>] txg_quiesce_thread+0x2e3/0x3f0 [zfs]
[ 1967.061690]  [<ffffffffc01de350>] ? txg_wait_open+0x100/0x100 [zfs]
[ 1967.061697]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1967.061704]  [<ffffffffc0072642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1967.061708]  [<ffffffff9408e457>] kthread+0xd7/0xf0
[ 1967.061712]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60
[ 1967.061714]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60
[ 1967.061719]  [<ffffffff9456a155>] ret_from_fork+0x25/0x30
[ 1967.061738] INFO: task zfs:15048 blocked for more than 120 seconds.
[ 1967.061804]       Tainted: P           O    4.9.58 #1-NixOS
[ 1967.061862] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.061942] zfs             D    0 15048  15040 0x00000000
[ 1967.061947]  ffff92d7617f0400 0000000000000000 ffff92d7afb57ec0 ffff92d78bf7cf80
[ 1967.061952]  ffff92d78c354240 ffffaf4e032438c8 ffffffff94565082 ffffffffc016ec26
[ 1967.061957]  ffff92d68c3e9730 0000000000000001 ffffaf4e032438d0 ffff92d78bf7cf80
[ 1967.061961] Call Trace:
[ 1967.061968]  [<ffffffff94565082>] ? __schedule+0x192/0x660
[ 1967.062006]  [<ffffffffc016ec26>] ? dbuf_rele+0x36/0x40 [zfs]
[ 1967.062012]  [<ffffffff94565586>] schedule+0x36/0x80
[ 1967.062020]  [<ffffffffc0077cb8>] cv_wait_common+0x128/0x140 [spl]
[ 1967.062024]  [<ffffffff940ad390>] ? wake_atomic_t_function+0x60/0x60
[ 1967.062032]  [<ffffffffc0077ce5>] __cv_wait+0x15/0x20 [spl]
[ 1967.062068]  [<ffffffffc0173f92>] bqueue_enqueue+0x62/0xe0 [zfs]
[ 1967.062112]  [<ffffffffc01898c1>] dmu_recv_stream+0x691/0x11c0 [zfs]
[ 1967.062119]  [<ffffffffc009062a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1967.062170]  [<ffffffffc02108fa>] ? zfs_set_prop_nvlist+0x2fa/0x510 [zfs]
[ 1967.062243]  [<ffffffffc0211057>] zfs_ioc_recv_impl+0x407/0x1170 [zfs]
[ 1967.062293]  [<ffffffffc02123f9>] zfs_ioc_recv_new+0x369/0x400 [zfs]
[ 1967.062306]  [<ffffffffc00702cc>] ? spl_kmem_alloc_impl+0x9c/0x180 [spl]
[ 1967.062315]  [<ffffffffc00724a9>] ? spl_vmem_alloc+0x19/0x20 [spl]
[ 1967.062322]  [<ffffffffc00958af>] ? nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
[ 1967.062328]  [<ffffffffc009062a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1967.062335]  [<ffffffffc00906ff>] ? nvlist_xalloc.part.13+0x5f/0xc0 [znvpair]
[ 1967.062381]  [<ffffffffc020f0eb>] zfsdev_ioctl+0x20b/0x660 [zfs]
[ 1967.062391]  [<ffffffff941ff604>] do_vfs_ioctl+0x94/0x5c0
[ 1967.062398]  [<ffffffff9405dece>] ? __do_page_fault+0x25e/0x4c0
[ 1967.062403]  [<ffffffff941ffba9>] SyS_ioctl+0x79/0x90
[ 1967.062410]  [<ffffffff94569ef7>] entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 1967.062418] INFO: task receive_writer:18808 blocked for more than 120 seconds.
[ 1967.062494]       Tainted: P           O    4.9.58 #1-NixOS
[ 1967.062552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.062632] receive_writer  D    0 18808      2 0x00000000
[ 1967.062637]  ffff92d789a44400 0000000000000000 ffff92d7afb57ec0 ffff92d78bf7ea00
[ 1967.062645]  ffff92d78bf78000 ffffaf4e11137b30 ffffffff94565082 0000000000000000
[ 1967.062649]  ffffffffc02af5d0 00ffffffc02c51d0 0000000000000001 ffff92d78bf7ea00
[ 1967.062654] Call Trace:
[ 1967.062661]  [<ffffffff94565082>] ? __schedule+0x192/0x660
[ 1967.062667]  [<ffffffff94565586>] schedule+0x36/0x80
[ 1967.062677]  [<ffffffffc0075aeb>] spl_panic+0xeb/0x100 [spl]
[ 1967.062714]  [<ffffffffc016ec26>] ? dbuf_rele+0x36/0x40 [zfs]
[ 1967.062760]  [<ffffffffc0190107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[ 1967.062804]  [<ffffffffc0190443>] ? dnode_setdirty+0x83/0x100 [zfs]
[ 1967.062813]  [<ffffffff945671e2>] ? mutex_lock+0x12/0x30
[ 1967.062870]  [<ffffffffc01bf84b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1967.062913]  [<ffffffffc019020b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1967.062955]  [<ffffffffc017aa7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[ 1967.062995]  [<ffffffffc017ab24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[ 1967.063038]  [<ffffffffc0187858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[ 1967.063077]  [<ffffffffc0187cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1967.063085]  [<ffffffff941ce021>] ? __slab_free+0xa1/0x2e0
[ 1967.063092]  [<ffffffff940a5200>] ? set_next_entity+0x70/0x890
[ 1967.063099]  [<ffffffffc006ff53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1967.063105]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1967.063145]  [<ffffffffc0187910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1967.063153]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1967.063168]  [<ffffffffc0072642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1967.063174]  [<ffffffff9408e457>] kthread+0xd7/0xf0
[ 1967.063180]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60
[ 1967.063187]  [<ffffffff9456a155>] ret_from_fork+0x25/0x30
[ 2089.939773] INFO: task txg_quiesce:468 blocked for more than 120 seconds.
[ 2089.939861]       Tainted: P           O    4.9.58 #1-NixOS
[ 2089.939919] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2089.940000] txg_quiesce     D    0   468      2 0x00000000
[ 2089.940007]  ffff92d789a44400 0000000000000000 ffff92d7afbd7ec0 ffff92d783539a80
[ 2089.940013]  ffff92d78c355cc0 ffffaf4e10e1fd30 ffffffff94565082 ffffaf4e10e1fd00
[ 2089.940018]  0000000000000246 0000000180200010 ffffaf4e10e1fd50 ffff92d783539a80
[ 2089.940023] Call Trace:
[ 2089.940037]  [<ffffffff94565082>] ? __schedule+0x192/0x660
[ 2089.940043]  [<ffffffff94565586>] schedule+0x36/0x80
[ 2089.940054]  [<ffffffffc0077cb8>] cv_wait_common+0x128/0x140 [spl]
[ 2089.940061]  [<ffffffff940ad390>] ? wake_atomic_t_function+0x60/0x60
[ 2089.940068]  [<ffffffffc0077ce5>] __cv_wait+0x15/0x20 [spl]
[ 2089.940132]  [<ffffffffc01de633>] txg_quiesce_thread+0x2e3/0x3f0 [zfs]
[ 2089.940180]  [<ffffffffc01de350>] ? txg_wait_open+0x100/0x100 [zfs]
[ 2089.940188]  [<ffffffffc00725d0>] ? __thread_exit+0x20/0x20 [spl]
[ 2089.940194]  [<ffffffffc0072642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 2089.940199]  [<ffffffff9408e457>] kthread+0xd7/0xf0
[ 2089.940202]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60
[ 2089.940205]  [<ffffffff9408e380>] ? kthread_park+0x60/0x60

As said, if I don't use the -R option on first dataset sending to server it works fine, but when I then try to send an incremental snapshot the same thing happens.

I also tried to send the snapshots to an encrypted child dataset on the server with the same results.

The text was updated successfully, but these errors were encountered:

tcaputi · 2017-11-06T18:44:41Z

@sjau Have you tried this with the latest code from master? I recently fixed a good number of these bugs a couple weeks ago. If you are on the latest code, could you provide the send file that is causing the panic? Thanks a lot.

sjau · 2017-11-06T18:49:09Z

Not sure what Nixos unstable uses... I'll have to find out.

What do you mean by providing the send file that is causing the panic?

If have don't have a copy yet on the remote server and use the -R flag it causes panic. If I just use -w flag it works for the inital dataset (snapshot) sending. But then when I try to send an incremental snapshot it will panic again.

Smallest dataset is ~ 95GB

sjau · 2017-11-06T19:05:08Z

Ok, already found it out... I use the unstable version because of encryption.

Currently Nixos uses for ZFS unstable commit 7670f72

Checked it on server and on notebook - both use same revision - https://github.com/NixOS/nixpkgs/blob/master/pkgs/os-specific/linux/zfs/default.nix#L164

tcaputi · 2017-11-06T19:05:48Z

Not sure what Nixos unstable uses... I'll have to find out.

Looking at the stack trace it seems that you may(?) have the patch I was talking about after all. I'll try to look into this. It would be really helpful if you could provide a minimal set of commands to reproduce the problem if possible. If not, I'll do my best to figure it out on my own

What do you mean by providing the send file that is causing the panic?

I meant the output of the zfs send command you are using. But if your dataset is 95GB this might not be feasible.

sjau · 2017-11-06T19:07:20Z

I can test it with a smaller dataset just to see how it goes.

tcaputi · 2017-11-06T19:11:12Z

Actually, it would be really helpful if you could cause it to crash one more time with the following debugging on the receiving side:

echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable
# perform the send, triggering the panic

Then provide the output of cat /proc/spl/kstat/zfs/dbgmsg | grep 'error 17'

sjau · 2017-11-06T19:11:50Z

The commands are simple:

If dataset does not exist, I can use:

zfs send -w "${srcDS}@${dsPrefix}_${now}" | ssh "${dstSrv}" "zfs receive ${dstDS}/${dsName}

That will create the dataset nicely on the destination server - alas I haven't tried to mount it and check it... However, if I use the -R option, it causes panic:

zfs send -wR "${srcDS}@${dsPrefix}_${now}" | ssh "${dstSrv}" "zfs receive ${dstDS}/${dsName}

Also for incremental:

If I didn't use the -R option for initial dataset creation on the server and then try to send an incremental dataset like below, it also causes panic:

zfs send -wi "#${incDS}" "${srcDS}@${dsPrefix}_${now}" | ssh "${dstSrv}" "zfs receive ${dstDS}/${dsName}"

sjau · 2017-11-06T19:12:56Z

Shall I crash it with initial dataset (-R) or incremental set?

tcaputi · 2017-11-06T19:15:27Z

The commands are simple:

We actually already have tests for these commands in our test suite (which is run against every pull request). This is probably a problem that is triggered by the data in your dataset.

Shall I crash it with initial dataset (-R) or incremental set?

The initial one please. Thanks a lot.

sjau · 2017-11-06T19:18:03Z

Doing so now... will take a few minutes to send 95GB or so....

sjau · 2017-11-06T19:19:52Z

FYI: The script I created to take snapshots and bookmarks and send them: https://paste.simplylinux.ch/view/raw/d7809481

sjau · 2017-11-06T19:34:55Z

Nothing found

[189601.664736] Showing stack for process 3880
[189601.664737] CPU: 7 PID: 3880 Comm: receive_writer Tainted: P           O    4.9.58 #1-NixOS
[189601.664738] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Pro4, BIOS P2.50 05/27/2016
[189601.664739]  ffffbd1250b2fb38 ffffffffa88f7a12 ffffffffc02ffe03 00000000000003a9
[189601.664740]  ffffbd1250b2fb48 ffffffffc00b09f2 ffffbd1250b2fcd0 ffffffffc00b0ac5
[189601.664741]  0000000000000000 ffffbd1200000030 ffffbd1250b2fce0 ffffbd1250b2fc80
[189601.664742] Call Trace:
[189601.664747]  [<ffffffffa88f7a12>] dump_stack+0x63/0x81
[189601.664752]  [<ffffffffc00b09f2>] spl_dumpstack+0x42/0x50 [spl]
[189601.664753]  [<ffffffffc00b0ac5>] spl_panic+0xc5/0x100 [spl]
[189601.664774]  [<ffffffffc01a9c26>] ? dbuf_rele+0x36/0x40 [zfs]
[189601.664784]  [<ffffffffc01cb107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[189601.664792]  [<ffffffffc01cb443>] ? dnode_setdirty+0x83/0x100 [zfs]
[189601.664793]  [<ffffffffa8b671e2>] ? mutex_lock+0x12/0x30
[189601.664805]  [<ffffffffc01fa84b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[189601.664814]  [<ffffffffc01cb20b>] ? dnode_hold+0x1b/0x20 [zfs]
[189601.664823]  [<ffffffffc01b5a7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[189601.664831]  [<ffffffffc01b5b24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[189601.664840]  [<ffffffffc01c2858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[189601.664848]  [<ffffffffc01c2cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[189601.664849]  [<ffffffffa87ce021>] ? __slab_free+0xa1/0x2e0
[189601.664851]  [<ffffffffa86a5200>] ? set_next_entity+0x70/0x890
[189601.664852]  [<ffffffffc00aaf53>] ? spl_kmem_free+0x33/0x40 [spl]
[189601.664854]  [<ffffffffc00ad5d0>] ? __thread_exit+0x20/0x20 [spl]
[189601.664862]  [<ffffffffc01c2910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[189601.664863]  [<ffffffffc00ad5d0>] ? __thread_exit+0x20/0x20 [spl]
[189601.664864]  [<ffffffffc00ad642>] thread_generic_wrapper+0x72/0x80 [spl]
[189601.664865]  [<ffffffffa868e457>] kthread+0xd7/0xf0
[189601.664866]  [<ffffffffa868e380>] ? kthread_park+0x60/0x60
[189601.664867]  [<ffffffffa8b6a155>] ret_from_fork+0x25/0x30
root@servi-nixos:~# cat /proc/spl/kstat/zfs/dbgmsg | grep 'error 17'
root@servi-nixos:~#

root@servi-nixos:~# cat /proc/spl/kstat/zfs/dbgmsg  > /tmp/dbgmsg
root@servi-nixos:~# ls -al /tmp/dbgmsg 
-rw-r--r-- 1 root root 3462117 Nov  6 20:35 /tmp/dbgmsg

The whole dbgmsg output can be viewed here https://paste.simplylinux.ch/view/raw/65e6bc7c

tcaputi · 2017-11-06T19:48:29Z

@sjau
I'm sorry. I forgot a step that you need to run on non-debug builds. Unfortunately you will need to restart the receiving machine and run this immediately after echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable

echo 4294967263 > /sys/module/zfs/parameters/zfs_flags

sjau · 2017-11-06T20:15:51Z

Still no output:

root@servi-nixos:~# zfs destroy -r serviTank/encZFS/BU/subi/Nixos
root@servi-nixos:~# echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable
root@servi-nixos:~# echo 4294967263 > /sys/module/zfs/parameters/zfs_flags
root@servi-nixos:~# dmesg | tail -n 20
[ 1314.303517]  [<ffffffffc029a107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[ 1314.303526]  [<ffffffffc029a443>] ? dnode_setdirty+0x83/0x100 [zfs]
[ 1314.303528]  [<ffffffffbd7671e2>] ? mutex_lock+0x12/0x30
[ 1314.303540]  [<ffffffffc02c984b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1314.303549]  [<ffffffffc029a20b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1314.303557]  [<ffffffffc0284a7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[ 1314.303565]  [<ffffffffc0284b24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[ 1314.303573]  [<ffffffffc0291858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[ 1314.303581]  [<ffffffffc0291cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1314.303583]  [<ffffffffbd3ce021>] ? __slab_free+0xa1/0x2e0
[ 1314.303584]  [<ffffffffbd2a5200>] ? set_next_entity+0x70/0x890
[ 1314.303586]  [<ffffffffc0176f53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1314.303587]  [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1314.303595]  [<ffffffffc0291910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1314.303596]  [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1314.303597]  [<ffffffffc0179642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1314.303598]  [<ffffffffbd28e457>] kthread+0xd7/0xf0
[ 1314.303599]  [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1314.303599]  [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1314.303600]  [<ffffffffbd76a155>] ret_from_fork+0x25/0x30
root@servi-nixos:~# cat /proc/spl/kstat/zfs/dbgmsg | grep 'error 17'
root@servi-nixos:~# dmesg | tail -n 30
[ 1314.303473] CPU: 6 PID: 22544 Comm: receive_writer Tainted: P           O    4.9.58 #1-NixOS
[ 1314.303473] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Pro4, BIOS P2.50 05/27/2016
[ 1314.303474]  ffff9cc8839a3b38 ffffffffbd4f7a12 ffffffffc03cee03 00000000000003a9
[ 1314.303476]  ffff9cc8839a3b48 ffffffffc017c9f2 ffff9cc8839a3cd0 ffffffffc017cac5
[ 1314.303477]  0000000000000028 ffff9cc800000030 ffff9cc8839a3ce0 ffff9cc8839a3c80
[ 1314.303478] Call Trace:
[ 1314.303482]  [<ffffffffbd4f7a12>] dump_stack+0x63/0x81
[ 1314.303486]  [<ffffffffc017c9f2>] spl_dumpstack+0x42/0x50 [spl]
[ 1314.303487]  [<ffffffffc017cac5>] spl_panic+0xc5/0x100 [spl]
[ 1314.303507]  [<ffffffffc030d3c2>] ? __set_error+0x22/0x30 [zfs]
[ 1314.303517]  [<ffffffffc029a107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[ 1314.303526]  [<ffffffffc029a443>] ? dnode_setdirty+0x83/0x100 [zfs]
[ 1314.303528]  [<ffffffffbd7671e2>] ? mutex_lock+0x12/0x30
[ 1314.303540]  [<ffffffffc02c984b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1314.303549]  [<ffffffffc029a20b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1314.303557]  [<ffffffffc0284a7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[ 1314.303565]  [<ffffffffc0284b24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[ 1314.303573]  [<ffffffffc0291858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[ 1314.303581]  [<ffffffffc0291cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1314.303583]  [<ffffffffbd3ce021>] ? __slab_free+0xa1/0x2e0
[ 1314.303584]  [<ffffffffbd2a5200>] ? set_next_entity+0x70/0x890
[ 1314.303586]  [<ffffffffc0176f53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1314.303587]  [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1314.303595]  [<ffffffffc0291910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1314.303596]  [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1314.303597]  [<ffffffffc0179642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1314.303598]  [<ffffffffbd28e457>] kthread+0xd7/0xf0
[ 1314.303599]  [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1314.303599]  [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1314.303600]  [<ffffffffbd76a155>] ret_from_fork+0x25/0x30
root@servi-nixos:~# cat /proc/spl/kstat/zfs/dbgmsg | grep 'error 17'
root@servi-nixos:~#

tcaputi · 2017-11-06T20:18:03Z

Hmmmm. what about grepping just for "error". Sorry for all the trouble.

sjau · 2017-11-06T20:20:20Z

there's tons of them:

https://paste.simplylinux.ch/view/raw/30b64aec

No need to be sorry :) You've created the encryption and provide help... so no need to be sorry...

tcaputi · 2017-11-06T20:23:18Z

Is the error reoprted in dmesg still failed (0 == 17) (the top line of the stack trace is cut off above).

sjau · 2017-11-06T20:24:29Z

Here's more dmesg output:

[    0.000000] x86/fpu: Using 'eager' FPU context switches.
[    0.000000] BIOS-e820: [mem 0x00000000bd3c4000-0x00000000bd451fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fed00000-0x00000000fed03fff] reserved
[    0.000000]   00000-9FFFF write-back
[    0.000000]   4 base 00BF000000 mask 7FFF000000 uncachable
[    0.000000] BRK [0x13450b000, 0x13450bfff] PGTABLE
[    0.000000] ACPI: Early table checksum verification disabled
[    0.000000] ACPI: HPET 0x00000000BD5B3510 000038 (v01 ALASKA A M I    01072009 AMI. 00000005)
[    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.000000]   node   0: [mem 0x00000000bdfff000-0x00000000bdffffff]
[    0.000000]   Normal zone: 7536128 pages, LIFO batch:31
[    0.000000] ACPI: IRQ9 used by override.
[    0.000000] PM: Registered nosave memory: [mem 0xaae99000-0xab36cfff]
[    0.000000] PM: Registered nosave memory: [mem 0xfec00000-0xfec00fff]
[    0.000000] e820: [mem 0xcf200000-0xefffffff] available for PCI devices
c56v2h-nixos-system-servi-nixos-18.03pre118381.4068703502/init loglevel=4
[    0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 133484882848 ns
[    0.017115] CPU: Processor Core ID: 0
[    0.035611] ... generic registers:      4
[    0.480112] NET: Registered protocol family 16
[    0.493460] ACPI: Added _OSI(Processor Aggregator Device)
[    0.500638] ACPI: Power Resource [PG01] (on)
[    0.509007] pci_bus 0000:00: root bus resource [mem 0x000e4000-0x000e7fff window]
[    0.509339] pci 0000:00:14.0: System wakeup disabled by ACPI
[    0.509775] pci 0000:00:1b.0: [8086:8ca0] type 00 class 0x040300
[    0.510310] pci 0000:00:1f.0: [8086:8cc4] type 00 class 0x060100
[    0.510885] pci 0000:00:1c.2: PCI bridge to [bus 02-03] (subtractive decode)
[    0.511003] pci 0000:02:00.0:   bridge window [mem 0x000d0000-0x000d3fff window] (subtractive decode)
[    0.511801] ACPI: PCI Interrupt Link [LNKG] (IRQs *3 4 5 6 10 11 12 14 15)
[    0.514591] e820: reserve RAM buffer [mem 0x82fe00000-0x82fffffff]
[    0.520322] system 00:03: Plug and Play ACPI device, IDs INT3f0d PNP0c02 (active)
[    0.520900] system 00:07: Plug and Play ACPI device, IDs PNP0c02 (active)
[    0.526565] pci_bus 0000:02: resource 8 [mem 0x000d4000-0x000d7fff window]
[    0.527013] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
[    0.645616] intel_idle: MWAIT substates: 0x42120
[    1.645668] tsc: Refined TSC clocksource calibration: 3399.013 MHz
[    2.841518] rtc_cmos 00:02: rtc core: registered rtc_cmos as rtc0
[    2.868667] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    2.880924] xhci_hcd 0000:00:14.0: hcc params 0x200077c1 hci version 0x100 quirks 0x00009810
[    2.893645] ata1: SATA max UDMA/133 abar m2048@0xefc39000 port 0xefc39100 irq 26
[    3.204474] ata3.00: configured for UDMA/133
[    3.269092] sd 2:0:0:0: [sdc] 4096-byte physical blocks
[    3.321525] sd 0:0:0:0: [sda] Attached SCSI disk
[   54.903311] NET: Registered protocol family 10
[   58.426044] loop: module loaded
[   59.727261] snd_hda_codec_realtek hdaudioC1D0:    speaker_outs=0 (0x0/0x0/0x0/0x0/0x0)
[   59.739982] input: HDA Intel PCH Front Headphone as /devices/pci0000:00/0000:00:1b.0/sound/card1/input12
[   60.475133] Console: switching to colour dummy device 80x25
[   60.572747] audit: type=1130 audit(1509997934.560:7): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-remount-fs comm="systemd" exe="/nix/store/dbay1civ36b8hkr26wsxh586r331wy94-systemd-234/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   62.495233] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null)
96F6E3A20
[ 1314.303437] VERIFY3(0 == dmu_object_dirty_raw(os, object, tx)) failed (0 == 17)
[ 1314.303482]  [<ffffffffbd4f7a12>] dump_stack+0x63/0x81
[ 1314.303540]  [<ffffffffc02c984b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1314.303584]  [<ffffffffbd2a5200>] ? set_next_entity+0x70/0x890
[ 1314.303599]  [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1469.502873] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
[ 1475.097656]  ffff95ee0c354240 ffff9cc890f0fd30 ffffffffbd765082 ffffffffc030cf52
[ 1475.097747]  [<ffffffffc017ecb8>] cv_wait_common+0x128/0x140 [spl]
[ 1475.097866]  [<ffffffffbd28e457>] kthread+0xd7/0xf0
[ 1475.098077]  ffff95ee0c351a80 ffff9cc883c878c8 ffffffffbd765082 0000000000000000
[ 1475.098114]  [<ffffffffc017ece5>] __cv_wait+0x15/0x20 [spl]
[ 1475.098348]  [<ffffffffc01976ff>] ? nvlist_xalloc.part.13+0x5f/0xc0 [znvpair]
[ 1475.098735] Call Trace:
[ 1475.099203]  [<ffffffffc0291cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1597.966287] txg_quiesce     D    0   455      2 0x00000000
[ 1597.966346]  [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1597.966405]  0000000000000000 000000000003328e 0000000000000400 ffff95eddcab6a00
[ 1597.966431]  [<ffffffffc019762a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1597.966467]  [<ffffffffc01976ff>] ? nvlist_xalloc.part.13+0x5f/0xc0 [znvpair]
[ 1597.966511] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1597.966536]  [<ffffffffc017caeb>] spl_panic+0xeb/0x100 [spl]
[ 1597.966585]  [<ffffffffc029a20b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1597.966620]  [<ffffffffbd2a5200>] ? set_next_entity+0x70/0x890
[ 1597.966634]  [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1597.966681]  ffff95ee0c351a80 ffff9cc88326ba90 ffffffffbd765082 ffffffffc030cf52
[ 1597.966699]  [<ffffffffc017ece5>] __cv_wait+0x15/0x20 [spl]
[ 1597.966764]  [<ffffffffc02a4fe8>] dsl_dataset_snapshot+0x138/0x2a0 [zfs]
[ 1597.966782]  [<ffffffffc031705c>] zfs_ioc_snapshot+0x32c/0x390 [zfs]
[ 1720.835451] INFO: task txg_quiesce:455 blocked for more than 120 seconds.
[ 1720.835664]  ffff95ee0c354240 ffff9cc890f0fd30 ffffffffbd765082 ffffffffc030cf52
[ 1720.835752]  [<ffffffffbd765586>] schedule+0x36/0x80
[ 1720.835865]  [<ffffffffc02e8350>] ? txg_wait_open+0x100/0x100 [zfs]
[ 1720.835871]  [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1720.835877]  [<ffffffffc0179642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1720.835881]  [<ffffffffbd28e457>] kthread+0xd7/0xf0
[ 1720.835884]  [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1720.835888]  [<ffffffffbd76a155>] ret_from_fork+0x25/0x30
[ 1720.835907] INFO: task zfs:12024 blocked for more than 120 seconds.
[ 1720.835965]       Tainted: P           O    4.9.58 #1-NixOS
[ 1720.836016] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1720.836086] zfs             D    0 12024  11990 0x00000000
[ 1720.836090]  ffff95eddb621800 0000000000000000 ffff95ee2fa97ec0 ffff95eddcab6a00
[ 1720.836095]  ffff95ee0c351a80 ffff9cc883c878c8 ffffffffbd765082 0000000000000000
[ 1720.836099]  0000000000000000 000000000003328e 0000000000000400 ffff95eddcab6a00
[ 1720.836104] Call Trace:
[ 1720.836110]  [<ffffffffbd765082>] ? __schedule+0x192/0x660
[ 1720.836114]  [<ffffffffbd765586>] schedule+0x36/0x80
[ 1720.836121]  [<ffffffffc017ecb8>] cv_wait_common+0x128/0x140 [spl]
[ 1720.836125]  [<ffffffffbd2ad390>] ? wake_atomic_t_function+0x60/0x60
[ 1720.836131]  [<ffffffffc017ece5>] __cv_wait+0x15/0x20 [spl]
[ 1720.836165]  [<ffffffffc027df92>] bqueue_enqueue+0x62/0xe0 [zfs]
[ 1720.836205]  [<ffffffffc02938c1>] dmu_recv_stream+0x691/0x11c0 [zfs]
[ 1720.836211]  [<ffffffffc019762a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1720.836256]  [<ffffffffc031a8fa>] ? zfs_set_prop_nvlist+0x2fa/0x510 [zfs]
[ 1720.836296]  [<ffffffffc031b057>] zfs_ioc_recv_impl+0x407/0x1170 [zfs]
[ 1720.836337]  [<ffffffffc031c3f9>] zfs_ioc_recv_new+0x369/0x400 [zfs]
[ 1720.836343]  [<ffffffffc01772cc>] ? spl_kmem_alloc_impl+0x9c/0x180 [spl]
[ 1720.836373]  [<ffffffffc01794a9>] ? spl_vmem_alloc+0x19/0x20 [spl]
[ 1720.836382]  [<ffffffffc019c8af>] ? nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
[ 1720.836390]  [<ffffffffc019762a>] ? nv_mem_zalloc.isra.12+0x2a/0x40 [znvpair]
[ 1720.836399]  [<ffffffffc01976ff>] ? nvlist_xalloc.part.13+0x5f/0xc0 [znvpair]
[ 1720.836451]  [<ffffffffc03190eb>] zfsdev_ioctl+0x20b/0x660 [zfs]
[ 1720.836462]  [<ffffffffbd3ff604>] do_vfs_ioctl+0x94/0x5c0
[ 1720.836468]  [<ffffffffbd25dece>] ? __do_page_fault+0x25e/0x4c0
[ 1720.836473]  [<ffffffffbd3ffba9>] SyS_ioctl+0x79/0x90
[ 1720.836477]  [<ffffffffbd769ef7>] entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 1720.836482] INFO: task receive_writer:22544 blocked for more than 120 seconds.
[ 1720.836549]       Tainted: P           O    4.9.58 #1-NixOS
[ 1720.836600] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1720.836670] receive_writer  D    0 22544      2 0x00000000
[ 1720.836674]  ffff95eddb621800 0000000000000000 ffff95ee2fb97ec0 ffff95ee0bf56a00
[ 1720.836679]  ffff95ee0bfc0000 ffff9cc8839a3b30 ffffffffbd765082 0000000000000000
[ 1720.836683]  ffffffffc03b95d0 00ffffffc03cf1d0 0000000000000001 ffff95ee0bf56a00
[ 1720.836687] Call Trace:
[ 1720.836693]  [<ffffffffbd765082>] ? __schedule+0x192/0x660
[ 1720.836698]  [<ffffffffbd765586>] schedule+0x36/0x80
[ 1720.836706]  [<ffffffffc017caeb>] spl_panic+0xeb/0x100 [spl]
[ 1720.836748]  [<ffffffffc030d3c2>] ? __set_error+0x22/0x30 [zfs]
[ 1720.836790]  [<ffffffffc029a107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[ 1720.836827]  [<ffffffffc029a443>] ? dnode_setdirty+0x83/0x100 [zfs]
[ 1720.836830]  [<ffffffffbd7671e2>] ? mutex_lock+0x12/0x30
[ 1720.836880]  [<ffffffffc02c984b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1720.836918]  [<ffffffffc029a20b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1720.836954]  [<ffffffffc0284a7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[ 1720.836988]  [<ffffffffc0284b24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[ 1720.837024]  [<ffffffffc0291858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[ 1720.837058]  [<ffffffffc0291cb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1720.837062]  [<ffffffffbd3ce021>] ? __slab_free+0xa1/0x2e0
[ 1720.837068]  [<ffffffffbd2a5200>] ? set_next_entity+0x70/0x890
[ 1720.837074]  [<ffffffffc0176f53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1720.837080]  [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1720.837113]  [<ffffffffc0291910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1720.837118]  [<ffffffffc01795d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1720.837123]  [<ffffffffc0179642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1720.837127]  [<ffffffffbd28e457>] kthread+0xd7/0xf0
[ 1720.837131]  [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1720.837135]  [<ffffffffbd28e380>] ? kthread_park+0x60/0x60
[ 1720.837139]  [<ffffffffbd76a155>] ret_from_fork+0x25/0x30

tcaputi · 2017-11-06T20:32:44Z

Hmmm. You definitely should be seeing a message with error 17 in it.... My only explanation is that the log could be cut off. Maybe you can increase the debug log to 1G temporarily by doing the following. Really sorry to ask you to keep crashing your machine but I'm still not able to reproduce it locally yet:

echo 1073741824> /sys/module/zfs/parameters/zfs_dbgmsg_maxsize

sjau · 2017-11-06T20:35:28Z

So what commands in what order to execute? The server is still for testing... real server still runs mdadm raid1, luks-encrypted debian with ext 4 and uses rsync for backup ;)

tcaputi · 2017-11-06T20:37:00Z

echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable
echo 1073741824 > /sys/module/zfs/parameters/zfs_dbgmsg_maxsize
echo 4294967263 > /sys/module/zfs/parameters/zfs_flags
# cause the panic
cat /proc/spl/kstat/zfs/dbgmsg | grep 'error 17'

This will use up 1G of system ram for debug messages, by the way so you might want to tune zfs_dbgmsg_maxsize back once we have everything all figured out (or rebooting will clear the value).

sjau · 2017-11-06T20:40:53Z

not issuing the echo 4294967263 > /sys/module/zfs/parameters/zfs_flags command?

tcaputi · 2017-11-06T20:47:25Z

Yes. I'm sorry. Thanks for pointing that out. I updated the commands above.

sjau · 2017-11-06T20:48:13Z

so restart over again :) btw missing space for the dbgmsg_maxsize before the > :)

sjau · 2017-11-06T21:11:07Z

Now we have output:

root@servi-nixos:~# zfs destroy -r serviTank/encZFS/BU/subi/Nixos
cannot open 'serviTank/encZFS/BU/subi/Nixos': dataset does not exist
root@servi-nixos:~# zfs list
NAME                       USED  AVAIL  REFER  MOUNTPOINT
serviTank                 90.1G  3.42T    96K  /serviTank
serviTank/encZFS          55.5G  3.42T  1.39M  none
serviTank/encZFS/BU       2.78M  3.42T  1.39M  none
serviTank/encZFS/BU/subi  1.39M  3.42T  1.39M  none
serviTank/encZFS/Nixos    55.5G  3.42T  10.7G  legacy
root@servi-nixos:~# echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable
root@servi-nixos:~# echo 1073741824 > /sys/module/zfs/parameters/zfs_dbgmsg_maxsize
root@servi-nixos:~# echo 4294967263 > /sys/module/zfs/parameters/zfs_flags
root@servi-nixos:~# dmesg | tail -n 40
[   77.492055] audit: type=1300 audit(1510001435.468:38): arch=c000003e syscall=54 success=yes exit=0 a0=4 a1=29 a2=40 a3=15d55c0 items=0 ppid=6321 pid=6550 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ip6tables" exe="/nix/store/hnwjdbj5w41sxw60cf4nfa93z8cr59nb-iptables-1.6.1/bin/xtables-multi" key=(null)
[   77.492057] audit: type=1327 audit(1510001435.468:38): proctitle=6970367461626C6573002D77002D7400726177002D4E006E69786F732D66772D727066696C746572
[   77.575067] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null)
[   80.088561] nf_conntrack version 0.5.0 (65536 buckets, 262144 max)
[   80.917907] systemd-journald[6236]: Failed to set ACL on /var/log/journal/cc00e28c9c9241ad97b93019da77628d/user-1000.journal, ignoring: Operation not supported
[   90.931897] wireguard: WireGuard 0.0.20171017 loaded. See www.wireguard.com for information.
[   90.931898] wireguard: Copyright (C) 2015-2017 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
[ 1244.050819] VERIFY3(0 == dmu_object_dirty_raw(os, object, tx)) failed (0 == 17)
[ 1244.050842] PANIC at dmu.c:937:dmu_free_long_object_impl()
[ 1244.050854] Showing stack for process 8103
[ 1244.050856] CPU: 6 PID: 8103 Comm: receive_writer Tainted: P           O    4.9.58 #1-NixOS
[ 1244.050856] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Pro4, BIOS P2.50 05/27/2016
[ 1244.050857]  ffffa13d43ac7b38 ffffffff842f7a12 ffffffffc057ce03 00000000000003a9
[ 1244.050859]  ffffa13d43ac7b48 ffffffffc032d9f2 ffffa13d43ac7cd0 ffffffffc032dac5
[ 1244.050860]  0000000000000028 ffffa13d00000030 ffffa13d43ac7ce0 ffffa13d43ac7c80
[ 1244.050861] Call Trace:
[ 1244.050865]  [<ffffffff842f7a12>] dump_stack+0x63/0x81
[ 1244.050869]  [<ffffffffc032d9f2>] spl_dumpstack+0x42/0x50 [spl]
[ 1244.050870]  [<ffffffffc032dac5>] spl_panic+0xc5/0x100 [spl]
[ 1244.050890]  [<ffffffffc04bb3c2>] ? __set_error+0x22/0x30 [zfs]
[ 1244.050900]  [<ffffffffc0448107>] ? dnode_hold_impl+0xb57/0xc40 [zfs]
[ 1244.050909]  [<ffffffffc0448443>] ? dnode_setdirty+0x83/0x100 [zfs]
[ 1244.050910]  [<ffffffff845671e2>] ? mutex_lock+0x12/0x30
[ 1244.050922]  [<ffffffffc047784b>] ? multilist_sublist_unlock+0x2b/0x40 [zfs]
[ 1244.050931]  [<ffffffffc044820b>] ? dnode_hold+0x1b/0x20 [zfs]
[ 1244.050939]  [<ffffffffc0432a7a>] dmu_free_long_object_impl.part.11+0xba/0xf0 [zfs]
[ 1244.050947]  [<ffffffffc0432b24>] dmu_free_long_object_raw+0x34/0x40 [zfs]
[ 1244.050955]  [<ffffffffc043f858>] receive_freeobjects.isra.11+0x58/0x110 [zfs]
[ 1244.050963]  [<ffffffffc043fcb5>] receive_writer_thread+0x3a5/0xd50 [zfs]
[ 1244.050965]  [<ffffffff841ce069>] ? __slab_free+0xe9/0x2e0
[ 1244.050966]  [<ffffffff840a5200>] ? set_next_entity+0x70/0x890
[ 1244.050968]  [<ffffffffc0327f53>] ? spl_kmem_free+0x33/0x40 [spl]
[ 1244.050969]  [<ffffffffc032a5d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1244.050977]  [<ffffffffc043f910>] ? receive_freeobjects.isra.11+0x110/0x110 [zfs]
[ 1244.050978]  [<ffffffffc032a5d0>] ? __thread_exit+0x20/0x20 [spl]
[ 1244.050980]  [<ffffffffc032a642>] thread_generic_wrapper+0x72/0x80 [spl]
[ 1244.050981]  [<ffffffff8408e457>] kthread+0xd7/0xf0
[ 1244.050981]  [<ffffffff8408e380>] ? kthread_park+0x60/0x60
[ 1244.050982]  [<ffffffff8408e380>] ? kthread_park+0x60/0x60
[ 1244.050983]  [<ffffffff8456a155>] ret_from_fork+0x25/0x30
root@servi-nixos:~# cat /proc/spl/kstat/zfs/dbgmsg | grep 'error 17'
1510001732   dnode.c:1421:dnode_hold_impl(): error 17
1510002335   dnode.c:1421:dnode_hold_impl(): error 17
1510002602   dnode.c:1421:dnode_hold_impl(): error 17
1510002602   dnode.c:1421:dnode_hold_impl(): error 17

tcaputi · 2017-11-06T21:15:33Z

Wonderful. I should be able to look at this tonight. Thanks for the help.

sjau · 2017-11-06T21:16:22Z

doesn't look to me like that's of any help... but well :) thanks for your work.

tcaputi · 2017-11-06T21:18:28Z

It's the line number. That should be enough for me to figure it out but I'll post here if it's not

sjau · 2017-11-06T21:21:52Z

Good luck.

sjau · 2018-01-23T08:30:47Z

so, I was able to compile it using this nix expression:

https://paste.simplylinux.ch/view/bbec32ff

I had to add lines 39-47 to get that missing file. Also I updated the revision for unstable to sunday's master in lines 173 - 176.

I had to update Mic92's Nixos patch and add the stability patch.

It compiled nicely but I'm not at home so I won't reboot server now for testing.

Also, I have at home everything mirrored. I guess I'll shut down the server, remove the mirrored drives and then boot up to see how it goes. In case my server gets nuked, I can still put back the mirrored drives and reboot into old (=current) build.

sjau · 2018-01-24T06:04:35Z

So, I treated it now. I took all the mirrored drives out. Nixos couldn't completely boot up in ro-mode, so I created a new Nixos iso with patches and booted that one.

As you said, existing encrypted DS can only be mounted as ro.

Also good that I created a structure like tank/encZFS/Nixos

So what I did is create new encrypted dataset like tank/encNEW and then I just run zfs snapshot tank/encZFS/Nixos@now followed by zfs send tank/encZFS/Nixos@now | zfs receive tank/encNEW/Nixos

Once that was completed, I run zfs rename tank/encZFS tank/encOLD and zfs rename tank/encOLD tank/encZFS

I tried then to reboot and it booted up just fine.

NAME                                                 USED  AVAIL  REFER  MOUNTPOINT
serviTank                                           1.84T  1.67T    96K  none
serviTank/encOLD                                    1.34T  1.67T  1.39M  none
[...]
serviTank/encOLD/Media                               859G  1.67T   859G  legacy
serviTank/encOLD/Nixos                               179G  1.67T  68.6G  legacy
serviTank/encZFS                                     516G  1.67T   184K  none
[...]
serviTank/encZFS/Media                               445G  1.67T   445G  legacy
serviTank/encZFS/Nixos                              70.7G  1.67T  68.8G  legacy

zfs get all serviTank/encZFS/Nixos | grep encr
serviTank/encZFS/Nixos  encryption            aes-256-gcm            -
serviTank/encZFS/Nixos  encryptionroot        serviTank/encZFS       -

So, all seems to work... Still restoring the Media DS

Currently, when a raw zfs send file includes a DRR_OBJECT record that would decrease the number of levels of an existing object, the object is reallocated with dmu_object_reclaim() which creates the new dnode using the old object's nlevels. For non-raw sends this doesn't really matter, but raw sends require that nlevels on the receive side match that of the send side so that the checksum-of-MAC tree can be properly maintained. This patch corrects the issue by freeing the object completely before allocating it again in this case. This patch also corrects an issue with dnode_hold_impl() that prevented this fix from working correctly. Fixes openzfs#6821 Signed-off-by: Tom Caputi <tcaputi@datto.com>

Currently, when a raw zfs send file includes a DRR_OBJECT record that would decrease the number of levels of an existing object, the object is reallocated with dmu_object_reclaim() which creates the new dnode using the old object's nlevels. For non-raw sends this doesn't really matter, but raw sends require that nlevels on the receive side match that of the send side so that the checksum-of-MAC tree can be properly maintained. This patch corrects the issue by freeing the object completely before allocating it again in this case. This patch also corrects an issue with dnode_hold_impl() that prevented this fix from working correctly. Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Fixes openzfs#6821

Currently, when a raw zfs send file includes a DRR_OBJECT record that would decrease the number of levels of an existing object, the object is reallocated with dmu_object_reclaim() which creates the new dnode using the old object's nlevels. For non-raw sends this doesn't really matter, but raw sends require that nlevels on the receive side match that of the send side so that the checksum-of-MAC tree can be properly maintained. This patch corrects the issue by freeing the object completely before allocating it again in this case. This patch also corrects several issues with dnode_hold_impl() and related functions that prevented dnodes (particularly multi-slot dnodes) from being reallocated properly due to the fact that existing dnodes were not being fully cleaned up when they were freed. Fixes openzfs#6821 Signed-off-by: Tom Caputi <tcaputi@datto.com>

Currently, when a raw zfs send file includes a DRR_OBJECT record that would decrease the number of levels of an existing object, the object is reallocated with dmu_object_reclaim() which creates the new dnode using the old object's nlevels. For non-raw sends this doesn't really matter, but raw sends require that nlevels on the receive side match that of the send side so that the checksum-of-MAC tree can be properly maintained. This patch corrects the issue by freeing the object completely before allocating it again in this case. This patch also corrects several issues with dnode_hold_impl() and related functions that prevented dnodes (particularly multi-slot dnodes) from being reallocated properly due to the fact that existing dnodes were not being fully cleaned up when they were freed. Fixes openzfs#6821 Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

sjau · 2018-01-28T11:11:12Z

So, on the homeserver I did apply the stability patch.

Now I set from my notebook (without stability) patch the dataset:

zfs send tankSubi/encZFS/Nixos@zfs-auto-snap_hourly-2018-01-28-11h00 | ssh root@servi 'zfs receive serviTank/encZFS/BU/subi'

and then I sent an incremental dataset:

zfs send -i tankSubi/encZFS/Nixos@zfs-auto-snap_hourly-2018-01-28-11h00 tankSubi/encZFS/Nixos@zfs-auto-snap_hourly-2018-01-28-12h00 | ssh root@servi 'zfs receive serviTank/encZFS/BU/subi'

On the server I then run:

root@servi:~# zfs set mountpoint=legacy serviTank/encZFS/BU/subi
root@servi:~# mkdir /tmp/xxx
root@servi:~# mount -t zfs serviTank/encZFS/BU/subi /tmp/xxx
root@servi:~# cd /tmp/xxx
root@servi:/tmp/xxx# ls
bin   dev  home         klauncherJ19542.1.slave-socket  mysql_backup  nixpkgs  root  sys       tmp  var
boot  etc  kdeinit5__0  mnt                             nix           proc     run   tankSubi  usr
root@servi:/tmp/xxx# 
root@servi:/tmp/xxx# zfs get all  serviTank/encZFS/BU/subi | grep encr
serviTank/encZFS/BU/subi  encryption            aes-256-gcm            -
serviTank/encZFS/BU/subi  encryptionroot        serviTank/encZFS       -

Ok non-raw sending datasets to an encrypted dataset (with stability) patch works fine. Even sending incremental such datasets work fine with the stability patch implemented.

Then I tried with raw sending:

root@subi:~# zfs send -w tankSubi/encZFS/Nixos@zfs-auto-snap_hourly-2018-01-28-11h00 | ssh root@servi 'zfs receive serviTank/encZFS/BU/subi'
cannot receive new filesystem stream: pool must be upgraded to receive this stream.

tcaputi · 2018-01-28T14:52:40Z

@sjau
So, just to make sure I understand correctly, you are saying that non-raw sends from unpatched to patched work correctly, but raw sends from unpatched to patched do not?

If so, this is expected behavior. The receive side is detecting that the send stream is for the old format and is rejecting it. The error message there isn't great, but the current zfs recv code makes it hard for the userspace code to determine exactly what went wrong when an error occurs (for any kind of send). Hopefully we can clean that up sometime soon, but that is outside the scope of this patch.

What should work is a raw send from patched to patched. So the raw send should work if you receive it locally on the patched server.

sjau · 2018-01-28T16:17:47Z

yes, you understand correctly. I'll have to patch my notebook next :)

tcaputi · 2018-01-28T16:19:43Z

Just make sure you keep the original datasets around until this gets merger in case we have to make any last minute changes.

sjau · 2018-01-28T16:26:19Z

keep original datasets around?

tcaputi · 2018-01-28T16:29:14Z

Yeah. Keep the filesystems you currently have on the notebook for a little bit before fixing them with zfs send.

sjau · 2018-01-28T16:29:47Z

don't have the space for that

tcaputi · 2018-01-28T23:55:38Z

Then I would just hold off just to be safe (if you care about that data a lot). We should have it merged in a few days (hopefully).

Currently, when a raw zfs send file includes a DRR_OBJECT record that would decrease the number of levels of an existing object, the object is reallocated with dmu_object_reclaim() which creates the new dnode using the old object's nlevels. For non-raw sends this doesn't really matter, but raw sends require that nlevels on the receive side match that of the send side so that the checksum-of-MAC tree can be properly maintained. This patch corrects the issue by freeing the object completely before allocating it again in this case. This patch also corrects several issues with dnode_hold_impl() and related functions that prevented dnodes (particularly multi-slot dnodes) from being reallocated properly due to the fact that existing dnodes were not being fully cleaned up when they were freed. Fixes openzfs#6821 Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

sjau · 2018-02-12T18:26:52Z

@tcaputi

I still get the invalid exchange when trying to mount an encrypted dataset send in raw mode:

root@servi:~# ssh root@subi 'zfs send -w tankSubi/encZFS/Nixos@zfs-auto-snap_hourly-2018-02-12-17h00' | zfs receive serviTank/encZFS/Subi
root@servi:~# ssh root@subi 'zfs send -w -i tankSubi/encZFS/Nixos@zfs-auto-snap_hourly-2018-02-12-17h00 tankSubi/encZFS/Nixos@zfs-auto-snap_hourly-2018-02-12-18h00' | zfs receive serviTank/encZFS/Subi
root@servi:~# ssh root@subi 'zfs send -w -i tankSubi/encZFS/Nixos@zfs-auto-snap_hourly-2018-02-12-18h00 tankSubi/encZFS/Nixos@zfs-auto-snap_hourly-2018-02-12-19h00' | zfs receive serviTank/encZFS/Subi
root@servi:~# zfs list -t snapshot -r serviTank/encZFS/Subi
NAME                                                          USED  AVAIL  REFER  MOUNTPOINT
serviTank/encZFS/Subi@zfs-auto-snap_hourly-2018-02-12-17h00   843M      -  96.6G  -
serviTank/encZFS/Subi@zfs-auto-snap_hourly-2018-02-12-18h00   150M      -  96.6G  -
serviTank/encZFS/Subi@zfs-auto-snap_hourly-2018-02-12-19h00     0B      -  96.6G  -
root@servi:~# zfs set mountpoint=legacy serviTank/encZFS/Subi
root@servi:~# zfs load-key serviTank/encZFS/Subi
Enter passphrase for 'serviTank/encZFS/Subi':
root@servi:~# mkdir /tmp/subi
root@servi:~# mount -t zfs serviTank/encZFS/Subi /tmp/subi
filesystem 'serviTank/encZFS/Subi' can not be mounted: Invalid exchange

tcaputi · 2018-02-12T18:49:46Z

One other person has reported this as well since the stability patch. I'm looking into it. Do you have any steps to reproduce it?

sjau · 2018-02-12T21:03:14Z

Well, it's what I did above.

Currently, when a raw zfs send file includes a DRR_OBJECT record that would decrease the number of levels of an existing object, the object is reallocated with dmu_object_reclaim() which creates the new dnode using the old object's nlevels. For non-raw sends this doesn't really matter, but raw sends require that nlevels on the receive side match that of the send side so that the checksum-of-MAC tree can be properly maintained. This patch corrects the issue by freeing the object completely before allocating it again in this case. This patch also corrects several issues with dnode_hold_impl() and related functions that prevented dnodes (particularly multi-slot dnodes) from being reallocated properly due to the fact that existing dnodes were not being fully cleaned up when they were freed. This patch adds a test to make sure that zfs recv functions properly with incremental streams containing dnodes of different sizes. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#6821 Closes openzfs#6864

This is a port of 047116a - Raw sends must be able to decrease nlevels, to the zfs-0.7-stable branch. It includes the various fixes to the problem of receiving incremental streams which include reallocated dnodes in which the number of dnode slots has changed but excludes the parts which are related to raw streams. From 047116a: Currently, when a raw zfs send file includes a DRR_OBJECT record that would decrease the number of levels of an existing object, the object is reallocated with dmu_object_reclaim() which creates the new dnode using the old object's nlevels. For non-raw sends this doesn't really matter, but raw sends require that nlevels on the receive side match that of the send side so that the checksum-of-MAC tree can be properly maintained. This patch corrects the issue by freeing the object completely before allocating it again in this case. This patch also corrects several issues with dnode_hold_impl() and related functions that prevented dnodes (particularly multi-slot dnodes) from being reallocated properly due to the fact that existing dnodes were not being fully cleaned up when they were freed. This patch adds a test to make sure that zfs recv functions properly with incremental streams containing dnodes of different sizes. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes openzfs#6821 Closes openzfs#6864 Should close openzfs#6366

This is a port of 047116a - Raw sends must be able to decrease nlevels, to the zfs-0.7-stable branch. It includes the various fixes to the problem of receiving incremental streams which include reallocated dnodes in which the number of dnode slots has changed but excludes the parts which are related to raw streams. From 047116a: Currently, when a raw zfs send file includes a DRR_OBJECT record that would decrease the number of levels of an existing object, the object is reallocated with dmu_object_reclaim() which creates the new dnode using the old object's nlevels. For non-raw sends this doesn't really matter, but raw sends require that nlevels on the receive side match that of the send side so that the checksum-of-MAC tree can be properly maintained. This patch corrects the issue by freeing the object completely before allocating it again in this case. This patch also corrects several issues with dnode_hold_impl() and related functions that prevented dnodes (particularly multi-slot dnodes) from being reallocated properly due to the fact that existing dnodes were not being fully cleaned up when they were freed. This patch adds a test to make sure that zfs recv functions properly with incremental streams containing dnodes of different sizes. Authored-by: Tom Caputi <tcaputi@datto.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Ported-by: Tim Chase <tim@chase2k.com> Closes openzfs#6821 Closes openzfs#6864 NOTE: This is the first of the port of 3 related patches patches to the zfs-0.7-release branch of ZoL. The other two patches should immediately follow this one.

This is a port of 047116a - Raw sends must be able to decrease nlevels, to the zfs-0.7-stable branch. It includes the various fixes to the problem of receiving incremental streams which include reallocated dnodes in which the number of dnode slots has changed but excludes the parts which are related to raw streams. From 047116a: Currently, when a raw zfs send file includes a DRR_OBJECT record that would decrease the number of levels of an existing object, the object is reallocated with dmu_object_reclaim() which creates the new dnode using the old object's nlevels. For non-raw sends this doesn't really matter, but raw sends require that nlevels on the receive side match that of the send side so that the checksum-of-MAC tree can be properly maintained. This patch corrects the issue by freeing the object completely before allocating it again in this case. This patch also corrects several issues with dnode_hold_impl() and related functions that prevented dnodes (particularly multi-slot dnodes) from being reallocated properly due to the fact that existing dnodes were not being fully cleaned up when they were freed. This patch adds a test to make sure that zfs recv functions properly with incremental streams containing dnodes of different sizes. This also includes a one-liner fix from loli10K to fix a test failure: openzfs#7792 (comment) Authored-by: Tom Caputi <tcaputi@datto.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Ported-by: Tim Chase <tim@chase2k.com> Closes openzfs#6821 Closes openzfs#6864 NOTE: This is the first of the port of 3 related patches patches to the zfs-0.7-release branch of ZoL. The other two patches should immediately follow this one. Add stuff

This is a port of 047116a - Raw sends must be able to decrease nlevels, to the zfs-0.7-stable branch. It includes the various fixes to the problem of receiving incremental streams which include reallocated dnodes in which the number of dnode slots has changed but excludes the parts which are related to raw streams. From 047116a: Currently, when a raw zfs send file includes a DRR_OBJECT record that would decrease the number of levels of an existing object, the object is reallocated with dmu_object_reclaim() which creates the new dnode using the old object's nlevels. For non-raw sends this doesn't really matter, but raw sends require that nlevels on the receive side match that of the send side so that the checksum-of-MAC tree can be properly maintained. This patch corrects the issue by freeing the object completely before allocating it again in this case. This patch also corrects several issues with dnode_hold_impl() and related functions that prevented dnodes (particularly multi-slot dnodes) from being reallocated properly due to the fact that existing dnodes were not being fully cleaned up when they were freed. This patch adds a test to make sure that zfs recv functions properly with incremental streams containing dnodes of different sizes. This also includes a one-liner fix from loli10K to fix a test failure: openzfs#7792 (comment) Authored-by: Tom Caputi <tcaputi@datto.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Ported-by: Tim Chase <tim@chase2k.com> Closes openzfs#6821 Closes openzfs#6864 NOTE: This is the first of the port of 3 related patches patches to the zfs-0.7-release branch of ZoL. The other two patches should immediately follow this one.

behlendorf assigned tcaputi Nov 6, 2017

behlendorf closed this as completed in 047116a Feb 2, 2018

spl_panic when receiving encrypted dataset #6821

spl_panic when receiving encrypted dataset #6821

Comments

sjau commented Nov 4, 2017 • edited Loading

System information

Describe the problem you're observing

Describe how to reproduce the problem

tcaputi commented Nov 6, 2017

sjau commented Nov 6, 2017

sjau commented Nov 6, 2017 • edited Loading

tcaputi commented Nov 6, 2017

sjau commented Nov 6, 2017

tcaputi commented Nov 6, 2017

sjau commented Nov 6, 2017

sjau commented Nov 6, 2017

tcaputi commented Nov 6, 2017

sjau commented Nov 6, 2017

sjau commented Nov 6, 2017

sjau commented Nov 6, 2017 • edited Loading

tcaputi commented Nov 6, 2017

sjau commented Nov 6, 2017

tcaputi commented Nov 6, 2017

sjau commented Nov 6, 2017 • edited Loading

tcaputi commented Nov 6, 2017

sjau commented Nov 6, 2017

tcaputi commented Nov 6, 2017

sjau commented Nov 6, 2017 • edited Loading

tcaputi commented Nov 6, 2017 • edited Loading

sjau commented Nov 6, 2017

tcaputi commented Nov 6, 2017

sjau commented Nov 6, 2017

sjau commented Nov 6, 2017

tcaputi commented Nov 6, 2017

sjau commented Nov 6, 2017

tcaputi commented Nov 6, 2017

sjau commented Nov 6, 2017

sjau commented Jan 23, 2018

sjau commented Jan 24, 2018 • edited Loading

sjau commented Jan 28, 2018 • edited Loading

tcaputi commented Jan 28, 2018

sjau commented Jan 28, 2018

tcaputi commented Jan 28, 2018

sjau commented Jan 28, 2018

tcaputi commented Jan 28, 2018

sjau commented Jan 28, 2018

tcaputi commented Jan 28, 2018

sjau commented Feb 12, 2018

tcaputi commented Feb 12, 2018

sjau commented Feb 12, 2018

sjau commented Nov 4, 2017 •

edited

Loading

sjau commented Nov 6, 2017 •

edited

Loading

sjau commented Nov 6, 2017 •

edited

Loading

sjau commented Nov 6, 2017 •

edited

Loading

sjau commented Nov 6, 2017 •

edited

Loading

tcaputi commented Nov 6, 2017 •

edited

Loading

sjau commented Jan 24, 2018 •

edited

Loading

sjau commented Jan 28, 2018 •

edited

Loading