Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to mount: "unable to fetch ZFS version for filesystem" after receiving a snapshot #7617

Closed
datalate opened this issue Jun 9, 2018 · 20 comments · Fixed by #7662
Closed

Comments

@datalate
Copy link

datalate commented Jun 9, 2018

System information

Type Version/Name
Distribution Name Arch Linux
Linux Kernel 4.16.13-1-ARCH
Architecture amd64
ZFS Version 0.7.0-1412_g1a5b96b8e
SPL Version 0.7.0-1412_g1a5b96b8e

I'm running zfs on two systems. One for live system and another one for backup. The issue happened after receiving a snapshot from the live system. Live system is still fine. The pool was mounting fine before this, and now refuses to mount even in read-only mode:

datalate ~ $ sudo zfs mount -o ro backup/root/debian-1
unable to fetch ZFS version for filesystem 'backup/root/debian-1'
cannot mount 'backup/root/debian-1': Resource temporarily unavailable

I sent the backup over netcat inside LAN. Each command exited fine without errors. Commands for send/recv were following:

Receiving: nc -l -p 2020 10.0.0.3 | sudo zfs receive -Fvdu backup
Sending: sudo zfs send -R -I datapool/root/debian-1@20170827 datapool/root/debian-1@20171113 | nc 10.0.0.3 2020

Basic info about backup pool:

datalate ~ $ sudo zfs list
NAME                   USED  AVAIL  REFER  MOUNTPOINT
backup                2.92T   609G    96K  none
backup/root           2.92T   609G    96K  none
backup/root/debian-1  2.92T   609G   494G  /mnt/backup
datalate ~ $ sudo zfs list -t snapshot
NAME                                         USED  AVAIL  REFER  MOUNTPOINT
backup/root/debian-1@20140728-cleaninstall   691M      -   874M  -
backup/root/debian-1@20170827               2.43T      -  2.83T  -
backup/root/debian-1@20171113                  0B      -   494G  -
datalate ~ $ sudo zpool status
  pool: backup
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
	still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
	the pool may no longer be accessible by software that does not support
	the features. See zpool-features(5) for details.
  scan: resilvered 0B in 0 days 00:00:00 with 0 errors on Fri Jun  8 17:02:03 2018
config:

	NAME                                        STATE     READ WRITE CKSUM
	backup                                      ONLINE       0     0     0
	  ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E6SA7DLS  ONLINE       0     0     0

errors: No known data errors
datalate ~ $ sudo zpool list
NAME     SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
backup  3.62T  2.92T   725G         -    40%    80%  1.00x  ONLINE  -

I noticed the version info is missing and zdb didn't output any version info from pool:

datalate ~ $ sudo zdb -ddddd backup/root/debian-1 1
Dataset backup/root/debian-1 [ZPL], ID 49, cr_txg 706, 494G, 523208 objects, rootbp DVA[0]=<0:2e66ef18000:1000> DVA[1]=<0:35dface2000:1000> [L0 DMU objset] fletcher4 uncompressed unencrypted LE contiguous unique double size=800L/800P birth=1402711L/1402711P fill=523208 cksum=ae9251850:ade0fdb035a:77ee2ac8d37b2:416edf3ff686722

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         1    1   128K    512      0     512    512    0.00  ZFS master node
	dnode flags: USERUSED_ACCOUNTED 
	dnode maxblkid: 0

Indirect blocks:

datalate ~ $ sudo zfs get version
NAME                                        PROPERTY  VALUE    SOURCE
backup                                      version   5        -
backup/root                                 version   5        -
backup/root/debian-1                        version   -        -
backup/root/debian-1@20140728-cleaninstall  version   5        -
backup/root/debian-1@20170827               version   5        -
backup/root/debian-1@20171113               version   -        -

I'm not sure if it's related, but I had to tune some options (changing sync, compression, xattr options and changing ARC size) before I got the transmission of snapshot correctly working. The write speeds were painfully slow at first with high I/O load, and the send/recv was interrupted a few times before I got it working... I also removed some old snapshots. Here's the history:

...
2018-06-09.03:25:29 zpool import -N backup
2018-06-09.03:25:34 zfs set relatime=on backup/root/debian-1
2018-06-09.03:26:40 zfs set compression=on backup/root/debian-1
2018-06-09.03:35:59 zpool export backup
2018-06-09.03:36:07 zpool import -N backup
2018-06-09.03:40:52 zpool export backup
2018-06-09.03:50:03 zpool import -N backup
2018-06-09.03:58:34 zpool export backup
2018-06-09.04:05:58 zfs set mountpoint=none backup/root
2018-06-09.04:06:03 zfs set mountpoint=none backup
2018-06-09.04:06:21 zfs set mountpoint=/mnt/backup backup/root/debian-1
2018-06-09.04:23:13 zfs set compression=lz4 backup
2018-06-09.04:45:07 zfs set sync=disabled backup
2018-06-09.04:50:39 zpool export backup
2018-06-09.04:55:46 zpool import -N backup
2018-06-09.05:07:09 zpool export backup
2018-06-09.05:12:10 zpool import -N backup
2018-06-09.05:12:42 zfs destroy backup/root/debian-1@latest
2018-06-09.05:13:35 zfs destroy backup/root/debian-1@20170107
2018-06-09.05:13:42 zfs destroy backup/root/debian-1@20170327
2018-06-09.05:22:05 zpool export backup
2018-06-09.05:28:29 zpool import -N backup
2018-06-09.05:38:45 zfs set xattr=sa backup
2018-06-09.07:11:26 zfs receive -Fvdu backup
2018-06-09.14:10:19 zpool export backup
2018-06-09.14:11:08 zpool import -N backup
2018-06-09.14:26:46 zfs set sync=standard backup
2018-06-09.14:27:16 zpool export backup
2018-06-09.14:32:13 zpool import -N backup
...

I also tried the rescuecd ISO with ZFS preinstalled, but everything was the same. I could just create a new backup pool and send everything back over, but nuking the old pool feels a bit scary to do.

@DeHackEd
Copy link
Contributor

I'm seeing this lately on new systems. The sender is 0.6.5 series, the receiver is days old ZFS from Git. I'm suspecting there's a receiver side problem in the last month or two? Only affects incremental sends, full sends are clean.

Gonna try bisecting.

@DeHackEd
Copy link
Contributor

047116a is the first bad commit

# zfs list -t all -o name,version
NAME                                        VERSION
zippermask                                        5
zippermask/server2-mysql                          -
zippermask/server2-mysql@2018-06-01--08:30        5
zippermask/server2-mysql@2018-06-02--08:30        -

Normal output from zdb on filesystem object 1:

# zdb -dddd zippermask/server2-mysql@2018-06-01--08:30 1
Dataset zippermask/server2-mysql@2018-06-01--08:30 [ZPL], ID 521, cr_txg 7104, 326G, 4750 objects, rootbp DVA[0]=<0:7caffb9600:200> DVA[1]=<0:1cdbd04800:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=800L/200P birth=7104L/7104P fill=4750 cksum=fb881c082:56623ea9df9:f8dde5dbfc2c:1f325d45d23ff3

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         1    1   128K    512     1K     512    512  100.00  ZFS master node
	dnode flags: USED_BYTES USERUSED_ACCOUNTED 
	dnode maxblkid: 0
	microzap: 512 bytes, 7 entries

		DELETE_QUEUE = 3 
		casesensitivity = 0 
		SA_ATTRS = 2 
		normalization = 0 
		VERSION = 5 
		utf8only = 0 
		ROOT = 4 

When using this version (or newer) no items are listed.

Dataset zippermask/server2-mysql@2018-06-02--08:30 [ZPL], ID 299, cr_txg 34229, 37.6G, 4753 objects, rootbp DVA[0]=<0:e85cdae600:200> DVA[1]=<0:8aa2ef6800:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=800L/200P birth=34229L/34229P fill=4753 cksum=c404ab43f:449894441f6:c96ff1691dbd:19c1f3fa28eec3

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         1    1   128K    512      0     512    512    0.00  ZFS master node
	dnode flags: USERUSED_ACCOUNTED 
	dnode maxblkid: 0



Not even an indication it's supposed to be a ZAP.

Sender is running version 0.6.5.3-31_g5dfbb20. Send stream (as viewed by zstreamdump) looks like:

BEGIN record
        hdrtype = 1
        features = 4
        magic = 2f5bacbac
        creation_time = 5b128dc9
        type = 2
        flags = 0x0
        toguid = ac5d244c2e6ece71
        fromguid = d0dfc3ad38e75b23
        toname = server2/mysql-recv@2018-06-02--08:30

FREEOBJECTS firstobj = 0 numobjs = 1
OBJECT object = 1 type = 21 bonustype = 0 blksz = 512 bonuslen = 0 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 1 offset = 512 length = -1
OBJECT object = 2 type = 45 bonustype = 0 blksz = 512 bonuslen = 0 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 2 offset = 512 length = -1
OBJECT object = 3 type = 22 bonustype = 0 blksz = 512 bonuslen = 0 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 3 offset = 512 length = -1
OBJECT object = 4 type = 20 bonustype = 44 blksz = 4608 bonuslen = 176 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 4 offset = 4608 length = -1
OBJECT object = 5 type = 46 bonustype = 0 blksz = 1536 bonuslen = 0 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 5 offset = 1536 length = -1
OBJECT object = 6 type = 47 bonustype = 0 blksz = 16384 bonuslen = 0 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 6 offset = 32768 length = -1
OBJECT object = 7 type = 20 bonustype = 44 blksz = 512 bonuslen = 168 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 7 offset = 512 length = -1
OBJECT object = 8 type = 20 bonustype = 44 blksz = 1024 bonuslen = 168 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 8 offset = 1024 length = -1
OBJECT object = 9 type = 20 bonustype = 44 blksz = 1024 bonuslen = 168 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 9 offset = 1024 length = -1
OBJECT object = 10 type = 20 bonustype = 44 blksz = 2048 bonuslen = 168 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 10 offset = 2048 length = -1
OBJECT object = 11 type = 20 bonustype = 44 blksz = 1536 bonuslen = 168 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 11 offset = 1536 length = -1
OBJECT object = 12 type = 20 bonustype = 44 blksz = 91136 bonuslen = 168 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 12 offset = 91136 length = -1
OBJECT object = 13 type = 20 bonustype = 44 blksz = 2048 bonuslen = 168 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 13 offset = 2048 length = -1
OBJECT object = 14 type = 20 bonustype = 44 blksz = 16384 bonuslen = 168 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 14 offset = 16384 length = -1
OBJECT object = 15 type = 20 bonustype = 44 blksz = 22016 bonuslen = 168 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 15 offset = 22016 length = -1
OBJECT object = 16 type = 20 bonustype = 44 blksz = 512 bonuslen = 168 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 16 offset = 512 length = -1
OBJECT object = 17 type = 20 bonustype = 44 blksz = 4096 bonuslen = 168 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 17 offset = 4096 length = -1
OBJECT object = 18 type = 20 bonustype = 44 blksz = 41472 bonuslen = 168 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 18 offset = 41472 length = -1
OBJECT object = 19 type = 20 bonustype = 44 blksz = 1024 bonuslen = 168 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 19 offset = 1024 length = -1
OBJECT object = 20 type = 20 bonustype = 44 blksz = 27136 bonuslen = 168 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 20 offset = 27136 length = -1
OBJECT object = 21 type = 20 bonustype = 44 blksz = 51712 bonuslen = 168 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 21 offset = 51712 length = -1
OBJECT object = 22 type = 20 bonustype = 44 blksz = 1536 bonuslen = 168 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 22 offset = 1536 length = -1
OBJECT object = 23 type = 20 bonustype = 44 blksz = 512 bonuslen = 168 dn_slots = 0 raw_bonuslen = 0 flags = 0 maxblkid = 0 indblkshift = 0 nlevels = 0 nblkptr = 0
FREE object = 23 offset = 512 length = -1
[ ... ]

Mentioning @tcaputi as the author of the bisected commit.

@tcaputi
Copy link
Contributor

tcaputi commented Jun 25, 2018

@DeHackEd
Could you provide a list of all of the properties on the broken dataset (on both the send and receive side)?

Could you also turn on error printing with:

echo 0xfffffbfe > /sys/module/zfs/parameters/zfs_flags
echo 0 > /proc/spl/kstat/zfs/dbgmsg

Then cause the problem again and provide the output in /proc/spl/kstat/zfs/dbgmsg

@DeHackEd
Copy link
Contributor

Not right now, because I can't spare the machine for experimentation with different versions of ZFS while people are using it.

@tcaputi
Copy link
Contributor

tcaputi commented Jun 25, 2018

If you can cause the issue on the current version that should be fine. Otherwise we can wait until you have time.

@DeHackEd
Copy link
Contributor

Oh yes the current version does it.

Unfortunately the (saved) send streams I've been using weigh in at 300 GB compressed for the initial snapshot and 3 GB compressed for the incremental. I don't have that kind of virtual machine laying around right now... Maybe tonight...

@tcaputi
Copy link
Contributor

tcaputi commented Jun 25, 2018

The problem is only on the incremental. You shouldn't need to receive the full again (you can just destroy the second snapshot). We can wait if you need to though

@DeHackEd
Copy link
Contributor

The issue is importing my known working reproducer on a known broken version of ZFS on a machine that 1) I can run with the expectation that it could hang, need to be rebooted, or generally reinstall ZFS and 2) Has the capacity to load the test cases.

Sender:

# zfs get all server2/mysql-recv@2018-06-02--08:30
NAME                                     PROPERTY              VALUE                  SOURCE
server2/mysql-recv@2018-06-02--08:30  type                  snapshot               -
server2/mysql-recv@2018-06-02--08:30  creation              Sat Jun  2  8:30 2018  -
server2/mysql-recv@2018-06-02--08:30  used                  730M                   -
server2/mysql-recv@2018-06-02--08:30  referenced            422G                   -
server2/mysql-recv@2018-06-02--08:30  compressratio         1.74x                  -
server2/mysql-recv@2018-06-02--08:30  devices               on                     default
server2/mysql-recv@2018-06-02--08:30  exec                  on                     default
server2/mysql-recv@2018-06-02--08:30  setuid                on                     default
server2/mysql-recv@2018-06-02--08:30  xattr                 on                     default
server2/mysql-recv@2018-06-02--08:30  version               5                      -
server2/mysql-recv@2018-06-02--08:30  utf8only              off                    -
server2/mysql-recv@2018-06-02--08:30  normalization         none                   -
server2/mysql-recv@2018-06-02--08:30  casesensitivity       sensitive              -
server2/mysql-recv@2018-06-02--08:30  nbmand                off                    default
server2/mysql-recv@2018-06-02--08:30  primarycache          all                    default
server2/mysql-recv@2018-06-02--08:30  secondarycache        all                    default
server2/mysql-recv@2018-06-02--08:30  defer_destroy         off                    -
server2/mysql-recv@2018-06-02--08:30  userrefs              0                      -
server2/mysql-recv@2018-06-02--08:30  mlslabel              none                   default
server2/mysql-recv@2018-06-02--08:30  refcompressratio      1.74x                  -
server2/mysql-recv@2018-06-02--08:30  written               3.88G                  -
server2/mysql-recv@2018-06-02--08:30  clones                                       -
server2/mysql-recv@2018-06-02--08:30  logicalused           0                      -
server2/mysql-recv@2018-06-02--08:30  logicalreferenced     691G                   -
server2/mysql-recv@2018-06-02--08:30  acltype               off                    default
server2/mysql-recv@2018-06-02--08:30  context               none                   default
server2/mysql-recv@2018-06-02--08:30  fscontext             none                   default
server2/mysql-recv@2018-06-02--08:30  defcontext            none                   default
server2/mysql-recv@2018-06-02--08:30  rootcontext           none                   default

Receiver:

NAME                                      PROPERTY              VALUE                  SOURCE
whoopass4/server2-test@2018-06-02--08:30  type                  snapshot               -
whoopass4/server2-test@2018-06-02--08:30  creation              Sat Jun  2  8:30 2018  -
whoopass4/server2-test@2018-06-02--08:30  used                  0B                     -
whoopass4/server2-test@2018-06-02--08:30  referenced            47.7G                  -
whoopass4/server2-test@2018-06-02--08:30  compressratio         3.35x                  -
whoopass4/server2-test@2018-06-02--08:30  devices               on                     default
whoopass4/server2-test@2018-06-02--08:30  exec                  on                     default
whoopass4/server2-test@2018-06-02--08:30  setuid                on                     default
whoopass4/server2-test@2018-06-02--08:30  createtxg             532284                 -
whoopass4/server2-test@2018-06-02--08:30  xattr                 on                     default
whoopass4/server2-test@2018-06-02--08:30  nbmand                off                    default
whoopass4/server2-test@2018-06-02--08:30  guid                  12420123256972824177   -
whoopass4/server2-test@2018-06-02--08:30  primarycache          all                    default
whoopass4/server2-test@2018-06-02--08:30  secondarycache        metadata               inherited from whoopass4
whoopass4/server2-test@2018-06-02--08:30  defer_destroy         off                    -
whoopass4/server2-test@2018-06-02--08:30  userrefs              0                      -
whoopass4/server2-test@2018-06-02--08:30  mlslabel              none                   default
whoopass4/server2-test@2018-06-02--08:30  refcompressratio      3.35x                  -
whoopass4/server2-test@2018-06-02--08:30  written               3.30G                  -
whoopass4/server2-test@2018-06-02--08:30  clones                                       -
whoopass4/server2-test@2018-06-02--08:30  logicalreferenced     124G                   -
whoopass4/server2-test@2018-06-02--08:30  acltype               off                    default
whoopass4/server2-test@2018-06-02--08:30  context               none                   default
whoopass4/server2-test@2018-06-02--08:30  fscontext             none                   default
whoopass4/server2-test@2018-06-02--08:30  defcontext            none                   default
whoopass4/server2-test@2018-06-02--08:30  rootcontext           none                   default
whoopass4/server2-test@2018-06-02--08:30  encryption            off                    default

Need to expand the debug buffer size and try again with canmount=off for the debug log.

@DeHackEd
Copy link
Contributor

Debug log: http://www.dehacked.net/zfs-noversion.zip (about 31 megabytes decompressed).

It was a slow job due to the amount of metadata the incremental needed to load as it went.

@tcaputi
Copy link
Contributor

tcaputi commented Jun 26, 2018

I'll try to take a look tomorrow. One last question. Do you have the large_dnode feature enabled on either side (dnodesize != legacy)

@DeHackEd
Copy link
Contributor

No. The sender is 0.6.5 and doesn't support it. I've tested a receiver with the feature explicitly disabled, and one with it enabled but never activated.

@tcaputi
Copy link
Contributor

tcaputi commented Jun 26, 2018

I think I see what's going on. I'll have a PR up for you to test by the end of the day.

@tcaputi
Copy link
Contributor

tcaputi commented Jun 26, 2018

Actually. I think this should be a one liner. Try applying this diff and if it works I'll make a full PR out of it:

diff --git a/module/zfs/dmu_send.c b/module/zfs/dmu_send.c
index d0e74a4..15905a6 100644
--- a/module/zfs/dmu_send.c
+++ b/module/zfs/dmu_send.c
@@ -2607,7 +2607,8 @@ receive_object(struct receive_writer_arg *rwa, struct drr_object *drro,
 
 		if (drro->drr_blksz != doi.doi_data_block_size ||
 		    nblkptr < doi.doi_nblkptr ||
-		    drro->drr_dn_slots != doi.doi_dnodesize >> DNODE_SHIFT ||
+		    (drro->drr_dn_slots != 0 &&
+		    drro->drr_dn_slots != doi.doi_dnodesize >> DNODE_SHIFT) ||
 		    (rwa->raw &&
 		    (indblksz != doi.doi_metadata_block_size ||
 		    drro->drr_nlevels < doi.doi_indirection))) {

@tcaputi
Copy link
Contributor

tcaputi commented Jun 26, 2018

For background, the issue (if I'm correct) is that your send streams don't support large dnodes, so they send 0 for the dnode size. The new code looks at this and thinks that the dnode size changed, so it needs to free the object's data (since dnodes cannot change size without being freed).

@DeHackEd
Copy link
Contributor

No luck with that diff.

@tcaputi
Copy link
Contributor

tcaputi commented Jun 26, 2018

@DeHackEd I was able to reproduce the issue locally. I missed another place where this check was needed. Please try this patch:

diff --git a/module/zfs/dmu_send.c b/module/zfs/dmu_send.c
index d0e74a4..38ab656 100644
--- a/module/zfs/dmu_send.c
+++ b/module/zfs/dmu_send.c
@@ -2607,7 +2607,8 @@ receive_object(struct receive_writer_arg *rwa, struct drr_object *drro,
 
 		if (drro->drr_blksz != doi.doi_data_block_size ||
 		    nblkptr < doi.doi_nblkptr ||
-		    drro->drr_dn_slots != doi.doi_dnodesize >> DNODE_SHIFT ||
+		    (drro->drr_dn_slots != 0 &&
+		    drro->drr_dn_slots != doi.doi_dnodesize >> DNODE_SHIFT) ||
 		    (rwa->raw &&
 		    (indblksz != doi.doi_metadata_block_size ||
 		    drro->drr_nlevels < doi.doi_indirection))) {
@@ -2628,7 +2629,8 @@ receive_object(struct receive_writer_arg *rwa, struct drr_object *drro,
 		 * instead.
 		 */
 		if ((rwa->raw && drro->drr_nlevels < doi.doi_indirection) ||
-		    drro->drr_dn_slots != doi.doi_dnodesize >> DNODE_SHIFT) {
+		    (drro->drr_dn_slots != 0 &&
+		    drro->drr_dn_slots != doi.doi_dnodesize >> DNODE_SHIFT)) {
 			err = dmu_free_long_object(rwa->os, drro->drr_object);
 			if (err != 0)
 				return (SET_ERROR(EINVAL));

@DeHackEd
Copy link
Contributor

[82187.548388] BUG: unable to handle kernel NULL pointer dereference at           (null)
[82187.548454] IP: [<ffffffff81337ab4>] memmove+0x24/0x1a0
[82187.548495] PGD 8000002fdd531067 PUD 303f70a067 PMD 0 
[82187.548537] Oops: 0002 [#1] SMP 
[82187.548565] Modules linked in: zfs(POE) binfmt_misc bridge stp llc nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter zunicode(POE) zlua(POE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) raid456 async_raid6_recov async_memcpy async_pq iTCO_wdt iTCO_vendor_support skx_edac raid6_pq edac_core intel_powerclamp coretemp libcrc32c async_xor intel_rapl iosf_mbi xor kvm_intel async_tx kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd vfat fat pcspkr ses enclosure joydev sg mei_me mei shpchp i2c_i801 lpc_ich wmi ipmi_si ipmi_devintf ipmi_msghandler nfit acpi_cpufreq libnvdimm acpi_pad acpi_power_meter ip_tables ext4 mbcache jbd2 raid1 sd_mod crc_t10dif crct10dif_generic ast i2c_algo_bit drm_kms_helper crct10dif_pclmul crct10dif_common
[82187.549209]  crc32c_intel syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ixgbe mpt3sas i40e drm ahci libahci libata raid_class mdio scsi_transport_sas dca ptp pps_core i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: zfs]
[82187.549394] CPU: 3 PID: 40393 Comm: receive_writer Tainted: P           OE  ------------   3.10.0-693.21.1.el7.x86_64 #1
[82187.549476] Hardware name: Supermicro Super Server/X11DDW-NT, BIOS 2.0b 03/07/2018
[82187.549531] task: ffff880d135c9fa0 ti: ffff880c61c3c000 task.ti: ffff880c61c3c000
[82187.549580] RIP: 0010:[<ffffffff81337ab4>]  [<ffffffff81337ab4>] memmove+0x24/0x1a0
[82187.549634] RSP: 0018:ffff880c61c3fbb8  EFLAGS: 00010282
[82187.549670] RAX: 0000000000000000 RBX: ffff8814415d9998 RCX: 00000000000000a8
[82187.549717] RDX: 00000000000000a8 RSI: ffff882ee82ceac0 RDI: 0000000000000000
[82187.549769] RBP: ffff880c61c3fc40 R08: ffff882ee82ceb68 R09: 0000000000000000
[82187.549819] R10: 0000000000000b6a R11: ffff880c61c3f716 R12: ffff880b68673b80
[82187.549866] R13: ffffffffffffff40 R14: 00000000000000a8 R15: 00000000ffffff40
[82187.549916] FS:  0000000000000000(0000) GS:ffff8817ddac0000(0000) knlGS:0000000000000000
[82187.549971] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[82187.550010] CR2: 0000000000000000 CR3: 0000002fda044000 CR4: 00000000003607e0
[82187.550060] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[82187.550109] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[82187.550155] Call Trace:
[82187.550220]  [<ffffffffc0cabd1f>] ? dbuf_read_impl+0x4af/0x680 [zfs]
[82187.550271]  [<ffffffffc061d34d>] ? spl_kmem_cache_alloc+0xbd/0x150 [spl]
[82187.550350]  [<ffffffffc0cacc62>] dbuf_read+0xd2/0x560 [zfs]
[82187.550428]  [<ffffffffc0cd1433>] ? dnode_rele_and_unlock+0x53/0x90 [zfs]
[82187.550508]  [<ffffffffc0cb9998>] dmu_bonus_hold_impl+0xe8/0x1e0 [zfs]
[82187.550586]  [<ffffffffc0cc5890>] receive_object+0x680/0x8f0 [zfs]
[82187.550659]  [<ffffffffc0cb7052>] ? dmu_free_long_range+0x2b2/0x460 [zfs]
[82187.550712]  [<ffffffffc061bf65>] ? spl_kmem_free+0x35/0x40 [spl]
[82187.550755]  [<ffffffff81343f1d>] ? list_del+0xd/0x30
[82187.550824]  [<ffffffffc0cc8bd3>] receive_writer_thread+0x453/0xb20 [zfs]
[82187.550873]  [<ffffffff810cb0b5>] ? sched_clock_cpu+0x85/0xc0
[82187.550944]  [<ffffffffc0cc8780>] ? receive_free.isra.13+0xd0/0xd0 [zfs]
[82187.550994]  [<ffffffffc061bf65>] ? spl_kmem_free+0x35/0x40 [spl]
[82187.551066]  [<ffffffffc0cc8780>] ? receive_free.isra.13+0xd0/0xd0 [zfs]
[82187.551116]  [<ffffffffc061dfc3>] thread_generic_wrapper+0x73/0x80 [spl]
[82187.551165]  [<ffffffffc061df50>] ? __thread_exit+0x20/0x20 [spl]
[82187.551208]  [<ffffffff810b4031>] kthread+0xd1/0xe0
[82187.551243]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[82187.551286]  [<ffffffff816c055d>] ret_from_fork+0x5d/0xb0
[82187.551324]  [<ffffffff810b3f60>] ? insert_kthread_work+0x40/0x40
[82187.551365] Code: 88 0c 17 88 0f c3 90 48 89 f8 48 83 fa 20 0f 82 03 01 00 00 48 39 fe 7d 0f 49 89 f0 49 01 d0 49 39 f8 0f 8f 9f 00 00 00 48 89 d1 <f3> a4 c3 0f 1f 84 00 00 00 00 00 0f 1f 84 00 00 00 00 00 0f 1f 
[82187.551627] RIP  [<ffffffff81337ab4>] memmove+0x24/0x1a0
[82187.551668]  RSP <ffff880c61c3fbb8>
[82187.551693] CR2: 0000000000000000
--- a/module/zfs/dmu_send.c
+++ b/module/zfs/dmu_send.c
@@ -2512,7 +2512,8 @@ receive_object(struct receive_writer_arg *rwa, struct drr_object *drro,
 
                if (drro->drr_blksz != doi.doi_data_block_size ||
                    nblkptr < doi.doi_nblkptr ||
-                   drro->drr_dn_slots != doi.doi_dnodesize >> DNODE_SHIFT ||
+                   (drro->drr_dn_slots != 0 &&
+                   drro->drr_dn_slots != doi.doi_dnodesize >> DNODE_SHIFT) ||
                    (rwa->raw &&
                    (indblksz != doi.doi_metadata_block_size ||
                    drro->drr_nlevels < doi.doi_indirection))) {
@@ -2533,7 +2534,8 @@ receive_object(struct receive_writer_arg *rwa, struct drr_object *drro,
                 * instead.
                 */
                if ((rwa->raw && drro->drr_nlevels < doi.doi_indirection) ||
-                   drro->drr_dn_slots != doi.doi_dnodesize >> DNODE_SHIFT) {
+                   (drro->drr_dn_slots != 0 &&
+                   drro->drr_dn_slots != doi.doi_dnodesize >> DNODE_SHIFT)) {
                        err = dmu_free_long_object(rwa->os, drro->drr_object);
                        if (err != 0)
                                return (SET_ERROR(EINVAL));

And still no love.

After each failed job I destroy the bad snapshot and revert to the old one before proceeding with another attempt.

@tcaputi
Copy link
Contributor

tcaputi commented Jun 27, 2018

Thanks a lot for working with me on this. I don't have a good script to reproduce your new problem, but I think I see what may be causing the issue. Try this diff when you get a chance: https://pastebin.com/9NTxCirm

@DeHackEd
Copy link
Contributor

This one worked. Receive was successful, the version is set, object 1 looks intact and I can mount the filesystem now.

@tcaputi
Copy link
Contributor

tcaputi commented Jun 27, 2018

Wonderful. I'll make a PR by the end of the day. Thanks for the help.

tcaputi pushed a commit to datto/zfs that referenced this issue Jun 27, 2018
Currently, there is a bug where older send streams without the
DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly.
The code in receive_object() fails to handle cases where
drro->drr_dn_slots is set to 0, which is always the case when the
sending code does not support this feature flag. This patch fixes
the issue by ensuring that that a value of 0 is treated as
DNODE_MIN_SLOTS.

Fixes: openzfs#7617

Signed-off-by: Tom Caputi <tcaputi@datto.com>
behlendorf pushed a commit that referenced this issue Jun 28, 2018
Currently, there is a bug where older send streams without the
DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly.
The code in receive_object() fails to handle cases where
drro->drr_dn_slots is set to 0, which is always the case when the
sending code does not support this feature flag. This patch fixes
the issue by ensuring that that a value of 0 is treated as
DNODE_MIN_SLOTS.

Tested-by:  DHE <git@dehacked.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7617 
Closes #7662
dweeezil pushed a commit to dweeezil/zfs that referenced this issue Aug 27, 2018
Currently, there is a bug where older send streams without the
DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly.
The code in receive_object() fails to handle cases where
drro->drr_dn_slots is set to 0, which is always the case when the
sending code does not support this feature flag. This patch fixes
the issue by ensuring that that a value of 0 is treated as
DNODE_MIN_SLOTS.

Tested-by:  DHE <git@dehacked.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes openzfs#7617
Closes openzfs#7662
dweeezil pushed a commit to dweeezil/zfs that referenced this issue Aug 28, 2018
Currently, there is a bug where older send streams without the
DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly.
The code in receive_object() fails to handle cases where
drro->drr_dn_slots is set to 0, which is always the case when the
sending code does not support this feature flag. This patch fixes
the issue by ensuring that that a value of 0 is treated as
DNODE_MIN_SLOTS.

Tested-by:  DHE <git@dehacked.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes openzfs#7617
Closes openzfs#7662
dweeezil pushed a commit to dweeezil/zfs that referenced this issue Aug 28, 2018
Currently, there is a bug where older send streams without the
DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly.
The code in receive_object() fails to handle cases where
drro->drr_dn_slots is set to 0, which is always the case when the
sending code does not support this feature flag. This patch fixes
the issue by ensuring that that a value of 0 is treated as
DNODE_MIN_SLOTS.

Tested-by:  DHE <git@dehacked.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes openzfs#7617
Closes openzfs#7662
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Aug 28, 2018
Currently, there is a bug where older send streams without the
DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly.
The code in receive_object() fails to handle cases where
drro->drr_dn_slots is set to 0, which is always the case when the
sending code does not support this feature flag. This patch fixes
the issue by ensuring that that a value of 0 is treated as
DNODE_MIN_SLOTS.

Tested-by:  DHE <git@dehacked.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes openzfs#7617
Closes openzfs#7662
dweeezil pushed a commit to dweeezil/zfs that referenced this issue Aug 28, 2018
Currently, there is a bug where older send streams without the
DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly.
The code in receive_object() fails to handle cases where
drro->drr_dn_slots is set to 0, which is always the case when the
sending code does not support this feature flag. This patch fixes
the issue by ensuring that that a value of 0 is treated as
DNODE_MIN_SLOTS.

Tested-by:  DHE <git@dehacked.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes openzfs#7617
Closes openzfs#7662
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Aug 28, 2018
Currently, there is a bug where older send streams without the
DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly.
The code in receive_object() fails to handle cases where
drro->drr_dn_slots is set to 0, which is always the case when the
sending code does not support this feature flag. This patch fixes
the issue by ensuring that that a value of 0 is treated as
DNODE_MIN_SLOTS.

Tested-by:  DHE <git@dehacked.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes openzfs#7617
Closes openzfs#7662
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Aug 30, 2018
Currently, there is a bug where older send streams without the
DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly.
The code in receive_object() fails to handle cases where
drro->drr_dn_slots is set to 0, which is always the case when the
sending code does not support this feature flag. This patch fixes
the issue by ensuring that that a value of 0 is treated as
DNODE_MIN_SLOTS.

Tested-by:  DHE <git@dehacked.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes openzfs#7617
Closes openzfs#7662
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Aug 30, 2018
Currently, there is a bug where older send streams without the
DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly.
The code in receive_object() fails to handle cases where
drro->drr_dn_slots is set to 0, which is always the case when the
sending code does not support this feature flag. This patch fixes
the issue by ensuring that that a value of 0 is treated as
DNODE_MIN_SLOTS.

Tested-by:  DHE <git@dehacked.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes openzfs#7617
Closes openzfs#7662
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Aug 30, 2018
Currently, there is a bug where older send streams without the
DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly.
The code in receive_object() fails to handle cases where
drro->drr_dn_slots is set to 0, which is always the case when the
sending code does not support this feature flag. This patch fixes
the issue by ensuring that that a value of 0 is treated as
DNODE_MIN_SLOTS.

Tested-by:  DHE <git@dehacked.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes openzfs#7617
Closes openzfs#7662

Requires-spl: refs/pull/707/head
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Aug 31, 2018
Currently, there is a bug where older send streams without the
DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly.
The code in receive_object() fails to handle cases where
drro->drr_dn_slots is set to 0, which is always the case when the
sending code does not support this feature flag. This patch fixes
the issue by ensuring that that a value of 0 is treated as
DNODE_MIN_SLOTS.

Tested-by:  DHE <git@dehacked.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes openzfs#7617
Closes openzfs#7662
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 5, 2018
Currently, there is a bug where older send streams without the
DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly.
The code in receive_object() fails to handle cases where
drro->drr_dn_slots is set to 0, which is always the case when the
sending code does not support this feature flag. This patch fixes
the issue by ensuring that that a value of 0 is treated as
DNODE_MIN_SLOTS.

Tested-by:  DHE <git@dehacked.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes openzfs#7617
Closes openzfs#7662
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants