Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs randomly giving errors #2863

Closed
uwkhahn opened this issue Nov 3, 2014 · 32 comments
Closed

zfs randomly giving errors #2863

uwkhahn opened this issue Nov 3, 2014 · 32 comments
Milestone

Comments

@uwkhahn
Copy link

uwkhahn commented Nov 3, 2014

We use ZFS in a home grown backup solution that previously worked on zfs on solaris. It is a series of scripts doing periodic rsync and ZFS snapshot creation.

Now in moving it to zfs on Debian (Wheezy) Linux I'm seeing a number of errors where a directory created by rsync gives me an error upon trying to cd to it. For example:

somehost:/tank/rsyncbackup/byservice/web/web/snapshots/otherhost/Web/current/by-vg/webvg/vhosts/ahostname # cd etc/
-su: cd: etc/: Input/output error

When I try to take a closer look with an strace ls etc, I see the following:

stat("etc", {st_mode=S_IFDIR|S_ISGID|0775, st_size=2, ...}) = 0
open("etc", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 EIO (Input/output error)

The rsync we use looks like:

rsync -4 --rsh 'ssh -4 -c arcfour,aes128-cbc,aes128-ctr,3des-cbc,blowfish-cbc -i "/opt/rsync-zfs-backups/rsync-ssh-keys/ahostname/id_rsa_rsync_backup" -o BatchMode=yes' -vih --stats -rdlptgoDHSz --delete --delete-excluded --ignore-errors --numeric-ids -A -v --log-file='/tank/rsyncbackup/byservice/apps/.lastrun.out' --files-from='/tank/rsyncbackup/byservice/apps/.files' --include-from='/tank/rsyncbackup/byservice/apps/.includes' --exclude-from='/tank/rsyncbackup/byservice/apps/.excludes' root@ahostname:/ "/tank/rsyncbackup/byservice/apps/apps"/

zfs properties on the filesystem:

tank/rsyncbackup/byservice/apps  compression           lz4                               inherited from tank/rsyncbackup
tank/rsyncbackup/byservice/apps  atime                 off                               local
tank/rsyncbackup/byservice/apps  snapdir               visible                           inherited from tank/rsyncbackup
tank/rsyncbackup/byservice/apps  xattr                 sa                                local
tank/rsyncbackup/byservice/apps  secondarycache        metadata                          inherited from tank/rsyncbackup
tank/rsyncbackup/byservice/apps  dedup                 off                               inherited from tank
tank/rsyncbackup/byservice/apps  acltype               posixacl                          inherited from tank/rsyncbackup
tank/rsyncbackup/byservice/apps  relatime              off                               inherited from tank/rsyncbackup

Debian relevant zfs packages:

i  debian-zfs                           7~wheezy                           amd64        Native ZFS filesystem metapackage for Debian.
ii  libzfs2                              0.6.3-1~wheezy                     amd64        Native ZFS filesystem library for Linux
ii  zfs-dkms                             0.6.3-1~wheezy                     all          Native ZFS filesystem kernel modules for Linux
ii  zfsonlinux                           2~wheezy                           all          archive.zfsonlinux.org trust package
ii  zfsutils                             0.6.3-1~wheezy                     amd64        command-line tools to manage ZFS filesystems
ii  spl                                  0.6.3-7~wheezy                     amd64        Solaris Porting Layer user-space utilities for Linux
ii  spl-dkms                             0.6.3-7~wheezy                     all          Solaris Porting Layer kernel modules for Linux

uname info:

Linux somehost 3.14-0.bpo.1-amd64 #1 SMP Debian 3.14.12-1~bpo70+1 (2014-07-13) x86_64 GNU/Linux

Not long after seeing this problem, I will see the kernel crash:

[34095.776611]: BUG: unable to handle kernel request
[34095.776635]: IP:[<ffffffffa2adfecd>]: zio_buf_alloc+0xd/0x20 [zfs]
[34095.776675]: PGD 180f067
[34095.776697]: Oops: 0000 [#1]

 Call Trace
  [<ffffffffa2a4244d>] ? arc_get_data_buf.isra.21+0x34d/0x4d0 [zfs]
  [<ffffffffa2a44d24>] ? arc_buf_alloc+0xf4/0x140 [zfs]
  [<ffffffffa2a455be>] ? arc_read+0x44e/0x8e0 [zfs]
  [<ffffffffa2a4b1d0>] ? dbuf_do_evict+0x60/0x60 [zfs]
  [<ffffffffa2a4bba0>] ? dbuf_read+0x290/0x870 [zfs]
  [<ffffffffa2a54e7e>] ? dmu_spill_hold_by_dnode+0x4e/0x170 [zfs]
  [<ffffffffa2a55113>] ? dmu_spill_hold_existing+0x173/0x180 [zfs]
  [<ffffffffa2a85400>] ? sa_get_spill.part.9+0x20/0x80 [zfs]
  [<ffffffffa2a865e8>] ? sa_attr_op+0x3d8/0x410 [zfs]
  [<ffffffffa2a86c70>] ? sa_lookup+0x40/0x60 [zfs]
  [<ffffffffa077c4b2>] ? __cv_init+0x52/0xf0 [spl]
  [<ffffffffa2ab7f10>] ? zfs_dirent_lock+0x3f0/0x600 [zfs]
  [<ffffffffa2ab969e>] ? zfs_get_xattrdir+0x6e/0x170 [zfs]
  [<ffffffffa1814c4b>] ? nvs_operation+0x16b/0x320 [znvpair]
  [<ffffffffa1815e7f>] ? nvlist_common+0x12f/0x1f0 [znvpair]
  [<ffffffffa2ad233d>] ? zfs_lookup+0x1ed/0x340 [zfs]
  [<ffffffffa2ae94c5>] ? __zpl_xattr_get+0x85/0x220 [zfs]
  [<ffffffffa2ae9aa6>] ? zpl_xattr_get+0x66/0x140 [zfs]
  [<ffffffff811a3499>] ? __inode_permission+0x29/0xc0
  [<ffffffffa2aeabfe>] ? zpl_get_acl+0x8e/0x1f0 [zfs]
  [<ffffffffa2aeada7>] ? zpl_xattr_acl_get+0x47/0xf0 [zfs]
  [<ffffffff811ba8b0>] ? generic_getxattr+0x50/0x80
  [<ffffffff811bae99>] ? vfs_getxattr+0xa9/0xd0
  [<ffffffff811baf7b>] ? getxattr+0xbb/0x200
  [<ffffffff811a497f>] ? filename_lookup+0x2f/0xd0
  [<ffffffff811a84f8>] ? user_path_at_empty+0x68/0xb0
  [<ffffffff811bbbbc>] ? SyS_getxattr+0x4c/0x80
  [<ffffffff814faf79>] ? system_call_fastpath+0x16/0x1b
...

Anybody have a clue?  Glad to provide any more needed details.
@dweeezil
Copy link
Contributor

dweeezil commented Nov 3, 2014

@uwkhahn This sure feels like another corrupted SA to me. If you could find the inode number of the etc directory, could you try running zdb -ddddd <pool>/<file_system> <inode_number_of_etc> on it. Unfortunately, if the SA's corrupted, it can be difficult to get its inode number. I'd start with ls -di etc in its parent directory. If all else fails, you can use zdb -ddddd <pool>/<file_system> <inode_number_of_parent_directory_of_etc which will dump the directory with a line for each entry in it of the form "name = inode (type)" which would allow you to get the inode number of "etc" (look for a line of the form "etc = <inode_number>".

@uwkhahn
Copy link
Author

uwkhahn commented Nov 3, 2014

@dweeezil Here's what I got:

Dataset tank/rsyncbackup/byservice/web [ZPL], ID 675, cr_txg 1785867, 79.9G, 1998692 objects, rootbp DVA[0]=<5:4801057a400:600> DVA[1]=<6:4800f73c200:600> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=23637893L/23637893P fill=1998692 cksum=18879dda00:76d0cb72a58:138747d1a301a:24c9fb0cbd6157

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
   2687072    1    16K    512  4.50K    512  100.00  ZFS directory
                                        196   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED SPILL_BLKPTR
    dnode maxblkid: 0
    path    /web/snapshots/somehost/Web/current/by-vg/webvg/vhosts/otherhost/etc
    uid     5257
    gid     20150
    atime   Thu Oct 30 11:45:23 2014
    mtime   Mon Sep  6 20:10:21 2010
    ctime   Thu Oct 30 11:45:23 2014
    crtime  Thu Oct 30 11:45:23 2014
    gen 1794735
    mode    42775
    size    2
    parent  2686699
    links   2
    pflags  40800000044
    microzap: 512 bytes, 0 entries

Indirect blocks:
               0 L0 4:2ce76bee600:600 200L/200P F=1 B=1794735/1794735

        segment [0000000000000000, 0000000000000200) size   512

@uwkhahn
Copy link
Author

uwkhahn commented Nov 3, 2014

One other bit of info, before the crash.. I see this set of output from the rsync.. I'm assuming it's related:

rsync: get_acl: sys_acl_get_file(snapshots/somehost/Web/current/by-vg/webvg/vhosts/otherhost/etc, ACL_TYPE_ACCESS): Input/output error (5)
rsync: get_acl: sys_acl_get_file(snapshots/somehost/Web/current/by-vg/webvg/vhosts/otherhost/etc, ACL_TYPE_ACCESS): Input/output error (5)
rsync: set_acl: sys_acl_set_file(snapshots/somehost/Web/current/by-vg/webvg/vhosts/otherhost/etc, ACL_TYPE_ACCESS): Bad address (14)

@dweeezil
Copy link
Contributor

dweeezil commented Nov 3, 2014

@uwkhahn Sorry to say but you've got the corrupted SA (actually, more correctly, a corrupted dnode) problem. I've been trying to track this down for a better part of a month or two but have not been able to reproduce it.

I've been working on an overhaul of the way these SAs are created but it's proving to be quite a bit more involved than I initially expected due to the differences in the ways that posix acls and selinux acls interface with the kernel.

Although I'm 99% sure I know exactly what the form of the corruption is, you could confirm it by providing the output of zdb -dddd <pool>/<filesystem> 5 6 and the building dweeezil/zfs@1019e58 (but not installing it) and then from the build directory, run cmd/zdb/zdb -ddddddd <pool>/<filesystem> <inode_of_etc_directory>. That will give the complete picture.

The problem is that in some cases, the SA layout in the dnode is not updated when kicking an SA xattr to the spill block. This problem is likely caused by a race condition and/or made more probable under conditions of memory pressure. Was there a lot else going on while this "etc" directory was created?

Finally, I'd like to ask if you know whether this directory has been corrupted since its creation or whether some other operation may have been done to it after-the-fact.

@uwkhahn
Copy link
Author

uwkhahn commented Nov 4, 2014

Here's the first bit you asked for.

zdb -dddd tank/rsyncbackup/byservice/web 5 6
Dataset tank/rsyncbackup/byservice/web [ZPL], ID 675, cr_txg 1785867, 80.1G, 1999823 objects, rootbp DVA[0]=<8:47bf33ac600:600> DVA[1]=<9:28340417e00:600> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=23673413L/23673413P fill=1999823 cksum=13f177984f:67b05bc25f1:12617fe700ca3:254acc5b7f6c65

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         5    1    16K  1.50K  2.50K  1.50K  100.00  SA attr registration
    dnode flags: USED_BYTES USERUSED_ACCOUNTED 
    dnode maxblkid: 0
    microzap: 1536 bytes, 21 entries

        ZPL_DACL_COUNT =  8000010 : [8:0:16]
        ZPL_GEN =  8000004 : [8:0:4]
        ZPL_SYMLINK =  30011 : [0:3:17]
        ZPL_LINKS =  8000008 : [8:0:8]
        ZPL_ATIME =  10000000 : [16:0:0]
        ZPL_UID =  800000c : [8:0:12]
        ZPL_RDEV =  800000a : [8:0:10]
        ZPL_DACL_ACES =  40013 : [0:4:19]
        ZPL_PAD =  2000000e : [32:0:14]
        ZPL_XATTR =  8000009 : [8:0:9]
        ZPL_DXATTR =  30014 : [0:3:20]
        ZPL_ZNODE_ACL =  5803000f : [88:3:15]
        ZPL_CRTIME =  10000003 : [16:0:3]
        ZPL_FLAGS =  800000b : [8:0:11]
        ZPL_SCANSTAMP =  20030012 : [32:3:18]
        ZPL_SIZE =  8000006 : [8:0:6]
        ZPL_MODE =  8000005 : [8:0:5]
        ZPL_PARENT =  8000007 : [8:0:7]
        ZPL_CTIME =  10000002 : [16:0:2]
        ZPL_MTIME =  10000001 : [16:0:1]
        ZPL_GID =  800000d : [8:0:13]

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         6    1    16K    16K  11.5K    32K  100.00  SA attr layouts
    dnode flags: USED_BYTES USERUSED_ACCOUNTED 
    dnode maxblkid: 1
    Fat ZAP stats:
        Pointer table:
            1024 elements
            zt_blk: 0
            zt_numblks: 0
            zt_shift: 10
            zt_blks_copied: 0
            zt_nextblk: 0
        ZAP entries: 4
        Leaf blocks: 1
        Total blocks: 2
        zap_block_type: 0x8000000000000001
        zap_magic: 0x2f52ab2ab
        zap_salt: 0x40f6f138d
        Leafs with 2^n pointers:
            9:      1 *
        Blocks with n*5 entries:
            0:      1 *
        Blocks n/10 full:
            1:      1 *
        Entries with n chunks:
            3:      1 *
            4:      3 ***
        Buckets with n entries:
            0:    508 ****************************************
            1:      4 *

        4 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  20 ]
        3 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19  17 ]
        2 = [ 5  6  4  12  13  7  11  0  1  2  3  8  16  19 ]
        5 = [ 20 ]

Here's the custom zdb output:

cmd/zdb/zdb -ddddddd tank/rsyncbackup/byservice/web 2687072
Dataset tank/rsyncbackup/byservice/web [ZPL], ID 675, cr_txg 1785867, 80.1G, 1999823 objects, rootbp DVA[0]=<8:47bf33ac600:600> DVA[1]=<9:28340417e00:600> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=23673413L/23673413P fill=1999823 cksum=13f177984f:67b05bc25f1:12617fe700ca3:254acc5b7f6c65

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
   2687072    1    16K    512  4.50K    512  100.00  ZFS directory (K=inherit) (Z=inherit)
                                        196   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED SPILL_BLKPTR
    dnode maxblkid: 0

    Spill blkptr:
        4:2ce76beec00:600 5:2ce732b2c00:600 200L/200P F=1 B=1794735/1794735
    Spill blkptr dump: \003\000\000\000\004\000\000\000\166\137\073\147\001\000\000\000\003\000\000\000\005\000\000\000\226\225\071\147\001\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\002\007\054\200\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\257\142\033\000\000\000\000\000\001\000\000\000\000\000\000\000\334\201\257\326\016\000\000\000\171\354\354\031\331\005\000\000\313\276\240\013\376\056\001\000\152\134\007\344\313\354\051\000
    SA hdrsize 16
    SA layout 4
    path    /web/snapshots/somehost/Web/current/by-vg/webvg/vhosts/otherhost/etc
    uid     5257
    gid     20150
    atime   Thu Oct 30 11:45:23 2014
    mtime   Mon Sep  6 20:10:21 2010
    ctime   Thu Oct 30 11:45:23 2014
    crtime  Thu Oct 30 11:45:23 2014
    gen 1794735
    mode    42775
    size    2
    parent  2686699
    links   2
    pflags  40800000044
    ndacl   3
    dump_znode_sa_xattr: sa_xattr_size=20 sa_size error=0
    SA packed dump sa_xattr_size=20: \001\001\000\000\000\000\000\000\000\000\000\001\000\000\000\000\003\000\000\000
    SA xattr dump:
    dump_znode_sa_xattr: nvlist_unpack error=14
    microzap: 512 bytes, 0 entries

Indirect blocks:
               0 L0 4:2ce76bee600:600 5:2ce732b2600:600 200L/200P F=1 B=1794735/1794735

        segment [0000000000000000, 0000000000000200) size   512

Be aware that I'm anonymizing the paths that I'm posting here. so the pathname is slightly different in length and slightly different in content, but unless you tell me otherwise I'm going to assume that's not an issue.

At the times that these issues occur, it appears that zfs is using most of the system memory (there's 128GB RAM).

I'm going to move the affected directory and then try to do the rsync again to see if the path is created corrupted again. I'll let you know what I see.

@uwkhahn
Copy link
Author

uwkhahn commented Nov 4, 2014

Ah.. one further addition.. There is a zvol which is being used over iscsi that is active at the same time.
I note that the zvol seems to add memory usage beyond what I would have expected to have run into.
Hmmm...

@dweeezil
Copy link
Contributor

dweeezil commented Nov 4, 2014

@uwkhahn Thanks, the problem is exactly the same as others have been having. The SA layout is wrong. It's 4 but should be 2.

Could you please check the source file system and see whether the parent directory of "etc" has a default ACL? I'm looking for any clues which might help me reproduce this problem.

I'm going to look into how hard it would be to whip up a gross hack to detect and work around this problem after the fact.

@uwkhahn
Copy link
Author

uwkhahn commented Nov 4, 2014

Here's a "getfacl -n ." on the parent directory.

# file: .
# owner: 5257
# group: 20150
# flags: -s-
user::rwx
user:801:rwx
group::rwx
mask::rwx
other::r-x
default:user::rwx
default:user:801:rwx
default:group::rwx
default:mask::rwx
default:other::r-x

So if I destroy this filesystem and re-create with xattrs=on will that be at risk of hitting this same bug?
(ok maybe we'll rename it so that we can have it around for reference).

Also with 128GB I've set the zfs_arc_max to 107374182400 (100GB). I note that I see significant usage beyond 100GB (which reduces when the zvol is not being written to actively) so I'm contemplating reducing it further. You were saying that you believed this might be related to memory pressure, so I wonder if you think this might help, in our case.

BTW Thanks for helping figure this out.

@dweeezil
Copy link
Contributor

dweeezil commented Nov 4, 2014

@uwkhahn xattr=on should be perfectly safe. The normal caveats regarding dir-style xattrs do apply, but at least they don't involve filesystem corruption.

Thinking about the ARC oversubscription for a moment, I just re-read this thread and realized you're running a post 3.12 kernel and stock 0.6.3 packages which means you've not got the shrinker fixes. You might want to try cherry-picking ed6e9cc and openzfs/spl@802a4a2 (assuming you're using DKMS they should patch cleanly into 0.6.3) and see whether the ARC stays under control better.

@uwkhahn
Copy link
Author

uwkhahn commented Nov 4, 2014

Thanks,

I've not done a patched dkms package before (have done patched kernel packages directly, but not dkms).. I'll go figure that out (I'd take pointers to any good instructions).
If we start going down that path, I think there are some other patches of interest to us as well, but we tend to not want to do patched packages if we can avoid it (generally much more work to maintain).

I don't suppose ZOL is close to any new release coming out the door, is it? :)
Thanks again.

@uwkhahn
Copy link
Author

uwkhahn commented Nov 4, 2014

Ah... I see that there is no instructions on how to build the debian dkms packages, but yet they exist for download. Any clue on how they are generated currently?

@dweeezil
Copy link
Contributor

dweeezil commented Nov 4, 2014

You could just patch the sources in /usr/src and then dkms remove spl/0.6.3 ; dkms build spl/0.6.3 ; dkms install spl/0.6.3 and then the same for zfs.

@uwkhahn
Copy link
Author

uwkhahn commented Nov 4, 2014

hmm.. the patch for spl didn't apply cleanly...
The one for zfs did...

@dweeezil
Copy link
Contributor

dweeezil commented Nov 4, 2014

Oh well, I thought they'd apply cleanly. They're pretty small. You can definitely try it with just the zfs change.

@uwkhahn
Copy link
Author

uwkhahn commented Nov 5, 2014

Last night, I removed the parent directory of the one with the bad SA, rsync ran again and indeed caused the I/O errors followed by the same crash. I tried rsyncing data out of there, so I was only reading.. I also got the immediate I/O error and then later the same kernel crash.

For now I've kept with the same version of everything and moved the filesystems with the xattrs=sa setting to an archived location.

We're recompiling a debian dkms package using the developer build instructions, but not that the current head doesn't apply the debian patches cleanly. (apparently the debian daily build is from a month ago)

I've started rsyncing the old data to new filesystems with xattrs=on instead of sa. It has not been crashing, with the exception of when I rsync from the filesystem I've been talking with you about.

We'll be trying to use newer versions soon, but I wanted to get our backups back on track... the filesystems are just moved so we can go look at them later.

@dweeezil
Copy link
Contributor

dweeezil commented Nov 5, 2014

@uwkhahn So you have saved the corrupted filesystem? Good. I'm about halfway done with a patch that will implement a simple heuristic to detect this and work around the problem. I've resisted doing this over the past few weeks but figured that it'll get my eyes in front of the SA code yet again and maybe I'll get an idea as to what's actually causing this corruption. And it might help others recover their files more easily.

@dweeezil
Copy link
Contributor

dweeezil commented Nov 7, 2014

@uwkhahn Could you please run zdb -R tank/rsyncbackup/byservice/web 4:2ce76beec00:600 and post the output? I'm starting to gather enough debugging data to formulate an idea of what's going on. For reference, here are the related issues: #2663, #2700, #2717 and openzfs/spl#390.

While working on the hack alluded to in my previous post, I realized that your corrupted dnode actually contains an empty, but corrupted packed nvlist dxattr which has pointed me in a slightly different direction. That said, I'm fairly certain the work in #2809 will eliminate this whole issue.

@uwkhahn
Copy link
Author

uwkhahn commented Nov 7, 2014

I moved the filesystem, so the command is slightly different, but here's the output:


zdb -R tank/archive/web 4:2ce76beec00:600
Found vdev type: raidz

4:2ce76beec00:600
          0 1 2 3 4 5 6 7   8 9 a b c d e f  0123456789abcdef
000000:  00c40405002f505a  0000000000000101  ZP/.............
000010:  5800000001000000  1800000060000000  .......X...`....
000020:  702e6d6574737973  6c63615f7869736f  system.posix_acl
000030:  746c75616665645f  2c0000000a000000  _default.......,
000040:  0007000100000002  00070002ffffffff  ................
000050:  0007000400000321  00070010ffffffff  !...............
000060:  00050020ffffffff  58000000ffffffff  .... ..........X
000070:  1700000058000000  702e6d6574737973  ...X....system.p
000080:  6c63615f7869736f  007373656363615f  osix_acl_access.
000090:  2c0000000a000000  0007000100000002  .......,........
0000a0:  00070002ffffffff  0007000400000321  ........!.......
0000b0:  00000010ffffffff  00000020ffffffff  ............ ...
0000c0:  00000000ffffffff  0000000000000000  ................
0000d0:  0000000000000000  0000000000000000  ................
0000e0:  0000000000000000  0000000000000000  ................
0000f0:  0000000000000000  0000000000000000  ................
000100:  0000000000000000  0000000000000000  ................
000110:  0000000000000000  0000000000000000  ................
000120:  0000000000000000  0000000000000000  ................
000130:  0000000000000000  0000000000000000  ................
000140:  0000000000000000  0000000000000000  ................
000150:  0000000000000000  0000000000000000  ................
000160:  0000000000000000  0000000000000000  ................
000170:  0000000000000000  0000000000000000  ................
000180:  0000000000000000  0000000000000000  ................
000190:  0000000000000000  0000000000000000  ................
0001a0:  0000000000000000  0000000000000000  ................
0001b0:  0000000000000000  0000000000000000  ................
0001c0:  0000000000000000  0000000000000000  ................
0001d0:  0000000000000000  0000000000000000  ................
0001e0:  0000000000000000  0000000000000000  ................
0001f0:  0000000000000000  0000000000000000  ................
000200:  8000000000000003  0000000a561da7eb  ...........V....
000210:  0000000000000000  0000000000000000  ................
000220:  0000000000000000  0000000000000000  ................
000230:  0000000000000000  0000000000000000  ................
000240:  400000000029006d  732e000000000000  m.)....@.......s
000250:  0000000000006e76  0000000000000000  vn..............
000260:  0000000000000000  0000000000000000  ................
000270:  0000000000000000  0000000000000000  ................
000280:  400000000029006e  6461000000000000  n.)....@......ad
000290:  00000000006e696d  0000000000000000  min.............
0002a0:  0000000000000000  0000000000000000  ................
0002b0:  0000000000000000  0000000000000000  ................
0002c0:  400000000029006f  6e69000000000000  o.)....@......in
0002d0:  0000736564756c63  0000000000000000  cludes..........
0002e0:  0000000000000000  0000000000000000  ................
0002f0:  0000000000000000  0000000000000000  ................
000300:  4000000000290070  696c000000000000  p.)....@......li
000310:  0000000000000062  0000000000000000  b...............
000320:  0000000000000000  0000000000000000  ................
000330:  0000000000000000  0000000000000000  ................
000340:  0000000000000000  0000000000000000  ................
000350:  0000000000000000  0000000000000000  ................
000360:  0000000000000000  0000000000000000  ................
000370:  0000000000000000  0000000000000000  ................
000380:  0000000000000000  0000000000000000  ................
000390:  0000000000000000  0000000000000000  ................
0003a0:  0000000000000000  0000000000000000  ................
0003b0:  0000000000000000  0000000000000000  ................
0003c0:  0000000000000000  0000000000000000  ................
0003d0:  0000000000000000  0000000000000000  ................
0003e0:  0000000000000000  0000000000000000  ................
0003f0:  0000000000000000  0000000000000000  ................
000400:  8000000000000003  0000000a561da7eb  ...........V....
000410:  0000000000000000  0000000000000000  ................
000420:  0000000000000000  0000000000000000  ................
000430:  0000000000000000  0000000000000000  ................
000440:  400000000029006d  732e000000000000  m.)....@.......s
000450:  0000000000006e76  0000000000000000  vn..............
000460:  0000000000000000  0000000000000000  ................
000470:  0000000000000000  0000000000000000  ................
000480:  400000000029006e  6461000000000000  n.)....@......ad
000490:  00000000006e696d  0000000000000000  min.............
0004a0:  0000000000000000  0000000000000000  ................
0004b0:  0000000000000000  0000000000000000  ................
0004c0:  400000000029006f  6e69000000000000  o.)....@......in
0004d0:  0000736564756c63  0000000000000000  cludes..........
0004e0:  0000000000000000  0000000000000000  ................
0004f0:  0000000000000000  0000000000000000  ................
000500:  4000000000290070  696c000000000000  p.)....@......li
000510:  0000000000000062  0000000000000000  b...............
000520:  0000000000000000  0000000000000000  ................
000530:  0000000000000000  0000000000000000  ................
000540:  0000000000000000  0000000000000000  ................
000550:  0000000000000000  0000000000000000  ................
000560:  0000000000000000  0000000000000000  ................
000570:  0000000000000000  0000000000000000  ................
000580:  0000000000000000  0000000000000000  ................
000590:  0000000000000000  0000000000000000  ................
0005a0:  0000000000000000  0000000000000000  ................
0005b0:  0000000000000000  0000000000000000  ................
0005c0:  0000000000000000  0000000000000000  ................
0005d0:  0000000000000000  0000000000000000  ................
0005e0:  0000000000000000  0000000000000000  ................
0005f0:  0000000000000000  0000000000000000  ................

@dweeezil
Copy link
Contributor

dweeezil commented Nov 7, 2014

@uwkhahn Thanks. Except for the ZAP remnants in the slack area, it looks to be OK to me. I'll run a standalone decoder program on it later today to verify.

@uwkhahn
Copy link
Author

uwkhahn commented Nov 7, 2014

Hey, just for reference I followed the process for another dir that is having the Input output error.
Its dump looks like this (I can give you the steps I used to get there, but I'm just following the recipe you gave above):

zdb -R tank/archive/web 8:48fcab22200:600
Found vdev type: raidz

8:48fcab22200:600
          0 1 2 3 4 5 6 7   8 9 a b c d e f  0123456789abcdef
000000:  00c40405002f505a  0000000000000101  ZP/.............
000010:  5800000001000000  1800000060000000  .......X...`....
000020:  702e6d6574737973  6c63615f7869736f  system.posix_acl
000030:  746c75616665645f  2c0000000a000000  _default.......,
000040:  0007000100000002  00070002ffffffff  ................
000050:  0007000400000321  00070010ffffffff  !...............
000060:  00050020ffffffff  58000000ffffffff  .... ..........X
000070:  1700000058000000  702e6d6574737973  ...X....system.p
000080:  6c63615f7869736f  007373656363615f  osix_acl_access.
000090:  2c0000000a000000  0007000100000002  .......,........
0000a0:  00070002ffffffff  0007000400000321  ........!.......
0000b0:  00000010ffffffff  00000020ffffffff  ............ ...
0000c0:  00000000ffffffff  0000000000000000  ................
0000d0:  0000000000000000  0000000000000000  ................
0000e0:  0000000000000000  0000000000000000  ................
0000f0:  0000000000000000  0000000000000000  ................
000100:  0000000000000000  0000000000000000  ................
000110:  0000000000000000  0000000000000000  ................
000120:  0000000000000000  0000000000000000  ................
000130:  0000000000000000  0000000000000000  ................
000140:  0000000000000000  0000000000000000  ................
000150:  0000000000000000  0000000000000000  ................
000160:  0000000000000000  0000000000000000  ................
000170:  0000000000000000  0000000000000000  ................
000180:  0000000000000000  0000000000000000  ................
000190:  0000000000000000  0000000000000000  ................
0001a0:  0000000000000000  0000000000000000  ................
0001b0:  0000000000000000  0000000000000000  ................
0001c0:  0000000000000000  0000000000000000  ................
0001d0:  0000000000000000  0000000000000000  ................
0001e0:  0000000000000000  0000000000000000  ................
0001f0:  0000000000000000  0000000000000000  ................
000200:  7ce8937fab5b1b37  129d66279c3ee5d5  7.[....|..>.'f..
000210:  9150d286eb2f192f  5c69aa8a7766d9a4  /./...P...fw..i\
000220:  05a895c6195f2ef3  b338788f73fba470  .._.....p..s.x8.
000230:  7780c41187c2a869  dca85f5e96075f2e  i......w._..^_..
000240:  ee72bade6a2cbcf4  1cf591acbfb9dd87  ..,j..r.........
000250:  53832ae2b8f8b3d2  729d4f83e4a9634a  .....*.SJc...O.r
000260:  54754ebb98ee10d6  d0eae8a1bbda98cd  .....NuT........
000270:  031817c20127cac2  4bb5006514b03ded  ..'......=..e..K
000280:  5c8d0870fc38d2ae  09b140db3998ca0e  ..8.p..\...9.@..
000290:  60f913f3204fb7f5  fafe1ca9dedb060c  ..O ...`........
0002a0:  0cc8ef8264f8ba94  cc38007338303db2  ...d.....=08s.8.
0002b0:  830d521d3791871b  258480e8703de5d8  ...7.R....=p...%
0002c0:  b05ebfb868c6921a  333178eb70cb7022  ...h..^."p.p.x13
0002d0:  0fdbb98f1a73ba9c  2f30d76e19d2fca6  ..s.........n.0/
0002e0:  1399a692d8c2aab4  d38b309db8d3d204  .............0..
0002f0:  f7eedfc137e1a34e  9044153549ea1517  N..7.......I5.D.
000300:  7460c05b3cbed0e0  c6ee84f93d3a6247  ...<[.`tGb:=....
000310:  c2042d662926b015  1ec29863dba51605  ..&)f-......c...
000320:  d9491f10a3c7e30d  8a920a3b12d1d448  ......I.H...;...
000330:  ee0560c2f3f4cfc7  d1c1c511f3e65fe7  .....`..._......
000340:  12f599a58a78c7ec  5066fd0b0d0d49b0  ..x......I....fP
000350:  2a129ba8efaf91b6  641e4fcd71e7aa07  .......*...q.O.d
000360:  8c8061656acc53d0  72f1fb552e590f5f  .S.jea.._.Y.U..r
000370:  d96cab7d22e8ded0  a253565296803d16  ..."}.l..=..RVS.
000380:  62e3f86051bcb36e  504167b10b280d94  n..Q`..b..(..gAP
000390:  de77a540ecb4dad3  93dbd47eb77376a9  ....@.w..vs.~...
0003a0:  a3ae0b6204d918d3  9b4f4c8d17c5490b  ....b....I...LO.
0003b0:  024684c802369f13  1ef6e7314e55bbe2  ..6...F...UN1...
0003c0:  86a35670c8da0184  4ef63bb7e9f8f033  ....pV..3....;.N
0003d0:  93f99ab4a2ecd360  2cde98d37ccf0426  `.......&..|...,
0003e0:  a2cce6f77a7f84ab  4abebe447ae2b4fa  ...z.......zD..J
0003f0:  82e1dbe93372c3bd  1b0bf889b7f93f24  ..r3....$?......
000400:  98ccb1d1303b641a  388ba75491b23725  .d;0....%7..T..8
000410:  c19bb1b078a2d595  12b8674a27c74691  ...x.....F.'Jg..
000420:  17be22959c663a35  499af4a147951475  5:f.."..u..G...I
000430:  0296690f67f8f144  6471049a96329736  D..g.i..6.2...qd
000440:  5b7e52e39cb18e97  106344a3cd2dcf9b  .....R~[..-..Dc.
000450:  8147e683618762c6  b05a02f6cc742540  .b.a..G.@%t...Z.
000460:  664c6f7d21a52e83  5bae34b54980ab16  ...!}oLf...I.4.[
000470:  65931ee3d205781b  aea48c026883ddce  .x.....e...h....
000480:  5672eb42460eca8d  f8171fa702fad4a0  ...FB.rV........
000490:  faf0d39ec74b1438  7fff72977ad03c64  8.K.....d<.z.r..
0004a0:  48ebeffb402ca576  d13e4a4109adcb26  v.,@...H&...AJ>.
0004b0:  68bc938dd379f86d  ac501b05f9dc6fd9  m.y....h.o....P.
0004c0:  5d459299ad1f2b06  46777627492b434e  .+....E]NC+I'vwF
0004d0:  a6b460434df2ebf8  062ae0c76c3bbe14  ...MC`....;l..*.
0004e0:  a37d5b857a95feb5  f1ee40c178d3f55a  ...z.[}.Z..x.@..
0004f0:  987a193142ecbe8c  35d06b67d745c203  ...B1.z...E.gk.5
000500:  3ccc3e6d1704af63  8cc316fb01646a3a  c...m>.<:jd.....
000510:  9adf720155b27791  6b9609664e120925  .w.U.r..%..Nf..k
000520:  049cfea873f25ddd  442b1eb917993db5  .].s.....=....+D
000530:  79243102d41f610e  cfa85fb424b29ad8  .a...1$y...$._..
000540:  2b9d92bf7b630adf  b0ceed1782a18f72  ..c{...+r.......
000550:  6d19456286171c33  386d7536ce5884ca  3...bE.m..X.6um8
000560:  cad6c580a622440c  d12fa558af1d1e39  .D".....9...X./.
000570:  6b867e89539adda1  a4340ed9fd9a6ba3  ...S.~.k.k....4.
000580:  73d5b6bb5d3b2845  81408384384ff71e  E(;]...s..O8..@.
000590:  63ec89250dc3a863  9687ce1aea719132  c...%..c2.q.....
0005a0:  44012a4ac7a298f8  aa2c9cb647acf88a  ....J*.D...G..,.
0005b0:  54379337be5e9e6b  ef42829bce5cb494  k.^.7.7T..\...B.
0005c0:  03d36fb54cc5be7f  35da40903edbcc64  ...L.o..d..>.@.5
0005d0:  5b1bcc53c1800777  45e6c81b6e4d4a80  w...S..[.JMn...E
0005e0:  73da46d03576b180  04cc4afc64548af1  ..v5.F.s..Td.J..
0005f0:  546fce4f2fbf7a6d  4cc3a8a2872c05f3  mz./O.oT..,....L

@dweeezil
Copy link
Contributor

Big breakthrough this weekend, I've got a 100% reliable reproducer and am hot on the trail of the problem. More information later. I think it will explain all currently reported instances of SA corruption.

@dweeezil
Copy link
Contributor

Update (now that I'm not on a mobile device): The problem is that a transition from a dnode with a spill block to a dnode without a spill block doesn't work properly if the the dnode isn't synced in-between the transition. The cause seems to be that the dbuf corresponding to the spill block (db_blkid of DMU_SPILL_BLKID) is still laying around to be synced even after the spill block has been removed at higher levels.

The effect is that a perfectly valid dnode is written with the exception that it has an unnecessary spill block attached to it. This, in itself, doesn't cause much of a problem (but it is detected by my extra zdb debugging) but when subsequent changes are made to the dnode, such as re-extending an xattr into the spill area, fatal corruption can occur.

I'm hoping to have a better description and solution within the next day or so. As best as I can tell at the moment, this problem is not specific to SA xattrs, but is exacerbated by their use because without them, the spill code within ZFS doesn't get a whole lot of use.

@behlendorf
Copy link
Contributor

Very interesting. It sounds like the spill block dbuf isn't getting properly undirtied in this case and is being left around. It wouldn't shock me if all the ZFS implementations suffer from this. We make much more use of the SA code and are more likely to notice it.

@dweeezil
Copy link
Contributor

The acltype=posixacl and xattr=sa combination gives the SA code quite a workout. The key bit of information is that in several of the bits of extended zdb debugging provided in these related issues, there was an almost perfectly valid non-spill xattr SA present but its size was not so coincidentally exactly what it would be when there were zero nvpairs in the packed nvlist.

I think I'm getting pretty close to the root of the problem on my test system on which I'm running the reproducers.

@dweeezil
Copy link
Contributor

Further tracing shows that the dbufs for both the dnode and the spill are being dirtied properly. It looks like I went on the wrong track here and the problem is likely to be in the main SA code but it is definitely dependent on the flushing of earlier data.

@dweeezil
Copy link
Contributor

Ugh, you were right @behlendorf (as was my initial hunch; I was confusing dirtied with undirtied). Patch to be posted shortly.

@behlendorf behlendorf added this to the 0.6.4 milestone Nov 17, 2014
@uwkhahn
Copy link
Author

uwkhahn commented Nov 20, 2014

Thanks for all the work.. will this show up as a daily build at some point?

I took some of bpkroth's work to try and get a debian package with the latest code, and it hasn't gone well. (zfs 7b2d78a and spl 917fef2 ).

I then tried turnning on xattrs=sa on a new filesystem and rsyncing old data over... I crashed pretty soon after the rsync started (rsync -aviPHAX) and now I can't even traverse the directory structure without it eventually crashing...

It is a very strange crash... Initially it will tell me that the program traversing the filesystem: (e.g. find,ls,rm) is using 100% of the cpu then it eventually freezes the machine. zpool iostat 2 doesn't really show anything going on.

Hmm

@bpkroth
Copy link

bpkroth commented Nov 21, 2014

Hi, I've been working with Ken on this issue as well. Here's some other details we have to offer:

We were testing the git checkout build ids Ken mentioned, first with a local build, then again with the packages in the wheezy-daily repository (details below).

The testing procedure goes something like this:

  • create a zfs test fs (/tank/test) with acltype=posixacl xattrs=sa (we have not yet done this same test on the latest builds with xattrs=on)
  • rsync (-aviPHAS) from a local non-zfs directory structure (/local.hd/test) that has some posix acls to the zfs test fs (/tank/test)
  • if that succeeds, or when the machine returns from a crash, attempt to walk the test fs via find /tank/test, for instance
  • if that succeeds, or when the machine returns from a crash, zfs destroy tank/fs
  • if that succeeds, or when the machine returns from a crash, repeat with different settings or build

When doing that, we'll end up with a crash in a couple of different places, as noted above.

In the case of the zfs destory, we just got this:

[  349.160037] VERIFY(size != 0) failed
[  349.160122] SPLError: 14585:0:(range_tree.c:172:range_tree_add()) SPL PANIC
[  349.160205] SPL: Showing stack for process 14585
[  349.160271] CPU: 8 PID: 14585 Comm: z_fr_iss_6/3 Tainted: P           O  3.16-0.bpo.3-amd64 #1 Debian 3.16.5-1~bpo70+1
[  349.160346] Hardware name: Supermicro X9DRW-3LN4F+/X9DRW-3TF+/X9DRW-3LN4F+/X9DRW-3TF+, BIOS 3.0a 02/06/2014
[  349.160420]  0000000000000000 ffffffffa0f3cfc3 ffffffff8154144f 0000000000000000
[  349.160598]  ffffffffa02d669c ffffffffa02eab4d 000000000000000a 000000000000ffff
[  349.160601]  000000000000ffff 00000000000000ac ffffffffa02e2746 0000000000000028
[  349.160605] Call Trace:
[  349.160625]  [<ffffffff8154144f>] ? dump_stack+0x41/0x51
[  349.160637]  [<ffffffffa02d669c>] ? spl_debug_bug+0x7c/0xe0 [spl]
[  349.160647]  [<ffffffffa02e2746>] ? spl_PANIC+0xb6/0x100 [spl]
[  349.160654]  [<ffffffffa02dbbeb>] ? spl_kmem_cache_alloc+0x9b/0xcf0 [spl]
[  349.160661]  [<ffffffffa02dc98f>] ? spl_kmem_cache_free+0x14f/0x3e0 [spl]
[  349.160667]  [<ffffffffa028018a>] ? avl_find+0x5a/0xa0 [zavl]
[  349.160674]  [<ffffffffa0ec46e1>] ? range_tree_add+0x291/0x2d0 [zfs]
[  349.160700]  [<ffffffffa0ec09f0>] ? metaslab_free_dva+0x1d0/0x3b0 [zfs]
[  349.160721]  [<ffffffffa0ec38f4>] ? metaslab_free+0x94/0xe0 [zfs]
[  349.160740]  [<ffffffffa0f2323c>] ? zio_dva_free+0x1c/0x30 [zfs]
[  349.160758]  [<ffffffffa0f24269>] ? zio_execute+0xa9/0x140 [zfs]
[  349.160775]  [<ffffffffa02df0c6>] ? taskq_thread+0x236/0x4c0 [spl]
[  349.160781]  [<ffffffff8109f330>] ? try_to_wake_up+0x310/0x310
[  349.160787]  [<ffffffffa02dee90>] ? task_done+0x150/0x150 [spl]
[  349.160792]  [<ffffffff8108f491>] ? kthread+0xc1/0xe0
[  349.160797]  [<ffffffff8108f3d0>] ? flush_kthread_worker+0xb0/0xb0
[  349.160801]  [<ffffffff8154787c>] ? ret_from_fork+0x7c/0xb0
[  349.160805]  [<ffffffff8108f3d0>] ? flush_kthread_worker+0xb0/0xb0

For the crash while walking the filesystem (either via find or rsync) we were basically getting "BUG: soft lockup" and/or "INFO: rcu_sched detected stalls on CPUs/tasks" and the rsync or find process would appear to be spinning at 100% in top (though I'd have expected it to be in D state or something like that and spinning in kernel land instead). Anyways, those netconsole dumps are pretty long since they repeat until the panic occurs and dump for every CPU on the box, so I'll only post a snippet here:

Nov 20 11:23:33 randal netconsole:  683.784535] INFO: rcu_sched detected stalls on CPUs/tasks:
Nov 20 11:23:33 randal netconsole: by 14, t=36762 jiffies, g=11502, c=11501, q=137251)
Nov 20 11:23:33 randal netconsole:  683.784842] NMI backtrace for cpu 8
Nov 20 11:23:33 randal netconsole:  683.784913] CPU: 8 PID: 15162 Comm: find Tainted: P        W  O 3.14-0.bpo.1-amd64 #1 Debian 3.14.12-1~bpo70+1
Nov 20 11:23:33 randal netconsole:  683.785007] Hardware name: Supermicro X9DRW-3LN4F+/X9DRW-3TF+/X9DRW-3LN4F+/X9DRW-3TF+, BIOS 3.0a 02/06/2014
Nov 20 11:23:33 randal netconsole:  683.785101] task: ffff880ffd3de9e0 ti: ffff880fc141e000 task.ti: ffff880fc141e000
Nov 20 11:23:33 randal netconsole:  683.785191] RIP: 0010:[<ffffffff814f297d>] 
Nov 20 11:23:33 randal netconsole[<ffffffff814f297d>]: _raw_spin_lock_irq+0x1d/0x30
Nov 20 11:23:33 randal netconsole:  683.785337] RSP: 0018:ffff880fc141faa8  EFLAGS: 00000006
Nov 20 11:23:33 randal netconsole:  683.785409] RAX: 000000000000000a RBX: ffff880ffd3de9e0 RCX: 0000000000000016
Nov 20 11:23:33 randal netconsole:  683.785484] RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff88203149a488
Nov 20 11:23:33 randal netconsole:  683.785559] RBP: ffff88203149a480 R08: 008293cec78f2ba7 R09: ffff88203aee9000
Nov 20 11:23:33 randal netconsole:  683.785634] R10: ffff88203149a480 R11: ffff882010d44038 R12: ffff88203149a480
Nov 20 11:23:33 randal netconsole:  683.785709] R13: 0000000000000002 R14: ffff880fc141fe58 R15: 0000000000000000
Nov 20 11:23:33 randal netconsole:  683.785786] FS:  00007fd5a2a1b700(0000) GS:ffff88207fc00000(0000) knlGS:0000000000000000
Nov 20 11:23:33 randal netconsole:  683.785876] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 20 11:23:33 randal netconsole:  683.785948] CR2: 00007f14a0496610 CR3: 0000000fbeb89000 CR4: 00000000001407e0
Nov 20 11:23:33 randal netconsole:  683.786023] Stack:
Nov 20 11:23:33 randal netconsole:  683.786089]  ffffffff814f1f7c
Nov 20 11:23:33 randal netconsole:  683.786372]  ffff880ffd3de9e0
Nov 20 11:23:33 randal netconsole:  683.786655]  ffff88203149a480
Nov 20 11:23:33 randal netconsole:  683.786939] Call Trace:
Nov 20 11:23:33 randal netconsole:  683.787008]  [<ffffffff814f1f7c>] ? rwsem_down_read_failed+0x2c/0xf0
Nov 20 11:23:33 randal netconsole:  683.787101]  [<ffffffffa0d4d3bc>] ? dmu_buf_hold_noread+0xfc/0x190 [zfs]
Nov 20 11:23:33 randal netconsole:  683.787179]  [<ffffffff812a53b4>] ? call_rwsem_down_read_failed+0x14/0x30
Nov 20 11:23:33 randal netconsole:  683.787255]  [<ffffffff814f1a57>] ? down_read+0x17/0x20
Nov 20 11:23:33 randal netconsole:  683.787341]  [<ffffffffa0da5fa0>] ? zap_get_leaf_byblk+0xe0/0x300 [zfs]
Nov 20 11:23:33 randal netconsole:  683.787425]  [<ffffffffa0daa070>] ? zap_match.part.5+0x40/0x40 [zfs]
Nov 20 11:23:33 randal netconsole:  683.787508]  [<ffffffffa0da622a>] ? zap_deref_leaf+0x6a/0x90 [zfs]
Nov 20 11:23:33 randal netconsole:  683.787593]  [<ffffffffa0da7c1f>] ? fzap_cursor_retrieve+0xdf/0x290 [zfs]
Nov 20 11:23:33 randal netconsole:  683.787678]  [<ffffffffa0dab31f>] ? zap_cursor_retrieve+0x6f/0x330 [zfs]
Nov 20 11:23:33 randal netconsole:  683.787761]  [<ffffffffa0d4d1a3>] ? dmu_prefetch+0x1c3/0x2e0 [zfs]
Nov 20 11:23:33 randal netconsole:  683.787847]  [<ffffffffa0dcc75f>] ? zfs_readdir+0x11f/0x460 [zfs]
Nov 20 11:23:33 randal netconsole:  683.787925]  [<ffffffff811a7b81>] ? path_openat+0xd1/0x630
Nov 20 11:23:33 randal netconsole:  683.788001]  [<ffffffff810a6519>] ? remove_wait_queue+0x19/0x60
Nov 20 11:23:33 randal netconsole:  683.788084]  [<ffffffffa0de3e11>] ? zpl_iterate+0x41/0x70 [zfs]
Nov 20 11:23:33 randal netconsole:  683.788159]  [<ffffffff811aaa76>] ? iterate_dir+0xa6/0xe0
Nov 20 11:23:33 randal netconsole:  683.788234]  [<ffffffff811a2299>] ? generic_permission+0xf9/0x1a0
Nov 20 11:23:33 randal netconsole:  683.788309]  [<ffffffff811aabd3>] ? SyS_getdents+0x93/0x120
Nov 20 11:23:33 randal netconsole:  683.788382]  [<ffffffff811aa8c0>] ? filldir64+0x110/0x110
Nov 20 11:23:33 randal netconsole:  683.788455]  [<ffffffff814faf79>] ? system_call_fastpath+0x16/0x1b

Something else interesting to note, that's not necessarily related to the zfs xattrs=sa stuff:
In our initial tests we rsynced from another zfs fs (eg: /tank/working with xattr=on) but neglected to --exclude .zfs. To our surprise, not only did /tank/zfs/.zfs/snapshot/ contain the data from /tank/working/.zfs/snapshot/, but zfs list actually reported them as snapshots. Of course they weren't (this wasn't a cloned fs nor did we take any snapshots from the fs), so the zfs destroy command had some trouble with those. We eventually just rm -rf /tank/test/.zfs/snapshot/* and were able to zfs destory tank/test. We were wondering if the following commit might be related to that:
c944be5

Here are the (even more) broken package and kernel version details:

ii  debian-zfs                              7~wheezy                       amd64        Native ZFS filesystem metapackage for Debian.
ii  dkms                                    2.2.0.3-1.2+zfs6               all          Dynamic Kernel Module Support Framework
ii  libzfs2                                 0.6.3-21~7b2d78~wheezy         amd64        Native ZFS filesystem library for Linux
ii  spl                                     0.6.3-13~917fef~wheezy         amd64        Solaris Porting Layer user-space utilities for Linux
ii  spl-dkms                                0.6.3-13~917fef~wheezy         all          Solaris Porting Layer kernel modules for Linux
ii  zfs-dbg                                 0.6.3-21~7b2d78~wheezy         amd64        Debugging symbols for ZFS userland libraries and tools
ii  zfs-dkms                                0.6.3-21~7b2d78~wheezy         all          Native ZFS filesystem kernel modules for Linux
ii  zfsonlinux                              3~wheezy                       all          archive.zfsonlinux.org trust package
ii  zfsutils                                0.6.3-21~7b2d78~wheezy         amd64        command-line tools to manage ZFS filesystems


Linux randal 3.16-0.bpo.3-amd64 #1 SMP Debian 3.16.5-1~bpo70+1 (2014-11-02) x86_64 GNU/Linux

Anyways, we're going to downgrade back to the "stable" 0.6.3 release in your wheezy (non-daily) repo and then attempt to apply the patches you suggested (for the shrinker fix and xattr=sa fix) separately directly to the local dkms source to see how that responds.

Let us know if you need any more details.

Thanks,
Brian

@uwkhahn
Copy link
Author

uwkhahn commented Nov 21, 2014

I was not able to cleanly apply the patch that you provided for the dev source (4254acb) to the current stable source. Having a look at it I came up with the following.. I was curious if it looked correct to you

--- dbuf.c  2014-05-20 02:25:26.000000000 -0500
+++ kens_dbuf.c 2014-11-21 13:37:38.224882327 -0600
@@ -872,18 +872,19 @@
 {
    dmu_buf_impl_t *db, *db_next;
    uint64_t txg = tx->tx_txg;
+   boolean_t freespill = (start == DMU_SPILL_BLKID || end == DMU_SPILL_BLKID);
    int epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
    uint64_t first_l1 = start >> epbs;
    uint64_t last_l1 = end >> epbs;

-   if (end > dn->dn_maxblkid && (end != DMU_SPILL_BLKID)) {
+   if ( end > dn->dn_maxblkid && !freespill ) {
        end = dn->dn_maxblkid;
        last_l1 = end >> epbs;
    }
    dprintf_dnode(dn, "start=%llu end=%llu\n", start, end);

    mutex_enter(&dn->dn_dbufs_mtx);
-   if (start >= dn->dn_unlisted_l0_blkid * dn->dn_datablksz) {
+   if (start >= dn->dn_unlisted_l0_blkid * dn->dn_datablksz && !freespill) {
        /* There can't be any dbufs in this range; no need to search. */
        mutex_exit(&dn->dn_dbufs_mtx);
        return;
@@ -919,7 +920,7 @@
        if (db->db_level != 0)
            continue;
        dprintf_dbuf(db, "found buf %s\n", "");
-       if (db->db_blkid < start || db->db_blkid > end)
+       if (db->db_blkid < start || db->db_blkid > end && !freespill)
            continue;

        /* found a level 0 buffer in the range */

ryao pushed a commit to ryao/zfs that referenced this issue Nov 29, 2014
If a spill block's dbuf hasn't yet been written when a spill block is
freed, the unwritten version will still be written.  This patch handles
the case in which a spill block's dbuf is freed and undirties it to
prevent it from being written.

The most common case in which this could happen is when xattr=sa is being
used and a long xattr is immediately replaced by a short xattr as in:

	setfattr -n user.test -v very_very_very..._long_value  <file>
	setfattr -n user.test -v short_value  <file>

The first value must be sufficiently long that a spill block is generated
and the second value must be short enough to not require a spill block.
In practice, this would typically happen due to internal xattr operations
as a result of setting acltype=posixacl.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#2663
Closes openzfs#2700
Closes openzfs#2701
Closes openzfs#2717
Closes openzfs#2863
Closes openzfs#2884

Conflicts:
	module/zfs/dbuf.c
@behlendorf
Copy link
Contributor

@uwkhahn Yes, that patch looks good.

@bpkroth I'd be very interested to hear if applying this patch resolves the issue for you.

As for the .zfs/snapshot issue I can explain what's going on there. A lesser known feature of ZFS is that you can create/destroy/rename snapshots with mkdir/rmdir/mv in the .zfs/snapshot directory. Because you must have snapdir=visible set rsync traversed in to this directory and then created it at the destination with mkdir which created a snapshot. That snapshot of course doesn't match what was at the source but that's why it happened.

@bpkroth
Copy link

bpkroth commented Dec 2, 2014

Brian Behlendorf notifications@github.com 2014-12-01 13:43:

@uwkhahn Yes, that patch looks good.

@bpkroth I'd be very interested to hear if applying this patch resolves the issue for you.

We're working on testing it in a VM now. Having a little bit of trouble reproducing the original issue without the patch so far, but with the patch it doesn't appear to fail catastrophically, so that's good at least.

That said, his patch is just the one for fixing the dirty spill block corruption.

It seemed like the rest of the issues we were experiencing with the zfs 7b2d78a and spl 917fef2 builds were related to other code updates that were brought along with them.

As for the .zfs/snapshot issue I can explain what's going on there. A lesser known feature of ZFS is that you can create/destroy/rename snapshots with mkdir/rmdir/mv in the .zfs/snapshot directory. Because you must have snapdir=visible set rsync traversed in to this directory and then created it at the destination with mkdir which created a snapshot. That snapshot of course doesn't match what was at the source but that's why it happened.

Huh, that's kinda cool/weird. I'd have expected .zfs/snapshot and all of it's children to be read-only.

...

I was about to say that we didn't need snapdir=visible anyways since the point was to allow users to browse backups for themselves via NFS and that that was broken, but it looks like that's about to be fixed too:
#2797

Thanks for the info.

Cheers,
Brian

@bpkroth
Copy link

bpkroth commented Dec 17, 2014

FYI, we've been running with Ken's patch in [1] and all volumes resynced fresh with xattr=sa now for about a week. Things seem stable and there's no zfs_iput_taskq task spinning.

Thanks,
Brian

[1] #2863 (comment)

behlendorf pushed a commit that referenced this issue Dec 23, 2014
If a spill block's dbuf hasn't yet been written when a spill block is
freed, the unwritten version will still be written.  This patch handles
the case in which a spill block's dbuf is freed and undirties it to
prevent it from being written.

The most common case in which this could happen is when xattr=sa is being
used and a long xattr is immediately replaced by a short xattr as in:

	setfattr -n user.test -v very_very_very..._long_value  <file>
	setfattr -n user.test -v short_value  <file>

The first value must be sufficiently long that a spill block is generated
and the second value must be short enough to not require a spill block.
In practice, this would typically happen due to internal xattr operations
as a result of setting acltype=posixacl.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2663
Closes #2700
Closes #2701
Closes #2717
Closes #2863
Closes #2884
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants