zfs send very slow with target filesystem recordsize=4k #1771

treydock · 2013-10-07T19:34:34Z

Initial report was https://groups.google.com/a/zfsonlinux.org/d/msg/zfs-discuss/-njH0OwOICw/YBiagvqCXiIJ

The send operation:

zfs send tank@20131007-1136 | zfs receive -F tank2/fhgfs

This operation sending to a recordsize=128k zfs volume took less than 2 hours. Sending to a pre-created recordsize=4k ran for 2 hours only sending 1G out of 88.2G, and has now sped up after 3 hours of running.

I noticed that zpool iostat for the sending zpool had less than 100 read operations a second for the first few hours of the send. After a few hours the read operations per second ranges from 7,000 to 15,000 read operations per second.

The first few hours of the receive showed less than 100 write operations per second. After the first few hours the receiving filesystem does write operations in bursts. Using zpool iostat tank2 1 I see that every 5 seconds anywhere from 28,000 to 50,000 write operations take place.

This behavior was not observed when the receiving filesystem did not exist and was created with the default recordsize of 128k.

The sending filesystem is shown as 88.2G "USED" and 59.1G "REFER" in zfs list. After 3 hours the receiving filesystem only shows as 23G "USED" and "30K" "REFER". The first 2 hours of the send only 1G showed in "USED". The sending filesystem comprises of millions (maybe 70 million) 0 byte files who's xattr contain FhGFS metadata.

System:

64GB RAM, 16 cores (2 sockets)

# cat /etc/modprobe.d/zfs.conf
options zfs l2arc_nocompress=1 zfs_arc_max=8589934592 zfs_arc_meta_limit=6442450944

Zpool and zfs information:

 zpool status
  pool: tank
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            ssd03   ONLINE       0     0     0
            ssd04   ONLINE       0     0     0
            ssd05   ONLINE       0     0     0
            ssd06   ONLINE       0     0     0
            ssd07   ONLINE       0     0     0
            ssd08   ONLINE       0     0     0

errors: No known data errors

  pool: tank2
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank2       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ssd09   ONLINE       0     0     0
            ssd10   ONLINE       0     0     0

errors: No known data errors

# zfs list
NAME          USED  AVAIL  REFER  MOUNTPOINT
tank         88.2G   788G  59.1G  /tank
tank2        24.4G   194G    31K  /tank2
tank2/fhgfs  24.4G   194G    30K  /tank2/fhgfs

# zpool get all tank
NAME  PROPERTY               VALUE                  SOURCE
tank  size                   1.30T                  -
tank  capacity               9%                     -
tank  altroot                -                      default
tank  health                 ONLINE                 -
tank  guid                   14798683540878421717   default
tank  version                -                      default
tank  bootfs                 -                      default
tank  delegation             on                     default
tank  autoreplace            off                    default
tank  cachefile              -                      default
tank  failmode               wait                   default
tank  listsnapshots          off                    default
tank  autoexpand             off                    default
tank  dedupditto             0                      default
tank  dedupratio             1.00x                  -
tank  free                   1.18T                  -
tank  allocated              132G                   -
tank  readonly               off                    -
tank  ashift                 0                      default
tank  comment                -                      default
tank  expandsize             0                      -
tank  freeing                0                      default
tank  feature@async_destroy  enabled                local
tank  feature@empty_bpobj    active                 local
tank  feature@lz4_compress   enabled                local

# zpool get all tank2
NAME   PROPERTY               VALUE                  SOURCE
tank2  size                   222G                   -
tank2  capacity               11%                    -
tank2  altroot                -                      default
tank2  health                 ONLINE                 -
tank2  guid                   5779269606406075032    default
tank2  version                -                      default
tank2  bootfs                 -                      default
tank2  delegation             on                     default
tank2  autoreplace            off                    default
tank2  cachefile              -                      default
tank2  failmode               wait                   default
tank2  listsnapshots          off                    default
tank2  autoexpand             off                    default
tank2  dedupditto             0                      default
tank2  dedupratio             1.00x                  -
tank2  free                   197G                   -
tank2  allocated              24.7G                  -
tank2  readonly               off                    -
tank2  ashift                 0                      default
tank2  comment                -                      default
tank2  expandsize             0                      -
tank2  freeing                0                      default
tank2  feature@async_destroy  enabled                local
tank2  feature@empty_bpobj    active                 local
tank2  feature@lz4_compress   enabled                local

Both tank and tank2/fhgfs only have atime=off and recordsize=4k altered from default.

The text was updated successfully, but these errors were encountered:

treydock · 2013-10-07T19:36:28Z

May be related to #1357

dweeezil · 2013-10-07T20:18:15Z

In your case, it sounds like it's actually the zfs recv that's slow.

It would be interesting to know whether the xattrs play any part in this problem. I'd like to try to duplicate this behavior on one of my test systems because I'm interested in xattr-related issues. What's the density of the directories into which the empty files are stored? How many xattrs have they got and how long are the xattrs' values? Are you using xattr=sa?

I've been working on a send/recv-related patch stack ported from illumos which is available in #1760 but I have a feeling your problem isn't going to be helped by any of them.

treydock · 2013-10-07T22:17:08Z

@dweeezil The xattrs has been a source of concern since we migrated our FhGFS metadata to ZFS. The density of the directories is hard to determine exactly as there are 2 primary directories in the metadata storage location, dentries and inodes and total the metadata directory contains ~30 million files.

At first glance, it seems that this is the basic layout

- meta
-- dentries
--- HEX value - 129 currently
---- HEX value - 129 for random samples
----- directories that contain files
------ FILES
-- inodes
--- HEX value - 129 currently
---- HEX value - 129 for random samples
----- directories that contain files
------ FILES

Here's example of the file xattr contents

# getfattr -d inodes/7F/7B/12-51FC5D3D-1
# file: inodes/7F/7B/12-51FC5D3D-1
user.fhgfs=0sBAECAAMAAADGDgBSAAAAAFo9AFIAAAAAknkvUgAAAACSeS9SAAAAAGQFAADnAwAA/UEAAAIAAAANAAAAMTItNTFGQzVEM0QtMQAAAAwAAAA1LTUxRkM1RDNELTEAAAAAAQABABgAAAABAAAAAAAIAAQAAAAIAAAAAAAAAA==

Each file under both inodes and dentries contains 1 extended attribute, "user.fhgfs". The attribute's value appears to be hashed but based on the FhGFS documentation [1] the inode size for our setup (2 storage targets) is recommended to be 512bytes for ext4.

Hope that information helps. Right now I think using the xattr metadata storage approach with ZFS was the wrong choice, but so far performance hasn't been terrible except that are IOPS are not where they were when using MD RAID w/ ext4.

dweeezil · 2013-10-08T02:31:37Z

@treydock That's very helpful. Since you didn't mention xattr=sa, I presume you're not using it. FWIW, if your files only have the single xattr with values typical of your example, it would fit just fine in the SA space which means that send/recv issues notwithstanding, you'd certainly enjoy much better overall performance by switching to xattr=sa.

I'm going to try to cobble together some scripts to fabricate a directory that has the characteristics of yours and see if I can duplicate this problem.

treydock · 2013-10-08T03:43:15Z

@dweeezil Sorry I forgot to mention the xattr value. Your correct, xattr is set to default 'on'. Thank you for pointing out the xattr=sa.

Would setting that parameter on a new zfs fs then then performing a zfs send to that new fs put all files to using sa instead of default method?

I'll post or send more detailed views of the file structure and if it helps I can send a tar archive of metadata files.

dweeezil · 2013-10-08T12:28:16Z

@treydock Yes, if you pre-create the destination and zfs set xattr=sa then the send/recv will convert the xattrs to SA storage on the new file system. The performance of xattr=sa for an FhGFS metadata server should be much better.

I think I can hack together a script to fabricate a sample file system. I'm going to try to do that on one of our test servers today and see if I can duplicate this zfs recv problem. I gather that the 30M files are evenly split between the "dentries" and "indoes" top-level directories (each has 15M)? If so, that would be an average of 901 files per directory so my plan is to make a script to randomly fill them with between 500 and 1500 files. My plan is to add a knob to allow me to adjust the density of them.

treydock · 2013-10-08T16:16:06Z

@dweeezil Thanks for confirming the xattr=sa can be applied via zfs send.

I'll confirm the distribution of the files between "dentries" and "inodes" shortly. I'm in the process of migrating the metadata filesystem from RAIDZ2 to mirrors via zfs send.

I started a new zfs send -v tank 20131008-0140 | zfs receive -F tank2/fhgfs last night after setting xattr=sa on tank2/fhgfs. The first 10 hours saw 2GB as being amount sent. I stopped all FhGFS processes and executed zfs unmount -a and in the last 2 minutes 10GB has transferred.

This is the 3rd "send" run on this system in 3 days and it's gone from 1 hour, to 3 hours, to 10+ hours to send the same data to local zpool. Is it possible a reboot or reloading the zfs module could help clear some lingering cache or SPL data that is slowing this down? Or is having the receiving and/or sending side mounted a big hindrance on performance?

Sample zpool iostats when it was slow (around hour 10).

# zpool iostat tank 1 10
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank         133G  1.18T  2.16K    331  4.77M  1.13M
tank         133G  1.18T     25      0  13.0K      0
tank         133G  1.18T     42      0  21.5K      0
tank         133G  1.18T     20      0  10.5K      0
tank         133G  1.18T     42      0  21.5K      0
tank         133G  1.18T     20      0  10.5K      0
tank         133G  1.18T     42      0  21.5K      0
tank         133G  1.18T     20      0  10.5K      0
tank         133G  1.18T     42      0  21.5K      0
tank         133G  1.18T     28      0  14.5K      0

# zpool iostat tank2 1 10
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank2       36.5G   186G     65    333   187K   315K
tank2       36.5G   186G      0      0      0      0
tank2       36.5G   186G      0     42      0  59.0K
tank2       36.5G   186G      0    372      0   378K
tank2       36.5G   186G      0      0      0      0
tank2       36.5G   186G      0      0      0      0
tank2       36.5G   186G      0      0      0      0
tank2       36.5G   186G      0    257      0   205K
tank2       36.5G   186G      0    125      0   220K
tank2       36.5G   186G      0      0      0      0

zpool iostat after tank2/fhgfs is unmounted and tank's unmount process still running, and all processes using the zfs filesystems have been stopped

# zpool iostat 1 10
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank         133G  1.18T  2.16K    331  4.77M  1.13M
tank2       42.7G   179G     65    358   186K   334K
----------  -----  -----  -----  -----  -----  -----
tank         133G  1.18T  14.8K      0  25.0M      0
tank2       42.7G   179G      1      0   1021      0
----------  -----  -----  -----  -----  -----  -----
tank         133G  1.18T  6.81K      0  15.8M      0
tank2       42.7G   179G      4  74.4K  2.49K  52.3M
----------  -----  -----  -----  -----  -----  -----
tank         133G  1.18T  14.3K      0  24.6M      0
tank2       42.8G   179G     13    107  13.5K   166K
----------  -----  -----  -----  -----  -----  -----
tank         133G  1.18T  15.9K      0  25.9M      0
tank2       42.8G   179G      1      0   1020      0
----------  -----  -----  -----  -----  -----  -----
tank         133G  1.18T  14.8K      0  24.7M      0
tank2       42.8G   179G      1      0   1020      0
----------  -----  -----  -----  -----  -----  -----
tank         133G  1.18T  14.6K      0  24.6M      0
tank2       42.8G   179G      1      0   1020      0
----------  -----  -----  -----  -----  -----  -----
tank         133G  1.18T  6.50K      0  16.5M      0
tank2       42.8G   179G      4  70.1K  2.49K  49.3M
----------  -----  -----  -----  -----  -----  -----
tank         133G  1.18T  14.8K      0  25.0M      0
tank2       42.8G   179G     12    107  13.0K   166K
----------  -----  -----  -----  -----  -----  -----
tank         133G  1.18T  15.5K      0  28.0M      0
tank2       42.8G   179G      1      0   1020      0
----------  -----  -----  -----  -----  -----  -----

dweeezil · 2013-10-08T17:48:39Z

@treydock I ran a bunch of tests using a filesystem containing a bit more than 1 million files. I figured that would be enough to see whether I could reproduce the problem.

I was not able to reproduce the originally-reported problem. Send/recv ran at almost identical speed regardless whether the recv was to a fresh filesystem or whether it was to a pre-created filesystem with 4k recordsize. I all my tests with xattr=on. I ran both stock 0.6.2 code and also master code with my latest set of illumos-ported patches and there was little difference between the two. I was also using a cache device because I saw you had set the l2arc_compress option but I now notice you've not got any cache devices configured.

The new information in your last post suggests that it's either interference from the normal workload or memory starvation/fragmentation that's causing your problem (or some combination of both).

I'll try my tests again with some synthetic load applied to the source filesystem while the send/recv is running. That's going to be a total shot in the dark, however, because I can only guess at the characteristics of your normal load. Do you know what the concurrent workload is? Is it read-heavy, write-heavy, both, neither?

As a final note, I'd suggest sticking a piped instance of "pv" in-between your send/recv (zfs send a@b | pv | zfs recv x) to help you see the throughput in real-time.

treydock · 2013-10-08T19:50:16Z

@dweeezil Thanks for testing. I did a full system reboot then reran the sends with no FhGFS services running and nothing else accessing /tank or /tank2. These were the steps.

zpool create -O atime=off -O xattr=sa -O recordsize=4k tank2 mirror ssd09 ssd10
zfs set readonly=on tank
zfs snapshot tank@20131008-1228
zfs set readonly=on tank2
zfs send -v tank@20131008-1228 | zfs receive -F tank2

This time the operation completed in 1 hour 34 minutes, which is what I saw the first run without a pre-created filesystem.

Memory starvation is unlikely as my networking monitoring application shows the system never going below 70% available memory. Currently zfs_arc_max is ~8GB and zfs_arc_meta_limit is ~6GB on a 64GB system. I think this is likely a non-issue and related to the FhGFS services running causing the sends to be slow or some other quirk outside of zfs.

Thanks! Closing this issue.

treydock · 2013-10-08T23:46:04Z

@dweeezil Out of curiosity, is there a way to "spot check" that the zfs send (now completed) to a zfs fs with xattr=sa correctly used the SA method for storing the data?

Example file

# getfattr -d dentries/7F/7B/12-51FC5D3D-1/cbench
# file: dentries/7F/7B/12-51FC5D3D-1/cbench
user.fhgfs=0sAgMAAAEAAAANAAAAMTUtNTFGQzVEM0QtMQAAAAEA

# stat dentries/7F/7B/12-51FC5D3D-1/cbench
  File: `dentries/7F/7B/12-51FC5D3D-1/cbench'
  Size: 0               Blocks: 1          IO Block: 512    regular empty file
Device: 12h/18d Inode: 29710       Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2013-08-05 19:01:16.692686000 -0500
Modify: 2013-08-05 19:01:16.692686000 -0500
Change: 2013-08-05 19:01:16.692891000 -0500

That output does not differ from filesystem with xattr=on. Hoping there's some other method to confirm the correct usage of xattr=sa.

Thanks

ryao · 2013-10-09T01:56:20Z

My gentoo-next branch has plenty of these patches ported, but I have a list of another 20 patches that I need to review or port. I plan to open a pull request with all of the Illumos changes when I have finished adding them:

ryao/zfs@master...gentoo-next
ryao/spl@master...gentoo-next

On Oct 7, 2013, at 4:18 PM, Tim Chase notifications@github.com wrote:

In your case, it sounds like it's actually the zfs recv that's slow.

It would be interesting to know whether the xattrs play any part in this problem. I'd like to try to duplicate this behavior on one of my test systems because I'm interested in xattr-related issues. What's the density of the directories into which the empty files are stored? How many xattrs have they got and how long are the xattrs' values? Are you using xattr=sa?

I've been working on a send/recv-related patch stack ported from illumos which is available in #1760 but I have a feeling your problem isn't going to be helped by any of them.

—
Reply to this email directly or view it on GitHub.

behlendorf · 2013-10-11T22:18:25Z

@treydock If you're running 0.6.2 or newer you can use zdb to dump more detailed information. Using the inode number provided by stat you can run the following which will show you how the xattrs are stored.

zdb <pool/dataset> 29710

treydock closed this as completed Oct 8, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zfs send very slow with target filesystem recordsize=4k #1771

zfs send very slow with target filesystem recordsize=4k #1771

treydock commented Oct 7, 2013

treydock commented Oct 7, 2013

dweeezil commented Oct 7, 2013

treydock commented Oct 7, 2013

dweeezil commented Oct 8, 2013

treydock commented Oct 8, 2013

dweeezil commented Oct 8, 2013

treydock commented Oct 8, 2013

dweeezil commented Oct 8, 2013

treydock commented Oct 8, 2013

treydock commented Oct 8, 2013

ryao commented Oct 9, 2013

behlendorf commented Oct 11, 2013

zfs send very slow with target filesystem recordsize=4k #1771

zfs send very slow with target filesystem recordsize=4k #1771

Comments

treydock commented Oct 7, 2013

treydock commented Oct 7, 2013

dweeezil commented Oct 7, 2013

treydock commented Oct 7, 2013

dweeezil commented Oct 8, 2013

treydock commented Oct 8, 2013

dweeezil commented Oct 8, 2013

treydock commented Oct 8, 2013

dweeezil commented Oct 8, 2013

treydock commented Oct 8, 2013

treydock commented Oct 8, 2013

ryao commented Oct 9, 2013

behlendorf commented Oct 11, 2013