-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zfs send very slow with target filesystem recordsize=4k #1771
Comments
May be related to #1357 |
In your case, it sounds like it's actually the zfs recv that's slow. It would be interesting to know whether the xattrs play any part in this problem. I'd like to try to duplicate this behavior on one of my test systems because I'm interested in xattr-related issues. What's the density of the directories into which the empty files are stored? How many xattrs have they got and how long are the xattrs' values? Are you using xattr=sa? I've been working on a send/recv-related patch stack ported from illumos which is available in #1760 but I have a feeling your problem isn't going to be helped by any of them. |
@dweeezil The xattrs has been a source of concern since we migrated our FhGFS metadata to ZFS. The density of the directories is hard to determine exactly as there are 2 primary directories in the metadata storage location, dentries and inodes and total the metadata directory contains ~30 million files. At first glance, it seems that this is the basic layout
Here's example of the file xattr contents # getfattr -d inodes/7F/7B/12-51FC5D3D-1
# file: inodes/7F/7B/12-51FC5D3D-1
user.fhgfs=0sBAECAAMAAADGDgBSAAAAAFo9AFIAAAAAknkvUgAAAACSeS9SAAAAAGQFAADnAwAA/UEAAAIAAAANAAAAMTItNTFGQzVEM0QtMQAAAAwAAAA1LTUxRkM1RDNELTEAAAAAAQABABgAAAABAAAAAAAIAAQAAAAIAAAAAAAAAA== Each file under both inodes and dentries contains 1 extended attribute, "user.fhgfs". The attribute's value appears to be hashed but based on the FhGFS documentation [1] the inode size for our setup (2 storage targets) is recommended to be 512bytes for ext4. Hope that information helps. Right now I think using the xattr metadata storage approach with ZFS was the wrong choice, but so far performance hasn't been terrible except that are IOPS are not where they were when using MD RAID w/ ext4. |
@treydock That's very helpful. Since you didn't mention xattr=sa, I presume you're not using it. FWIW, if your files only have the single xattr with values typical of your example, it would fit just fine in the SA space which means that send/recv issues notwithstanding, you'd certainly enjoy much better overall performance by switching to xattr=sa. I'm going to try to cobble together some scripts to fabricate a directory that has the characteristics of yours and see if I can duplicate this problem. |
@dweeezil Sorry I forgot to mention the xattr value. Your correct, xattr is set to default 'on'. Thank you for pointing out the xattr=sa. Would setting that parameter on a new zfs fs then then performing a zfs send to that new fs put all files to using sa instead of default method? I'll post or send more detailed views of the file structure and if it helps I can send a tar archive of metadata files. |
@treydock Yes, if you pre-create the destination and I think I can hack together a script to fabricate a sample file system. I'm going to try to do that on one of our test servers today and see if I can duplicate this |
@dweeezil Thanks for confirming the xattr=sa can be applied via zfs send. I'll confirm the distribution of the files between "dentries" and "inodes" shortly. I'm in the process of migrating the metadata filesystem from RAIDZ2 to mirrors via zfs send. I started a new This is the 3rd "send" run on this system in 3 days and it's gone from 1 hour, to 3 hours, to 10+ hours to send the same data to local zpool. Is it possible a reboot or reloading the zfs module could help clear some lingering cache or SPL data that is slowing this down? Or is having the receiving and/or sending side mounted a big hindrance on performance? Sample zpool iostats when it was slow (around hour 10).
zpool iostat after tank2/fhgfs is unmounted and tank's unmount process still running, and all processes using the zfs filesystems have been stopped
|
@treydock I ran a bunch of tests using a filesystem containing a bit more than 1 million files. I figured that would be enough to see whether I could reproduce the problem. I was not able to reproduce the originally-reported problem. Send/recv ran at almost identical speed regardless whether the recv was to a fresh filesystem or whether it was to a pre-created filesystem with 4k recordsize. I all my tests with xattr=on. I ran both stock 0.6.2 code and also master code with my latest set of illumos-ported patches and there was little difference between the two. I was also using a cache device because I saw you had set the l2arc_compress option but I now notice you've not got any cache devices configured. The new information in your last post suggests that it's either interference from the normal workload or memory starvation/fragmentation that's causing your problem (or some combination of both). I'll try my tests again with some synthetic load applied to the source filesystem while the send/recv is running. That's going to be a total shot in the dark, however, because I can only guess at the characteristics of your normal load. Do you know what the concurrent workload is? Is it read-heavy, write-heavy, both, neither? As a final note, I'd suggest sticking a piped instance of "pv" in-between your send/recv ( |
@dweeezil Thanks for testing. I did a full system reboot then reran the sends with no FhGFS services running and nothing else accessing /tank or /tank2. These were the steps.
This time the operation completed in 1 hour 34 minutes, which is what I saw the first run without a pre-created filesystem. Memory starvation is unlikely as my networking monitoring application shows the system never going below 70% available memory. Currently zfs_arc_max is ~8GB and zfs_arc_meta_limit is ~6GB on a 64GB system. I think this is likely a non-issue and related to the FhGFS services running causing the sends to be slow or some other quirk outside of zfs. Thanks! Closing this issue. |
@dweeezil Out of curiosity, is there a way to "spot check" that the zfs send (now completed) to a zfs fs with xattr=sa correctly used the SA method for storing the data? Example file
That output does not differ from filesystem with xattr=on. Hoping there's some other method to confirm the correct usage of xattr=sa. Thanks |
My gentoo-next branch has plenty of these patches ported, but I have a list of another 20 patches that I need to review or port. I plan to open a pull request with all of the Illumos changes when I have finished adding them: ryao/zfs@master...gentoo-next On Oct 7, 2013, at 4:18 PM, Tim Chase notifications@github.com wrote:
|
@treydock If you're running 0.6.2 or newer you can use zdb to dump more detailed information. Using the inode number provided by stat you can run the following which will show you how the xattrs are stored.
|
Initial report was https://groups.google.com/a/zfsonlinux.org/d/msg/zfs-discuss/-njH0OwOICw/YBiagvqCXiIJ
The send operation:
zfs send tank@20131007-1136 | zfs receive -F tank2/fhgfs
This operation sending to a recordsize=128k zfs volume took less than 2 hours. Sending to a pre-created recordsize=4k ran for 2 hours only sending 1G out of 88.2G, and has now sped up after 3 hours of running.
I noticed that zpool iostat for the sending zpool had less than 100 read operations a second for the first few hours of the send. After a few hours the read operations per second ranges from 7,000 to 15,000 read operations per second.
The first few hours of the receive showed less than 100 write operations per second. After the first few hours the receiving filesystem does write operations in bursts. Using
zpool iostat tank2 1
I see that every 5 seconds anywhere from 28,000 to 50,000 write operations take place.This behavior was not observed when the receiving filesystem did not exist and was created with the default recordsize of 128k.
The sending filesystem is shown as 88.2G "USED" and 59.1G "REFER" in
zfs list
. After 3 hours the receiving filesystem only shows as 23G "USED" and "30K" "REFER". The first 2 hours of the send only 1G showed in "USED". The sending filesystem comprises of millions (maybe 70 million) 0 byte files who's xattr contain FhGFS metadata.System:
64GB RAM, 16 cores (2 sockets)
Zpool and zfs information:
Both tank and tank2/fhgfs only have
atime=off
andrecordsize=4k
altered from default.The text was updated successfully, but these errors were encountered: